Heavy reports – the light way

If you have dealt with reports at all, you’ll recognize the case at a glance: you have an OLTP database that is also used for reporting, it has grown over the years to many Gigabytes and you need to aggregate millions of rows per day to provide useful business metrics. To make it more difficult, you need to update today’s metrics throughout the day, so that today’s metrics are close to real-time. This need is fulfilled either by jobs that run throughout the day, or by running on-line queries directly on the data.

It has probably been a long time since you encountered so many bad practices mentioned side-by-side in the same paragraph. But if you work in a shop where the legacy reporting infrastructure is exactly so, you know that there’s nothing to gain by moaning and groaning. You need to deal with it, preferably in a way that will work so well that you’ll gain time to design the data warehouse of tomorrow or whatever.

There are things to do right here, right now, and this post is an account of how I dealt with cases like that and lived to brag about it. The necessary condition for this approach to work is that metrics are calculated on incoming events, and that all metrics are additive, which is a common case: total sales, total leads, total conversions, etc.

First of all, let’s identify why the legacy report taxes our machines: it aggregates too many things, too many times. To update today’s metrics, it visits all current rows repeatedly. This morning’s rows will be aggregated again and again until the end of day. It follows from this that aggregating few things, just once, might help us make everything lighter. The key to that is to visit only a few new rows at a time, based on a moving cursor.

There are many ways to keep track of a moving cursor, and one of them should be applicable to your case. The simplest way, is when the event has a sequential key (typically an integer). In this case, the way to keep track of the cursor is simply to store the last processed key. When there’s no sequential key, there’s bound to be a timestamp of some sort. When timestamps are ensured to be strictly monotonically increasing per table, they can be used exactly like a sequential key.

When there’s no guarantee that timestamps are strictly monotonically increasing, but they are monotonically increasing nonetheless, you have to lag behind the current timestamp just enough to ensure that once you move the cursor past a particular value, no row with that value can ever come again. For example, the granularity of datetime in Sql Server is around 3ms, so if you lag behind 1s you are fine.

When it’s not even the case that timestamps are strictly monotonically increasing, but there is some upper limit to how out-of-place timestamps can be (e.g. a few seconds), then you follow the same principle and arrange to lag behind long enough so that you can be sure to not process an event again. It is a retroactively bounded temporal relation, look it up.

When you don’t have even that, well, just go ahead and create a new column! Use a sequence or, if you use a timestamp with the current timestamp as the default value, you have created a new, interesting dimension in your report, since you now have a bitemporal relation in your hands (look it up, as well).

Updating your report data is now a matter of moving your cursor forward to assemble a batch of events (anything from one row upwards is acceptable in the case of additive metrics), calculate the metrics and upsert them to the stored data, grouped by some suitable time interval and any other useful dimension.

Here’s some accompanying code in Transact SQL.

create procedure rpt.UpdatePreparedReport
as
begin
	DECLARE @timeSpanMinutes int = 5
	DECLARE @lagMinutes int = 5
	DECLARE @fromDate DATETIME = (SELECT Tm FROM rpt.PreparedReport_TimeCursor);

	SELECT @toDate = DATEADD(minute, @timeSpanMinutes, @fromDate);
	IF @toDate >= DATEADD(minute, -@lagMinutes , getdate()) 
	begin
		print 'Window overlaps current time, aborting'
		return
	end

	declare @Metrics table(
		EventTm datetime NOT NULL
		,Dimension1Id int
		,Dimension2Id int
		,MetricCnt int
	)

	;with T as (
				select cast(ev.Tm AS date) EventTm 
					,dim1.ID
					,dim2.ID
					,COUNT(ev.PurchaseID) MetricCnt
				from dbo.EventRelation ev
					INNER JOIN dbo.Dimension1 d1 ON d1.ID = ev.Dimension1ID
					INNER JOIN dbo.Dimension2 d1 ON d2.ID = ev.Dimension2ID
				where @fromDate <= ev.Tm and ev.Tm < @toDate
				group by cast(ev.Tm AS date), dim1.ID, dim2.ID
	)
	insert into @Metrics (
		EventTm
		,Dimension1Id
		,Dimension2Id
		,MetricCnt
	)
	SELECT 
		EventTm
		,Dimension1Id
		,Dimension2Id
		,MetricCnt
	from T	

	merge rpt.PreparedReport as target
	using @Metrics as source
	on target.Dimension1Id = source.Dimension1Id and target.Dimension2Id = source.Dimension2Id and target.EventTm = source.EventTm
	when matched then
		update set MetricCnt = target.MetricCnt+ source.MetricCnt
	when not matched by target then
		insert (EventTm
			,Dimension1Id
			,Dimension2Id
			,MetricCnt)
		values (EventTm
			,Dimension1Id
			,Dimension2Id
			,MetricCnt);

	update rpt.PreparedSubscriptionReport_TimeCursor SET Tm = @toDate
end

Leave a comment