It has probably happened to you, too. A non-technical person comes with specifications for a monthly report, in the form of a bullet-list of column descriptions. All of them seem to be counts of various events but… wait! There is one column that says “Active Customers”, or something to that effect. This post is not how to explain to non-technical people that additive counts and counts that can only be calculated at a particular instant do not mix, or how to arrive at a compromise and agree to a particular instant to calculate those quantities, or how to keep clarifying each month what these numbers mean because they will keep forgetting it. Even though these topics might have been of interest, because one must deal with them at one point or another, this post is more ambitious. I will forgo all these matters and push this question to its limit: what is the real meaning of “Number of X during that month”, if one were to be so cruel as to skip all negotiations and produce a report of exactly that. For simplicity, I will use the “AdventureWorks” sample database as common ground.
The object of discourse will be “number of open WorkOrders”. What could the “number of open WorkOrders” across time possibly mean? The number of open WorkOrders (which are open between their StartDate up until their EndDate) is a constantly shifting number. At a granularity of a day (this is the actual granularity of timestamps in this table), this number possibly changes every day. To describe what the real answer to this question is, I will now introduce some terminology from the field of Temporal Databases and I will use concepts and software from my open-source project in that field, which is Pandamator. You have to use at least the HEAD revision at the time of writing this post, which is rev. 44, because grouping queries without GROUP BY were not handled in earlier releases.
Relations that carry with them validity intervals, like the WorkOrder, are commonly called state relations, in contrast to the relations one uses in a regular relational database, which actually encode the current state of the world and are called snapshot relations.You can think of a state relation as encoding a snapshot relation across time. Or you can think of a snapshot relation as being what a state relation encodes for a particular point in time. This dualism is the basis of sequenced queries, which give you an answer at every instant in time (i.e. a state relation). Even though there are a small subset of the temporal queries one might like to ask, it is a useful subset that has been researched in depth and actually formed the basis of the proposed temporal additions to the SQL Standard (SQL/Temporal).
Pandamator supports sequenced queries (actually just them, for now), so let’s see what this count looks like, as a sequenced query. In fact, there are a number of steps before one can produce an answer, which I document here just in case one might want to follow them on one’s own:
- Pandamator is a bit inflexible as to the naming of the validity period columns, so one has to rename the original columns of the WorkOrder table and make sure validity intervals are not NULL. For simplicity, and because I did not plan to run any update statements on the table, I just added two calculated columns with the expected names: ValidFromDate and ValidToDate (the latter being
case when [EndDate] IS NULL then '3000-01-01' else [EndDate] end).
- Using the Query Express for Pandamator (branched off the Query Express project), one has to add temporal support to the AdventureWorks database (there’s a button for that).
- One then has to declare the WorkOrder table as a state relation, with the following command:
alter table Production.WorkOrder add validtime(STATE);
We are now ready to count WorkOrders. The query is:
validtime select count(*) from Production.WorkOrder;
Running this query produces a state relation with the result:
It does take more than half a minute on my machine but, then again, what it computes in not so simple, when translated into plain SQL (you can see it in the “Messages” tab):
WITH spanboundaries(TheDate) AS ( (SELECT ValidFromDate AS TheDate FROM Production.WorkOrder) union (SELECT ValidToDate AS TheDate FROM Production.WorkOrder) ), spans(ValidFromDate,ValidToDate) AS ( SELECT a.TheDate AS ValidFromDate,b.TheDate AS ValidToDate FROM spanboundaries AS a INNER JOIN spanboundaries AS b ON a.TheDate<b.TheDate and not exists(SELECT 0 FROM spanboundaries AS c WHERE a.TheDate<c.TheDate and c.TheDate<b.TheDate) ), R(t0,ValidFromDate,ValidToDate) AS ( SELECT count(*)AS t0,spans.ValidFromDate AS ValidFromDate,spans.ValidToDate AS ValidToDate FROM spans INNER JOIN Production.WorkOrder ON WorkOrder.ValidFromDate<=spans.ValidFromDate and spans.ValidToDate<=WorkOrder.ValidToDate GROUP BY spans.ValidFromDate,spans.ValidToDate ), CoalescedAux_R(t0,ValidFromDate,ValidToDate) AS ( (SELECT p2.t0 AS t0,p2.ValidFromDate AS ValidFromDate,p2.ValidToDate AS ValidToDate FROM R AS p2 WHERE not exists(SELECT 0 FROM R AS p1 WHERE(p1.t0=p2.t0 or coalesce(p1.t0,p2.t0) is null)and p1.ValidToDate = p2.ValidFromDate)) union all (SELECT p2.t0 AS t0,p1.ValidFromDate AS ValidFromDate,p2.ValidToDate AS ValidToDate FROM CoalescedAux_R AS p1 INNER JOIN R AS p2 ON p1.ValidToDate = p2.ValidFromDate and(p1.t0=p2.t0 or coalesce(p1.t0,p2.t0) is null)) ), Coalesced_R(t0,ValidFromDate,ValidToDate) AS ( SELECT p1.t0 AS t0,p1.ValidFromDate AS ValidFromDate,max(p1.ValidToDate)AS ValidToDate FROM CoalescedAux_R AS p1 GROUP BY t0,ValidFromDate ) SELECT Coalesced_R.t0 AS t0,Coalesced_R.ValidFromDate AS ValidFromDate,Coalesced_R.ValidToDate AS ValidToDate FROM Coalesced_R ORDER BY ValidFromDate,ValidToDate OPTION(MAXRECURSION 0);
Pretty scary, huh? You can read about the translation in the docs but, let’s go on and verify some of the values, to make sure this complicated query actually does what it promises.
Here are three queries to produce the numbers during the last three validity periods depicted:
SELECT count(*) from Production.WorkOrder where not (EndDate <= '2004-07-13' or StartDate >='2004-07-14') SELECT count(*) from Production.WorkOrder where not (EndDate <= '2004-07-14' or StartDate >='2004-07-16') SELECT count(*) from Production.WorkOrder where not (EndDate <= '2004-07-16' or StartDate >='2004-07-18')
Spend some time verifying the condition, to convince yourself that it does capture open WorkOrders. The numbers produced for these three periods are, indeed, 394, 210 and 138.
Next time one comes up with a report across time that asks what is essentially a snapshot question, instead of explaining the fallacy, you can take some comfort with the mental image of you, throwing the results of a sequenced query in their lap and leaving them to drown. I guess it suits them well!