What was the number of X during that month?

It has probably happened to you, too. A non-technical person comes with specifications for a monthly report, in the form of a bullet-list of column descriptions. All of them seem to be counts of various events but… wait! There is one column that says “Active Customers”, or something to that effect. This post is not how to explain to non-technical people that additive counts and counts that can only be calculated at a particular instant do not mix, or how to arrive at a compromise and agree to a particular instant to calculate those quantities, or how to keep clarifying each month what these numbers mean because they will keep forgetting it. Even though these topics might have been of interest, because one must deal with them at one point or another, this post is more ambitious. I will forgo all these matters and push this question to its limit: what is the real meaning of “Number of X during that month”, if one were to be so cruel as to skip all negotiations and produce a report of exactly that. For simplicity, I will use the “AdventureWorks” sample database as common ground.

The object of discourse will be “number of open WorkOrders”. What could the “number of open WorkOrders” across time possibly mean? The number of open WorkOrders (which are open between their StartDate up until their EndDate) is a constantly shifting number. At a granularity of a day (this is the actual granularity of timestamps in this table), this number possibly changes every day. To describe what the real answer to this question is, I will now introduce some terminology from the field of Temporal Databases and I will use concepts and software from my open-source project in that field, which is Pandamator. You have to use at least the HEAD revision at the time of writing this post, which is rev. 44, because grouping queries without GROUP BY were not handled in earlier releases.

Relations that carry with them validity intervals, like the WorkOrder, are commonly called state relations, in contrast to the relations one uses in a regular relational database, which actually encode the current state of the world and are called snapshot relations.You can think of a state relation as encoding a snapshot relation across time. Or you can think of a snapshot relation as being what a state relation encodes for a particular point in time. This dualism is the basis of  sequenced queries, which give you an answer at every instant in time (i.e. a state relation). Even though there are a small subset of the temporal queries one might like to ask, it is a useful subset that has been researched in depth and actually formed the basis of the proposed temporal additions to the SQL Standard (SQL/Temporal).

Pandamator supports sequenced queries (actually just them, for now), so let’s see what this count looks like, as a sequenced query. In fact, there are a number of steps before one can produce an answer, which I document here just in case one might want to follow them on one’s own:

  • Pandamator is a bit inflexible as to the naming of the validity period columns, so one has to rename the original columns of the WorkOrder table and make sure validity intervals are not NULL. For simplicity, and because I did not plan to run any update statements on the table, I just added two calculated columns with the expected names: ValidFromDate and ValidToDate (the latter being case when [EndDate] IS NULL then '3000-01-01' else [EndDate] end).
  • Using the Query Express for Pandamator (branched off the Query Express project), one has to add temporal support to the AdventureWorks database (there’s a button for that).
  • One then has to declare the WorkOrder table as a state relation, with the following command:
alter table Production.WorkOrder add validtime(STATE);

We are now ready to count WorkOrders. The query is:

validtime select count(*) from Production.WorkOrder;

Running this query produces a state relation with the result:

AdventureWorksCountQuery

It does take more than half a minute on my machine but, then again, what it computes in not so simple, when translated into plain SQL (you can see it in the “Messages” tab):

WITH spanboundaries(TheDate) AS (
(SELECT ValidFromDate AS TheDate
FROM Production.WorkOrder)
union
(SELECT ValidToDate AS TheDate
FROM Production.WorkOrder)
),
spans(ValidFromDate,ValidToDate) AS (
SELECT a.TheDate AS ValidFromDate,b.TheDate AS ValidToDate
FROM spanboundaries AS a
INNER JOIN spanboundaries AS b ON a.TheDate<b.TheDate and not exists(SELECT 0
FROM spanboundaries AS c
WHERE a.TheDate<c.TheDate and c.TheDate<b.TheDate)
),
R(t0,ValidFromDate,ValidToDate) AS (
SELECT count(*)AS t0,spans.ValidFromDate AS ValidFromDate,spans.ValidToDate AS ValidToDate
FROM spans
INNER JOIN Production.WorkOrder ON WorkOrder.ValidFromDate<=spans.ValidFromDate and spans.ValidToDate<=WorkOrder.ValidToDate
GROUP BY spans.ValidFromDate,spans.ValidToDate
),
CoalescedAux_R(t0,ValidFromDate,ValidToDate) AS (
(SELECT p2.t0 AS t0,p2.ValidFromDate AS ValidFromDate,p2.ValidToDate AS ValidToDate
FROM R AS p2
WHERE not exists(SELECT 0
FROM R AS p1
WHERE(p1.t0=p2.t0 or coalesce(p1.t0,p2.t0) is null)and p1.ValidToDate = p2.ValidFromDate))
union all
(SELECT p2.t0 AS t0,p1.ValidFromDate AS ValidFromDate,p2.ValidToDate AS ValidToDate
FROM CoalescedAux_R AS p1
INNER JOIN R AS p2 ON p1.ValidToDate = p2.ValidFromDate and(p1.t0=p2.t0 or coalesce(p1.t0,p2.t0) is null))
),
Coalesced_R(t0,ValidFromDate,ValidToDate) AS (
SELECT p1.t0 AS t0,p1.ValidFromDate AS ValidFromDate,max(p1.ValidToDate)AS ValidToDate
FROM CoalescedAux_R AS p1
GROUP BY t0,ValidFromDate
)
SELECT Coalesced_R.t0 AS t0,Coalesced_R.ValidFromDate AS ValidFromDate,Coalesced_R.ValidToDate AS ValidToDate
FROM Coalesced_R
ORDER BY ValidFromDate,ValidToDate OPTION(MAXRECURSION 0);

Pretty scary, huh? You can read about the translation in the docs but, let’s go on and verify some of the values, to make sure this complicated query actually does what it promises.
Here are three queries to produce the numbers during the last three validity periods depicted:

SELECT count(*) from Production.WorkOrder
where not (EndDate <= '2004-07-13' or StartDate >='2004-07-14')

SELECT count(*) from Production.WorkOrder
where not (EndDate <= '2004-07-14' or StartDate >='2004-07-16')

SELECT count(*) from Production.WorkOrder
where not (EndDate <= '2004-07-16' or StartDate >='2004-07-18')

Spend some time verifying the condition, to convince yourself that it does capture open WorkOrders. The numbers produced for these three periods are, indeed, 394, 210 and 138.

Next time one comes up with a report across time that asks what is essentially a snapshot question, instead of explaining the fallacy, you can take some comfort with the mental image of you, throwing the results of a sequenced query in their lap and leaving them to drown. I guess it suits them well!

Advertisements

2 thoughts on “What was the number of X during that month?

  1. Pingback: Sequenced Foreign Key | dsouflis

  2. Pingback: What’s in a Validity Interval? | dsouflis

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s