New Table Constraint Planned for Pandamator, part 2

As I promised in the previous relevant post, I am moving forward with the LTL constraint for Pandamator (as any of you who have been following the iteration on Flying Donut already know). Apart from some integration in the maintenance scripts, all that remains is actually running the Büchi automaton. Which, of course, is not an easy task to do in SQL… That’s why I’m going to stick to the positive side only, and give you an account of what I have done so far.

The actual syntax of the constraint is very close to what I wrote earlier (I used WITH instead of WHERE because it’s already used as a keyword for defining things).

alter table Production.Product 
add validtime constraint weightneverdec 
check 'G(p0 -> G p1)' 
with p0 = weight > 0, p1 = weight > 0

This is a simple constraint for testing, which means that, whenever Product.weight becomes bigger than zero, it never becomes zero again.

SPOT happily produces the following automaton for that expression.

weightneverdec

2 1t
0 1
1 0 -1 & p1 p0
0 0 -1 ! p0
-1
1 0
1 0 -1 p1
-1

The automaton is parsed internally to the following value.

val it : BuchiAutomaton =
  {acccondnum = 1;
   states =
    [|{initial = true;
       transitions =
        [{endstate = 1;
          acccond = [0];
          guard = Bin (And,Prop "p1",Prop "p0");}; {endstate = 0;
                                                    acccond = [0];
                                                    guard = Not (Prop "p0");}];};
      {initial = false;
       transitions = [{endstate = 1;
                       acccond = [0];
                       guard = Prop "p1";}];}|];}

Almost the full definition for LBTT is followed by my (simple) parser, except I assume that states are named as consecutive numbers starting at zero, because SPOT does that.

The abstract syntax tree for the automaton is defined by the following types.

type LogicOp = And | Or | Impl | Xor | Equiv

type LogicExpr =
    | Prop of string
    | Const of bool
    | Not of LogicExpr
    | Bin of LogicOp * LogicExpr * LogicExpr

type BuchiAutomatonTransition = {
    endstate: int;
    acccond: int list;
    guard: LogicExpr;
}

type BuchiAutomatonState = {
    initial: bool
    transitions: BuchiAutomatonTransition list;
}

type BuchiAutomaton = {
    acccondnum: int;
    states: BuchiAutomatonState[];
}

A simple translation layer transforms the propositional logic of the guards, together with the definition of the propositions in the constraint, to SQL.

let rec logexp2stringlist (rcl:PropositionDefinition list) (exp:LogicExpr) =
    match exp with
    | Const true -> ["(1=1)"]
    | Const false -> ["(1=0)"]
    | Not e -> ["NOT";"("] @ (logexp2stringlist rcl e) @ [")"]
    | Bin (o, x, y) -> 
        let lx = logexp2stringlist rcl x
        let ly = logexp2stringlist rcl y
        match o with
        | And -> ["(";"("] @ lx @ [")";"and";"("] @ ly @ [")";")"]
        | Or -> ["(";"("] @ lx @ [")";"or";"("] @ ly @ [")";")"]
        | Impl -> logexp2stringlist rcl (Bin(Or,Not(x),y))
        | Equiv -> logexp2stringlist rcl (Bin(Or,Bin(And,x,y),Bin(And,Not(x),Not(y)))) // (x and y) or (not x or not y)
        | Xor -> logexp2stringlist rcl (Bin(And,Bin(Or,x,y),Not(Bin(And,x,y)))) // (x or y) and not(x and y)
    | Prop p -> 
        let rc = rcl |> List.find (fun rc -> rc.name = p)
        "(" :: rc.expression :: [")"]

During ALTER VALIDTIME execution, a view is created to abstract away the exact expressions used for the propositions (as boolean numbers 0/1, since SQL does not recognize logical expressions outside WHERE).

CREATE VIEW [Production].[Product_weightneverdec_GuardView] WITH SCHEMABINDING as
SELECT
 ProductID,
 ValidToDate,
 row_number() over (partition by ProductID order by ValidToDate) AS VersionOrdinal,
 case when ( ( ( weight > 0 ) ) and ( ( weight > 0 ) ) ) then 1 else 0 end AS g0_0,
 case when NOT ( ( weight > 0 ) ) then 1 else 0 end AS g0_1,
 case when ( weight > 0 ) then 1 else 0 end AS g1_0
FROM [Production].[Product]

That’s it. I wish I had done more, but I believe I made good use of my scant free time to do as much as possible. Expressed like that, the view can be used to perform (non-deterministic, i.e. parallel…) runs of the automaton on the actual object histories. Stay tuned, I hope my next post on the subject will be a triumphant conclusion to creating the constraint.

“Lenient Parsing” and why I regret it

Don’t look it up — “Lenient Parsing” is a term of my own. When writing Pandamator, I wanted to make the point that the query transformations, I was using, did not care which were the SQL expressions used for SELECT, WHERE etc (except when they did, read footnote). Which is one of my good/bad facets: some time I like to push an idea to see how far it goes. In this case it went very far, because I could ignore the content of all expressions and ignore subtleties of TransactSQL syntax and typesystem.

“Lenient parsing” was the way I made the point. It consists of not parsing at all. I just amass all tokens and declare it an expression. The side effect was that I had to delimit all statements by semicolons, which I rather preferred. (I actually did that for statements, as well, in an effort to read whole scripts and just ignore anything that was not temporal statements. I have since aimed a little lower.) Anyway, I made my point, proved that I could get away with minimal syntactic transformations, and now “Lenient Parsing” (how on earth I came up with this weird term, I don’t know) gets in my way.

For one thing, I have been wanting to implement something I call (buckle up, because another weird term of my own invention is coming) “assisted nonsequenced queries”, which will be nonsequenced queries with an implicit “join” with the “spans” of the query, the latter being the largest contiguous validity periods that do not overlap with any validity period of any object of any relation. Within each such “span”, nothing changes, so these are like extended instants (you can read more on that in “SQL Extension for Interval Data”, Lorentzos, Mitsopoulos). This would allow one to write something akin to IXSQL, but then one would have to perform arithmetic on time, and not having real expressions to manipulate is an impediment.

Secondly, I’ve been wanting to  write an extension to SQL ever since I started doing a lot of SQL work. It’s a good thing that I didn’t, though, because what I would have written at the time would have been what I thought I needed for better programming under TransactSQL (my name for that was SQL DoneRight, so that gives you the idea). However, my point of view has changed, and I have become convinced that the way to go is to avoid the procedural, Basic-like language altogether, and go for pure SQL, working on whole relations at once and abolishing scalars. I hope you’re intrigued, but I’m not going to go into details just yet, as I’m preparing a “manifesto” blog post for that. I’m just going to remind you of my post about how to play to the strengths of an RDBMS and avoid shoehorning it to procedural, object-at-a-time computation. This extension will have a full typesystem, which necessitates handling real expressions.

In conclusion, “Lenient Parsing” was a great idea, but it’s now time to let it go, and switch to real parsing of expressions, even if that means losing some obscure part of Transact SQL grammar in the proceedings.

Footnote: Inner subqueries, of course, had to be treated specially than other elementary expressions.

New table constraint planned for Pandamator (Constraints on workflow paths, part 3)

Building upon the thoughts in part 1 and part 2 of this series of posts, I’m planning a new table constraint for Pandamator.

First of all, I have created Pandamator as a public project in the Flying Donut tool. So you’ll be able to keep tags on me while I’m moving Pandamator forward. Once you’re at it, you can register for free, and use it for your own projects. Don’t worry if you are not an expert Donut Flyer, a sample project will be created for you to play with upon your registration. I have great hopes for this tool and I have actually invested for its development. So I’m putting my money where my mouth is, by using it for Pandamator!

You’ll find an iteration (SCRUM terminology) for the new table constraint, there, so I’m not going to explain breakdown into items and tasks. I’m going to give you the outline of how I’m planning it to work. It’s the first big addition that diverges from ATSQL. Here’s what I’m thinking it should look like.

ALTER TABLE T ADD VALIDTIME CONSTRAINT C CHECK('[] p1 -> <> p2')
WHERE p1 = AMOUNT > 100, p2 = POSITION > 2

It’s a variation of the CHECK constraint, where the constraint is not a logical expression, but a string. This is done so that the constraint itself will be whatever SPOT accepts (LTL and PSL), without interfering with SQL grammar. The connection to SQL grammar will be in the WHERE clause that follows this novel CHECK constraint, which will provide the meaning of the propositions, inside LTL/PSL, as SQL expressions on the object in question. I settled on that as a solution to ensure separation of concerns and a smooth transition between the two domains. SPOT will do its job, and create the Büchi automaton, and Pandamator will generate proper SQL for the propositional logic formulas within, by plugging in the propositions defined in the constraint. Before that, I toyed with grandiose ideas of extending the SQL grammar to add LTL/PSL, but the solution I’m proposing here seems a lot cleaner, in the end.

I’ll post an update when this progresses a bit, and you can keep track of it yourselves at the relevant iteration.

What’s in a Validity Interval?

What’s in a Validity Interval? Could a Validity Interval be any different, and would it smell any different? Well, the short answer is: yes. But the long answer is longer than you might think, and to furnish that, I have to start at the very beginning.

First of all, the Validity Interval is just a method of coercing a relational database, handling plain relational data, to represent sequenced data. By using a Validity Interval, you are actually using a particular encoding to achieve that. If you like a fascinating read, check out some of the articles (and the book) by Nikos Lorentzos, on a way to think about temporal data (actually, any data having intervals of any kind) that does not even need to step away from the pure relational model.

Having settled on using Validity Intervals, there are still choices to be made. Let’s divert a bit and visit TSQL2 (spec), which is like the Latin of sequenced query languages. TSQL2 has evolved while its proponents were pushing for its inclusion in SQL3, but in all incarnations it had syntax to specify the granularity of time. For example, in its latest incarnation as the proposed SQL/Temporal part of SQL3, the following statement would transform a regular relation to a sequenced relation with valid-time support with a granularity of a day.

ALTER TABLE salary ADD VALIDTIME DAY

Presumably, since granularity can be defined per relation, one could mix different granularities in a database, although I never found an example of that in the TSQL2 material I have looked at. Which makes one think that referencing relations of different granularities, in the same query, would force it to adopt the finest granularity of all. That thought and the fact that I needed a granularity of minutes, for the application I was considering, made me treat time at the finest granularity possible. Unfortunately, for Sql Server 2005, which I was using at the time, that was the granularity of the DATETIME datatype, which is… fuzzy. Because of the way DATETIMEs are encoded, the granularity is 3 – 4 ms (“Rounded to increments of .000, .003, or .007 seconds”). And, to be frank, as an engineer it was more natural to think of time as a continuous, rather than discrete, quantity.

It didn’t matter a lot, except it sort-of forced my hand during the next choice, which was between closed-closed ([a, b]) and closed-open ([a, b)) validity intervals. Not being able to succinctly compute the next instant function, made it more compelling to use the closed-open interval. There are also other advantages, explained also in “Developing Time-Oriented Database Applications in SQL” (Richard T. Snodgrass, available here), since closed-open interval arithmetic is a little easier. You can easily test if intervals meet end-to-end: [a, b) meets [b, c) and if you concatenate them you get [a, c).

And then there’s the issue of how to represent the end of the last validity period of the versions of an object. There are actually two ways: avoid the issue altogether, by keeping two separate tables for the historical portion and the current portion of the data, which complicates all queries that need to refer to both, and choosing some value to use. Or none, because NULL is not really a value, but would be an obvious choice, as it usually stands for “don’t care”, “don’t know”, “not applicable”, or all of the above. However, trying to use NULL in actual calculations makes ORs pop up left and right, which is detrimental to already heavy execution plans. So, I took the way proposed by R. Snodgrass and used a marker value of 3000-01-01, which has the property of being able to be used in interval arithmetic, because it’s a value that falls after everyday dates. The end of the validity of an object, unless it is deleted from one valid time onwards, is “until the end of time”, and 3000-01-01 plays that role effectively (as a specific value, it is proposed by R. Snodgrass as the largest round date that can be represented in all RDBMSs he had looked at).

Up to now, you must have realized the (innocent) trick I played in the previous posts (2013-04-16 and 2013-05-14) to avoid introducing complications that were not essential to the points I wanted to make: the ends of the validity periods I defined for the AdventureWorks tables, should have actually been one day in the future, since AdventureWorks defines closed-closed intervals at a granularity of a day. I hope you will forgive me…

Sequenced Foreign Key

Last time I wrote about temporal databases, I had used a mostly-unaltered copy of the well-known AdventureWorks database sample. I thought it was best to base any material I present on that, because it is widespread and does not need much introduction. I see now that I will not be able to stick to that much longer because, as far as temporal data are concerned, AdventureWorks is quite limited out of the box. However, I do not have the time right now to prepare any other sample of significant size, so I will continue using it, for the time being, and explain where it is lacking. And one very crucial thing it has that gets in the way of demonstrating temporal databases is autonumbering columns.

Cue the concept of a sequenced Primary Key. Like the Primary Key in a conventional database (called a Snapshot database when viewed from within a temporal setting), a sequenced Primary Key defines the identity of the data items (notice I don’t call them rows), and is unique among all items — but this time, it unique for each moment in time. Which means that, in the general case, a single data item can change through time and actually be represented  by several rows of the relation (and that’s why I did not call items rows to begin with). Autonumbering columns (IDENTITY columns, as they’re called in Transact SQL) are a decent way to provide automatic surrogate keys in snapshot relations, but they make it impossible to use the table as-is to hold multiple versions of an item, which means I either have to start twiddling more with the AdventureWorks schema (and end up having to create a migration script), or skip the sequenced Primary Key for a while and jump directly to the sequenced Foreign Key.

The sequenced Foreign Key, as you have probably already guessed, is a reference from one relation to another that is defined for each moment in time. And this means that, time being as vast as it is, one data item (let’s call them objects to simplify things), one object in the first relation can reference different objects from the second in different moments in time. But, just like in its snapshot equivalent, a sequenced foreign key can only reference sequenced primary keys that exist for the moments of time involved.

Transposing it to my mostly-unaltered AdventureWorks database lacks a bit of generality, because each object is only defined for a single validity period. Trust me, having multiple, non-coinciding (and possibly non-contiguous) validity intervals in both relations makes everything a little more exciting (read as: vastly more complicated), but the essence of the sequenced Foreign Key will shine through, even with this example.

For this example, I’ll “upgrade” the Foreign Key which exists from the Production.WorkOrder relation to the Production.Product relation. I chose to use SellStartDate to SellEndDate as the validity period of the product. It does not mean that the creators of the AdventureWorks schema had that interpretation in mind, but this interpretation of the validity period can be used to make my point (and there are no other columns in the table, anyway). After adding the validity period columns to Production.Product and making it a State relation, like I did to the Production.WorkOrder in my previous post on Temporal databases, I declare the integrity constraints. I actually had to modify Pandamator to allow that, because it used to ignore IDENTITY columns altogether, since they were antithetical to the presence of real sequenced relations.

alter table Production.Product add validtime constraint PK primary key(ProductID);

alter table Production.WorkOrder add validtime constraint PK primary key(WorkOrderID);

alter table Production.WorkOrder add validtime constraint FK_Product foreign key(ProductID) references Production.Product on delete restrict;

After declaring them, I create the temporal artifacts.

alter validtime

Of interest, is procedure Production.ChkFK_Product_WorkOrder_To_Product, which exists to be used automatically in the case of modifications that can violate the Foreign Key. Since my computed columns for the validity intervals are not updateable, you can’t actually try to update anything (Sql Server is smart enough to decline running the update to begin with), but you can trigger the check if you do a dummy update in Sql Management Studio.

UPDATE Production.Product SET Name = Name WHERE ProductID = -1

Just something to invoke the trigger. You’ll get back an error like the following.

Msg 50000, Level 16, State 2, Procedure TR_FK_Product_WorkOrder_To_Product_R, Line 15
Transaction violates sequenced referential constraint for FK_Product(RESTRICT) Production.WorkOrder to Production.Product, e.g. Production.WorkOrder.WorkOrderID=72587 ValidFromDate="Jul  3 2004 12:00AM" ValidToDate="Jul 14 2004 12:00AM"
Msg 3609, Level 16, State 1, Line 1
The transaction ended in the trigger. The batch has been aborted.

What this says, is that the foreign key is violated (it is actually invalid to begin with, as Pandamator does not currently check integrity constraints at the time of the alter validtime command) and an example of the violation is a WorkOrder with WorkOrderId=72587 between July 3rd, 2004 and July 14th, 2004, which is the whole validity period for that WorkOrder, if you check it. Which makes sense, since the Product with ProductId=804, is only valid between July 1st, 2002 and June 30th, 2003! We have uncovered inconsistencies between the validity intervals in Products and WorkOrders of the AdventureWorks database! (Don’t take that too literally, as the interpretations I’m using do not actually exist in the original database).

In a different article, I may go into more depth into what the sequenced Foreign Key update rules mean (no, there are no delete rules in Pandamator, as I don’t currently allow updating the sequenced Primary Key). I also promise to spend some time treating how the validity interval works in Pandamator, and how I’ve not been completely honest in interpreting the AdventureWorks datetime columns as validity endpoints.

What was the number of X during that month?

It has probably happened to you, too. A non-technical person comes with specifications for a monthly report, in the form of a bullet-list of column descriptions. All of them seem to be counts of various events but… wait! There is one column that says “Active Customers”, or something to that effect. This post is not how to explain to non-technical people that additive counts and counts that can only be calculated at a particular instant do not mix, or how to arrive at a compromise and agree to a particular instant to calculate those quantities, or how to keep clarifying each month what these numbers mean because they will keep forgetting it. Even though these topics might have been of interest, because one must deal with them at one point or another, this post is more ambitious. I will forgo all these matters and push this question to its limit: what is the real meaning of “Number of X during that month”, if one were to be so cruel as to skip all negotiations and produce a report of exactly that. For simplicity, I will use the “AdventureWorks” sample database as common ground.

The object of discourse will be “number of open WorkOrders”. What could the “number of open WorkOrders” across time possibly mean? The number of open WorkOrders (which are open between their StartDate up until their EndDate) is a constantly shifting number. At a granularity of a day (this is the actual granularity of timestamps in this table), this number possibly changes every day. To describe what the real answer to this question is, I will now introduce some terminology from the field of Temporal Databases and I will use concepts and software from my open-source project in that field, which is Pandamator. You have to use at least the HEAD revision at the time of writing this post, which is rev. 44, because grouping queries without GROUP BY were not handled in earlier releases.

Relations that carry with them validity intervals, like the WorkOrder, are commonly called state relations, in contrast to the relations one uses in a regular relational database, which actually encode the current state of the world and are called snapshot relations.You can think of a state relation as encoding a snapshot relation across time. Or you can think of a snapshot relation as being what a state relation encodes for a particular point in time. This dualism is the basis of  sequenced queries, which give you an answer at every instant in time (i.e. a state relation). Even though there are a small subset of the temporal queries one might like to ask, it is a useful subset that has been researched in depth and actually formed the basis of the proposed temporal additions to the SQL Standard (SQL/Temporal).

Pandamator supports sequenced queries (actually just them, for now), so let’s see what this count looks like, as a sequenced query. In fact, there are a number of steps before one can produce an answer, which I document here just in case one might want to follow them on one’s own:

  • Pandamator is a bit inflexible as to the naming of the validity period columns, so one has to rename the original columns of the WorkOrder table and make sure validity intervals are not NULL. For simplicity, and because I did not plan to run any update statements on the table, I just added two calculated columns with the expected names: ValidFromDate and ValidToDate (the latter being case when [EndDate] IS NULL then '3000-01-01' else [EndDate] end).
  • Using the Query Express for Pandamator (branched off the Query Express project), one has to add temporal support to the AdventureWorks database (there’s a button for that).
  • One then has to declare the WorkOrder table as a state relation, with the following command:
alter table Production.WorkOrder add validtime(STATE);

We are now ready to count WorkOrders. The query is:

validtime select count(*) from Production.WorkOrder;

Running this query produces a state relation with the result:

AdventureWorksCountQuery

It does take more than half a minute on my machine but, then again, what it computes in not so simple, when translated into plain SQL (you can see it in the “Messages” tab):

WITH spanboundaries(TheDate) AS (
(SELECT ValidFromDate AS TheDate
FROM Production.WorkOrder)
union
(SELECT ValidToDate AS TheDate
FROM Production.WorkOrder)
),
spans(ValidFromDate,ValidToDate) AS (
SELECT a.TheDate AS ValidFromDate,b.TheDate AS ValidToDate
FROM spanboundaries AS a
INNER JOIN spanboundaries AS b ON a.TheDate<b.TheDate and not exists(SELECT 0
FROM spanboundaries AS c
WHERE a.TheDate<c.TheDate and c.TheDate<b.TheDate)
),
R(t0,ValidFromDate,ValidToDate) AS (
SELECT count(*)AS t0,spans.ValidFromDate AS ValidFromDate,spans.ValidToDate AS ValidToDate
FROM spans
INNER JOIN Production.WorkOrder ON WorkOrder.ValidFromDate<=spans.ValidFromDate and spans.ValidToDate<=WorkOrder.ValidToDate
GROUP BY spans.ValidFromDate,spans.ValidToDate
),
CoalescedAux_R(t0,ValidFromDate,ValidToDate) AS (
(SELECT p2.t0 AS t0,p2.ValidFromDate AS ValidFromDate,p2.ValidToDate AS ValidToDate
FROM R AS p2
WHERE not exists(SELECT 0
FROM R AS p1
WHERE(p1.t0=p2.t0 or coalesce(p1.t0,p2.t0) is null)and p1.ValidToDate = p2.ValidFromDate))
union all
(SELECT p2.t0 AS t0,p1.ValidFromDate AS ValidFromDate,p2.ValidToDate AS ValidToDate
FROM CoalescedAux_R AS p1
INNER JOIN R AS p2 ON p1.ValidToDate = p2.ValidFromDate and(p1.t0=p2.t0 or coalesce(p1.t0,p2.t0) is null))
),
Coalesced_R(t0,ValidFromDate,ValidToDate) AS (
SELECT p1.t0 AS t0,p1.ValidFromDate AS ValidFromDate,max(p1.ValidToDate)AS ValidToDate
FROM CoalescedAux_R AS p1
GROUP BY t0,ValidFromDate
)
SELECT Coalesced_R.t0 AS t0,Coalesced_R.ValidFromDate AS ValidFromDate,Coalesced_R.ValidToDate AS ValidToDate
FROM Coalesced_R
ORDER BY ValidFromDate,ValidToDate OPTION(MAXRECURSION 0);

Pretty scary, huh? You can read about the translation in the docs but, let’s go on and verify some of the values, to make sure this complicated query actually does what it promises.
Here are three queries to produce the numbers during the last three validity periods depicted:

SELECT count(*) from Production.WorkOrder
where not (EndDate <= '2004-07-13' or StartDate >='2004-07-14')

SELECT count(*) from Production.WorkOrder
where not (EndDate <= '2004-07-14' or StartDate >='2004-07-16')

SELECT count(*) from Production.WorkOrder
where not (EndDate <= '2004-07-16' or StartDate >='2004-07-18')

Spend some time verifying the condition, to convince yourself that it does capture open WorkOrders. The numbers produced for these three periods are, indeed, 394, 210 and 138.

Next time one comes up with a report across time that asks what is essentially a snapshot question, instead of explaining the fallacy, you can take some comfort with the mental image of you, throwing the results of a sequenced query in their lap and leaving them to drown. I guess it suits them well!