HoSql Type System proof-of-concept

Well, even “nugget”-sized posts proved to be beyond my ability, these days. You see, I’m in a new job, with everything that goes with that. It’s exciting, but super-exhausting, and invigorating, but frustrating, and would have given me zillion things to blog about, had I time to do that. I’m back in the JVM, as you probably understood by my love affair with Scala. But this post is not about that. This post is, actually, a continuation of the one about Higher-Order SQL, triggered by my discovery of Yeti.

Yeti is a strict language of the ML family (syntactically not close to Standard ML, though), with a characteristic that suits me very much: it supports structures in its typesystem, which is of the Hindley-Milner variety. Which means that, some of the work for the “Higher-Order SQL” typesystem has already been done (the unification part — the constraint satisfaction part, I still have to do myself). How does this work? I’ll draw upon Yeti and papers on strongly-typed query languages to try and answer that.

Yeti supports structures, otherwise known as records. Staying informal, a structure type is a collection of labeled types. For example, the type of {bar="3", foo=3}, in Yeti, is {bar is string, foo is number} (that’s actually the way the REPL shows it). If you read how structures behave in Yeti, it will probably remind you of HList in Haskell (at least if you know as little Haskell as I do). BTW, even though Scala is not a functional-first language, its malleability has given opportunities for lots of functional explorations: check out HList in Shapeless by Miles Sabin, a fellow described by Dick Wall as “not being comfortable unless he’s programming with his hair on fire”. Here, I managed to slip a bit of Scala in, again.

We can use structure types to simulate relation types, if we forget, for a moment, that relations are sets of structures. Enclosing everything inside a set wouldn’t add anything to our discussion.

Let’s compile a cookbook of HoSQL expressions together with static types and the Yeti expressions I used to coerce the compiler to perform the needed unifications. Newlines cannot be used in the REPL, nor are they always produced in the output, but I’ll insert some manually to wrap the expressions conveniently.

Projection. In HoSQL, you don’t have to always use projection, like in SQL. But Yeti does not have an empty structure type, so I’ll be jumping past the from x, with x completely unconstrained, expression. It would just result in an unconstrained type. Apart from that, the dot-star shortcut can only be used as a syntactic shortcut for known relations, and is not part of HoSQL. I never intended HoSQL to represent syntactic abstractions.

HoSQL:

select x.a, x.b from x

Type:

{.a is 'a, .b is 'b} -> {a is 'a, b is 'b}

Ignore the dots; dots just mean inferred fields, while their absence means provided ones. In our case, one might interpret the dots as meaning that the input might have other fields, which we care not about, while the output is just what we inferred.

Yeti session:

> do x: {a=x.a, b=x.b} done
<code$> is {.a is 'a, .b is 'b} -> {a is 'a, b is 'b}

Unifying this with an actual relation produces a more specific type. I’m using a structure literal to generate the intended type, here, and this is the equivalent of a first-order term, which can be used to generate SQL (the actual value shown here, of course, is immaterial).

Type:

{a is number, b is string}

Yeti session:

> do x: {a=x.a, b=x.b} done { a=0, b="", c=0 }
{a=0, b=""} is {a is number, b is string}

Selection.

HoSQL:

from x where x.a>3 and length(x.b)>4

Type:

{.a is number, .b is string} -> {a is number, b is string}

Yeti session:

> do x: if x.a==0 and x.b=="" 
> then {a=x.a, b=x.b} else {a=x.a, b=x.b} fi done
<code$> is {.a is number, .b is string} -> {a is number, b is string}

The specific expressions are immaterial, inasmuch as we just care that they constrain the types appropriately.

Union. The good part begins here.

HoSQL:

select x.a, x.b from x
union
select y.a, y.b from y

Type:

{.a is 'a, .b is 'b} -> {.a is 'a, .b is 'b} -> {a is 'a, b is 'b}

Yeti session:

> do x y: if x.a==y.a and x.b==y.b 
> then {a=x.a, b=x.b} else {a=y.a, b=y.b} fi done
<code$> is {.a is 'a, .b is 'b} -> {.a is 'a, .b is 'b}
 -> {a is 'a, b is 'b}

Join. And the good part continues.

HoSQL:

x inner join y on x.a = y.a

Type:

('b is {.a is 'a}) -> {.a is 'a} -> {x is 'b, y is {.a is 'a}}

Yeti session:

> do x y: if x.a==y.a then {x, y} else {x, y} fi done
<code$> is ('b is {.a is 'a}) -> {.a is 'a} -> {x is 'b, y is {.a is 'a}}

Recursive CTE. But the best part is here.

HoSQL:

let t = 
 select x.a, x.b from x
 union
 select t.a+1, t.b-1 from t where t.b>0
t

Type:

{.a is 'a, .b is number} -> {a is 'a, b is number}

Yeti session:

> f x = if x.b==0 then { a=x.a, b=x.b } else f x fi
f is {.a is 'a, .b is number} -> {a is 'a, b is number} = <code$f>

These are the building blocks for most queries (and you can see how the little I left out can work in that point of view). This was just a demonstration of real-life unification of structure types at work. I believe that adding value constraints to the typesystem can build on top of that, without disturbing the waters. Next time, I hope I can return to the “nuggets” and continue the face-off with a demonstration of mixed-site variance in Kotlin.

Higher-Order SQL

Two years ago, when I started swimming in a sea of SQL at my job at the time, I first came up with the idea of a Transact SQL preprocessor (code-named SQL DoneRightTM). Its main benefit was to overcome the myriad of arbitrary shortcomings evil brains have imposed on Transact SQL programming (don’t get me started…) and add some extra functionality, like hints for caching particular CTEs and special-casing particular stored proc parameters. Thank God, I never got around to doing it. It would be solving the wrong problem, the wrong way.

Brothers and sisters, I have seen the light! And I will share it with you, so help me Codd! Making Transact SQL, or any other dialect, a better procedural language, although it would come handy, is not the way to go. The way to go is to abandon the procedural part of SQL, and make it work the way it does best: declaratively and massively.

I have already written about the performance boost from using cardinalities larger than one in your SQL code. I would go so far as proposing that

  • no procedure should take scalar arguments
  • no scalar variable should be declared anywhere
  • all conditionals on data should be folded inside the queries themselves

But this post is not meant to preach my (controversial?) best practices. It is meant to preach what I plan to do about them. I’ll focus on queries for now.

Typed, First-order SQL. First of all, I want my SQL to be typed and implicitly typed at that. I cannot understand why current-day stored procedures are not fully typechecked.

And I want a language where relations are first-class values, as Codd intended. Relations should be freely usable as function parameters, return values and variables (as far as their use as variables go, think of them as “inline views” or “detached CTEs”). Not only that; all data will be in the form of relations and all operations on data will be done on relations. No scalars will exist outside of row constructors (tuples). In particular, this means that application code will only have the option of feeding to and getting back whole relations. And I want it to compile to plain SQL.

Imagine being able to do the following (I’m not actually proposing a console like that of F# Interactive, that’s just for demonstration). I’m defining foo as a relation-valued function and I’m going to show you how this is different that a table-valued function.

let foo r1 ids =
  let unqids = select unique id from ids
  let r2 = select id, nam from r1 inner join unqids on r1.id = ids.id
  r2 

==> 

foo : T -> rel(id any) -> T 
  where T = rel(id any, nam any)

Don’t read this the wrong way. Values unqids and r2 stand for the relations expressed by the corresponding query expressions. They do not stand for the queries themselves (I don’t propose a language to allow you to do query manipulations here, although the compiler itself will manipulate the queries as needed) nor for table variables (although compiling to and interfacing with plain SQL will inevitably create code using some table variables).

Of course, the programmer should be able to supply typings by hand, as well.

let foo r1 (ids:relation(id int)) =
  let unqids = select unique id from ids
  let r2 = select id, nam from r1 inner join unqids on r1.id = ids.id
  r2 

==> 

foo : T -> rel(id int) -> T
  where T = rel(id any, nam any)

It didn’t gain us much. Remember that it will all compile to SQL, and SQL equality does not constrain the types of its operands in any meaningful degree, so no information was added to the first argument and the return type. But if you type the first argument of the function, type inference should propagate information to the return type. Suppose there’s a table called dbo.Client defined as TABLE(id int, nam varchar(60), phon varchar(15), addr varchar(100)).

let foo r1 ids =
  let unqids = select unique id from ids
  let r2 = select id, nam from r1 inner join unqids on r1.id = ids.id
  r2 

let foo2 = foo dbo.Client

let foo3 = foo (select id, nam from dbo.Client where addr is not null) 

==> 

foo : T -> rel(id any) -> T
  where T = rel(id any, nam any)
foo2 : rel(id any) -> rel(id int, nam varchar(60))
foo3 : rel(id any) -> rel(id int, nam varchar(60))

The typing of the remaining curried argument was not affected, though, for the same reasons as before.

As you can see, I want any expression returning a relation (including the name of a table) to be usable as a parameter. If one were to compile foo2 and foo3 to SQL, it might look like the following.

create type dbo.foo2_ids as table(id int)
go

create procedure dbo.foo2 @ids foo2_ids as
begin
 with unqids as ( select unique id from @ids )
 select id, nam from dbo.Client inner join unqids on r1.id = ids.id
end
go

create procedure dbo.foo3 @ids foo2_ids as
begin
 with
 unqids as ( select unique id from @ids ),
 t as (select id, nam from dbo.Client where addr is not null)
 select id, nam from t inner join unqids on r1.id = ids.id
end
go

As you can see, at this stage, functions are little more than typed SQL procedures. The difference with SQL is that partial evaluation folds relation definitions inside functions. Make no mistake: this is not macro expansion.

You can also see that the typesystem implied by the examples just presented is not exactly the typesystem of SQL (was it that “any” that gave me away?). Let me go a bit more formal for a moment, to explain what I have in mind.

Relations consist of attributes and the type of an attribute is flagged for nullability (NULL or NOT NULL) and is either:

  • a number of base SQL types, depending on the targeted engine
  • a number of increasingly widening union types e.g. “num” (all numbers), “string” (all strings), “bin”, “eq” (all types supporting equality and comparison), “blob” — I haven’t given it much thought yet, basically all cases that can be inferred by the use of functions or operations that restrain their arguments or operands
  • a relation (obviously not available when the targeted engine is flat relational)
  • type class “any” (union type of all attribute types)

Relations are typed using type variables and constraints on them. The presence of type variables are inevitable to capture sharing and refinement of types. Constraints on attributes can be conveniently gathered together in one place, say “rel(attr-constr-list)”, where attr-constr-list is a list of

  • attribute name / attribute type / nullability flag, or
  • pk(attr-list), where attr-list is the list of primary key attributes
  • unq(attr-list), where attr-list is a list of attributes in a UNIQUE constraint
  • chk(expr), where expr is a CHECK constraint

This arrangement, that brings together all intrinsic constraints, also aligns with the concept of a table type in the underlying engine. Other relation constraints are of the (sole one, for now) type of foreign key constraint fk(R,fk-name), where R is the type of the primary relation and fk-name the name of the foreign key constraint, like in the following.

rel( attr1 int, attr2 varchar(30), pk(attr1), chk(len(attr2) > 5) ), 
	fk( OtherRel, fk_name ) 

I want the primary key to be usable inside the queries, in the form of r.pk, denoting the tuple of the primary key columns, able to participate in equalities and select lists (the way r.* does in select lists).

I want foreign keys to also be usable by naming them, hence the presence of fk-name, although I am thinking of a more explicit syntax for joins on foreign keys, which will make them inferrable more easily. Something like

foreign key join on fk_name

Note that the definition of relations is recursive, when the underlying engine contains relation-valued attributes, but a particular relation definition is not allowed to be recursive (or mutually recursive). Even nested relational databases do not allow that, and it will be a while until I contemplate dealing with graph databases (but you bet I will, at some point!).

The type of a function, now, is formed using the familiar arrow type constructor (yes, functions are curried and why should they not be?). Remember that I don’t want scalar-valued functions (you can write them in most SQL dialects directly), so all function arguments are constrained to be refinements of “rel()”, up to this point in the post. You have also seen that function definitions are nothing but incremental relation definitions (like you do using WITH in SQL).

You may say I’m a dreamer, but I’m not the only one. I know this typesystem is very ambitious, since it implies that types are richer than usual and also carry with them some theorems about them (remember my post on Dependent Types?). But I don’t plan in incorporating a full theorem prover at this point, just infer some basic theorems from the relational operations used. I just want this aging language to become a little more intelligent.

Higher-order SQL. Second, I want functions to also be first-class values, which means that function arguments can also be other functions. During my work on Pandamator, I missed being able to express e.g. coalescing as a function taking a relation and returning a relation. Well, here it is (just imagine that Pandamator is the underlying engine and the “star” operator excludes the special timestamp columns).

let temporal_coalesce sourcerelation =
  let CoalAux =
	select p2.*, ValidFromDate, ValidToDate from sourcerelation p2
	where not exists(select 0 from sourcerelation p1 
		where p1.ValidToDate = p2.ValidFromDate
		and p1.pk = p2.pk
	)

	union all

	select p1.*, p1.ValidFromDate, p2.ValidToDate
	from CoalAux p1
	inner join sourcerelation p2 on p1.ValidToDate = p2.ValidFromDate
		and p1.pk = p2.pk

  let Coal = 
	select p1.*, p1.ValidFromDate, max(p1.ValidToDate)
	from CoalAux p1 
	group by p1.*, p1.ValidFromDate
  Coal

==>

temporal_coalesce : T -> T 
  where T = rel(ValidFromDate eq, ValidToDate eq)

I want that, but I also want this to be useful in practice, so I want to be able to compile everything that is first-order into plain SQL. I want to be able to compile any any first-order function to an Sql92 stored procedure or table-valued function (defunctionalization of higher-order functions).

This functionality allows one to define and use higher-order functions as if they were typed macros. Obviously I cannot promise I can sort out all weird interactions between recursive functions and recursive relations but, at least, I plan to support recursive relations. I’d rather omit recursive functions for now.

I’d rather omit one other thing I had in mind, as well, which was to relax the condition about compiling only first-order expressions to SQL by allowing parameters that are higher-order application code functions. Using these functions within SQL would break calculations into pipelined stages alternating between native code and SQL, and I wanted this pipelining to be transparent. This would necessitate generating code at the application level that handled the pipelining.

Even omitting all that and even focusing just on composing complex queries does not make it a piece of cake. I is still hard. But, somehow, I believe that adding back what will be missing will be easier once the hard part is done. And this first version will still be usable without them.

Not created in a vacuum. I did my homework and I found a number of articles defining type systems for query languages. The problem is that none of those query languages was SQL itself. Nevertheless, it seems like another research topic that has yet to be applied to everyday practice. This is where I come into the picture. Applying useful theory to practice is my harmless vice. This was going to be my summer project (for which I brushed up my TAPL), except I got somewhat sidetracked with another potential project, about which I cannot say more yet (hint: did you notice I included relation-valued attributes in the typesystem?). However, this other project can also benefit from the code for this one, so it’s about time I started both!

I was thinking of naming it HOSql (pronounced in a way that rhymes with Haskell — I have to admit that half the fun in doing a project like that is coming up with the fancy name!) but maybe I’ll stick with SQL DoneRight. Your vote is welcome!

SQL: The data model behind MongoDB

You know that MongoDB is a modern NoSQL database, meaning, among other things, that it is schemaless and its data model is more complex than what SQL can handle. However, in this post I will attempt to show you how SQL is sufficient to serve as a data model behind it. Hard to swallow? Follow me as I go through a list of misconceptions I need to debunk and then as I make my case.

Misconception #1. Let’s consider the claim that NoSQL is schemaless. Taking MongoDB as an example, clearly a database that can answer the following, is hardly as amorphous as this word implies.

db.somecollection.find( { foo: "foo" }, { bar: 1} )

This query expects all queried documents (not all documents in the collection, though) to have a “foo” field, which (some times) takes a string value, and a “bar” field (of whatever type). The net sum of all query/update code touching the database contributes to a definition of what data are supposed to be there. The data themselves, on the other hand, have a specific form, which does or does not correspond to this definition. In any case, this definition is a “schema”, by any other name. Consequently, it is more truthful (and some NoSQL material adheres to this truth) to talk about having flexible or dynamic schemas, instead of being schemaless. And keep in mind that the only flexibility is in extending the schema, as no facility exists for any other modification to it.

Misconception #2. Everyone knows that SQL represents “the relational model”. It might do, depending on what you mean by “SQL”. But you can’t possibly mean SQL99, whose data model explicitly encompasses “groups”, “repeating groups” and “nested repeating groups”, i.e. tree data. SQL99 transcends the flat relational model (in other ways, also, which I won’t go into in this post) so, confusing SQL as a standard with what vendors sell in the form of SQL is a classic “straw man” argument.

Nested relational models. Nested relational models can be queried using nested relational algebra. There is nothing extraordinary about nested relational algebra or its corresponding SQL. Its only difference with flat relational algebra is that attributes can be relations and, everywhere attributes can appear (in projection and selection), algebraic expressions can be used.

I’m going to use the example from the MongoDB documentation, which is “a collection named bios that contains documents with the following prototype” (note that it’s not a schema!).

{
  "_id" : 1,
  "name" : {
             "first" : "John",
             "last" :"Backus"
           },
  "birth" : ISODate("1924-12-03T05:00:00Z"),
  "death" : ISODate("2007-03-17T04:00:00Z"),
  "contribs" : [ "Fortran", "ALGOL", "Backus-Naur Form", "FP" ],
  "awards" : [
              {
                "award" : "W.W. McDowellAward",
                "year" : 1967,
                "by" : "IEEE Computer Society"
              },
              {
                "award" : "National Medal of Science",
                "year" : 1975,
                "by" : "National Science Foundation"
              },
              {
                "award" : "Turing Award",
                "year" : 1977,
                "by" : "ACM"
              },
              {
                "award" : "Draper Prize",
                "year" : 1993,
                "by" : "National Academy of Engineering"
              }
  ]
}

A query to return each document with awards from 1990 onwards would be the following:

select 
  _id, 
  name, 
  birth, 
  death, 
  contribs, 
  (select * from awards where year >= 1990) as recentawards
from bios;

A query to return only documents having awards from 1990 onwards would be the following:

select * from bios
where exists (select * from awards where year >= 1990);

It looks like normal SQL, doesn’t it? Unfortunately, it is not sufficient for everything you might want to do.

Shuffling information around. Consider what it would take to collect all distinct contribution names from all bios. Something more is needed to lift information from inside bios.contribs. This operation is called Unnest (μ) and its converse is Nest (ν). Unnesting contribs on a projection of bios that contains just _id and contribs, results in the following relation.

{
  "_id" : 1,
  "contribution" : "Fortran"
}
{
  "_id" : 1,
  "contribution" : "ALGOL"
}
{
  "_id" : 1,
  "contribution" : "Backus-Naur Form"
}
{
  "_id" : 1,
  "contribution" : "FP"
}

Note that the operation of Unnest can be achieved with a join like Sql Server’s CROSS APPLY, where the right joined table expression is evaluated in the context of each row of the left table, like in the following unnesting of awards.

select
  _id, 
  name, 
  birth, 
  death, 
  contribs, 
  a.award,
  a.year,
  a.by
from bios cross apply bios.awards a

Nesting _id is an operation akin to what GROUP BY does before aggregates are computed.

{
  "ids" : [ 1 ],
  "contribution" : "Fortran"
}
{
  "ids" : [ 1 ],
  "contribution" : "ALGOL"
}
{
  "ids" : [ 1 ],
  "contribution" : "Backus-Naur Form"
}
{
  "ids" : [ 1 ],
  "contribution" : "FP"
}

In fact, the operation of GROUP BY can be defined within this model: Nest using the grouping columns, producing a new relation attribute e.g. TheGroup; compute the aggregates on relation TheGroup, producing a single-row relation attribute e.g. TheAggregates; Unnest TheAggregates, producing a flat relation again.

Verso. There is also another viewpoint I found in a data model called Verso, where restructuring is a powerful operation in its own right. Some restructurings can be shown to preserve all information, others to lose some, but it’s a powerful operation that, combined with the rest of nested relational algebra, can express queries that are very cumbersome in flat relational algebra.

The transformation I presented earlier on a projection of bios that contains just _id and contribs would have been a single application of restructuring using the target schema (called “format” in verso). The initial “format” would be written as _id (contribution)*. Restructuring to contribution (_id)* would be done in a single step.

The data model behind MongoDB. Hierarchical databases, insofar as they hold useful data that we need to query, need a way for us to be able to express these queries. Making the conscious decision of not inherently supporting a full-blown query language and delegating this task to application code does not negate the need to be able to define the queries in a concise way. I posit that SQL, in its incarnation that targets hierarchical data, adapted to flexible and untyped schemas and by incorporating the rich amount of relevant research, is sufficiently powerful to serve as a data model for hierarchical databases, like MongoDb.

Infusing MongoDB with some ACID

In the previous post, I demonstrated the obvious, viz. that in the absence of transactions, concurrent clients can step on each others’ feet. And now, we find ourselves in a crossroads: either we admit that there’s a new science of data management whereby all problems have been solved once we demormalized everything into large tree structures, or we accept that there’s still need for transactional support and we need some way to implement it. Lest I make it a habit to state the obvious, I won’t spend time on this dilemma. Let me just say that denormalizing, if anything, widens the need for transactions, since denormalized data must be constantly synchronized. Notwithstanding thousands of Google hits stating with almost religious fanaticism how everyone who seeks “Old SQL” properties in MongoDB just doesn’t get it.

Let’s revisit the item reordering exercise from the last post, and reformulate it in a bit more elementary form: a particular simple way to go about rearranging items in a list, which is swapping locations of two items. With this formulation, the problem now is that we have multiple agents, each needing occasionally to get exclusive access to two shared resources and do some work, preferably without interfering too much with anyone else. Did I jog your memory? I bet I did. This problem is the Dining Philosophers one, of the venerable Edsger Dijkstra.

I will show you that MongoDB is, indeed, up to the task of handling this problem. I even hope that this will appease the MongoDB fans into flaming me less than usual for this kind of heretic talk. Let me set up the universe of discourse by way of showing you the code to create the data, set, get, test and reset them (yes, data is plural, by the way).

open MongoDB.Bson
open MongoDB.Driver

let mutex = new System.Object()
let global_counter = ref 0
let connectionString = "mongodb://localhost/?safe=true";
let server = MongoServer.Create(connectionString);
let mongo = server.GetDatabase("local");
let items = mongo.GetCollection<BsonDocument>("items")

let create_items () =
    let create_item (n:int) =
        let x = (new BsonDocument()).Add("name",BsonValue.Create(n))
        items.Insert(x) |> ignore
    for i in 1 .. 5 do create_item i
    let all = items.Count
    printfn "%A" all

let checkUniq listarr =
    (listarr |> Set.ofList |> Set.count) = listarr.Length

let check_items () =
    let cur = items.FindAll()
    let enumer = cur.GetEnumerator()
    let rec traverse l =
        if enumer.MoveNext () then traverse (enumer.Current :: l)
        else l
    let all = traverse []
    let allord = all |> List.map (fun d -> d.GetValue("ord").AsInt32)
    checkUniq allord

let get_item (n:int) =
    let query = new QueryDocument((new BsonDocument()).Add("name",BsonValue.Create(n)))
    let doc = items.FindOne(query)
    doc.GetValue("ord").AsInt32

let set_item (n:int) (o:int) =
    let query = new QueryDocument((new BsonDocument()).Add("name",BsonValue.Create(n)))
    let upd = new UpdateDocument((new BsonDocument()).Add("$set",BsonValue.Create((new BsonDocument()).Add("ord", BsonValue.Create(o)))))
    items.Update(query, upd) |> ignore

let reset_item (n:int) =
    let query = new QueryDocument((new BsonDocument()).Add("name",BsonValue.Create(n)))
    let upd = new UpdateDocument((new BsonDocument()).Add("$set",BsonValue.Create((new BsonDocument()).Add("ord", BsonValue.Create(n)).Add("locked", BsonValue.Create(0)))))
    items.Update(query, upd) |> ignore

let reset_items () = for i in 1 .. 5 do reset_item i

Simple enough. Five items, identified by integer names from one to five, holding integer places from one to five.

To make a long story short, I’ll skip the part where a naive solution to the problem leads to potential deadlock and adopt the practice of ordering resources to eliminate that. But let me show you how I implement locking, including the busy-wait loop to fill in for the missing blocking wait.

let r (lockfun:int->bool) (n:int) =
    let rec raux i =
        if i = 200 then failwith "Possible livelock" 
        let b = lockfun n
        if b then i
        else Async.RunSynchronously(Async.Sleep 10) ; raux (i+1)
    let i = raux 1
    lock mutex (fun () -> global_counter := !global_counter + i ; printfn "l %A -> %A" n i)

let simple_lock (n:int) =
    let query = new QueryDocument((new BsonDocument()).Add("name",BsonValue.Create(n)).Add("locked",BsonValue.Create(0)))
    let upd = new UpdateDocument((new BsonDocument()).Add("$set",BsonValue.Create((new BsonDocument()).Add("locked", BsonValue.Create(1)))))
    let wc = items.Update(query, upd)
    wc.DocumentsAffected = (int64)1

let simple_unlock (n:int) =
    let query = new QueryDocument((new BsonDocument()).Add("name",BsonValue.Create(n)))
    let upd = new UpdateDocument((new BsonDocument()).Add("$set",BsonValue.Create((new BsonDocument()).Add("locked", BsonValue.Create(0)))))
    let wc = items.Update(query, upd)
    wc.DocumentsAffected = (int64)1

Each swap operation is implemented by the following code.

let rec swap_simple_lock (ix:int) (iy:int) =
    if ix > iy then
        swap_simple_lock iy ix
    else
        r simple_lock ix
        let oix = get_item ix
        r simple_lock iy
        let oiy = get_item iy
        set_item ix oiy
        set_item iy oix
        simple_unlock iy |> ignore
        simple_unlock ix |> ignore

The code to run a full test is the following.

let worker (swapfun:int->int->unit) cnt =
    let random = new System.Random()
    async {
        for i = 1 to cnt 
            do
                let ix = random.Next(4) + 1
                let iytemp = random.Next(4) + 1
                let iy = if ix <> iytemp then iytemp else (iytemp % 4) + 1
                swapfun ix iy
    }

let run_test swapfun cnt wcnt =
    let s = seq { for i in [1 .. wcnt] -> worker swapfun cnt }
    Async.RunSynchronously (Async.Parallel s)

let cnt = 10
let wcnt = 10
let expected_cnt = 2 * cnt * wcnt
run_test swap_simple_lock cnt wcnt |> ignore
let b = check_items ()
printfn "check = %A" b
printfn "Exprected_counter = %A" expected_cnt
printfn "Global_counter = %A" !global_counter

Running the test almost proves that the problem is solved and, for practical purposes, one might also say that it is. However, the only thing we did is push the problem to a corner albeit similar to the corner it is pushed by the “Full SQL” solution. Starvation due to livelock is still possible, particularly since the ordering of resources makes some workers less privileged than others. To demonstrate that, I’ll “rig” the test a little.

let rigged_worker ix iy (swapfun:int->int->unit) cnt =
    let random = new System.Random()
    async {
        for i = 1 to cnt 
            do
                let flip = random.Next(5)
                let ix = if flip = 0 then 1 else ix
                let iy = if flip = 0 then ix else iy
                swapfun ix iy;
        lock mutex (fun () -> printfn "swapped %A, %A for %A times" ix iy cnt)
    }

let run_rigged_test swapfun cnt wcnt =
    let s = seq { for i in [1 .. wcnt] -> rigged_worker 4 5 swapfun cnt }
    Async.RunSynchronously (Async.Parallel s)

run_rigged_test swap_simple_lock cnt wcnt |> ignore

This set of probabilities makes a livelock appear in almost every run, on my machine. So a livelock is still possible, but is it more possible than running similar concurrent transactions in SQL? I must say I do not have an authoritative answer to that but, after monitoring systems running many millions of similar concurrent transactions per day, I can attest to only having witnessed locking of a transaction by another when the latter took more time than expected. Modern RDBMS’s include mechanisms to minimize starvation (here’s an article I found to that effect).

So, is this the end of it? I hope not, and I have some ideas on now to implement a solution that goes a bit further in the way of avoiding starvation. Rest assured that, should they come to fruition, you’ll be the first to know.

Missing the A in ACID

The A in ACID stands for “Atomic”, and I’m going to present some experiments about what can go wrong, and how often, if the A is not there. And it’s not there when you’re working with so-called NoSQL databases, in this case MongoDB. A post on MongoDB might seem peculiar, given my evident penchant for high-level abstractions, when the post-I-should-be-writing would have probably been how recursive SQL traverses a non-deterministic automaton, but things have been straying from normal lately. I’ve been downsized from my daytime job and focused more on Flying Donut. Until things settle on a new normal (if unemployment can be called that) I dug up old Flying Donut material to fill the void. Not to mention that it’s one more opportunity to show how well F# fares in such simulation tasks.

The context is the apparently simple action of moving an item from one place of a list to another. Conventional MongoDB wisdom says that, in the absence of atomic transactions, you design so that you have to update whole documents or sub-documents at once. Since I was near the Flying Donut team from the beginning, I learned about an early debate on how to rearrange items of a list, and I decided to run some experiments and find out how often can things go bad when parallel modifications are done without atomicity.

Both experiments are on a linked list representation. The encoding is terse: nodes are numbered 0 to N and the “next node” function is represented by an array. There are locals numbered from 0 to 1, also stored in an array. You can read the actual code in the end of the post.

The first experiment concerns fine-grained actions that would be used in a typical server-side implementation of moving a node. It uses the following six atomic actions: T0 = X.next, X.next = T0.next, T1 = Y.next, T0.next = T1, Y.next = T0 (T0, T1 are the temps). I simulated all possible interleavings of those six actions between two clients, the first moving a node from 1 to 3 and the second moving a node from 2 to 4. The result is havoc: of the 252 interleavings, 60 ended up with a well-form list, and 192 ended up with loops or unlinked items (finegrainedRun).

The second experiment concerns actions that would be used in a client-side implementation which took advantage of knowledge of the whole state, to avoid extra reads. It makes use of the following actions: X.next = NX.next, NX.next = Y.next, Y.next = NX, implementing the macro move MOVE(x,nx,nnx,y,ny), meaning: Move nx, which is currently between x and nnx, to between y and ny. I thought that providing more information would make things better, but I was wrong. The client-side version is even less resilient than the server-side one, probably because the server-side case could adapt to the current state of the list and recover in some cases. 20 interleavings, all malformed (coarsegrainedRun)…

Conventional MongoDB wisdom seems to win, this time. There’s no substitute for the A in ACID. You have to save the whole list at once, using the only kind of A in MongoDB: a single update operation.

Just kidding. You know I’m no fun of conventional wisdom. I can’t possibly end the post with that remark. Although I do recommend that you follow the conventional wisdom in similar MongoDB problems, I’m going to be the devil’s advocate and show you how it is possible for a number of concurrent processes to move items around in a list, without interference, using enough of the I to make up for the lack of the A. But, that has to wait until the next post!

The code

let rec interleave lst1 lst2 =
  match lst1, lst2 with
    | [], ys -> [ys]
    | xs, [] -> [xs]
    | x :: xs, y :: ys ->
        (List.map (fun zs -> x :: zs) (interleave xs (y::ys))) @
        (List.map (fun zs -> y :: zs) (interleave (x::xs) ys))

//let show () =
//    let res = interleave [1;2;3] [11;12;13]
//    res |> List.iter (fun x -> printf "["; x |> List.iter (printf "%O "); printf "]\n")
//    printfn "Done"

//printfn "Interleavings"
//do show ()

type locals = int array

type thelist = int array

type state = State of locals array * thelist

let checkUniq listarr =
    (listarr |> Set.ofArray |> Set.count) = listarr.Length

let showState2 listarr =
    printf "["
    listarr |> Array.iter (printf "%O ")
    printf "] : %O\n" (checkUniq listarr) 

let showState (State (locarr, listarr)) =
    printf "Process0 T0=%O T1=%O, Process1 T0=%O T1=%O, " locarr.[0].[0] locarr.[0].[1] locarr.[1].[0] locarr.[1].[1]
    showState2 listarr 

type movefun = int -> int -> int -> state -> state

// T0 = X.next
let getNextX (x:int) (y:int) (i:int) (State (locarr, listarr)) : state =
    let mylocals = locarr.[i]
    mylocals.[0] <- listarr.[x]
    State (locarr, listarr)
    
// X.next = T0.next
let linkX (x:int) (y:int) (i:int) (State (locarr, listarr)) : state =
    let mylocals = locarr.[i]
    listarr.[x] <- listarr.[mylocals.[0]]
    State (locarr, listarr)

// T1 = Y.next
let getNextY (x:int) (y:int) (i:int) (State (locarr, listarr)) : state =
    let mylocals = locarr.[i]
    mylocals.[1] <- listarr.[y]
    State (locarr, listarr)

// T0.next = T1
let linkNextX (x:int) (y:int) (i:int) (State (locarr, listarr)) : state =
    let mylocals = locarr.[i]
    listarr.[mylocals.[0]] <- mylocals.[1]
    State (locarr, listarr)

// Y.next = T0
let linkY (x:int) (y:int) (i:int) (State (locarr, listarr)) : state =
    let mylocals = locarr.[i]
    listarr.[y] <- mylocals.[0]
    State (locarr, listarr)

let finegrainedUnit () =
    let lst = [|1;2;3;4;5;6|]
    let loc1 = [|0;0|]
    let loc2 = [|0;0|]
    let state = State ([|loc1;loc2|],lst)
    let actions = [ getNextX 1 3 0; getNextX 2 4 1; linkX 1 3 0; linkX 2 4 1; getNextY 1 3 0; getNextY 2 4 1; linkNextX 1 3 0; linkNextX 2 4 1; linkY 1 3 0; linkY 2 4 1 ]
    let finalstate = actions |> List.fold (fun s fn -> fn s) state
    showState state
    ()

printfn "Linked list example"
do finegrainedUnit ()

let applyactions actions = 
    let lst = [|1;2;3;4;5;6|]
    let loc1 = [|0;0|]
    let loc2 = [|0;0|]
    let state = State ([|loc1;loc2|],lst)
    let res = actions |> List.fold (fun s fn -> fn s) state
    state

let finegrainedRun () =
    let actionsseq = [ getNextX ; linkX ; getNextY ; linkNextX ; linkY ]
    let actionsf x y i = actionsseq |> List.map (fun f -> f x y i)
    let actions1 = actionsf 1 3 0
    let actions2 = actionsf 2 4 1
    
    let interleavings = interleave actions1 actions2
    let res = interleavings |> List.map applyactions
    res |> List.iter showState
    ()

printfn "Linked list"
do finegrainedRun ()

type movefun2 = int -> int -> int -> int -> int -> thelist -> thelist

// X.next = NX.next
let linkX2 (x:int) (nx:int) (nnx:int) (y:int) (ny:int) (listarr:thelist) : thelist =
    listarr.[x] <- nnx
    listarr

// NX.next = Y.next
let linkNextX2 (x:int) (nx:int) (nnx:int) (y:int) (ny:int) (listarr:thelist) : thelist =
    listarr.[nx] <- ny
    listarr

// Y.next = NX
let linkY2 (x:int) (nx:int) (y:int) (nnx:int) (ny:int) (listarr:thelist) : thelist =
    listarr.[y] <- nx
    listarr

let coarsegrainedUnit () =
    let lst = [|1;2;3;4;5;6|]
    let loc1 = [|0;0|]
    let loc2 = [|0;0|]
    let actions = [ linkX2 1 2 3 3 4; linkX2 2 3 4 4 5; linkNextX2 1 2 3 3 4; linkNextX2 2 3 4 4 5; linkY2 1 2 3 3 4; linkY2 2 3 4 4 5 ]
    let finalstate = actions |> List.fold (fun s fn -> fn s) lst
    showState2 lst
    ()

printfn "Linked list 2nd set of actions example"
do coarsegrainedUnit ()

let applyactions2 actions = 
    let lst = [|1;2;3;4;5;6|]
    let loc1 = [|0;0|]
    let loc2 = [|0;0|]
    let res = actions |> List.fold (fun s fn -> fn s) lst
    lst

let coarsegrainedRun () =
    let actionsseq = [ linkX2 ; linkNextX2 ; linkY2 ]
    let actionsf x nx nnx y ny = actionsseq |> List.map (fun f -> f x nx nnx y ny)
    let actions1 = actionsf 1 2 3 3 4
    let actions2 = actionsf 2 3 4 4 5
    
    let interleavings = interleave actions1 actions2
    let res = interleavings |> List.map applyactions2
    res |> List.iter showState2
    ()

printfn "Linked list 2nd set of actions"
do coarsegrainedRun ()

printfn "Done"

New Table Constraint Planned for Pandamator, part 2

As I promised in the previous relevant post, I am moving forward with the LTL constraint for Pandamator (as any of you who have been following the iteration on Flying Donut already know). Apart from some integration in the maintenance scripts, all that remains is actually running the Büchi automaton. Which, of course, is not an easy task to do in SQL… That’s why I’m going to stick to the positive side only, and give you an account of what I have done so far.

The actual syntax of the constraint is very close to what I wrote earlier (I used WITH instead of WHERE because it’s already used as a keyword for defining things).

alter table Production.Product 
add validtime constraint weightneverdec 
check 'G(p0 -> G p1)' 
with p0 = weight > 0, p1 = weight > 0

This is a simple constraint for testing, which means that, whenever Product.weight becomes bigger than zero, it never becomes zero again.

SPOT happily produces the following automaton for that expression.

weightneverdec

2 1t
0 1
1 0 -1 & p1 p0
0 0 -1 ! p0
-1
1 0
1 0 -1 p1
-1

The automaton is parsed internally to the following value.

val it : BuchiAutomaton =
  {acccondnum = 1;
   states =
    [|{initial = true;
       transitions =
        [{endstate = 1;
          acccond = [0];
          guard = Bin (And,Prop "p1",Prop "p0");}; {endstate = 0;
                                                    acccond = [0];
                                                    guard = Not (Prop "p0");}];};
      {initial = false;
       transitions = [{endstate = 1;
                       acccond = [0];
                       guard = Prop "p1";}];}|];}

Almost the full definition for LBTT is followed by my (simple) parser, except I assume that states are named as consecutive numbers starting at zero, because SPOT does that.

The abstract syntax tree for the automaton is defined by the following types.

type LogicOp = And | Or | Impl | Xor | Equiv

type LogicExpr =
    | Prop of string
    | Const of bool
    | Not of LogicExpr
    | Bin of LogicOp * LogicExpr * LogicExpr

type BuchiAutomatonTransition = {
    endstate: int;
    acccond: int list;
    guard: LogicExpr;
}

type BuchiAutomatonState = {
    initial: bool
    transitions: BuchiAutomatonTransition list;
}

type BuchiAutomaton = {
    acccondnum: int;
    states: BuchiAutomatonState[];
}

A simple translation layer transforms the propositional logic of the guards, together with the definition of the propositions in the constraint, to SQL.

let rec logexp2stringlist (rcl:PropositionDefinition list) (exp:LogicExpr) =
    match exp with
    | Const true -> ["(1=1)"]
    | Const false -> ["(1=0)"]
    | Not e -> ["NOT";"("] @ (logexp2stringlist rcl e) @ [")"]
    | Bin (o, x, y) -> 
        let lx = logexp2stringlist rcl x
        let ly = logexp2stringlist rcl y
        match o with
        | And -> ["(";"("] @ lx @ [")";"and";"("] @ ly @ [")";")"]
        | Or -> ["(";"("] @ lx @ [")";"or";"("] @ ly @ [")";")"]
        | Impl -> logexp2stringlist rcl (Bin(Or,Not(x),y))
        | Equiv -> logexp2stringlist rcl (Bin(Or,Bin(And,x,y),Bin(And,Not(x),Not(y)))) // (x and y) or (not x or not y)
        | Xor -> logexp2stringlist rcl (Bin(And,Bin(Or,x,y),Not(Bin(And,x,y)))) // (x or y) and not(x and y)
    | Prop p -> 
        let rc = rcl |> List.find (fun rc -> rc.name = p)
        "(" :: rc.expression :: [")"]

During ALTER VALIDTIME execution, a view is created to abstract away the exact expressions used for the propositions (as boolean numbers 0/1, since SQL does not recognize logical expressions outside WHERE).

CREATE VIEW [Production].[Product_weightneverdec_GuardView] WITH SCHEMABINDING as
SELECT
 ProductID,
 ValidToDate,
 row_number() over (partition by ProductID order by ValidToDate) AS VersionOrdinal,
 case when ( ( ( weight > 0 ) ) and ( ( weight > 0 ) ) ) then 1 else 0 end AS g0_0,
 case when NOT ( ( weight > 0 ) ) then 1 else 0 end AS g0_1,
 case when ( weight > 0 ) then 1 else 0 end AS g1_0
FROM [Production].[Product]

That’s it. I wish I had done more, but I believe I made good use of my scant free time to do as much as possible. Expressed like that, the view can be used to perform (non-deterministic, i.e. parallel…) runs of the automaton on the actual object histories. Stay tuned, I hope my next post on the subject will be a triumphant conclusion to creating the constraint.

“Lenient Parsing” and why I regret it

Don’t look it up — “Lenient Parsing” is a term of my own. When writing Pandamator, I wanted to make the point that the query transformations, I was using, did not care which were the SQL expressions used for SELECT, WHERE etc (except when they did, read footnote). Which is one of my good/bad facets: some time I like to push an idea to see how far it goes. In this case it went very far, because I could ignore the content of all expressions and ignore subtleties of TransactSQL syntax and typesystem.

“Lenient parsing” was the way I made the point. It consists of not parsing at all. I just amass all tokens and declare it an expression. The side effect was that I had to delimit all statements by semicolons, which I rather preferred. (I actually did that for statements, as well, in an effort to read whole scripts and just ignore anything that was not temporal statements. I have since aimed a little lower.) Anyway, I made my point, proved that I could get away with minimal syntactic transformations, and now “Lenient Parsing” (how on earth I came up with this weird term, I don’t know) gets in my way.

For one thing, I have been wanting to implement something I call (buckle up, because another weird term of my own invention is coming) “assisted nonsequenced queries”, which will be nonsequenced queries with an implicit “join” with the “spans” of the query, the latter being the largest contiguous validity periods that do not overlap with any validity period of any object of any relation. Within each such “span”, nothing changes, so these are like extended instants (you can read more on that in “SQL Extension for Interval Data”, Lorentzos, Mitsopoulos). This would allow one to write something akin to IXSQL, but then one would have to perform arithmetic on time, and not having real expressions to manipulate is an impediment.

Secondly, I’ve been wanting to  write an extension to SQL ever since I started doing a lot of SQL work. It’s a good thing that I didn’t, though, because what I would have written at the time would have been what I thought I needed for better programming under TransactSQL (my name for that was SQL DoneRight, so that gives you the idea). However, my point of view has changed, and I have become convinced that the way to go is to avoid the procedural, Basic-like language altogether, and go for pure SQL, working on whole relations at once and abolishing scalars. I hope you’re intrigued, but I’m not going to go into details just yet, as I’m preparing a “manifesto” blog post for that. I’m just going to remind you of my post about how to play to the strengths of an RDBMS and avoid shoehorning it to procedural, object-at-a-time computation. This extension will have a full typesystem, which necessitates handling real expressions.

In conclusion, “Lenient Parsing” was a great idea, but it’s now time to let it go, and switch to real parsing of expressions, even if that means losing some obscure part of Transact SQL grammar in the proceedings.

Footnote: Inner subqueries, of course, had to be treated specially than other elementary expressions.