Higher-Order SQL

Two years ago, when I started swimming in a sea of SQL at my job at the time, I first came up with the idea of a Transact SQL preprocessor (code-named SQL DoneRightTM). Its main benefit was to overcome the myriad of arbitrary shortcomings evil brains have imposed on Transact SQL programming (don’t get me started…) and add some extra functionality, like hints for caching particular CTEs and special-casing particular stored proc parameters. Thank God, I never got around to doing it. It would be solving the wrong problem, the wrong way.

Brothers and sisters, I have seen the light! And I will share it with you, so help me Codd! Making Transact SQL, or any other dialect, a better procedural language, although it would come handy, is not the way to go. The way to go is to abandon the procedural part of SQL, and make it work the way it does best: declaratively and massively.

I have already written about the performance boost from using cardinalities larger than one in your SQL code. I would go so far as proposing that

  • no procedure should take scalar arguments
  • no scalar variable should be declared anywhere
  • all conditionals on data should be folded inside the queries themselves

But this post is not meant to preach my (controversial?) best practices. It is meant to preach what I plan to do about them. I’ll focus on queries for now.

Typed, First-order SQL. First of all, I want my SQL to be typed and implicitly typed at that. I cannot understand why current-day stored procedures are not fully typechecked.

And I want a language where relations are first-class values, as Codd intended. Relations should be freely usable as function parameters, return values and variables (as far as their use as variables go, think of them as “inline views” or “detached CTEs”). Not only that; all data will be in the form of relations and all operations on data will be done on relations. No scalars will exist outside of row constructors (tuples). In particular, this means that application code will only have the option of feeding to and getting back whole relations. And I want it to compile to plain SQL.

Imagine being able to do the following (I’m not actually proposing a console like that of F# Interactive, that’s just for demonstration). I’m defining foo as a relation-valued function and I’m going to show you how this is different that a table-valued function.

let foo r1 ids =
  let unqids = select unique id from ids
  let r2 = select id, nam from r1 inner join unqids on r1.id = ids.id
  r2 

==> 

foo : T -> rel(id any) -> T 
  where T = rel(id any, nam any)

Don’t read this the wrong way. Values unqids and r2 stand for the relations expressed by the corresponding query expressions. They do not stand for the queries themselves (I don’t propose a language to allow you to do query manipulations here, although the compiler itself will manipulate the queries as needed) nor for table variables (although compiling to and interfacing with plain SQL will inevitably create code using some table variables).

Of course, the programmer should be able to supply typings by hand, as well.

let foo r1 (ids:relation(id int)) =
  let unqids = select unique id from ids
  let r2 = select id, nam from r1 inner join unqids on r1.id = ids.id
  r2 

==> 

foo : T -> rel(id int) -> T
  where T = rel(id any, nam any)

It didn’t gain us much. Remember that it will all compile to SQL, and SQL equality does not constrain the types of its operands in any meaningful degree, so no information was added to the first argument and the return type. But if you type the first argument of the function, type inference should propagate information to the return type. Suppose there’s a table called dbo.Client defined as TABLE(id int, nam varchar(60), phon varchar(15), addr varchar(100)).

let foo r1 ids =
  let unqids = select unique id from ids
  let r2 = select id, nam from r1 inner join unqids on r1.id = ids.id
  r2 

let foo2 = foo dbo.Client

let foo3 = foo (select id, nam from dbo.Client where addr is not null) 

==> 

foo : T -> rel(id any) -> T
  where T = rel(id any, nam any)
foo2 : rel(id any) -> rel(id int, nam varchar(60))
foo3 : rel(id any) -> rel(id int, nam varchar(60))

The typing of the remaining curried argument was not affected, though, for the same reasons as before.

As you can see, I want any expression returning a relation (including the name of a table) to be usable as a parameter. If one were to compile foo2 and foo3 to SQL, it might look like the following.

create type dbo.foo2_ids as table(id int)
go

create procedure dbo.foo2 @ids foo2_ids as
begin
 with unqids as ( select unique id from @ids )
 select id, nam from dbo.Client inner join unqids on r1.id = ids.id
end
go

create procedure dbo.foo3 @ids foo2_ids as
begin
 with
 unqids as ( select unique id from @ids ),
 t as (select id, nam from dbo.Client where addr is not null)
 select id, nam from t inner join unqids on r1.id = ids.id
end
go

As you can see, at this stage, functions are little more than typed SQL procedures. The difference with SQL is that partial evaluation folds relation definitions inside functions. Make no mistake: this is not macro expansion.

You can also see that the typesystem implied by the examples just presented is not exactly the typesystem of SQL (was it that “any” that gave me away?). Let me go a bit more formal for a moment, to explain what I have in mind.

Relations consist of attributes and the type of an attribute is flagged for nullability (NULL or NOT NULL) and is either:

  • a number of base SQL types, depending on the targeted engine
  • a number of increasingly widening union types e.g. “num” (all numbers), “string” (all strings), “bin”, “eq” (all types supporting equality and comparison), “blob” — I haven’t given it much thought yet, basically all cases that can be inferred by the use of functions or operations that restrain their arguments or operands
  • a relation (obviously not available when the targeted engine is flat relational)
  • type class “any” (union type of all attribute types)

Relations are typed using type variables and constraints on them. The presence of type variables are inevitable to capture sharing and refinement of types. Constraints on attributes can be conveniently gathered together in one place, say “rel(attr-constr-list)”, where attr-constr-list is a list of

  • attribute name / attribute type / nullability flag, or
  • pk(attr-list), where attr-list is the list of primary key attributes
  • unq(attr-list), where attr-list is a list of attributes in a UNIQUE constraint
  • chk(expr), where expr is a CHECK constraint

This arrangement, that brings together all intrinsic constraints, also aligns with the concept of a table type in the underlying engine. Other relation constraints are of the (sole one, for now) type of foreign key constraint fk(R,fk-name), where R is the type of the primary relation and fk-name the name of the foreign key constraint, like in the following.

rel( attr1 int, attr2 varchar(30), pk(attr1), chk(len(attr2) > 5) ), 
	fk( OtherRel, fk_name ) 

I want the primary key to be usable inside the queries, in the form of r.pk, denoting the tuple of the primary key columns, able to participate in equalities and select lists (the way r.* does in select lists).

I want foreign keys to also be usable by naming them, hence the presence of fk-name, although I am thinking of a more explicit syntax for joins on foreign keys, which will make them inferrable more easily. Something like

foreign key join on fk_name

Note that the definition of relations is recursive, when the underlying engine contains relation-valued attributes, but a particular relation definition is not allowed to be recursive (or mutually recursive). Even nested relational databases do not allow that, and it will be a while until I contemplate dealing with graph databases (but you bet I will, at some point!).

The type of a function, now, is formed using the familiar arrow type constructor (yes, functions are curried and why should they not be?). Remember that I don’t want scalar-valued functions (you can write them in most SQL dialects directly), so all function arguments are constrained to be refinements of “rel()”, up to this point in the post. You have also seen that function definitions are nothing but incremental relation definitions (like you do using WITH in SQL).

You may say I’m a dreamer, but I’m not the only one. I know this typesystem is very ambitious, since it implies that types are richer than usual and also carry with them some theorems about them (remember my post on Dependent Types?). But I don’t plan in incorporating a full theorem prover at this point, just infer some basic theorems from the relational operations used. I just want this aging language to become a little more intelligent.

Higher-order SQL. Second, I want functions to also be first-class values, which means that function arguments can also be other functions. During my work on Pandamator, I missed being able to express e.g. coalescing as a function taking a relation and returning a relation. Well, here it is (just imagine that Pandamator is the underlying engine and the “star” operator excludes the special timestamp columns).

let temporal_coalesce sourcerelation =
  let CoalAux =
	select p2.*, ValidFromDate, ValidToDate from sourcerelation p2
	where not exists(select 0 from sourcerelation p1 
		where p1.ValidToDate = p2.ValidFromDate
		and p1.pk = p2.pk
	)

	union all

	select p1.*, p1.ValidFromDate, p2.ValidToDate
	from CoalAux p1
	inner join sourcerelation p2 on p1.ValidToDate = p2.ValidFromDate
		and p1.pk = p2.pk

  let Coal = 
	select p1.*, p1.ValidFromDate, max(p1.ValidToDate)
	from CoalAux p1 
	group by p1.*, p1.ValidFromDate
  Coal

==>

temporal_coalesce : T -> T 
  where T = rel(ValidFromDate eq, ValidToDate eq)

I want that, but I also want this to be useful in practice, so I want to be able to compile everything that is first-order into plain SQL. I want to be able to compile any any first-order function to an Sql92 stored procedure or table-valued function (defunctionalization of higher-order functions).

This functionality allows one to define and use higher-order functions as if they were typed macros. Obviously I cannot promise I can sort out all weird interactions between recursive functions and recursive relations but, at least, I plan to support recursive relations. I’d rather omit recursive functions for now.

I’d rather omit one other thing I had in mind, as well, which was to relax the condition about compiling only first-order expressions to SQL by allowing parameters that are higher-order application code functions. Using these functions within SQL would break calculations into pipelined stages alternating between native code and SQL, and I wanted this pipelining to be transparent. This would necessitate generating code at the application level that handled the pipelining.

Even omitting all that and even focusing just on composing complex queries does not make it a piece of cake. I is still hard. But, somehow, I believe that adding back what will be missing will be easier once the hard part is done. And this first version will still be usable without them.

Not created in a vacuum. I did my homework and I found a number of articles defining type systems for query languages. The problem is that none of those query languages was SQL itself. Nevertheless, it seems like another research topic that has yet to be applied to everyday practice. This is where I come into the picture. Applying useful theory to practice is my harmless vice. This was going to be my summer project (for which I brushed up my TAPL), except I got somewhat sidetracked with another potential project, about which I cannot say more yet (hint: did you notice I included relation-valued attributes in the typesystem?). However, this other project can also benefit from the code for this one, so it’s about time I started both!

I was thinking of naming it HOSql (pronounced in a way that rhymes with Haskell — I have to admit that half the fun in doing a project like that is coming up with the fancy name!) but maybe I’ll stick with SQL DoneRight. Your vote is welcome!