SQL: The data model behind MongoDB

You know that MongoDB is a modern NoSQL database, meaning, among other things, that it is schemaless and its data model is more complex than what SQL can handle. However, in this post I will attempt to show you how SQL is sufficient to serve as a data model behind it. Hard to swallow? Follow me as I go through a list of misconceptions I need to debunk and then as I make my case.

Misconception #1. Let’s consider the claim that NoSQL is schemaless. Taking MongoDB as an example, clearly a database that can answer the following, is hardly as amorphous as this word implies.

db.somecollection.find( { foo: "foo" }, { bar: 1} )

This query expects all queried documents (not all documents in the collection, though) to have a “foo” field, which (some times) takes a string value, and a “bar” field (of whatever type). The net sum of all query/update code touching the database contributes to a definition of what data are supposed to be there. The data themselves, on the other hand, have a specific form, which does or does not correspond to this definition. In any case, this definition is a “schema”, by any other name. Consequently, it is more truthful (and some NoSQL material adheres to this truth) to talk about having flexible or dynamic schemas, instead of being schemaless. And keep in mind that the only flexibility is in extending the schema, as no facility exists for any other modification to it.

Misconception #2. Everyone knows that SQL represents “the relational model”. It might do, depending on what you mean by “SQL”. But you can’t possibly mean SQL99, whose data model explicitly encompasses “groups”, “repeating groups” and “nested repeating groups”, i.e. tree data. SQL99 transcends the flat relational model (in other ways, also, which I won’t go into in this post) so, confusing SQL as a standard with what vendors sell in the form of SQL is a classic “straw man” argument.

Nested relational models. Nested relational models can be queried using nested relational algebra. There is nothing extraordinary about nested relational algebra or its corresponding SQL. Its only difference with flat relational algebra is that attributes can be relations and, everywhere attributes can appear (in projection and selection), algebraic expressions can be used.

I’m going to use the example from the MongoDB documentation, which is “a collection named bios that contains documents with the following prototype” (note that it’s not a schema!).

{
  "_id" : 1,
  "name" : {
             "first" : "John",
             "last" :"Backus"
           },
  "birth" : ISODate("1924-12-03T05:00:00Z"),
  "death" : ISODate("2007-03-17T04:00:00Z"),
  "contribs" : [ "Fortran", "ALGOL", "Backus-Naur Form", "FP" ],
  "awards" : [
              {
                "award" : "W.W. McDowellAward",
                "year" : 1967,
                "by" : "IEEE Computer Society"
              },
              {
                "award" : "National Medal of Science",
                "year" : 1975,
                "by" : "National Science Foundation"
              },
              {
                "award" : "Turing Award",
                "year" : 1977,
                "by" : "ACM"
              },
              {
                "award" : "Draper Prize",
                "year" : 1993,
                "by" : "National Academy of Engineering"
              }
  ]
}

A query to return each document with awards from 1990 onwards would be the following:

select 
  _id, 
  name, 
  birth, 
  death, 
  contribs, 
  (select * from awards where year >= 1990) as recentawards
from bios;

A query to return only documents having awards from 1990 onwards would be the following:

select * from bios
where exists (select * from awards where year >= 1990);

It looks like normal SQL, doesn’t it? Unfortunately, it is not sufficient for everything you might want to do.

Shuffling information around. Consider what it would take to collect all distinct contribution names from all bios. Something more is needed to lift information from inside bios.contribs. This operation is called Unnest (μ) and its converse is Nest (ν). Unnesting contribs on a projection of bios that contains just _id and contribs, results in the following relation.

{
  "_id" : 1,
  "contribution" : "Fortran"
}
{
  "_id" : 1,
  "contribution" : "ALGOL"
}
{
  "_id" : 1,
  "contribution" : "Backus-Naur Form"
}
{
  "_id" : 1,
  "contribution" : "FP"
}

Note that the operation of Unnest can be achieved with a join like Sql Server’s CROSS APPLY, where the right joined table expression is evaluated in the context of each row of the left table, like in the following unnesting of awards.

select
  _id, 
  name, 
  birth, 
  death, 
  contribs, 
  a.award,
  a.year,
  a.by
from bios cross apply bios.awards a

Nesting _id is an operation akin to what GROUP BY does before aggregates are computed.

{
  "ids" : [ 1 ],
  "contribution" : "Fortran"
}
{
  "ids" : [ 1 ],
  "contribution" : "ALGOL"
}
{
  "ids" : [ 1 ],
  "contribution" : "Backus-Naur Form"
}
{
  "ids" : [ 1 ],
  "contribution" : "FP"
}

In fact, the operation of GROUP BY can be defined within this model: Nest using the grouping columns, producing a new relation attribute e.g. TheGroup; compute the aggregates on relation TheGroup, producing a single-row relation attribute e.g. TheAggregates; Unnest TheAggregates, producing a flat relation again.

Verso. There is also another viewpoint I found in a data model called Verso, where restructuring is a powerful operation in its own right. Some restructurings can be shown to preserve all information, others to lose some, but it’s a powerful operation that, combined with the rest of nested relational algebra, can express queries that are very cumbersome in flat relational algebra.

The transformation I presented earlier on a projection of bios that contains just _id and contribs would have been a single application of restructuring using the target schema (called “format” in verso). The initial “format” would be written as _id (contribution)*. Restructuring to contribution (_id)* would be done in a single step.

The data model behind MongoDB. Hierarchical databases, insofar as they hold useful data that we need to query, need a way for us to be able to express these queries. Making the conscious decision of not inherently supporting a full-blown query language and delegating this task to application code does not negate the need to be able to define the queries in a concise way. I posit that SQL, in its incarnation that targets hierarchical data, adapted to flexible and untyped schemas and by incorporating the rich amount of relevant research, is sufficiently powerful to serve as a data model for hierarchical databases, like MongoDb.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s