Why NoSQL Will Not Kill the RDBMS

  • warning: include(/tmp/fortune.txt): failed to open stream: No such file or directory in /home/mohawksoft/org/www/htdocs/includes/common.inc(1696) : eval()'d code on line 1.
  • warning: include(): Failed opening '/tmp/fortune.txt' for inclusion (include_path='.:/usr/share/php:/usr/share/pear') in /home/mohawksoft/org/www/htdocs/includes/common.inc(1696) : eval()'d code on line 1.

Introduction

There is a lot of discussions about “NoSQL” out there and a lot of developers are spending a lot of time advocating it over more established RDBMS. What is NoSQL? NoSQL is a movement that works on the premise that “relational databases” don't scale to fit the web, and are thus mostly obsolete in the web environment.

The NoSQL Argument

Lets try to understand what the problems the “NoSQL” crowd seek to solve. There seems to be three major focuses of the NoSQL movement: scalability of storage, rigid schemas, and the non-scalability of relational data access – “joins don't scale.”

Scalability of Storage

Yes, all systems will eventually hit limitations when you continue to increase load. That goes from everything from car engines and generators to software. There is no such thing as infinite in the real world. All systems implemented in the real world will hit a limit if the load increases beyond the physical capacity of the system. The strategy of NoSQL, therefor, is to create a system that can easily grow beyond any “current” implementation should the need arise.

Rigid Schemas

The next problem is “rigid schemas.” The main problem is the requirement of a normalized schema as most “data” does not easily fit into such a construct.

Relational Data, joins don't scale

Yes, joining from one table to another on two matching columns, in many cases is less efficient than merely reading from a data store.

Rebuttle

The previous section sums up the core problems the NoSQL movement wish to solve. They are real issues and any web site archetecture needs to address them. The focus of this section is to question the basis on which a NoSQL solution would be chosen over a more established and mature system.

Scalability

This is the biggest claim of the “NoSQL” systems. “Scalability” is a difficult problem to quantify. What is it that you want to scale? If it is just access to static data, a few apache web servers behind a load balancer on a good internet service will probably scale to the capacity. That is, of course, not what they are talking about, but in reality, what NoSQL provides isn't much more than this.

At issue is the big problem: quantifying what you want to scale. Are you scaling data storage, like a remote backup service? Are you scaling small transactions like twitter? Are you scaling mostly page views like Google? What is your ratio between reads and writes? What kind of information does your site serve? How coherent must your data be? Do you need to guarantee data integrity? To have any meaningful discussion on scalability, you need to quantify what it is you wish to scale and the limitations imposed by requirements.

There are a number of applications that may benefit from a NoSQL technology, but it would be difficult to identify those applications without fully understanding the nature of the application. An RDBMS such as PostgreSQL, DB2, oracle, or even MySQL is almost always a better starting point as they provide a host of capabilities and tools which simply do not exist on even the most mature NoSQL offerings. As your site grows, RDBMS databases can be scaled with faster hardware, data partitioning, better caching, and etc.

Rigid Schemas

This claim is both accurate and bogus at the same time for a number of reasons. There is a dose of reality needed here. Even in the most “schemaless” database, there is some normalcy and structure in the way the data is stored, otherwise the system would not be able to access it. After we accept that there must be some order and control over the data storage, whether or not we call it a schema is unimportant. Furthermore, the idea that SQL “schemas” are inflexible is a little disingenuous. One can add and usually remove columns in a SQL table.

Beyond the standard table definitions, it seems there is a lack of imagination or comfort in the NoSQL community when it comes to databases. One can easily create “generic” key/value tables in a standard SQL database in which the “value” column may contain virtually any sort of data, from a filename, to an XML stream, blobs, or even some user defined types. In NoSQL solutions which do not enforce a data definition, you can't search by data contained in “value,” so how is the SQL implementation any different? In addition, many SQL databases allow you to create an index based on the results of a function. If the “value” in your key/value table contains something like XML or JSON, you can create an index from specific member values extracted from a parser function. PostgreSQL for instance makes this easy.

Relational Data: Joins Don't Scale.

This is reminiscent of a joke, patient: “Doctor, it hurts when I do 'this'” doctor: “Then don't do 'that'.” A SQL RDBMS is a comprehensive set of tools, it does not require that data be 100% normalized nor does it need your solutions to adhere to any sort of idealogical discipline. This is engineering, not religion. There are no imams whom you must obey.

In many ways, this is like the continuing C vs C++ argument. C code can compile with a C++ compiler with a few changes that are generally backward compatible and will usually improve the overall quality of the code. You don't need to use STL, iostreams, or other “C++” features that you don't like. You can still use printf and fopen. Straight C type code compiled with a C++ compiler is just as efficient as straight C code compiled with a C compiler, except that you now have access to the additional features of C++. C++, like an RDBMS, is a set of tools and not an ideology.

Just because an RDBMS supports joins, triggers, foreign key constraints, primary keys, and so on, does not mean that you need to use them. You design your system for your needs. That's the beauty of an RDBMS. If you don't want a join, put all your data in one table. You want a key/value store? Create a key/value table. You still get all the tools and features of the RDBMS if you need them. In a NoSQL system, you don't have that option.

Lastly, “joins don't scale.” That isn't always true, in fact, joins can be more efficient than denormalized data. Take for instance two tables, “files” and “directories.” In files, the data is the base file name and extension. In directories is the whole path to a directory. Since the full path tends to be longer than the file name and there tends to be many files per directory, it is possible that there will be less real disk I/O when the data is queried because the duplicative data in directories only has to be retrieved once, and the file data ends up being smaller and taking up less space on disk requiring less I/O in total.

Additional Argument

The NoSQL movement, it seems, is confused. SQL is a language to access data and it supports an algebraic notation for identifying specific datum and making correlations between related data. There is no reason any of the NoSQL offerings could not be fitted with a SQL front end parser in order to make them easier to use. It would be limited by the capability of the underlying system, but there is even an ODBC SQL driver for text files, it would be a sucker's bet to assume someone was not working on one.

Where, for instance, would the NoSQL crowd be if some enterprising soul put a SQL parser in front of the Cassandra or the CouchDB projects? Would they be NoSQL databases with a SQL front end? Then they wouldn't really be NoSQL databases, would they? The NoSQL crowd is, in essence, creating nothing more than storage systems with primitive access APIs and logic. The fact that they claim to “scale” (for some subset of the definition of scaling) is almost besides the point.

The last important point is scaling itself. How big do you need to scale? In what time frame do you need to do it? Architects like to dream big. Building architects dream of building the tallest buildings. Software architects dream of building the biggest/best software. Yes, we all envision a huge site, capable of almost infinite load, but the reality is a lot less sexy and a lot easier. 99% of the sites on the internet can be handled with a single server. 99.99% of them can be handled with a multi-tier split function architecture where web servers can be load balanced and logical databases can be split on functional requirements.

If you want to make the next google or facebook, it will take you years to get there and cost many many millions of dollars along the way.