Introduction
There is a lot of discussions about “NoSQL” out there and a
lot of developers are spending a lot of time advocating it over more
established RDBMS. What is NoSQL? NoSQL is a movement that works on
the premise that “relational databases” don't scale to fit the
web, and are thus mostly obsolete in the web environment.
The NoSQL Argument
Lets try to understand what the problems the “NoSQL” crowd
seek to solve. There seems to be three major focuses of the NoSQL
movement: scalability of storage, rigid schemas, and the
non-scalability of relational data access – “joins don't scale.”
Scalability of Storage
Yes, all systems will eventually hit limitations when you continue
to increase load. That goes from everything from car engines and
generators to software. There is no such thing as infinite in the
real world. All systems implemented in the real world will hit a
limit if the load increases beyond the physical capacity of the
system. The strategy of NoSQL, therefor, is to create a system that
can easily grow beyond any “current” implementation should the
need arise.
Rigid Schemas
The next problem is “rigid schemas.” The main problem is the
requirement of a normalized schema as most “data” does not easily
fit into such a construct.
Relational Data, joins don't scale
Yes, joining from one table to another on two matching columns, in
many cases is less efficient than merely reading from a data store.
Rebuttle
The previous section sums up the core problems the NoSQL movement
wish to solve. They are real issues and any web site archetecture
needs to address them. The focus of this section is to question the
basis on which a NoSQL solution would be chosen over a more
established and mature system.
Scalability
This is the biggest claim of the “NoSQL” systems.
“Scalability” is a difficult problem to quantify. What is it that
you want to scale? If it is just access to static data, a few apache
web servers behind a load balancer on a good internet service will
probably scale to the capacity. That is, of course, not what they are
talking about, but in reality, what NoSQL provides isn't much more
than this.
At issue is the big problem: quantifying what you want to scale.
Are you scaling data storage, like a remote backup service? Are you
scaling small transactions like twitter? Are you scaling mostly page
views like Google? What is your ratio between reads and writes? What
kind of information does your site serve? How coherent must your data
be? Do you need to guarantee data integrity? To have any meaningful
discussion on scalability, you need to quantify what it is you wish
to scale and the limitations imposed by requirements.
There are a number of applications that may benefit from a NoSQL
technology, but it would be difficult to identify those applications
without fully understanding the nature of the application. An RDBMS
such as PostgreSQL, DB2, oracle, or even MySQL is almost always a
better starting point as they provide a host of capabilities and
tools which simply do not exist on even the most mature NoSQL
offerings. As your site grows, RDBMS databases can be scaled with
faster hardware, data partitioning, better caching, and etc.
Rigid Schemas
This claim is both accurate and bogus at the same time for a
number of reasons. There is a dose of reality needed here. Even in
the most “schemaless” database, there is some normalcy and
structure in the way the data is stored, otherwise the system would
not be able to access it. After we accept that there must be some
order and control over the data storage, whether or not we call it a
schema is unimportant. Furthermore, the idea that SQL “schemas”
are inflexible is a little disingenuous. One can add and usually
remove columns in a SQL table.
Beyond the standard table definitions, it seems there is a lack of
imagination or comfort in the NoSQL community when it comes to
databases. One can easily create “generic” key/value tables in a
standard SQL database in which the “value” column may contain
virtually any sort of data, from a filename, to an XML stream, blobs,
or even some user defined types. In NoSQL solutions which do not
enforce a data definition, you can't search by data contained in
“value,” so how is the SQL implementation any different? In
addition, many SQL databases allow you to create an index based on
the results of a function. If the “value” in your key/value table
contains something like XML or JSON, you can create an index from
specific member values extracted from a parser function. PostgreSQL
for instance makes this easy.
Relational Data: Joins Don't Scale.
This is reminiscent of a joke, patient: “Doctor, it hurts when I
do 'this'” doctor: “Then don't do 'that'.” A SQL RDBMS is a
comprehensive set of tools, it does not require that data be 100%
normalized nor does it need your solutions to adhere to any sort of
idealogical discipline. This is engineering, not religion. There are
no imams whom you must obey.
In many ways, this is like the continuing C vs C++ argument. C
code can compile with a C++ compiler with a few changes that are
generally backward compatible and will usually improve the overall
quality of the code. You don't need to use STL, iostreams, or other
“C++” features that you don't like. You can still use printf and
fopen. Straight C type code compiled with a C++ compiler is just as
efficient as straight C code compiled with a C compiler, except that
you now have access to the additional features of C++. C++, like an
RDBMS, is a set of tools and not an ideology.
Just because an RDBMS supports joins, triggers, foreign key
constraints, primary keys, and so on, does not mean that you need to
use them. You design your system for your needs. That's the beauty of
an RDBMS. If you don't want a join, put all your data in one table.
You want a key/value store? Create a key/value table. You still get
all the tools and features of the RDBMS if you need them. In a NoSQL
system, you don't have that option.
Lastly, “joins don't scale.” That isn't always true, in fact,
joins can be more efficient than denormalized data. Take for instance
two tables, “files” and “directories.” In files, the data is
the base file name and extension. In directories is the whole path to
a directory. Since the full path tends to be longer than the file
name and there tends to be many files per directory, it is possible
that there will be less real disk I/O when the data is queried
because the duplicative data in directories only has to be retrieved
once, and the file data ends up being smaller and taking up less
space on disk requiring less I/O in total.
Additional Argument
The NoSQL movement, it seems, is confused. SQL is a language to
access data and it supports an algebraic notation for identifying
specific datum and making correlations between related data. There is
no reason any of the NoSQL offerings could not be fitted with a SQL
front end parser in order to make them easier to use. It would be
limited by the capability of the underlying system, but there is even
an ODBC SQL driver for text files, it would be a sucker's bet to
assume someone was not working on one.
Where, for instance, would the NoSQL crowd be if some enterprising
soul put a SQL parser in front of the Cassandra or the CouchDB
projects? Would they be NoSQL databases with a SQL front end? Then
they wouldn't really be NoSQL databases, would they? The NoSQL crowd
is, in essence, creating nothing more than storage systems with
primitive access APIs and logic. The fact that they claim to “scale”
(for some subset of the definition of scaling) is almost besides the
point.
The last important point is scaling itself. How big do you need to
scale? In what time frame do you need to do it? Architects like to
dream big. Building architects dream of building the tallest
buildings. Software architects dream of building the biggest/best
software. Yes, we all envision a huge site, capable of almost
infinite load, but the reality is a lot less sexy and a lot easier.
99% of the sites on the internet can be handled with a single server.
99.99% of them can be handled with a multi-tier split function
architecture where web servers can be load balanced and logical
databases can be split on functional requirements.
If you want to make the next google or facebook, it will take you
years to get there and cost many many millions of dollars along the
way.