Why You Need NoSql in Your Toolbox

by
Tags: , , , , , , ,

Even if you work for Oracle, you still need NoSql databases in your toolbox.

One size does not fit all for programming languages, operating systems, IDEs, shoes, bailouts, or anything else. But for a long time now, many developers have been told that relational databases are really the only choice for persistence. If you need to store some data — any data — shove it in a relational database. It is the safe choice, and gives us developers one less thing to worry about, right? Well actually, wrong. RDBMS’ are great solutions to many problems, but there are better alternatives for others.

Quick overview of NoSql

A simple definition of a NoSql database is a schema-less database that does not support joins or ACID transactions. In a broad simplification, there are 2 types of NoSql databases:

Key/value — A “giant hash table in the sky” that stores key value pairs.

Pros:

  • very fast
  • very scalable
  • simple model
  • many are distributed

Cons: many data structures (objects) can’t be easily modeled as key value pairs.

Schema-less — Databases somewhere in-between key/value and relational including: column-based (different rows in same table can have different columns), document-based, and graph-based.

Pros:

  • Schema-less data model is richer than key/value pairs
  • eventual consistency
  • many are distributed
  • still provide excellent performance and scalability

Cons: typically no ACID transactions or joins.

Relational database

Pros

  • rich data models
  • ACID transactions
  • joins, reductions (group by, sum), ordering
  • lots of apps already integrate with RDBMS
  • mature, commercial support easy to find

Cons:

  • complex
  • object to relational mismatch
  • can be difficult to scale horizontally

Performance and Scalability: Less Features = More Speed/Scaling

“Constraints proceed performance.”

– Joe Gregorio, Google

“Constraints proceed performance” really sums up why many of the NoSql databases can be blazingly fast — they implement fewer features and have some significant constraints when compared to a relational database. The key is determining whether you can live with those constraints. If you can, you reap the performance and scalability benefits. But before you become enamored with the allure of more performance, here’s a list of things you’ll probably have to give up:

  • joins
  • group by
  • order by
  • indexes
  • ACID transactions
  • SQL as a sometimes frustrating but still powerful query language
  • easy integration with all of the other applications that support SQL

Yikes. That’s a pretty hefty price should you actually need one or more of those features. However, for many applications that perform CRUD operations on data largely partitioned by user, you might not need as many as you think.

On the scalability side of the equation, many of the NoSql databases were designed to be distributed so they can scale horizontally. Relational databases are not designed to scale horizontally. If your application really needs to scale up, vertically scaling your relational database is only going to get you so far because eventually you won’t be able to buy a big enough box to scale any further. At that point you have to look at sharding your data across multiple database instances, and then you lose many of the benefits of a relational database across instances such as joins, transactions, etc.

In addition to get the performance you need, you very well may need to denormalize some of your data to eliminate slow joins. Additionally, to really scale up you may need to use a distributed caching layer like memcached that requires key/value semantics, in which case you are really starting to diverge from the pure relational model. So when making the comparison, it may be more fair to compare a sharded (data partitioned into multiple databases), partially denormalized relational database using a caching layer with an alternative NoSql database.

Productivity: Object to Relational Mapping Is Not Fun (Unless You Are Weird)

Object to relational mapping is difficult to implement, difficult to reason about, and easy to get wrong. As a developer, you either have to get in the weeds and write your own queries, or rely on a ORM tool like Hibernate which is great until something doesn’t work and then you have try to find out why the magic isn’t working anymore.

There is a lot of effort put into solving the object to relational mapping problem, and it still hasn’t be completely perfected. This is where, depending on your data, a NoSql database can provide an abstraction that maps closer to your object model than relational tables. Column-based stores like BigTable and Cassandra essentially allow different rows in the same table to have different columns. Why is that useful? There a lots of cases where data is optional or slightly different between entities. If I was creating a ratings site, I could rate different things using different criteria, but they could all be stored in the same table. I won’t need a table of “things” with a join table and a table of “criteria”.

On the other hand, why is that dangerous? Obviously the application could get back data it didn’t expect because the schema wasn’t enforced by the database. Document-based stores like CouchDB give you even more flexibility where you store entire documents that can contain anything you want. No longer are you a slave to that crusty database administrator and all of his rules, but on the other hand you could easily hang yourself with all of this new rope. What happens later when you need to run reports on this data and you don’t have a group by or sum function? You can write your own map reduce functions to essentially do the same thing, but you have to do the work.

Another point to consider is that many “rapid development” web frameworks like Ruby on Rails, Grails, Django, and Lift already have substantial ORM tools baked in. If you use NoSql for your persistence you won’t be able to take full advantage of those frameworks. Some of the frameworks are starting to support some NoSql databases, but since NoSql is just a general label and not a specification like SQL, support is one off. For instance, there is a Grails plugin to provide support for Google App Engine’s version of JPA.

You can’t have it all; you have to decide between the complex features that a relational database supports versus the constraints/benefits that a NoSql database provides. But whether you put 99% of your data in a relational database and only 1% in a NoSql store, or the other way around, both types of databases are likely to be valuable to your business. I like the interpretation of NoSql as “Not Only SQL” meaning both types of databases can and should co-exist within your organization.

Why Now?

Non relational databases have been around since ENIAC, so why are we hearing so much about them now? My opinion is that there are 2 drivers:

  1. The massive scalability requirements of some websites has exposed one of the weaknesses of relational databases: scalability. Sites like Google, Amazon, Twitter, and Facebook all use NoSql databases for major components of their site.
  2. A strong open source community means that there are many high-quality, production-ready NoSql databases available. There probably have been hundreds of very good, proprietary non-RDMS databases created in the last 20 years, but since they are proprietary, they have been used at one company and no one has ever heard of them. Open source has changed that, and now companies like Amazon, Google, and Facebook are contributing incredibly valuable research and software to the community. Research papers such as Amazon’s Dynamo and Google’s BigTable inspired many open source implementations.

This is a fairly recently phenomena. I’m not aware of any production-ready open source non-relational databases that existed 10 years ago, and if there were, there weren’t many and they weren’t popular. Now there are well over 15 (and probably many more, I don’t claim to know the exhaustive list) production-ready open source NoSql databases. Am I going to list them all? No, because if I do I’ll miss one, and I’ll get a comment saying “You didn’t mention XYZ database, what’s wrong with you?” But I’ll list the ones I’m currently looking at in no particular order:

  • Voldemort
  • Cassandra
  • Redis
  • CouchDB
  • MongoDB
  • Terrastore

Some links for more reading:

  1. http://www.vineetgupta.com/2010/01/nosql-databases-part-1-landscape.html
  2. http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  3. http://horicky.blogspot.com/2009/11/query-processing-for-nosql-db.html
  4. http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/#
  5. http://highscalability.com/blog/2009/10/29/paper-no-relation-the-mixed-blessings-of-non-relational-data.html
  6. http://debasishg.blogspot.com/2009/11/nosql-movement-excited-with-coexistence.html