Moving Away From Relational Storage

September 08, 2010

Don’t. Fooled you, didn’t I? If you’re already using a relational database, keep using it. If it’s scaling just fine with your hardware and workload, keep using it. If you aren’t running into any complexity problems, keep using it. There’s no reason to change the way you’re storing your data just because you read an article about how BrandNewStartup.com was able to increase uptime and throughput eleventy-four percent by utilizing a new key-value storage solution. That’s like re-writing your core product every time you read an article about how Ruby has excellent metaprogramming features or how Python’s use of significant whitespace can lead to more readable code. These are reasons to choose a language for a new product, but they are not reasons to change the language of an existing project. If you’re not supposed to move away from relational storage, what is this all about?

When Should I Switch?

When you add new features to an application, or when you rewrite a feature, take a look at what you need. When we needed to make use of large query result caching in a web application, we started using Velocity, which later became AppFabric. There’s some overhead serializing and deserializing objects into the storage mechanism, but could you imagine trying to dump that data to the database? The write overhead would be tremendous! We looked around at products that would work well with our existing application infrastructure, ASP.NET and SQL Server, and chose something that would play well in that garden, Windows Server AppFabric. Before you start new you should ask yourself questions about what you’re trying to build. Question your assumptions about how the new feature or product will work. Question your assumptions about the existing infrastructure. Make sure that you aren’t shoehorning existing technology into a solution because you are familiar with it. Just because something feels familiar and safe doesn’t mean that it’s the best solution for a problem.

What Questions Should I Ask?

This is the trickiest part. It’s the part that I’ve struggled with and gotten wrong on more than one occasion. It’s okay to screw up, it’s how we learn. Here are the questions I’ve started asking myself when I start a new project/feature that needs additional data storage:

Why are we storing this data?
How much data will I collect in a week? A month? A year?
How long will this data need to live?
How will this data be used?
How structured does this data need to be?
How available does this data need to be?

Why Are We Storing This Data?

You need to understand how you’ll be using the data before you can figure out how you want to store it. Will you be doing ad hoc reporting? Will the data be aggregated and consumed by other applications? How often will I write this data? How often will I read my data? Different types of databases have different use case profiles. The way that you’re using the data will make a big difference for how you’re going to store data. You don’t want to store session state in a relational database – you’ll spend a lot of time writing transient data to disk. Likewise, you don’t want to store financial transactions in an in-memory cache.

How Much Data Will I Collect?

The volume of data that you’re working with will influence the way that you’re storing the data. Terabytes, and even petabytes, of data require different storage techniques and management strategies in a relational database, why would it be different anywhere else? Handling huge quantities of data often requires splitting the load across multiple servers or purchasing a SAN. Either way, you’ll want to consider long term budget and how the availability of those budget dollars might change over time. Just as important as the long term capacity of your data is the speed you’re collecting data. The faster you need to collect data, the more you need to look at how you’re storing that data. Every database engine employs different strategies to maximize I/O throughput. The problem here is that they all use different strategies based on their use case.

How Long Will This Data Need to Live?

The lifespan of data is incredibly important. Short term data need never touch disk – it can live in memory. If this is something like session state data, it’s possible to use a 100% in memory storage solution (like memcached orAppFabric or even riak’s riak_kv_cache_backend) to solve this problem. Disks are slow, memory is fast. If the data needs to live longer than a few seconds, then it’s time to consider how long it will really need to stick around. You need to look at different forms of persistence and how the strengths and weaknesses of those systems play into your long term choices. Some next-generation data stores will store data in memory and persist to disk in the background. This speeds up the ability to write data but it does bring up some data life concerns: what happens if the power fails? Other data stores use a write-ahead log, like relational databases, to make sure that the data is safe.

How Will This Data Be Used?

The way that end users are going to use our data is important for our decision making. Some data stores (such as CouchDB) do not allow ad hoc querying. Others (graph databases) make it possible to easily navigate deeply nested and complex data structures. Relational databases are phenomenal general purpose data stores. They make it possible to store data in a variety of formats but there may be a variety of complications as a result of the general nature of SQL and the relational model. Massive volumes of data stored for statistical analysis have different storage and indexing requirements than data that needs to be instantly available from a variety of locations for atomic reads and writes.

How Structured Does This Data Need To Be?

Data structure can be important for a variety of reasons. Hierarchical data stores make it possible to traverse deeply nested category trees – think of how species are classified. Likewise, graph databases make it possible to navigate through complex relationships, much like the relationships that can be found in strongly object-oriented designs. Considering how the structure of the data can make it very easy to make a decision on the data store that you’ll be using. If a lot of flexibility is required, it may even be possible that a Big Table derived database, such as Cassandraor HBase might meet your storage needs.

How Available Does This Data Need To Be?

One of the benefits of a relational database is that once a transaction commits, the data is immediately available to everyone querying that database (ignoring things like replication and log shipping). When you start working with distributed data stores, the data may not be immediately available. You need to ask yourself about how available the data needs to be. In some data stores, the delay may be only a few milliseconds. In others, it may be longer. Network latency and hardware utilization play a large role in the latency of data replication. The other side of the availability question is “How fault tolerant does this data need to be?” With a relational database, unless you’ve implemented a clustering solution, the data is dependent on a single server functioning appropriately. If that single server goes down, business could stop until a replacement is brought back online. Some of the distributed data stores are massively fault tolerant – Riak is designed to be tolerant of hardware failure. If nodes fail, data will be replicated to new nodes, or even to new clusters, automatically. Replacement servers can be brought online and they will begin receiving new data immediately. The immediately availability of data after a write as well as the long term (fault tolerance) availability of data is an important aspect of choosing a data storage solution.

What’s Next?

Ask yourself these questions the next time that you’re implementing new functionality or an entirely new application. The answers to your questions might very well surprised you. In the next few weeks, I’ll be putting more thoughts together and showing examples of realistic scenarios of moving away from a relational database and utilizing a NoSQL solution.