A Multi-Column Index - How Should I Design This?

August 02, 2016

Design problems are fun. It’s a chance to build something that lasts and do it right. Plus, those bad decisions are going to hang around forever. This is our chance to make the right decision. [caption id=“attachment_1059” align=“alignright” width=“300”] Who needs design?[/caption]

Our Feature

We’re building a system to store events that have occurred in an our application. This is going to be the back end for an event sourcing system. The event sourcing system (in case you won’t want to read that article) will just be storing things that have happened in the system. Sometimes you don’t just want to know where an order is right now, you also want to how it got to that location. Event sourcing lets us store that information in a meaningful way. So, back to the app - we’re creating a back end for event sourcing in our application. Events are identified by an always increasing numeric column. Events are also associated with some kind of event owner. Doesn’t really matter what that is, just know that events are owned by something. The combination of event owner + event ID uniquely identifies each event. In effect, the event source storage is a time ordered log of activity in the system for a single application entity. In other words - if you’re tracking activity for an order, you’d have an order_events table.

Implementing the Event Source Storage

The Platform

What’s the best way to implement this in a database? For our purposes, we’re going to use SQL Server and talk about what we can and can’t do in there. Specifically, we’ll be working with SQL Server 2016 Developer Edition. Unless I have to, I won’t be using any SQL Server specific features and functionality for this. Instead, we’ll be looking at this from a general design perspective. I may look at using specific features in the future. We’ll see.

Logical Design

What do we know about the data that we’re going to be storing?

We have an Owner ID. Owner ID can be virtually anything, but it should match the primary key of the table we’re tracking.
We have an Event ID that should always increase.
Data is never updated once inserted into the event table.
Data is never deleted once inserted into the event table.
Data is only queried by the Owner ID

How should we go about implementing this physicall in the database?

Physical Design - Clustered Index

Since Event ID is constantly increasing, many database developers would suggest creating a table in SQL Server with a clustered index on the Event ID column. As a best first guess, this isn’t the worst option. Using a clustered index on Event ID is a decent approach because the Event ID should be always increasing. Depending on how the IDs are generated, there are scenarios where they may not be generated in purely sequential order (see rustflakes). In this case, we have to assume that Event IDs may be coming from anywhere and, as such, may not arrive in order. Even though we’re largely appending to the table, we may not be appending in a strict order. Using a clustered index to support the table isn’t the best option in this case - data will be inserted somewhat randomly. We’ll spend maintenance cycles defragmenting this data. Another downside to this approach is that data is largely queried by Owner ID. These aren’t unique, and one Owner ID could have many events or only a few events. To support our querying pattern we need to create a multi-column clustering key or create an index to support querying patterns.

Clustered index - We would need to cluster on Owner ID, Event ID to support our table. This results in inserts throughout the table and we’re back at having fragmentation.
Secondary index - In this case, we can create a non-clustered index on top of the clustered index with just the Owner ID column. Now we can pull back only the records we need. But, in this case, we’ll need to traverse two b-trees - one for the non-clustered index and one for the clustered index. On high throughput systems, this could become problematic.

Physical Design - Heap

What if we use a heap for base table in this case? Data in a heap is appended to the table, so this solves the problem of random inserts - it’s just not happening at the table level. We can make use of just one index on the Owner ID column. Sure, there will be fragmentation in this index as we add data to the database, but this is going to be considerably smaller than our clustered index in the previous example. To find our rows, we’ll have to traverse the b-tree to locate the Owner ID, but then it’s a straight RID lookup into the heap. Since there won’t be updates or deletes from the table, many of the benefits of the clustered index approach fall by the wayside. Instead, we need to focus on the most effective way to work with data to this table. Using a heap for base table storage and a non-clustered index on Owner ID solves the problem of fragmenting on insert - there won’t be any in the table. Fragmentation of the non-clustered index will be minimal. In addition, we can get directly to the rows we want through the non-clustered index.

Wrapping Up

Before implementing anything in your database, stop and think about both the physical and logical design. Both of those topics include more than just the table structure. Make sure you think about the indexes and query patterns - data retrieval and modification are important to designing the most effective database.

“Bad day at the office?” by Matthew Hutchinson licensed with CC BY 2.0

Original comments

Andrew Notarian - Aug 2, 2016

Finally a blog post about databases. I was starting to worry.

Serge Adler - Apr 5, 2017

Jeremiah, There is only one issue with a heap and a non-clustered index on Owner ID. Depending on statistics (percentage of rows returned) for a particular Owner ID, SQL query optimizer may decide to ignore the index in favour of a full table scan. This may become a problem when the table grows beyond a certain size. This will never be an issue with a clustered index on the Owner ID, Event ID - this will always result in a nice clustered index range scan.

Jeremiah Peschka - Apr 5, 2017

You are correct, statistics matter. There are corner cases for everything. The point of the post is that you should question the conventional wisdom of using a clustered index. Your approach can result in excessive fragmentation and you will have to face difficult decision about only reorganizing, only rebuilding during certain hours, or abandoning maintenance altogether. “Never” is a strong word for an approach that has significant physical implications. Which, by the by, could result in bad stats leading to table scans instead of range scans.