The Promise and Failure of Federated Data

March 15, 2011

One of the biggest problems facing businesses today is integrating data from multiple sources. The holy grail of data integration is called a federated database system. Basically, a federated database stores meta data about other databases and makes it easier to integrate them through a single interface. Many relational databases have features that support integrating with other relational databases through SQL Server’s linked servers or Oracle’s database links. One of the problems with these features is that they only allow relational databases to talk to other relational databases. As the volumes of data we collect every day increase, more and more of that data is being stored outside of relational databases in CSVs, spreadsheets, log files, PDFs, and plain text as well as in a variety of non-relational databases likeMongoDB, HBase, Riak, and Cassandra. If the only thing we’re looking for is access to meta data for external data, SQL Server will provide the [FileTable][8] data type in SQL Server 2011. Admittedly, FileTable isn’t an acceptable solution because it’s really intended to make it possible to reference files in the database that are being managed by external applications through the filesystem and Win32 APIs. Clearly, this doesn’t suit our need for querying external data. It’s also possible to use ETL tools to move data into relational databases. One of the problems with ETL tools (SSIS, Pentaho Data Integration/Kettle, orOracle Data Integrator) is that they are effectively batch operations. New data insertions will have to be triggered by some external event; the data isn’t available until it’s migrated into some master system. Luckily, there’s an extension to the SQL Standard to help us: SQL/MED. MED stands for Management of External Data. This is a way to link up any external data source to a database server. It doesn’t have to be another relational database – there’s already a twitter foreign data wrapper library. Unfortunately, PostgreSQL is currently the only major database player on the market with any potential for an implementation for SQL/MED. The implementation of SQL/MED just isn’t here, yet. Another promising project is HadoopDB. HadoopDB is a project coming out of Yale University. The aim is to make it possible to run analytical workloads in parallel across many commodity RDBMS servers. One of the goals of HadoopDB is to excel in areas where parallel data warehouses simple do not perform well. Many of these situations are outline in the paper HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. To summarize, parallel data warehouses provide near linear scaling up to several hundred nodes running on homogeneous hardware. Parallel data warehouse also operate under the assumption that failures are rare. Google and others have demonstrated that hardware failure is inevitable at scale. HadoopDB presents a phenomenal way to _scale_databases and integrate disparate technologies. Despite its promise of scaling databases, HadoopDB still doesn’t solve the problems that we face when trying to build a federated database system. The truth is a depressing one – there is currently no solution for building federated databases that incorporate data from across the enterprise. Database vendors, DBAs, and more traditional corporate IT departments will tell you that this is a Good Thing™. I’ve mentioned before that you should choose the databasethat is best suited for the task at hand. Where do we go from here? SQL/MED doesn’t meet its own promises – only one vendor is implementing the SQL/MED standard and that support is going to depend on third parties releasing drivers. HadoopDB isn’t a federated database so much as it is a way to avoid scaling a relational database into thousands of cheap nodes and paying millions of dollars in licensing fees for Teradata, Microsoft’s Parallel Data Warehouse, or Oracle’s Exadata. The unfortunate truth is that if we want a federated database we’re going to have to build it ourselves. What sounds like a Sisyphean task isn’t as difficult as it sounds. If we’re collecting data in multiple databases, the problem is already almost solved. Some of those technologies are already here. LINQ lets us treat all data sources equally; we can query an array of objects as easily as we can query a database. ARel is a relational algebra for Ruby. While ARel is specifically focused on working with relational databases, it could be extended to work with many different data sources. Business intelligence vendor Jaspersoftrecently announced support for a number of non-relational databases to complement their existing business intelligence products. Quest Software, makes Toad for Cloud Databases – a tool for querying both relational and non-relational databases. A federated database may never materialize, but it’s already possible to build a hybrid database solution today.

References