Design and Implementation of the DBO Database System

This document gives an overview of the research project "III-COR-Medium: Design and Implementation of the DBO Database System System," funded by the National Science Foundation under grant number 1007062 and managed by Chris Jermaine ar Rice University, as well as by Alin Dobra at the University of Florida on a subcontract from Rice. This project is concerned with the design, implementation, and evaluation of two related database systems: DBO and DataPath. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF. For more information or for any comments or questions, please contact Chris Jermaine.

Recent Project News:

Sept 26, 2012: Although we have not yet had an official release of the DataPath code base, you can download a version of the code from Google Code .

Why DBO and DataPath? Databases are ubiquitous, and every large company in existence today relies on relational database technology for its day-to-day operations. Databases are used to run websites, track orders, monitor inventory, manage payroll, and for countless other applications. However, in one sense, the ubiquity of database technology has led directly to significant dis-satisfaction by many users of database systems: databases are now used for tasks that they were never designed for. The architecture of most modern database systems was first conceived more than 30 years ago, with one primary task in mind: transaction processing. A transaction is a sequence of small operations that must happen atomically (they cannot be partially executed)---a perfect example is the set of database operations that are invoked when a customer withdraws money from a bank ATM. Transactions are characterized by their "smallness", in the sense that they typically look at only a very small fraction of the database. They are also characterized by the fact that they update the data as well as read it. Modern database systems have significant machinery dedicated to ensuring correctness when the same data are concurrently read/written by thousands of users.

While many users would agree that databases excel at transaction processing, a new class of applications termed analytic processing applications have become prevalent in recent years. In analytic processing, transactional data is aggregated into a large repository called a data warehouse and used for subsequent analysis---figuring out how customers react to sales promotions, how websites are viewed, how and why customers complain, and so on. It is unfortunate (but not surprising) that 30-year-old engines tuned to locate and update specific records an ensure atomicity and correctness under concurrent reads and writes don't work that well when they are used to analyze 100 terabytes of analytic data at an aggregate level. For evidence of this, just peruse the latest TPC-H benchmark results in detail. You'll see that it is possible to spend a million dollars to archive a few hundred gigabytes of data (which is not that much nowadays) and you still have to wait 20+ minutes to get an answer to a complicated SQL query.

This dis-satisfaction among database users has lead to a number of startup companies that build "shared-nothing" (that is, cluster-based) database systems for large-scale analytic processing: Greenplum, Netezza, AsterData, and Vertica are just a few of the companies now out there. It has also led to intense interest in the cluster-computing paradigm known as map-reduce, that is used by Internet companies such as Google and Yahoo! to analyze their own very large data repositories.

A Possible Solution: Online Query Processing. One way to address this is to use online query processing. In online processing, the database makes use of randomized algorithms to come up with a quick guess as to the answer to a query. As the user waits, the guess is refined, until eventually the "guess" is totally accurate as query processing is completed. This has the advantage of allowing the user to stop waiting for the final query answer as soon as the guess is "good enough". The potential benefit should be obvious: if it takes ten hours to get the exact answer, but only five minutes to get a high-quality guess, then we have saved a huge amount of both computer and user time. Analytic queries over a data warehouse are particularly amenable to this sort of approximation because the questions that are asked are almost always statistical in nature (indeed, every one of the 22 TPH-H benchmark queries is statistical).

For an example of how online query processing might work, consider the simple SQL query:

SELECT SUM (e.SALARY)

FROM EMPLOYEE AS e

WHERE e.DEPARTMENT = 'Accounting'

To answer this query, we could simply randomly (offline) scramble the records in the relation EMPLOYEE on disk, and then read them in, one-at-a-time, in random order when the query is issued. At all times, no matter how much data we have processed, we compute SUM(e.SALARY)for those records where e.DEPARTMENT = 'Accounting', and scale the current sum up by the inverse of the fraction of the records that we have processed so far. For example, if we have processed 50% of the records and the current sum is $1,000,000, then $2,000,000 is a good guess for the final answer to the query. In fact, this estimate is unbiased, and the error of the guess can be bounded using classical statistical methods.

Redesigning the Database System From the Ground Up. While performing this sort of online aggregation for simple SUM queries over a single database table is relatively easy and was first proposed more than ten years ago, the applicability of prior work was quite limited, for several reasons. First, the type of query that could be answered using online aggregation consisted of simple, SELECT-FROM-WHERE-GROUP BY queries, with no subqueries. Second, and perhaps more significant, existing algorithms for online aggregation were not scalable. As soon as enough data had been processed that all of the records read in so far could not be stored in memory, it was not known how a statistically-meaningful guess for the final query answer could be arrived at. Given that a modern, relatively inexpensive hard disk array can produce a gigabyte of data in a second, and that a high-end system might have 50GB of main memory, this means that online aggregation could only be used for around 50 seconds!

In response to this, the goal of this particular project is to redesign the database system from the ground up to support scalable, online analytic query processing. The prototype database system that we are working on is called DBO, which is short for "Database-Online". DBO has the following, specific design goals:

To be able to handle any and all of the 22 queries in the TPC-H benchmark.
To easily scale to databases that are tens of terabytes in size.
To be faster than conventional, relational system such as Oracle or SQLServer for running one or more of the TPC-H queries (possibly concurrently) from startup through completion.
To be able to give an accurate, statistically meaningful guess for the final query answer from query startup through query completion.
To be able to shrink the error of the guess smoothly and quickly as query processing progresses.

The ultimate goal is to fully implement and distribute a version of DBO that can be used to answer analytic queries over very large, terabyte-sized databases.

The Two Main Thrusts of the DBO Project. There are two main research efforts associated with the project.

Direction Number One: DatabaseOnline. Our first main research effort is designing, implementing, and evaluating the randomized algorithms that will allow for accurate, statistically meaningful guesses to analytic database queries from startup thru completion. As the first year of the project comes to a close, we have designed and prototyped a new set of algorithms for online processing of queries containing arbitrary join trees. We call these algorithms Turbo DBO. The original version of DBO (see our SIGMOD 2007 paper) obtains its estimates by looking for combinations of database records that happen to be in main memory at the same time, and can be combined together to form a record that answers the underlying database query. The key innovation of Turbo DBO is that it makes use of novel algorithms that look for and remember what we term partial-match tuples in a randomized fashion. These are tuples that satisfy some of the boolean predicates associated with the query, and can possibly be grown into tuples that actually contribute to the final query result at a later time. A paper describing Turbo DBO appeared in the VLDB 2009 Conference that took place in Lyon, France. The reference is:

Turbo-Charging Estimate Convergence in DBO. Alin Dobra, Chris Jermaine, Florin Rusu, Fei Xu in Proc. VLDB 2009.

This paper shows that the partial-match approach is able to obtain much tighter estimates (in some cases dramatically tighter estimates) than the approach described in the original DBO paper.

We recently looked at the problem of extending the randomized algorithms used by DBO to a distributed environment. The resulting paper will appear this year in VLDB:

Online Aggregation for Large MapReduce Jobs. Niketan Pansare, Vinayak Borkar, Chris Jermaine, Tyson Condie in Proc. VLDB 2011.

Direction Number Two: DataPath. Our second main research effort is implementing and designing the parallel database platform that DBO will run on top of, which we call DataPath. One of the major goals of the DBO project (stated above) is that we want DBO to actually be faster than a conventional database system for running a query to completion over a multi-terabyte archive. Since conventional databases are handicapped by their three-decade-old-design, we actually don't think that this will be too hard to do! But since we have the opportunity to design the underlying engine from the ground up, we have the freedom to dream up a system architecture that is radically different than anything else that is out there.

In designing the DataPath architecture, we wanted to go beyond all of the ideas that are currently in vogue, especially all of the emphasis on shared-nothing architectures and cluster computing solutions (such as Map-Reduce). While shared-nothing and cluster computing are important, the simple fact is that for $50,0000, one can now purchase a single-machine system with 50 high-end disks (totaling 15TB to 50TB of storage, depending on what disks are purchased), 256GB of RAM, and 32 compute cores, all interconnected via a super-fast NUMA architecture. As manufacturers add more and more cores to each machine, the RAM, disk, and core counts available on a single, relatively-inexpensive machine such as this will only increase over time. Given that such a system suffices to store all of the data from all but the very largest databases, and given the fact that it is not well-understood how to perform analytic processing on such a "fat" machine, we have spent the first year of the project focused on the question of how to build a super-fast analytic engine for such hardware.

Our work to date has focused on the design and implementation of the so-called "DataPath" architecture. This architecture makes use of a so-called "data-centric" design paradigm to combat the fundamental problem one faces when moving billions of records through a system for large-scale analytic processing: latency and bandwidth problems associated with all of the data transfer. In the data-centric paradigm, computation does not request data when it needs it (as it would in a traditional, compute-centric system). The problem with the compute-centric design is that it incurs heavy latency, as data must travel from the disk, thru memory and cache, and onto the CPU. It is also bandwidth-inefficient, as even when two different computations make use of the same data, they must move it onto the CPU multiple times. In a data-centric system, data is forced from the storage medium and up through the memory hierarchy in a stream, independent of the computation that is happening. As the data flows onto the CPU, any computation that needs the data can make use of it. In this way, the data-centric paradigm ensures that there is no latency (it is impossible to have latency, when computation cannot request data!) and there is minimum use of transfer bandwidths (data are only forced onto the CPU once, no matter how many computations make use of the data). We anticipate that the DataPath architecture will form a super-fast base upon which our final, open-source version of DBO will be constructed. A paper describing the DataPath architecture was presented at the SIGMOD, 2010 conference in Indianapolis:

The DataPath System: A Data-Centric Analytic Processing Engine for Large Data Warehouses. Subramanian Arumugam, Alin Dobra, Chris Jermaine, Luis Perez, Niketan Pansare, in Proc. SIGMOD 2010.

Project Personnel. The work of the following people has been sponsored at one time or another through the NSF funding associated with this project:

· Subramanian Arumugam - Post-doctoral associate, Rice University.

· Alin Dobra - Associate professor, University of Florida (PI on subcontract to UF).

· Chris Dudley - Undergraduate researcher, University of Florida.

· Chris Jermaine - Associate professor, Rice Univerity (PI).

· Supriya Nirkhiwale – PhD student, University of Florida.

· Niketan Pansare – PhD student, Rice Univerity.

· Luis Perez - PhD student, Rice Univerity.

· Florin Rusu - PhD student, University of Florida (now graduated; assistant professor at UC Merced).

· Praveen Salitra – MSc student, University of Florida.

· Sarvesh Singh – MSc student, University of Florida.

Last modified: September 26, 2012