What Is SimSQL? 

SimSQL is system for stochastic analytics implemented at Rice University.

At its most basic, SimSQL is a scalable, parallel, analytic relational database system compiles SQL queries into Java codes that run on top of Hadoop. We have regularly used SimSQL to run SQL queries on 100-node Hadoop clusters with 1,000+ compute cores and terabytes of data.

In the sense that SimSQL is an SQL-based platform that runs on top of Hadoop, it is similar to platforms such as Hive. However, compared to systems such as Hive, SimSQL is much more of a "classical" relational database system. It has a fully functional query optimizer, and it doesn't just use an "SQL-like" scripting language: the language supported by SimSQL is very close to classical SQL, with full support for important SQL language constructs such as nested subqueries.

Stochastic Analytics Using SimSQL.

If all you do is use SimSQL as a parallel database system for running analytic queries, you will find it to be very useful. It has many of the performance optimizations found on commercial database systems, such as the ability to choose from a number of different join algorithms, and the ability to automatically pipeline operations.

But what makes SimSQL truly unique is its support for stochastic analytics. SimSQL has special facilities that allow a user to define special database tables that have simulated data---these are data that are not actually stored in the database, but are produced by calls to statistical distributions. Such simulated data can be queried just like any other database data. This is very useful because it allows one to use statistical distributions in place of data that are uncertain. In real life, uncertain data are commonplace due to measurement errors, because they have not yet been observed and must be forecast (think of sales figures for the upcoming quarter), or because they were never recorded and must be imputed. As long as you can come up with an appropriate statistical model, it is possible to use SimSQL to ask (and answer) sophisticated questions such as "What would my profits have been last year had I raised my profits by 10%?"

Crucially, when a query is issued over a table containing simulated data, an entire distribution of query results are returned. This distribution gives a user an indication of the uncertainty in the query result due to uncertainty in the data.  

Bayesian Machine Learning Using SimSQL. 

SimSQL can also be used for large-scale Bayesian machine learning, which can be viewed as a special case of stochastic analytics.

The most common way in which Bayesian machine learning models are learned is via Markov Chain Monte Carlo, or MCMC. To perform MCMC on "Big Data", a platform needs to be able to run Markov chain simulations on large data sets, in which new data are simulated via a set of recursive dependencies.  Since tables of simulated data in SimSQL can have such recursive dependencies, it is easy to use SimSQL to run Markov Chain simulations over "Big Data".  This means that SimSQL is a great platform very large-scale Bayesian machine learning. In fact, we have amassed a great deal of experimental evidence showing that SimSQL scales as well as (or better than) many other platforms for large-scale machine learning.

How Do I Get Started With SimSQL?

SimSQL is entirely open-source. We do all of our development on top of the Amazon EC2 platform. On this website, you'll find enough information to have SimSQL up and running on Amazon EC2 within about 30 minutes, even if you have somewhat limited computer systems skills. All you have to do is rent an Amazon EC2 machine and have it run the SimSQL machine image that we've created---if you can do that, it won't take you long to have a SimSQL Hadoop cluster up and running.

Additional Resources.

A presentation describing SimSQL can be found here. A technical paper describing how SimSQL can be used for Bayesian machine learning is here. An experimental comparison of SimSQL with a number of other platforms for large-scale machine learning is here. SimSQL is an extension of the earlier MCDB system for stochastical analytics; the definitive technical paper describing MCDB can be found here.