Using SimSQL for Stochastic Analytics.

SimSQL has special facilities that allow a user to define special database tables that have simulated data---these are data that are not actually stored in the database, but are produced by calls to statistical distributions. Such simulated data can be queried just like any other database data. This is very useful because it allows one to use statistical distributions in place of data that are uncertain. In real life, uncertain data are commonplace due to measurement errors, because they have not yet been observed and must be forecast (think of sales figures for the upcoming quarter), or because they were never recorded and must be imputed.

For an in-depth description of how stochastic analytics work in SimSQL, you should see our definitive paper on MCDB which can be found here. MCDB is the predecessor to SimSQL (and whose functionality SimSQL subsumes). This discussion will focusing on using the MCDB functionality in practice.

For an example of all of this using our flowers database, fire up SimSQL. At the command line, type:

SimSQL> CREATE TABLE newSightings (name, person, location, sighted) AS

> FOR EACH s IN sightings

> WITH temptable AS Categorical (

> SELECT s2.sight_id, 1.0

> FROM sightings AS s2)

> SELECT s3.name, s3.person, s3.location, s3.sighted

> FROM sightings s3, tempTable AS tt

> WHERE s3.sight_id = tt.out_id;

This defines a table newSightings that is a bootstrapped version of the original sightings table (to "bootstrap" a data set, you simply re-sample the data set with replacement; a lot of resources on the web describe the bootstrap). Bootstrapping is commonly used to understand the amount of variation in data. Naturally, SimSQL can support much more complicated models, but this will suffice for our example.

The way that this code works is that for each original sighting, we replace it with one drawn at random (using the Categorical "VG function") from the set of all sightings that are already in the database.

You can now ask questions about this simulated table. For example, we can ask: how many sightings each of the people in the simulated database had:

SimSQL> SELECT person, COUNT (*) FROM newSightings GROUP BY person;

Invoking the optimizer...

Executing the plan. Number of operations to go is...

5->4->3 (running Selection as Hadoop job...)

2 (running VGWrapper as Hadoop job...)

1 (running AggregateChained as Hadoop job...)

mcdb_count_____12 | mcdb_s3_person_13

---------------------------------------------------------------

30 | Donna

21 | John

28 | Sandra

12 | Robert

70 | Maria

48 | Helen

4 | Brad

20 | James

5 | Pete

58 | Michael

17 | Tim

113 | Jennifer

And we can ask how many times each genus of flower was seen in the simulated database:

SimSQL> SELECT f.genus, COUNT (*) FROM newSightings AS n, flowers AS f WHERE n.name = f.comname GROUP BY f.genus;

Invoking the optimizer...

Executing the plan. Number of operations to go is...

6->5->4->3 (running Selection as Hadoop job...)

2 (running VGWrapper as Hadoop job...)

1 (running AggregateChained as Hadoop job...)

mcdb_count_____5 | mcdb_f_genus_2

-----------------------------------------------------------

1 | Aquilegia

9 | Triphysaria

18 | Penstemon

2 | Ligusticum

7 | Lupinus

4 | Erigeron

9 | Geranium

2 | Eriophyllum

6 | Carex

8 | Eriogonum

7 | Juncus

3 | Angelica

5 | Draperia

1 | Orthilia

21 | Mimulus

…

One nice feature of SimSQL is that it can efficiently answer questions over many simulated versions of a data set. At the command line, type:

SimSQL> show params

parameter name | current value

-------------------------------------------------------

numCPUs | 8

memoryPerCPUInMB | 1000

numIterations | 1

optIterations | 150

optPlans | 1

debug | false

SimSQL> set numIterations 20

We have just set the system so that it will run 20 Monte Carlo iterations of all queries. Now, we can ask how many times each of the flowers in the 'Fremontodendron' genus were seen in 20 different versions of the simulated database:

SimSQL> SELECT f.comname, COUNT (*) FROM newSightings AS n, flowers AS f WHERE n.name = f.comname and f.genus = 'Penstemon' GROUP BY f.comname;

Invoking the optimizer...

Executing the plan. Number of operations to go is...

7->6 (running Selection as Hadoop job...)

5->4 (running Join as Hadoop job...)

3->2 (running VGWrapper as Hadoop job...)

1 (running AggregateChained as Hadoop job...)

mcdb_count_____10 | mcdb_f_comname_6

----------------------------------------------------------------------------------------

6, 7, 1, 3, 5, 9, 5, 7, 2, 5, 6, 5, 8, 2, 1, 2, 3, 3, 5, 5 | Purple penstemon

16, 13, 9, 15, 15, 15, 8, 14, 10, 12, 19, 9, 10, 14, 7, 7, 13, 11, 7, 19 | Alpine penstemon

This is telling you, for each simulated database, how many of each of the two penstemon varieties were sighted.

Here is a second, more complicated example of the sort of stochastic analysis that you can do using SimSQL. Imagine that we want to estimate how many of each variety of flowers we are likely to find if we visit the location 'Burton Mill' and collect 100 specimens. We want to use a relatively sophisticated Bayesian model to do this.

What we will do is to assume a Dirichlet prior the probability that each flower variety is going to be collected. When choosing the hyperparameters for this Dirichlet prior, we'll assume that on expectation (without any observations at a location), the probability of collecting a flower at a location is going to be proportional to the fraction of all sightings (regardless of the location) at which that flower was seen. Specifically, we'll begin by creating our table of hyperparameters:

SimSQL> CREATE VIEW hyperparams (flow_id, val) AS

> SELECT f.flow_id, COUNT(*)

> FROM flowers AS f, sightings AS s

> WHERE f.comname = s.name

> GROUP BY f.flow_id;

And then we'll write code for a random table that uses the standard Dirichlet-Multinomial conjugacy to generate a statistical guess as to the actual probability of sighting each type of flower at each location. That is, it will combine the Dirichlet prior with the number of times each flower was observed at each location in order to guess the desired probability:

SimSQL> CREATE VIEW priorStrength (num) AS SELECT * FROM VALUES(0.01);

SimSQL> CREATE TABLE flowerProb (flow_id, location_id, probability) AS

> FOR EACH f IN features

> WITH temptable AS Dirichlet_Conjugate (

> (SELECT p.flow_id, p.val * ps.num

> FROM params AS p, priorStrength AS ps),

> (SELECT f2.flow_id AS flow_id, COUNT (*) as cnt

> FROM sightings AS s, flowers AS f2

> WHERE s.name = f2.comname AND f.location = s.location

> GROUP BY f2.flow_id))

> SELECT tt.out_id, f.loc_id, tt.probability

> FROM tempTable AS tt;

Thus, a tuple (flow_id, location_id, probability) in flowerProb means that the probability of observing the given flower at the given location is probability.

Finally, we can define a random table called sightingsPerLocation that uses flowerProb to guess how many of each variety of flower we would observe if we collected 100 samples at each location:

SimSQL> CREATE VIEW numSightings (num) AS SELECT * FROM VALUES(100);

SimSQL> CREATE TABLE sightingsPerLocation (location, flow_name, num) AS

> FOR EACH f IN features

> WITH temptable AS Multinomial (

> (SELECT fp.flow_id, fp.probability

> FROM flowerProb AS fp

> WHERE fp.location_id = f.loc_id),

> (SELECT ns.num FROM numSightings as ns))

> SELECT f.location, fl.comname, tt.out_count

> FROM temptable AS tt, flowers AS fl

> WHERE fl.flow_id = tt.out_id;

Now we can ask queries over this table. For example, let us estimate how many of each variety we would see at 'Burton Mill', were we to collect 100 samples:

SimSQL> SET numIterations 10

SimSQL> SELECT location, flow_name, num FROM sightingsPerLocation WHERE location = 'Burton Mill';

Invoking the optimizer...

Executing the plan. Number of operations to go is...

15->14 (running TempTable as Hadoop job...)

13->12 (running TempTable as Hadoop job...)

11 (running Selection as Hadoop job...)

10 (running Selection as Hadoop job...)

9->8->7 (running Join as Hadoop job...)

6->5 (running Aggregate as Hadoop job...)

4 (running Aggregate as Hadoop job...)

3 (running VGWrapper as Hadoop job...)

2 (running VGWrapper as Hadoop job...)

1 (running Join as Hadoop job...)

mcdb_fl_comname_3 | mcdb_out_count_3 | mcdb_f_location_6

------------------------------------------------------------------

Sierra daisy | 0, 0, 0, 0, 0, 0, 2,... | Burton Mill

Sierra angelica | 0, 0, 0, 1, 0, 0, 0,... | Burton Mill

Snow plant | 0 | Burton Mill

Oak violet | 3, 2, 5, 3, 5, 0, 0,... | Burton Mill

Death camas | 8, 3, 5, 3, 7, 6, 11... | Burton Mill

Cow parsnip | 0, 0, 0, 0, 0, 0, 0,... | Burton Mill

Water groundsel | 0 | Burton Mill

Torreys lomatium | 0, 4, 0, 0, 0, 3, 2,... | Burton Mill

Sierra stonecrop | 4, 4, 3, 5, 5, 13, 5... | Burton Mill

Hoary buckwheat | 0, 1, 0, 1, 0, 0, 0,... | Burton Mill

Ithuriels spear | 2, 1, 6, 11, 7, 3, 6... | Burton Mill

Alpine penstemon | 4, 4, 15, 5, 9, 5, 4... | Burton Mill

Large-leaved lupine | 14, 0, 4, 8, 5, 2, 1... | Burton Mill

...

Note that a single '0' for the count in a particular row means that the flower variety was never observed in any of the simulations.