Using SimSQL for Bayesian Machine Learning

SimSQL's "Big Data" simulation capabilities mean that it is an excellent platform for large-scale Bayesian machine learning. Markov chain Monte Carlo (MCMC) is the most widely applicable inference (or "learning") method in Bayesian statistics and machine learning. Since MCMC over Big Data requires simulating a Markov chain whose state is huge (having millions or billions or even trillions of variables) SimSQL is a natural choice for very large-scale Bayesian inference.

Here we'll give a detailed example of how SimSQL can be used for solving problems in this domain. Specifically, we focus on using SimSQL for uncollapsed Gibbs sampling to learn an Latent Dirichlet Allocation (LDA) model. LDA is a standard model for text mining. For a note on collapsed vs. uncollapsed sampling, check out the bottom of this document.

To follow our example, begin by downloading the ubiquitous 20 newsgroups data set, as well as some software that we've prepared for processing this data set:

ubuntu@ec2-123-45-67-89$ wget http://cmj4.web.rice.edu/SimSQL/20news-19997.tar.gz

ubuntu@ec2-123-45-67-89$ gunzip 20news-19997.tar.gz

ubuntu@ec2-123-45-67-89$ tar xvf 20news-19997.tar

Next, download and run some software that we've built for processing text data sets like this:

ubuntu@ec2-123-45-67-89$ wget http://cmj4.web.rice.edu/SimSQL/ProcessCorpusForLDA.tar

ubuntu@ec2-123-45-67-89$ java -jar ProcessCorpusForLDA.jar

Enter the directory where the corpus is located: 20_newsgroups <typed by you!>

Enter the max number of docs to use in each subdirectory: 2000 <typed by you!>

20_newsgroups

Counting the number of occurs of each word in the corpus.....................Found 153832 unique words in the corpus.

How many of those words do you want to use to process the docs? 10000 <typed by you!>

Done creating the dictionary. Writing it out.

Converting the corpus to a set of tables. How many documents per group? 200 <typed by you!>

How many topics will you learn? 100 <typed by you!>

...................Done processing all of the docs!

This software will create a set of text files called topics, dictionary, groups, doc, wordsInDoc, and prior. The next thing that you need to do is to fire up SimSQL, and at the SQL prompt, you'll want to create database tables for each of these text files, and load the text files up into the database. You can do that by copying and pasting the text from this file into the SimSQL SQL prompt.

After you've created and loaded the data, you can type display tables at the SimSQL SQL command prompt to make sure you've loaded the tables successfully. You should see the tables topics, words, groups, docs, word_in_doc, and prior listed.

You can verify that you have loaded everything correctly by asking how many documents (grouped by newsgroup) have more than 5 occurrences of the words god, christ, or jesus:

SimSQL> SELECT d.label, COUNT(*) FROM docs AS d, word_in_doc AS wd, words AS w WHERE wd.doc_id = d.doc_id AND wd.group_id = d.group_id AND wd.word_id = w.word_id AND (w.word = 'god' OR w.word = 'christ' OR w.word = 'jesus') AND wd.count_num >= 5 GROUP BY d.label;

Invoking the optimizer...

Executing the plan. Number of operations to go is...

3->2->1 (running Aggregate as Hadoop job...)

mcdb_count____5 | mcdb_d_label_5

------------------------------------------------------------

4 | talk.politics.mideast

1 | talk.politics.guns

248 | soc.religion.christian

7 | talk.politics.misc

2 | sci.med

63 | alt.atheism

113 | talk.religion.misc

1 | comp.graphics

Not surprisingly, the groups soc.religion.christian, alt.atheism, and talk.religion.misc have the most documents with five or more occurrences of one of these words (perhaps surprisingly, the computer graphics forum has one document like this!).

Next, we'll create a few different stochastic tables that will form a Markov chain. Simply copy and paste the contents of this text file into the SimSQL SQL command prompt. This code should have created word_prob_in_topic[0], doc_info[0], word_prob_in_topic[i], and doc_info[i], which you can verify by typing display tables at the SQL prompt.

The schema for doc_info has seven attributes:

rec_type. 1 means that it is telling us how many instances of a word in a document have been assigned to a given topic, 2 means that this is describing the topic mixture for a doc.
group_id and doc_id together identify the document.
word_id. Tells us which word is being described; this is -1 if rec_type is 2.
topic_id. Tells us which topic is being described; this is -1 if rec_type is 1.
count_num. Tells us how many instances of word_id in the doc were assigned to topic_id; this is -1 if rec_type is 1.
probability. Tells us the prevalence of topic_id in this doc; this is -1 if rec_type is 2.

The schema for word_prob_in_topic has three:

topic_id. Tells us which topic is being described.
word_id. Tells us which word is being described
probability. Tells us the prevalence word_id in topic_id.

Now we are ready to simulate the Markov chain that will learn an LDA model over this data set. At the command prompt, type:

SimSQL> SELECT * FROM word_prob_in_topic[60];

This will cause SimSQL to simulate 60 steps of the Markov chain that has been specified, and will have the side effect of materializing doc_info[60], and word_prob_in_topic[60], each of which you can then query using SQL. Note that 60 iterations should be plenty because the Multi_Dir_DocWordTopic VG function used to create doc_info actually runs 10 "mini-Monte Carlo" iterations internally in sequence, updating the various word types and then the topic mixtures.

What you will see is something like:

SimSQL> SELECT * FROM doc_info[60];

Monte Carlo iteration 0/60

Invoking the optimizer...

Executing the plan. Number of operations to go is...

10->9->8->7 (running Selection as Hadoop job...)

6 (running VGWrapper as Hadoop job...)

5 (running Selection as Hadoop job...)

4 (running VGWrapper as Hadoop job...)

3 (running SelectionChained as Hadoop job...)

2 (running SelectionChained as Hadoop job...)

1 (running FrameOutput as Hadoop job...)

Monte Carlo iteration 1/60

Invoking the optimizer...

Executing the plan. Number of operations to go is...

14->13->12->11 (running Selection as Hadoop job...)

10->9 (running Selection as Hadoop job...)

8->7->6 (running VGWrapper as Hadoop job...)

5 (running Selection as Hadoop job...)

4 (running Aggregate as Hadoop job...)

3 (running VGWrapper as Hadoop job...)

2 (running SelectionChained as Hadoop job...)

1 (running FrameOutput as Hadoop job...)

…

Be aware that each Monte Carlo iteration will take a few minutes, so the whole thing can take a few hours to run. See our note on SimSQL for Small Data below for a discussion of the running time.

Once the simulation finishes, we can use SQL to to a bit of analysis. First we can compute the importance of each word in each topic by computing KL Divergence difference between the distributions Pr[word_id] and Pr[word_id|topic_id]:

SimSQL> CREATE MATERIALIZED VIEW kl (divergence, word_id, topic_id) AS

> SELECT topics.probability * (log(topics.probability) -

> log(counts.num_times / tot.tot_cnt)) +

> (1.0 - topics.probability) * (log(1.0 - topics.probability) -

> log(1.0 - counts.num_times / tot.tot_cnt)),

> topics.word_id, topics.topic_id

> FROM

> (SELECT word_id, SUM (count_num) AS num_times

> FROM word_in_doc

> GROUP BY word_id) AS counts,

> (SELECT 1.0 * SUM (count_num) AS tot_cnt

> FROM word_in_doc) AS tot,

> word_prob_in_topic[60] AS topics

> WHERE counts.word_id = topics.word_id AND

> topics.probability > counts.num_times / tot.tot_cnt;

Invoking the optimizer...

Executing the plan. Number of operations to go is...

5 (running Aggregate as Hadoop job...)

4->3 (running Aggregate as Hadoop job...)

2 (running Join as Hadoop job...)

1 (running FrameOutput as Hadoop job...)

output_table_name

------------------------------------------------------------------

/zzzzzzzjjjjjjjjjjj/kl_3257

And we can compute all of the important words in each topic:

SimSQL> CREATE MATERIALIZED VIEW best_kl (word, divergence, topic_id) AS

> SELECT w.word, big.divergence, big.topic_id

> FROM

> (SELECT *

> FROM kl

> WHERE divergence > 0.015) AS big, words AS w

> WHERE big.word_id = w.word_id;

Invoking the optimizer...

Executing the plan. Number of operations to go is...

2 (running Join as Hadoop job...)

1 (running FrameOutput as Hadoop job...)

output_table_name

------------------------------------------------------------------

/zzzzzzzjjjjjjjjjjj/best_kl_3264

The following query will ask for all of the words that are important in topics where the word mideast is also important:

SimSQL> SELECT b.word, b.divergence, b.topic_id FROM best_kl AS b WHERE b.topic_id IN (SELECT topic_id FROM best_kl WHERE word = 'mideast');

Invoking the optimizer...

Executing the plan. Number of operations to go is...

1 (running Join as Hadoop job...)

mcdb_b_word_0 | mcdb_b_divergence_0 | mcdb_b_topic_id_0

-----------------------------------------------------------------------------------------

bony | 0.02 | 61

population | 0.02 | 61

mideast | 0.06 | 61

iranian | 0.03 | 61

soc | 0.40 | 61

ottoman | 0.02 | 61

europe | 0.03 | 61

african | 0.02 | 61

yugoslavia | 0.03 | 61

arab | 0.02 | 61

war | 0.02 | 61

bosnia | 0.02 | 61

of | 0.09 | 61

culture | 0.38 | 61

greece | 0.02 | 61

iran | 0.02 | 61

muslims | 0.05 | 61

occupied | 0.02 | 61

arabic | 0.03 | 61

ethnic | 0.02 | 61

extermination | 0.02 | 61

cantaloupe | 0.02 | 61

army | 0.02 | 61

serbs | 0.02 | 61

pakistan | 0.02 | 61

palestinians | 0.02 | 61

greek | 0.05 | 61

bosna | 0.02 | 61

bony1 | 0.03 | 61

mideast | 0.06 | 36

politics | 0.17 | 36

israel | 0.09 | 36

peace | 0.03 | 36

palestinian | 0.02 | 36

that | 0.03 | 36

their | 0.05 | 36

troops | 0.02 | 36

uci | 0.04 | 36

org | 0.02 | 36

soldiers | 0.02 | 36

istanbul | 0.02 | 36

land | 0.02 | 36

are | 0.03 | 36

azerbaijan | 0.02 | 36

As one would expect, most of these words have to do with the Middle East, or at least topics that one might expect would be posted on an Internet newsgroup post discussing the Middle East. Interestingly, the words with the largest divergence (culture, soc, politics, talk) are parts of the names of newsgroups (talk.politics.mideast, soc.religion.christian) where posts on the topic of the Middle East are likely to be sent. The initial processing of the documents breaks apart the newsgroup names into their constituent parts, which explains why the topic model found that these words co-occur commonly in the same topic.

SimSQL and Small Data: SimSQL is a great choice for very large scale inference (LDA problems having millions of documents, for example). In fact, we have good experimental evidence that of the other platforms that one might choose for large-scale Bayesian inference, SimSQL scales best. However, SimSQL is perhaps not the most natural choice for Bayesian inference on small data sets. SimSQL MCMC simulations are compiled into a series of Java codes that are then compiled and executed as MapReduce jobs on Hadoop. For a complex simulation, performing the the various compilations and starting up the sequence of Hadoop MapReduce jobs required to run an MCMC iteration---without any computation at all---takes a couple of minutes. On a big problem where the entire MCMC iteration takes 20 minutes, that couple of minutes is not significant. But on a small problem where the computation takes only seconds, that couple of minutes is very annoying!

Collapsed vs. uncollapsed sampling: Finally, we note that in this document, we've considered how to use SimSQL for an uncollapsed Gibbs sampler for the LDA model. In the more standard collapsed sampler, some of the key variables are integrated out, which in theory can speed convergence. However, it is an oft-ignored fact that in a distributed setting, in order to achieve parallelism, all implementations of the collapsed sampler for LDA that we are aware of are not ergodic. This means that they are not actually correct. We've seen no experimental or theoretical evidence that the incorrect, collapsed sampler performs better than the correct, uncollapsed one. In the absence of such evidence, we are partial to the uncollapsed sampler, which is why we've presented it here. That said, SimSQL can easily be used to implement an uncollapsed sampler, if that's what you prefer!