Data Science Boot Camp Python Activity One: Implementing LDA

Fire up Python. Copy and paste the following program into your favorite text editor (Word is fine):

import numpy as np

import time;

# this returns a number whose probability of occurence is p

def sampleValue (p):

        return np.flatnonzero (np.random.multinomial (1, p, 1))[0]

# there are 2000 words in the corpus

alpha = np.full (2000, .1)

# there are 100 topics

beta = np.full (100, .1)

# this gets us the probabilty of each word happening in each of the 100 topics

wordsInTopic = np.random.dirichlet (alpha, 100)

# wordsInCorpus[i] will give us the number of each word in the document

wordsInCorpus = {}

# generate each doc

for doc in range (0, 50):


        # no words in this doc yet

        wordsInDoc = {}


        # get the topic probabilities for this doc

        topicsInDoc = np.random.dirichlet (beta)


        # generate each of the 1000 words in this document

        for word in range (0, 1000):


                # select the topic and the word

                whichTopic = ???????????????

                whichWord = ???????????????


                # and record the word

                wordsInDoc [whichWord] = wordsInDoc.get (whichWord, 0) + 1


        # now, remember this document

        wordsInCorpus [doc] = ???????????????

This program implements the LDA generative process, and uses that process to create 50 documents, that have 1000 words each, where the words in each document are stored as a map from wordID to the number of times that the word appears in the corpus. Thus, wordsInCorpus[i][j] is either the number of times that the jth dictionary word appears in document i, or else it is undefined (if word j does not appear in documenti.

Your task is to now to replace those question marks with appropriate code, so that you can execute the LDA generative process. The complete code can be accessed here. If you want to know if you are getting reasonable results, at the Python prompt type wordsInCorpus[4] to display the set of words found in the 4th document in the corpus. You should get something like {0: 3, 7: 1, 9: 2, 10: 2, 12: 2,...} though the actual numbers that you see may differ. This means that word zero appears three times in document four, word seven eapprs once, and so on.