Data Science Boot Camp Python Activity One: Implementing LDA


In this activity, you will be writing a couple of lines of mathematical/statistical Python code. The first thing that you need to do is to fire up Anaconda. On Windows, from the Start menu, click the Anaconda Navigator desktop app. Or, on Mac, open Launchpad, then click the Anaconda Navigator icon. One Anaconda Navigator fires up, its home screen will list a number of applications that you can run. You will want to run Spyder. Launch Spyder by clicking Spyder’s Launch button.


Once Spyder is running, consider the following almost-complete Python code:


import numpy as np


# this returns a number whose probability of occurence is p

def sampleValue (p):

        return np.flatnonzero (np.random.multinomial (1, p, 1))[0]


# there are 2000 words in the corpus

alpha = np.full (2000, .1)


# there are 100 topics

beta = np.full (100, .1)


# this gets us the probabilty of each word happening in each of the 100 topics

wordsInTopic = np.random.dirichlet (alpha, 100)


# wordsInCorpus[i] will give us the number of each word in the document

wordsInCorpus = {}


# generate each doc

for doc in range (0, 50):

        #

        # no words in this doc yet

        wordsInDoc = {}

        #

        # get the topic probabilities for this doc

        topicsInDoc = np.random.dirichlet (beta)

        #

        # generate each of the 1000 words in this document

        for word in range (0, 1000):

                #

                # select the topic and the word

                whichTopic = ???????????????

                whichWord = ???????????????

                #

                # and record the word

                wordsInDoc [whichWord] = wordsInDoc.get (whichWord, 0) + 1

                #

        # now, remember this document

        wordsInCorpus [doc] = ???????????????


This program implements the LDA generative process, and uses that process to create 50 documents, that have 1000 words each, where the words in each document are stored as a map from wordID to the number of times that the word appears in the corpus. Thus, wordsInCorpus[i][j] is either the number of times that the jth dictionary word appears in document i, or else it is undefined (if word j does not appear in document i.


Your task is to now to replace those question marks with appropriate code, so that you can execute the LDA generative process. To do this, copy and paste this code into the pane labeled temp.py, then replace the question marks with code, and press the "play" button (the sideways triangle that is the seventh icon from the left in the toolbar).


Note that there may be some issues with copying and pasting the above code (extra lines and formatting problems may be introduced). If you have a problem, click this link to get a text-based version of the code, and copy and paste that one instead.


One possible answer to this activity can be accessed here. If you want to know if you are getting reasonable results, at the Python prompt (in the iPython console) type wordsInCorpus[4] to display the set of words found in the 4th document in the corpus. You should get something like {0: 3, 7: 1, 9: 2, 10: 2, 12: 2,...} though the actual numbers that you see may differ. This means that word zero appears three times in document four, word seven eapprs once, and so on.