Data Science Boot Camp Python Activity Two: Implementing LDA With NumPy Arrays

Python is a scripting language, and it is almost always better to use arrays to implement statistical and numerical computations, in order to keep the number of loops to a minimum. Here is an alternative implementation, that makes better use of NumPy arrays:


import numpy as np


# there are 2000 words in the corpus

alpha = np.full (2000, .1)


# there are 100 topics

beta = np.full (100, .1)


# this gets us the probabilty of each word happening in each of the 100 topics

wordsInTopic = np.random.dirichlet (alpha, 100)


# wordsInCorpus[i] will give us the vector of words in document i

wordsInCorpus = np.zeros ((50, 2000))


# generate each doc

for doc in range (0, 50):

        #

        # get the topic probabilities for this doc

        topicsInDoc = np.random.dirichlet (beta)

        #

        # assign each of the 1000 words in this doc to a topic

        wordsToTopic = ??????????????????????

        #

        # and generate each of the 1000 words

        for topic in range (0, 100):

                wordsFromCurrentTopic = ??????????????????????

                wordsInCorpus[doc] = ???????????????????????


Your task is to complete the missing code. Note that in case you have copy and paste problems with the above code, you can find a text file more amenable to copy and paste here. The first missing line assigns each of the document's 1000 words to a topic. The result should be an array where entry i in the array is the number of words assigned to topic i. Now consider the last two missing lines. These lines should first use the current topic to produce a set of words, and then take the words produced by the current topic, and add them into the current document..

The answer can be found here. 

If you are curious as to whether your code is correct, you can type "wordsInCorpus[2,1:10]". This will print out the number of occurrences of words 1 through 9 (inclusive) in document 2. I got something like "array([ 1., 1., 5., 0., 1., 0., 0., 0., 0.])". This means that word one appeared once, word three five times, and so on.