Data Science Boot Camp Python Activity Two: Implementing LDA With NumPy Arrays
Python is a scripting language, and it is almost always better to use arrays to implement statistical and numerical computations, in order to keep the number of loops to a minimum. Here is an alternative implementation, that makes better use of NumPy arrays:
import numpy as np
# there are 2000 words in the corpus
alpha = np.full (2000, .1)
# there are 100 topics
beta = np.full (100, .1)
# this gets us the probabilty of each word happening in each of the 100 topics
wordsInTopic = np.random.dirichlet (alpha, 100)
# wordsInCorpus[i] will give us the vector of words in document i
wordsInCorpus = np.zeros ((50, 2000))
# generate each doc
for doc in range (0, 50):
#
# get the topic probabilities for this doc
topicsInDoc = np.random.dirichlet (beta)
#
# assign each of the 1000 words in this doc to a topic
wordsToTopic = ??????????????????????
#
# and generate each of the 1000 words
for topic in range (0, 100):
wordsFromCurrentTopic = ??????????????????????
wordsInCorpus[doc] = ???????????????????????
Your task is to complete the missing code. Note that in case you have copy and paste problems with the above code, you can find a text file more amenable to copy and paste here. The first missing line assigns each of the document's 1000 words to a topic. The result should be an array where entry i in the array is the number of words assigned to topic i. Now consider the last two missing lines. These lines should first use the current topic to produce a set of words, and then take the words produced by the current topic, and add them into the current document..
The answer can be found here.
If you are curious as to whether your code is correct, you can type "wordsInCorpus[2,1:10]". This will print out the number of occurrences of words 1 through 9 (inclusive) in document 2. I got something like "array([ 1., 1., 5., 0., 1., 0., 0., 0., 0.])". This means that word one appeared once, word three five times, and so on.