Data Science Boot Camp Python Activity Three: Subarray Operators
Consider the following version of the LDA generative process, that records not only the words in each document, but which topic produced which word:
import numpy as np
# there are 2000 words in the corpus
alpha = np.full (2000, .1)
# there are 100 topics
beta = np.full (100, .1)
# this gets us the probabilty of each word happening in each of the 100 topics
wordsInTopic = np.random.dirichlet (alpha, 100)
# produced [doc, topic, word] gives us the number of times that the given word was
# produced by the given topic in the given doc
produced = np.zeros ((50, 100, 2000))
# generate each doc
for doc in range (0, 50):
#
# get the topic probabilities for this doc
topicsInDoc = np.random.dirichlet (beta)
#
# assign each of the 1000 words in this doc to a topic
wordsToTopic = np.random.multinomial (1000, topicsInDoc)
#
# and generate each of the 1000 words
for topic in range (0, 100):
produced[doc, topic] = np.random.multinomial (wordsToTopic[topic], wordsInTopic[topic])
In case you have copy/paste problems with the above code, a text version of this code is available here. As described in the comments, produced [doc, topic, word] gives the number of times that the given word was produced by the given topic in the given doc.
Given this, here are a number of mini-tasks:
Scroll down for some possible answers.