Data Science Boot Camp Python Activity Three: Subarray Operators 

Consider the following version of the LDA generative process, that records not only the words in each document, but which topic produced which word:


import numpy as np


# there are 2000 words in the corpus

alpha = np.full (2000, .1)


# there are 100 topics

beta = np.full (100, .1)


# this gets us the probabilty of each word happening in each of the 100 topics

wordsInTopic = np.random.dirichlet (alpha, 100)


# produced [doc, topic, word] gives us the number of times that the given word was

# produced by the given topic in the given doc

produced = np.zeros ((50, 100, 2000))


# generate each doc

for doc in range (0, 50):

        #

        # get the topic probabilities for this doc

        topicsInDoc = np.random.dirichlet (beta)

        #

        # assign each of the 1000 words in this doc to a topic

        wordsToTopic = np.random.multinomial (1000, topicsInDoc)

        #

        # and generate each of the 1000 words

        for topic in range (0, 100):

                produced[doc, topic] = np.random.multinomial (wordsToTopic[topic], wordsInTopic[topic])


In case you have copy/paste problems with the above code, a text version of this code is available here. As described in the comments, produced [doc, topic, word] gives the number of times that the given word was produced by the given topic in the given doc. 

Given this, here are a number of mini-tasks:

  1. Write a line of code that computes the number of words produced by topic 17 in document 18.
  2. Write a line of code that computes the number of words produced by topic 17 thru 45 in document 18.
  3. Write a line of code that computes the number of words in the entire corpus. 
  4. Write a line of code that computes the number of words in the entire corpus produced by topic 17 .
  5. Write a line of code that computes the number of words in the entire corpus produced by topic 17 or topic 23.
  6. Write a line of code that computes the number of words in the entire corpus produced by even numbered topics.
  7. Write a line of code that computes the number of each word produced by topic 15.
  8. Write a line of code that computes the topic responsible for the most instances of each word in the corpus.
  9. Write a line of code that for each topic, computes the max number of occurrences (summed over all documents) of any word that it was responsible for.

Scroll down for some possible answers.










  1. produced[18,17,:].sum () # or produced[18,17].sum ()
  2. produced[18,17:46].sum ()
  3. produced.sum () # or produced[:,:,:].sum ()
  4. produced[:,17,:].sum () # or produced[:,17].sum ()
  5. produced[:,np.array([17,23]),:].sum ()
  6. produced[:,np.arange(0,100,2),:].sum ()
  7. produced[:,15,:].sum (0) # or produced.sum (0)[15]
  8. produced.sum (0).argmax (0)
  9. produced[:,np.arange(0,100,1),produced.sum (0).argmax (1)].sum(0) This works, though it is possible to come up with a much better solution!