Data Science Boot Camp Python Activity Three: Computing the Number of Word Co-Occurences

Computing the number of documents in which a pair of words co-occurs is an important problem in text analytics. Here we’ll examine the cost/runtime of doing this several different ways in Python.

First, run this code, which will compute the document corpus and store it using Python dictionaries.

Once you have done this, your goal is to complete the following code, which loops through all of the documents, and then for each document, it counts the number of times that each pair of words co-occurs. Note that when you run this code, it will print out the total execution time. Hint: when you call .get (index, default) to access an entry in a dictionary, if that index in the dictionary is not occupied, the default value is returned instead.

import time

start = time.time()

coOccurrences = {}

for doc in range (0, 50):

for ?????????????:

coOccurrences [(wordOne, wordTwo)] = ?????????????

end = time.time()

end - start

Note the running time. The answer can be found here.

Next, your task is to write a similar code, this time using the NumPy array-based implementation. First, run the array-based code, which will compute the document corpus, and store it using NumPy arrays. We will now implement the co-occurrence analysis by looping through all of the documents, taking the outer product of each document with itself, and summing. It is important when you do this that you cap any counts at one, since if wordA appears twice in a document, and wordB appears three times, the number of documents where a co-occurrence occurred is still one, not six. You can cap the value in a NumPy array using np.clip (array, minToAllow, maxToAllow)). Here is a skeleton of the code:

import time

start = time.time()

coOccurrences = np.zeros ((2000, 2000))

for doc in range (0, 50):

coOccurInThisDoc = ?????????????

coOccurrences = ?????????????

end = time.time()

end - start

Note the running time. The answer can be found here.

Finally, your task is to write a one-line code that uses a matrix-multiply to compute the answer. Here is the skeleton of the code:

import time

start = time.time()

res = ?????????????

end = time.time()

end - start

Note the running time once again. Which implementation is fastest? The answer can be found here.