Data Science Boot Camp Spark Activity: Building a kNN Classifier

1) The task here is to build a kNN classifier for the 20 Newsgroups data set. That is, given a text string, we will compute the k closest documents to the query document, and then use the majority vote of the k best in order to guess which class the query document belongs to. The 20 categories in the 20 newsgroups data set correspond to the 20 different newsgroups that the posts were extracted from. They are:

alt.atheism comp.windows.x rec.sport.hockey

soc.religion.christian comp.graphics misc.forsale

sci.crypt talk.politics.guns comp.os.ms-windows.misc

rec.autos sci.electronics talk.politics.mideast

comp.sys.ibm.pc.hardware rec.motorcycles sci.med

talk.politics.misc comp.sys.mac.hardware rec.sport.baseball

sci.space talk.religion.misc

Example: given the text string "god jesus allah" we might hope to return alt.atheism or soc.religion.christian or talk.religion.misc.

You can start by taking a look at the data set. Check out https://s3.amazonaws.com/chrisjermainebucket/comp330_A6/20_news_same_line.txt. There are 19997 lines in this file, each corresponding to a different text document.

2) Start by going to http://cmj4.web.rice.edu/DSDay2/Activity2Out.py. Take a look at this code. It is a bit intricate! It is trying to build a dictionary over the corpus through a series of RDD transformations. There is one tiny bit of code missing… now try to fill it in. The answer is in http://cmj4.web.rice.edu/DSDay2/Activity2Answer.py. You might try experimenting a bit with adding a lambda to the top operation that allows you to select the most common few dictionary words.

3) Now check out http://cmj4.web.rice.edu/DSDay2/Activity3Out.py. This code adds a bit to the last one. There are four missing transformations. See if you can figure them out. The answer is in http://cmj4.web.rice.edu/DSDay2/Activity3Answer.py. Note that when we print out the few top items in the resulting RDD, it is not very satisfying, since we see the key, but the result is just a pyspark.resultiterable.ResultIterable object which is not very interesting.

4) Now do http://cmj4.web.rice.edu/DSDay2/Activity4Out.py. The answer is in http://cmj4.web.rice.edu/DSDay2/Activity4Answer.py. At this point, the entire input corpus is transformed into a bunch of nice NumPy vectors storing the bag-of-words for each document.

5) Finally, take a look at http://cmj4.web.rice.edu/DSDay2/Activity5.py. There is no code for you to fill in here, but this has an additional function that processes an input string and performs the kNN classification by (in parallel) searching through all of the documents that have previously been processed, and finding the k closest. Look over this new code.

Then you can run it. Try getPrediction ("god jesus allah", 30) and also try getPrediction ("how many goals Vancouver score last year?",30). Do the answers make sense? Make up some other queries and try those as well.