Data Science Boot Camp Spark Activity Two: Building a kNN Classifier

1) The task here is to build a kNN classifier for the 20 Newsgroups data set. That is, given a text string, we will compute the k closest documents to the query document, and then use the majority vote of the k best in order to guess which class the query document belongs to. The 20 categories in the 20 newsgroups data set correspond to the 20 different newsgroups that the posts were extracted from. They are:



sci.crypt talk.politics.guns sci.electronics talk.politics.mideast

talk.politics.misc comp.sys.mac.hardware talk.religion.misc

Example: given the text string "god jesus allah" we might hope to return alt.atheism or soc.religion.christian or talk.religion.misc.

You can start by taking a look at the data set. Check out There are 19997 lines in this file, each corresponding to a different text document.

2) Start by going to Take a look at this code. It is a bit intricate! It is trying to build a dictionary over the corpus through a series of RDD transformations. There is one tiny bit of code missing… now try to fill it in. The answer is in 

3) Now check out This code adds a bit to the last one. There are four missing transformations. See if you can figure them out.  The answer is in 

4) Now do The answer is in  At this point, the entire input corpus is transformed into a bunch of nice NumPy vectors storing the bag-of-words for each document.

5) Finally, take a look at There is no code for you to fill in here, but this has an additional function that processes an input string and performs the kNN classification by (in parallel) searching through all of the documents that have previously been processed, and finding the k closest. Look over this new code. 

Then you can run it. Try getPrediction ("god jesus allah", 30) and also try getPrediction ("how many goals Vancouver score last year?",30). Do the answers make sense? Make up some other queries and try those as well.