Data Science Boot Camp Spark Activity: Building a kNN Classifier Using TF * IDF
1) The last classifier we built had some problems—see the poor result on the hockey query.
We will try to fix this my moving to TF-IDF. First, check out http://cmj4.web.rice.edu/DSDay2/Activity6Out.py. Take a look at this code. It is exactly the same as the Activity 5 code, but the buildArray function has had its body removed. Your task is to re-write it so that it builds a TF vector, rather than a count vector.
After you make the change, you can try to classify a few strings using the new code to see if it works better, but before you do so you might try leaving the Spark shell, re-starting it, and then re-executing your code, just to make sure that the new function is properly used. You can see my answer at http://cmj4.web.rice.edu/DSDay2/Activity6Answer.py.
2) A few more changes are needed to move from TF to TF * IDF. Check out http://cmj4.web.rice.edu/DSDay2/Activity7Out.py. This code contains a few lines that you need to change to fully move to TF * IDF. You can find an answer at http://cmj4.web.rice.edu/DSDay2/Activity7Answer.py.
Now you can try out a few queries. You will probably find that the classifier works much better now.