Data Science Boot Camp Spark Activity: Building a kNN Classifier Using TF * IDF

1) The last classifier we built had some problems—see the poor result on the hockey query. 

We will try to fix this my moving to TF-IDF. First, check out Take a look at this code. It is exactly the same as the Activity 5 code, but the buildArray function has had its body removed. Your task is to re-write it so that it builds a TF vector, rather than a count vector. 

After you make the change, you can try to classify a few strings using the new code to see if it works better, but before you do so you might try leaving the Spark shell, re-starting it, and then re-executing your code, just to make sure that the new function is properly used. You can see my answer at

2) A few more changes are needed to move from TF to TF * IDF. Check out This code contains a few lines that you need to change to fully move to TF * IDF. You can find an answer at 

Now you can try out a few queries. You will probably find that the classifier works much better now.