The Last Exercise: A KNN Classifer

1. Download from here.

2. Unzip it on your local machine, which will create a KNN directory. In there will be a "KNN.jar" file. This is my implementation of the KNN classifier. FTP this over to your master node.

3. Log on to your master node. Presumably you still have the "vectors" file sitting around from your KMeans work. We are going to prepare some training data and testing data by selecting 50 newsgroups articles. Type:

tail -n +10000 vectors | head -n 50 > KNNTestingData

head -n 9999 vectors > temp1

tail -n +10050 vectors > temp2 

cat temp1 temp2 > KNNTrainingData

hadoop dfs -mkdir /KNNTraining

hadoop dfs -mkdir /KNNTesting

hadoop dfs -copyFromLocal KNNTestingData /KNNTesting

hadoop dfs -copyFromLocal KNNTrainingData /KNNTraining

4. Now you can run my KNN classifier implementation:

hadoop jar KNN.jar KNN /KNNTraining /KNNTesting /KNNOutput 5 6

The "5" is the number of nearest neighbors to use during classification, and the "6" is the number of reducers.

6. The output will be put into the "/KNNOutput" directory in Hadoop. To look at it, type:

hadoop dfs -copyToLocal /KNNOutput .

cd KNNOutput

more part*

Then keep pressing the space bar. You will see the name of all of the files that were classified, as well as the classification. The results are likely to be quite good!

7. Now it is your turn to implement this. Back on your home machine, go into the KNN directory, and add all of the ".java" files into a project, along with "Hadoop.jar" from your first WordCount exercise. Again, your task is to fill in "" and "". As discussed, a big difference here compared to what we did before with KMeans is that the key-value pairs that get sent from the mapper to the reducer are now user-defined, and are both of type "RecordKey", which has two things in it: a String object  (the "key"), and a Double object (the "distance"). You will see a lot of new files containing code that tells Hadoop how to serialize, deserialize, and compare RecordKey objects, but from a programming standpoint, the new file that you've never seen before that you need to become most familiar with is "". If you get stuck, you can look at my "" and "".

Good luck!