Getting Started On WordCount Over MapReduce

1. Download "WordCount.zip" from here onto your machine (not onto your cluster!). 

2. Unzip "WordCount.zip". This will create a directory called "WordCount" that has four files in it: 

(a) "Hadoop.jar". This contains all of the compiled ".java" files for one of the commonly-used Hadoop distributions. If you want to see documentation for any part of the API contained in Hadoop.jar (for example, the "Mapper" class) the best thing to do is to run a Google search using the name of the class for which you want documentation, plus the terms "Hadoop" and "1.0.1" (this is the version number that we are using). 

(b) "WordCount.java". This class is used to set up the MapReduce word count computation.  

(c) "WordCountMapper.java". This is the class that is used to perform the Map part of the MapReduce for the word count computation. This class is nearly empty.

(d) "WordCountReducer.java". This is the class that is used to perform the Reduce part of the MapReduce for the word count computation. This class is also nearly empty.

3. Create a project containing these four files, using your favorite Java development environment.

4. Edit  "WordCountMapper.java" and "WordCountReducer.java", adding the code that is necessary to make them perform the full MapReduce computation.

5. After you have written your code, compile it, create a ".jar" archive for the three ".class" files resulting from the compilation, and then FTP the ".jar" file over to the master node of your cluster.

6. SSH into the master node, and then run your WordCount implementation:

hadoop jar WordCount.jar WordCount -r 16 /txt /out

If you get stuck and you absolutely have no idea what is wrong, you can check out my implementation for "WordCountMapper.java" and "WordCountReducer.java".

7. Once you get the WordCount code working, add to the code so that after the WordCount completes, it asks the user how many of the most frequent words he/she would like to see. The user then enters in a number (we'll call it k) and your program will fire up a second MapReduce computation uses the result of the first to compute the k most frequent words in the corpus.