Getting Started With Amazon EC2 and Hadoop

1. Create a new directory on your machine called "Hadoop"

2. Go to the web address indicated in class, and log in using the given credentials

3. Click EC2 in the upper left corner

4. Click "Launch Instance" to start up a machine that will reside in the cloud

5. First you need to choose a machine image to run. This is a software that your machine will run. Click on Community AMIs at the left, and in the box type in “simsql”. Select “SimSQL-EC2 - ami-2f023b46”.

6. Now you need to choose a machine type to rent. Scroll down and choose c3.2xlarge.

7. Then at the top of the screen, click “Tag Instance”. In the “Value” field, give your machine a name that is very unlikely to be used by anyone else in the class. Then click Review and Launch.

8. Click Launch. This will give you a chance to create a key pair that will allow you to connect to your machine. Change the “Choose an existing key pair” to “Crete a new key pair”. Choose a name that is unique, such that there is not likely to be anyone else in the class who will choose it. Then click “Download Key Pair”. This will give you a ".pem" file, which contains the public key that will allow you to connect to an amazon machine. Copy it to your Hadoop directory. Then, after after clicking the “I acknowledge…” box, click Launch Instances.

9. Let's take a look at your machine. Click “View Instances”. You will see instances created by other students in the class, as well as your own (look for the machine with the name that you came up with).

10. Now you are going to connect to the machine that you are renting from Amazon and upload some datausing a secure ftp program. The first thing you need to do is to make it possible to have a secure connection, so you need to set up public key encryption. The instructions for the rest of this step depend upon on whether you are using Windows or Mac/Linux.

Mac/Linux. In Mac or Linux, from your Hadoop directory, just your .pem file (the one that you downloaded when you created a new key pair) to "~/.ssh/id_rsa" (backing the old file up first, if it exists). The following assumes that your .pem file is called MyFirstKeyPair.pem; replace this with the actual name of your file:

cp ~/.ssh/id_rsa id_rsa.bup

cp MyFirstKeyPair.pem ~/.ssh/id_rsa

chmod 500 ~/.ssh/id_rsa

When you are done with all of this, you can restore your old id_rsa as follows:

cp id_rsa.bup ~/.ssh/id_rsa

Windows. In Windows, first fire up PuTTYgen. Click "Load" and then in the file type drop-down menu, choose "all files". Then select "MyFirstKeyPair.pem" (your .pem file will have a different name, depending upon what you called your key pair). Then choose "save" and save your file as "MyFirstKeyPair" in your Hadoop directory (again, use the name that you chose above; PuTTYgen will add a .ppk extension to the file you are saving) and "yes" to choose to save the file without paraphrase.

11. Download the "HadoopSetupFiles.zip" archive from this link and unzip it into your Hadoop directory.

12. Now, you are going to transfer over all of the files and bring up a command prompt. Using the EC2 console, find the "Public DNS" for the machine that you started up. It will be something like "ec2-23-22-211-211.compute-1.amazonaws.com". This is the machine you are going to upload your files to.

The instructions for the rest of this step depend upon on whether you are using Windows or Mac/Linux.

Mac/Linux. In Mac or Linux, type the following to fire up the secure ftp program, from your Hadoop directory:

sftp ubuntu@ec2-23-22-211-211.compute-1.amazonaws.com

Note: you need to use the DNS of your very own machine, and you might be prompted by sftp to verify that you should actually connect (type "yes"). Also note that you prepend the DNS with the user name "ubuntu". Then type the following to move all of the files that you need over to the master machine:

put Holmes.txt

put dictionary.txt

put war.txt

put william.txt

put WordCount.jar

exit

Now you are going to connect to the master to set it up. Using ssh, connect to the machine and obtain a command prompt. Type:

ssh ubuntu@ec2-23-22-211-211.compute-1.amazonaws.com

Again, use the DNS for your master, not mine!

Windows. Now we'll connect to your master machine to transfer the necessary files over. Fire up WinSCP. Enter in the Public DNS address for your master machine (our example it was "ec2-23-22-211-211.compute-1.amazonaws.com"), and the user name "ubuntu", and then select the private key file created using PuTTYgen (this should be "MyFirstKeyPair.ppk" in your Hadoop directory). WinSCP will connect to the master, and you can use WinSCP's graphical user interface to transfer files to it. The following files should be transferred over:

Holmes.txt

dictionary.txt

war.txt

william.txt

WordCount.jar

Next, fire up PuTTY. This will allow you to connect to your Amazon machine via SSH. In the left-hand side of the dialog that comes up, click "Connection" then "ssh" then "auth" and then click on "Browse" to select the private key file that you created above using PuTTYgen.

13. At this point you should be connected to the master via SSH and should a command prompt on the master. At this prompt, you can start up Hadoop using a script that comes with the SimSQL-EC2 machine type:

ubuntu@ip-10-69-86-199:~$ sudo ./cluster_setup 2

This will create a Hadoop cluster with two slave machines, or workers (these machines will automatically be rented from Amazon using the script). When the script asks for your AWS Access Key ID, type in the value that we gave you. Likewise, do the same for the secret access key. Leave the next two questions blank (just accept the default). You must now want a few minutes for two machine to be created, and for Hadoop to be started on them.

If you type:

ubuntu@ip-10-69-86-199:~$ ./cluster_display

you should see something like:

Number of slave machines: 2

Instance type: c3.2xlarge

Number of CPUs per node: 8

Est. memory for Hadoop processes: 1671 MB

Master instance ID: i-a8700d01

Master internal hostname: ip-10-69-86-199

Master public hostname: ec2-54-211-85-202.compute-1.amazonaws.com

Ganglia monitor web interface: http://ec2-54-211-85-202.compute-1.amazonaws.com/ganglia/

Hadoop JobTracker web interface: http://ec2-54-211-85-202.compute-1.amazonaws.com:50030/

Slave nodes:

Instance ID | Private IP | Public DNS

-------------------------------------------------------------------------------

i-fc5b2655 | 10.164.173.173 | ec2-54-159-151-216.compute-1.amazonaws.com

i-c05b2669 | 10.168.74.254 | ec2-54-204-134-193.compute-1.amazonaws.com

(Invoke me with the -v flag to obtain additional information!)

14. Now we will run a Hadoop job that counts the number of distinct words in A Concise Dictionary of Middle English, The Adventures of Sherlock Holmes, War and Peace, and the Complete Works of William Shakespeare. On the master, create a directory for the data and load it into HDFS using:

hdfs dfs -mkdir /txt

hdfs dfs -copyFromLocal *.txt /txt

16. Make sure the data are there:

hdfs dfs -ls /txt

You should see the three files.

17. Now you can run the the wordcount job as follows:

hadoop jar WordCount.jar WordCount -r 16 /txt /out

This runs the WordCount class in WordCount.jar using 16 reducers, processing all of the text in /txt and writing the results to /out. Note that you can check the progress of your job on the web (both before and after it executes) at the URL:

http://ec2-23-22-19-188.compute-1.amazonaws.com:50030

Note that you need to use the DNS address of YOUR master, and not mine!

18. To see the output files, type:

hdfs dfs -ls /out

To copy them to your home directory on the master so you can look at them, typc

hdfs dfs -copyToLocal /out/* .

You can then look at it one of the files using a program such as more:

more part-r-00005