Data Science Boot Camp: Creating a Spark Cluster and Running Word Count

1) Go to the designated sign-in page to log onto Amazon’s AWS website.

2) Sign in with the user name and password provided.

3) Next, you will need to create a “key pair” that will allow you to connect securely to the cluster that you create. To do this, under "All services" and then “Compute”, click EC2.

4) Click “key pairs”.

5) Click “Create Key Pair”.

6) Pick a key pair name that is likely unique to you (such as the name of your eldest child, or your last name, so that it is unlikely that any other people here will choose it). Type it in, and click “Create”.

7) This should create a “.pem” file that you can download. You will subsequently use this .pem file to connect securely to a machine over the internet.

8) Now it is time to create your Spark cluster. Again click "Services" at the upper left of the dashboard and under “Analytics” click “EMR”.

9) Click “Create cluster” then click "Go to advanced options".

10) Unclick all applications except for Spark 2.4.3 and Hadoop 2.8.5, which is what we'll be using. Click "Next".

11) Under “instance type”, choose one “m3.xlarge” instance for your Master, and two “m3.xlarge” instances for your Core (this should be the deault). This will give you three machines in the cloud of the type “m3.xlarge” to perform your computations. If you are interested, you can find a list of all instance types at https://aws.amazon.com/ec2/instance-types/. Each m3.xlarge machine has 8 CPU cores and 15GB of RAM, along with two SSD drives. Click "Next".

12) Click "Next". Now you are on the "General Options" screen; choose a cluster name that you will remember and that is not generic. Click "Next".

13) Now you are on "Security Options"; choose the KeyPair that you created. This is important: if you do not do this, you won’t be able to access your cluster.

14) Click “Create Cluster”. Now your machines are being provisioned in the cloud! You will be taken to a page that will display your cluster status. It will take a bit of time for your cluster to come online. You can check the status of your cluster by clicking the little circular update arrow at the top right.

15) One your cluster is up and running, you will want to connect to the master node so that you run run Spark jobs on it. Under “Hardware”, you will see a list of “Instance Groups”… you will have two types of instance groups in your cluster… a “core” group (the workers) and the “master” group, which contains the machine that you will interact with. Click on the ID associated with the “master” group, and you will see a clickable link under “EC2 Instance Id”. This is your master machine. Click on this link, and you will be taken to the EC2 dashboard where it will give you all sorts of info about your master node (if you ever want to get back to your cluster, just click on the cube at the upper left of the console, which gets you back to the main menu; click EMR, and then choose your cluster and you will be back to your cluster once again). The thing that we are really interested in is the public IP. This will be a number such as 54.172.82.0.

16) Now that you have created your cluster, and identified the public IP of your master node, it is time to connect to the node and run a Spark job! To connect to your master node:

Mac/Linux. The following assumes that your .pem file is called MyFirstKeyPair.pem and that it is located in your working directory; replace this with the actual name and location of your file, assuming that you called your key pair something else. Type:

chmod 500 MyFirstKeyPair.pem

Now, you can connect to your master machine (replace “54.172.82.0” with the IP address of your own master machine):

ssh -i "MyFirstKeyPair.pem" hadoop@54.172.82.0

This will give you a Linux prompt; you are connected to your master node.

Windows. In Windows, we’ll assume that you are using the PuTTY suite of tools. First fire up PuTTYgen. Click "Load" and then in the file type drop-down menu, choose "all files". Then select "MyFirstKeyPair.pem" (your .pem file will have a different name, depending upon what you called your key pair). Then choose "save" and save your file as "MyFirstKeyPair" in an appropriate directory, where you can find it (again, use the name that you chose above; PuTTYgen will add a .ppk extension to the file you are saving) and "yes" to choose to save the file without paraphrase.

Next, fire up PuTTY. This will allow you to connect to your Amazon machine via SSH. In the left-hand side of the dialog that comes up, click "Connection" then "ssh" then "auth" and then click on "Browse" to select the private key file that you created above using PuTTYgen. Connect ,

17) Now, whether or not you are using Windows or Mac/Linux, you will have a Linux prompt to your master node. It is time to run a Spark job! At the Linux command prompt, type pyspark. This will open up a Python shell that is connected to your Spark cluster.

18) In the shell, copy and past the following, simple function, that uses Spark to count the top 200 most frequently occurring words in a text file:

def countWords (fileName):

textfile = sc.textFile(fileName)

lines = textfile.flatMap(lambda line: line.split(" "))

counts = lines.map (lambda word: (word, 1))

aggregatedCounts = counts.reduceByKey (lambda a, b: a + b)

return aggregatedCounts.top (200, key=lambda p : p[1])

Now that you’ve got that function all set, check out the following four files, which contain a bunch of text data that I’ve loaded onto Amazon’s S3 cloud storage service:

https://s3.amazonaws.com/chrisjermainebucket/text/Holmes.txt

https://s3.amazonaws.com/chrisjermainebucket/text/dictionary.txt

https://s3.amazonaws.com/chrisjermainebucket/text/war.txt

https://s3.amazonaws.com/chrisjermainebucket/text/william.txt

I’ll let you guess what these files contain! You can count the top 200 words in any of them by simply typing, at the command prompt, for example:

countWords ("s3://chrisjermainebucket/text/Holmes.txt")

Or, if you want to count all of them at the same time:

countWords ("s3://chrisjermainebucket/text/")

19) Type control-d to exit the shell when you are done.

20) Finally, here is a little assignment for you: Change the countWords program so that any words less than length 2 are filtered away (not counted) and so that all words are converted into lower case before counting. Hint: in Python, the length of a word w is computed using len(w), and the function that returns the lower case version of a word is w.lower (). Scroll down to see a possible answer….

def countWords (fileName):

textfile = sc.textFile(fileName)

lines = textfile.flatMap(lambda line: line.split(" "))

lines = lines.filter (lambda word: True if len (word) > 1 else False)

counts = lines.map (lambda word: (word.lower (), 1))

aggregatedCounts = counts.reduceByKey (lambda a, b: a + b)

return aggregatedCounts.top (200, key=lambda p : p[1])