Using Amazon EC2, it is possible to get started with SimSQL in just a few minutes. Below we give a step-by-step guide to getting started with SimSQL on EC2.

Background.

EC2 is a cloud computing offered by Amazon, which allows you to rent and use virtual computers run by Amazon on demand. You can choose from a library of machine images when you fire up an Amazon virtual machine. We have created a special EC2 machine instance that comes pre-loaded with all of the tools that SimSQL needs to run (including Hadoop, SWI-Prolog, GSL, and so on). To get started with SimSQL, all you have to do is to rent a virtual machine from Amazon, choose the right machine image, and you are basically ready to go. You can bypass all of the semi-challenging issues associated with installing these tools correctly and getting them to run.

Step-By-Step Guide.

1. Preliminaries.

A. Amazon EC2. If you have not already done so, point your web browser to aws.amazon.com and create an account with Amazon AWS so that you can use EC2. Note that you'll need to enter in your credit card information so that you can rent a machine.

B. Amazon Access Key. To get your compute cluster going, you will need your Amazon access key ID and your Amazon secret access key. If you do not know those, go to the AWS Management console. In the upper-right hand corner, click your name and choose "Security Credentials". Choose the "Access Keys" option, and create a new access key. Record both the access key ID and the secret access key.

C. SSH. You will also need to be able to SSH into (connect to) the EC2 machine that you rent from Amazon. To use SSH, you will first need to have an SSH key pair so that you can connect safely to your machine. If you do not have one, you can create one by clicking "Key Pairs" from the Amaon EC2 console. Download the .pem file that you create to your machine (we'll assume that you called it MyFirstKeyPair.pem.)

Next you will need to make sure that you can run SSH from your machine.

Mac/Linux. In Mac or Linux, SSH is already installed. In your working director, just copy MyFirstKeyPair.pem to ~/.ssh/id_rsa (backing the old file up first, if it exists):

cp ~/.ssh/id_rsa id_rsa.bup

cp MyFirstKeyPair.pem ~/.ssh/id_rsa

chmod 500 ~/.ssh/id_rsa

Later, if you want to, you can restore your old id_rsa as follows:

cp id_rsa.bup ~/.ssh/id_rsa

Windows. In Windows, you'll need to download putty.exe and puttygen.exe to your machine (do a Google search and you can find many copies of these executables on the Web). Once you do this, fire up PuTTYgen. Click "Load" and then in the file type drop-down menu, choose "all files". Then select MyFirstKeyPair.pem. Then choose "save" and save your file as MyFirstKeyPair in an appropriate directory (PuTTYgen will add a .ppk extension to the file you are saving) and "yes" to choose to save the file without paraphrase. Now you'll be able to SSH into the Amazon machine that you create via PuTTY.

2. Starting Up an EC2 Machine.

If you've got an EC2 account, you have SSH set up on your Windows/Mac/LInux machine, and you know your Amazon access keys, you are ready to start SimSQL.

First, log onto Amazon. From the EC2 console, click "Launch Instance". You'll want to start up our special EC2 machine type, which comes all ready to run SimSQL. To do thus, click "Community AMIs", and search for "SimSQL-EC2" (the search interface is a bit strange; you may need to click "64-bit" first or else the instance won't come up).

Choose to the SimSQL machine image. You'll need to select a type of machine that you'll rent from Amazon. Each has its own pros and cons. To select a machine type, select "All Instance Types" on the left. Select an Amazon machine type (we typically use m2.4xlarge, which costs around $1.60 per hour, per machine). Click "Review and Launch", then click "Launch", taking care to select the correct key pair after you click "Launch". Click "View Instances" to go to the EC2 dashboard, and your machine has started up.

Be aware that until you shut this machine down, you are paying for it. When you are done using EC2, you should be super careful to shut down any instances that you don't want, or else you will continue to accrue charges. The $$ can add up fast!

3. Downloading SimSQL and Starting a Hadoop Cluster.

Select the instance that you started in the EC2 console, and then find its public DNS. This will be something like "ec2-123-45-67-89.compute-1.amazonaws.com". SSH into this machine, using the user name ubuntu (so in Mac/Linux, you'd use the command "ssh ubuntu@ec2-123-45-67-89.compute-1.amazonaws.com"; on Windows you'll use PuTTY to connect).

Once you are on the machine, you need to start up a Hadoop cluster:

ubuntu@ec2-123-45-67-89$ sudo ./cluster_setup 2

(Replace the "2" with the number of machines that you want to rent. Be aware that you will be paying for these machines!). Then you'll see:

AWS Access Key ID [None]: <enter in your access key here>

AWS Secret Access Key [None]: <enter in your secret access key here>

Default region name [us-east-1]: <leave this blank>

Default output format [json]: <leave this blank>

It might take five minutes for the cluster to start up. The script will finish once all of the machines you have asked for are running. It will then exit, telling you the location of your Hadoop JobTracker (you can point your browser there in order to watch what is happening as you run SimSQL computations).

Remember that you are paying for each machine that you start up. So shut them down when you no longer want to pay! You can easily shut down the Hadoop cluster that you just started by typing, from the command line:

ubuntu@ec2-123-45-67-89$ sudo ./cluster_shutdown

This will not, however, shut down the first node that you started via the EC2 console. You will have to do that by going back to the console. Also note that you can show the nodes in your cluster easily:

ubuntu@ec2-123-45-67-89$ sudo ./cluster_display

You are now ready to run SimSQL.

4. Starting Up SimSQL.

Begin by getting the SimSQL code:

ubuntu@ec2-123-45-67-89$ wget http://cmj4.web.rice.edu/SimSQL/v0.3-cdh4.tar.gz

This is version 0.3 of the system, current as of March 31, 2014. Unpack it:

ubuntu@ec2-123-45-67-89$ gunzip v0.3-cdh4.tar.gz

ubuntu@ec2-123-45-67-89$ tar xvf v0.3-cdh4.tar

Now compile SimSQL:

ubuntu@ec2-123-45-67-89$ cd v0.3-cdh4

ubuntu@ec2-123-45-67-89$ ant

A bunch of stuff will happen as SimSQL builds. This will take around a minute. We are now ready to run SimSQL.

ubuntu@ec2-123-45-67-89$ cd

ubuntu@ec2-123-45-67-89$ hadoop jar v0.3-cdh4/simsql.jar

Then you'll see:

Welcome to SimSQL v0.3!

Could not find the configuration directory ".simsql". Please choose one of the following options:

[1] Create a new configuration.

[2] Restore an existing configuration.

[3] Quit.

> 1 <typed by you!>

Creating a new catalog. OK!

Creating a new set of runtime parameter settings. Please choose one of the following options:

[1] Set up these parameters (requires answering some questions).

[2] Use the default factory settings (might not be optimal).

[3] Quit.

(These parameters can be changed later!)

> 2 <typed by you!>

Loading default regular functions........................... OK!

Loading default VG functions............................ OK!

Creating a new physical database.

In what Hadoop directory do you want to store all of your database files? Foo <typed by you!>

Now let's make sure that SimSQL is configured correctly. Type:

SimSQL> show params

parameter name | current value

---------------------------------------------------------------------------

numCPUs | 8

memoryPerCPUInMB | 1000

numIterations | 1

optIterations | 150

optPlans | 1

debug | false

You should set the first two parameters so that they match the machine and cluster type you are using. If you started up a Hadoop cluster with two m2.4xlarge machines, then you have 16 CPUs in all (each instance has 8 CPUs) and around 135GB of RAM. Reserving two GB of RAM for each machine, this gives us 131GB of RAM to use on those 16 CPUs:

SimSQL> set numCPUs 16 <typed by you!>

SimSQL> set memoryPerCPUInMB 8200 <typed by you!>

At this point, you can begin loading data and running queries.