Using SimSQL as a Database System. 


At its core, SimSQL is an SQL-based parallel relational database system, designed for analytics, that runs on top of Hadoop. In this regard, it is similar to Apache Hive, though SimSQL is more of a "true" RDBMS in that it has a full-fledged compiler and does both logical optimization (re-ordering of joins, for example) and physical optimization (exploiting data sort orders and pipelining, for example).


To demonstrate how SimSQL works, we'll create the "SimSQL Flower Club" database, which stores information about an imaginary club, whose members spend their time trying to find wildflowers. There are three tables in this database: features (which describe the various geographical areas where the club looks for flowers), flowers (which describes all of the flowers that the club's members are interested in seeing) and sightings (which lists all of the times that the club's members have seen sighted flowers). 


Begin by downloading the flowers database.

ubuntu@ec2-123-45-67-89$ wget http://cmj4.web.rice.edu/SimSQL/flowers.tar

ubuntu@ec2-123-45-67-89$ tar xvf flowers.tar


This will give you a bunch of text files that you can load into SimQL. Fire up the system and at the SimSQL prompt, type:


SimSQL> CREATE TABLE features (loc_id INTEGER, location CHAR(30), class CHAR(30), latitude INTEGER, longitude INTEGER, map CHAR(30), elev INTEGER,  PRIMARY KEY (location));


Now, load up the data:


SimSQL> LOAD features FROM features.txt SORT BY location;

Done! Loaded 6811 raw bytes.


You can display the the contents of the table as follows:


SimSQL> display features;

... location        | class           | latitude ...

... ---------------------------------------------...

... KERO-TV         | Tower           | 352714   ...   

... Sawtooth Peak   | Summit          | 354924   ...

... King Solomon... | Ridge           | 352937   ...

... Lone Star Mine  | Mine            | 353024   ...

... Inmans          | Populated Place | 352542   ...

...


Now we'll create and load up the three other tables:


SimSQL> CREATE TABLE flowers (flow_id INTEGER, genus CHAR(30), species CHAR(30), comname CHAR(30), PRIMARY KEY (comname));

SimSQL> LOAD flowers FROM flowers.txt SORT BY comname;

Done! Loaded 4772 raw bytes.


SimSQL> CREATE TABLE sightings (sight_id INTEGER, name CHAR(30), person CHAR(30), location CHAR(30), sighted CHAR(30), PRIMARY KEY (name, person, location, sighted));

SimSQL> LOAD sightings FROM sightings.txt SORT BY name;

Done! Loaded 53474 raw bytes.


SimSQL> CREATE TABLE people (person_id INTEGER, person CHAR(30), PRIMARY KEY (person_id));

SimSQL> LOAD people FROM people.txt SORT BY person;


At this point, we can ask some queries. For example, we compute the total number of flowers that have been sighted at each class of geographical location:


SimSQL> SELECT f.class, COUNT(*) AS number

   > FROM features AS f, sightings AS s

   > WHERE f.location = s.location

   > GROUP BY f.class;

Invoking the optimizer...

Executing the plan. Number of operations to go is 2 1 (running Aggregate as Hadoop job...) 


mcdb_number_2                        | mcdb_f_class_2                       

---------------------------------------------------------------

6                                    | Populated Place                      

103                                  | Locale                               

2                                    | Tower                                

69                                   | Range                                

17                                   | Spring                               

20                                   | Gap                                  

40                                   | Flat                                 

114                                  | Summit                               

5                                    | Ridge                                

50                                   | Mine   


Or, we can ask which map has had the highest number of sightings of flowers in the genus 'Fremontodendron' or the genus 'Zigadenus':


SimSQL> CREATE VIEW sightings_per_map (number, map) AS

   > SELECT COUNT(*), f.map

   > FROM features AS f, sightings AS s, flowers AS fl

   > WHERE f.location = s.location AND fl.comname = s.name AND

   >         (fl.genus = 'Fremontodendron' or 

   >          fl.genus = 'Zigadenus')

   > GROUP BY f.map;


SimSQL> SELECT m.map

   > FROM sightings_per_map AS m

   > WHERE m.number = (SELECT MAX (number) 

                       FROM sightings_per_map);

Invoking the optimizer...

Executing the plan. Number of operations to go is 5 4 3 (running Aggregate as Hadoop job...) 

2 (running Aggregate as Hadoop job...) 

1 (running Join as Hadoop job...) 


mcdb_f_map_12                             

---------------------------------------

Claraville  


Or, we can ask who has seen a flower on every 'Summit' on the 'Sawmill Mountain' map, except for 'Cerro Noroeste':


SimSQL> SELECT p.person

   > FROM people AS p

   > WHERE NOT EXISTS (

   >   SELECT *

   >   FROM features AS f

   >   WHERE f.map = 'Sawmill Mountain' AND f.class = 'Summit' AND 

   >     f.location <> 'Cerro Noroeste' and NOT EXISTS  (

   >     SELECT *

   >     FROM sightings AS s

   >     WHERE s.person = p.person AND 

   >       s.location = f.location));

Invoking the optimizer...

Executing the plan. Number of operations to go is...

5->4 (running Join as Hadoop job...)

3 (running Join as Hadoop job...)

2 (running Join as Hadoop job...)

1 (running Join as Hadoop job...)

mcdb_s_person_3                                                                                                         

------------------------------------------------------------------

Sandra                                                                                                                  

In general, SimSQL supports a pretty good subset of the declarative (non-procedural) part of SQL. The most notable things that are absent at this time are set-based operations (such as UNION) and outer joins.