Software Development

High-Performance Computing Clusters (HPCC) and Cassandra on OS X

Our new parent company, LexisNexis, has one of the world’s largest public records database:

“…our comprehensive collection of more than 46 billion records from more than 10,000 diverse sources—including public, private, regulated, and derived data. You get comprehensive information on approximately 269 million individuals and 277 million unique businesses.”

http://www.lexisnexis.com/en-us/products/public-records.page

And they’ve been managing, analyzing and searching this database for decades.  Over that time period, they’ve built up quite an assortment of “Big Data” technologies.  Collectively, LexisNexis refers to those technologies as their High-Performance Computing Cluster (HPCC) platform.

HPCC is entirely open source:

Naturally, we are working through the marriage of HPCC with our real-time data management and analytics stack.  The potential is really exciting.  Specifically, HPCC has sophisticated machine learning and statistics libraries, and a query engine (Roxie) capable of serving up those statistics.

Low and behold, HPCC can use Cassandra as a backend storage mechanism! (FTW!)

The HPCC platform isn’t technically supported on a Mac, but here is what I did to get it running:

HPCC Install

  • Clone the github repository, and its submodules (git submodule update –init –recursive)
  • Pull my patches (https://github.com/hpcc-systems/HPCC-Platform/pull/7166)
  • Install the dependencies using brew
    brew install icu4c
    brew install boost
    brew install libarchive
    brew install bison27
    brew install openldap
    brew install nodejs
  • Make a build directory, and run cmake from there:
    export CC=/usr/bin/clang
    export CXX=/usr/bin/clang++
    cmake ../ -DICU_LIBRARIES=/usr/local/opt/icu4c/lib/libicuuc.dylib -DICU_INCLUDE_DIR=/usr/local/opt/icu4c/include -DLIBARCHIVE_INCLUDE_DIR=/usr/local/opt/libarchive/include -DLIBARCHIVE_LIBRARIES=/usr/local/opt/libarchive/lib/libarchive.dylib -DBOOST_REGEX_LIBRARIES=/usr/local/opt/boost/lib -DBOOST_REGEX_INCLUDE_DIR=/usr/local/opt/boost/include  -DUSE_OPENLDAP=true -DOPENLDAP_INCLUDE_DIR=/usr/local/opt/openldap/include -DOPENLDAP_LIBRARIES=/usr/local/opt/openldap/lib/libldap_r.dylib -DCLIENTTOOLS_ONLY=false -DPLATFORM=true
  • Then, compile and install with (sudo make install)
  • After that, you’ll need to muck with the permissions a bit:
    chmod -R a+rwx /opt/HPCCSystems/
    chmod -R a+rwx /var/lock/HPCCSystems
    chmod -R a+rwx /var/log/HPCCSystems
  • Now, ordinarily you would run hpcc-init to get the system configured, but that script fails on OS X, so I used linux to generate config files that work and posted those to a repository here: https://github.com/boneill42/hpcc_on_mac
  • Clone this repository and replace /var/lib/HPCCSystems with the content of var_lib_hpccsystems.zip
    sudo rm -fr /var/lib/HPCCSystems
    sudo unzip var_lib_hpccsystems.zip -d /var/lib
    chmod -R a+rwx /var/lib/HPCCSystems
  • Then, from the directory containing the xml files in this repository, you can run:
    • daserver: (Runs the Dali server, which is the persistence mechanism for HPCC)
    • esp: (Runs the ESP server, which is the web services and UI layer for HPCC)
    • eclccserver: (Runs the ECL compile server, which takes the ECL and compiles it down to C and then a dynmic library)
    • roxie (Runs the Roxie server, which is capable of responding to queries)
  • Kickoff each one of those, then you should be ready to run some ECL. Then, go to http://localhost:8010 in a browser.  You are ready to run some ECL!

Running ECL

Like Pig with Hadoop, HPCC runs a DSL called ECL.  More information on ECL can be found here: http://hpccsystems.com/download/docs/learning-ecl

  • As a simple smoke test, go into your HPCC-Platform repository, and go under: ./testing/regress/ecl.
  • Then, run the following:
  • ecl run hello.ecl --target roxie --server=localhost:8010
  • You should see the following:
             <dataset name="Result 1"> 
            <row><result_1>Hello world</result_1></row> 
            </dataset> 

Cassandra Plugin

With HPCC up and running, we are ready to have some fun with Cassandra.  HPCC has plugins.  Those plugins reside in /opt/HPCC/plugins.  For me, I had to copy those libraries into /opt/HPCCSystems/lib to get HPCC to recognize them.

Go back to the testing/regress/ecl directory and have a look at cassandra-simple.ecl. A snippet is shown below:

childrec := RECORD
   string name,
   integer4 value { default(99999) },
   boolean boolval { default(true) },
   real8 r8 {default(99.99)},
   real4 r4 {default(999.99)},
   DATA d {default (D'999999')},
   DECIMAL10_2 ddd {default(9.99)},
   UTF8 u1 {default(U'9999 ß')},
   UNICODE u2 {default(U'9999 ßßßß')},
   STRING a,
   SET OF STRING set1,
   SET OF INTEGER4 list1,
   LINKCOUNTED DICTIONARY(maprec) map1{linkcounted};
END;

init := DATASET([{'name1', 1, true, 1.2, 3.4, D'aa55aa55', 1234567.89, U'Straße', U'Straße','Ascii',['one','two','two','three'],[5,4,4,3],[{'a'=>'apple'},{'b'=>'banana'}]},
                 {'name2', 2, false, 5.6, 7.8, D'00', -1234567.89, U'là', U'là','Ascii', [],[],[]}], childrec);

load(dataset(childrec) values) := EMBED(cassandra : user('boneill'),keyspace('test'),batch('unlogged'))
  INSERT INTO tbl1 (name, value, boolval, r8, r4,d,ddd,u1,u2,a,set1,list1,map1) values (?,?,?,?,?,?,?,?,?,?,?,?,?);
ENDEMBED;

In this example, we define childrec as a RECORD with a set of fields. We then create a DATASET of type childrec. Then we define a method that takes a dataset of type childrec and runs the Cassandra insert command for each of the records in the dataset.

Startup a Cassandra locally.  (download Cassandra, unzip it, then run bin/cassandra -f (to keep it in foreground))

Once Cassandra is up, simply run the ECL like you did the hello program.

ecl run cassandra-simple.ecl --target roxie --server=localhost:8010

You can then go over to cqlsh and validate that all the data made it back into Cassandra:

➜  cassandra  bin/cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh> select * from test.tbl1 limit 5;

 name      | a | boolval | d              | ddd  | list1 | map1 | r4     | r8     | set1 | u1     | u2        | value
-----------+---+---------+----------------+------+-------+------+--------+--------+------+
  name1575 |   |    True | 0x393939393939 | 9.99 |  null | null | 1576.6 |   1575 | null | 9999 ß | 9999 ßßßß |  1575
  name3859 |   |    True | 0x393939393939 | 9.99 |  null | null | 3862.9 |   3859 | null | 9999 ß | 9999 ßßßß |  3859
 name11043 |   |    True | 0x393939393939 | 9.99 |  null | null |  11054 |  11043 | null | 9999 ß | 9999 ßßßß | 11043
  name3215 |   |    True | 0x393939393939 | 9.99 |  null | null | 3218.2 |   3215 | null | 9999 ß | 9999 ßßßß |  3215
  name7608 |   |   False | 0x393939393939 | 9.99 |  null | null | 7615.6 | 7608.1 | null | 9999 ß | 9999 ßßßß |  7608

OK, that should give a little taste of ECL and HPCC. It is a powerful platform.

As always, let me know if you run into any trouble.

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button