Software Development

Connecting Pentaho Data Integration to MapR Using Apache Drill

Pentaho Data Integration (PDI) provides the ETL capabilities that facilitate the process of capturing, cleansing, and storing data. Its uniform and consistent format makes it accessible and relevant to end-users and IoT technologies.

Apache Drill is a schema-free SQL-on-Hadoop engine that lets you run SQL queries against different data sets with various formats, e.g. JSON, CSV, Parquet, HBase, etc. By integrating it with PDI, you have the flexibility to do serious data integration work through Pentaho’s powerful PDI product. The Drill Tutorials pages in MapR’s documentation can help you get familiar with Apache Drill.

Step 1

You’ll need administrator permissions in order to do these steps. Make sure that you meet the following software requirements:

  • MapR Converged Data Platform (version 4.x or 5.x)
  • Apache Drill (version 1.6 or later) along with the latest Drill JDBC driver
  • Apache ZooKeeper (running in replicated mode)
  • Pentaho Data Integration (version 5.4 or later)

You should also make sure that the PDI client system can resolve the hostnames on your Drill cluster before you get started.

Step 2

The first thing you’ll have to do is get the Drill cluster ID and construct a custom URL string. This will be something that we’ll be using a bit later to make the JDBC connection through PDI.

  1. Getting the Drill cluster ID isn’t too bad: all you have to do is go to the query page in the Drill interface and run an SQL query like this:

select string_val from sys.boot where name =’drill.exec.cluster-id’;

  1. Apache Drill will return your cluster ID. Once you have that, you can make a custom URL that will end up looking something like this:

jdbc:drill:zk=mapr1:5181,mapr2:5181,mapr3:5181/drill/WhateverYourClusterIDIs

Step 3

Once you have your custom URL string, follow these steps to make the connection to PDI:

  1. Open PDI, start a new Transformation, then click on the View tab in the far left.
  2. Expand Transformation 1, then right-click on Database connections and select New.

  1. Do the following in the Database Connection window:
    1. Name the connection. We are using Drill as the Connection Name in our example below.
    2. Select Generic Database for your database type and Native JDBC for your access type.
    3. Under Settings on the right, copy and paste your Custom Connection URL.
    4. Enter the Custom Driver Class Name.
    5. Leave the username and password fields empty for now.

  1. Click Test to verify the connection. You should see a connection success window pop up.

Troubleshooting Tips

On the off-chance that your connection test doesn’t work, try verifying that your Custom URL string is correct, and make sure your hosts file for the PDI client can resolve the private hostnames of the cluster.

Summary

By the time you get to the end of this process, you should have successfully connected your Pentaho Data Integration client to your MapR cluster using Apache Drill. Have fun with your data!

Sandra Wagner

Sandra Wagner (“the Goddess of Best Practices”) leads a small team at Pentaho, on a mission to translate crazy engineering baffle-speak into actual English words and phrases. She especially loves to delete superfluous content. Sandra has 10 years of experience as a technical writer/editor. Prior to working for Pentaho, she worked for Lockheed Martin (Simulation, Training, & Support) documenting some super cool stuff. She holds a B.A. in Technical Communication from the University of Central Florida.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
pokemon go apk
7 years ago

Valuable information! Looking forward to seeing your notes posted. Thank you for sharing the nice article. Good to see your article.

Back to top button