Software Development

Building Apache Zeppelin for MapR using Spark under YARN

Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Spark SQL, Scala, Hive, Flink, Kylin and more. Zeppelin enables rapid development of Spark and Hadoop workflows with simple, easy visualizations. The code from Zeppelin can be used in the Zeppelin notebooks or compiled and packaged into complete applications.
 
 
 
 
 
 
Zeppeline-Blog-Img1

As of the current master branch (and release candidate), all the MapR build profiles are now included in the Apache Zeppelin repository. Four profiles, mapr3, mapr40, mapr41, and mapr50 will build Zeppelin with the appropriate MapR dependencies.

This blog provides instructions for building with the MapR profiles. Building the Hive interpreter for MapR is included, but the dependencies are commented out in the Hive pom.xml file.

Some assumptions

  1. A cluster with MapR 4.0.x/5.x and Apache Spark (1.2.x, 1.3.x or 1.4.x) running under YARN
  2. The ability to edit a couple of text files
  3. A decent browser
  4. A machine (node or edge) to run the Zeppelin server on. This requires mapr-spark and at least the MapR client installed
  5. Git client, npm and Maven 3.x

What do you need to do?

  1. Make sure you have at least the MapR client & Spark installed on your machine. Test this by executing a hadoop fs -ls / and the Spark shell (for example version 1.2.1)
    /opt/mapr/spark/spark-1.2.1/bin/spark-shell
  2. Find a nice directory and run git clone
    https://github.com/apache/incubator-zeppelin zeppelin
  3. cd zeppelin
  4. Build it (version MapR 4.0.x):
    mvn clean package -Pbuild-distr -Pmapr40 -Pyarn -Pspark-1.2 -DskipTests

    (for version MapR 4.1):

    mvn clean package -Pbuild-distr -Pmapr41 -Pyarn -Pspark-1.3 -DskipTests

    (for version MapR 5.x):

    mvn clean package -Pbuild-distr -Pmapr50 -Pyarn -Pspark-1.3 -DskipTests
  5. This will create a directory called zeppelin-distribution. In this directory will be a runnable version of Zeppelin and a tar file. The tar file is a complete Zeppelin installation. Use it.
  6. Untar zeppelin-x.x.x-incubating-SNAPSHOT.tar.gz where you want to execute the Zeppelin server. Everything is local to that machine, so it is not necessary to have the Zeppelin server on a MapR cluster node.
  7. Configuration … assuming you have a working MapR client and Spark installation, there is little to configure. In the zeppelin-x.x.x-incubating-SNAPSHOT/conf directory, you will need to copy zeppelin-env.sh.template to zeppelin-env.sh
  8. Edit zeppelin-env.sh … you need to export two items.
    a.

    export 
    HADOOP_CONF_DIR="/opt/mapr/hadoop/hadoop-x.x.x/etc/hadoop"

    insert the correct Hadoop version & path)
    b.

    export ZEPPELIN_JAVA_OPTS="-Dspark.executor.instances=4 -Dspark.executor.memory=2g"

    The Hadoop conf directory is where yarn-site.xml lives. The Zeppelin Java Options set information about your Spark deployment. These options are explained in the Spark documentation here.

This should be all you need to do at the command line …. to start the Zeppelin server, execute

bin/zeppelin-daemon.sh start

Now you need to configure Zeppelin to use your Spark cluster. Point your browser to

http://:8080

Click on Interpreter (top of the page), and edit the Spark section:

  • master == yarn-client
  • Save

You can configure your HiveServer2 on this page as well, if you are using one. Now click on Notebook (top of the page) and select the tutorial.

NOTES

Be aware of the port number Zeppelin runs on.

  • If you are on a node of a cluster, port 8080 will probably conflict with any number of Hadoop services.
  • In the conf directory (steps 8 & 9) there is also a zeppelin-site.xml template. Copy this and edit it …. the port number is at the top.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button