Top 10 Questions about Apache Spark on the MapR Data Platform

Sameer NoriJune 26th, 2016Last Updated: June 24th, 2016

2 150 8 minutes read

In the last few weeks, we’ve seen a lot of activity and momentum centered on Apache Spark. At the Spark Summit in San Francisco, we announced an enterprise-grade Spark distribution that runs on the MapR Platform, and we received a lot of interest at this event. Customers are flocking to Spark as their primary compute engine for big data use cases, and we received further proof of this last week when we ran an “Ask Us Anything about Spark” forum in the Converge Community. There were some great discussions that took place, where our Spark experts answered questions from customers and partners. Here is a summary of some of these discussions:

Question 1

How do I write back into MapR Streams using Spark (Java)?

I am now able to read from MapR Streams using Spark. But now I want to write back into them using Spark (and Java). There is barely any documentation available online for Scala, and there isn’t any available for Java. I did find a “sendToKafka” function mentioned in some Scala code, but the same isn’t working for Java (because it writes DStream and I am working with JavaDStream). All I am looking for is a Java doc for MapR Streams and Spark, or just a function that lets me write JavaDStream into MapR Streams, preferably using Java.

Answer 1:

There isn’t a direct method to send the complete DStream to Kafka. The design pattern would have a .foreach() against the RDD from the incoming DStream. You would use the MapR Streams (Kafka) API which would instantiate a Producer, and then (typically) use the Producer.send() on each record from the .foreach().
In Java, you iterate over the DStream calling the method Producer.send() for each message.

Answer 2: At this time, there is only a Scala producer in the org.apache.spark.streaming.kafka.producer package. (Spark 1.6.1)

Question 2

We are configuring security for Spark 1.5.2, and are facing a few challenges as outlined below:

All of the Spark web UIs are not moving from http to https (e.g., : port 4040/ SparkHistoryServer, etc.).
While launching Spark – SQL and spark-shell, we are facing a lot of issues: w.r.t sqlcontext, hivemetastore, sentry configuration, etc.

Please provide detailed instructions/steps to be followed while enabling security for Spark. The MapR 5.1 cluster is a 3-node secure cluster with native security being enabled for all the components. Our Spark cluster is running Spark on YARN mode on MapR 5.1.

Answer: If the cluster has been configured as a secure cluster, there is no additional configuration you must change. The “configure.sh” command you (or the MapR installer) executed during the installation will configure YARN security for you. By extension, Spark executed with YARN will be secure as well.

Question 3

In our application, we have a large amount of “streaming data” (i.e., CSV files arrivIng at five-minute intervals), we want to store all the data up to some age limit and form real-time RDD views which the visualization will access via Drill.

For the Spark app, what is the best storage method – a Spark DataFrame, MapR-DB table, or just a Parquet file? All are accessible to Spark and Drill, but if we are just doing regular column lookups based on a few sub keys, which one is preferable?

Answer 1: Based on the limited information in this thread, and given that your common lookups are based on a set of known columns, it would appear that you may want to store these as Parquet files. Drill is optimized for reading Parquet. Please share more with respect to SLAs, and if you plan to keep data in memory (for example, can it be contained in memory for an extended period of time?), which may point to using Spark DataFrames.

Answer 2: If you need to update your data, then HBase will be faster for updates. If you mainly want to read your data and not update it, then Parquet is optimized for fast columnar reads.

Question 4

We would like to use Spark from a standalone Java application. The Java application should generate a temporary table and start the Hive Thrift Server. Which classes should we use to connect from the Java application to Spark? SparkLauncher? Is there any other way (other than the SparkLauncher) to succeed without using spark-submit?

Answer: This will very much depend on where the Spark application will be executed. I assume the Spark application will be executed on the MapR cluster. If so, spark-submit is the method to do this for Java applications. Java applications are not as flexible in how they are executed on a Spark cluster.

Question 5

We want to load data from an HBase table and convert them into DataFrames to perform aggregations. Since Scala’s case classes have a limitation of 22 parameters, how can we create schema when the number of columns are more than 22? Currently, we have created a Hive external table and query using HiveContext in order to get a DataFrame. Is there any way to create a Dataframe from RDD by directly scanning HBase?

Answer: When you have more than 22 columns, you can programmatically specify the schema. A DataFrame can be created programmatically with three steps.

Create an RDD of Rows from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via the createDataFrame method provided by SQLContext.

You cannot currently create a DataFrame from scanning HBase with the current release. There are no released modules which are working on this.

Question 6

When I do aggregations like sum using DataFrame, I encounter a double precision issue. For example, instead of 913.76, it returns 913.7600000000001, and instead of 6796.25 it returns 6796.249999999995. I am using BigDecimal’s setScale(2, BigDecimal.RoundingMode.HALF_UP) method to round off the value. Is there any way to handle this precision problem without applying another round-off function?

Answer: The loss of precision is related to the conversion between data types in Java. This is a common occurrence, and mathematically rounding the number is the most common solution. I would check the real data type of the field that produced the RDD, and match the DataFrame class exactly in order to reduce precision errors.

Question 7

We have a cluster which was created few years ago, before Spark become popular and widely used. Now we are facing a problem with small disk space for the Spark scratch directory (As MapR documentation suggested, have a small disk space for OS and the rest for MapR-FS is not so good for Spark). As we have data on MapR-FS now, it’s very slow/expensive to steal disks from MapR-FS for Spartk scrach/tmp. Can we use the MapR-FS local volumes as scratch on MapR Community Edition 5.1?

There is a manual on how to configure a scratch directory for Spark Standalone:

MapR 5.1 Documentation. Is there any respective documentation for Spark on YARN?

Answer: Yes, you can use MapR-FS for Spark local directories, as documented in the 5.1 documentation. In the MapR Community Edition, this will, however, send all the scratch files through a single NFS server instance. The performance with this configuration will not be as good as with local disk directories.

Spark on YARN doesn’t use the Spark local directories; temporary space is handled by YARN. On MapR, the YARN directories are already on MapR-FS.

Question 8

I’d like to know what would be typical admin issues related to Spark that devops engineers may need to do other than the installation and configuration. Some elaboration would be appreciated!

Answer: For CI environments (Jenkins, etc.), Spark will use both Java and Scala, so having the full set of Scala dev tools (the Scala compiler) installed is required. Most Java Spark projects are built with Maven. There will be some Scala projects that may need ‘sbt’ as well to complete.

Depending on how the developers use Spark in your cluster, you might want to think about placing some limits on how many executors and tasks each developer can execute at a time. If you are running Spark under YARN, you can use YARN queues to assist in resource management on the cluster.

Testing and test cases should probably not be run on the cluster, but configure Jenkins to use a “local[2]” Spark master. When the code is debugged and running on a local Spark instance, then you can test at scale on a full cluster.

Question 9

I’m using an edge node (MapR Client) with Spark 1.6.1 using YARN to distribute my tasks to the cluster. For now, we are running ad hoc analytical tasks as opposed to production repeated tasks. I am using Jupyter to create and execute code, and the Apache Toree kernel to direct it to Spark. I own the config of this, so I get to choose the command to initiate Spark. I do not have direct access to the cluster, or to the MapR client install, so I have to raise a support request to do any work that modifies the MapR or Spark Client install as well as any work on the cluster.

To date, I have not used Spark packages, though I can see increasing opportunities to use things like GraphFrames and spark-csv. I understand that I have to use the –packages switch to use Spark packages. I would hope that I do not have to put in a support call each time I want to use a new package (I don’t have to when I use Python or R!).

Where do the Spark packages have to be physically located?

Answer: Two basic methods: in the classpath of the Spark slaves or in a self-contained JAR that is submitted for execution. The classpath does not necessarily have to be in /opt/mapr

Do the Spark packages have to be on my edge node or on all nodes in the cluster?

The packages must be available to all nodes in the Spark cluster. In all cases of adding packages to a Spark cluster, the spark-env.sh can be used to set almost all of the paths required to find things.

Question 10

For one of my current clients, we are looking to offload multiple PBs worth of data. As we do require fast access to all of that data, we are looking towards an Apache Spark 2.0 implementation. Part of the work of our DWH is to keep a historical record of all records in our 50+ sources. This means that we are building up a rather large ODS with all source data over time (in an “SCD type-II”-like structure). Part of that data is then used as a source for the Data Vault, and eventual reporting.

We have two issues with this:

The ODS accounts for 90% of our size, and is several PB (no typo, as in thousands of GB). This is specifically where Spark 2.0 comes in, as we do have to match billions of records.
Following the source. We currently have 50+ source-systems, with another 40+ planned. We are spending most of our time following changes in the source, even when they are not relevant for the data mart. If we can offload the data from the mainframe to a Hadoop + Spark solution, we can switch from schema-on-write to schema-on-read, which means we’ll only have to spend time on the relevant data.

Answer: We will be releasing a development preview soon. The GA will probably follow a few weeks after the community GA. The primary reason for the few additional weeks is to ensure strong interoperability with the rest of the Hadoop stack. The Spark 2.0 GA itself ensures that Spark works, but the interop testing can reveal issues, so we are taking a few weeks to ensure strong interoperability.

We trust that this is useful to use as you are working with Spark. Don’t hesitate to engage with us on the Converge Community.

Reference:

Top 10 Questions about Apache Spark on the MapR Data Platform from our JCG partner Sameer Nori at the Mapr blog.