Persistent Storage for Enterprise-Grade Spark Applications

Sameer NoriAugust 2nd, 2016Last Updated: August 2nd, 2016

0 111 3 minutes read

Apache Spark is becoming very popular and widely used in the big data community. There are several reasons for Spark getting such rapid traction. These include its in-memory processing capabilities, support for a wide range of engines for various use cases such as streaming, machine learning, and SQL, and the ability to develop in multiple languages such as Python and Scala. The interest and momentum around Spark is very real in the market. In early June, MapR announced an enterprise-grade Apache Spark distribution. The reason for doing so was straightforward—make it easier for you to adopt Spark as the primary big data compute engine in your data architecture. Does this mean we are moving away from Hadoop/MapReduce and all the associated ecosystem tools? Absolutely not. We are simply giving customers more choice in how they start their big data journey.

MapR has already been leading the way among big data vendors with support for Apache Spark for over two years now, and providing a separate “Spark-only” distribution is the next step. The MapR Platform including Spark is the only reliable and production-ready platform for Spark workloads on-premises and in the cloud. You now get a converged compute and storage engine for batch and real-time processing that helps you build and deploy applications rapidly. The combination of MapR Streams for delivering event streams, Spark Streaming for streaming analytics, and MapR-DB to store those results is becoming a consistent pattern in the quest for real-time analytics across use cases such as recommendation engines, churn prediction, and IoT applications.

As you are probably aware, Spark has no persistent data storage capabilities of its own. And despite its renown as a high-speed in-memory engine, it still needs cost-effective data storage for tasks where the data set cannot fit entirely in memory. There are a variety of storage mechanisms that can be used with Spark. I believe the best suited mechanism is a distributed file system, which makes it easy to store Spark Resilient Distributed Datasets (RDD). When using Spark with other Hadoop vendors, HDFS acts as the storage layer for both Hadoop and Spark data. This typically works well in development and test environments that aren’t tied to stringent SLAs. Most IT managers, however, see challenges with HDFS handling business-critical, production workloads due to inadequate data protection and disaster recovery capabilities, the need to move data between task-specific clusters, and the lack of true multi-tenant capabilities. The MapR Platform addresses precisely these shortcomings and has been built to do so from the ground up. With its enterprise-grade capabilities, its low TCO from use of commodity hardware, and its flexibility in easily storing a wide variety of data types, the MapR Platform including Spark should be included on any short list for organizations investigating their Spark options.

I recommend that you seek input on our Spark distribution, as well as our accompanying technology stack, from independent industry analysts who study various big data technologies in the market. For example, The Evaluator Group, a leading analyst firm, has concluded that the MapR Platform including Spark is the most reliable enterprise-grade Spark platform in the big data market. You can get access to their findings by downloading the white paper “Persistent Storage for Apache Spark in the Enterprise.” This paper will provide you with an in-depth view into how the MapR Platform meets key requirements for the persistent storage layer for Spark applications.

And if you’re new to Spark and MapR, don’t forget to check out the MapR Converge Community, a public online community that anyone can join. There’s a wealth of information being exchanged in the community that makes it the hub for knowledge about big data use cases, Spark, and other related topics. Also be sure to check out our free On-Demand Training courses that cover Spark.

Reference:

Persistent Storage for Enterprise-Grade Spark Applications from our JCG partner Sameer Nori at the Mapr blog.