Theodora Fragkouli

About Theodora Fragkouli

Theodora has graduated from Computer Engineering and Informatics Department in the University of Patras. She also holds a Master degree in Economics from the National and Technical University of Athens. During her studies she has been involved with a large number of projects ranging from programming and software engineering to telecommunications, hardware design and analysis.

Apache Spark is now a top-level project

The Apache Software Foundation (ASF) happily announced that Apache Spark has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying the project’s stability.

Apache Spark is an Open Source cluster computing framework for fast and flexible large-scale data analysis. Spark has been the talk of the Big Data town for a while, and 2014 was predicted to be the year of Spark.

According to the Spark Web site home page, the engine runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. This is why Cloudera has integrated it into its Hadoop distribution, CDH (Cloudera Distribution including Apache Hadoop). Spark’s big success is not only the fact that it is a fast engine, but also its rapid evolution since past June that it entered the Apache incubator, with contributions including more than 120 developers from 25 organizations.

Spark’s creators from the University of California, Berkeley, have created a company called Databricks to commercialize the technology. According to Ion Stoica, CEO at Databricks and Professor at UC Berkeley, with the Spark project it became much easier for organizations to get insights from big data. Now, an open source community is created and this can help to accelerate the development and adoption of Apache Spark.

One of Sparks’s features, according to “Apache Spark becomes top-level project” article is that it can run on Hadoop 2.0 YARN. Also, Shark, its companion project can implement SQL-on-Hadoop engine that is syntax-compatible with Apache Hive, but claims the same 10x/100x increases in performance over it that Spark claims over raw MapReduce.

Another feature of Spark is that it allows developers to write applications in Java, Python, or Scala. Integrated with Apache Hadoop, Spark is well suited for machine learning, interactive queries, and stream processing, and can read from HDFS, HBase, Cassandra, as well as any Hadoop data source.

Yahoo has congratulated Spark on becoming an Apache top-level project, via Andrew Feng, Distinguished Architect at Yahoo. Feng explaned how Yahoo has helped in evolving Hadoop and related big-data technologies, including Spark. Yahoo has made significant contributions to the development of Spark, since Apache Hadoop is the foundation of Yahoo’s big-data platform.

Apache Spark software is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. Documentation and ways to become involved with Apache Spark are offered here.

As far as MapReduce is concerned, it seems that Spark is set to take the reins as the primary processing framework for the new Hadoop workloads whereas MapReduce fades. Spark seems to be well suited for next-generation big data applications that might require lower-latency queries, real-time processing or iterative computations on the same data. Spark is technically a standalone project, but it was always designed to work with the Hadoop Distributed File System.

However, there’s still a lot of tooling for MapReduce that Spark doesn’t have yet (e.g., Pig and Cascading), and MapReduce is still quite good for certain batch jobs. Cloudera co-founder and Chief Strategy Officer Mike Olson explained that there are a lot of legacy MapReduce workloads that aren’t going anywhere anytime soon even as Spark takes off.

In fact, there is a Structure Data conference on March 19-20 in New York, where Ion Stoica will be speaking as part of the Structure Data Awards presentation, and the CEOs of Cloudera, Hortonworks, and Pivotal will talk about the future of big data platforms and how they plan to capitalize on them.

Related Whitepaper:

Software Architecture

This guide will introduce you to the world of Software Architecture!

This 162 page guide will cover topics within the field of software architecture including: software architecture as a solution balancing the concerns of different stakeholders, quality assurance, methods to describe and evaluate architectures, the influence of architecture on reuse, and the life cycle of a system and its architecture. This guide concludes with a comparison between the professions of software architect and software engineer.

Get it Now!  

Leave a Reply


7 × = forty two



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books