Home » Tag Archives: Apache Spark

Tag Archives: Apache Spark

Exploring the Spline Data Tracker and Visualization tool for Apache Spark (Part 2)

In part 1 we have learned how to test data lineage info collection with Spline from a Spark shell. The same can be done in any Scala or Java Spark application. The same dependencies for the Spark shell need to be registered in your build tool of choice (Maven, Gradle or sbt): groupId: za.co.absa.spline artifactId: spline-core version: 0.3.5 groupId: za.co.absa.spline artifactId: spline-persistence-mongo ...

Read More »

Exploring the Spline Data Tracker and Visualization tool for Apache Spark (Part 1)

One interesting and promising Open Source project that caught my attention lately is Spline, a data lineage tracking and visualization tool for Apache Spark, maintained at  Absa. This project consists of 2 parts: a Scala library that works on the drivers which, by analyzing the Spark execution plans, captures the data lineages and a web application which provides a UI to visualize them. ...

Read More »

Insights from Spark UI

As continuation of  anatomy-of-apache-spark-job post i will share how you can use Spark UI for tuning job. I will continue with same example that was used in earlier post, new spark application will do below things – Read new york city parking ticket – Aggregation by “Plate ID” and calculate offence dates – Save result DAG for this code looks like this ...

Read More »

Anatomy of Apache Spark Job

Apache  Spark is general purpose large scale data processing framework. Understanding how spark executes jobs is very important for getting most of it. Little recap of Spark evaluation paradigm: Spark is using lazy evaluation paradigm in which Spark application does not anything till driver calls “Action”. Lazy eval is key to all the runtime/compile time optimization spark can do with it. Lazy ...

Read More »

Custom Logs in Apache Spark

Have you ever felt the frustration of Spark job that runs for hours and it fails due to infra issue. You know about this failure very late and waste couple of hours on it and it hurts more when Spark UI logs are also not available for postmortem. You are not alone! In this post i will go over how ...

Read More »

Apache Spark RDD and Java Streams

A few months ago, I was fortunate enough to participate in a few PoCs (proof-of-concepts) that used Apache Spark. There, I got the chance to use resilient distributed datasets (RDDs for short), transformations, and actions. After a few days, I realized that while Apache Spark and the JDK are very different platforms, there are similarities between RDD transformations and actions, ...

Read More »

Monitoring Real-Time Uber Data Using Spark Machine Learning, Streaming, and the Kafka API (Part 2)

This post is the second part in a series where we will build a real-time example for analysis and monitoring of Uber car GPS trip data. If you have not already read the first part of this series, you should read that first. The first post discussed creating a machine learning model using Apache Spark’s K-means algorithm to cluster Uber data based ...

Read More »

Apache Spark: A Quick Start With Python

Spark Overview As per the official website, “Apache Spark is a fast and general engine for large scale data processing” It is best used with clustered environment where the data processing task or job is split to run on multiple computers or nodes quickly and efficiently. It claims to run program 100 times faster than Hadoop platform. Spark uses something ...

Read More »