Home » Tag Archives: Apache Hadoop

Tag Archives: Apache Hadoop

Apache Hadoop Tutorial – The ULTIMATE Guide (PDF Download)

apache-hadoop-logo

EDITORIAL NOTE: Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. Hadoop has become the de-facto tool ...

Read More »

The Lord of the Things: Spark or Hadoop?

apache-hadoop-logo

Are people in your data analytics organization contemplating the impending data avalanche from the internet of things and thus asking this question: “Spark or Hadoop?” That’s the wrong question! The internet of things (IOT) will generate massive quantities of data. In most cases, these will be streaming data from ubiquitous sensors and devices. Often, we will need to make real-time ...

Read More »

Mesos and YARN: A tale of two clusters

apache-hadoop-logo

This is a tale of two siloed clusters. The first cluster is an Apache Hadoop cluster. This is an island whose resources are completely isolated to Hadoop and its processes. The second cluster is the description I give to all resources that are not a part of the Hadoop cluster. I break them up this way because Hadoop manages its ...

Read More »

What Are The Advanced Apache Hadoop MapReduce Features?

apache-hadoop-logo

Overview The basic MapReduce programming explains the work flow details. But it does not cover the actual working details inside the MapReduce programming framework. This article will explain the data movement through the MapReduce architecture and the API calls used to do the actual processing. We will also discuss the customization techniques and function overriding for application specific needs. Introduction ...

Read More »

Tuning Hadoop & Cassandra : Beware of vNodes, Splits and Pages

apache-cassandra-logo

When running Hadoop jobs against Cassandra, you will want to be careful about a few parameters. Specifically, pay special attention to vNodes, Splits and Page Sizes. vNodes were introduced in Cassandra 1.2. vNodes allow a host to have multiple portions of the token range.  This allows for more evenly distributed data, which means nodes can share the burden of a ...

Read More »

Delta Architectures: Unifying the Lambda Architecture and leveraging Storm from Hadoop/REST

apache-hadoop-logo

Recently, I’ve been asked by a bunch of people to go into more detail on the Druid/Storm integration that I wrote for our book: Storm Blueprints for Distributed Real-time Computation.  Druid is great. Storm is great. And the two together appear to solve the real-time dimensional query/aggregations problem. In fact, it looks like people are taking it mainstream, calling it ...

Read More »

Running PageRank Hadoop job on AWS Elastic MapReduce

apache-hadoop-logo

In a previous post I described an example to perform a PageRank calculation which is part of the Mining Massive Dataset course with Apache Hadoop. In that post I took an existing Hadoop job in Java and modified it somewhat (added unit tests and made file paths set by a parameter). This post shows how to use this job on ...

Read More »

Calculate PageRanks with Apache Hadoop

apache-hadoop-logo

Currently I am following the Coursera training ‘Mining Massive Datasets‘. I have been interested in MapReduce and Apache Hadoop for some time and with this course I hope to get more insight in when and how MapReduce can help to fix some real world business problems (another way to do so I described here). This Coursera course is mainly focussing ...

Read More »

Hadoop and the OpenDataPlatform

apache-hadoop-logo

Pivotal, IBM and Hortonworks announced today the “Open Data Platform” (ODP) – an attempt to standardize Hadoop. This move seems to be backed up by IBM, Teradata and others that appear as sponsors on the initiative site. This move has a lot of potential and a few possible downsides. ODP promises standardization – Cloudera’s Mike Olson downplays the importance of this ...

Read More »

Lambda Architecture for Big Data

apache-hadoop-logo

An increasing number of systems are being built to handle the Volume, Velocity and Variety of Big Data, and hopefully help gain new insights and make better business decisions. Here, we will look at ways to deal with Big Data’s Volume and Velocity simultaneously, within a single architecture solution. Volume + Velocity Apache Hadoop provides both reliable storage (HDFS) and a processing system (MapReduce) for large data ...

Read More »

Want to take your Java skills to the next level?

Grab our programming books for FREE!

Here are some of the eBooks you will get:

  • Advanced Java Guide
  • Java Design Patterns
  • JMeter Tutorial
  • Java 8 Features Tutorial
  • JUnit Tutorial
  • JSF Programming Cookbook
  • Java Concurrency Essentials