Home » Tag Archives: Big Data

Tag Archives: Big Data

What are the 5 Trends for Testing in the Era of Big Data?

software-development-2-logo

In today’s world of data explosion, big data applications and their implementations are growing dramatically. As data is at the heart of any big data application, it is important to understand the characteristics of big data. The three most unique characteristics of big data are ‘Volume’, ‘Velocity’ and ‘Variety’. And these data comes in different format from multiple channels. All ...

Read More »

Running PageRank Hadoop job on AWS Elastic MapReduce

apache-hadoop-logo

In a previous post I described an example to perform a PageRank calculation which is part of the Mining Massive Dataset course with Apache Hadoop. In that post I took an existing Hadoop job in Java and modified it somewhat (added unit tests and made file paths set by a parameter). This post shows how to use this job on ...

Read More »

Calculate PageRanks with Apache Hadoop

apache-hadoop-logo

Currently I am following the Coursera training ‘Mining Massive Datasets‘. I have been interested in MapReduce and Apache Hadoop for some time and with this course I hope to get more insight in when and how MapReduce can help to fix some real world business problems (another way to do so I described here). This Coursera course is mainly focussing ...

Read More »

Even Doctors Will Be Data Scientists

software-development-2-logo

We all know how it works. You walk into a doctor’s office complaining about some pain in your leg or otherwise. They take your temperature, get you on the scale, check your blood pressure, and perhaps even get out the rubber hammer. These measurements are simply snapshots at one particular instant in time and may be subject to error. This ...

Read More »

How to: Refine Hive ZooKeeper Lock Manager Implementation

apache-zookeeper-logo

Hive has been using ZooKeeper as distributed lock manager to support concurrency in HiveServer2. The ZooKeeper-based lock manager works fine in a small scale environment. However, as more and more users move to HiveServer2 from HiveServer and start to create a large number of concurrent sessions, problems can arise. The major problem is that the number of open connections between ...

Read More »

How to Analyze Highly Dynamic Datasets with Apache Drill

java-interview-questions-answers

Today’s data is dynamic and application-driven. The growth of a new era of business applications driven by industry trends such as web/social/mobile/IOT are generating datasets with new data types and new data models. These applications are iterative, and the associated data models typically are semi-structured, schema-less and constantly evolving. Semi-structured where an element can be complex/nested, and schema-less with its ...

Read More »

Hadoop and the OpenDataPlatform

apache-hadoop-logo

Pivotal, IBM and Hortonworks announced today the “Open Data Platform” (ODP) – an attempt to standardize Hadoop. This move seems to be backed up by IBM, Teradata and others that appear as sponsors on the initiative site. This move has a lot of potential and a few possible downsides. ODP promises standardization – Cloudera’s Mike Olson downplays the importance of this ...

Read More »

Streaming Big Data: Storm, Spark and Samza

apache-spark-logo

There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of their similarities and differences. Apache Storm In Storm, you design a graph of real-time computation called a topology, and feed it to the ...

Read More »

Lambda Architecture for Big Data

apache-hadoop-logo

An increasing number of systems are being built to handle the Volume, Velocity and Variety of Big Data, and hopefully help gain new insights and make better business decisions. Here, we will look at ways to deal with Big Data’s Volume and Velocity simultaneously, within a single architecture solution. Volume + Velocity Apache Hadoop provides both reliable storage (HDFS) and a processing system (MapReduce) for large data ...

Read More »

Open Source Cloud Formation with Minotaur for Mesos, Kafka and Hadoop

apache-hadoop-logo

Today I am happy to announce “Minotaur” which is our Open Source AWS based infrastructure for managing big data open source projects including (but not limited too): Apache Kafka, Apache Mesos and Cloudera’s Distribution of Hadoop. Minotaur is based on AWS Cloud Formation. The following labs are currently supported:           Apache Mesos Apache Kafka Apache Zookeeper Cloudera Hadoop ...

Read More »
Do you want to know how to develop your skillset and become a ...

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!
Get ready to Rock!
To download the books, please verify your email address by following the instructions found on the email we just sent you.

THANK YOU!

Close