Home » Tag Archives: MapReduce (page 3)

Tag Archives: MapReduce

MapReduce Questions and Answers Part 2

4 Inverting Indexing for Text Retrieval The chapter contains a lot of details about integer numbers encoding and compression. Since these topics are not directly about MapReduce, I made no questions about them. 4.4 Inverting Indexing: Revised Implementation Explain inverting index retrieval algorithm. You may assume that each document fits into the memory. Assume also then there is a huge ...

Read More »

MapReduce Questions and Answers Part 1

With all the hype and buzz surrounding NoSQL, I decided to have a look at it. I quickly found that there is not one NoSQL I could learn. Rather, there are various different solutions with different purposes and trade offs. Those various solutions tend to have one thing in common: processing of data in NoSQL storage is usually done using ...

Read More »

MapReduce for dummies

Continuing the coverage on Hadoop component, we will go through the MapReduce component. MapReduce is a concept that has been programming model of LISP. But before we jump into MapReduce, lets start with an example to understand how MapReduce works. Given a couple of sentences, write a program that counts the number of words. Now, the traditional thinking when solving ...

Read More »

Joins with Map Reduce

I have been reading on Join implementations available for Hadoop for past few days. In this post I recap some techniques I learnt during the process. The joins can be done at both Map side and Join side according to the nature of data sets of to be joined. Reduce Side Join Let’s take the following tables containing employee and ...

Read More »

Word Count MapReduce with Akka

In my ongoing workings with Akka, i recently wrote an Word count map reduce example. This example implements the Map Reduce model, which is very good fit for a scale out design approach. Flow The client system (FileReadActor) reads a text file and sends each line of text as a message to the ClientActor. The ClientActor has the reference to ...

Read More »

A SMALL cross-section of BIG Data

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. IDC estimated the ...

Read More »

Big Data analytics with Hive and iReport

Each J.J. Abrams’ TV series Person of Interest episode starts with the following narration from Mr. Finch one of the leading characters: “You are being watched. The government has a secret system–a machine that spies on you every hour of every day. I know because…I built it.” Of course us technical people know better. It would take a huge team ...

Read More »

Hadoop: A Soft Introduction

What is Hadoop: Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFSis a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is ...

Read More »

MapReduce: A Soft Introduction

MapReduce is a parallel programming technique made popular by Google. It is used for processing very large amounts of data. Such processing can be completed in a reasonable amount of time only by distributing the work to multiple machines in parallel. Each machine processes a small subset of the data. MapReduce is a programming model that lets developers focus on ...

Read More »