Home » Software Development » HDFS for dummies

About Munish Gupta

Munish K Gupta is a senior architect working in a leading IT services company. His experience is in building Online Portals, SaaS Platforms, CRM Solutions and Transaction Processing Systems. He is author of the book - Akka Essentials.

HDFS for dummies

Whenever a newbie wants to start learning the Hadoop, the number of elements in a Hadoop stack are mind bogling and at times difficult to comprehend. I am trying to de-crypt the whole stack and help explain the basic pieces in my own way.

Before we start talking about the Hadoop Stack, let us take a step back and try to understand what led to the origins to the Hadoop.

Problem – With the prolification of the internet, the amount of data stored growing up. Lets take an example of a search engine (like Google), that needs to index the large of amount of data that is being generated. The search engine crawls and indexes the data. The index data is stored and retrieval from a single storage device. As the data generated grows, the search index data will keep on increasing.

As the number of queries to access data increase, the current file system I/O becomes inadequate to retrieve large amounts of data simuntaneously. Further, the model of one large single storage starts becoming a bottleneck. To overcome the problem, we move the file system from a single disk storage to a clustered file system. But as the amount of data keeps growing the underlying data that can go one one machine starts to become a bottleneck.

As data reaches TB’s, existing file system based solutions starts faltering. Data access, multiple writers, large file sizes soon become a problem in scaling up the system.

Solution – To overcome the problems, an distributed file system was concieved that provided solution to the above problems. The solution tackled the problem as

  • When dealing with large files, I/O becomes a big bottleneck. So, we divide the files into small blocks and store in multiple machines. [Block Storage]
  • When we need to read the file, the client sends a request to multiple machines, each machine sends a block of file which is then combined together to pierce the whole file.
  • With the advent of block storage, the data access becomes distributed and leads to a faster retrieval/write
  • As the data blocks are stored on multiple machines, it helps in removing single point of failure by having the same block on multiple machines. Meaning, if one machine goes, the client can request the block from another machine.

Now, any solution that implements file storage as blocks needs to have the following characteristics

  • Manage the meta data information – Since the file gets broken into multiple blocks, somebody needs to keep track of no of blocks and storage of these blocks on different machines [NameNode]
  • Manage the stored blocks of data and fulfill the read/write requests [DataNodes]

So, in the context of Hadoop –The NameNode is the arbitrator and repository for all metadata. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. All these component together form the Distributed File System called as HDFS (Hadoop Distributed File System).

Reference – http://hadoop.apache.org/common/docs/current/hdfs_design.html

HDFS has an inbuild redundancy and replication feature that makes sure that any failure of the machine can be dealt without any loss of data. The HDFS balances itself whenever a new data node is added to the cluster or any of the existing datanode fails.
In addition to the distributed file system called HDFS (Hadoop Distributed File System), there are 2 other core components

  • Hadoop Common – are set of utilities that support the Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries.
  • Hadoop MapReduce – is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes

So, effectively, when you start working with Hadoop, HDFS and Hadoop MapReduce are the first 2 things you encounter. I will cover MapReduce in subsequent posts.

Reference: HDFS for dummies from our JCG partner Munish K Gupta at the Tech Spot blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*


nine + 8 =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Want to take your Java Skills to the next level?
Grab our programming books for FREE!
  • Save time by leveraging our field-tested solutions to common problems.
  • The books cover a wide range of topics, from JPA and JUnit, to JMeter and Android.
  • Each book comes as a standalone guide (with source code provided), so that you use it as reference.
Last Step ...

Where should we send the free eBooks?

Good Work!
To download the books, please verify your email address by following the instructions found on the email we just sent you.