What is Big Data – Theory to Implementation

Byron KiourtzoglouApril 25th, 2013Last Updated: May 24th, 2013

3 102 6 minutes read

What is Big Data? You may ask; and more importantly why it is the latest trend in nearly every business domain? Is it just a hype or its here to stay?

As a matter of fact “Big Data” is a pretty straightforward term – its just what its says – a very large data-set. How large? The exact answer is “as large as you can imagine”!

How can this data-set be so massively big? Because the data may come from everywhere and in enormous rates: RFID sensors that gather traffic data, sensors used to gather weather information, GPRS packets from cell phones, posts to social media sites, digital pictures and videos, online purchase transaction records, you name it! Big Data is an enormous data-set that may contain information from every possible source that produces data that we are interested in.

Nevertheless Big Data is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make businesses more agile, and to answer questions that were previously considered beyond our reach. That is why Big Data is characterized by four main aspects: Volume, Variety, Velocity, and Veracity(Value) known as “the four Vs of Big Data”. Let’s briefly examine what each one of them stands for and what challenges it presents:

Volume

Volume references the amount of content a business must be able to capture, store and access. 90% of the world’s data has been generated in the past two years alone. Organizations today are overwhelmed with volumes of data, easily amassing terabytes—even petabytes—of information of all types, some of which needs to be organized, secured and analyzed.

Variety

80% of the world’s data is semi – structured. Sensors, smart devices and social media are generating this data through Web pages, weblog files, social-media forums, audio, video, click streams, e-mails, documents, sensor systems and so on. Traditional analytics solutions work very well with structured information, for example data in a relational database with a well formed schema. Variety in data types represents a fundamental shift in the way data is stored and analysis needs to be done to support today’s decision-making and insight process. Thus Variety represents the various types of data that can’t easily be captured and managed in a traditional relational database but can be easily stored and analyzed with Big Data technologies.

Velocity

Velocity requires analyzing data in near real time, aka “sometimes 2 minutes is too late!”. Gaining a competitive edge means identifying a trend or opportunity in minutes or even seconds before your competitor does. Another example is time-sensitive processes such as catching fraud where information must be analyzed as it streams into your enterprise in order to maximize its value. Time-sensitive data has a very short shelf-life; compelling organizations to analyze them in near real-time.

Veracity (Value)

Acting on data is how we create opportunities and derive value. Data is all about supporting decisions, so when you are looking at decisions that can have a major impact on your business, you are going to want as much information as possible to support your case. Nevertheless the volume of data alone does not provide enough trust for decision makers to act upon information. The truthfulness and quality of data is the most important frontier to fuel new insights and ideas. Thus establishing trust in Big Data solutions probably presents the biggest challenge one should overcome to introduce a solid foundation for successful decision making.

While the existing installed base of business intelligence and data warehouse solutions weren’t engineered to support the four V’s, big data solutions are being developed to address these challenges.

What follows is a brief presentation of the major open-source Java based tools that are available today and support Big Data :

	HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. HDFS is specifically designed for storing vast amount of data, so it is optimized for storing/accessing a relatively small number of very large files compared to traditional file systems where are optimized to handle large numbers of relatively small files.
	Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
	Apache HBase is the Hadoop database, a distributed, scalable, big data store. It provides random, realtime read/write access to Big Data and is optimized for hosting very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. In its core Apache HBase is a distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
	The Apache Cassandra is a performant, linear scalable and high available database that can run on commodity hardware or cloud infrastructure making it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for users and the peace of mind of knowing that you can survive regional outages. Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.
	Apache Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
	Apache Pig is a platform for analyzing large data sets. It consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs. Pig’s language layer currently consists of a textual language called Pig Latin, which is developed with ease of programming, optimization opportunities and extensibility in mind.
	Apache Chukwa is an open source data collection system for monitoring large distributed systems. It is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
	Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
	Apache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. In short Apache ZooKeeper is a high-performance coordination service for distributed applications like those run on a hadoop cluster.
	Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
	Apache Oozie is a scalable, reliable and extensible workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
	Apache Mahout is a scalable machine learning and data mining library. Currently Mahout supports mainly four use cases: Recommendation mining : Takes users’ behavior and from that tries to find items users might like. Clustering : Takes e.g. text documents and groups them into groups of topically related documents. Classification : Learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category. Frequent itemset mining : Takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.
	Apache HCatalog is a table and storage management service for data created using Apache Hadoop. This includes: Providing a shared schema and data type mechanism. Providing a table abstraction so that users need not be concerned with where or how their data is stored. Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.

That’s it; Big Data, a short theoretical introduction and a compact matrix of implementation approaches focused on overcoming the problems of a new era – the era that forces us to ask bigger questions!

Happy Coding
Byron