Home » Tag Archives: Big Data

Tag Archives: Big Data

How Apache Kafka and MapR Streams Handle Topic Partitions

software-development-2-logo

Streaming data can be used as a long-term auditable history when you choose a messaging system with persistence, but is this approach practical in terms of the cost of storing years of data at scale?  The answer is “yes”, particularly because of the way topic partitions are handled in MapR Streams. Here’s how it works. Streaming Data as a Long ...

Read More »

The Changing Economics of Big Data

software-development-2-logo

Perhaps you’re old enough to remember when the library was the place we went to learn. We foraged through card catalogs, encyclopedias and the Reader’s Guide to Periodical Literature in hopes that we’d be able to understand what was going on in other people’s minds when they decided what went where. The process was time-consuming, frustrating and often futile. We ...

Read More »

Distributed Deep Learning with Caffe Using a MapR Cluster

software-development-2-logo

We have experimented with CaffeOnSpark on a 5 node MapR 5.1 cluster running Spark 1.5.2 and will share our experience, difficulties, and solutions on this blog post. Deep Learning and Caffe Deep learning is getting a lot of attention recently, with AlphaGo beating a top world  player at a game that was thought so complicated as to be out of reach of ...

Read More »

Spark Streaming and Twitter Sentiment Analysis

apache-spark-logo

This blog post is the result of my efforts to show to a coworker how to get the insights he needed by using the streaming capabilities and concise API of Apache Spark. In this blog post, you’ll learn how to do some simple, yet very interesting analytics that will help you solve real problems by analyzing specific areas of a ...

Read More »

Key Steps for Removing the Hive Metastore Password from the Hive Configuration

apache-hive-logo

In a typical Hive installation with metadata in a MySQL configuration, a password is configured in a configuration file in clear text. This presents a few risks: 1) Unauthorized access could destroy/modify Hive metadata and disrupt workflows. A malicious user could alter Hive permissions or damage metadata. 2) This password permits hiveserver2-thrift-MySQL communication. To avoid this problem, you should use ...

Read More »

Spark Data Source API: Extending Our Spark SQL Query Engine

apache-spark-logo

In my last post, Apache Spark as a Distributed SQL Engine, we explained how we could use SQL to query our data stored within Hadoop. Our engine is capable of reading CSV files from a distributed file system, auto discovering the schema from the files and exposing them as tables through the Hive meta store. All this was done to ...

Read More »

Achieving Sub Second SQL JOINs and building a data warehouse using Spark, Cassandra, and FiloDB

apache-cassandra-logo

Evan loves to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. He is the creator of the FiloDB open-source distributed analytical database, as well as the Spark Job Server. He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including ...

Read More »

Want to take your Java skills to the next level?

Grab our programming books for FREE!

Here are some of the eBooks you will get:

  • Advanced Java Guide
  • Java Design Patterns
  • JMeter Tutorial
  • Java 8 Features Tutorial
  • JUnit Tutorial
  • JSF Programming Cookbook
  • Java Concurrency Essentials