Getting Started with MapR Streams

Tugdual GrallMarch 11th, 2016Last Updated: March 11th, 2016

0 299 5 minutes read

MapR Streams is a new distributed messaging system for streaming event data at scale, and it’s integrated into the MapR converged platform. MapR Streams uses the Apache Kafka API, so if you’re already familiar with Kafka, you’ll find it particularly easy to get started with MapR Streams.

Although MapR Streams generally uses the Apache Kafka programming model, there are a few key differences. For instance, there is a new kind of object in the MapR file-system called, appropriately enough, a stream. Each stream can handle a huge number of topics, and you can have many streams in a single cluster. Policies such as time-to-live or ACEs (Access Control Expressions) can be set at the stream level for convenient management of many topics together. You can find out more about streaming architectures using Kafka and MapR Streams in the new short book Streaming Architectures: New Designs Using Apache Kafka and MapR Streams, available as a free download from the MapR website.

If you already have Kafka applications, it’s easy to migrate them over to MapR Streams. You can find out more in the MapR documentation at http://maprdocs.mapr.com/51/ – MapR_Streams/migrating_kafka_applications_to_mapr_streams.html

In this current blog we describe how to run a simple application we originally wrote for Kafka using MapR Streams instead.

Sample Programs

As mentioned above, MapR Streams uses Kafka API 0.9.0, which means it is possible to reuse the same application with minor changes. Before diving into a concrete example, let’s take a look at what has to be changed:

The topic names change from “topic-name” to “/stream-name:topic-name” as MapR organizes the topics in streams for management reasons (security, TTL, etc.).
The producer and consumer configuration parameters that are not used by MapR Streams are automatically ignored, so no change here.
The producer and consumer applications are using jars from MapR rather than the Apache Kafka jars.

You can find a complete application on the Sample Programs for MapR Streams page. It’s a simple copy that includes minor changes of the Sample Programs for Kafka 0.9 API project. This Kafka project has been documented in this article.

Prerequisites

You will need basic Java programming skills as well as access to:

A running MapR 5.1 Cluster or Sandbox
Apache Maven 3.0 or later
Git to clone the https://github.com/mapr-demos/mapr-streams-sample-programs repository

Running Your First MapR Streams Application

Step 1: Create the stream

A stream is a collection of topics that you can manage together by:

Setting security policies that apply to all topics in that stream
Setting a default number of partitions for each new topic that is created in the stream
Set a time-to-live for messages in every topic in the stream

You can find more information about MapR Streams concepts in the documentation.

Run the following command, as mapr user, on your MapR cluster:

$ maprcli stream create -path /sample-stream

By default, the produce and consume topic permissions are defaulted to the creator of the streams—the unix user you are using to run the maprcli command. It is possible to configure the permission by editing the streams. For example, to make all of the topics available to anybody (public permission), you can run the following command:

$ maprcli stream edit -path /sample-stream -produceperm p -consumeperm p -topicperm p

Step 2: Create the topics

We need two topics for the example program, which we can be created using maprcli:

$ maprcli stream topic create -path /sample-stream  -topic fast-messages
$ maprcli stream topic create -path /sample-stream  -topic summary-markers

These topics can be listed using the following command:

$ maprcli stream topic list -path /sample-stream
topic            partitions  logicalsize  consumers  maxlag  physicalsize
fast-messages    1           0            0          0       0
summary-markers  1           0            0          0       0

Note that the program will automatically create the topic if it does not already exist. For your applications you should decide whether it is better to allow programs to automatically create topics simply by virtue of having mentioning them or whether it is better to strictly control which topics exist.

Step 3: Compile and package the example programs

Go back to the directory where you have the example programs and build the example programs.

$ cd ..
$ mvn package
...

The project creates a jar with all external dependencies ( ./target/mapr-streams-examples-1.0-SNAPSHOT-jar-with-dependencies.jar )

Note that you can build the project with the Apache Kafka dependencies as long as you do not package them into your application when you run and deploy it. This example has a dependency on the MapR Streams client instead which can be found in the mapr.com maven repository.

   <repositories>
       <repository>
           <id>mapr-maven</id>
           <url>http://repository.mapr.com/maven</url>
           <releases><enabled>true</enabled></releases>
           <snapshots><enabled>false</enabled></snapshots>
       </repository>
   </repositories>
   ...
       <dependency>
           <groupId>org.apache.kafka</groupId>
           <artifactId>kafka-clients</artifactId>
           <version>0.9.0.0-mapr-1602</version>
           <scope>provided</scope>
       </dependency>
  ...

Step 4: Run the example producer

You can install the MapR Client and run the application locally, or copy the jar file onto your cluster (any node).

$ scp ./target/mapr-streams-examples-1.0-SNAPSHOT-jar-with-dependencies.jar mapr@<YOUR_MAPR_CLUSTER>:/home/mapr

The producer will send a large number of messages to /sample-stream:fast-messages along with occasional messages to /sample-stream:summary-markers. Since there isn’t any consumer running yet, nobody will receive the messages.

If you compare this with the Kafka example used to build this application, the topic name is the only change to the code.

Any MapR Streams application will need the MapR Client libraries. One way to make these libraries available to add them to the application classpath using the /opt/mapr/bin/mapr classpath command. For example:

$ java -cp $(mapr classpath):./mapr-streams-examples-1.0-SNAPSHOT-jar-with-dependencies.jar com.mapr.examples.Run producer
Sent msg number 0
Sent msg number 1000
...
Sent msg number 998000
Sent msg number 999000

The only important difference here between an Apache Kafka application and MapR Streams application is that the client libraries are different. This causes the MapR Producer to connect to the MapR cluster to post the messages, and not to a Kafka broker.

Step 5: Start the example consumer

In another window, you can run the consumer using the following command:

$ java -cp $(mapr classpath):./mapr-streams-examples-1.0-SNAPSHOT-jar-with-dependencies.jar com.mapr.examples.Run consumer
1 messages received in period, latency(min, max, avg, 99%) = 20352, 20479, 20416.0, 20479 (ms)
1 messages received overall, latency(min, max, avg, 99%) = 20352, 20479, 20416.0, 20479 (ms)
1000 messages received in period, latency(min, max, avg, 99%) = 19840, 20095, 19968.3, 20095 (ms)
1001 messages received overall, latency(min, max, avg, 99%) = 19840, 20479, 19968.7, 20095 (ms)
...
1000 messages received in period, latency(min, max, avg, 99%) = 12032, 12159, 12119.4, 12159 (ms)
<998001 messages received overall, latency(min, max, avg, 99%) = 12032, 20479, 15073.9, 19583 (ms)
1000 messages received in period, latency(min, max, avg, 99%) = 12032, 12095, 12064.0, 12095 (ms)
999001 messages received overall, latency(min, max, avg, 99%) = 12032, 20479, 15070.9, 19583 (ms)

Note that there is a latency listed in the summaries for the message batches. This is because the consumer wasn’t running when the messages were sent to MapR Streams, and thus it is only getting them much later, long after they were sent.

Monitoring your topics

At any time you can use the maprcli tool to get some information about the topic. For example:

$ maprcli stream topic info -path /sample-stream -topic fast-messages -json

The -json option is used to get the topic information as a JSON document.

Cleaning up

When you are done playing, you can delete the stream and all associated topics using the following command:

$ maprcli stream delete -path /sample-stream

Conclusion

Using this example built from an Apache Kafka application, you have learned how to write, deploy, and run your first MapR Streams application.

As you can see, the application code is really similar, and only a few changes need to be made (such as changing the topic names). This means it is possible to easily deploy your Kafka applications on MapR and reap the benefits of all the features of MapR Streams, such as advanced security, geographically distributed deployment, very large number of topics, and many more. This also means that you can immediately use all of your Apache Kafka skills on a MapR deployment.

If you have any questions about running a MapR Streams application, please ask them in the comments section below.

Reference:

Getting Started with MapR Streams from our JCG partner Tugdual Grall at the Mapr blog.