Software Development

How to Replicate Streaming Data Across Data Centers with MapR Streams

In this week’s Whiteboard Walkthrough Jorge Geronimo, Solutions Architect at MapR, explains how with a single line of code you can create a replica of a MapR data stream within the same cluster or to another cluster even in another part of the world. Jorge also describes multi master replication for streaming data and how MapR Streams’ unique capability for geo-distributed replication with preserved offsets offers advantages for working with streaming data.

https://youtu.be/qkXIQ1wT-zw

Editor’s Note: In replication of data streams across data centers using MapR Streams, consumers can fail over from one site to another.  In addition, the system will break replication loops. For more detail on these aspects of geo-distributed replication with MapR Streams, see “MapR Streams Under the Hood.”

Additional resources on MapR Streams include:

•    Streaming Architecture Chapter 5
•    “Getting Started with MapR Streams” tutorial and sample programs

The full transcript follows below:

Hi. My name is Jorge Geronimo, and I’m a solution architect with MapR Technologies.

For this Whiteboard Walkthrough, what we’re going to be talking about are applications for MapR Streams replication. Now MapR Streams is a message transport layer within the MapR Converged Data Platform. MapR Streams replication is MapR’s ability to replicate a stream either within the same cluster, maybe for engineering or model training, or to other clusters within the same data center or anywhere else around the world. You may want that for HA and DR situations.

Let’s imagine the use case in which you have a manufacturing plant in Asia. There are sensors on that manufacturing line that are producing messages to a stream called the metric stream within the MapR cluster. There are also consumers, and these are processes applications that are reading data off of the metrics stream and doing something with that data.

What is unique to MapR is that with a single line of code, you can create a replica of the metric stream, again, either within the same cluster, model training, what have you, or to another data center, maybe in another part of the world for disaster recovery situations.

Let’s take it a step further, and say that your company has another manufacturing line, somewhere in Europe. It has much the same architecture and many of the same components. Now let’s say that your worldwide headquarters is somewhere in North America. Your analysts in North America want to be able to read data, off of all the streams in all the manufacturing plants around the world. Now, those analysts can reach out to each of the individual streams to read that data, or, in order to remove latencies introduced by geography, what you can do is actually consolidate all of those streams into a single metric stream within the North America MapR cluster. Your analyst will be able to access data from your Asia line, as well as from your European line, and it will be as if they’re accessing data on that same cluster, because as a part of replication, MapR is unique in that it saves that message offset.

This is called Slave Master replication. It’s one way. You have an origin or a source, and you have a destination or a replica. There are other situations where you may want to have a stream that’s able to both push messages to another stream and, at the same time, read messages from that stream. This is called Multi-Master Replication. Something to note here is that your topic names within your streams must be unique, such that, your message offsets won’t override each other when they’re replicated to North America.

What do I mean by that? Let’s suppose that Asia has a manufacturing line called Line 2, and it publishes its messages to Line 2_Asia. Your European line publishes its messages to topic called Line 2_Europe. These will each have unique offsets, and when those streams and topics are then replicated to your North America cluster, those will maintain their message offsets and maintain uniqueness, as well as avoiding that situation, which you could see data loss, because messages are overriding each other.

If you want to learn more about MapR Streams, feel free to access the website. If you have any questions or comments, feel free to post them in the section below. Have a great day.

Jorge Geronimo

Jorge is a Solution Architect on the MapR Sales Team, and is based in the Bay Area California. Jorge has over a decade of data technology experience focusing on enabling customers to leverage major database platforms including Teradata, Vertica, Oracle, and Microsoft SQL Server. His current interests are focused on Apache Kafka, Spark, and Drill.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button