Software Development

Beyond Real-time Data Applications – Whiteboard Walkthrough

In this week’s Whiteboard Walkthrough, Ellen Friedman, a consultant at MapR, talks about how to design a system to handle real-time applications, but also how to take advantage of streaming data beyond those in the moment insights.
 
 
 
 
 
 
 
 
https://youtu.be/lvw_Gqs38xo

Here’s the undedited transcription:

Hi, I’m Ellen Friedman. I’m a consultant for MapR and an O’Reilly author. I’m here today to talk to you about how to design a system to handle real time applications, but also how to take advantage of streaming data beyond those in the moment insights.

Let’s start with something simple. Really, the kind of target, the goals that drive you to this kind of work. People might, for example, be designing some sort of real time application due to updates to a dashboard, a real time dashboard. The first place they may start thinking about the system is what sorts of technologies or tools do they want to use to design this real time application or near real time application? There really are a variety of good choices. You might use something like Apache Spark streaming, Apache Flink, an older tool, Apache Storm, Apache Apex. There are a number of different good choices that let you design your application within whatever window of latency is appropriate for the kind of ultimate business goal that you have.

The question then comes, how do you deliver data to that system and how do you do it effectively and safely? Well, people who’ve worked with systems like this before probably realize that their data source, I’ve labeled it as a P to mean the producer of the data, in this case some sort of event data, might be click stream data, log data, maybe sensor data. There are a number of different sources you might use, but they’re a producer of data, their data source is probably best if it doesn’t go directly to their application, but if instead they use some sort of an upstream messaging system. A sort of cue, essentially as a safety cue, so that they are streaming the data into that messaging system which will then deliver it to their real time application which becomes the consumer in this system.

One reason to do this is in case there’s any sort of interruption in the system, you’re catching that data and making it available for the application. In order to do this and do it very effectively it really matters the capabilities of this messaging system. For example, for best results it’s good to use a system that lets you have data coming in from a variety of different producers, but that also decouples the producers from the consumer. The data is made available to use right away but it doesn’t have to be used right away. In other words, you make that data stream durable for varying amounts of time. In that way it’s available for immediate use and it’s also available if you come back later.

This has a number of advantages. For one thing, you might add a different sort of application, maybe a different low latency application for a different purpose. You can add different consumers at different times. You can have multiple producers and multiple consumers of that data. So, we say what sorts of technologies really support these capabilities? We particularly like Apache Kafka. There’s also MapR Streams which is a messaging system that’s integrated into the MapR Data Platform. It actually uses the Apache 0.9 kpi, so there are a number of similarities between those two messaging technologies. They’re both really good for this kind of approach where you have a durable replayable stream. You can have multiple producers and consumers of the data. You can add consumers later.

Now let’s think beyond this initial way of looking at it where you’re trying to get some sort of real time or near real time insights, to be able to use this streaming data in an even broader way. For example, you might want to have another class of consumers here which might be something like a database. It might be a search document. You’re basically using it to do updates that capture a picture of the current status of whatever this data is, a current status of affairs. That might be something like, say with regard to your bank, maybe your own bank accounts. It’s quite a different thing to say, “What is the current status of your bank accounts?” Maybe the balances for your checking, for your savings? What interest is paid? That’s a different thing now looking at the sequence of events, a sequence of deposits and withdrawals that lead up to that current status of affairs.

This is a different way to use that same streaming data. There are situations, again, maybe thinking in terms of a bank, where you need that replayable auditable history. Maybe you’re trying to do some sort of an audit. In that situation you actually do want to be able to say what each event is, when it happened, what’s the sequence of events. Again, with a messaging systems such as Apache Kafka or MapR Streams, you’re able to do that with that same streaming cue, the same stream of event data. You’re able to have a class of consumers that are using that data for in the moment insights real time applications. Other ones who are using it for things like updates to a database or a search document or ones that want to go back and do some sort of forensic analysis, and auditable log, basically a long-term history. It’s actually really good to support consumers in all of these classes of these cases off of the same data stream.

To think about the capabilities that we’ve talked about that support that kind of design you want the messaging system to support multiple producers and consumers, but to do that in a way where they are decoupled that gives you enormous flexibility and agility. You can add new consumers later, ones not dependent on the other. As we said, in order to do that it’s essential for that message stream to be durable. It needs to be something that you can persist. You’d like it to be configurable. You can persist it for a short time, if you want to keep it really long-term you might set that time to live essentially to infinity, you have that long-term editable history.

In today’s world obviously it needs to be massively scalable. With systems like Apache Kalka and MapR Streams the good news is you actually are able to have both high performance, very high through put even with the very low latency, unlike older systems where that’s a trade off, you’re able to do both. These are both excellent technologies to support that kind of system. This is a way that you can build your real time applications but go way beyond that for a much broader view of how to use streaming data. This is the sort of situation that is leading people to think in terms of a stream first architecture.

Thank you very much for watching. If you’d like to know more about these topics there’s a link to a book that I co-authored with Ted Dunning called Streaming Architectures that’s available for a free download. There’s also a place where you can ask questions, suggest a topic for other Whiteboard Walkthroughs. Let us hear from you in the comments section. Thank you very much!

Reference: Beyond Real-time Data Applications – Whiteboard Walkthrough from our JCG partner Ellen Friedman at the Mapr blog.

Ellen Friedman

She is a consultant and commentator on big data topics. Active in open source, she is committer for Apache Drill and Apache Mahout projects and co-author of many books on working with data in the Hadoop ecosystem. She has a PhD in biochemistry, years of experience as a research scientist and has written about a wide range of technical topics including biology, oceanography and the genetics of learning and memory.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button