Software Development

Streaming Data: How to Move from State to Flow – Whiteboard Walkthrough (Part 2)

 

In this week’s Whiteboard Walkthrough Part II, Ted Dunning, Chief Application Architect at MapR, talks about the design freedom gained by adopting a micro-services architecture based on streaming data. When you move – one step at a time – from an old style architecture that suffers from too much dependence on a shared global state database to a stream-based flow architecture, the isolation between micro-services results in reduced strain on the original database, improved flexibility and often speed.

https://youtu.be/_xqK0Es9zP4

If you would like to know more about building a stream-based architecture, read about MapR Streams as part of the MapR Converged Platform or see the book ‘Streaming Architecture’.

Watch Part I here.

Here’s the unedited transcription:

Hey, there. I’d like to talk today about moving from a state style to a flow style. I’m Ted Dunning, chief application architect at MapR. This is all about a revolution, a re-platforming of computing that’s happening right now. It’s really facilitating moving from old style, siloed systems to new style, micro-service, big data sorts of systems.

Now we talked in another video about the difference between flow and state, but suffice it to say that with state, we tend to share something like a shared database. The shared database has expensive operations like transactions, which allow us to simplify the problem of interacting systems with shared state, but that ultimately is becoming too expensive to scale out to the scale that many, many businesses need to run at.

So we’re going to move to flow instead. It’s the new style in which we put business events in a queue of some sort, and services update their own local state. We avoid having any global state other than the queues and topics that these messages go into. In Kafka, that would be a topic in a broker cluster. In MapR Streams, it would be a topic within a stream. There can of course be many streams in MapR. In the old style, we would have processes A, B, and C all talking to a shared database. Inevitably, what happens, as you might guess and possibly from your own experience, is this becomes red hot.

Now it might become hot because ther are so many processes accessing it, but it might also become hot in a more human way. Because this is a shared resource, and because it’s a shared database, if A wants to make a change, B and C have to agree that the change is okay. If we have 10 people in this team, 10 people in this team, and 10 people in this team, ultimately we have to get agreement between 30 people. That’s really hard. So we wind up having these big meetings. We wind up having lots of engineering change control that’s involving lots of people, and that slows down progress enormously. Ultimately, people start just trying to make do, or they start designing databases with 500 dummy columns that aren’t in use yet just so they can try to avoid talking to each other too much and having to make changes. Well, that’s really, really unpleasant.

I don’t know if you’ve ever worked on a system like that. Both the overload problem and the committee problem (the social aspect of the problem) are very, very hard to solve. Ultimately, large organizations tend to move very slowly because of this sort of issue.

We can migrate from this to a flow-based architecture in some cases, but not in all cases. If we can abstract some of the changes happening to the database in terms of business events—business events that can always be applied. Possibly they’re applied, a failure is noticed, and a correction event is sent, but that these are actual events that will always succeed in some sense; they can always be applied to all of the local state involved. If, suppose, we want to isolate C away from these others as a first step in migration, what we could change all of these arrows to unidirectional and have all of the updates go in the form of business events.

The way that looks now is A, B, and C are sending business events to a queue of some kind, to a Kafka topic or a MapR topic within a MapR Stream. These events are updating the database. This is the old shared database. It was very hot. It’s now much less hot because C isn’t using it so much any more. C now has its own private database which stays nice and cool because it’s totally private, totally specialized to C’s particular needs. It might even be in a memory database, or it might become a NoSQL database. You can change technologies, you can change structure, and you can change design. If there are many columns that C doesn’t even need, the database can get smaller. That makes it faster, and so on.

So business events go out, they’re applied in parallel to these different databases. We do lose a little bit of synchronization, but if we can use the ordering provided by the queue so that A, who is making a change, says, “I have been to this state on my database, and I’m making a change relative to that.” Then C can make sense of those changes. It will make sure that it waits until those updates up to that point have been applied to its local database.

So the trick here is to get a business event that can be applied in parallel. At that point, what we can say is that C, instead of an old-style micro-service, has become an isolated micr-service; it moved from megaservice to microservice.

We now have this abstraction boundary around it that nobody can see inside of. Whatever external interactions that C is specified to have based on its contracts, it will still have, but all updates to its local state must come from the queue, and all updates it makes to the shared sense of the world must go to that queue in the form of business events. These have to be higher level, of course, but that means that the internals are now hidden away, and the team that runs C is free to further devolve it into micro-services or to change the structure or change the way that things work. All kinds of freedoms now happen because C has been isolated. This is the first step in migrating to a flow style of computing, which is necessary for a very large scale.

A synonym for flow is micro-service architecture.This is the first step. We’ve taken a large, interacting ball of string, and we’ve pulled one thread out of it and made one micro-service. It’s our first step in a long journey. It takes a while for this all to happen, but here’s your first step. It’s an exciting step and an exciting journey. Thanks very much

Ted Dunning

Ted Dunning is Chief Application Architect at MapR Technologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects​. Ted has been very active in mentoring new Apache projects and is currently serving as vice president of incubation for the Apache Software Foundation​.​ Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (later purchased by LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button