Systems that Run Forever Self-heal and Scale

I recently saw a great presentation by Joe Armstrong called “Systems that run forever self-heal and scale” . Joe Armstrong is the inventor of Erlang and he does mention Erlang quite a lot, but the principles are very much universal and applicable with other languages and tools.

The talk is well worth watching, but here’s a few quick notes for a busy reader or my future self.

General remarks

 

  • If you want to run forever, you have to have more than one instance of everything. If anything is unique, then as soon as that service or machine goes down your system goes down. This may be due to unplanned outage or routine software update. Obvious but still pretty hard.
  • There are two ways to design systems: scaling up or scaling down. If you want a system for 1,000 users, you can start with design for 10 users and expand it, or start with 1,000,000 users and scale it down. You will get different design for your 1,000 users depending on where you start.
  • The hardest part is distributing data in a consistent, durable manner. Don’t even try to do it yourself, use known algorithms, libraries and products.Data is sacred, pay attention to it. Web services and such frameworks? Whatever, anyone can write those.
  • Distributing computations is much easier. They can be performed anywhere, resumed or retried after a failure etc. There are some more suggestions hints on how to do it.

Six rules of a reliable system

  1. Isolation – when one process crashes, it should not crash others. Naturally leads to better fault-tolerance, scalability, reliability, testability and comprehensibility. It all also means much easier code upgrades.
  2. Concurrency – pretty obvious: you need more than one computer to make a non-stop system, and that automatically means they will operate concurrently and be distributed.
  3. Failure detection – you can’t fix it if you can’t detect it. It has to work across machine and process boundaries because the entire machine and process can’t fail. You can’t heal yourself when you have a heart attack, it has to be external force.It implies asynchronous communication and message-driven model.Interesting idea: supervision trees. Supervisors on higher levels of the tree, workers in leaves.
  4. Fault identification – when it fails, you also need to know why it failed.
  5. Live code upgrade – obvioius must have for zero downtime. Once you start the system, never stop it.
  6. Stable storage – store things forever in multiple copies, distributed across many machines and places etc.With proper stable storage you don’t need backups. Snapshots, yes, but not backups.

Others: Fail fast, fail early, let it crash. Don’t swallow errors, don’t continue unless you really know what you’re doing. Better crash and let the higher level process decide how to deal with illegal state.

Actor model in Erlang

We’re used to two notions of running things concurrently: processes and threads. The difference? Processes are isolated, live in different places in memory and one can’t screw the other. Threads can.

Answer from Erlang: Actors. They are isolated processes, but they’re not the heavy operating system processes. They all live in the Erlang VM, rely on it for scheduling etc. They’re very light and you can easily run thousands of them on a computer.

Conclusion

Much of this is very natural in functional programming. Perhaps that’s what makes functional programming so popular nowadays – that in this paradigm it’s so much easier to write reliable, fault-tolerant scalable, comprehensible systems.
 

Reference: Systems that Run Forever Self-heal and Scale from our JCG partner Konrad Garus at the Squirrel’s blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

One Response to "Systems that Run Forever Self-heal and Scale"

  1. A while back a wrote an article on process supervision and how best to overcome shortcomings in the approach taken by Erlang and other Java frameworks which look to emulate it.

    http://www.jinspired.com/site/going-beyond-actor-process-supervision-with-simulation-signaling

    “This all sounds good until you realize that supervision in Erlang, and much like other actor systems including Scala/Akka, is largely focused on process lifecycle management and not the internal workings of the worker process (or actor). The supervisor does not actually monitor, or is aware of, the software execution behavior of the worker process other than whether it has terminated and the cause of such termination. It is not at all like a metering supervisor.

    Simz on the other hand pairs a simulation runtime with the application that is for all intents and purposes the application under supervision. The supervisor is the worker in terms of metered execution and resource consumption tracking. The simulated runtime assumes the behavior patterns of the real application runtime and thus supervision is self reflective. Instead of asking the question “What is the worker process doing?” it asks “What am I doing?”. In this context you can think of Simz as the “brain in a vat”. Its perception of reality is via the metering feed. Here the worker takes the role of the body of which we can have many over time.”

Leave a Reply


seven + = 13



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close