Back at university, when I dealt with much low-level problem solving and very basic libraries and constructs, I learned to pay attention to what can possibly go wrong. A lot. Implementing reliable, hang-proof communication over plain sockets? I remember it today, a trivial loop of “core logic” and a ton of guards around it.
Now I suspect I am not the only person who got used to all the convenient higher-level abstractions so much that he began to forget this approach. Thing is, real software is a little bit more complex, and the fact that our libraries kind of deal with most low-level problems for us doesn’t mean there are no reasons to fail.
As I’m reading “Release It!” by Michael T. Nygard, I keep nodding in agreement: Been there, done this, suffered that. I’ve just started, but it’s already shown quite a few interesting examples of failure and error handling.
Michael describes a spectacular outage of an airline software. Its experienced designers expected many kinds of failures and avoided many obvious issues. There was a nice layered architecture, with proper redundancy on every level from clients and terminals, through servers, through database. All was well, yet on a routine maintenance in database the entire system just hung. It did not kill anyone, but delayed flights and serious financial losses have an impact too.
The root cause turned out to be one swallowed exception on servers talking to the database, thrown by JDBC driver when the virtual IP of the database server was remapped. If you don’t have proper handling for such situations, one such leakage can lock the entire server as all of its threads wait for the connection or for each other. Since there were no proper timeouts anywhere in the server or above, eventually everything hung.
Now it’s easy to say: It’s obvious, thou shalt not swallow exceptions, you moron, and walk on. Or is it?
The thing is, an unexpected or improperly handled error can always happen. In hardware. Or a third party component. Or core library of your programming language. Or even you or your colleague can screw up and fail to predict something. It. Just. Happens.
Let’s take a look at two examples from real life.
Everyone gets in the car thinking: I’m an awesome driver, accidents happen but not to me. Yet somehow we are grateful for having airbags, carefully designed crumple zones, and all kinds of automatic systems that prevent or mitigate effects of accidents.
If you were offered two cars at the same cost, which would you choose? One is in pimp-my-ride style with extremely comfortable seats, sat TV, bright pink wheels and whatever unessential features. But it breaks down every so often based on its mood or the moon cycle, and would certainly kill you if you hit a hedgehog. The other is just comfortable enough, completely boring, no cool features to show off at all. But it will serve you 500,000 kilometers without a single breakdown and save your life when you hit a tree. Obvious, right?
Another example. My brother-in-law happens to be a construction manager at a pretty big power plant. He recently took me on a trip and explained some basics on how it works, and one thing really struck me.
The power station consists of a dozen separate generation units and is designed to survive all kinds of failures. I was impressed, and still am, that in power plant business it’s normal to say stuff like: If this block goes dark, this and this happens, that one takes over, whatever. No big deal. Let’s put it in a perspective. A damn complicated piece of engineering that can detect any potentially dangerous conditions, alarm, shut down and fail over just like that. From small and trivial things like variations in pressure or temperature, through conditions that could blow the whole thing up. And it is so reliable that when people talk about such conditions, rare and severe as they are, they say it in the same tone as “in case of rain the picnic will be held at Ms. Johnson’s”.
In his “After the Disaster” post, Uncle Bob asked: “How many times per day do you put your life in the hands of an ‘if’ statement written by some twenty-two year old at three in the morning, while strung out on vodka and redbull?”
I wish it was a rhetorical question.
We are pressed hard to focus on adding shiny new features, as fast as possible. That’s what makes our bosses and their bosses shine and what brings money to the table. But not only them, even we (the developers) naturally take most pride in all those features and find them the most exciting part of our work.
Remember that we’re here to serve. While pumping out features is fun, remember that those people simply rely on you. Even if you don’t directly cause death or injury, your outages can still affects lives. Think more like a car or power station designer, your position is really closer to theirs than to a lone hippie who’s building a little wobbly shack for himself.
When an outage happens and also causes financial loss, you will be to blame. If that reasoning does not work, do it for yourself – pay attention now to avoid pain in future, be it regular panic calls at 3 AM or your boss yelling at you.
Michael T. Nygard ends that airline example with a very valuable advice. Obvious as it may seem, it feels different if you realize it and engrave it deep in your mind. Expect failure everywhere, and plan for it. Even if your tools handle some failures, they can’t do everything for you. Even if you have at least two of each thing (no single point of failure), you can still suffer from bad design. Be paranoid. Place crumple zones on every integration point with other systems, and even different components of your system, in order to prevent cracks from propagating. Optimistic monoliths fail hard.
Want something more concrete? Go read “Release It!”, it’s full of great and concrete examples. There’s a reason why it fits in a book and not in a blog post.