When Clouds Clear

Kristoffer SjogrenSeptember 10th, 2011Last Updated: October 21st, 2012

0 50 7 minutes read

When I first heard about Cloud Computing a few year back I was sceptical, not really paying attention to it, thinking that this is the next IT bubble just begging to pop. But as the, somewhat hysterical, hype rapidly increased in the industry it was inevitable that I eventually had to confront and learn more about it. I admit to be hypnotised at first, but after doing some thinking and research a sobering relief settled.

So, is the idea of Cloud Computing innovative, and if so, is it adding value in revolutionizing IT? Let’s analyze some of the fundamental arguments of Cloud Computing for a second.

On-demand rental

Pay-as-you-go-service has the potential for business to catch the long-tail and reduces the need for a large and risky customer investment. But this is hardly something new. I could rent rackspace for my silly little application back in university!

It also sound strange to me that large cooperations suddenly would put their business critical data “out-there-somewhere” in the public. Its a matter of trust, can Amazon convince you that they will honor their SLA and that disk I/O isnt dodging your transactions? There is of course no gurantees, failures will happen, skies will fall and data will get lost. Knowing this, I dont think banks would feel easy sending their SWIFT messages through the public cloud, where a single message could contain a 10 million dollar transaction (or more).

Yet, public renting has it values for volatile resources that are required and discarded on short notice. For example, stateless compute intensive operations, offline data processing or caches. It may even be suitable for testing if your distributed algorithm scales or as a load generator for simulating the slashdot effect.

Multi-tenancy and elasticity

Both these concerns have a noticeable thing in common, resources are intended to be allocated (and de-allocated) on the fly without human intervention. In Cloud Computing, these resources are usually provided by massive data-centers that can be seen as a virtualized pool of computing resources shared between heterogeneous applications.

But pooling of resources is also nothing new under the sun. I probably would risk my employment by designing a system that would require manual/human intervention of expanding and shrinking JDBC connection pools along with load. Similarly, monitoring and HTTP has been around for quite a while, don’t you think?

Furthermore, not every application is suitable for multi-tenancy. There certainly are security implications and laws enforcement for handling sensitive data, depending on what level in the architecture the cut is made.

The potential of Cloud Computing

I don’t see the idea of Cloud Computing to be mind-blowingly innovative, not economically nor technically, we have seen it all in history: wolves in sheeps’ clothing, a marketing hype campaign. But considering the momentum in the industry at the moment, Cloud Computing excites me because it has the potential to elevate the way we think about management of computers and software, if we steer it in the right direction.

Allow me to elaborate.

I find it surprising that systems today are still deployed over-dimensioned, often based on premature predictions of the business leading to inefficiencies and higher costs, lacking a way for scaling down and reusing investments for future business opportunities.

No, it is about time we let go of the illusion that humans controls computers, because we cannot, reliability would not be an issue if we *really* could. It is actually these rather bizarre beliefs that holds us back and make computers unreliable and inefficient.

However, I do believe this dependency can be broken and that computers can be set free if we are willing to raise level of abstraction. To illustrate this im going to describe the general problem of managing computers by comparing two fundamentally different approaches, push and pull management.

Push

This is the traditional model where administrators command a network of computers to materialize a certain service. This of course implies that the administrator needs to understand the exact implementation details of how the service should work and also forces him to keep track of where and if (reliability) the service fullfil the SLA.

But as mentioned earlier, humans trying to control computers have implications for security and reliability…

The human factor.
Humans make mistakes and are especially bad at reproducing administrative changes, which means state will diverge til the point where system integrity is plagued with unintentional inconsistencies, making system behavior unpredictable and unreliable.
Keep in mind that a distributed enterprise system usually is a very sensitive and complex piece of work. Computers can turn small human mistakes into huge disasters and a lot of problems in the field are a direct consequence of human errors. Stress, timing, scale and complexity directly affect security and reliability.
There many studies on this subject that acknowledge these problems and the conclusions should be familiar to anyone with operational experience. This particular study is old but the symptoms described are far too common, even to this date.
Despite the huge contribution of operator error to service failure, operator error is almost completely overlooked in designing high-dependability systems and the tools used to monitor and control them. This oversight is particularly problematic because as our data shows, operator error is the most difficult component failure to mask through traditional techniques. Industry has paid a great deal of attention to the end-user experience, but has neglected tools and systems used by operators for configuration, monitoring, diagnosis, and repair.
Why do Internet services fail, and what can be done about it
Computers are highly unreliable and unstable.
Computers are constantly bombarded by the harsh realities of the real world and bad things will happen in this environment. Overload, memory leaks, barrier or timing conditions, denial-of-service attacks, break-ins, outage, cascading failures, misconfiguration, network failures, virus, malfunctioning hardware; there is almost an unlimited number of things that can and will go wrong.
Not only are we as humans a threat to computers, but the environment we put computers in make them even more unreliable. Having a single authority responsible for the availability of many makes fault-tolerance very difficult. Think about how you manage something that isn’t there or doesnt respond to your commands? Hmmm..
Lack of Integrity.
From a security perspective, control and command (the attempt to dominate) is suspiciously similar to that of an attack. Being commanded with no opportunity to disagree/disobey can therefore be perceived as a security breach violating integrity. It does not matter if commands have good intentions, predictable behavior can no longer be determined.

Pull

The reverse of push management is pull management, where computers come and go as they please, offering particular services. An administrator will discover these services and be able to understand how they work because services are small and have self-describing contracts that describe their attributes.

Service contracts are also asynchronous, meaning that instructions are never handed over directly, but stored in an intermediary. Services pull this intermediary and apply settings at their own convenience.

This forces services to be more intelligent but at the same time opens up for some interesting possibilities.

Abstraction.
The administrator does not need to know how the actual service is implemented since the published contracts hides those complex details. This makes management much simpler, more productive and can proceed even when computers are down.
Secure and Reliable
Because information is pulled, the vulnerability surface can reduced to absolute minimum through hardening. Nobody should ever need to access the computers (other than from service interfaces) so SSH can actually be disabled!
Another advantage of having communication governed by a contract provided by the service itself is that cooperation gets a lot more reliable – the administrator cannot violate the contract.
Efficient.
The number of computers that form a specific service can grow and shrink freely. Indeed, this elasticity can be based on very complex decentralized peer-to-peer algorithms.
Push management get more complicated as the system grows in size while staying fairly constant with pull.
Self-Healing.
Because the service have been hardened so that it is only itself that have access to self, any anomaly the service notice about itself can be considered a security breach. When this happens the computer can perform repairs on itself, maybe even resetting itself to last known good-state.
Computers can even supervise each other making sure that the service, as a whole, is highly-available and meet the SLA. If members drop, new ones can be spawned/cloned, maybe even in a different geographical region if the failure was disastrous. Unresponsive or suspicious members can be subject to STONITH.
Self-configurable.
When the administrator makes new software available, services perform their own maintenance and upgrades by modifying themselves in a rolling fashion, while maintaining high availability.

A very fundamental and important difference to understand between push and pull management is that pull does not convey commands, but a “destination”. This means that instead of pushing the exact procedures for reaching an end-state, we simply declare the end-state and let the service figure out the procedures for getting there by itself!

I think it is great to discover that all this have a lot in common with Configuration Management, a subject very dear to my heart!

The DevOps and Continuous Delivery movement does a fantastic job in helping developers to realize how computers and software work in their natural environment, production. Eat your own dogfood, so to speak.

However, I believe we can do more, simplify, improve. Humans should not do work that computers can do better! I understand that letting go of control to self-managing software hits a mental barrier. But what are we afraid of? Im pretty sure that you trust the software that guide pilots to fly your airplane safely (unless you are like me, shit scared of flying)?

We should not ignore self-managing software – it has far too much potential to make our lives incredibly secure, reliable and efficient.

And no matter how the future turns out, I believe that Configuration Management always will be a critical cornerstone in any successful attempt to grow a dynamic service-oriented ecosystem living in symbiosis, where existing services are easily composed into new services ready to harvest unforeseen business opportunities!

Reference: When Clouds Clear from our JCG partner Kristoffer Sjögren at the Deep Hacks blog.

Related Articles :

When Clouds Clear

Thank you!

Kristoffer Sjogren

Thank you!