How should a system respond when under sustained load? Should it keep accepting requests until its response times follow the deadly hockey stick, followed by a crash? All too often this is what happens unless a system is designed to cope with the case of more requests arriving than is it capable of processing. If we are seeing a sustained arrival rate of requests, greater than our system is capable of processing, then something has to give. Having the entire system degrade is not the ideal service we want to give our customers. A better approach would be to process transactions at our systems maximum possible throughput rate, while maintaining a good response time, and rejecting requests above this arrival rate.
Let’s consider a small art gallery as an metaphor. In this gallery the typical viewer spends on average 20 minutes browsing, and the gallery can hold a maximum of 30 viewers. If more than 30 viewers occupy the gallery at the same time then customers become unhappy because they cannot have a clear view of the paintings. If this happens they are unlikely to purchase or return. To keep our viewers happy it is better to recommend that some viewers visit the café a few doors down and come back when the gallery is less busy. This way the viewers in the gallery get to see all the paintings without other viewers in the way, and in the meantime those we cannot accommodate enjoy a coffee. If we apply Little’s Law we cannot have customers arriving at more than 90 per hour, otherwise the maximum capacity is exceeded. If between 9:00-10:00 they are arriving at 100 per hour, then I’m sure the café down the road will appreciate the extra 10 customers.
Within our systems the available capacity is generally a function of the size of our thread pools and time to process individual transactions. These thread pools are usually fronted by queues to handle bursts of traffic above our maximum arrival rate. If the queues are unbounded, and we have a sustained arrival rate above the maximum capacity, then the queues will grow unchecked. As the queues grow they increasingly add latency beyond acceptable response times, and eventually they will consume all memory causing our systems to fail. Would it not be better to send the overflow of requests to the café while still serving everyone else at the maximum possible rate? We can do this by designing our systems to apply “Back Pressure”.
Separation of concerns encourages good systems design at all levels. I like to layer a design so that the gateways to third parties are separated from the main transaction services. This can be achieved by having gateways responsible for protocol translation and border security only. A typical gateway could be a web container running Servlets. Gateways accept customer requests, apply appropriate security, and translate the channel protocols for forwarding to the transaction service hosting the domain model. The transaction service may use a durable store if transactions need to be preserved. For example, the state of a chat server domain model may not require preservation, whereas a model for financial transactions must be kept for many years for compliance and business reasons.
Figure 1. above is a simplified view of the typical request flow in many systems. Pools of threads in a gateway accept user requests and forward them to a transaction service. Let’s assume we have asynchronous transaction services fronted by an input and output queues, or similar FIFO structures. If we want the system to meet a response time quality-of-service (QOS) guarantee, then we need to consider the three following variables:
- The time taken for individual transactions on a thread
- The number of threads in a pool that can execute transactions in parallel
- The length of the input queue to set the maximum acceptable latency
max latency = (transaction time / number of threads) * queue length
queue length = max latency / (transaction time / number of threads)
By allowing the queue to be unbounded the latency will continue to increase. So if we want to set a maximum response time then we need to limit the queue length.
By bounding the input queue we block the thread receiving network packets which will apply back pressure up stream. If the network protocol is TCP, similar back pressure is applied via the filling of network buffers, on the sender. This process can repeat all the way back via the gateway to the customer. For each service we need to configure the queues so that they do their part in achieving the required quality-of-service for the end-to-end customer experience.
One of the biggest wins I often find is to improve the time taken to process individual transaction latency. This helps in the best and worst case scenarios.
Worst Case Scenario
Let’s say the queue is unbounded and the system is under sustained heavy load. Things can begin to go wrong very quickly in subtle ways before memory is exhausted. What do you think will happen when the queue is larger than the processor cache? The consumer threads will be suffering cache misses just at the time when they are struggling to keep up, thus compounding the problem. This can cause a system to get into trouble very quickly and eventually crash. Under Linux this is particularly nasty because malloc, or one of its friends, will succeed because Linux allows “ Over Commit” by default, then later at the point of using that memory, the OOM Killer will start shooting processes. When the OS starts shooting processes, you just know things are not going to end well!
What About Synchronous Designs?
You may say that with synchronous designs there are no queues. Well not such obvious ones. If you have a thread pool then it will have a lock, or semaphore, wait queues to assign threads. If you are crazy enough to allocate a new thread on every request, then once you are over the huge cost of thread creation, your thread is in the run queue for a processor to execute. Also, these queues involve context switches and condition variables which greatly increase the costs. You just cannot run away from queues, they are everywhere! Best to embrace them and design for the quality-of-service your system needs to deliver to its customers. If we must have queues, then design for them, and maybe choose some nice lock-free ones with great performance.
When we need to support synchronous protocols like REST then use back pressure, signalled by our full incoming queue at the gateway, to send a meaningful “server busy” message such as the HTTP 503 status code. The customer can then interpret this as time for a coffee and cake at the café down the road.
Subtleties To Watch Out For…
You need to consider the whole end-to-end service. What if a client is very slow at consuming data from your system? It could tie up a thread in the gateway taking it out of action. Now you have less threads working the queue so the response time will be increasing. Queues and threads need to be monitored, and appropriate action needs to be taken when thresholds are crossed. For example, when a queue is 70% full, maybe an alert should be raised so an investigation can take place? Also, transaction times need to be sampled to ensure they are in the expected range.
If we do not consider how our systems will behave when under heavy load then they will most likely seriously degrade at best, and at worst crash. When they crash this way, we get to find out if there are any really evil data corruption bugs lurking in those dark places. Applying back pressure is one effective technique for coping with sustained high-load, such that maximum throughput can be delivered without degrading system performance for the already accepted requests and transactions.