We may choose to use a queue to link parts of our process together. There are a few reasons for that:
- Delay non-urgent processing until later
- Attempt to guarantee the execution of an action that may fail
- Scale the processing of a source of work by having multiple executors consuming the queue
- Decouple an action from the event that triggers it
There are many more reasons why we may also do this, and queues are a standard part of many systems. Turning key things in the system into events and processing events is a standard pattern.
With SQS as the queue of choice, in this article, I’d like to cover some of the challenges raised by using this pattern. We have to design to handle them.
- Error handling
- Dead letter queue and replay
The biggest design consideration with queues is the contract regarding guaranteed delivery.
We would love guaranteed delivery to mean Exactly once and in perfect order but that’s seldom the actual guarantee. The guarantee of perfect order would slow the system down to a single thread. The guarantee of exactly once is very hard to guarantee too without some very strong de-duplication methods.
Where we need perfect order we can use a single FIFO queue. Where we need perfect order by group (i.e. play user transactions in their correct order, but don’t worry about order between users) then we can use a FIFO queue with multiple message groups.
Where we don’t need perfect order, we can use the simplest form of queue and expect that generally later events happen after earlier ones… generally.
However, the exactly once delivery is still hard to get right. With SQS we have at least once delivery, with a side order of probably once.
So we have to decide what to do about potential duplicates:
- Not care – allow an event/action to happen a couple of times over every so often
- Idempotence – ensure that the second time an action happens, it just reinforces the first – this is fine when the action is some sort of data storing command
- Explicit down-stream deduplication techniques – maybe looking at a message origin ID (a guid in the message) and deduping according to that ID to avoid repeating an action
One thing that can exacerbate duplication is the retrying of apparently failed message processing. We really cannot lose a message while processing, which means we need to ensure it is really processed. This can be achieved with two levels of retries:
- Retry a failed operation as it’s happening – using frameworks like Spring Retry or Resilience4j
- Ensure a message that wasn’t successfully processed isn’t deleted from the queue so it retries when it becomes visible again
In some ways, having a retry mechanism built in to queues is another advantage of using them.
For efficiency we may prefer to read from the queue in batches. We may also prefer to write in batches. This comes with some potential down sides.
When writing in batches, there’s a risk of unflushed messages sitting in the batch waiting to be sent, and then failing to be sent long after the producer of the message has moved on to something else. We may prefer to design this out, but this can add complexity to our systems.
When processing a whole batch of messages, there’s a risk that the failure of some messages in the batch can affect the other messages, potentially causing excessive retries, or false-positive failures. This is exacerbated if there’s a “poison pill” message.
When processing in batches, we may need to be more brutal with certain sorts of error.
In three previous projects, I’ve either created, discovered or co-created the following error handling classification:
- Transient error – the error happened this time, but there’s no guarantee that this message will fail next time
- Fatal error – under no circumstances could this message ever be processed
If we have that classification, then we can bin messages that are simply impossible. This can be good for the whole batch they live in, and it can also avoid queues getting stuck owing to broken messages.
However, one then must ask – where do messages go when they die?
Dead letter queue and replay
When a message goes wrong, it can either explicitly or automatically be sent to a dead letter queue. What then?
There’s an argument for explicitly sending messages to the dead letter queue, tagged with the error that their reason for either failing “fatally” or for failing through too many retries. In this latter case, we’d allow automatic replay of the message by SQS up to nearly the maximum number of attempts, but then recognise that we’re about to process the last failure of the message and explicitly move it to the dead letter queue, tagged with error metadata, instead of letting SQS do it implicitly.
In the case of manually moving a fatal message to the dead letter queue, we can tag it with the reason.
However, once the message is in the dead letter queue, what do we do with it? We can have alarms showing that this has happened, but where does the message go next?
One option is to build a queue copier.
The queue copier is essentially a replay tool that can take the dead letter messages and reattempt them on the original queue.
Alternatively, we might download from all our dead letter queues and drop the messages into somewhere longterm, like S3. Perhaps we’d also build a mechanism to retransmit the messages to the original queue again.
The problem with a system built around guaranteed delivery is that you start feeling very guilty when some messages just refuse to be processed!