Handling “Failure at scale” in Azure Functions triggered by IoT Hub.

Or how to tackle exceptions and re-run failed messages.

Stas(Stanislav) Lebedenko
ITNEXT

--

One of the Serverless solution benefits is a “performance at scale.” As a result, you can get “Failure at scale” if something goes wrong. Thus it’s crucial to introduce error handling for your Azure Functions project early.

TL; DR; I will explain how to handle exceptions and errors in Azure Functions triggered by Azure IoT Hub. And how to setup proper notification and re-run failed messages.

The problem

With serverless things can get real messy fast. In a matter of a few minutes, 5k messages can trigger an equal number of exceptions. In my case, it was a PoC project with an unexpected message format coming from IoT devices. The project’s primary need was to develop in an agile way and introduce changes on the fly.

Besides essential error handling, I needed to be notified and be able to re-run erroneous messages on demand.

The solution setup:

  • Azure Storage Queue.
  • Azure Function with input trigger for IoT Hub(Event Hub) messages endpoint and output trigger to Storage Queue.
  • Notification alert triggered by Queue transaction count in 30 minutes.
  • Azure Function for a replay of the messages stored in Storage Queue.

This solution proved to be robust enough for production usage, so I decided to share it with the online community.

The dead letter queue or poison queue?

Terminology is essential stuff. So what’s the difference between the “dead-letter-queue” and “poison queue”? Let’s take quotes from Microsoft docs.

The purpose of the dead-letter queue (DLQ) is to hold messages that cannot be delivered to any receiver, or messages that could not be processed. Link to Azure Service Bus queues docs.

And regarding the poison queue.

A poison message is a message that has exceeded the maximum number of delivery attempts to the application. This situation can arise when a queue-based application cannot process a message because of errors. To meet reliability demands, a queued application receives messages under a transaction. Link to Azure Storage Queue docs.

Unfortunately, IoT Hub and Event Hub Functions needed a custom solution.

Error handling via Azure Storage Queue.

There are two rival queue solutions in Azure — Azure Storage Queue and Service Bus Queue. You can read an in-depth comparison article via link.

To keep things simple, I took the Azure Storage Queue as a good enough solution. It is a lightweight service for storing large numbers of messages and easy to use via output binding of Azure Function and HTTPS. It also has an extra failsafe via embedded poison queue.

Please note. Storage queue messages cannot exceed 64 KB, while Azure IoT hub device-to-cloud messages can be up to 256 KB.
The default message lifetime is 7 days, but you can change this setting.

Here are a few action steps.

  • Create Azure Storage and Queue via Azure CLI.
  • Add output binding for IoT Hub Azure Function.
  • Create an Azure Function with Storage Queue trigger and disable it.
  • Create an alert to notify you about new messages in Storage Queue.

Azure infrastructure and Functions.

So, let’s start with the Azure Storage Queue, set up via Azure CLI. I selected Version 2 hot storage with Standard_LRS because the dead-letter queue should be empty most of the time.

You can view messages in Storage Queue via Portal.

The next step is to adjust the Azure Function. This output trigger differs from MS docs, because of async code inside. So you need to add a message to IAsyncCollector messages in case of an error event.

Function code with IoT Hub trigger and output binging to Storage Queue.

And re-run function with Azure Storage Queue trigger.

Function code with Azure Storage Queue Trigger.

If a function triggered by a QueueTrigger fails, the Azure Functions runtime automatically retries that function five times for that specific queue message. If the message continues to fail, then the bad message is moved to a "poison" queue. The name of the queue will be based on the original + "-poison".
https://<storage account>.queue.core.windows.net/<queue>

Notification setup

The storage alert setup has the same functionality and steps as the ones in Azure Monitor.

  • Open Storage Account and Check the Diagnostic settings. Enable minute metrics if you need a faster alert response.
  • Select the alert section and add a new alert.
  • Create a condition for transaction counts greater than zero over a period of 30 minutes.
  • Add the Action group with Email or any other notification type.
The diagnostic setting of Storage Account.
The alert section with selected Transaction metrics.
Action group with Email subscription.
View of configured alert.

It’s great that we have an estimated cost of monthly alert execution, a small and neat detail from Azure. You can use this setup to debug complicated issues with Azure Functions. For example, getting failed POST request data from Application Insights can be a tricky thing. I hope you will find this solution useful.

That’s it, thanks for the reading. Cheers!

Useful links.

--

--

Writer for

Azure MVP | MCT | Software/Cloud Architect | Dev | https://github.com/staslebedenko | Odesa MS .NET/Azure group | Serverless fan 🙃| IT2School/AtomSpace