Auto-Heal Azure VM Scale Set Instances

Published in

ITNEXT

3 min readJul 10, 2020

In a recent engagement, I was asked to find a way to auto-heal instances in a scale set should the application became unhealthy. Having done quite a bit of Kubernetes lately, the concept of auto-healing was well understood; I just had to find a solution for Scale Sets. Luckily, Microsoft introduced a new feature in April 2020 called VMSS Automatic Instance Repairs.

Before getting into the implementation, let’s go over what auto-healing is about. A auto-healing solution is made up of two key components:

A probing mechanism to check the application health
A recovery mechanism to bring the application to a healthy state

Probing Mechanism

To probe an application, a health probe is defined to check the application at regular intervals. L4 and L7 probes are possible.

A L4 (layer 4) probe checks for the application’s TCP port being open. While the port can be open, it may not fully reflect the health of the application behind the port.
A L7 (layer 7) probe checks for a URL path over HTTP or HTTPS. The crudest option would be an index.html, but would not validate if the application is processing requests. The preferred option would be to have an application health endpoint that validates multiple aspects of the application. From the probe’s perspective, it expects to receive an HTTP 200 when the application is healthy; otherwise mark it unhealthy.

The probing can be executed either from a load balancer health probe or by using the VMSS Application Health Extension. Using a load balancer probe is the preferred option (explained under limitations below)

Recovery Mechanism

In pre-cloud days, instances were managed as pets and an operator would attempt bringing the instance back online. This approach was labor intensive and also required different procedures for each application.

Instead, by taking a cattle approach, one can think of instances as disposable and replaceable. The VMSS approach will terminate the unhealthy instance and spawn off a new one. All scripts / extensions that ran on the original instance will be run when instantiating the new node (same process as autoscaling). So keep those steps shorts as they introduce latency in returning to a healthy state.

Limitations

Grace Period — Whenever a scale set is modified (adding nodes, modifying the configuration), the auto-healing process enters a grace period of a minimum of 30 minutes before it can take healing actions. This is a period to give the application a chance to stabilize. In Kubernetes parlance, this is a very crude readinessProbe. I suspect Microsoft will be enhancing this to match Kubernetes semantics.
Load Balancer HTTPS Probe — LB HTTPS probes do not support the use of self-signed certs. While the Application Gateway probes support those certificates by having a trusted root cert (AppGw v2) or auth certs (AppGw v1); they are not a valid source of probes (yet)
Application Health Extension — The extension runs within the VMSS instance and performs the health checks. The extension probes against localhost/127.0.0.1. Therefore, if your application is bound to a specific interface, the probe will not be able to reach the endpoint and fail the instance. The fix is to have your application bind to 0.0.0.0.
Application Health Extension — Unlike the externalized load balancer probe, the probes managed by the extension do not support interval periods and number of failed probes before marking the instance unhealthy. I expect Microsoft to enhance this to align with the external probe behavior.

Configuration

To configure auto-healing, it needs to be enabled by adding this block under the properties section of the ARM template

"automaticRepairsPolicy": {
          "enabled": true,
          "gracePeriod": "PT30M"
}

Application Health Option

Under the extensionProfile section, add the configuration for the health extension

"extensionProfile": {
    "extensions": [
        {
        "name": "[variables('applicationHealthExtensionName')]",
        "properties": {
            "autoUpgradeMinorVersion": true,
            "publisher": "Microsoft.ManagedServices",
            "type": "ApplicationHealthLinux",
            "typeHandlerVersion": "1.0",
            "settings": {
              "protocol": "[parameters('healthProbeProtocol')]",
              "port": "[parameters('healthProbePort')]"
            }
        }
        }
    ]

Load Balancer Option

For this option, create the load balancer separately and pass the probe resource Id into the networkProfile.healthProbe.id field of the scale set. Much simpler configuration.

"networkProfile": {
            "healthProbe": {
              "id": "[variables('probeID')]"
            },

Resources

Microsoft documentation for VMSS Automatic instance repairs: https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-automatic-instance-repairs