Automating Oracle DB client failover using Azure Function App

Published in

ITNEXT

9 min readSep 15, 2022

Introduction and Problem Statement

Oracle Data Guard is one of the ways you can ensure High Availability of the oracle databases. Oracle Data Guard maintains the standby databases as copies of the primary database. Then, if the primary database becomes unavailable because of a planned or unplanned outage, Oracle Data Guard can switch any standby database to the Primary role, minimizing the downtime associated with the outage.

Since this blog is not about Oracle Data Guard, if you want to learn more please refer to the documentation for more info.

To understand the concept at a high level, you can refer to the GIF below which depicts a simple Oracle database High Availability configuration using Oracle Data Guard.

Figure1 Oracle Database Failover using Data Guard

Few takeaways from this GIF in below few points, so that it makes more sense :

Availability Zone 1 has an AKS instance that connects to an Oracle database.
Oracle Database High availability is ensured by Oracle Data Guard.
This is a simple configuration where the Oracle database has two instances (on separate Azure Virtual Machines), one being the primary database instance (in Availability Zone 1) and the other being the standby database instance(in Availability Zone 2).
There is a third Virtual Machine that has the oracle client software installed and assumes the role of FSFO observer for the Oracle Data Guard configuration. The observer is required only if Fast Start Failover is configured
All Database clients (AKS Pods or other clients) use db.mydomain.local CNAME (in Private DNS Zone) in their connection strings and dbservice as the service name.
db.mydomain.local CNAME must always resolve to the Virtual Machine FQDN where Primary Instance is running.
Fast Start Failover is configured at Oracle Data Guard, so when Primary Database becomes unavailable because of any reason (vm1.mydomain.local crashes in this case), Database fails over to standby Database Instance and Data Guard converts Standby Instance to Primary Instance.
All old client connections got disconnected. Now Client failover must take place so that clients can start connecting to the database again. And this is the problem statement for this article.

Oracle’s solutions for client failover

Oracle provides many solutions for client failover and details are in the documentation.

Configuring Fast Connection Failover for JDBC Clients (updates to application config, needs ONS)
Configuring Application Continuity (Oracle Clusterware is a requirement, extra licensing is needed)
Configuring Fast Connection Failover for OCI Clients (Oracle Clusterware or Oracle restart, extra licensing needed)
Configuring Oracle RAC Databases for Failover (Oracle RAC must be installed, extra licensing needed)
Configuring the Oracle Data Guard Environment (Oracle Clusterware must be installed, extra licensing needed)

These options either need you to install ONS, install Oracle RAC, Oracle Clusterware, or Oracle restart (Oracle Clusterware on a single node) or configure certain parameters on the application side. Oracle RAC solves most of the problems related to High Availability but it costs extra plus you need to maintain another Oracle Home and for setup such as you saw above, it is not an ideal solution.

You don’t want to depend on application configuration for client failover nor do you want to add additional components to the stack.

Using CNAME switchover for client failover

In Figure 1, you noticed that clients used the CNAME in the database connection string. So once the Failover happens, CNAME db.mydomain.local must point to the FQDN of the Virtual machine hosting the new Primary Instance (Old Standby) i.e. vm2.mydomain.local, so that clients can start connecting to the database again.

In Azure, we can use the command below to do this.

az account set --subscription "Subscription_of_PRIVATE_DNS_ZONE"
az network private-dns record-set cname set-record -g ${v_resourceGroup_pz} -z ${v_DnsZone}  -n ${v_primaryCname} -c "${v_standbyNode}.${v_DnsZone}";

Besides that “dbservice” database service must start on the new primary database. We can create a trigger to start “dbservice” whenever the database instance starts with the database role as ‘PRIMARY’.

CREATE OR REPLACE TRIGGER startDgServices after startup on database
DECLARE  
db_role VARCHAR(30);  
db_open_mode VARCHAR(30);
BEGIN  
    SELECT DATABASE_ROLE, OPEN_MODE INTO db_role, db_open_mode FROM V$DATABASE;
    IF db_role = 'PRIMARY' THEN 
      DBMS_SERVICE.START_SERVICE('DBSERVICE');
    END IF;
END;
/

Once the Failover will occur service DBSERVICE will start automatically when the database instance starts as “PRIMARY”.

This approach is OK but it has a major drawback, it requires you to manually update CNAME in the Private DNS Zone.

You solve this by automating the CNAME update whenever a Database switchover or failover occurs.

Automating CNAME update using Azure Function App

You will create a function app that executes the PowerShell cmdlets to update CNAME in the Private DNS Zone. Function App will be triggered using HTTP trigger from the VM where Primary Instance runs when CNAME points to an FQDN not same as the Hosting VM. You can use a script to execute the HTTP trigger using curl. This is discussed in detail in the “Create a trigger Script” section.

Create Function App

Select appropriate values for the function app

Pick Appropriate Plan, you can pick “functions premium” which supports private endpoints, Virtual Network integration, custom domain, and Traffic manager

Navigate to Identity, for system assigned identity Toggle the status to on, and click on Save.
Navigate to “functions” and click on create

Select “HTTP Trigger” as the template. Specify an appropriate name for the function and click on create.

Click on “code + Test”, replace the existing code for run.ps1 with the below piece of code and click on “Save”

using namespace System.Net# Input bindings are passed in via param block.
param($Request, $TriggerMetadata)# Write to the Azure Functions log stream.
Write-Host "PowerShell HTTP trigger function processed a request."#Set the Variables
$subscriptionId =$env:platformSubscriptionId
$tenantId = $env:tenantId
$dbname = $env:dbnamedgbppr
$dnsrgname = $env:dnsrgname
$dnszone = $env:dnszone#Switch Subscription
Set-AzContext -Subscription $subscriptionId  -Tenant $tenantId
# Interact with query parameters or the body of the request.
$name = $Request.Query.Name
$newprimcname = $Request.Body.Newprimcnameif (-not $name) {
    $name = $Request.Body.Name
}if ($name) {
    Write-Host "Running the DNS switchover for PPR Digitalbank Database"
    $rs = Get-AzPrivateDnsRecordSet -name "$dbname-prim" -RecordType CNAME -ZoneName "$dnszone" -ResourceGroupName "$dnsrgname"
    Remove-AzPrivateDnsRecordConfig -RecordSet $rs -Cname $rs.Records.cname | Set-AzPrivateDnsRecordSet
    Get-AzPrivateDnsRecordSet -name "$dbname-prim" -RecordType CNAME -ZoneName  "$dnszone" -ResourceGroupName "$dnsrgname"| Add-AzPrivateDnsRecordConfig -Cname $newprimcname | Set-AzPrivateDnsRecordSet
    $body1 = "Hello, $name. This HTTP triggered function executed successfully. CNAME $dbname-prim now  switched \n"
}# Associate values to output bindings by calling 'Push-OutputBinding'.
Push-OutputBinding -Name Response -Value ([HttpResponseContext]@{
    StatusCode = [HttpStatusCode]::OK
    Body = $body1 
})

Navigate to “configuration” , under application settings add the env variables “platformSubscriptionId”, ”tenantId”, “dbnamedgbppr“, “dnsrgname” ,”dnszone”
Navigate to App Service Editor and modify requirements.psd1 to match below

@{
    # For the latest supported version, go to 'https://www.powershellgallery.com/packages/Az'. 
    # To use the Az module in your function app, please uncomment the line below.
    # 'Az' = '8.*'
    'Az' = '7.*'
}

Authorizing Function App to update Private DNS Zone

Navigate to your Private DNS Zone on the Azure portal
Navigate to IAM
Click Add and Add role assignment
Select “Private DNS Zone Contributor”
Assign access to Managed Identity and Select members

Click on Review and Assign

Create a trigger Script

You can write a trigger script using bash or Python or which language you like.
Script will run at certain intervals on the virtual machine hosting the Oracle Database Standby Instance.
The script should fetch the status of the database instance running on the VM and if the OPEN_MODE and database_role for the instance are found to be “READ WRITE“ and “PRIMARY” respectively then validate the DNS resolution for CNAME.
If DB CNAME (db.mydomain.local in the above example) points to the FQDN of the hosting VM then no change is required, else update the DB CNAME to match the FQDN of the hosting VM where the Database is running as Primary. This will be done by executing an HTTP trigger for the function App we created earlier.
Navigate to Your function App> functions> your_function_for_cname_switching on Azure Portal.
Copy the URL from “Get Function URL”

Figure 6

You can use curl utility to execute HTTP trigger for function App like below

curl -X POST <replace with the URL you copied above(including "code" part) >  \
        -H 'Content-Type: application/json' \
        -d '{"name":"dnscnameswitcher","newprimcname":"<FQDN of VM Hosting new Primary>"}'

Make sure to update the value for FQDN of the new Primary VM.
In our current Project, I used ansible to templatize this script, you can feel free to reach us for the code.
You can schedule the script to run on the standby node at intervals. You can decide on the intervals as per needs.

The above Diagram shows the execution flow of the script without authentication

Why do we run the script only on the Standby Node?

If we run the script on the primary and the standby nodes, in the case of split-brain where Primary is disconnected from Observer and Standby and failover takes place, both instances may become Primary at the same time and it may result in data loss.
If failover or switchover occurs, the script needs to be rescheduled on the new standby node. This can be achieved by ansible easily. It may require manual intervention.

Preliminary security measures

Create Private Endpoint

Navigate to networking, Add a private endpoint for your function app

This will assign a private IP to the function app from the subnet that you selected when you created the private endpoint
Now when you will make an HTTP call to the function, Private IP will be used and the request won’t be routed over the internet.

Add Access Restrictions

Navigate to networking, click on access restrictions
Click on Add Rule
Give an appropriate name to the rule, Action as “Allow”
The source should be a Virtual Network and subnet of which your VMs are part.
This will restrict all incoming traffic except the calls from the VMs which are part of the allowed subnet.

Add Authentication

Add the authentication provider

Navigate to “Authentication” in the function App
Add Identity Provider and Select “Microsoft”
Add Microsoft (AAD) identity provider with existing app registration details. Check the documentation for details.
Add the appid and secret of app registration, and also add this secret in a Key vault.
Add the appid in the “Allowed token audience” as well.
Restrict the access of the Function App to “require authentication”
Unauthenticated requests will receive “403 Unauthorized Request” error.

Changes in how you call the HTTP trigger

When using authentication, you can use curl utility to execute HTTP trigger for function App like below

# get secret for app registration used for authentication
az login --identity > /dev/null
client_secret_sp=`az keyvault secret show --vault-name <vault_name> -n <appregistration_secret_name_in_KV> --query "value" -o tsv# get access token to be used in curl to authenticate the client in function App
ACCESS_TOKEN=$(curl -X POST -H 'Content-Type: application/x-www-form-urlencoded' \
https://login.microsoftonline.com/f1821f56-8d20-4902-bf65-bc7502e367e7/oauth2/token \
-d 'client_id=<AppID_of_appregistration>' \
-d 'grant_type=client_credentials' \
-d 'client_secret='$client_secret_sp \
-d 'resource=<AppID_of_appregistration>' | jq -r .access_token)#making the HTTP trigger request
 curl -X POST <replace with the URL you copied above(including "code" part) >  \
   -H 'Content-Type: application/json' \
   -d '{"name":"dnscnameswitcher","newprimcname":"<FQDN of VM Hosting new Primary>"}'\
   -H "Authorization: Bearer ${ACCESS_TOKEN}"

You can get the App registration secret from the Key vault where you saved it, make sure the access policy exists which allows the System Identities of the VMs hosting the database Instances to “get” secrets from the key vault.
The below diagram shows the execution flow of the script with authentication

At present, this allows any client application in your Azure AD tenant to request an access token and authenticate to the function app. If you also want to enforce authorization to allow only certain client applications, you must perform some additional configuration.

Use Key Vault

Store the function key in a Key vault
Store app registration secret in a Key vault
Don’t store any secret info in the script and retrieve all in the script.

Conclusion

This solution works fine in case of Database failover or switchover. Any suggestions for improvement are welcome.

Thanks to Emmanuel Eyoma for helping me in fixing some issues with the function app.

Please read my other articles as well and share your feedback. If you like the content shared please like, comment and subscribe for new articles.