PBX Platforms - Using Amazon AWS Route 53 for SIP Servers Survivability via a single FQDN
In redundant SIP servers environments, where the endpoints do not offer the possibility of defining a Secondary/Backup SIP server, such as in Zulu 3 (currently) and in other possible endpoints (phones, softphones, gateways, etc.), having a DNS service that would quickly and automatically switch the IP address a single FQDN resolves to, would be very convenient as it will allow for the automatic failover between the two such redundant SIP servers be possible.
In this guide we will be using two PBXact v14 appliances along with Zulu 3 as the endpoint but this could also be applied to other SIP servers such as SBCs or other endpoints such as SIP softphones lacking a the possibility of defining a Secondary/Backup SIP server.
Pre-Requisites
Both SIP servers must be reachable through a different public IP address of their own
For automatic syncing of configurations and records (CDR, logs, etc) between two PBX servers, a Warm Spare backup and restore job should be properly configured first
Limitations
Only one Health Checker, that is, AWS Route 53 method for determining server availability, can be setup at a time (ie. only one, the AMI port 5038 or the Zulu port 8002, could be monitored to determine if the PBX service is available or not). This brings the following possible complications:
When Asterisk process (tcp 5038) is selected to be monitored on AWS Route 53 for checking PBX availability (recommended as asterisk service tends to fail more commonly than zulu server)
If only Asterisk fails (tcp 5038), but not Zulu (tcp 8002), AWS will switch the IP address the FQDN resolves to. Zulu will not be able to automatically failover though as it does not detect any problem registering with the current server. Restarting the Zulu application will make the failover take place
If only Zulu fails (tcp 8002), but not Asterisk (tcp 5038), AWS will NOT switch the IP address the FQDN resolves to. Zulu will get disconnected as expected. Troubleshooting would be needed to determine the cause for Zulu server failing and bringing it back up
When Zulu process (tcp 8002) is selected to be monitored on AWS Route 53 for checking PBX availability
If only Asterisk fails (tcp 5038), but not Zulu (tcp 8002), AWS will NOT switch the IP address the FQDN resolves to. Zulu will not be able to make or receive calls as expected. Troubleshooting would be needed to determine the cause for Asterisk failing and bringing it back up
If only Zulu fails (tcp 8002), but not Asterisk (tcp 5038), AWS will switch the IP address the FQDN resolves to. Zulu will automatically failover (Note: restarting the Zulu application may still be needed for correctly authenticating on the new server)
Zulu 3 has (currently) no possibility for defining the value for its SIP Registration Expiry timer, it is permanently set to 600 seconds (10 minutes). This makes it necessary to restart the application once the main server has come back online in order to failover back to it
Step-by-step guide
Configurations on Amazon AWS Route 53:
Register or Transfer a Domain
Registered domains > Register Domain
Once the Domain is registered, AWS Route 53 will automatically create a Hosted Zone for it.
Create the Health Check for AWS Route 53 to know when to deem a server as unhealthy and proceed to switch the IP address the Domain (FQDN) resolves to.
Health checks > Create health check
The "Request interval" and "Failure threshold" parameters inside the "Advanced configuration" section allow us to control how 'quickly' would a failing SIP server be detected, which in turn makes Amazon AWS Route 53 proceed to switch the IP address the Domain (FQDN) resolves to
Optionally, you can create an alarm to get notified via email when the server becomes unhealthy (ie. reaching TCP port 5038 is not possible)
Assign your SIP servers, along with the Health check, to your Hosted Zone
Add the main SIP server associating it with the Health check previously created
Add the secondary SIP server, this one is not associated with any Health check
Configurations on the main SIP server:
Open the firewall for the specific IP addresses Amazon AWS Route 53 uses for the Health checking of the SIP server. In a PBXact server it can be done by executing the following command in the Linux console as root user:
fwconsole firewall add trusted 15.177.0.0/18 54.183.255.128/26 54.228.16.0/26 54.232.40.64/26 54.241.32.64/26 54.243.31.192/26 54.244.52.192/26 54.245.168.0/26 54.248.220.0/26 54.250.253.192/26 54.251.31.128/26 54.252.79.128/26 54.252.254.192/26 54.255.254.192/26 107.23.255.0/26 176.34.159.192/26 177.71.207.128/26
For the current list of IP addresses Amazon AWS Route 53 uses for its Health checks search its ip-ranges.json file for "ROUTE53_HEALTHCHECKS" - Ref. AWS IP address ranges - Amazon Virtual Private Cloud
fwconsole firewall stop
fwconsole firewall start
Conclusion
The configurations have been completed. For testing the failover you can easily force the main SIP server to go down. In a PBXact server, for example, this can be done by executing the following command in the Linux console as root user:
fwconsole stop
After approximately ~1-2 minutes (this can be changed by adjusting the "Request interval" and "Failure threshold" parameters inside the "Advanced configuration" section when creating the Health check, as shown above), the endpoints previously registered to the main SIP server will start failing over to the secondary SIP server. Please note that you may also need to adjust the value for the SIP Registration Expiry timer in your SIP endpoints to a lower value, to 30 seconds for example.
Amazon AWS Route 53 will automatically revert the IP address the Domain (FQDN) resolves to back to the main SIP server as soon as it detects, thanks to its Health check, that the main server has come back online again.
Related articles
Warm Spare