FreePBX HA-Status and Errors
Understanding Errors
In the event that the heartbeat system detects that a core service such as Asterisk, Apache, or MySQL is down, the system will attempt to bring it back up. There are two possible outcomes: either heartbeat will be able to restart the service, or it will move the service to the other node.
For example, if the Asterisk service has stopped running on the master node, here are the possible outcomes:
Heartbeat is able to restart Asterisk after detecting it is down on the node.
Heartbeat is not able to restart Asterisk after detecting it is down on the node, so it moves the FreePBX Service group (Asterisk, Apache, and iSymphony) to the other node.
If a service fails a second time on the same node, and you have not cleared the error, the system will move the service to the other node. After encountering a error, it's important that you log into your High Availability GUI and clear any outstanding errors.
Viewing and Clearing Errors
On your main node (Floating IP Address), click on the High Availability module under Settings.
Go to the Status page.
If you have any errors, they will be displayed along with a Clear Error button.
If the failed service has not yet exceeded the maximum number of failures allowed, it will be displayed with a yellow background.
If the failed service has exceeded the maximum number of failures allowed, it will be displayed with a red background.
Click the Clear Error button to clear the error.
Example
For demonstration purposes, we will purposely stop asterisk on our Master and walk you through the error display and clearing process.
We have gone to the Status page in the High Availability module.
Our example below shows that freepbx-a and freepbx-b are both online. All services show as primary on freepbx-a, and there are no errors.
We will now stop asterisk on freepbx-a so that the heartbeat can detect asterisk is down, bring it back up, and display an error to us. We can see that Asterisk failed on freepbx-a but the system started the service back up on freepbx-a. The error message informs us that if asterisk fails again, it will be blocked from being restarted on freepbx-a until we clear these errors, which means it would move to freepbx-b after another failure.
We will now stop asterisk on freepbx-a a second time. The error now is now shown in red to inform us that asterisk has exceeded the number of allowed failures. We can also see that both asterisk and httpd are now using freepbx-b as the active node, but mysql is still using freepbx-a, since mysql did not crash.
If we would have cleared the error prior to the second failure, then the system would have restarted asterisk on freepbx-a instead of moving it to freepbx-b. Make sure to clear errors when you see them. If the same error happens multiple times, that would indicate a bigger issues somewhere with your setup.
If you would like to force asterisk back to freepbx-a now, you can simply follow the guide on how to make a node be offline from the GUI. Or, you can simply reboot freepbx-b, which will cause the system to move asterisk back to freepbx-a.