Upgrading a HA System from 6.12.65 to 10.13.66
Due to various kernel changes, this upgrade process may result in an unexpected restart of Asterisk. There will also be a short outage as you move the services between nodes. Please be prepared for this, and schedule an outage window as necessary.
Requirements
FreePBX 13
Upgrading is only possible when you are running FreePBX 13. Before attempting to upgrade to Distro 6.6, ensure you are running the latest FreePBX 13 version, and associated modules
Outage Window
Whilst there is no danger of data loss, both nodes will require a reboot. This means that there will need to be an outage window where you can swap nodes.
Overview
Before you begin, you must ensure your cluster is in maintenance mode! When your cluster is in maintenance mode, running the command 'pcs status' will result in all the resources having the suffix "(unmanaged)". More information is on the page Setting the cluster to maintenance mode
The cluster is upgraded in an A-then-B fashion. This means that the cluster infrastructure on the active node is upgraded first, and then the standby node.
There is a lot of information on this page, and you should take care to follow the script below. There are some potential errors that you may encounter as you proceed, so please take care to follow the script precisely. |
The overview of the tasks are as follows
Run a cluster health check
Disable Fencing
Put the node that is not processing calls explicitly into standby mode
Put the entire cluster into maintenance mode
Upgrade the cluster software on the active node
Start pacemaker service on active node
Take the cluster out of maintenance mode
Upgrade the cluster software on the standby node
Reboot the standby node
Ensure replication is valid
Verify cluster integrity
Reboot the active node to validate failover
Estimated Timeframes
An upgrade of both nodes should not take more than an hour.
Outage Windows
There should be two planned outage windows.
This is a potential outage, when returning the cluster from maintenance mode. This failure hasn't been duplicated, but is theoretically possible if asterisk returns an unusual error. Ensuring that Asterisk is up and running, and processing calls, before returning the cluster from maintenance mode will remove this possibility.
The primary planned outage window should be approximately 5 minutes, or, however long it takes Asterisk to start up. This is when all the services are failed over from the active node to the standby node, before rebooting the previously active node.
Before you begin
Before making any changes, run a cluster health check. If there are any errors, they will be fixed, or, you will be given instructions on how to fix it.
Upgrade Process
First disable fencing, if enabled using
[root@freepbx-b ~]# pcs property set stonith-enabled=false |
And verify that fencing is disabled with:
[root@freepbx-b ~]# pcs property
Cluster Properties:
cluster-infrastructure: cman
cluster-recheck-interval: 5m
dc-version: 1.1.11-97629de
default-action-timeout: 30
last-lrm-refresh: 1470610819
maintenance-mode: false
no-quorum-policy: ignore
stonith-enabled: false <--- here
|
You can re-enable Fencing at any time, just by browsing to the Fencing menu item in the High Availability module so do not accidentally re-enable fencing until finished.
When you have organised your outage window, ensure that at least one node is in standby (We'll be assuming your currently active machine is FreePBX-A):
[root@freepbx-a ~]# pcs cluster standby freepbx-b
[root@freepbx-a ~]# |
If any services were running on freepbx-b, they will be moved across to freepbx-a (or vice versa, if you are running services on -b, and are setting -a to standby). Verify that this is complete by running 'pcs status'. All services should say 'Started freepbx-a'. After you have verified this, you must put the cluster into maintenance mode.
After you issue this command, the cluster immediately stops managing and monitoring resources. Nothing will be restarted or moved if it fails. (Note that we remove this setting after upgrading the active node, please don't forget to do that!).
It is imperative that you put the cluster into maintenance mode! Failure to do so will lead to extremely difficult to resolve failures, and may cause an extended outage. If you are unsure of how to verify this, please read FreePBX HA-Setting the cluster to maintenance mode.
You start by upgrading the distro on the node that is processing calls. This does NOT automatically restart the Asterisk process, it only restarts the Cluster Management software. (If your cluster is not in maintenance mode, the secondary node will attempt to take over the cluster services! Read the previous paragraph about maintenance mode!).
This will take, normally, about 20 minutes.This process will not cause an outage.
It's possible that the 'Cleanup' part of the upgrade may hang. This is due to a (fixed) bug in the Cluster services. If you system seems to not be proceeding in the 'Cleanup' phase, please read the Stalled Upgrade information |
Your upgrade may hang! It is possible that your upgrade may hang on or around this point:
This is due to a bug that is fixed in the latest version of Pacemaker. Please read the Stalled Upgrade page to unblock it.
When it is finished you will see something like this
DO NOT RESTART YOUR MACHINE! At this point, the cluster services are not running on this machine, and need to be restarted. You can verify this with the following commands:
You will notice that everything should look exactly the same as it did prior to the upgrade. At this point you can now take the cluster out of maintenance mode.
WARNING! It is at this point that the cluster may determine that asterisk needs to be restarted.
There are cases where an upgrade does not create /usr/sbin/fwconsole symlink, so at this point you want to confirm that with:
running "amportal chown" will ensure the symlink is created in /usr/sbin/
You can take the cluster out of maintenance mode now:
Under NO CIRCUMSTANCES should you take the other node out of standby at this point!
At this time, you can run 'pcs status' and all the services will appear to be running and valid. However, due to version changes, it is not possible for the standby node to take control of the cluster services, and if it attempts to do so, will cause a catastrophic failure.
Finally, because the upgrade script thinks it failed, you should manually update the version number on this machine
This ensures that the upgrade system knows which track you are on.
Switch to the other node
You must now proceed to upgrading the standby node. This upgrade will take slightly longer than the upgrade on the active node. An average system should complete the upgrade in around 25 minutes. Note that a number of errors and warnings about Asterisk will be shown as part of the upgrade process This is of no concern, and should be expected.
It's possible that the 'Cleanup' part of the upgrade may hang. This is due to a (fixed) bug in the Cluster services. If you system seems to not be proceeding in the 'Cleanup' phase, please read the Stalled Upgrade information |
Your upgrade may hang! It is possible that your upgrade may hang on or around this point:
This is due to a bug that is fixed in the latest version of Pacemaker. Please read the Stalled Upgrade page to unblock it.
When the upgrade of the standby node is completed, it will appear as if it has encountered a fatal error. This is normal, as this machine is not in control of the cluster.
Because the upgrade script thinks it failed, you should manually update the version number on this machine
When you see these errors, and after you update the version, you now must reboot the standby node.
When the standby node has rebooted, you can now take it out of standby mode and verify that it has rejoined the cluster successfully.
You should now validate the cluster configuration in FreePBX HA again. If any errors are detected, it will fix them, or, tell you how to fix them if it can't fix it itself.
If all tests pass, you should now set the currently active node to standby, in preparation for rebooting it.
WARNING: THIS WILL CAUSE AN OUTAGE.
Simply click on the 'Standby' button in FreePBX HA. This will move all the services across to the other node.
When the services have moved across, run a cluster check AGAIN. This ensures that all software is up to date on both machines.
You can now reboot the original node, and when it's rebooted return it from standby mode. Your cluster version upgrade is now complete. Any further upgrades can be performed through the System Admin module, as per normal.
Re-enable Fencing if necessary by browsing to the High Availability module, and clicking the Fencing menu item. Verify that fencing is enabled with: