Upgrading a HA System from 6.12.65 to 10.13.66

Due to various kernel changes, this upgrade process may result in an unexpected restart of Asterisk. There will also be a short outage as you move the services between nodes. Please be prepared for this, and schedule an outage window as necessary.

 

 

 

Requirements

FreePBX 13

Upgrading is only possible when you are running FreePBX 13.  Before attempting to upgrade to Distro 6.6, ensure you are running the latest FreePBX 13 version, and associated modules

Outage Window

Whilst there is no danger of data loss, both nodes will require a reboot. This means that there will need to be an outage window where you can swap nodes. 

Overview

Before you begin, you must ensure your cluster is in maintenance mode! When your cluster is in maintenance mode, running the command 'pcs status' will result in all the resources having the suffix "(unmanaged)". More information is on the page Setting the cluster to maintenance mode 

The cluster is upgraded in an A-then-B fashion. This means that the cluster infrastructure on the active node is upgraded first, and then the standby node.

There is a lot of information on this page, and you should take care to follow the script below. There are some potential errors that you may encounter as you proceed, so please take care to follow the script precisely.

 

The overview of the tasks are as follows

  1. Run a cluster health check

  2. Disable Fencing

  3. Put the node that is not processing calls explicitly into standby mode

  4. Put the entire cluster into maintenance mode

  5. Upgrade the cluster software on the active node

  6. Start pacemaker service on active node

  7. Take the cluster out of maintenance mode

  8. Upgrade the cluster software on the standby node

  9. Reboot the standby node

  10. Ensure replication is valid

  11. Verify cluster integrity

  12. Reboot the active node to validate failover

Estimated Timeframes

An upgrade of both nodes should not take more than an hour.

Outage Windows

There should be two planned outage windows.

  1. This is a potential outage, when returning the cluster from maintenance mode. This failure hasn't been duplicated, but is theoretically possible if asterisk returns an unusual error. Ensuring that Asterisk is up and running, and processing calls, before returning the cluster from maintenance mode will remove this possibility.  

  2. The primary planned outage window should be approximately 5 minutes, or, however long it takes Asterisk to start up. This is when all the services are failed over from the active node to the standby node, before rebooting the previously active node.

Before you begin

Before making any changes, run a cluster health check. If there are any errors, they will be fixed, or, you will be given instructions on how to fix it.

ln8CK2x.png

 

Upgrade Process

First disable fencing, if enabled using

[root@freepbx-b ~]# pcs property set stonith-enabled=false

And verify that fencing is disabled with:

[root@freepbx-b ~]# pcs property Cluster Properties: cluster-infrastructure: cman cluster-recheck-interval: 5m dc-version: 1.1.11-97629de default-action-timeout: 30 last-lrm-refresh: 1470610819 maintenance-mode: false no-quorum-policy: ignore stonith-enabled: false <--- here

You can re-enable Fencing at any time, just by browsing to the Fencing menu item in the High Availability module so do not accidentally re-enable fencing until finished.

When you have organised your outage window, ensure that at least one node is in standby (We'll be assuming your currently active machine is FreePBX-A):

[root@freepbx-a ~]# pcs cluster standby freepbx-b [root@freepbx-a ~]#

If any services were running on freepbx-b, they will be moved across to freepbx-a (or vice versa, if you are running services on -b, and are setting -a to standby).  Verify that this is complete by running 'pcs status'. All services should say 'Started freepbx-a'.  After you have verified this, you must put the cluster into maintenance mode.

After you issue this command, the cluster immediately stops managing and monitoring resources.  Nothing will be restarted or moved if it fails.  (Note that we remove this setting after upgrading the active node, please don't forget to do that!).

It is imperative that you put the cluster into maintenance mode! Failure to do so will lead to extremely difficult to resolve failures, and may cause an extended outage. If you are unsure of how to verify this, please read FreePBX HA-Setting the cluster to maintenance mode.

You start by upgrading the distro on the node that is processing calls. This does NOT automatically restart the Asterisk process, it only restarts the Cluster Management software. (If your cluster is not in maintenance mode, the secondary node will attempt to take over the cluster services! Read the previous paragraph about maintenance mode!). 

This will take, normally, about 20 minutes.This process will not cause an outage.  

It's possible that the 'Cleanup' part of the upgrade may hang. This is due to a (fixed) bug in the Cluster services. If you system seems to not be proceeding in the 'Cleanup' phase, please read the Stalled Upgrade information

 

Your upgrade may hang! It is possible that your upgrade may hang on or around this point:

This is due to a bug that is fixed in the latest version of Pacemaker. Please read the Stalled Upgrade page to unblock it.

When it is finished you will see something like this

DO NOT RESTART YOUR MACHINE! At this point, the cluster services are not running on this machine, and need to be restarted. You can verify this with the following commands:

You will notice that everything should look exactly the same as it did prior to the upgrade. At this point you can now take the cluster out of maintenance mode.

WARNING! It is at this point that the cluster may determine that asterisk needs to be restarted.

There are cases where an upgrade does not create  /usr/sbin/fwconsole symlink, so at this point you want to confirm that with:

running "amportal chown" will ensure the symlink is created in /usr/sbin/

You can take the cluster out of maintenance mode now:

Under NO CIRCUMSTANCES should you take the other node out of standby at this point! 

At this time, you can run 'pcs status' and all the services will appear to be running and valid. However, due to version changes, it is not possible for the standby node to take control of the cluster services, and if it attempts to do so, will cause a catastrophic failure.

Finally, because the upgrade script thinks it failed, you should manually update the version number on this machine

This ensures that the upgrade system knows which track you are on.

Switch to the other node

 You must now proceed to upgrading the standby node. This upgrade will take slightly longer than the upgrade on the active node. An average system should complete the upgrade in around 25 minutes. Note that a number of errors and warnings about Asterisk will be shown as part of the upgrade process  This is of no concern, and should be expected.

It's possible that the 'Cleanup' part of the upgrade may hang. This is due to a (fixed) bug in the Cluster services. If you system seems to not be proceeding in the 'Cleanup' phase, please read the Stalled Upgrade information

 

Your upgrade may hang! It is possible that your upgrade may hang on or around this point:

This is due to a bug that is fixed in the latest version of Pacemaker. Please read the Stalled Upgrade page to unblock it.

When the upgrade of the standby node is completed,  it will appear as if it has encountered a fatal error. This is normal, as this machine is not in control of the cluster. 

Because the upgrade script thinks it failed, you should manually update the version number on this machine

When you see these errors, and after you update the version, you now must reboot the standby node.

When the standby node has rebooted, you can now take it out of standby mode and verify that it has rejoined the cluster successfully.

You should now validate the cluster configuration in FreePBX HA again.  If any errors are detected, it will fix them, or, tell you how to fix them if it can't fix it itself.

ln8CK2x (1).png

 

If all tests pass, you should now set the currently active node to standby, in preparation for rebooting it.

WARNING: THIS WILL CAUSE AN OUTAGE.

Simply click on the 'Standby' button in FreePBX HA. This will move all the services across to the other node.

 

When the services have moved across, run a cluster check AGAIN. This ensures that all software is up to date on both machines. 

 

You can now reboot the original node, and when it's rebooted return it from standby mode. Your cluster version upgrade is now complete. Any further upgrades can be performed through the System Admin module, as per normal.

Re-enable Fencing if necessary by browsing to the High Availability module, and clicking the Fencing menu item. Verify that fencing is enabled with:

 

 

Return to Documentation Home I Return to Sangoma Support