IPMI Failure

In the situation that an IPMI device fails at the same time as the currently active host, there is no way for the other node to know what has happened (It actually believes that it is the one that has fallen off the network), and will not restart cluster resources.

The most common occurrence of this is when the active node loses power instantly, which disables its onboard IPMI.

If this is the case, when logging into the remaining node (freepbx-b in this example) and running `pcs status` the following will be indications that this has happened:

[root@freepbx-b ~]# pcs status Cluster name: freepbx-ha Last updated: Wed Jul 27 14:21:33 2016 Last change: Wed Jul 27 14:20:38 2016 Stack: cman Current DC: freepbx-b - partition WITHOUT quorum Version: 1.1.11-97629de 2 Nodes configured 22 Resources configured Node freepbx-a: UNCLEAN (offline) Online: [ freepbx-b ]

Note that freepbx-a is marked as UNCLEAN. This means that the cluster is unaware of its state, and is unable to manage any resources that are assigned to it

Full list of resources: spare_ip (ocf::heartbeat:IPaddr2): Started freepbx-a floating_ip (ocf::heartbeat:IPaddr2): Started freepbx-a Master/Slave Set: ms-asterisk [drbd_asterisk] Masters: [ freepbx-a ] Slaves: [ freepbx-b ] Master/Slave Set: ms-mysql [drbd_mysql] Masters: [ freepbx-a ] Slaves: [ freepbx-b ] Master/Slave Set: ms-httpd [drbd_httpd] Masters: [ freepbx-a ] Slaves: [ freepbx-b ] Master/Slave Set: ms-spare [drbd_spare] Masters: [ freepbx-a ] Slaves: [ freepbx-b ] spare_fs (ocf::heartbeat:Filesystem): Started freepbx-a Resource Group: mysql mysql_fs (ocf::heartbeat:Filesystem): Started freepbx-a mysql_ip (ocf::heartbeat:IPaddr2): Started freepbx-a mysql_service (ocf::heartbeat:mysql): Started freepbx-a Resource Group: asterisk asterisk_fs (ocf::heartbeat:Filesystem): Started freepbx-a asterisk_ip (ocf::heartbeat:IPaddr2): Started freepbx-a asterisk_service (ocf::heartbeat:freepbx): Started freepbx-a Resource Group: httpd httpd_fs (ocf::heartbeat:Filesystem): Started freepbx-a httpd_ip (ocf::heartbeat:IPaddr2): Started freepbx-a httpd_service (ocf::heartbeat:apache): Started freepbx-a fence_a (stonith:fence_vmware): Started freepbx-b fence_b (stonith:fence_vmware): Started freepbx-a

Unfortunately, this means that, in this situation, everything is now unmanageable.

To resolve this problem, you need to tell the cluster that fencing is disabled

[root@freepbx-b ~]# pcs property set stonith-enabled=false

This will allow the cluster to start failing over to freepbx-b, but it will not completely start.  Running `pcs status` now will result in a partly started cluster:

You can see that only mysql has started. You now need to tell the cluster that the remaining services are OK to start with the 'pcs resource cleanup' command.

After waiting a few seconds, you should then see the services starting up again

The final thing to do is to remove the fencing agent from the failed node, if it is not coming back. In our example, it was freepbx-a that failed, so we need to remove that agent

You can now replace the freepbx-a machine, and re-configure Fencing when it is replaced.

 

 

Return to Documentation Home I Return to Sangoma Support