IPMI Failure
In the situation that an IPMI device fails at the same time as the currently active host, there is no way for the other node to know what has happened (It actually believes that it is the one that has fallen off the network), and will not restart cluster resources.
The most common occurrence of this is when the active node loses power instantly, which disables its onboard IPMI.
If this is the case, when logging into the remaining node (freepbx-b in this example) and running `pcs status` the following will be indications that this has happened:
[root@freepbx-b ~]# pcs status
Cluster name: freepbx-ha
Last updated: Wed Jul 27 14:21:33 2016
Last change: Wed Jul 27 14:20:38 2016
Stack: cman
Current DC: freepbx-b - partition WITHOUT quorum
Version: 1.1.11-97629de
2 Nodes configured
22 Resources configured
Node freepbx-a: UNCLEAN (offline)
Online: [ freepbx-b ]
|
Note that freepbx-a is marked as UNCLEAN. This means that the cluster is unaware of its state, and is unable to manage any resources that are assigned to it
Full list of resources:
spare_ip (ocf::heartbeat:IPaddr2): Started freepbx-a
floating_ip (ocf::heartbeat:IPaddr2): Started freepbx-a
Master/Slave Set: ms-asterisk [drbd_asterisk]
Masters: [ freepbx-a ]
Slaves: [ freepbx-b ]
Master/Slave Set: ms-mysql [drbd_mysql]
Masters: [ freepbx-a ]
Slaves: [ freepbx-b ]
Master/Slave Set: ms-httpd [drbd_httpd]
Masters: [ freepbx-a ]
Slaves: [ freepbx-b ]
Master/Slave Set: ms-spare [drbd_spare]
Masters: [ freepbx-a ]
Slaves: [ freepbx-b ]
spare_fs (ocf::heartbeat:Filesystem): Started freepbx-a
Resource Group: mysql
mysql_fs (ocf::heartbeat:Filesystem): Started freepbx-a
mysql_ip (ocf::heartbeat:IPaddr2): Started freepbx-a
mysql_service (ocf::heartbeat:mysql): Started freepbx-a
Resource Group: asterisk
asterisk_fs (ocf::heartbeat:Filesystem): Started freepbx-a
asterisk_ip (ocf::heartbeat:IPaddr2): Started freepbx-a
asterisk_service (ocf::heartbeat:freepbx): Started freepbx-a
Resource Group: httpd
httpd_fs (ocf::heartbeat:Filesystem): Started freepbx-a
httpd_ip (ocf::heartbeat:IPaddr2): Started freepbx-a
httpd_service (ocf::heartbeat:apache): Started freepbx-a
fence_a (stonith:fence_vmware): Started freepbx-b
fence_b (stonith:fence_vmware): Started freepbx-a |
Unfortunately, this means that, in this situation, everything is now unmanageable.
To resolve this problem, you need to tell the cluster that fencing is disabled
[root@freepbx-b ~]# pcs property set stonith-enabled=false |
This will allow the cluster to start failing over to freepbx-b, but it will not completely start. Running `pcs status` now will result in a partly started cluster:
You can see that only mysql has started. You now need to tell the cluster that the remaining services are OK to start with the 'pcs resource cleanup' command.
After waiting a few seconds, you should then see the services starting up again
The final thing to do is to remove the fencing agent from the failed node, if it is not coming back. In our example, it was freepbx-a that failed, so we need to remove that agent
You can now replace the freepbx-a machine, and re-configure Fencing when it is replaced.