View Source

In the situation that an IPMI device fails at the same time as the currently active host, there is no way for the other node to know what has happened (It actually believes that it is the one that has fallen off the network), and will not restart cluster resources.

The most common occurrence of this is when the active node loses power instantly, which disables its onboard IPMI.

If this is the case, when logging into the remaining node (freepbx-b in this example) and running `pcs status` the following will be indications that this has happened:

[root@freepbx-b ~]# pcs status
Cluster name: freepbx-ha
Last updated: Wed Jul 27 14:21:33 2016
Last change: Wed Jul 27 14:20:38 2016
Stack: cman
Current DC: freepbx-b - partition WITHOUT quorum
Version: 1.1.11-97629de
2 Nodes configured
22 Resources configured

Node freepbx-a: UNCLEAN (offline)
Online: [ freepbx-b ]

Note that freepbx-a is marked as UNCLEAN. This means that the cluster is unaware of its state, and is unable to manage any resources that are assigned to it

Full list of resources:
 spare_ip       (ocf::heartbeat:IPaddr2):       Started freepbx-a
 floating_ip    (ocf::heartbeat:IPaddr2):       Started freepbx-a
 Master/Slave Set: ms-asterisk [drbd_asterisk]
     Masters: [ freepbx-a ]
     Slaves: [ freepbx-b ]
 Master/Slave Set: ms-mysql [drbd_mysql]
     Masters: [ freepbx-a ]
     Slaves: [ freepbx-b ]
 Master/Slave Set: ms-httpd [drbd_httpd]
     Masters: [ freepbx-a ]
     Slaves: [ freepbx-b ]
 Master/Slave Set: ms-spare [drbd_spare]
     Masters: [ freepbx-a ]
     Slaves: [ freepbx-b ]
 spare_fs       (ocf::heartbeat:Filesystem):    Started freepbx-a
 Resource Group: mysql
     mysql_fs   (ocf::heartbeat:Filesystem):    Started freepbx-a
     mysql_ip   (ocf::heartbeat:IPaddr2):       Started freepbx-a
     mysql_service      (ocf::heartbeat:mysql): Started freepbx-a
 Resource Group: asterisk
     asterisk_fs        (ocf::heartbeat:Filesystem):    Started freepbx-a
     asterisk_ip        (ocf::heartbeat:IPaddr2):       Started freepbx-a
     asterisk_service   (ocf::heartbeat:freepbx):       Started freepbx-a
 Resource Group: httpd
     httpd_fs   (ocf::heartbeat:Filesystem):    Started freepbx-a
     httpd_ip   (ocf::heartbeat:IPaddr2):       Started freepbx-a
     httpd_service      (ocf::heartbeat:apache):        Started freepbx-a
 fence_a        (stonith:fence_vmware): Started freepbx-b
 fence_b        (stonith:fence_vmware): Started freepbx-a

Unfortunately, this means that, in this situation, everything is now unmanageable.

To resolve this problem, you need to tell the cluster that fencing is disabled

[root@freepbx-b ~]# pcs property set stonith-enabled=false

This will allow the cluster to start failing over to freepbx-b, but it will not completely start. Running `pcs status` now will result in a partly started cluster:

Cluster name: freepbx-ha
Last updated: Wed Jul 27 14:24:53 2016
Last change: Wed Jul 27 14:24:39 2016
Stack: cman
Current DC: freepbx-b - partition WITHOUT quorum
Version: 1.1.11-97629de
2 Nodes configured
22 Resources configured

Online: [ freepbx-b ]
OFFLINE: [ freepbx-a ]
Full list of resources:
 spare_ip       (ocf::heartbeat:IPaddr2):       Started freepbx-b
 floating_ip    (ocf::heartbeat:IPaddr2):       Started freepbx-b
 Master/Slave Set: ms-asterisk [drbd_asterisk]
     Masters: [ freepbx-b ]
     Stopped: [ freepbx-a ]
 Master/Slave Set: ms-mysql [drbd_mysql]
     Masters: [ freepbx-b ]
     Stopped: [ freepbx-a ]
 Master/Slave Set: ms-httpd [drbd_httpd]
     Masters: [ freepbx-b ]
     Stopped: [ freepbx-a ]
 Master/Slave Set: ms-spare [drbd_spare]
     Masters: [ freepbx-b ]
     Stopped: [ freepbx-a ]
 spare_fs       (ocf::heartbeat:Filesystem):    Started freepbx-b
 Resource Group: mysql
     mysql_fs   (ocf::heartbeat:Filesystem):    Started freepbx-b
     mysql_ip   (ocf::heartbeat:IPaddr2):       Started freepbx-b
     mysql_service      (ocf::heartbeat:mysql): Started freepbx-b
 Resource Group: asterisk
     asterisk_fs        (ocf::heartbeat:Filesystem):    Stopped
     asterisk_ip        (ocf::heartbeat:IPaddr2):       Stopped
     asterisk_service   (ocf::heartbeat:freepbx):       Stopped
 Resource Group: httpd
     httpd_fs   (ocf::heartbeat:Filesystem):    Stopped
     httpd_ip   (ocf::heartbeat:IPaddr2):       Stopped
     httpd_service      (ocf::heartbeat:apache):        Stopped
 fence_a        (stonith:fence_vmware): Stopped
 fence_b        (stonith:fence_vmware): FAILED freepbx-b
Failed actions:
    fence_a_start_0 on freepbx-b 'unknown error' (1): call=273, status=Error, last-rc-change='Wed Jul 27 14:21:50 2016', queued=0ms, exec=3328ms
    fence_b_start_0 on freepbx-b 'unknown error' (1): call=281, status=Error, last-rc-change='Wed Jul 27 14:24:39 2016', queued=0ms, exec=1837ms
    fence_b_start_0 on freepbx-b 'unknown error' (1): call=281, status=Error, last-rc-change='Wed Jul 27 14:24:39 2016', queued=0ms, exec=1837ms

You can see that only mysql has started. You now need to tell the cluster that the remaining services are OK to start with the 'pcs resource cleanup' command.

[root@freepbx-b ~]# pcs resource cleanup httpd
Resource: httpd successfully cleaned up
[root@freepbx-b ~]# pcs resource cleanup asterisk
Resource: asterisk successfully cleaned up
[root@freepbx-b ~]#

After waiting a few seconds, you should then see the services starting up again

Cluster name: freepbx-ha
Last updated: Wed Jul 27 14:26:49 2016
Last change: Wed Jul 27 14:26:43 2016
Stack: cman
Current DC: freepbx-b - partition WITHOUT quorum
Version: 1.1.11-97629de
2 Nodes configured
22 Resources configured

Online: [ freepbx-b ]
OFFLINE: [ freepbx-a ]
Full list of resources:
 spare_ip       (ocf::heartbeat:IPaddr2):       Started freepbx-b
 floating_ip    (ocf::heartbeat:IPaddr2):       Started freepbx-b
 Master/Slave Set: ms-asterisk [drbd_asterisk]
     Masters: [ freepbx-b ]
     Stopped: [ freepbx-a ]
 Master/Slave Set: ms-mysql [drbd_mysql]
     Masters: [ freepbx-b ]
     Stopped: [ freepbx-a ]
 Master/Slave Set: ms-httpd [drbd_httpd]
     Masters: [ freepbx-b ]
     Stopped: [ freepbx-a ]
 Master/Slave Set: ms-spare [drbd_spare]
     Masters: [ freepbx-b ]
     Stopped: [ freepbx-a ]
 spare_fs       (ocf::heartbeat:Filesystem):    Started freepbx-b
 Resource Group: mysql
     mysql_fs   (ocf::heartbeat:Filesystem):    Started freepbx-b
     mysql_ip   (ocf::heartbeat:IPaddr2):       Started freepbx-b
     mysql_service      (ocf::heartbeat:mysql): Started freepbx-b
 Resource Group: asterisk
     asterisk_fs        (ocf::heartbeat:Filesystem):    Started freepbx-b
     asterisk_ip        (ocf::heartbeat:IPaddr2):       Started freepbx-b
     asterisk_service   (ocf::heartbeat:freepbx):       Stopped
 Resource Group: httpd
     httpd_fs   (ocf::heartbeat:Filesystem):    Started freepbx-b
     httpd_ip   (ocf::heartbeat:IPaddr2):       Started freepbx-b
     httpd_service      (ocf::heartbeat:apache):        Started freepbx-b
 fence_a        (stonith:fence_vmware): Stopped
 fence_b        (stonith:fence_vmware): Stopped
Failed actions:
    fence_a_start_0 on freepbx-b 'unknown error' (1): call=273, status=Error, last-rc-change='Wed Jul 27 14:21:50 2016', queued=0ms, exec=3328ms
    fence_b_start_0 on freepbx-b 'unknown error' (1): call=281, status=Error, last-rc-change='Wed Jul 27 14:24:39 2016', queued=0ms, exec=1837ms

The final thing to do is to remove the fencing agent from the failed node, if it is not coming back. In our example, it was freepbx-a that failed, so we need to remove that agent

[root@freepbx-b ~]# pcs stonith delete fence_a
Removing Constraint - location-fence_a-freepbx-b-INFINITY
Deleting Resource - fence_a
[root@freepbx-b ~]#

You can now replace the freepbx-a machine, and re-configure Fencing when it is replaced.