<div dir="ltr"><div><div><div>I have a two-node CentOS 6.4 based cluster, using pacemaker 1.1.8 with a cman backend running primarily libvirt controlled kvm VMs. For the VMs, I am using clvm volumes for the virtual hard drives and a single gfs2 volume for shared storage of the config files for the VMs and other shared data. For fencing, I use ipmi and a apc master switch to provide redundant fencing. There are location constraints that do not allow the fencing resources run on their own node. I am *not* using sbd or any other software based fencing device.<br>
<br></div>I had a very bizarre situation this morning -- I had one of the nodes powered off. Then the other self-fenced. I thought that was impossible.<br><br></div>Excerpts from the logs:<br><br>Mar 28 13:10:01 virtualhost2 stonith-ng[4223]: notice: remote_op_done: Operation reboot of <a href="http://virtualhost2.delta-co.gov">virtualhost2.delta-co.gov</a> by <br>
<a href="http://virtualhost1.delta-co.gov">virtualhost1.delta-co.gov</a> for crmd.4430@virtualhost1.delta-co.gov.fc5638ad: Timer expired<br><br>[...]<br></div>Virtualhost1 was offline, so I expect that line.<br><div>[...]<br>
<br>Mar 28 13:13:30 virtualhost2 pengine[4226]: notice: unpack_rsc_op: Preventing p_ns2 from re-starting on <a href="http://virtualhost2.delta-co.gov">virtualhost2.delta-co.gov</a>: operation monitor failed 'not installed' (rc=5)<br>
<br>[...]<br></div><div>If I had a brief interruption of my gfs2 volume, would that show up? And would it be the cause of a fencing operation?<br>[...]<br></div><div><br>Mar 28 13:13:30 virtualhost2 pengine[4226]: warning: pe_fence_node: Node <a href="http://virtualhost2.delta-co.gov">virtualhost2.delta-co.gov</a> will be fenced to recover from resource failure(s)<br>
Mar 28 13:13:30 virtualhost2 pengine[4226]: warning: stage6: Scheduling Node <a href="http://virtualhost2.delta-co.gov">virtualhost2.delta-co.gov</a> for STONITH<br><br>[...]<br></div><div>Why is it still trying to fence, if all of the fencing resources are offline?<br>
[...]<br></div><div><br>Mar 28 13:13:30 virtualhost2 crmd[4227]: notice: te_fence_node: Executing reboot fencing operation (43) on <a href="http://virtualhost2.delta-co.gov">virtualhost2.delta-co.gov</a> (timeout=60000)<br>
<br>Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: handle_request: Client crmd.4227.9fdec3bd wants to fence (reboot) '<a href="http://virtualhost2.delta-co.gov">virtualhost2.delta-co.gov</a>' with device '(any)'<br>
<br>[...]<br></div><div>What does that mean? crmd.4227.9fdec3bd I figure 4227 is a process number, but I don't what the next number is.<br>[...]<br></div><div><br>Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: error: check_alternate_host: No alternate host available to handle complex self fencing request<br>
<br>[...]<br></div><div>Where did that come from?<br>[...]<br></div><div><br>Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: check_alternate_host: Peer[1] <a href="http://virtualhost1.delta-co.gov">virtualhost1.delta-co.gov</a><br>
Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: check_alternate_host: Peer[2] <a href="http://virtualhost2.delta-co.gov">virtualhost2.delta-co.gov</a><br>Mar 28 13:13:30 virtualhost2 stonith-ng[4223]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for <a href="http://virtualhost2.delta-co.gov">virtualhost2.delta-co.gov</a>: 648ca743-6cda-4c81-9250-21c9109a51b9 (0)<br>
<br>[...]<br></div><div>The next logs are the reboot logs.<br></div></div>