[ClusterLabs] Stonith two-node cluster shot each other

Tue Dec 4 12:48:30 EST 2018

I *think* the two nodes of my cluster shot each other in the head this 
weekend and I can't figure out why.

Looking at corosync.log on node1 I see this:

[143747] node1.mydomain.com corosyncnotice  [TOTEM ] A processor failed, 
forming new configuration.
[143747] node1.mydomain.com corosyncnotice  [TOTEM ] A new membership 
(192.168.10.25:236) was formed. Members joined: 2 left: 2
[143747] node1.mydomain.com corosyncnotice  [TOTEM ] Failed to receive 
the leave message. failed: 2
[143747] node1.mydomain.com corosyncnotice  [TOTEM ] Retransmit List: 1
Dec 01 07:03:50 [143768] node1.mydomain.com       crmd:     info: 
pcmk_cpg_membership: Node 2 left group crmd (peer=node2.mydomain.com, 
counter=1.0)
Dec 01 07:03:50 [143766] node1.mydomain.com      attrd:     info: 
pcmk_cpg_membership: Node 2 left group attrd (peer=node2.mydomain.com, 
counter=1.0)
Dec 01 07:03:50 [143764] node1.mydomain.com stonith-ng:     info: 
pcmk_cpg_membership: Node 2 left group stonith-ng 
(peer=node2.vselect.com, counter=1.0)
Dec 01 07:03:50 [143762] node1.mydomain.com pacemakerd:     info: 
pcmk_cpg_membership: Node 2 left group pacemakerd 
(peer=node2.vselect.com, counter=1.0)

Followed by a whole slew of messages generally saying node2 was 
dead/could not be reached, culminating in:

Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng:   notice: 
initiate_remote_stonith_op:  Requesting peer fencing (reboot) of 
node2.mydomain.com | id=a041d1df-e857-4815-91db-00f448106a33 state=0
Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng:     info: 
process_remote_stonith_query:        Query result 1 of 2 from 
node1.mydomain.com for node2.mydomain.com/reboot (1 devices) 
a041d1df-e857-4815-91db-00f448106a33
Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng:     info: 
call_remote_stonith: Total timeout set to 300 for peer's fencing of 
node2.mydomain.com for 
stonith-api.139901|id=a041d1df-e857-4815-91db-00f448106a33
Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng:     info: 
call_remote_stonith: Requesting that 'node1.mydomain.com' perform op 
'node2.mydomain.com reboot' for stonith-api.139901 (360s, 0s)
Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng:     info: 
process_remote_stonith_query:        Query result 2 of 2 from 
node2.mydomain.com for node2.mydomain.com/reboot (1 devices) 
a041d1df-e857-4815-91db-00f448106a33
Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng:     info: 
stonith_fence_get_devices_cb:        Found 1 matching devices for 
'node2.mydomain.com'
Dec 01 07:04:21 [143768] node1.mydomain.com       crmd:     info: 
crm_update_peer_expected:    handle_request: Node node2.mydomain.com[2] 
- expected state is now down (was member)
Dec 01 07:04:21 [143766] node1.mydomain.com      attrd:     info: 
attrd_peer_update:   Setting shutdown[node2.mydomain.com]: (null) -> 
1543665861 from node2.mydomain.com
Dec 01 07:04:21 [143763] node1.mydomain.com        cib:     info: 
cib_perform_op:      Diff: --- 0.188.66 2
Dec 01 07:04:21 [143763] node1.mydomain.com        cib:     info: 
cib_perform_op:      Diff: +++ 0.188.67 (null)
Dec 01 07:04:21 [143763] node1.mydomain.com        cib:     info: 
cib_perform_op:      +  /cib:  @num_updates=67
Dec 01 07:04:21 [143763] node1.mydomain.com        cib:     info: 
cib_perform_op:      ++ /cib/status/node_state[@id='2']: 
<transient_attributes id="2"/>
Dec 01 07:04:21 [143763] node1.mydomain.com        cib:     info: 
cib_perform_op:      ++ 
<instance_attributes id="status-2">
Dec 01 07:04:21 [143763] node1.mydomain.com        cib:     info: 
cib_perform_op:      ++                                       <nvpair 
id="status-2-shutdown" name="shutdown" value="1543665861"/>
Dec 01 07:04:21 [143763] node1.mydomain.com        cib:     info: 
cib_perform_op:      ++ 
</instance_attributes>
Dec 01 07:04:21 [143763] node1.mydomain.com        cib:     info: 
cib_perform_op:      ++ 
</transient_attributes>
Dec 01 07:04:21 [143763] node1.mydomain.com        cib:     info: 
cib_process_request: Completed cib_modify operation for section status: 
OK (rc=0, origin=node2.mydomain.com/attrd/6, version=0.188.67)

And on node2 I see this:

[50215] node2.mydomain.com corosyncnotice  [TOTEM ] A new membership 
(192.168.10.25:228) was formed. Members
[50215] node2.mydomain.com corosyncnotice  [TOTEM ] A new membership 
(192.168.10.25:236) was formed. Members joined: 1 left: 1
[50215] node2.mydomain.com corosyncnotice  [TOTEM ] Failed to receive 
the leave message. failed: 1
Dec 01 07:03:50 [50224] node2.mydomain.com        cib:     info: 
pcmk_cpg_membership:  Node 1 left group cib (peer=node1.mydomain.com, 
counter=2.0)
Dec 01 07:03:50 [50224] node2.mydomain.com        cib:     info: 
crm_update_peer_proc: pcmk_cpg_membership: Node node1.mydomain.com[1] - 
corosync-cpg is now offline
Dec 01 07:03:50 [50229] node2.mydomain.com       crmd:     info: 
pcmk_cpg_membership:  Node 1 left group crmd (peer=node1.mydomain.com, 
counter=2.0)
Dec 01 07:03:50 [50229] node2.mydomain.com       crmd:     info: 
crm_update_peer_proc: pcmk_cpg_membership: Node node1.mydomain.com[1] - 
corosync-cpg is now offline
Dec 01 07:03:50 [50229] node2.mydomain.com       crmd:     info: 
peer_update_callback: Client node1.mydomain.com/peer now has status 
[offline] (DC=true, changed=4000000)

and then later

Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng:   notice: 
handle_request:       Client stonith-api.170881.b598a6f3 wants to fence 
(reboot) '1' with device '(any)'
Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng:   notice: 
initiate_remote_stonith_op:   Requesting peer fencing (reboot) of 
node1.mydomain.com | id=2b08eff2-1555-46fa-8a88-fe500f3fca87 state=0
Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng:     info: 
process_remote_stonith_query: Query result 1 of 2 from 
node1.mydomain.com for node1.mydomain.com/reboot (1 devices) 
2b08eff2-1555-46fa-8a88-fe500f3fca87
Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng:     info: 
process_remote_stonith_query: Query result 2 of 2 from 
node2.mydomain.com for node1.mydomain.com/reboot (1 devices) 
2b08eff2-1555-46fa-8a88-fe500f3fca87
Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng:     info: 
call_remote_stonith:  Total timeout set to 300 for peer's fencing of 
node1.mydomain.com for 
stonith-api.170881|id=2b08eff2-1555-46fa-8a88-fe500f3fca87
Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng:     info: 
call_remote_stonith:  Requesting that 'node2.mydomain.com' perform op 
'node1.mydomain.com reboot' for stonith-api.170881 (360s, 0s)
Dec 01 07:04:21 [50225] node2.mydomain.com stonith-ng:     info: 
stonith_fence_get_devices_cb: Found 1 matching devices for 
'node1.mydomain.com'

What is wrong with my config that they would want to kill each other? 
Shouldn't one always survive?

# pcs stonith show --full
  Resource: FenceNode2 (class=stonith type=fence_ipmilan)
   Attributes: hexadecimal_kg=<KEY> ipaddr=192.168.10.29 lanplus=1 
login=ipmiUser method=onoff passwd=<BLAH> power_timeout=30 power_wait=4
   Operations: monitor interval=60s (FenceNode2-monitor-interval-60s)
  Resource: FenceNode1 (class=stonith type=fence_ipmilan)
   Attributes: hexadecimal_kg=<KEY> ipaddr=192.168.100.28 lanplus=1 
login=ipmiUser method=onoff passwd=<BLAH> power_timeout=30 power_wait=4
   Operations: monitor interval=60s (FenceNode1-monitor-interval-60s)

The corresponding constraints:

   Resource: FenceNode1
     Disabled on: node1.mydomain.com (score:-INFINITY)
   Resource: FenceNode2
     Disabled on: node2.mydomain.com (score:-INFINITY)

And corosync.conf:

# cat /etc/corosync/corosync.conf
totem {
     version: 2
     cluster_name: MyCluster
     secauth: off
     transport: udp

     interface {
         ringnumber: 0
         bindnetaddr: 192.168.10.0
         broadcast: no
         mcastport: 5405
         ttl: 1
     }
}

nodelist {
     node {
         ring0_addr: node1.vselect.com
         nodeid: 1
     }

     node {
         ring0_addr: node2.vselect.com
         nodeid: 2
     }
}

quorum {
     provider: corosync_votequorum
     two_node: 1
}

logging {
     to_logfile: yes
     logfile: /var/log/cluster/corosync.log
     to_syslog: yes
}

TIA,

Dan