[ClusterLabs] Stonith two-node cluster shot each other
Digimer
lists at alteeve.ca
Tue Dec 4 13:32:46 EST 2018
You need to set a fence delay on the node you want to win in a case like
this. So say, for example, node 1 is hosting services. You will want to
add 'delay="15"' to the stonith config for node 1.
This way, when both nodes try to fence each other, node 2 looks up how
to fence node 1, sees a delay and pauses for 15 seconds. Node 1 looks up
how to fence node 2, sees no delay, and fences immediately. Node 1
lives, node 2 gets fenced.
digimer
On 2018-12-04 12:48 p.m., Daniel Ragle wrote:
> I *think* the two nodes of my cluster shot each other in the head this
> weekend and I can't figure out why.
>
> Looking at corosync.log on node1 I see this:
>
> [143747] node1.mydomain.com corosyncnotice [TOTEM ] A processor failed,
> forming new configuration.
> [143747] node1.mydomain.com corosyncnotice [TOTEM ] A new membership
> (192.168.10.25:236) was formed. Members joined: 2 left: 2
> [143747] node1.mydomain.com corosyncnotice [TOTEM ] Failed to receive
> the leave message. failed: 2
> [143747] node1.mydomain.com corosyncnotice [TOTEM ] Retransmit List: 1
> Dec 01 07:03:50 [143768] node1.mydomain.com crmd: info:
> pcmk_cpg_membership: Node 2 left group crmd (peer=node2.mydomain.com,
> counter=1.0)
> Dec 01 07:03:50 [143766] node1.mydomain.com attrd: info:
> pcmk_cpg_membership: Node 2 left group attrd (peer=node2.mydomain.com,
> counter=1.0)
> Dec 01 07:03:50 [143764] node1.mydomain.com stonith-ng: info:
> pcmk_cpg_membership: Node 2 left group stonith-ng
> (peer=node2.vselect.com, counter=1.0)
> Dec 01 07:03:50 [143762] node1.mydomain.com pacemakerd: info:
> pcmk_cpg_membership: Node 2 left group pacemakerd
> (peer=node2.vselect.com, counter=1.0)
>
> Followed by a whole slew of messages generally saying node2 was
> dead/could not be reached, culminating in:
>
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: notice:
> initiate_remote_stonith_op: Requesting peer fencing (reboot) of
> node2.mydomain.com | id=a041d1df-e857-4815-91db-00f448106a33 state=0
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
> process_remote_stonith_query: Query result 1 of 2 from
> node1.mydomain.com for node2.mydomain.com/reboot (1 devices)
> a041d1df-e857-4815-91db-00f448106a33
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
> call_remote_stonith: Total timeout set to 300 for peer's fencing of
> node2.mydomain.com for
> stonith-api.139901|id=a041d1df-e857-4815-91db-00f448106a33
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
> call_remote_stonith: Requesting that 'node1.mydomain.com' perform op
> 'node2.mydomain.com reboot' for stonith-api.139901 (360s, 0s)
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
> process_remote_stonith_query: Query result 2 of 2 from
> node2.mydomain.com for node2.mydomain.com/reboot (1 devices)
> a041d1df-e857-4815-91db-00f448106a33
> Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:
> stonith_fence_get_devices_cb: Found 1 matching devices for
> 'node2.mydomain.com'
> Dec 01 07:04:21 [143768] node1.mydomain.com crmd: info:
> crm_update_peer_expected: handle_request: Node node2.mydomain.com[2]
> - expected state is now down (was member)
> Dec 01 07:04:21 [143766] node1.mydomain.com attrd: info:
> attrd_peer_update: Setting shutdown[node2.mydomain.com]: (null) ->
> 1543665861 from node2.mydomain.com
> Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:
> cib_perform_op: Diff: --- 0.188.66 2
> Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:
> cib_perform_op: Diff: +++ 0.188.67 (null)
> Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:
> cib_perform_op: + /cib: @num_updates=67
> Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:
> cib_perform_op: ++ /cib/status/node_state[@id='2']:
> <transient_attributes id="2"/>
> Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:
> cib_perform_op: ++ <instance_attributes id="status-2">
> Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:
> cib_perform_op: ++ <nvpair
> id="status-2-shutdown" name="shutdown" value="1543665861"/>
> Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:
> cib_perform_op: ++ </instance_attributes>
> Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:
> cib_perform_op: ++ </transient_attributes>
> Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:
> cib_process_request: Completed cib_modify operation for section status:
> OK (rc=0, origin=node2.mydomain.com/attrd/6, version=0.188.67)
>
> And on node2 I see this:
>
> [50215] node2.mydomain.com corosyncnotice [TOTEM ] A new membership
> (192.168.10.25:228) was formed. Members
> [50215] node2.mydomain.com corosyncnotice [TOTEM ] A new membership
> (192.168.10.25:236) was formed. Members joined: 1 left: 1
> [50215] node2.mydomain.com corosyncnotice [TOTEM ] Failed to receive
> the leave message. failed: 1
> Dec 01 07:03:50 [50224] node2.mydomain.com cib: info:
> pcmk_cpg_membership: Node 1 left group cib (peer=node1.mydomain.com,
> counter=2.0)
> Dec 01 07:03:50 [50224] node2.mydomain.com cib: info:
> crm_update_peer_proc: pcmk_cpg_membership: Node node1.mydomain.com[1] -
> corosync-cpg is now offline
> Dec 01 07:03:50 [50229] node2.mydomain.com crmd: info:
> pcmk_cpg_membership: Node 1 left group crmd (peer=node1.mydomain.com,
> counter=2.0)
> Dec 01 07:03:50 [50229] node2.mydomain.com crmd: info:
> crm_update_peer_proc: pcmk_cpg_membership: Node node1.mydomain.com[1] -
> corosync-cpg is now offline
> Dec 01 07:03:50 [50229] node2.mydomain.com crmd: info:
> peer_update_callback: Client node1.mydomain.com/peer now has status
> [offline] (DC=true, changed=4000000)
>
> and then later
>
> Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: notice:
> handle_request: Client stonith-api.170881.b598a6f3 wants to fence
> (reboot) '1' with device '(any)'
> Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: notice:
> initiate_remote_stonith_op: Requesting peer fencing (reboot) of
> node1.mydomain.com | id=2b08eff2-1555-46fa-8a88-fe500f3fca87 state=0
> Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: info:
> process_remote_stonith_query: Query result 1 of 2 from
> node1.mydomain.com for node1.mydomain.com/reboot (1 devices)
> 2b08eff2-1555-46fa-8a88-fe500f3fca87
> Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: info:
> process_remote_stonith_query: Query result 2 of 2 from
> node2.mydomain.com for node1.mydomain.com/reboot (1 devices)
> 2b08eff2-1555-46fa-8a88-fe500f3fca87
> Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: info:
> call_remote_stonith: Total timeout set to 300 for peer's fencing of
> node1.mydomain.com for
> stonith-api.170881|id=2b08eff2-1555-46fa-8a88-fe500f3fca87
> Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: info:
> call_remote_stonith: Requesting that 'node2.mydomain.com' perform op
> 'node1.mydomain.com reboot' for stonith-api.170881 (360s, 0s)
> Dec 01 07:04:21 [50225] node2.mydomain.com stonith-ng: info:
> stonith_fence_get_devices_cb: Found 1 matching devices for
> 'node1.mydomain.com'
>
> What is wrong with my config that they would want to kill each other?
> Shouldn't one always survive?
>
> # pcs stonith show --full
> Resource: FenceNode2 (class=stonith type=fence_ipmilan)
> Attributes: hexadecimal_kg=<KEY> ipaddr=192.168.10.29 lanplus=1
> login=ipmiUser method=onoff passwd=<BLAH> power_timeout=30 power_wait=4
> Operations: monitor interval=60s (FenceNode2-monitor-interval-60s)
> Resource: FenceNode1 (class=stonith type=fence_ipmilan)
> Attributes: hexadecimal_kg=<KEY> ipaddr=192.168.100.28 lanplus=1
> login=ipmiUser method=onoff passwd=<BLAH> power_timeout=30 power_wait=4
> Operations: monitor interval=60s (FenceNode1-monitor-interval-60s)
>
> The corresponding constraints:
>
> Resource: FenceNode1
> Disabled on: node1.mydomain.com (score:-INFINITY)
> Resource: FenceNode2
> Disabled on: node2.mydomain.com (score:-INFINITY)
>
> And corosync.conf:
>
> # cat /etc/corosync/corosync.conf
> totem {
> version: 2
> cluster_name: MyCluster
> secauth: off
> transport: udp
>
> interface {
> ringnumber: 0
> bindnetaddr: 192.168.10.0
> broadcast: no
> mcastport: 5405
> ttl: 1
> }
> }
>
> nodelist {
> node {
> ring0_addr: node1.vselect.com
> nodeid: 1
> }
>
> node {
> ring0_addr: node2.vselect.com
> nodeid: 2
> }
> }
>
> quorum {
> provider: corosync_votequorum
> two_node: 1
> }
>
> logging {
> to_logfile: yes
> logfile: /var/log/cluster/corosync.log
> to_syslog: yes
> }
>
> TIA,
>
> Dan
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
--
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
More information about the Users
mailing list