[ClusterLabs] Why Do Nodes Leave the Cluster?

Wed Feb 5 14:59:02 EST 2020

On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>05.02.2020 20:55, Eric Robinson пишет:
>> The two servers 001db01a and 001db01b were up and responsive. Neither
>had been rebooted and neither were under heavy load. There's no
>indication in the logs of loss of network connectivity. Any ideas on
>why both nodes seem to think the other one is at fault?
>
>The very fact that nodes lost connection to each other *is* indication
>of network problems. Your logs start too late, after any problem
>already
>happened.
>
>> 
>> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
>an option at this time.)
>> 
>> Log from 001db01a:
>> 
>> Feb  5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
>forming new configuration.
>> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
>(10.51.14.33:960) was formed. Members left: 2
>> Feb  5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
>the leave message. failed: 2
>> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Node 001db01b state is
>now lost
>> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Removing all 001db01b
>attributes for peer loss
>> Feb  5 08:01:03 001db01a cib[1522]:  notice: Node 001db01b state is
>now lost
>> Feb  5 08:01:03 001db01a cib[1522]:  notice: Purged 1 peer with id=2
>and/or uname=001db01b from the membership cache
>> Feb  5 08:01:03 001db01a attrd[1525]:  notice: Purged 1 peer with
>id=2 and/or uname=001db01b from the membership cache
>> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
>node 2 to be down
>> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Node 001db01b
>state is now lost
>> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
>001db01b not matched
>> Feb  5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1
>> Feb  5 08:01:03 001db01a corosync[1306]: [MAIN  ] Completed service
>synchronization, ready to provide service.
>> Feb  5 08:01:03 001db01a stonith-ng[1523]:  notice: Purged 1 peer
>with id=2 and/or uname=001db01b from the membership cache
>> Feb  5 08:01:03 001db01a pacemakerd[1491]:  notice: Node 001db01b
>state is now lost
>> Feb  5 08:01:03 001db01a crmd[1527]:  notice: State transition S_IDLE
>-> S_POLICY_ENGINE
>> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Node 001db01b state is
>now lost
>> Feb  5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
>node 2 to be down
>> Feb  5 08:01:03 001db01a crmd[1527]:  notice: Stonith/shutdown of
>001db01b not matched
>> Feb  5 08:01:03 001db01a pengine[1526]:  notice: On loss of CCM
>Quorum: Ignore
>> 
>> From 001db01b:
>> 
>> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
>(10.51.14.34:960) was formed. Members left: 1
>> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Our peer on the DC
>(001db01a) is dead
>> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Node 001db01a
>state is now lost
>> Feb  5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive
>the leave message. failed: 1
>> Feb  5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2
>> Feb  5 08:01:03 001db01b corosync[1455]: [MAIN  ] Completed service
>synchronization, ready to provide service.
>> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: Purged 1 peer
>with id=1 and/or uname=001db01a from the membership cache
>> Feb  5 08:01:03 001db01b pacemakerd[1678]:  notice: Node 001db01a
>state is now lost
>> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
>S_NOT_DC -> S_ELECTION
>> Feb  5 08:01:03 001db01b crmd[1693]:  notice: Node 001db01a state is
>now lost
>> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Node 001db01a state is
>now lost
>> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Removing all 001db01a
>attributes for peer loss
>> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Lost attribute writer
>001db01a
>> Feb  5 08:01:03 001db01b attrd[1691]:  notice: Purged 1 peer with
>id=1 and/or uname=001db01a from the membership cache
>> Feb  5 08:01:03 001db01b crmd[1693]:  notice: State transition
>S_ELECTION -> S_INTEGRATION
>> Feb  5 08:01:03 001db01b cib[1688]:  notice: Node 001db01a state is
>now lost
>> Feb  5 08:01:03 001db01b cib[1688]:  notice: Purged 1 peer with id=1
>and/or uname=001db01a from the membership cache
>> Feb  5 08:01:03 001db01b stonith-ng[1689]:  notice: [cib_diff_notify]
>Patch aborted: Application of an update diff failed (-206)
>> Feb  5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC
>received in state S_INTEGRATION from do_election_check
>> Feb  5 08:01:03 001db01b pengine[1692]:  notice: On loss of CCM
>Quorum: Ignore
>> 
>> 
>> -Eric
>> 
>> 
>> 
>> Disclaimer : This email and any files transmitted with it are
>confidential and intended solely for intended recipients. If you are
>not the named addressee you should not disseminate, distribute, copy or
>alter this email. Any views or opinions presented in this email are
>solely those of the author and might not represent those of Physician
>Select Management. Warning: Although Physician Select Management has
>taken reasonable precautions to ensure no viruses are present in this
>email, the company cannot accept responsibility for any loss or damage
>arising from the use of this email or attachments.
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> ClusterLabs home: https://www.clusterlabs.org/
>> 
>
>_______________________________________________
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

Hi Eric,
Do you use 2 corosync rings (routed  via separare  switches) ?

If not, you can easily set them up without downtime.

Also, are you using  multicast or unicast ?

If 3rd node is not an option, you can check if your version is supporting 'qdevice' which can be on a separate network and requires very low resources - a simple VM will be enough.

Best Regards,
Strahil Nikolov