[ClusterLabs] Why Do Nodes Leave the Cluster?
Strahil Nikolov
hunter86_bg at yahoo.com
Wed Feb 5 14:59:02 EST 2020
On February 5, 2020 8:14:06 PM GMT+02:00, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>05.02.2020 20:55, Eric Robinson пишет:
>> The two servers 001db01a and 001db01b were up and responsive. Neither
>had been rebooted and neither were under heavy load. There's no
>indication in the logs of loss of network connectivity. Any ideas on
>why both nodes seem to think the other one is at fault?
>
>The very fact that nodes lost connection to each other *is* indication
>of network problems. Your logs start too late, after any problem
>already
>happened.
>
>>
>> (Yes, it's a 2-node cluster without quorum. A 3-node cluster is not
>an option at this time.)
>>
>> Log from 001db01a:
>>
>> Feb 5 08:01:02 001db01a corosync[1306]: [TOTEM ] A processor failed,
>forming new configuration.
>> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] A new membership
>(10.51.14.33:960) was formed. Members left: 2
>> Feb 5 08:01:03 001db01a corosync[1306]: [TOTEM ] Failed to receive
>the leave message. failed: 2
>> Feb 5 08:01:03 001db01a attrd[1525]: notice: Node 001db01b state is
>now lost
>> Feb 5 08:01:03 001db01a attrd[1525]: notice: Removing all 001db01b
>attributes for peer loss
>> Feb 5 08:01:03 001db01a cib[1522]: notice: Node 001db01b state is
>now lost
>> Feb 5 08:01:03 001db01a cib[1522]: notice: Purged 1 peer with id=2
>and/or uname=001db01b from the membership cache
>> Feb 5 08:01:03 001db01a attrd[1525]: notice: Purged 1 peer with
>id=2 and/or uname=001db01b from the membership cache
>> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
>node 2 to be down
>> Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Node 001db01b
>state is now lost
>> Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of
>001db01b not matched
>> Feb 5 08:01:03 001db01a corosync[1306]: [QUORUM] Members[1]: 1
>> Feb 5 08:01:03 001db01a corosync[1306]: [MAIN ] Completed service
>synchronization, ready to provide service.
>> Feb 5 08:01:03 001db01a stonith-ng[1523]: notice: Purged 1 peer
>with id=2 and/or uname=001db01b from the membership cache
>> Feb 5 08:01:03 001db01a pacemakerd[1491]: notice: Node 001db01b
>state is now lost
>> Feb 5 08:01:03 001db01a crmd[1527]: notice: State transition S_IDLE
>-> S_POLICY_ENGINE
>> Feb 5 08:01:03 001db01a crmd[1527]: notice: Node 001db01b state is
>now lost
>> Feb 5 08:01:03 001db01a crmd[1527]: warning: No reason to expect
>node 2 to be down
>> Feb 5 08:01:03 001db01a crmd[1527]: notice: Stonith/shutdown of
>001db01b not matched
>> Feb 5 08:01:03 001db01a pengine[1526]: notice: On loss of CCM
>Quorum: Ignore
>>
>> From 001db01b:
>>
>> Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] A new membership
>(10.51.14.34:960) was formed. Members left: 1
>> Feb 5 08:01:03 001db01b crmd[1693]: notice: Our peer on the DC
>(001db01a) is dead
>> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Node 001db01a
>state is now lost
>> Feb 5 08:01:03 001db01b corosync[1455]: [TOTEM ] Failed to receive
>the leave message. failed: 1
>> Feb 5 08:01:03 001db01b corosync[1455]: [QUORUM] Members[1]: 2
>> Feb 5 08:01:03 001db01b corosync[1455]: [MAIN ] Completed service
>synchronization, ready to provide service.
>> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: Purged 1 peer
>with id=1 and/or uname=001db01a from the membership cache
>> Feb 5 08:01:03 001db01b pacemakerd[1678]: notice: Node 001db01a
>state is now lost
>> Feb 5 08:01:03 001db01b crmd[1693]: notice: State transition
>S_NOT_DC -> S_ELECTION
>> Feb 5 08:01:03 001db01b crmd[1693]: notice: Node 001db01a state is
>now lost
>> Feb 5 08:01:03 001db01b attrd[1691]: notice: Node 001db01a state is
>now lost
>> Feb 5 08:01:03 001db01b attrd[1691]: notice: Removing all 001db01a
>attributes for peer loss
>> Feb 5 08:01:03 001db01b attrd[1691]: notice: Lost attribute writer
>001db01a
>> Feb 5 08:01:03 001db01b attrd[1691]: notice: Purged 1 peer with
>id=1 and/or uname=001db01a from the membership cache
>> Feb 5 08:01:03 001db01b crmd[1693]: notice: State transition
>S_ELECTION -> S_INTEGRATION
>> Feb 5 08:01:03 001db01b cib[1688]: notice: Node 001db01a state is
>now lost
>> Feb 5 08:01:03 001db01b cib[1688]: notice: Purged 1 peer with id=1
>and/or uname=001db01a from the membership cache
>> Feb 5 08:01:03 001db01b stonith-ng[1689]: notice: [cib_diff_notify]
>Patch aborted: Application of an update diff failed (-206)
>> Feb 5 08:01:03 001db01b crmd[1693]: warning: Input I_ELECTION_DC
>received in state S_INTEGRATION from do_election_check
>> Feb 5 08:01:03 001db01b pengine[1692]: notice: On loss of CCM
>Quorum: Ignore
>>
>>
>> -Eric
>>
>>
>>
>> Disclaimer : This email and any files transmitted with it are
>confidential and intended solely for intended recipients. If you are
>not the named addressee you should not disseminate, distribute, copy or
>alter this email. Any views or opinions presented in this email are
>solely those of the author and might not represent those of Physician
>Select Management. Warning: Although Physician Select Management has
>taken reasonable precautions to ensure no viruses are present in this
>email, the company cannot accept responsibility for any loss or damage
>arising from the use of this email or attachments.
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
>_______________________________________________
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
Hi Eric,
Do you use 2 corosync rings (routed via separare switches) ?
If not, you can easily set them up without downtime.
Also, are you using multicast or unicast ?
If 3rd node is not an option, you can check if your version is supporting 'qdevice' which can be on a separate network and requires very low resources - a simple VM will be enough.
Best Regards,
Strahil Nikolov
More information about the Users
mailing list