[ClusterLabs] New user needs some help stabilizing the cluster

Strahil Nikolov hunter86_bg at yahoo.com
Wed Jun 10 13:55:41 EDT 2020


What  is your corosync.conf timeouts (especially token & consensus)?
Last time I did live migration of RHEL 7 node with the default values, the cluster fenced it - thus I set it to 10s  for token and I also raised the consensus (check 'man corosync.conf') above the default.

Also,  start your investigation from the  virtualization layer, as during the nights a lot  of backups  are going on. Last week I got a cluster node fenced cause it failed to respond for 40s  . Thankfully that was just a QA cluster,  so it wasn't a big deal.

The most common reasons for a VM to fail to respond are:
- CPU starvation due to high CPU utilisation on the host
- I/O issues  causing the VM to pause
- Lots  of backups eating the bandwidth on any of the  Hypervisours or on a switch between them  (if you have a single heartbeat  network)

With RHEL8 corosync allows using more than 2 heartbeat rings and way new stuff like sctp.

P.S.: You can use a second fencing mechanism like 'sbd' a.k.a. "poison pill" ,  just make the vmdk shared &  independent .  This way your cluster can operate even when the vCenter is unreachable for any reason.

Best Regards,
Strahil Nikolov


На 10 юни 2020 г. 20:06:28 GMT+03:00, Howard <hmoneta at gmail.com> написа:
>Good morning.  Thanks for reading.  We have a requirement to provide
>high
>availability for PostgreSQL 10.  I have built a two node cluster with a
>quorum device as the third vote, all running on RHEL 8.
>
>Here are the versions installed:
>[postgres at srv2 cluster]$ rpm -qa|grep
>"pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
>corosync-3.0.2-3.el8_1.1.x86_64
>corosync-qdevice-3.0.0-2.el8.x86_64
>corosync-qnetd-3.0.0-2.el8.x86_64
>corosynclib-3.0.2-3.el8_1.1.x86_64
>fence-agents-vmware-soap-4.2.1-41.el8.noarch
>pacemaker-2.0.2-3.el8_1.2.x86_64
>pacemaker-cli-2.0.2-3.el8_1.2.x86_64
>pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
>pacemaker-libs-2.0.2-3.el8_1.2.x86_64
>pacemaker-schemas-2.0.2-3.el8_1.2.noarch
>pcs-0.10.2-4.el8.x86_64
>resource-agents-paf-2.3.0-1.noarch
>
>These are vmare VMs so I configured the cluster to use the ESX host as
>the
>fencing device using fence_vmware_soap.
>
>Throughout each day things generally work very well.  The cluster
>remains
>online and healthy. Unfortunately, when I check pcs status in the
>mornings,
>I see that all kinds of things went wrong overnight.  It is hard to
>pinpoint what the issue is as there is so much information being
>written to
>the pacemaker.log. Scrolling through pages and pages of informational
>log
>entries trying to find the lines that pertain to the issue.  Is there a
>way
>to separate the logs out to make it easier to scroll through? Or maybe
>a
>list of keywords to GREP for?
>
>It is clearly indicating that the server lost contact with the other
>node
>and also the quorum device. Is there a way to make this configuration
>more
>robust or able to recover from a connectivity blip?
>
>Here are the pacemaker and corosync logs for this morning's failures:
>pacemaker.log
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
> [10573] (pcmk_quorum_notification)       warning: Quorum lost |
>membership=952 members=1
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2
>pacemaker-controld
> [10579] (pcmk_quorum_notification)       warning: Quorum lost |
>membership=952 members=1
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
>will be fenced: peer is no longer part of the cluster
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (determine_online_status)        warning:
>Node
>srv1 is unclean
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (custom_action)  warning: Action
>pgsql-master-ip_stop_0 on srv1 is unrunnable (offline)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (stage6)         warning: Scheduling Node
>srv1
>for STONITH
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>pacemaker-schedulerd[10578] (pcmk__log_transition_summary)   warning:
>Calculated transition 2 (with warnings), saving inputs in
>/var/lib/pacemaker/pengine/pe-warn-34.bz2
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2
>pacemaker-controld
> [10579] (crmd_ha_msg_filter)     warning: Another DC detected: srv1
>(op=join_offer)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2
>pacemaker-controld
>[10579] (destroy_action)         warning: Cancelling timer for action 3
>(src=307)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2
>pacemaker-controld
>[10579] (destroy_action)         warning: Cancelling timer for action 2
>(src=308)
>/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2
>pacemaker-controld
> [10579] (do_log)         warning: Input I_RELEASE_DC received in state
>S_RELEASE_DC from do_election_count_vote
>/var/log/pacemaker/pacemaker.log:pgsqlms(pgsqld)[1164379]:      Jun 10
>00:07:19  WARNING: No secondary connected to the master
>/var/log/pacemaker/pacemaker.log:Sent 5 probes (5 broadcast(s))
>/var/log/pacemaker/pacemaker.log:Received 0 response(s)
>
>corosync.log
>Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync main
>process was not scheduled for 13006.0615 ms (threshold is 800.0000 ms).
>Consider token timeout increase.
>Jun 10 00:06:41 [10558] srv2 corosync notice  [TOTEM ] Token has not
>been
>received in 12922 ms
>Jun 10 00:06:41 [10558] srv2 corosync notice  [TOTEM ] A processor
>failed,
>forming new configuration.
>Jun 10 00:06:41 [10558] srv2 corosync info    [VOTEQ ] lost contact
>with
>quorum device Qdevice
>Jun 10 00:06:41 [10558] srv2 corosync info    [KNET  ] link: host: 1
>link:
>0 is down
>Jun 10 00:06:41 [10558] srv2 corosync info    [KNET  ] host: host: 1
>(passive) best link: 0 (pri: 1)
>Jun 10 00:06:41 [10558] srv2 corosync warning [KNET  ] host: host: 1
>has no
>active links
>Jun 10 00:06:42 [10558] srv2 corosync info    [KNET  ] rx: host: 1
>link: 0
>is up
>Jun 10 00:06:42 [10558] srv2 corosync info    [KNET  ] host: host: 1
>(passive) best link: 0 (pri: 1)
>Jun 10 00:06:42 [10558] srv2 corosync info    [VOTEQ ] waiting for
>quorum
>device Qdevice poll (but maximum for 30000 ms)
>Jun 10 00:06:42 [10558] srv2 corosync notice  [TOTEM ] A new membership
>(2:952) was formed. Members left: 1
>Jun 10 00:06:42 [10558] srv2 corosync notice  [TOTEM ] Failed to
>receive
>the leave message. failed: 1
>Jun 10 00:06:42 [10558] srv2 corosync warning [CPG   ] downlist
>left_list:
>1 received
>Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] This node is
>within
>the non-primary component and will NOT provide any services.
>Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
>Jun 10 00:06:42 [10558] srv2 corosync notice  [MAIN  ] Completed
>service
>synchronization, ready to provide service.
>Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] This node is
>within
>the primary component and will provide service.
>Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
>Jun 10 00:06:43 [10558] srv2 corosync info    [VOTEQ ] waiting for
>quorum
>device Qdevice poll (but maximum for 30000 ms)
>Jun 10 00:06:43 [10558] srv2 corosync notice  [TOTEM ] A new membership
>(1:960) was formed. Members joined: 1
>Jun 10 00:06:43 [10558] srv2 corosync warning [CPG   ] downlist
>left_list:
>0 received
>Jun 10 00:06:43 [10558] srv2 corosync warning [CPG   ] downlist
>left_list:
>0 received
>Jun 10 00:06:45 [10558] srv2 corosync notice  [QUORUM] Members[2]: 1 2
>Jun 10 00:06:45 [10558] srv2 corosync notice  [MAIN  ] Completed
>service
>synchronization, ready to provide service.
>Jun 10 00:06:45 [10558] srv2 corosync warning [MAIN  ] Corosync main
>process was not scheduled for 1747.0415 ms (threshold is 800.0000 ms).
>Consider token timeout increase.
>Jun 10 00:06:45 [10558] srv2 corosync info    [VOTEQ ] waiting for
>quorum
>device Qdevice poll (but maximum for 30000 ms)
>Jun 10 00:06:45 [10558] srv2 corosync notice  [TOTEM ] A new membership
>(1:964) was formed. Members
>Jun 10 00:06:45 [10558] srv2 corosync warning [CPG   ] downlist
>left_list:
>0 received
>Jun 10 00:06:45 [10558] srv2 corosync warning [CPG   ] downlist
>left_list:
>0 received
>Jun 10 00:06:45 [10558] srv2 corosync notice  [QUORUM] Members[2]: 1 2
>Jun 10 00:06:45 [10558] srv2 corosync notice  [MAIN  ] Completed
>service
>synchronization, ready to provide service.
>Jun 10 00:06:52 [10558] srv2 corosync notice  [TOTEM ] Token has not
>been
>received in 750 ms
>Jun 10 00:06:52 [10558] srv2 corosync info    [KNET  ] link: host: 1
>link:
>0 is down
>Jun 10 00:06:52 [10558] srv2 corosync info    [KNET  ] host: host: 1
>(passive) best link: 0 (pri: 1)
>Jun 10 00:06:52 [10558] srv2 corosync warning [KNET  ] host: host: 1
>has no
>active links
>Jun 10 00:06:52 [10558] srv2 corosync notice  [TOTEM ] A processor
>failed,
>forming new configuration.
>Jun 10 00:06:53 [10558] srv2 corosync info    [VOTEQ ] waiting for
>quorum
>device Qdevice poll (but maximum for 30000 ms)
>Jun 10 00:06:53 [10558] srv2 corosync notice  [TOTEM ] A new membership
>(2:968) was formed. Members left: 1
>Jun 10 00:06:53 [10558] srv2 corosync notice  [TOTEM ] Failed to
>receive
>the leave message. failed: 1
>Jun 10 00:06:53 [10558] srv2 corosync warning [CPG   ] downlist
>left_list:
>1 received
>Jun 10 00:07:17 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
>Jun 10 00:07:17 [10558] srv2 corosync notice  [MAIN  ] Completed
>service
>synchronization, ready to provide service.
>Jun 10 00:08:56 [10558] srv2 corosync notice  [TOTEM ] Token has not
>been
>received in 750 ms
>Jun 10 00:09:04 [10558] srv2 corosync warning [MAIN  ] Corosync main
>process was not scheduled for 4477.0459 ms (threshold is 800.0000 ms).
>Consider token timeout increase.
>Jun 10 00:09:13 [10558] srv2 corosync warning [MAIN  ] Corosync main
>process was not scheduled for 5302.9785 ms (threshold is 800.0000 ms).
>Consider token timeout increase.
>Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] Token has not
>been
>received in 5295 ms
>Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] A processor
>failed,
>forming new configuration.
>Jun 10 00:09:13 [10558] srv2 corosync info    [VOTEQ ] waiting for
>quorum
>device Qdevice poll (but maximum for 30000 ms)
>Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] A new membership
>(2:972) was formed. Members
>Jun 10 00:09:13 [10558] srv2 corosync warning [CPG   ] downlist
>left_list:
>0 received
>Jun 10 00:09:13 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
>Jun 10 00:09:13 [10558] srv2 corosync notice  [MAIN  ] Completed
>service
>synchronization, ready to provide service.
>
>Thanks,
>Howard


More information about the Users mailing list