[ClusterLabs] New user needs some help stabilizing the cluster

Thu Jun 11 03:36:34 EDT 2020

Howard,

> Good morning.  Thanks for reading.  We have a requirement to provide high
> availability for PostgreSQL 10.  I have built a two node cluster with a
> quorum device as the third vote, all running on RHEL 8.
> 
> Here are the versions installed:
> [postgres at srv2 cluster]$ rpm -qa|grep
> "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
> corosync-3.0.2-3.el8_1.1.x86_64
> corosync-qdevice-3.0.0-2.el8.x86_64
> corosync-qnetd-3.0.0-2.el8.x86_64
> corosynclib-3.0.2-3.el8_1.1.x86_64
> fence-agents-vmware-soap-4.2.1-41.el8.noarch
> pacemaker-2.0.2-3.el8_1.2.x86_64
> pacemaker-cli-2.0.2-3.el8_1.2.x86_64
> pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
> pacemaker-libs-2.0.2-3.el8_1.2.x86_64
> pacemaker-schemas-2.0.2-3.el8_1.2.noarch
> pcs-0.10.2-4.el8.x86_64
> resource-agents-paf-2.3.0-1.noarch
> 
> These are vmare VMs so I configured the cluster to use the ESX host as the
> fencing device using fence_vmware_soap.
> 
> Throughout each day things generally work very well.  The cluster remains
> online and healthy. Unfortunately, when I check pcs status in the mornings,
> I see that all kinds of things went wrong overnight.  It is hard to
> pinpoint what the issue is as there is so much information being written to
> the pacemaker.log. Scrolling through pages and pages of informational log
> entries trying to find the lines that pertain to the issue.  Is there a way
> to separate the logs out to make it easier to scroll through? Or maybe a
> list of keywords to GREP for?

The most important info is following line:

 > Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync main
 > process was not scheduled for 13006.0615 ms (threshold is 800.0000 ms).
 > Consider token timeout increase.

There are more of these, so you can either make sure VM is not paused 
for such a long time or increase token timeout so corosync is able to 
handle such pause.

Regards,
   Honza

> 
> It is clearly indicating that the server lost contact with the other node
> and also the quorum device. Is there a way to make this configuration more
> robust or able to recover from a connectivity blip?
> 
> Here are the pacemaker and corosync logs for this morning's failures:
> pacemaker.log
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
>   [10573] (pcmk_quorum_notification)       warning: Quorum lost |
> membership=952 members=1
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemaker-controld
>   [10579] (pcmk_quorum_notification)       warning: Quorum lost |
> membership=952 members=1
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
> will be fenced: peer is no longer part of the cluster
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (determine_online_status)        warning: Node
> srv1 is unclean
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (custom_action)  warning: Action
> pgsql-master-ip_stop_0 on srv1 is unrunnable (offline)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (stage6)         warning: Scheduling Node srv1
> for STONITH
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
> pacemaker-schedulerd[10578] (pcmk__log_transition_summary)   warning:
> Calculated transition 2 (with warnings), saving inputs in
> /var/lib/pacemaker/pengine/pe-warn-34.bz2
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>   [10579] (crmd_ha_msg_filter)     warning: Another DC detected: srv1
> (op=join_offer)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>   [10579] (destroy_action)         warning: Cancelling timer for action 3
> (src=307)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>   [10579] (destroy_action)         warning: Cancelling timer for action 2
> (src=308)
> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>   [10579] (do_log)         warning: Input I_RELEASE_DC received in state
> S_RELEASE_DC from do_election_count_vote
> /var/log/pacemaker/pacemaker.log:pgsqlms(pgsqld)[1164379]:      Jun 10
> 00:07:19  WARNING: No secondary connected to the master
> /var/log/pacemaker/pacemaker.log:Sent 5 probes (5 broadcast(s))
> /var/log/pacemaker/pacemaker.log:Received 0 response(s)
> 
> corosync.log
> Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync main
> process was not scheduled for 13006.0615 ms (threshold is 800.0000 ms).
> Consider token timeout increase.
> Jun 10 00:06:41 [10558] srv2 corosync notice  [TOTEM ] Token has not been
> received in 12922 ms
> Jun 10 00:06:41 [10558] srv2 corosync notice  [TOTEM ] A processor failed,
> forming new configuration.
> Jun 10 00:06:41 [10558] srv2 corosync info    [VOTEQ ] lost contact with
> quorum device Qdevice
> Jun 10 00:06:41 [10558] srv2 corosync info    [KNET  ] link: host: 1 link:
> 0 is down
> Jun 10 00:06:41 [10558] srv2 corosync info    [KNET  ] host: host: 1
> (passive) best link: 0 (pri: 1)
> Jun 10 00:06:41 [10558] srv2 corosync warning [KNET  ] host: host: 1 has no
> active links
> Jun 10 00:06:42 [10558] srv2 corosync info    [KNET  ] rx: host: 1 link: 0
> is up
> Jun 10 00:06:42 [10558] srv2 corosync info    [KNET  ] host: host: 1
> (passive) best link: 0 (pri: 1)
> Jun 10 00:06:42 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
> device Qdevice poll (but maximum for 30000 ms)
> Jun 10 00:06:42 [10558] srv2 corosync notice  [TOTEM ] A new membership
> (2:952) was formed. Members left: 1
> Jun 10 00:06:42 [10558] srv2 corosync notice  [TOTEM ] Failed to receive
> the leave message. failed: 1
> Jun 10 00:06:42 [10558] srv2 corosync warning [CPG   ] downlist left_list:
> 1 received
> Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] This node is within
> the non-primary component and will NOT provide any services.
> Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
> Jun 10 00:06:42 [10558] srv2 corosync notice  [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] This node is within
> the primary component and will provide service.
> Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
> Jun 10 00:06:43 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
> device Qdevice poll (but maximum for 30000 ms)
> Jun 10 00:06:43 [10558] srv2 corosync notice  [TOTEM ] A new membership
> (1:960) was formed. Members joined: 1
> Jun 10 00:06:43 [10558] srv2 corosync warning [CPG   ] downlist left_list:
> 0 received
> Jun 10 00:06:43 [10558] srv2 corosync warning [CPG   ] downlist left_list:
> 0 received
> Jun 10 00:06:45 [10558] srv2 corosync notice  [QUORUM] Members[2]: 1 2
> Jun 10 00:06:45 [10558] srv2 corosync notice  [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 10 00:06:45 [10558] srv2 corosync warning [MAIN  ] Corosync main
> process was not scheduled for 1747.0415 ms (threshold is 800.0000 ms).
> Consider token timeout increase.
> Jun 10 00:06:45 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
> device Qdevice poll (but maximum for 30000 ms)
> Jun 10 00:06:45 [10558] srv2 corosync notice  [TOTEM ] A new membership
> (1:964) was formed. Members
> Jun 10 00:06:45 [10558] srv2 corosync warning [CPG   ] downlist left_list:
> 0 received
> Jun 10 00:06:45 [10558] srv2 corosync warning [CPG   ] downlist left_list:
> 0 received
> Jun 10 00:06:45 [10558] srv2 corosync notice  [QUORUM] Members[2]: 1 2
> Jun 10 00:06:45 [10558] srv2 corosync notice  [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 10 00:06:52 [10558] srv2 corosync notice  [TOTEM ] Token has not been
> received in 750 ms
> Jun 10 00:06:52 [10558] srv2 corosync info    [KNET  ] link: host: 1 link:
> 0 is down
> Jun 10 00:06:52 [10558] srv2 corosync info    [KNET  ] host: host: 1
> (passive) best link: 0 (pri: 1)
> Jun 10 00:06:52 [10558] srv2 corosync warning [KNET  ] host: host: 1 has no
> active links
> Jun 10 00:06:52 [10558] srv2 corosync notice  [TOTEM ] A processor failed,
> forming new configuration.
> Jun 10 00:06:53 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
> device Qdevice poll (but maximum for 30000 ms)
> Jun 10 00:06:53 [10558] srv2 corosync notice  [TOTEM ] A new membership
> (2:968) was formed. Members left: 1
> Jun 10 00:06:53 [10558] srv2 corosync notice  [TOTEM ] Failed to receive
> the leave message. failed: 1
> Jun 10 00:06:53 [10558] srv2 corosync warning [CPG   ] downlist left_list:
> 1 received
> Jun 10 00:07:17 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
> Jun 10 00:07:17 [10558] srv2 corosync notice  [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jun 10 00:08:56 [10558] srv2 corosync notice  [TOTEM ] Token has not been
> received in 750 ms
> Jun 10 00:09:04 [10558] srv2 corosync warning [MAIN  ] Corosync main
> process was not scheduled for 4477.0459 ms (threshold is 800.0000 ms).
> Consider token timeout increase.
> Jun 10 00:09:13 [10558] srv2 corosync warning [MAIN  ] Corosync main
> process was not scheduled for 5302.9785 ms (threshold is 800.0000 ms).
> Consider token timeout increase.
> Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] Token has not been
> received in 5295 ms
> Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] A processor failed,
> forming new configuration.
> Jun 10 00:09:13 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
> device Qdevice poll (but maximum for 30000 ms)
> Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] A new membership
> (2:972) was formed. Members
> Jun 10 00:09:13 [10558] srv2 corosync warning [CPG   ] downlist left_list:
> 0 received
> Jun 10 00:09:13 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
> Jun 10 00:09:13 [10558] srv2 corosync notice  [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> Thanks,
> Howard
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>