[ClusterLabs] New user needs some help stabilizing the cluster

Howard hmoneta at gmail.com
Wed Jun 10 13:06:28 EDT 2020


Good morning.  Thanks for reading.  We have a requirement to provide high
availability for PostgreSQL 10.  I have built a two node cluster with a
quorum device as the third vote, all running on RHEL 8.

Here are the versions installed:
[postgres at srv2 cluster]$ rpm -qa|grep
"pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
corosync-3.0.2-3.el8_1.1.x86_64
corosync-qdevice-3.0.0-2.el8.x86_64
corosync-qnetd-3.0.0-2.el8.x86_64
corosynclib-3.0.2-3.el8_1.1.x86_64
fence-agents-vmware-soap-4.2.1-41.el8.noarch
pacemaker-2.0.2-3.el8_1.2.x86_64
pacemaker-cli-2.0.2-3.el8_1.2.x86_64
pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
pacemaker-libs-2.0.2-3.el8_1.2.x86_64
pacemaker-schemas-2.0.2-3.el8_1.2.noarch
pcs-0.10.2-4.el8.x86_64
resource-agents-paf-2.3.0-1.noarch

These are vmare VMs so I configured the cluster to use the ESX host as the
fencing device using fence_vmware_soap.

Throughout each day things generally work very well.  The cluster remains
online and healthy. Unfortunately, when I check pcs status in the mornings,
I see that all kinds of things went wrong overnight.  It is hard to
pinpoint what the issue is as there is so much information being written to
the pacemaker.log. Scrolling through pages and pages of informational log
entries trying to find the lines that pertain to the issue.  Is there a way
to separate the logs out to make it easier to scroll through? Or maybe a
list of keywords to GREP for?

It is clearly indicating that the server lost contact with the other node
and also the quorum device. Is there a way to make this configuration more
robust or able to recover from a connectivity blip?

Here are the pacemaker and corosync logs for this morning's failures:
pacemaker.log
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
 [10573] (pcmk_quorum_notification)       warning: Quorum lost |
membership=952 members=1
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemaker-controld
 [10579] (pcmk_quorum_notification)       warning: Quorum lost |
membership=952 members=1
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
will be fenced: peer is no longer part of the cluster
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (determine_online_status)        warning: Node
srv1 is unclean
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_demote_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsqld:1_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (custom_action)  warning: Action
pgsql-master-ip_stop_0 on srv1 is unrunnable (offline)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (stage6)         warning: Scheduling Node srv1
for STONITH
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
pacemaker-schedulerd[10578] (pcmk__log_transition_summary)   warning:
Calculated transition 2 (with warnings), saving inputs in
/var/lib/pacemaker/pengine/pe-warn-34.bz2
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
 [10579] (crmd_ha_msg_filter)     warning: Another DC detected: srv1
(op=join_offer)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
 [10579] (destroy_action)         warning: Cancelling timer for action 3
(src=307)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
 [10579] (destroy_action)         warning: Cancelling timer for action 2
(src=308)
/var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
 [10579] (do_log)         warning: Input I_RELEASE_DC received in state
S_RELEASE_DC from do_election_count_vote
/var/log/pacemaker/pacemaker.log:pgsqlms(pgsqld)[1164379]:      Jun 10
00:07:19  WARNING: No secondary connected to the master
/var/log/pacemaker/pacemaker.log:Sent 5 probes (5 broadcast(s))
/var/log/pacemaker/pacemaker.log:Received 0 response(s)

corosync.log
Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync main
process was not scheduled for 13006.0615 ms (threshold is 800.0000 ms).
Consider token timeout increase.
Jun 10 00:06:41 [10558] srv2 corosync notice  [TOTEM ] Token has not been
received in 12922 ms
Jun 10 00:06:41 [10558] srv2 corosync notice  [TOTEM ] A processor failed,
forming new configuration.
Jun 10 00:06:41 [10558] srv2 corosync info    [VOTEQ ] lost contact with
quorum device Qdevice
Jun 10 00:06:41 [10558] srv2 corosync info    [KNET  ] link: host: 1 link:
0 is down
Jun 10 00:06:41 [10558] srv2 corosync info    [KNET  ] host: host: 1
(passive) best link: 0 (pri: 1)
Jun 10 00:06:41 [10558] srv2 corosync warning [KNET  ] host: host: 1 has no
active links
Jun 10 00:06:42 [10558] srv2 corosync info    [KNET  ] rx: host: 1 link: 0
is up
Jun 10 00:06:42 [10558] srv2 corosync info    [KNET  ] host: host: 1
(passive) best link: 0 (pri: 1)
Jun 10 00:06:42 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
device Qdevice poll (but maximum for 30000 ms)
Jun 10 00:06:42 [10558] srv2 corosync notice  [TOTEM ] A new membership
(2:952) was formed. Members left: 1
Jun 10 00:06:42 [10558] srv2 corosync notice  [TOTEM ] Failed to receive
the leave message. failed: 1
Jun 10 00:06:42 [10558] srv2 corosync warning [CPG   ] downlist left_list:
1 received
Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] This node is within
the non-primary component and will NOT provide any services.
Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
Jun 10 00:06:42 [10558] srv2 corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.
Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] This node is within
the primary component and will provide service.
Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
Jun 10 00:06:43 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
device Qdevice poll (but maximum for 30000 ms)
Jun 10 00:06:43 [10558] srv2 corosync notice  [TOTEM ] A new membership
(1:960) was formed. Members joined: 1
Jun 10 00:06:43 [10558] srv2 corosync warning [CPG   ] downlist left_list:
0 received
Jun 10 00:06:43 [10558] srv2 corosync warning [CPG   ] downlist left_list:
0 received
Jun 10 00:06:45 [10558] srv2 corosync notice  [QUORUM] Members[2]: 1 2
Jun 10 00:06:45 [10558] srv2 corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.
Jun 10 00:06:45 [10558] srv2 corosync warning [MAIN  ] Corosync main
process was not scheduled for 1747.0415 ms (threshold is 800.0000 ms).
Consider token timeout increase.
Jun 10 00:06:45 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
device Qdevice poll (but maximum for 30000 ms)
Jun 10 00:06:45 [10558] srv2 corosync notice  [TOTEM ] A new membership
(1:964) was formed. Members
Jun 10 00:06:45 [10558] srv2 corosync warning [CPG   ] downlist left_list:
0 received
Jun 10 00:06:45 [10558] srv2 corosync warning [CPG   ] downlist left_list:
0 received
Jun 10 00:06:45 [10558] srv2 corosync notice  [QUORUM] Members[2]: 1 2
Jun 10 00:06:45 [10558] srv2 corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.
Jun 10 00:06:52 [10558] srv2 corosync notice  [TOTEM ] Token has not been
received in 750 ms
Jun 10 00:06:52 [10558] srv2 corosync info    [KNET  ] link: host: 1 link:
0 is down
Jun 10 00:06:52 [10558] srv2 corosync info    [KNET  ] host: host: 1
(passive) best link: 0 (pri: 1)
Jun 10 00:06:52 [10558] srv2 corosync warning [KNET  ] host: host: 1 has no
active links
Jun 10 00:06:52 [10558] srv2 corosync notice  [TOTEM ] A processor failed,
forming new configuration.
Jun 10 00:06:53 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
device Qdevice poll (but maximum for 30000 ms)
Jun 10 00:06:53 [10558] srv2 corosync notice  [TOTEM ] A new membership
(2:968) was formed. Members left: 1
Jun 10 00:06:53 [10558] srv2 corosync notice  [TOTEM ] Failed to receive
the leave message. failed: 1
Jun 10 00:06:53 [10558] srv2 corosync warning [CPG   ] downlist left_list:
1 received
Jun 10 00:07:17 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
Jun 10 00:07:17 [10558] srv2 corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.
Jun 10 00:08:56 [10558] srv2 corosync notice  [TOTEM ] Token has not been
received in 750 ms
Jun 10 00:09:04 [10558] srv2 corosync warning [MAIN  ] Corosync main
process was not scheduled for 4477.0459 ms (threshold is 800.0000 ms).
Consider token timeout increase.
Jun 10 00:09:13 [10558] srv2 corosync warning [MAIN  ] Corosync main
process was not scheduled for 5302.9785 ms (threshold is 800.0000 ms).
Consider token timeout increase.
Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] Token has not been
received in 5295 ms
Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] A processor failed,
forming new configuration.
Jun 10 00:09:13 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
device Qdevice poll (but maximum for 30000 ms)
Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] A new membership
(2:972) was formed. Members
Jun 10 00:09:13 [10558] srv2 corosync warning [CPG   ] downlist left_list:
0 received
Jun 10 00:09:13 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
Jun 10 00:09:13 [10558] srv2 corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.

Thanks,
Howard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200610/bfbf127b/attachment.htm>


More information about the Users mailing list