[ClusterLabs] Antw: [EXT] Re: New user needs some help stabilizing the cluster

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Fri Jun 12 03:27:07 EDT 2020


>>> Howard <hmoneta at gmail.com> schrieb am 10.06.2020 um 22:14 in Nachricht
<12644_1591820112_5EE13F4F_12644_922_1_CAO51vj4cR4cXfr_wuy+4J3a8PCeJVZ99+O7iyUX7
rCkBZc00Q at mail.gmail.com>:
> Hi everyone.  As a followup, I found that the vms were having snapshot
> backup at the time of the disconnects which I think freezes IO. We'll be
> addressing that.  Is there anything else in the log that can be improved.

Hi!

That seems to be a big problem with VMware: We run an Icinga2 server on VMware and it's reporting odd things like "Remote Icinga instance '...' is not connected to '...'" from time to time when the host and network is perfectly up. Reason probably is that the VM is actually not running for a significant amount of time, and when it's running again, it reports timeouts. It could also be that the VM is running, but the time sudenly jumps forward, I'm not completely sure, but in effect it doesn't really make a difference it seems...

Regards,
Ulrich

> 
> Thanks,
> Howard
> 
> On Wed, Jun 10, 2020 at 10:06 AM Howard <hmoneta at gmail.com> wrote:
> 
>> Good morning.  Thanks for reading.  We have a requirement to provide high
>> availability for PostgreSQL 10.  I have built a two node cluster with a
>> quorum device as the third vote, all running on RHEL 8.
>>
>> Here are the versions installed:
>> [postgres at srv2 cluster]$ rpm -qa|grep
>> "pacemaker\|pcs\|corosync\|fence-agents-vmware-soap\|paf"
>> corosync-3.0.2-3.el8_1.1.x86_64
>> corosync-qdevice-3.0.0-2.el8.x86_64
>> corosync-qnetd-3.0.0-2.el8.x86_64
>> corosynclib-3.0.2-3.el8_1.1.x86_64
>> fence-agents-vmware-soap-4.2.1-41.el8.noarch
>> pacemaker-2.0.2-3.el8_1.2.x86_64
>> pacemaker-cli-2.0.2-3.el8_1.2.x86_64
>> pacemaker-cluster-libs-2.0.2-3.el8_1.2.x86_64
>> pacemaker-libs-2.0.2-3.el8_1.2.x86_64
>> pacemaker-schemas-2.0.2-3.el8_1.2.noarch
>> pcs-0.10.2-4.el8.x86_64
>> resource-agents-paf-2.3.0-1.noarch
>>
>> These are vmare VMs so I configured the cluster to use the ESX host as the
>> fencing device using fence_vmware_soap.
>>
>> Throughout each day things generally work very well.  The cluster remains
>> online and healthy. Unfortunately, when I check pcs status in the mornings,
>> I see that all kinds of things went wrong overnight.  It is hard to
>> pinpoint what the issue is as there is so much information being written to
>> the pacemaker.log. Scrolling through pages and pages of informational log
>> entries trying to find the lines that pertain to the issue.  Is there a way
>> to separate the logs out to make it easier to scroll through? Or maybe a
>> list of keywords to GREP for?
>>
>> It is clearly indicating that the server lost contact with the other node
>> and also the quorum device. Is there a way to make this configuration more
>> robust or able to recover from a connectivity blip?
>>
>> Here are the pacemaker and corosync logs for this morning's failures:
>> pacemaker.log
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemakerd
>>  [10573] (pcmk_quorum_notification)       warning: Quorum lost |
>> membership=952 members=1
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:42 srv2 pacemaker-controld
>>  [10579] (pcmk_quorum_notification)       warning: Quorum lost |
>> membership=952 members=1
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (pe_fence_node)  warning: Cluster node srv1
>> will be fenced: peer is no longer part of the cluster
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (determine_online_status)        warning: Node
>> srv1 is unclean
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (custom_action)  warning: Action
>> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (custom_action)  warning: Action
>> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (custom_action)  warning: Action
>> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (custom_action)  warning: Action
>> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (custom_action)  warning: Action
>> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (custom_action)  warning: Action
>> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (custom_action)  warning: Action
>> pgsqld:1_demote_0 on srv1 is unrunnable (offline)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (custom_action)  warning: Action
>> pgsqld:1_stop_0 on srv1 is unrunnable (offline)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (custom_action)  warning: Action
>> pgsql-master-ip_stop_0 on srv1 is unrunnable (offline)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (stage6)         warning: Scheduling Node srv1
>> for STONITH
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:43 srv2
>> pacemaker-schedulerd[10578] (pcmk__log_transition_summary)   warning:
>> Calculated transition 2 (with warnings), saving inputs in
>> /var/lib/pacemaker/pengine/pe-warn-34.bz2
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>>  [10579] (crmd_ha_msg_filter)     warning: Another DC detected: srv1
>> (op=join_offer)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>>  [10579] (destroy_action)         warning: Cancelling timer for action 3
>> (src=307)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>>  [10579] (destroy_action)         warning: Cancelling timer for action 2
>> (src=308)
>> /var/log/pacemaker/pacemaker.log:Jun 10 00:06:45 srv2 pacemaker-controld
>>  [10579] (do_log)         warning: Input I_RELEASE_DC received in state
>> S_RELEASE_DC from do_election_count_vote
>> /var/log/pacemaker/pacemaker.log:pgsqlms(pgsqld)[1164379]:      Jun 10
>> 00:07:19  WARNING: No secondary connected to the master
>> /var/log/pacemaker/pacemaker.log:Sent 5 probes (5 broadcast(s))
>> /var/log/pacemaker/pacemaker.log:Received 0 response(s)
>>
>> corosync.log
>> Jun 10 00:06:41 [10558] srv2 corosync warning [MAIN  ] Corosync main
>> process was not scheduled for 13006.0615 ms (threshold is 800.0000 ms).
>> Consider token timeout increase.
>> Jun 10 00:06:41 [10558] srv2 corosync notice  [TOTEM ] Token has not been
>> received in 12922 ms
>> Jun 10 00:06:41 [10558] srv2 corosync notice  [TOTEM ] A processor failed,
>> forming new configuration.
>> Jun 10 00:06:41 [10558] srv2 corosync info    [VOTEQ ] lost contact with
>> quorum device Qdevice
>> Jun 10 00:06:41 [10558] srv2 corosync info    [KNET  ] link: host: 1 link:
>> 0 is down
>> Jun 10 00:06:41 [10558] srv2 corosync info    [KNET  ] host: host: 1
>> (passive) best link: 0 (pri: 1)
>> Jun 10 00:06:41 [10558] srv2 corosync warning [KNET  ] host: host: 1 has
>> no active links
>> Jun 10 00:06:42 [10558] srv2 corosync info    [KNET  ] rx: host: 1 link: 0
>> is up
>> Jun 10 00:06:42 [10558] srv2 corosync info    [KNET  ] host: host: 1
>> (passive) best link: 0 (pri: 1)
>> Jun 10 00:06:42 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
>> device Qdevice poll (but maximum for 30000 ms)
>> Jun 10 00:06:42 [10558] srv2 corosync notice  [TOTEM ] A new membership
>> (2:952) was formed. Members left: 1
>> Jun 10 00:06:42 [10558] srv2 corosync notice  [TOTEM ] Failed to receive
>> the leave message. failed: 1
>> Jun 10 00:06:42 [10558] srv2 corosync warning [CPG   ] downlist left_list:
>> 1 received
>> Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] This node is within
>> the non-primary component and will NOT provide any services.
>> Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
>> Jun 10 00:06:42 [10558] srv2 corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] This node is within
>> the primary component and will provide service.
>> Jun 10 00:06:42 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
>> Jun 10 00:06:43 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
>> device Qdevice poll (but maximum for 30000 ms)
>> Jun 10 00:06:43 [10558] srv2 corosync notice  [TOTEM ] A new membership
>> (1:960) was formed. Members joined: 1
>> Jun 10 00:06:43 [10558] srv2 corosync warning [CPG   ] downlist left_list:
>> 0 received
>> Jun 10 00:06:43 [10558] srv2 corosync warning [CPG   ] downlist left_list:
>> 0 received
>> Jun 10 00:06:45 [10558] srv2 corosync notice  [QUORUM] Members[2]: 1 2
>> Jun 10 00:06:45 [10558] srv2 corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Jun 10 00:06:45 [10558] srv2 corosync warning [MAIN  ] Corosync main
>> process was not scheduled for 1747.0415 ms (threshold is 800.0000 ms).
>> Consider token timeout increase.
>> Jun 10 00:06:45 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
>> device Qdevice poll (but maximum for 30000 ms)
>> Jun 10 00:06:45 [10558] srv2 corosync notice  [TOTEM ] A new membership
>> (1:964) was formed. Members
>> Jun 10 00:06:45 [10558] srv2 corosync warning [CPG   ] downlist left_list:
>> 0 received
>> Jun 10 00:06:45 [10558] srv2 corosync warning [CPG   ] downlist left_list:
>> 0 received
>> Jun 10 00:06:45 [10558] srv2 corosync notice  [QUORUM] Members[2]: 1 2
>> Jun 10 00:06:45 [10558] srv2 corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Jun 10 00:06:52 [10558] srv2 corosync notice  [TOTEM ] Token has not been
>> received in 750 ms
>> Jun 10 00:06:52 [10558] srv2 corosync info    [KNET  ] link: host: 1 link:
>> 0 is down
>> Jun 10 00:06:52 [10558] srv2 corosync info    [KNET  ] host: host: 1
>> (passive) best link: 0 (pri: 1)
>> Jun 10 00:06:52 [10558] srv2 corosync warning [KNET  ] host: host: 1 has
>> no active links
>> Jun 10 00:06:52 [10558] srv2 corosync notice  [TOTEM ] A processor failed,
>> forming new configuration.
>> Jun 10 00:06:53 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
>> device Qdevice poll (but maximum for 30000 ms)
>> Jun 10 00:06:53 [10558] srv2 corosync notice  [TOTEM ] A new membership
>> (2:968) was formed. Members left: 1
>> Jun 10 00:06:53 [10558] srv2 corosync notice  [TOTEM ] Failed to receive
>> the leave message. failed: 1
>> Jun 10 00:06:53 [10558] srv2 corosync warning [CPG   ] downlist left_list:
>> 1 received
>> Jun 10 00:07:17 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
>> Jun 10 00:07:17 [10558] srv2 corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Jun 10 00:08:56 [10558] srv2 corosync notice  [TOTEM ] Token has not been
>> received in 750 ms
>> Jun 10 00:09:04 [10558] srv2 corosync warning [MAIN  ] Corosync main
>> process was not scheduled for 4477.0459 ms (threshold is 800.0000 ms).
>> Consider token timeout increase.
>> Jun 10 00:09:13 [10558] srv2 corosync warning [MAIN  ] Corosync main
>> process was not scheduled for 5302.9785 ms (threshold is 800.0000 ms).
>> Consider token timeout increase.
>> Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] Token has not been
>> received in 5295 ms
>> Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] A processor failed,
>> forming new configuration.
>> Jun 10 00:09:13 [10558] srv2 corosync info    [VOTEQ ] waiting for quorum
>> device Qdevice poll (but maximum for 30000 ms)
>> Jun 10 00:09:13 [10558] srv2 corosync notice  [TOTEM ] A new membership
>> (2:972) was formed. Members
>> Jun 10 00:09:13 [10558] srv2 corosync warning [CPG   ] downlist left_list:
>> 0 received
>> Jun 10 00:09:13 [10558] srv2 corosync notice  [QUORUM] Members[1]: 2
>> Jun 10 00:09:13 [10558] srv2 corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>>
>> Thanks,
>> Howard
>>
>>





More information about the Users mailing list