[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres

Wed Jul 14 01:58:14 EDT 2021

>>> damiano giuliani <damianogiuliani87 at gmail.com> schrieb am 13.07.2021 um 13:42
in Nachricht
<CAG=zYNPzXNFEMxR1dZdzhKT122=_mrgQELk8RNYpXgH7Y291Wg at mail.gmail.com>:
> Hi guys,
> im back with some PAF postgres cluster problems.
> tonight the cluster fenced the master node and promote the PAF resource to
> a new node.
> everything went fine, unless i really dont know why.
> so this morning i noticed the old master was fenced by sbd and a new master
> was promoted, this happen tonight at 00.40.XX.
> filtering the logs i cant find out the any reasons why the old master was
> fenced and the start of promotion of the new master (which seems went
> perfectly), at certain point, im a bit lost cuz non of us can is able to
> get the real reason.
> the cluster worked flawessy for days  with no issues, till now.
> crucial for me uderstand why this switch occured.
> 
> a attached the current status and configuration and logs.
> on the old master node log cant find any reasons
> on the new master the only thing is the fencing and the promotion.
> 
> 
> PS:
> could be this the reason of fencing?

First I think your timeouts are rather aggressive. Hope there are no virtual machines involved.

Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] Failed to receive the leave message. failed: 1

This may be a networking problem, or the other node dies for some unknown reason.
That's the reason for fencing.
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Our peer on the DC (ltaoperdbs02) is dead
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Node ltaoperdbs02 is unclean

You said there is no reason for fencing, but here it is!

Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Scheduling Node ltaoperdbs02 for STONITH
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice:  * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster'

The fencing timing is also quite aggressive IMHO.

Could it be that a command saturated the network?
Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 00:39:28.936 UTC [172262] LOG:  duration: 660.329 ms  execute <unnamed>:  SELECT  xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011 ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id  WHERE xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND o.online_status_id = 3    GROUP BY xmf.file_id, f.size,  fp.full_path   LIMIT 7265 ;

Regards,
Ulrich

> 
> grep  -e sbd /var/log/messages
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant
> pcmk is outdated (age: 4)
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child: Servant
> pcmk is healthy (age: 0)
> 
> Any though and help is really appreciate.
> 
> Damiano