[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Wed Jul 14 01:58:14 EDT 2021
>>> damiano giuliani <damianogiuliani87 at gmail.com> schrieb am 13.07.2021 um 13:42
in Nachricht
<CAG=zYNPzXNFEMxR1dZdzhKT122=_mrgQELk8RNYpXgH7Y291Wg at mail.gmail.com>:
> Hi guys,
> im back with some PAF postgres cluster problems.
> tonight the cluster fenced the master node and promote the PAF resource to
> a new node.
> everything went fine, unless i really dont know why.
> so this morning i noticed the old master was fenced by sbd and a new master
> was promoted, this happen tonight at 00.40.XX.
> filtering the logs i cant find out the any reasons why the old master was
> fenced and the start of promotion of the new master (which seems went
> perfectly), at certain point, im a bit lost cuz non of us can is able to
> get the real reason.
> the cluster worked flawessy for days with no issues, till now.
> crucial for me uderstand why this switch occured.
>
> a attached the current status and configuration and logs.
> on the old master node log cant find any reasons
> on the new master the only thing is the fencing and the promotion.
>
>
> PS:
> could be this the reason of fencing?
First I think your timeouts are rather aggressive. Hope there are no virtual machines involved.
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] Failed to receive the leave message. failed: 1
This may be a networking problem, or the other node dies for some unknown reason.
That's the reason for fencing.
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]: notice: Our peer on the DC (ltaoperdbs02) is dead
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Node ltaoperdbs02 is unclean
You said there is no reason for fencing, but here it is!
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Scheduling Node ltaoperdbs02 for STONITH
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: notice: * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster'
The fencing timing is also quite aggressive IMHO.
Could it be that a command saturated the network?
Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 00:39:28.936 UTC [172262] LOG: duration: 660.329 ms execute <unnamed>: SELECT xmf.file_id, f.size, fp.full_path FROM ism_x_medium_file xmf JOIN#011 ism_files f ON f.id_file = xmf.file_id JOIN#011 ism_files_path fp ON f.id_file = fp.file_id JOIN ism_online o ON o.file_id = xmf.file_id WHERE xmf.medium_id = 363 AND xmf.x_media_file_status_id = 1 AND o.online_status_id = 3 GROUP BY xmf.file_id, f.size, fp.full_path LIMIT 7265 ;
Regards,
Ulrich
>
> grep -e sbd /var/log/messages
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant
> pcmk is outdated (age: 4)
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: Servant
> pcmk is healthy (age: 0)
>
> Any though and help is really appreciate.
>
> Damiano
More information about the Users
mailing list