[ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres

Tue Jul 13 16:09:40 EDT 2021

Hi Klaus, thanks for helping, im quite lost because cant find out the
causes.
i attached the corosync logs of all three nodes hoping you guys can find
and hint me  something i cant see. i really appreciate the effort.
the old master log seems cutted at 00:38. so nothing interessing.
the new master and the third slave logged what its happened. but i cant
figure out the cause the old master went lost.

something interessing could be the stonith logs of the new master and the
third slave:

NEW MASTER:
grep stonith-ng /var/log/messages
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node ltaoperdbs02
state is now lost
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1 peer
with id=1 and/or uname=ltaoperdbs02 from the membership cache
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client
crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device
'(any)'
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting peer
fencing (reboot) targeting ltaoperdbs02
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't find
anyone to fence (reboot) ltaoperdbs02 with any device
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s for
ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5
Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Self-fencing
(reboot) by ltaoperdbs02 for
crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete
Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation
'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for
crmd.228700 at ltaoperdbs03.f5d882d5: OK

THIRD SLAVE:
grep stonith-ng /var/log/messages
Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Node ltaoperdbs02
state is now lost
Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Purged 1 peer with
id=1 and/or uname=ltaoperdbs02 from the membership cache
Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]:  notice: Operation 'reboot'
targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700 at ltaoperdbs03.f5d882d5:
OK

i really appreciate the help and  what you think about it.

PS the stonith should be set to 10s (pcs  property set
stonith-watchdog-timeout=10s) are u suggest different setting?

Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger <
kwenning at redhat.com> ha scritto:

>
>
> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani <
> damianogiuliani87 at gmail.com> wrote:
>
>> Hi guys,
>> im back with some PAF postgres cluster problems.
>> tonight the cluster fenced the master node and promote the PAF resource
>> to a new node.
>> everything went fine, unless i really dont know why.
>> so this morning i noticed the old master was fenced by sbd and a new
>> master was promoted, this happen tonight at 00.40.XX.
>> filtering the logs i cant find out the any reasons why the old master was
>> fenced and the start of promotion of the new master (which seems went
>> perfectly), at certain point, im a bit lost cuz non of us can is able to
>> get the real reason.
>> the cluster worked flawessy for days  with no issues, till now.
>> crucial for me uderstand why this switch occured.
>>
>> a attached the current status and configuration and logs.
>> on the old master node log cant find any reasons
>> on the new master the only thing is the fencing and the promotion.
>>
>>
>> PS:
>> could be this the reason of fencing?
>>
>> grep  -e sbd /var/log/messages
>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child:
>> Servant pcmk is outdated (age: 4)
>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child:
>> Servant pcmk is healthy (age: 0)
>>
> That was yesterday afternoon and not 0:40 today in the morning.
> With the watchdog-timeout set to 5s this may have been tight though.
> Maybe check your other nodes for similar warnings - or check the
> compressed warnings.
> Maybe you can as well check the journal of sbd after start to see if it
> managed to run rt-scheduled.
> Is this a bare-metal-setup or running on some hypervisor?
> Unfortunately I'm not enough into postgres to tell if there is anything
> interesting about the last
> messages shown before the suspected watchdog-reboot.
> Was there some administrative stuff done by ltauser before the reboot? If
> yes what?
>
> Regards,
> Klaus
>
>
>>
>> Any though and help is really appreciate.
>>
>> Damiano
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210713/264185bd/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log-20210713-third slave.gz
Type: application/x-gzip
Size: 35070 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210713/264185bd/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log-20210713-new master.gz
Type: application/x-gzip
Size: 41168 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210713/264185bd/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log-20210713-old master.gz
Type: application/x-gzip
Size: 85010 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210713/264185bd/attachment-0005.bin>