[ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres

Wed Jul 14 00:40:08 EDT 2021

On 13.07.2021 23:09, damiano giuliani wrote:
> Hi Klaus, thanks for helping, im quite lost because cant find out the
> causes.
> i attached the corosync logs of all three nodes hoping you guys can find
> and hint me  something i cant see. i really appreciate the effort.
> the old master log seems cutted at 00:38. so nothing interessing.
> the new master and the third slave logged what its happened. but i cant
> figure out the cause the old master went lost.
> 

The reason it was lost is most likely outside of pacemaker. You need to
check other logs on the node that was lost, may be BMC if this is bare
metal or hypervisor if it is virtualized system.

All that these logs say is that ltaoperdbs02 was lost from the point of
view of two other nodes. It happened at the same time (around Jul 13
00:40) which suggests ltaoperdbs02 had some problem indeed. Whether it
was software crash, hardware failure or network outage cannot be
determined from these logs.

> something interessing could be the stonith logs of the new master and the
> third slave:
> 
> NEW MASTER:
> grep stonith-ng /var/log/messages
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node ltaoperdbs02
> state is now lost
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1 peer
> with id=1 and/or uname=ltaoperdbs02 from the membership cache
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client
> crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device
> '(any)'
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting peer
> fencing (reboot) targeting ltaoperdbs02
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't find
> anyone to fence (reboot) ltaoperdbs02 with any device
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s for
> ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5
> Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Self-fencing
> (reboot) by ltaoperdbs02 for
> crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete
> Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation
> 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for
> crmd.228700 at ltaoperdbs03.f5d882d5: OK
> 
> THIRD SLAVE:
> grep stonith-ng /var/log/messages
> Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Node ltaoperdbs02
> state is now lost
> Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Purged 1 peer with
> id=1 and/or uname=ltaoperdbs02 from the membership cache
> Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]:  notice: Operation 'reboot'
> targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700 at ltaoperdbs03.f5d882d5:
> OK
> 
> i really appreciate the help and  what you think about it.
> 
> PS the stonith should be set to 10s (pcs  property set
> stonith-watchdog-timeout=10s) are u suggest different setting?
> 
> Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger <
> kwenning at redhat.com> ha scritto:
> 
>>
>>
>> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani <
>> damianogiuliani87 at gmail.com> wrote:
>>
>>> Hi guys,
>>> im back with some PAF postgres cluster problems.
>>> tonight the cluster fenced the master node and promote the PAF resource
>>> to a new node.
>>> everything went fine, unless i really dont know why.
>>> so this morning i noticed the old master was fenced by sbd and a new
>>> master was promoted, this happen tonight at 00.40.XX.
>>> filtering the logs i cant find out the any reasons why the old master was
>>> fenced and the start of promotion of the new master (which seems went
>>> perfectly), at certain point, im a bit lost cuz non of us can is able to
>>> get the real reason.
>>> the cluster worked flawessy for days  with no issues, till now.
>>> crucial for me uderstand why this switch occured.
>>>
>>> a attached the current status and configuration and logs.
>>> on the old master node log cant find any reasons
>>> on the new master the only thing is the fencing and the promotion.
>>>
>>>
>>> PS:
>>> could be this the reason of fencing?
>>>
>>> grep  -e sbd /var/log/messages
>>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child:
>>> Servant pcmk is outdated (age: 4)
>>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child:
>>> Servant pcmk is healthy (age: 0)
>>>
>> That was yesterday afternoon and not 0:40 today in the morning.
>> With the watchdog-timeout set to 5s this may have been tight though.
>> Maybe check your other nodes for similar warnings - or check the
>> compressed warnings.
>> Maybe you can as well check the journal of sbd after start to see if it
>> managed to run rt-scheduled.
>> Is this a bare-metal-setup or running on some hypervisor?
>> Unfortunately I'm not enough into postgres to tell if there is anything
>> interesting about the last
>> messages shown before the suspected watchdog-reboot.
>> Was there some administrative stuff done by ltauser before the reboot? If
>> yes what?
>>
>> Regards,
>> Klaus
>>
>>
>>>
>>> Any though and help is really appreciate.
>>>
>>> Damiano
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>