[ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres

Wed Jul 14 04:08:35 EDT 2021

On Wed, Jul 14, 2021 at 6:40 AM Andrei Borzenkov <arvidjaar at gmail.com>
wrote:

> On 13.07.2021 23:09, damiano giuliani wrote:
> > Hi Klaus, thanks for helping, im quite lost because cant find out the
> > causes.
> > i attached the corosync logs of all three nodes hoping you guys can find
> > and hint me  something i cant see. i really appreciate the effort.
> > the old master log seems cutted at 00:38. so nothing interessing.
> > the new master and the third slave logged what its happened. but i cant
> > figure out the cause the old master went lost.
> >
>
> The reason it was lost is most likely outside of pacemaker. You need to
> check other logs on the node that was lost, may be BMC if this is bare
> metal or hypervisor if it is virtualized system.
>
> All that these logs say is that ltaoperdbs02 was lost from the point of
> view of two other nodes. It happened at the same time (around Jul 13
> 00:40) which suggests ltaoperdbs02 had some problem indeed. Whether it
> was software crash, hardware failure or network outage cannot be
> determined from these logs.
>
> What speaks against a pure network-outage is that we don't see
the corosync memberhip messages on the node that died.
Of course it is possible that the log wasn't flushed out before reboot
but usually I'd expect that there would be enough time.
If something kept corosync or sbd from being scheduled that would
explain why we don't see messages from these instances.
And that was why I was asking to check if in the setup corosync and
sbd are able to switch to rt-scheduling.
But of course that is all speculations and from what we know it can
be merely anything from an administrative hard shutdown via
some BMC to whatever.

>
> > something interessing could be the stonith logs of the new master and the
> > third slave:
> >
> > NEW MASTER:
> > grep stonith-ng /var/log/messages
> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node
> ltaoperdbs02
> > state is now lost
> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1 peer
> > with id=1 and/or uname=ltaoperdbs02 from the membership cache
> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client
> > crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device
> > '(any)'
> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting peer
> > fencing (reboot) targeting ltaoperdbs02
> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't find
> > anyone to fence (reboot) ltaoperdbs02 with any device
> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s for
> > ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5
> > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Self-fencing
> > (reboot) by ltaoperdbs02 for
> > crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete
> > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation
> > 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for
> > crmd.228700 at ltaoperdbs03.f5d882d5: OK
> >
> > THIRD SLAVE:
> > grep stonith-ng /var/log/messages
> > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Node
> ltaoperdbs02
> > state is now lost
> > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Purged 1 peer
> with
> > id=1 and/or uname=ltaoperdbs02 from the membership cache
> > Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]:  notice: Operation
> 'reboot'
> > targeting ltaoperdbs02 on ltaoperdbs03 for
> crmd.228700 at ltaoperdbs03.f5d882d5:
> > OK
> >
> > i really appreciate the help and  what you think about it.
> >
> > PS the stonith should be set to 10s (pcs  property set
> > stonith-watchdog-timeout=10s) are u suggest different setting?
> >
> > Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger <
> > kwenning at redhat.com> ha scritto:
> >
> >>
> >>
> >> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani <
> >> damianogiuliani87 at gmail.com> wrote:
> >>
> >>> Hi guys,
> >>> im back with some PAF postgres cluster problems.
> >>> tonight the cluster fenced the master node and promote the PAF resource
> >>> to a new node.
> >>> everything went fine, unless i really dont know why.
> >>> so this morning i noticed the old master was fenced by sbd and a new
> >>> master was promoted, this happen tonight at 00.40.XX.
> >>> filtering the logs i cant find out the any reasons why the old master
> was
> >>> fenced and the start of promotion of the new master (which seems went
> >>> perfectly), at certain point, im a bit lost cuz non of us can is able
> to
> >>> get the real reason.
> >>> the cluster worked flawessy for days  with no issues, till now.
> >>> crucial for me uderstand why this switch occured.
> >>>
> >>> a attached the current status and configuration and logs.
> >>> on the old master node log cant find any reasons
> >>> on the new master the only thing is the fencing and the promotion.
> >>>
> >>>
> >>> PS:
> >>> could be this the reason of fencing?
> >>>
> >>> grep  -e sbd /var/log/messages
> >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child:
> >>> Servant pcmk is outdated (age: 4)
> >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child:
> >>> Servant pcmk is healthy (age: 0)
> >>>
> >> That was yesterday afternoon and not 0:40 today in the morning.
> >> With the watchdog-timeout set to 5s this may have been tight though.
> >> Maybe check your other nodes for similar warnings - or check the
> >> compressed warnings.
> >> Maybe you can as well check the journal of sbd after start to see if it
> >> managed to run rt-scheduled.
> >> Is this a bare-metal-setup or running on some hypervisor?
> >> Unfortunately I'm not enough into postgres to tell if there is anything
> >> interesting about the last
> >> messages shown before the suspected watchdog-reboot.
> >> Was there some administrative stuff done by ltauser before the reboot?
> If
> >> yes what?
> >>
> >> Regards,
> >> Klaus
> >>
> >>
> >>>
> >>> Any though and help is really appreciate.
> >>>
> >>> Damiano
> >>> _______________________________________________
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >>
> >
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210714/01f03d84/attachment.htm>