[ClusterLabs] Antw: [EXT] Re: unexpected fenced node and promotion of the new master PAF ‑ postgres

Wed Jul 14 10:25:34 EDT 2021

On Wed, Jul 14, 2021 at 3:28 PM Ulrich Windl <
Ulrich.Windl at rz.uni-regensburg.de> wrote:

> >>> damiano giuliani <damianogiuliani87 at gmail.com> schrieb am 14.07.2021
> um
> 12:49
> in Nachricht
> <CAG=zYNOjRmKC5az8nz2r82CRabJ3Z+GEnuW_8dE3UJFu1hD1hA at mail.gmail.com>:
> > Hi guys, thanks for helping,
> >
> > could be quite hard troubleshooting unexpected fails expecially if they
> are
> > not easily tracked on the pacemaker / system logs.
> > all servers are baremetal , i requested the BMC logs hoping there are
> some
> > informations.
> > you guys said the sbd is too tight, can you explain me and suggest a
> valid
> > configuration?
>
> You must answer these questions for yourself:
> * What is the maximum read/write delay for your sbd device that still means
> the storage is working? Before assuming something like 1s also think of
> firmware updates, bad disk sectors, etc.
>
stonith-watchdog-timeout set and no 'Servant starting for device' log - I
guess no poison-pill-fencing then

> * Then configure the sbd parameters accordingly
> * Finally configure the stonith timeout to be not less than the time sbd
> needs
> in worst case to down the machine. If the cluster starts recovering while
> the
> other node is not down already, you may have data corruption or other
> failures.
>
yep - 2 * watchdog-timeout should be a good pick in this case

>
> >
> > ps: yesterday i resyc the old master (to slave) and rejoined into the
> > cluster.
> > i found the following error into the var/log/messages about the sbd
> >
> >  grep -r sbd messages
> > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child:
> Servant
> > pcmk is outdated (age: 4)
> > Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child:
> Servant
> > pcmk is healthy (age: 0)
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185352]:  notice: main: Doing flush +
> > writing 'b' to sysrq on timeout
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185362]:      pcmk:   notice:
> > servant_pcmk: Monitoring Pacemaker health
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185363]:   cluster:   notice:
> > servant_cluster: Monitoring unknown cluster health
> > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:  notice: inquisitor_child:
> > Servant cluster is healthy (age: 0)
> > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:  notice: watchdog_init: Using
> > watchdog device '/dev/watchdog'
> > Jul 13 20:42:19 ltaoperdbs02 sbd[185357]:  notice: inquisitor_child:
> > Servant pcmk is healthy (age: 0)
> > Jul 13 20:53:57 ltaoperdbs02 sbd[188919]:    info: main: Verbose mode
> > enabled.
> > Jul 13 20:53:57 ltaoperdbs02 sbd[188919]:    info: main: Watchdog
> enabled.
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189176]:  notice: main: Doing flush +
> > writing 'b' to sysrq on timeout
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189178]:      pcmk:   notice:
> > servant_pcmk: Monitoring Pacemaker health
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]:  notice: inquisitor_child:
> > Servant pcmk is healthy (age: 0)
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]:   error: watchdog_init_fd:
> Cannot
> > open watchdog device '/dev/watchdog': Device or resource busy (16)
>
> Maybe also debug the watchdog device.
>
>
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning:
> cleanup_servant_by_pid:
> > Servant for pcmk (pid: 189178) has terminated
> > Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning:
> cleanup_servant_by_pid:
> > Servant for cluster (pid: 189179) has terminated
> > Jul 13 20:55:30 ltaoperdbs02 sbd[189484]:  notice: main: Doing flush +
> > writing 'b' to sysrq on timeout
> > Jul 13 20:55:30 ltaoperdbs02 sbd[189484]:   error: watchdog_init_fd:
> Cannot
> > open watchdog device '/dev/watchdog0': Device or resource busy (16)
> > Jul 13 20:55:30 ltaoperdbs02 sbd[189484]:   error: watchdog_init_fd:
> Cannot
> > open watchdog device '/dev/watchdog': Device or resource busy (16)
> >
> > if i check the systemctl status sbd:
> >
> > systemctl status sbd.service
> > ● sbd.service - Shared-storage based fencing daemon
> >    Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
> > preset: disabled)
> >    Active: active (running) since Tue 2021-07-13 20:42:15 UTC; 13h ago
> >      Docs: man:sbd(8)
> >   Process: 185352 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
> > watch (code=exited, status=0/SUCCESS)
> >  Main PID: 185357 (sbd)
> >    CGroup: /system.slice/sbd.service
> >            ├─185357 sbd: inquisitor
> >            ├─185362 sbd: watcher: Pacemaker
> >            └─185363 sbd: watcher: Cluster
> >
> > Jul 13 20:42:14 ltaoperdbs02 systemd[1]: Starting Shared-storage based
> > fencing daemon...
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185352]:   notice: main: Doing flush +
> > writing 'b' to sysrq on timeout
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185362]:       pcmk:   notice:
> > servant_pcmk: Monitoring Pacemaker health
> > Jul 13 20:42:14 ltaoperdbs02 sbd[185363]:    cluster:   notice:
> > servant_cluster: Monitoring unknown cluster health
> > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:   notice: inquisitor_child:
> > Servant cluster is healthy (age: 0)
> > Jul 13 20:42:15 ltaoperdbs02 sbd[185357]:   notice: watchdog_init: Using
> > watchdog device '/dev/watchdog'
> > Jul 13 20:42:15 ltaoperdbs02 systemd[1]: Started Shared-storage based
> > fencing daemon.
> > Jul 13 20:42:19 ltaoperdbs02 sbd[185357]:   notice: inquisitor_child:
> > Servant pcmk is healthy (age: 0)
> >
> > this is happening to all 3 nodes, any toughts?
>
> Bad watchdog?
>
> >
> > Thanks for helping, have as good day
> >
> > Damiano
> >
> >
> > Il giorno mer 14 lug 2021 alle ore 10:08 Klaus Wenninger <
> > kwenning at redhat.com> ha scritto:
> >
> >>
> >>
> >> On Wed, Jul 14, 2021 at 6:40 AM Andrei Borzenkov <arvidjaar at gmail.com>
> >> wrote:
> >>
> >>> On 13.07.2021 23:09, damiano giuliani wrote:
> >>> > Hi Klaus, thanks for helping, im quite lost because cant find out the
> >>> > causes.
> >>> > i attached the corosync logs of all three nodes hoping you guys can
> find
> >>> > and hint me  something i cant see. i really appreciate the effort.
> >>> > the old master log seems cutted at 00:38. so nothing interessing.
> >>> > the new master and the third slave logged what its happened. but i
> cant
> >>> > figure out the cause the old master went lost.
> >>> >
> >>>
> >>> The reason it was lost is most likely outside of pacemaker. You need to
> >>> check other logs on the node that was lost, may be BMC if this is bare
> >>> metal or hypervisor if it is virtualized system.
> >>>
> >>> All that these logs say is that ltaoperdbs02 was lost from the point of
> >>> view of two other nodes. It happened at the same time (around Jul 13
> >>> 00:40) which suggests ltaoperdbs02 had some problem indeed. Whether it
> >>> was software crash, hardware failure or network outage cannot be
> >>> determined from these logs.
> >>>
> >>> What speaks against a pure network-outage is that we don't see
> >> the corosync memberhip messages on the node that died.
> >> Of course it is possible that the log wasn't flushed out before reboot
> >> but usually I'd expect that there would be enough time.
> >> If something kept corosync or sbd from being scheduled that would
> >> explain why we don't see messages from these instances.
> >> And that was why I was asking to check if in the setup corosync and
> >> sbd are able to switch to rt-scheduling.
> >> But of course that is all speculations and from what we know it can
> >> be merely anything from an administrative hard shutdown via
> >> some BMC to whatever.
> >>
> >>>
> >>> > something interessing could be the stonith logs of the new master and
> >>> the
> >>> > third slave:
> >>> >
> >>> > NEW MASTER:
> >>> > grep stonith-ng /var/log/messages
> >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node
> >>> ltaoperdbs02
> >>> > state is now lost
> >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1
> peer
> >>> > with id=1 and/or uname=ltaoperdbs02 from the membership cache
> >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client
> >>> > crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with
> device
> >>> > '(any)'
> >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting
> >>> peer
> >>> > fencing (reboot) targeting ltaoperdbs02
> >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't
> find
> >>> > anyone to fence (reboot) ltaoperdbs02 with any device
> >>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s
> >>> for
> >>> > ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5
> >>> > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice:
> Self-fencing
> >>> > (reboot) by ltaoperdbs02 for
> >>> > crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete
> >>> > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation
> >>> > 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for
> >>> > crmd.228700 at ltaoperdbs03.f5d882d5: OK
> >>> >
> >>> > THIRD SLAVE:
> >>> > grep stonith-ng /var/log/messages
> >>> > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Node
> >>> ltaoperdbs02
> >>> > state is now lost
> >>> > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Purged 1
> peer
> >>> with
> >>> > id=1 and/or uname=ltaoperdbs02 from the membership cache
> >>> > Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]:  notice: Operation
> >>> 'reboot'
> >>> > targeting ltaoperdbs02 on ltaoperdbs03 for
> >>> crmd.228700 at ltaoperdbs03.f5d882d5:
> >>> > OK
> >>> >
> >>> > i really appreciate the help and  what you think about it.
> >>> >
> >>> > PS the stonith should be set to 10s (pcs  property set
> >>> > stonith-watchdog-timeout=10s) are u suggest different setting?
> >>> >
> >>> > Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger <
> >>> > kwenning at redhat.com> ha scritto:
> >>> >
> >>> >>
> >>> >>
> >>> >> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani <
> >>> >> damianogiuliani87 at gmail.com> wrote:
> >>> >>
> >>> >>> Hi guys,
> >>> >>> im back with some PAF postgres cluster problems.
> >>> >>> tonight the cluster fenced the master node and promote the PAF
> >>> resource
> >>> >>> to a new node.
> >>> >>> everything went fine, unless i really dont know why.
> >>> >>> so this morning i noticed the old master was fenced by sbd and a
> new
> >>> >>> master was promoted, this happen tonight at 00.40.XX.
> >>> >>> filtering the logs i cant find out the any reasons why the old
> master
> >>> was
> >>> >>> fenced and the start of promotion of the new master (which seems
> went
> >>> >>> perfectly), at certain point, im a bit lost cuz non of us can is
> able
> >>> to
> >>> >>> get the real reason.
> >>> >>> the cluster worked flawessy for days  with no issues, till now.
> >>> >>> crucial for me uderstand why this switch occured.
> >>> >>>
> >>> >>> a attached the current status and configuration and logs.
> >>> >>> on the old master node log cant find any reasons
> >>> >>> on the new master the only thing is the fencing and the promotion.
> >>> >>>
> >>> >>>
> >>> >>> PS:
> >>> >>> could be this the reason of fencing?
> >>> >>>
> >>> >>> grep  -e sbd /var/log/messages
> >>> >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child:
> >>> >>> Servant pcmk is outdated (age: 4)
> >>> >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child:
> >>> >>> Servant pcmk is healthy (age: 0)
> >>> >>>
> >>> >> That was yesterday afternoon and not 0:40 today in the morning.
> >>> >> With the watchdog-timeout set to 5s this may have been tight though.
> >>> >> Maybe check your other nodes for similar warnings - or check the
> >>> >> compressed warnings.
> >>> >> Maybe you can as well check the journal of sbd after start to see if
> it
> >>> >> managed to run rt-scheduled.
> >>> >> Is this a bare-metal-setup or running on some hypervisor?
> >>> >> Unfortunately I'm not enough into postgres to tell if there is
> anything
> >>> >> interesting about the last
> >>> >> messages shown before the suspected watchdog-reboot.
> >>> >> Was there some administrative stuff done by ltauser before the
> reboot?
> >>> If
> >>> >> yes what?
> >>> >>
> >>> >> Regards,
> >>> >> Klaus
> >>> >>
> >>> >>
> >>> >>>
> >>> >>> Any though and help is really appreciate.
> >>> >>>
> >>> >>> Damiano
> >>> >>> _______________________________________________
> >>> >>> Manage your subscription:
> >>> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>> >>>
> >>> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>> >>>
> >>> >> _______________________________________________
> >>> >> Manage your subscription:
> >>> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>> >>
> >>> >> ClusterLabs home: https://www.clusterlabs.org/
> >>> >>
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > Manage your subscription:
> >>> > https://lists.clusterlabs.org/mailman/listinfo/users
> >>> >
> >>> > ClusterLabs home: https://www.clusterlabs.org/
> >>> >
> >>>
> >>> _______________________________________________
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >>> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >>
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210714/695dcdc6/attachment-0001.htm>