<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jul 14, 2021 at 3:28 PM Ulrich Windl <<a href="mailto:Ulrich.Windl@rz.uni-regensburg.de">Ulrich.Windl@rz.uni-regensburg.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">>>> damiano giuliani <<a href="mailto:damianogiuliani87@gmail.com" target="_blank">damianogiuliani87@gmail.com</a>> schrieb am 14.07.2021 um<br>
12:49<br>
in Nachricht<br>
<CAG=<a href="mailto:zYNOjRmKC5az8nz2r82CRabJ3Z%2BGEnuW_8dE3UJFu1hD1hA@mail.gmail.com" target="_blank">zYNOjRmKC5az8nz2r82CRabJ3Z+GEnuW_8dE3UJFu1hD1hA@mail.gmail.com</a>>:<br>
> Hi guys, thanks for helping,<br>
> <br>
> could be quite hard troubleshooting unexpected fails expecially if they are<br>
> not easily tracked on the pacemaker / system logs.<br>
> all servers are baremetal , i requested the BMC logs hoping there are some<br>
> informations.<br>
> you guys said the sbd is too tight, can you explain me and suggest a valid<br>
> configuration?<br>
<br>
You must answer these questions for yourself:<br>
* What is the maximum read/write delay for your sbd device that still means<br>
the storage is working? Before assuming something like 1s also think of<br>
firmware updates, bad disk sectors, etc.<br></blockquote><div>stonith-watchdog-timeout set and no 'Servant starting for device' log - I guess no poison-pill-fencing then </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
* Then configure the sbd parameters accordingly<br>
* Finally configure the stonith timeout to be not less than the time sbd needs<br>
in worst case to down the machine. If the cluster starts recovering while the<br>
other node is not down already, you may have data corruption or other<br>
failures.<br></blockquote><div>yep - 2 * watchdog-timeout should be a good pick in this case </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> <br>
> ps: yesterday i resyc the old master (to slave) and rejoined into the<br>
> cluster.<br>
> i found the following error into the var/log/messages about the sbd<br>
> <br>
> grep -r sbd messages<br>
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant<br>
> pcmk is outdated (age: 4)<br>
> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: Servant<br>
> pcmk is healthy (age: 0)<br>
> Jul 13 20:42:14 ltaoperdbs02 sbd[185352]: notice: main: Doing flush +<br>
> writing 'b' to sysrq on timeout<br>
> Jul 13 20:42:14 ltaoperdbs02 sbd[185362]: pcmk: notice:<br>
> servant_pcmk: Monitoring Pacemaker health<br>
> Jul 13 20:42:14 ltaoperdbs02 sbd[185363]: cluster: notice:<br>
> servant_cluster: Monitoring unknown cluster health<br>
> Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: inquisitor_child:<br>
> Servant cluster is healthy (age: 0)<br>
> Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: watchdog_init: Using<br>
> watchdog device '/dev/watchdog'<br>
> Jul 13 20:42:19 ltaoperdbs02 sbd[185357]: notice: inquisitor_child:<br>
> Servant pcmk is healthy (age: 0)<br>
> Jul 13 20:53:57 ltaoperdbs02 sbd[188919]: info: main: Verbose mode<br>
> enabled.<br>
> Jul 13 20:53:57 ltaoperdbs02 sbd[188919]: info: main: Watchdog enabled.<br>
> Jul 13 20:54:28 ltaoperdbs02 sbd[189176]: notice: main: Doing flush +<br>
> writing 'b' to sysrq on timeout<br>
> Jul 13 20:54:28 ltaoperdbs02 sbd[189178]: pcmk: notice:<br>
> servant_pcmk: Monitoring Pacemaker health<br>
> Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: notice: inquisitor_child:<br>
> Servant pcmk is healthy (age: 0)<br>
> Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: error: watchdog_init_fd: Cannot<br>
> open watchdog device '/dev/watchdog': Device or resource busy (16)<br>
<br>
Maybe also debug the watchdog device.<br>
<br>
<br>
> Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning: cleanup_servant_by_pid:<br>
> Servant for pcmk (pid: 189178) has terminated<br>
> Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning: cleanup_servant_by_pid:<br>
> Servant for cluster (pid: 189179) has terminated<br>
> Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: notice: main: Doing flush +<br>
> writing 'b' to sysrq on timeout<br>
> Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: error: watchdog_init_fd: Cannot<br>
> open watchdog device '/dev/watchdog0': Device or resource busy (16)<br>
> Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: error: watchdog_init_fd: Cannot<br>
> open watchdog device '/dev/watchdog': Device or resource busy (16)<br>
> <br>
> if i check the systemctl status sbd:<br>
> <br>
> systemctl status sbd.service<br>
> ● sbd.service - Shared-storage based fencing daemon<br>
> Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor<br>
> preset: disabled)<br>
> Active: active (running) since Tue 2021-07-13 20:42:15 UTC; 13h ago<br>
> Docs: man:sbd(8)<br>
> Process: 185352 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid<br>
> watch (code=exited, status=0/SUCCESS)<br>
> Main PID: 185357 (sbd)<br>
> CGroup: /system.slice/sbd.service<br>
> ├─185357 sbd: inquisitor<br>
> ├─185362 sbd: watcher: Pacemaker<br>
> └─185363 sbd: watcher: Cluster<br>
> <br>
> Jul 13 20:42:14 ltaoperdbs02 systemd[1]: Starting Shared-storage based<br>
> fencing daemon...<br>
> Jul 13 20:42:14 ltaoperdbs02 sbd[185352]: notice: main: Doing flush +<br>
> writing 'b' to sysrq on timeout<br>
> Jul 13 20:42:14 ltaoperdbs02 sbd[185362]: pcmk: notice:<br>
> servant_pcmk: Monitoring Pacemaker health<br>
> Jul 13 20:42:14 ltaoperdbs02 sbd[185363]: cluster: notice:<br>
> servant_cluster: Monitoring unknown cluster health<br>
> Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: inquisitor_child:<br>
> Servant cluster is healthy (age: 0)<br>
> Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: watchdog_init: Using<br>
> watchdog device '/dev/watchdog'<br>
> Jul 13 20:42:15 ltaoperdbs02 systemd[1]: Started Shared-storage based<br>
> fencing daemon.<br>
> Jul 13 20:42:19 ltaoperdbs02 sbd[185357]: notice: inquisitor_child:<br>
> Servant pcmk is healthy (age: 0)<br>
> <br>
> this is happening to all 3 nodes, any toughts?<br>
<br>
Bad watchdog? <br>
<br>
> <br>
> Thanks for helping, have as good day<br>
> <br>
> Damiano<br>
> <br>
> <br>
> Il giorno mer 14 lug 2021 alle ore 10:08 Klaus Wenninger <<br>
> <a href="mailto:kwenning@redhat.com" target="_blank">kwenning@redhat.com</a>> ha scritto:<br>
> <br>
>><br>
>><br>
>> On Wed, Jul 14, 2021 at 6:40 AM Andrei Borzenkov <<a href="mailto:arvidjaar@gmail.com" target="_blank">arvidjaar@gmail.com</a>><br>
>> wrote:<br>
>><br>
>>> On 13.07.2021 23:09, damiano giuliani wrote:<br>
>>> > Hi Klaus, thanks for helping, im quite lost because cant find out the<br>
>>> > causes.<br>
>>> > i attached the corosync logs of all three nodes hoping you guys can<br>
find<br>
>>> > and hint me something i cant see. i really appreciate the effort.<br>
>>> > the old master log seems cutted at 00:38. so nothing interessing.<br>
>>> > the new master and the third slave logged what its happened. but i cant<br>
>>> > figure out the cause the old master went lost.<br>
>>> ><br>
>>><br>
>>> The reason it was lost is most likely outside of pacemaker. You need to<br>
>>> check other logs on the node that was lost, may be BMC if this is bare<br>
>>> metal or hypervisor if it is virtualized system.<br>
>>><br>
>>> All that these logs say is that ltaoperdbs02 was lost from the point of<br>
>>> view of two other nodes. It happened at the same time (around Jul 13<br>
>>> 00:40) which suggests ltaoperdbs02 had some problem indeed. Whether it<br>
>>> was software crash, hardware failure or network outage cannot be<br>
>>> determined from these logs.<br>
>>><br>
>>> What speaks against a pure network-outage is that we don't see<br>
>> the corosync memberhip messages on the node that died.<br>
>> Of course it is possible that the log wasn't flushed out before reboot<br>
>> but usually I'd expect that there would be enough time.<br>
>> If something kept corosync or sbd from being scheduled that would<br>
>> explain why we don't see messages from these instances.<br>
>> And that was why I was asking to check if in the setup corosync and<br>
>> sbd are able to switch to rt-scheduling.<br>
>> But of course that is all speculations and from what we know it can<br>
>> be merely anything from an administrative hard shutdown via<br>
>> some BMC to whatever.<br>
>><br>
>>><br>
>>> > something interessing could be the stonith logs of the new master and<br>
>>> the<br>
>>> > third slave:<br>
>>> ><br>
>>> > NEW MASTER:<br>
>>> > grep stonith-ng /var/log/messages<br>
>>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Node<br>
>>> ltaoperdbs02<br>
>>> > state is now lost<br>
>>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Purged 1 peer<br>
>>> > with id=1 and/or uname=ltaoperdbs02 from the membership cache<br>
>>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Client<br>
>>> > crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device<br>
>>> > '(any)'<br>
>>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Requesting<br>
>>> peer<br>
>>> > fencing (reboot) targeting ltaoperdbs02<br>
>>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Couldn't find<br>
>>> > anyone to fence (reboot) ltaoperdbs02 with any device<br>
>>> > Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Waiting 10s<br>
>>> for<br>
>>> > ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5<br>
>>> > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Self-fencing<br>
>>> > (reboot) by ltaoperdbs02 for<br>
>>> > crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete<br>
>>> > Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Operation<br>
>>> > 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for<br>
>>> > crmd.228700@ltaoperdbs03.f5d882d5: OK<br>
>>> ><br>
>>> > THIRD SLAVE:<br>
>>> > grep stonith-ng /var/log/messages<br>
>>> > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]: notice: Node<br>
>>> ltaoperdbs02<br>
>>> > state is now lost<br>
>>> > Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]: notice: Purged 1 peer<br>
>>> with<br>
>>> > id=1 and/or uname=ltaoperdbs02 from the membership cache<br>
>>> > Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]: notice: Operation<br>
>>> 'reboot'<br>
>>> > targeting ltaoperdbs02 on ltaoperdbs03 for<br>
>>> crmd.228700@ltaoperdbs03.f5d882d5:<br>
>>> > OK<br>
>>> ><br>
>>> > i really appreciate the help and what you think about it.<br>
>>> ><br>
>>> > PS the stonith should be set to 10s (pcs property set<br>
>>> > stonith-watchdog-timeout=10s) are u suggest different setting?<br>
>>> ><br>
>>> > Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger <<br>
>>> > <a href="mailto:kwenning@redhat.com" target="_blank">kwenning@redhat.com</a>> ha scritto:<br>
>>> ><br>
>>> >><br>
>>> >><br>
>>> >> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani <<br>
>>> >> <a href="mailto:damianogiuliani87@gmail.com" target="_blank">damianogiuliani87@gmail.com</a>> wrote:<br>
>>> >><br>
>>> >>> Hi guys,<br>
>>> >>> im back with some PAF postgres cluster problems.<br>
>>> >>> tonight the cluster fenced the master node and promote the PAF<br>
>>> resource<br>
>>> >>> to a new node.<br>
>>> >>> everything went fine, unless i really dont know why.<br>
>>> >>> so this morning i noticed the old master was fenced by sbd and a new<br>
>>> >>> master was promoted, this happen tonight at 00.40.XX.<br>
>>> >>> filtering the logs i cant find out the any reasons why the old master<br>
>>> was<br>
>>> >>> fenced and the start of promotion of the new master (which seems went<br>
>>> >>> perfectly), at certain point, im a bit lost cuz non of us can is able<br>
>>> to<br>
>>> >>> get the real reason.<br>
>>> >>> the cluster worked flawessy for days with no issues, till now.<br>
>>> >>> crucial for me uderstand why this switch occured.<br>
>>> >>><br>
>>> >>> a attached the current status and configuration and logs.<br>
>>> >>> on the old master node log cant find any reasons<br>
>>> >>> on the new master the only thing is the fencing and the promotion.<br>
>>> >>><br>
>>> >>><br>
>>> >>> PS:<br>
>>> >>> could be this the reason of fencing?<br>
>>> >>><br>
>>> >>> grep -e sbd /var/log/messages<br>
>>> >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child:<br>
>>> >>> Servant pcmk is outdated (age: 4)<br>
>>> >>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child:<br>
>>> >>> Servant pcmk is healthy (age: 0)<br>
>>> >>><br>
>>> >> That was yesterday afternoon and not 0:40 today in the morning.<br>
>>> >> With the watchdog-timeout set to 5s this may have been tight though.<br>
>>> >> Maybe check your other nodes for similar warnings - or check the<br>
>>> >> compressed warnings.<br>
>>> >> Maybe you can as well check the journal of sbd after start to see if<br>
it<br>
>>> >> managed to run rt-scheduled.<br>
>>> >> Is this a bare-metal-setup or running on some hypervisor?<br>
>>> >> Unfortunately I'm not enough into postgres to tell if there is<br>
anything<br>
>>> >> interesting about the last<br>
>>> >> messages shown before the suspected watchdog-reboot.<br>
>>> >> Was there some administrative stuff done by ltauser before the reboot?<br>
>>> If<br>
>>> >> yes what?<br>
>>> >><br>
>>> >> Regards,<br>
>>> >> Klaus<br>
>>> >><br>
>>> >><br>
>>> >>><br>
>>> >>> Any though and help is really appreciate.<br>
>>> >>><br>
>>> >>> Damiano<br>
>>> >>> _______________________________________________<br>
>>> >>> Manage your subscription:<br>
>>> >>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a> <br>
>>> >>><br>
>>> >>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a> <br>
>>> >>><br>
>>> >> _______________________________________________<br>
>>> >> Manage your subscription:<br>
>>> >> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a> <br>
>>> >><br>
>>> >> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a> <br>
>>> >><br>
>>> ><br>
>>> ><br>
>>> > _______________________________________________<br>
>>> > Manage your subscription:<br>
>>> > <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a> <br>
>>> ><br>
>>> > ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a> <br>
>>> ><br>
>>><br>
>>> _______________________________________________<br>
>>> Manage your subscription:<br>
>>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a> <br>
>>><br>
>>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a> <br>
>>><br>
>>> _______________________________________________<br>
>> Manage your subscription:<br>
>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a> <br>
>><br>
>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a> <br>
>><br>
<br>
<br>
<br>
_______________________________________________<br>
Manage your subscription:<br>
<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
<br>
ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
</blockquote></div></div>