<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jul 14, 2021 at 12:50 PM damiano giuliani <<a href="mailto:damianogiuliani87@gmail.com">damianogiuliani87@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi guys, thanks for helping,<div><br></div><div>could be quite hard troubleshooting unexpected fails expecially if they are not easily tracked on the pacemaker / system logs.</div><div>all servers are baremetal , i requested the BMC logs hoping there are some informations.</div><div>you guys said the sbd is too tight, can you explain me and suggest a valid configuration?</div></div></blockquote><div><br></div><div>There is no one-fits-all configuration. If you are experiencing issues that sbd isn't able to timely</div><div>trigger the hardware-watchdog you can consider setting the watchdog-timeout value to a highter</div><div>number and consequently stonith-watchdog-timeout to about double that time.</div><div>But you should try to understand why your watchdog triggers and there aren't things systematically</div><div>going wrong - like e.g. sbd or corosync not being able to switch to rt-scheduling. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br></div><div>ps: yesterday i resyc the old master (to slave) and rejoined into the cluster.</div><div>i found the following error into the var/log/messages about the sbd</div><div><br></div><div> grep -r sbd messages<br>Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant pcmk is outdated (age: 4)<br>Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child: Servant pcmk is healthy (age: 0)<br>Jul 13 20:42:14 ltaoperdbs02 sbd[185352]: notice: main: Doing flush + writing 'b' to sysrq on timeout<br>Jul 13 20:42:14 ltaoperdbs02 sbd[185362]: pcmk: notice: servant_pcmk: Monitoring Pacemaker health<br>Jul 13 20:42:14 ltaoperdbs02 sbd[185363]: cluster: notice: servant_cluster: Monitoring unknown cluster health<br>Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: Servant cluster is healthy (age: 0)<br>Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: watchdog_init: Using watchdog device '/dev/watchdog'<br>Jul 13 20:42:19 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: Servant pcmk is healthy (age: 0)<br>Jul 13 20:53:57 ltaoperdbs02 sbd[188919]: info: main: Verbose mode enabled.<br>Jul 13 20:53:57 ltaoperdbs02 sbd[188919]: info: main: Watchdog enabled.<br>Jul 13 20:54:28 ltaoperdbs02 sbd[189176]: notice: main: Doing flush + writing 'b' to sysrq on timeout<br>Jul 13 20:54:28 ltaoperdbs02 sbd[189178]: pcmk: notice: servant_pcmk: Monitoring Pacemaker health<br>Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: notice: inquisitor_child: Servant pcmk is healthy (age: 0)<br>Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: error: watchdog_init_fd: Cannot open watchdog device '/dev/watchdog': Device or resource busy (16)<br>Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning: cleanup_servant_by_pid: Servant for pcmk (pid: 189178) has terminated<br>Jul 13 20:54:28 ltaoperdbs02 sbd[189177]: warning: cleanup_servant_by_pid: Servant for cluster (pid: 189179) has terminated<br>Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: notice: main: Doing flush + writing 'b' to sysrq on timeout<br>Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: error: watchdog_init_fd: Cannot open watchdog device '/dev/watchdog0': Device or resource busy (16)<br>Jul 13 20:55:30 ltaoperdbs02 sbd[189484]: error: watchdog_init_fd: Cannot open watchdog device '/dev/watchdog': Device or resource busy (16)<br></div><div><br></div></div></blockquote><div>There is something strange going on so that sbd isn't able to open the watchdog-device.</div><div>Check that there is nobody else sitting on the watchdog-device - like systemd, watchdogd, with - iirc compile-time -</div><div>configuration corosync, ... Tools like 'lsof' may be helpful for that if you catch the system in that state.</div><div>I'm guessing it doesn't always happen because that should actually prevent a successful startup of</div><div>sbd and thus systemd shouldn't bring up pacemaker.</div><div>On the other hand competing for /dev/watchdog shouldn't introduce unexpected watchdog-reboots</div><div>as sbd will either fail opening the device and not come up thus or open the device and keep it open</div><div>for the time being so that nobody else is able to open it.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div></div><div>if i check the systemctl status sbd:</div><div><br></div><div>systemctl status sbd.service<br>● sbd.service - Shared-storage based fencing daemon<br> Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled)<br> Active: active (running) since Tue 2021-07-13 20:42:15 UTC; 13h ago<br> Docs: man:sbd(8)<br> Process: 185352 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=exited, status=0/SUCCESS)<br> Main PID: 185357 (sbd)<br> CGroup: /system.slice/sbd.service<br> ├─185357 sbd: inquisitor<br> ├─185362 sbd: watcher: Pacemaker<br> └─185363 sbd: watcher: Cluster<br><br>Jul 13 20:42:14 ltaoperdbs02 systemd[1]: Starting Shared-storage based fencing daemon...<br>Jul 13 20:42:14 ltaoperdbs02 sbd[185352]: notice: main: Doing flush + writing 'b' to sysrq on timeout<br>Jul 13 20:42:14 ltaoperdbs02 sbd[185362]: pcmk: notice: servant_pcmk: Monitoring Pacemaker health<br>Jul 13 20:42:14 ltaoperdbs02 sbd[185363]: cluster: notice: servant_cluster: Monitoring unknown cluster health<br>Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: Servant cluster is healthy (age: 0)<br>Jul 13 20:42:15 ltaoperdbs02 sbd[185357]: notice: watchdog_init: Using watchdog device '/dev/watchdog'<br>Jul 13 20:42:15 ltaoperdbs02 systemd[1]: Started Shared-storage based fencing daemon.<br>Jul 13 20:42:19 ltaoperdbs02 sbd[185357]: notice: inquisitor_child: Servant pcmk is healthy (age: 0)<br></div><div><br></div></div></blockquote><div>So at least for sbd there don't seem to be systematic issues switching to rt-scheduling</div><div>as we would see it moaning in the logs above.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div></div><div>this is happening to all 3 nodes, any toughts?</div><div><br></div><div>Thanks for helping, have as good day</div><div><br></div><div>Damiano</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno mer 14 lug 2021 alle ore 10:08 Klaus Wenninger <<a href="mailto:kwenning@redhat.com" target="_blank">kwenning@redhat.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jul 14, 2021 at 6:40 AM Andrei Borzenkov <<a href="mailto:arvidjaar@gmail.com" target="_blank">arvidjaar@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 13.07.2021 23:09, damiano giuliani wrote:<br>
> Hi Klaus, thanks for helping, im quite lost because cant find out the<br>
> causes.<br>
> i attached the corosync logs of all three nodes hoping you guys can find<br>
> and hint me something i cant see. i really appreciate the effort.<br>
> the old master log seems cutted at 00:38. so nothing interessing.<br>
> the new master and the third slave logged what its happened. but i cant<br>
> figure out the cause the old master went lost.<br>
> <br>
<br>
The reason it was lost is most likely outside of pacemaker. You need to<br>
check other logs on the node that was lost, may be BMC if this is bare<br>
metal or hypervisor if it is virtualized system.<br>
<br>
All that these logs say is that ltaoperdbs02 was lost from the point of<br>
view of two other nodes. It happened at the same time (around Jul 13<br>
00:40) which suggests ltaoperdbs02 had some problem indeed. Whether it<br>
was software crash, hardware failure or network outage cannot be<br>
determined from these logs.<br>
<br></blockquote><div>What speaks against a pure network-outage is that we don't see</div><div>the corosync memberhip messages on the node that died.</div><div>Of course it is possible that the log wasn't flushed out before reboot</div><div>but usually I'd expect that there would be enough time.</div><div>If something kept corosync or sbd from being scheduled that would</div><div>explain why we don't see messages from these instances.</div><div>And that was why I was asking to check if in the setup corosync and</div><div>sbd are able to switch to rt-scheduling.</div><div>But of course that is all speculations and from what we know it can</div><div>be merely anything from an administrative hard shutdown via</div><div>some BMC to whatever. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> something interessing could be the stonith logs of the new master and the<br>
> third slave:<br>
> <br>
> NEW MASTER:<br>
> grep stonith-ng /var/log/messages<br>
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Node ltaoperdbs02<br>
> state is now lost<br>
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Purged 1 peer<br>
> with id=1 and/or uname=ltaoperdbs02 from the membership cache<br>
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Client<br>
> crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device<br>
> '(any)'<br>
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Requesting peer<br>
> fencing (reboot) targeting ltaoperdbs02<br>
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Couldn't find<br>
> anyone to fence (reboot) ltaoperdbs02 with any device<br>
> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]: notice: Waiting 10s for<br>
> ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5<br>
> Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Self-fencing<br>
> (reboot) by ltaoperdbs02 for<br>
> crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete<br>
> Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]: notice: Operation<br>
> 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for<br>
> crmd.228700@ltaoperdbs03.f5d882d5: OK<br>
> <br>
> THIRD SLAVE:<br>
> grep stonith-ng /var/log/messages<br>
> Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]: notice: Node ltaoperdbs02<br>
> state is now lost<br>
> Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]: notice: Purged 1 peer with<br>
> id=1 and/or uname=ltaoperdbs02 from the membership cache<br>
> Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]: notice: Operation 'reboot'<br>
> targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700@ltaoperdbs03.f5d882d5:<br>
> OK<br>
> <br>
> i really appreciate the help and what you think about it.<br>
> <br>
> PS the stonith should be set to 10s (pcs property set<br>
> stonith-watchdog-timeout=10s) are u suggest different setting?<br>
> <br>
> Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger <<br>
> <a href="mailto:kwenning@redhat.com" target="_blank">kwenning@redhat.com</a>> ha scritto:<br>
> <br>
>><br>
>><br>
>> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani <<br>
>> <a href="mailto:damianogiuliani87@gmail.com" target="_blank">damianogiuliani87@gmail.com</a>> wrote:<br>
>><br>
>>> Hi guys,<br>
>>> im back with some PAF postgres cluster problems.<br>
>>> tonight the cluster fenced the master node and promote the PAF resource<br>
>>> to a new node.<br>
>>> everything went fine, unless i really dont know why.<br>
>>> so this morning i noticed the old master was fenced by sbd and a new<br>
>>> master was promoted, this happen tonight at 00.40.XX.<br>
>>> filtering the logs i cant find out the any reasons why the old master was<br>
>>> fenced and the start of promotion of the new master (which seems went<br>
>>> perfectly), at certain point, im a bit lost cuz non of us can is able to<br>
>>> get the real reason.<br>
>>> the cluster worked flawessy for days with no issues, till now.<br>
>>> crucial for me uderstand why this switch occured.<br>
>>><br>
>>> a attached the current status and configuration and logs.<br>
>>> on the old master node log cant find any reasons<br>
>>> on the new master the only thing is the fencing and the promotion.<br>
>>><br>
>>><br>
>>> PS:<br>
>>> could be this the reason of fencing?<br>
>>><br>
>>> grep -e sbd /var/log/messages<br>
>>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child:<br>
>>> Servant pcmk is outdated (age: 4)<br>
>>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: notice: inquisitor_child:<br>
>>> Servant pcmk is healthy (age: 0)<br>
>>><br>
>> That was yesterday afternoon and not 0:40 today in the morning.<br>
>> With the watchdog-timeout set to 5s this may have been tight though.<br>
>> Maybe check your other nodes for similar warnings - or check the<br>
>> compressed warnings.<br>
>> Maybe you can as well check the journal of sbd after start to see if it<br>
>> managed to run rt-scheduled.<br>
>> Is this a bare-metal-setup or running on some hypervisor?<br>
>> Unfortunately I'm not enough into postgres to tell if there is anything<br>
>> interesting about the last<br>
>> messages shown before the suspected watchdog-reboot.<br>
>> Was there some administrative stuff done by ltauser before the reboot? If<br>
>> yes what?<br>
>><br>
>> Regards,<br>
>> Klaus<br>
>><br>
>><br>
>>><br>
>>> Any though and help is really appreciate.<br>
>>><br>
>>> Damiano<br>
>>> _______________________________________________<br>
>>> Manage your subscription:<br>
>>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
>>><br>
>>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
>>><br>
>> _______________________________________________<br>
>> Manage your subscription:<br>
>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
>><br>
>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
>><br>
> <br>
> <br>
> _______________________________________________<br>
> Manage your subscription:<br>
> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
> <br>
> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
> <br>
<br>
_______________________________________________<br>
Manage your subscription:<br>
<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
<br>
ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
<br>
</blockquote></div></div>
_______________________________________________<br>
Manage your subscription:<br>
<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
<br>
ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
</blockquote></div>
_______________________________________________<br>
Manage your subscription:<br>
<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
<br>
ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
</blockquote></div></div>