<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jul 14, 2021 at 6:40 AM Andrei Borzenkov <<a href="mailto:arvidjaar@gmail.com">arvidjaar@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 13.07.2021 23:09, damiano giuliani wrote:<br>

> Hi Klaus, thanks for helping, im quite lost because cant find out the<br>

> causes.<br>

> i attached the corosync logs of all three nodes hoping you guys can find<br>

> and hint me  something i cant see. i really appreciate the effort.<br>

> the old master log seems cutted at 00:38. so nothing interessing.<br>

> the new master and the third slave logged what its happened. but i cant<br>

> figure out the cause the old master went lost.<br>

> <br>

<br>

The reason it was lost is most likely outside of pacemaker. You need to<br>

check other logs on the node that was lost, may be BMC if this is bare<br>

metal or hypervisor if it is virtualized system.<br>

<br>

All that these logs say is that ltaoperdbs02 was lost from the point of<br>

view of two other nodes. It happened at the same time (around Jul 13<br>

00:40) which suggests ltaoperdbs02 had some problem indeed. Whether it<br>

was software crash, hardware failure or network outage cannot be<br>

determined from these logs.<br>

<br></blockquote><div>What speaks against a pure network-outage is that we don't see</div><div>the corosync memberhip messages on the node that died.</div><div>Of course it is possible that the log wasn't flushed out before reboot</div><div>but usually I'd expect that there would be enough time.</div><div>If something kept corosync or sbd from being scheduled that would</div><div>explain why we don't see messages from these instances.</div><div>And that was why I was asking to check if in the setup corosync and</div><div>sbd are able to switch to rt-scheduling.</div><div>But of course that is all speculations and from what we know it can</div><div>be merely anything from an administrative hard shutdown via</div><div>some BMC to whatever. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

> something interessing could be the stonith logs of the new master and the<br>

> third slave:<br>

> <br>

> NEW MASTER:<br>

> grep stonith-ng /var/log/messages<br>

> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node ltaoperdbs02<br>

> state is now lost<br>

> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1 peer<br>

> with id=1 and/or uname=ltaoperdbs02 from the membership cache<br>

> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client<br>

> crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device<br>

> '(any)'<br>

> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting peer<br>

> fencing (reboot) targeting ltaoperdbs02<br>

> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't find<br>

> anyone to fence (reboot) ltaoperdbs02 with any device<br>

> Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s for<br>

> ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5<br>

> Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Self-fencing<br>

> (reboot) by ltaoperdbs02 for<br>

> crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete<br>

> Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation<br>

> 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for<br>

> crmd.228700@ltaoperdbs03.f5d882d5: OK<br>

> <br>

> THIRD SLAVE:<br>

> grep stonith-ng /var/log/messages<br>

> Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Node ltaoperdbs02<br>

> state is now lost<br>

> Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Purged 1 peer with<br>

> id=1 and/or uname=ltaoperdbs02 from the membership cache<br>

> Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]:  notice: Operation 'reboot'<br>

> targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700@ltaoperdbs03.f5d882d5:<br>

> OK<br>

> <br>

> i really appreciate the help and  what you think about it.<br>

> <br>

> PS the stonith should be set to 10s (pcs  property set<br>

> stonith-watchdog-timeout=10s) are u suggest different setting?<br>

> <br>

> Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger <<br>

> <a href="mailto:kwenning@redhat.com" target="_blank">kwenning@redhat.com</a>> ha scritto:<br>

> <br>

>><br>

>><br>

>> On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani <<br>

>> <a href="mailto:damianogiuliani87@gmail.com" target="_blank">damianogiuliani87@gmail.com</a>> wrote:<br>

>><br>

>>> Hi guys,<br>

>>> im back with some PAF postgres cluster problems.<br>

>>> tonight the cluster fenced the master node and promote the PAF resource<br>

>>> to a new node.<br>

>>> everything went fine, unless i really dont know why.<br>

>>> so this morning i noticed the old master was fenced by sbd and a new<br>

>>> master was promoted, this happen tonight at 00.40.XX.<br>

>>> filtering the logs i cant find out the any reasons why the old master was<br>

>>> fenced and the start of promotion of the new master (which seems went<br>

>>> perfectly), at certain point, im a bit lost cuz non of us can is able to<br>

>>> get the real reason.<br>

>>> the cluster worked flawessy for days  with no issues, till now.<br>

>>> crucial for me uderstand why this switch occured.<br>

>>><br>

>>> a attached the current status and configuration and logs.<br>

>>> on the old master node log cant find any reasons<br>

>>> on the new master the only thing is the fencing and the promotion.<br>

>>><br>

>>><br>

>>> PS:<br>

>>> could be this the reason of fencing?<br>

>>><br>

>>> grep  -e sbd /var/log/messages<br>

>>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child:<br>

>>> Servant pcmk is outdated (age: 4)<br>

>>> Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child:<br>

>>> Servant pcmk is healthy (age: 0)<br>

>>><br>

>> That was yesterday afternoon and not 0:40 today in the morning.<br>

>> With the watchdog-timeout set to 5s this may have been tight though.<br>

>> Maybe check your other nodes for similar warnings - or check the<br>

>> compressed warnings.<br>

>> Maybe you can as well check the journal of sbd after start to see if it<br>

>> managed to run rt-scheduled.<br>

>> Is this a bare-metal-setup or running on some hypervisor?<br>

>> Unfortunately I'm not enough into postgres to tell if there is anything<br>

>> interesting about the last<br>

>> messages shown before the suspected watchdog-reboot.<br>

>> Was there some administrative stuff done by ltauser before the reboot? If<br>

>> yes what?<br>

>><br>

>> Regards,<br>

>> Klaus<br>

>><br>

>><br>

>>><br>

>>> Any though and help is really appreciate.<br>

>>><br>

>>> Damiano<br>

>>> _______________________________________________<br>

>>> Manage your subscription:<br>

>>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

>>><br>

>>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

>>><br>

>> _______________________________________________<br>

>> Manage your subscription:<br>

>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

>><br>

>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

>><br>

> <br>

> <br>

> _______________________________________________<br>

> Manage your subscription:<br>

> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

> <br>

> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

> <br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

<br>

</blockquote></div></div>