<div dir="ltr">Hi Klaus, thanks for helping, im quite lost because cant find out the causes.<div>i attached the corosync logs of all three nodes hoping you guys can find and hint me  something i cant see. i really appreciate the effort.</div><div>the old master log seems cutted at 00:38. so nothing interessing.</div><div>the new master and the third slave logged what its happened. but i cant  figure out the cause the old master went lost.</div><div><br></div><div>something interessing could be the stonith logs of the new master and the third slave:</div><div><br></div><div>NEW MASTER:</div><div>grep stonith-ng /var/log/messages<br>Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node ltaoperdbs02 state is now lost<br>Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1 peer with id=1 and/or uname=ltaoperdbs02 from the membership cache<br>Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device '(any)'<br>Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting peer fencing (reboot) targeting ltaoperdbs02<br>Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't find anyone to fence (reboot) ltaoperdbs02 with any device<br>Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s for ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5<br>Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Self-fencing (reboot) by ltaoperdbs02 for crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete<br>Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700@ltaoperdbs03.f5d882d5: OK<br></div><div><br></div><div>THIRD SLAVE: </div><div>grep stonith-ng /var/log/messages<br>Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Node ltaoperdbs02 state is now lost<br>Jul 13 00:40:37 ltaoperdbs04 stonith-ng[77928]:  notice: Purged 1 peer with id=1 and/or uname=ltaoperdbs02 from the membership cache<br>Jul 13 00:40:47 ltaoperdbs04 stonith-ng[77928]:  notice: Operation 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700@ltaoperdbs03.f5d882d5: OK<br></div><div><br></div><div>i really appreciate the help and  what you think about it.<br></div><div><br></div><div>PS the stonith should be set to 10s (pcs  property set stonith-watchdog-timeout=10s) are u suggest different setting?</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno mar 13 lug 2021 alle ore 14:29 Klaus Wenninger <<a href="mailto:kwenning@redhat.com">kwenning@redhat.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 13, 2021 at 1:43 PM damiano giuliani <<a href="mailto:damianogiuliani87@gmail.com" target="_blank">damianogiuliani87@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi guys,<div>im back with some PAF postgres cluster problems.</div><div>tonight the cluster fenced the master node and promote the PAF resource to a new node.</div><div>everything went fine, unless i really dont know why.</div><div>so this morning i noticed the old master was fenced by sbd and a new master was promoted, this happen tonight at 00.40.XX.</div><div>filtering the logs i cant find out the any reasons why the old master was fenced and the start of promotion of the new master (which seems went perfectly), at certain point, im a bit lost cuz non of us can is able to get the real reason.</div><div>the cluster worked flawessy for days  with no issues, till now.</div><div>crucial for me uderstand why this switch occured.</div><div><br></div><div>a attached the current status and configuration and logs.</div><div>on the old master node log cant find any reasons</div><div>on the new master the only thing is the fencing and the promotion.</div><div><br></div><div><br>PS:</div><div>could be this the reason of fencing?</div><div><br></div><div>grep  -e sbd /var/log/messages<br>Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant pcmk is outdated (age: 4)<br>Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child: Servant pcmk is healthy (age: 0)<br></div></div></blockquote><div>That was yesterday afternoon and not 0:40 today in the morning.</div><div>With the watchdog-timeout set to 5s this may have been tight though.</div><div>Maybe check your other nodes for similar warnings - or check the compressed warnings.</div><div>Maybe you can as well check the journal of sbd after start to see if it managed to run rt-scheduled.</div><div>Is this a bare-metal-setup or running on some hypervisor?</div><div>Unfortunately I'm not enough into postgres to tell if there is anything interesting about the last</div><div>messages shown before the suspected watchdog-reboot.</div><div>Was there some administrative stuff done by ltauser before the reboot? If yes what?</div><div><br></div><div>Regards,</div><div>Klaus</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div></div><div><br></div><div>Any though and help is really appreciate.<br></div><div><br></div><div>Damiano</div></div>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div></div>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div>