<div dir="ltr">Hi Guys,<div><br></div><div>thanks for the support, really hoped you were not in holydays yet!</div><div><br></div><div>the replication is async, having a look into the postgres logs seems some updates failed cuz no master available.</div><div>i dont expect resource problems (im investingating ayway), the nodes have 200gb RAM , 80 cpu and alot of free hdd space.</div><div><br></div><div>how you guys suggest me to find out why the monitor timed out?</div><div><br></div><div>Really thanks for your support.</div><div><br></div><div>Pepe</div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno mer 30 giu 2021 alle ore 14:17 Ulrich Windl <<a href="mailto:Ulrich.Windl@rz.uni-regensburg.de" target="_blank">Ulrich.Windl@rz.uni-regensburg.de</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">>>> damiano giuliani <<a href="mailto:damianogiuliani87@gmail.com" target="_blank">damianogiuliani87@gmail.com</a>> schrieb am 30.06.2021 um 13:44<br>
in Nachricht<br>
<CAG=zYNNe=<a href="mailto:azZaLEhe3JzKaHnSEv88Nr%2ByEo0m06hLjL4L11PCA@mail.gmail.com" target="_blank">azZaLEhe3JzKaHnSEv88Nr+yEo0m06hLjL4L11PCA@mail.gmail.com</a>>:<br>
> Hi Guys,<br>
> <br>
> sorry for bothering, unfortunally i was called for an issue related to a<br>
> cluster i did months ago which was fully functional till last saturday.<br>
> <br>
> looks some applications lost connection to the master losing some<br>
> update/insert.<br>
> <br>
> i found the cause into the logs, the psqld-monitor went timeout after<br>
> 10000ms and the master resource been demote, the instance stopped and then<br>
> promoted to master again, generating few seconds of disservices (no master<br>
> during the described process)<br>
<br>
Well, I think YOU have to find out why the monitor timed out. Maybe the disks being used were too busy, maybe the memory was tight, ...<br>
WE don't know.<br>
<br>
> <br>
> i noticed a redundant info:<br>
> Update score of "ltaoperdbsXX" from 990 to 1000 because of a change in the<br>
> replication lag<br>
> seems some kind of network lag?<br>
> <br>
> the network should be 10gbs where both corosync and prod network insist.<br>
> netkwork bonding on all of the nodes.<br>
> PAF version resource-agents-paf-2.3.0-1.rhel7.noarch<br>
> Postgres psql (13.1)<br>
> pacemaker-1.1.23-1.el7.x86_64<br>
> pcs-0.9.169-3.el7.centos.x86_64<br>
> <br>
> i attached the log could be useful to dig further.<br>
> Can some guys point me on the right direction, should be really appreciate.<br>
> <br>
> thanks for the support<br>
> Pepe<br>
<br>
<br>
<br>
<br>
_______________________________________________<br>
Manage your subscription:<br>
<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
<br>
ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
</blockquote></div>