<div dir="ltr">Hi Guys,<div><br></div><div>thanks for the support, really hoped you were not in holydays yet!</div><div><br></div><div>the replication is async, having a look into the postgres logs seems some updates failed cuz no master available.</div><div>i dont expect resource problems (im investingating ayway), the nodes have 200gb RAM , 80 cpu and alot of free hdd space.</div><div><br></div><div>how you guys suggest me to find out why the monitor timed out?</div><div><br></div><div>Really thanks for your support.</div><div><br></div><div>Pepe</div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno mer 30 giu 2021 alle ore 14:17 Ulrich Windl <<a href="mailto:Ulrich.Windl@rz.uni-regensburg.de" target="_blank">Ulrich.Windl@rz.uni-regensburg.de</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">>>> damiano giuliani <<a href="mailto:damianogiuliani87@gmail.com" target="_blank">damianogiuliani87@gmail.com</a>> schrieb am 30.06.2021 um 13:44<br>

in Nachricht<br>

<CAG=zYNNe=<a href="mailto:azZaLEhe3JzKaHnSEv88Nr%2ByEo0m06hLjL4L11PCA@mail.gmail.com" target="_blank">azZaLEhe3JzKaHnSEv88Nr+yEo0m06hLjL4L11PCA@mail.gmail.com</a>>:<br>

> Hi Guys,<br>

> <br>

> sorry for bothering, unfortunally i was called for an issue related to a<br>

> cluster i did months ago which was fully functional till last saturday.<br>

> <br>

> looks some applications lost connection to the master losing some<br>

> update/insert.<br>

> <br>

> i found the cause into the logs, the psqld-monitor went timeout after<br>

> 10000ms and the master resource been demote, the instance stopped and then<br>

> promoted to master again, generating few seconds of disservices (no master<br>

> during the described process)<br>

<br>

Well, I think YOU have to find out why the monitor timed out. Maybe the disks being used were too busy, maybe the memory was tight, ...<br>

WE don't know.<br>

<br>

> <br>

> i noticed a redundant info:<br>

> Update score of "ltaoperdbsXX" from 990 to 1000 because of a change in the<br>

> replication lag<br>

> seems some kind of network lag?<br>

> <br>

> the network should be 10gbs where both corosync and prod network insist.<br>

> netkwork bonding on all of the nodes.<br>

> PAF version resource-agents-paf-2.3.0-1.rhel7.noarch<br>

> Postgres psql (13.1)<br>

> pacemaker-1.1.23-1.el7.x86_64<br>

> pcs-0.9.169-3.el7.centos.x86_64<br>

> <br>

> i attached the log could be useful to dig further.<br>

> Can some guys point me on the right direction, should be really appreciate.<br>

> <br>

> thanks for the support<br>

> Pepe<br>

<br>

<br>

<br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div>