<div dir="ltr"><div dir="ltr">On Fri, Jul 5, 2019 at 5:09 PM Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com">jgdr@dalibo.com</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">It seems to me the problem comes from here:<br>

<br>

  Jul 03 19:31:38 [30151] <a href="http://ph-sql-03.prod.ams.i.rdmedia.com" rel="noreferrer" target="_blank">ph-sql-03.prod.ams.i.rdmedia.com</a>       crmd:   notice:<br>

    te_rsc_command:     Initiating notify operation<br>

    pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67<br>

  Jul 03 19:32:38 [30148] <a href="http://ph-sql-03.prod.ams.i.rdmedia.com" rel="noreferrer" target="_blank">ph-sql-03.prod.ams.i.rdmedia.com</a>       lrmd:  warning:<br>

    operation_finished: pgsqld_notify_0:30939 - timed out after 60000ms<br>

<br>

and here:<br>

<br>

  Jul 03 19:31:38 [11914] <a href="http://ph-sql-05.prod.ams.i.rdmedia.com" rel="noreferrer" target="_blank">ph-sql-05.prod.ams.i.rdmedia.com</a>       lrmd:     info:<br>

    log_execute:        executing - rsc:pgsqld action:notify call_id:38<br>

    pgsqlms(pgsqld)[20881]:     <br>

  Jul 03 19:32:38 [11914] <a href="http://ph-sql-05.prod.ams.i.rdmedia.com" rel="noreferrer" target="_blank">ph-sql-05.prod.ams.i.rdmedia.com</a>       lrmd:<br>

    warning: operation_finished:        pgsqld_notify_0:20881 - timed out after<br>

    60000ms<br>

<br>

The pgsql election occurs during the pre-promote action where all remaining<br>

nodes set their LSN location. During the promote action, the designed primary<br>

checks its LSN location is the highest one. If it is not, it just cancel the<br>

promotion so the next round will elect the best one.<br>

<br>

Back to your issue. According to the logs, both standby nodes timed out during<br>

the pre-promote action. No LSN location has been set accros the whole cluster.<br>

I couldn't see any messages from the attrd daemon related to the lsn_location<br>

attribute or other cleanup actions.<br>

<br>

I couldn't even find the INFO message from pgsqlms giving its current status<br>

before actually setting it ("Current node TL#LSN: %s"). But this message<br>

appears soon after a...checkpoint. See:<br>

<a href="https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017" rel="noreferrer" target="_blank">https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017</a><br>

<br>

Maybe you have very long checkpoints on both nodes that timed out the<br>

pre-promote action? Do you have some PostgreSQL logs showing some useful info<br>

around this timelapse?<br></blockquote><div><br></div><div>Yes I have logs that show at what time checkpoints were taken. They're attached. Currently checkpoint_timeout is set to 15 minutes. I already lowered it from the previous value of 60 minutes. I could experiment with even lower values for checkpoint_timeout as it seems the problem is that forced checkpoints are taking too long (more than 60s). Perhaps I should also increase the notify timeout?</div><div><br></div><div>Thank you very much for your time.</div></div></div>