[ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

Fri Jul 5 11:09:50 EDT 2019

On Thu, 4 Jul 2019 11:38:05 +0200
Tiemen Ruiten <t.ruiten at tech-lab.io> wrote:

> Hello,
> 
> Yesterday, my three node cluster (CentOS 7, PostgreSQL with the PAF
> resource agent) went down. For an as of yet unknown reason, the master
> (ph-sql-04) did not report to the rest of the cluster and was fenced. (I'll
> take the advice given earlier now to setup an rsyslog server...).
> Unfortunately, the cluster failed to promote on of the slaves (ph-sql-03)
> so that node was fenced too. Then quorum was lost and the stop action for
> the pgsqld resource on the last node (ph-sql-05) was executed and although
> it timed out (see my earlier post on this list) the PostgreSQL daemon was
> eventually stopped, leaving all nodes down.
> 
> The error message on ph-sql-03 was:
> 
> pgsqlms(pgsqld)[5006]: Jul 03 19:32:38  ERROR: Can not get current node LSN
> location
> Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com       lrmd:
> notice: operation_finished: pgsqld_promote_0:5006:stderr [
> ocf-exit-reason:Can not get current node LSN location ]
> Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com       lrmd:
> info: log_finished: finished - rsc:pgsqld action:promote call_id:87
> pid:5006 exit-code:1 exec-time:237ms queue-time:0ms
> Jul 03 19:32:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com       crmd:
> notice: process_lrm_event: Result of promote operation for pgsqld on
> ph-sql-03: 1 (unknown error) | call=87 key=pgsqld_promote_0 confirmed=true
> cib-update=8309
> Jul 03 19:32:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com       crmd:
> notice: process_lrm_event: ph-sql-03-pgsqld_promote_0:87 [
> ocf-exit-reason:Can not get current node LSN location\n ]
> 
> I've seen some PAF Github issues that mention this error, but not sure they
> apply to my situation. Is this a bug or is there something wrong with my
> setup?

It seems to me the problem comes from here:

  Jul 03 19:31:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com       crmd:   notice:
    te_rsc_command:	Initiating notify operation
    pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67
  Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com       lrmd:  warning:
    operation_finished:	pgsqld_notify_0:30939 - timed out after 60000ms

and here:

  Jul 03 19:31:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:     info:
    log_execute:	executing - rsc:pgsqld action:notify call_id:38
    pgsqlms(pgsqld)[20881]:	
  Jul 03 19:32:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
    warning: operation_finished:	pgsqld_notify_0:20881 - timed out after
    60000ms

The pgsql election occurs during the pre-promote action where all remaining
nodes set their LSN location. During the promote action, the designed primary
checks its LSN location is the highest one. If it is not, it just cancel the
promotion so the next round will elect the best one.

Back to your issue. According to the logs, both standby nodes timed out during
the pre-promote action. No LSN location has been set accros the whole cluster.
I couldn't see any messages from the attrd daemon related to the lsn_location
attribute or other cleanup actions.

I couldn't even find the INFO message from pgsqlms giving its current status
before actually setting it ("Current node TL#LSN: %s"). But this message
appears soon after a...checkpoint. See:
https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017

Maybe you have very long checkpoints on both nodes that timed out the
pre-promote action? Do you have some PostgreSQL logs showing some useful info
around this timelapse?