[ClusterLabs] why and when a call of crm_attribute can be delayed ?
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Mon Apr 25 10:46:34 UTC 2016
Sorry all, I forgot the attachment in the first mail.
Regards,
Le Mon, 25 Apr 2016 12:02:57 +0200,
Jehan-Guillaume de Rorthais <jgdr at dalibo.com> a écrit :
> Hi all,
>
> I am facing a strange issue with attrd while doing some testing on a three
> node cluster with the pgsqlms RA [1].
>
> pgsqld is my pgsqlms resource in the cluster. pgsql-ha is the master/slave
> setup on top of pgsqld.
>
> Before triggering a failure, here was the situation:
>
> * centos1: pgsql-ha slave
> * centos2: pgsql-ha slave
> * centos3: pgsql-ha master
>
> Then we triggered a failure: the node centos3 has been kill using
>
> echo c > /proc/sysrq-trigger
>
> In this situation, PEngine provide a transition where :
>
> * centos3 is fenced
> * pgsql-ha on centos2 is promoted
>
> During the pre-promote notify action in the pgsqlms RA, each remaining slave
> are setting a node attribute called lsn_location, see:
>
> https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1504
>
> crm_attribute -l reboot -t status --node "$nodename" \
> --name lsn_location --update "$node_lsn"
>
> During the promotion action in the pgsqlms RA, the RA check the lsn_location
> of the all the nodes to make sure the local one is higher or equal to all
> others. See:
>
> https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1292
>
> This is where we face a attrd behavior we don't understand.
>
> Despite we can see in the log the RA was able to set its local
> "lsn_location", during the promotion action, the RA was unable to read its
> local lsn_location":
>
> pgsqlms(pgsqld)[9003]: 2016/04/22_14:46:16
> INFO: pgsql_notify: promoting instance on node "centos2"
>
> pgsqlms(pgsqld)[9003]: 2016/04/22_14:46:16
> INFO: pgsql_notify: current node LSN: 0/1EE24000
>
> [...]
>
> pgsqlms(pgsqld)[9023]: 2016/04/22_14:46:16
> CRIT: pgsql_promote: can not get current node LSN location
>
> Apr 22 14:46:16 [5864] centos2 lrmd:
> notice: operation_finished: pgsqld_promote_0:9023:stderr
> [ Error performing operation: No such device or address ]
>
> Apr 22 14:46:16 [5864] centos2 lrmd:
> info: log_finished: finished - rsc:pgsqld
> action:promote call_id:211 pid:9023 exit-code:1 exec-time:107ms
> queue-time:0ms
>
> The error comes from:
>
> https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1320
>
> **After** this error, we can see in the log file attrd set the "lsn_location"
> of centos2:
>
> Apr 22 14:46:16 [5865] centos2
> attrd: info: attrd_peer_update:
> Setting lsn_location[centos2]: (null) -> 0/1EE24000 from centos2
>
> Apr 22 14:46:16 [5865] centos2
> attrd: info: write_attribute:
> Write out of 'lsn_location' delayed: update 189 in progress
>
>
> As I understand it, the call of crm_attribute during pre-promote notification
> has been taken into account AFTER the "promote" action, leading to this error.
> Am I right?
>
> Why and how this could happen? Could it comes from the dampen parameter? We
> did not set any dampen anywhere, is there a default value in the cluster
> setup? Could we avoid this behavior?
>
> Please, find in attachment a tarball with :
> * all cluster logfiles from the three nodes
> * the content of /var/lib/pacemaker from the three nodes:
> * CIBs
> * PEngine transitions
>
>
> Regards,
>
> [1] https://github.com/dalibo/PAF
--
Jehan-Guillaume de Rorthais
Dalibo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: debug-attrd-issue.tar.bz2
Type: application/x-bzip
Size: 1022229 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160425/b41e6ad1/attachment-0004.bin>
More information about the Users
mailing list