[ClusterLabs] why and when a call of crm_attribute can be delayed ?

Fri May 6 22:27:04 UTC 2016

Le Wed, 4 May 2016 09:55:34 -0500,
Ken Gaillot <kgaillot at redhat.com> a écrit :

> On 04/25/2016 05:02 AM, Jehan-Guillaume de Rorthais wrote:
> > Hi all,
> > 
> > I am facing a strange issue with attrd while doing some testing on a three
> > node cluster with the pgsqlms RA [1].
> > 
> > pgsqld is my pgsqlms resource in the cluster. pgsql-ha is the master/slave
> > setup on top of pgsqld.
> > 
> > Before triggering a failure, here was the situation:
> > 
> >   * centos1: pgsql-ha slave
> >   * centos2: pgsql-ha slave
> >   * centos3: pgsql-ha master
> > 
> > Then we triggered a failure: the node centos3 has been kill using 
> > 
> >   echo c > /proc/sysrq-trigger
> > 
> > In this situation, PEngine provide a transition where :
> > 
> >   * centos3 is fenced 
> >   * pgsql-ha on centos2 is promoted
> > 
> > During the pre-promote notify action in the pgsqlms RA, each remaining
> > slave are setting a node attribute called lsn_location, see: 
> > 
> >   https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1504
> > 
> >   crm_attribute -l reboot -t status --node "$nodename" \
> >                 --name lsn_location --update "$node_lsn"
> > 
> > During the promotion action in the pgsqlms RA, the RA check the
> > lsn_location of the all the nodes to make sure the local one is higher or
> > equal to all others. See:
> > 
> >   https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1292
> > 
> > This is where we face a attrd behavior we don't understand.
> > 
> > Despite we can see in the log the RA was able to set its local
> > "lsn_location", during the promotion action, the RA was unable to read its
> > local lsn_location":
> > 
> >   pgsqlms(pgsqld)[9003]:  2016/04/22_14:46:16  
> >     INFO: pgsql_notify: promoting instance on node "centos2" 
> > 
> >   pgsqlms(pgsqld)[9003]:  2016/04/22_14:46:16  
> >     INFO: pgsql_notify: current node LSN: 0/1EE24000 
> > 
> >   [...]
> > 
> >   pgsqlms(pgsqld)[9023]:  2016/04/22_14:46:16
> >     CRIT: pgsql_promote: can not get current node LSN location
> > 
> >   Apr 22 14:46:16 [5864] centos2       lrmd:
> >     notice: operation_finished: pgsqld_promote_0:9023:stderr 
> >     [ Error performing operation: No such device or address ] 
> > 
> >   Apr 22 14:46:16 [5864] centos2       lrmd:     
> >     info: log_finished:      finished - rsc:pgsqld
> >     action:promote call_id:211 pid:9023 exit-code:1 exec-time:107ms
> >     queue-time:0ms
> > 
> > The error comes from:
> > 
> >   https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1320
> > 
> > **After** this error, we can see in the log file attrd set the
> > "lsn_location" of centos2:
> > 
> >   Apr 22 14:46:16 [5865] centos2
> >     attrd:     info: attrd_peer_update:
> >     Setting lsn_location[centos2]: (null) -> 0/1EE24000 from centos2 
> > 
> >   Apr 22 14:46:16 [5865] centos2
> >     attrd:     info: write_attribute:   
> >     Write out of 'lsn_location' delayed:    update 189 in progress
> > 
> > 
> > As I understand it, the call of crm_attribute during pre-promote
> > notification has been taken into account AFTER the "promote" action,
> > leading to this error. Am I right?
> > 
> > Why and how this could happen? Could it comes from the dampen parameter? We
> > did not set any dampen anywhere, is there a default value in the cluster
> > setup? Could we avoid this behavior?
> 
> Unfortunately, that is expected. Both the cluster's call of the RA's
> notify action, and the RA's call of crm_attribute, are asynchronous. So
> there is no guarantee that anything done by the pre-promote notify will
> be complete (or synchronized across other cluster nodes) by the time the
> promote action is called.

Ok, thank you for this explanation. It helps.

> There would be no point in the pre-promote notify waiting for the
> attribute value to be retrievable, because the cluster isn't going to
> wait for the pre-promote notify to finish before calling promote.

Oh, this is surprising. I thought the pseudo action
"*_confirmed-pre_notify_demote_0" in the transition graph was a wait for each
resource clone return code before going on with the transition. The graph is
confusing, if the cluster isn't going to wait for the pre-promote notify to
finish before calling promote, I suppose some arrows should point directly from
start (or post-start-notify?) action directly to the promote action then, isn't
it?

This is quite worrying as our RA rely a lot on notifications. As instance, we
try to recover a PostgreSQL instance during pre-start or pre-demote if we
detect a recover action...

> Maybe someone else can come up with a better idea, but I'm thinking
> maybe the attribute could be set as timestamp:lsn, and the promote
> action could poll attrd repeatedly (for a small duration lower than the
> typical promote timeout) until it gets lsn's with a recent timestamp
> from all nodes. One error condition to handle would be if one of the
> other slaves happens to fail or be unresponsive at that time.

We are now using "attrd_updater --private" because calling crm_attribute was
updating the CIB, breaking the transition, thus change the notify variables
(where we detect recover actions). I suppose it is still asynchronous, we will
have to deal this this.

Thank you,
-- 
Jehan-Guillaume de Rorthais
Dalibo