[ClusterLabs] why and when a call of crm_attribute can be delayed ?

Mon Apr 25 06:02:57 EDT 2016

Hi all,

I am facing a strange issue with attrd while doing some testing on a three node
cluster with the pgsqlms RA [1].

pgsqld is my pgsqlms resource in the cluster. pgsql-ha is the master/slave
setup on top of pgsqld.

Before triggering a failure, here was the situation:

  * centos1: pgsql-ha slave
  * centos2: pgsql-ha slave
  * centos3: pgsql-ha master

Then we triggered a failure: the node centos3 has been kill using 

  echo c > /proc/sysrq-trigger

In this situation, PEngine provide a transition where :

  * centos3 is fenced 
  * pgsql-ha on centos2 is promoted

During the pre-promote notify action in the pgsqlms RA, each remaining slave are
setting a node attribute called lsn_location, see: 

  https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1504

  crm_attribute -l reboot -t status --node "$nodename" \
                --name lsn_location --update "$node_lsn"

During the promotion action in the pgsqlms RA, the RA check the lsn_location of
the all the nodes to make sure the local one is higher or equal to all others.
See:

  https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1292

This is where we face a attrd behavior we don't understand.

Despite we can see in the log the RA was able to set its local
"lsn_location", during the promotion action, the RA was unable to read its
local lsn_location":

  pgsqlms(pgsqld)[9003]:  2016/04/22_14:46:16  
    INFO: pgsql_notify: promoting instance on node "centos2" 

  pgsqlms(pgsqld)[9003]:  2016/04/22_14:46:16  
    INFO: pgsql_notify: current node LSN: 0/1EE24000 

  [...]

  pgsqlms(pgsqld)[9023]:  2016/04/22_14:46:16
    CRIT: pgsql_promote: can not get current node LSN location

  Apr 22 14:46:16 [5864] centos2       lrmd:
    notice: operation_finished: pgsqld_promote_0:9023:stderr 
    [ Error performing operation: No such device or address ] 

  Apr 22 14:46:16 [5864] centos2       lrmd:     
    info: log_finished:      finished - rsc:pgsqld
    action:promote call_id:211 pid:9023 exit-code:1 exec-time:107ms
    queue-time:0ms

The error comes from:

  https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1320

**After** this error, we can see in the log file attrd set the "lsn_location" of
centos2:

  Apr 22 14:46:16 [5865] centos2
    attrd:     info: attrd_peer_update:
    Setting lsn_location[centos2]: (null) -> 0/1EE24000 from centos2 

  Apr 22 14:46:16 [5865] centos2
    attrd:     info: write_attribute:   
    Write out of 'lsn_location' delayed:    update 189 in progress

As I understand it, the call of crm_attribute during pre-promote notification
has been taken into account AFTER the "promote" action, leading to this error.
Am I right?

Why and how this could happen? Could it comes from the dampen parameter? We did
not set any dampen anywhere, is there a default value in the cluster setup?
Could we avoid this behavior?

Please, find in attachment a tarball with :
  * all cluster logfiles from the three nodes
  * the content of /var/lib/pacemaker from the three nodes:
    * CIBs
    * PEngine transitions

Regards,

[1] https://github.com/dalibo/PAF
-- 
Jehan-Guillaume de Rorthais
Dalibo