[ClusterLabs] Detected action arrived really late
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Mon Apr 9 17:32:57 EDT 2018
Hi all,
Two weeks ago, I faced an issue in a cluster using the PAF resource agent
(multi-state RA for PostgreSQL) where CRMd was receiving results from
actions executed in the previous transition.
This behavior triggered an unexpected result during a simple test:
* migration threshold: 5
* pgsqld failcount: 0
* kill pgsqld master
* expected result: recovery of master pgsqld on its current node
Final result:
* migration to another node
* 15 results from notify actions in previous transition were receive "really
late" (sometime 2 transitions later!)
* 11 transitions has been computed:
* 1 because of the failcount setting asynchronously
* 8 because of the "really late" events
* 1 because PAF was lost and triggered an election instead of recovering
the master
* 15 seconds from error detection to last action of the last transition
Please, find attached the log files of this event. The following command is a
good sum up of this situation:
grep -nE ' late|Calculated' pacemaker.late.log
This comes from a Pacemaker 1.1.13 on SLES 12 SP1. Servers are VM managed by
vmWare 5.5.
I suspect I have a similar scenario on the following issue using 1.1.16:
https://github.com/ClusterLabs/PAF/issues/131
In this issue, the attribute set during notify action arrive so lately, PAF
only find part of them and consider it as a failure.
I looked at the changelog and I couldn't find any messages related to such
behaviors :/
My questions:
* what trigger such behavior?
* is it possible to avoid them with some better settings or hardware?
* when setting a private attribute, PAF is checking continuously for
attrd to reply the same value before going on with other actions. I hoped
that would be enough to consider the whole cluster share the same value for
the attribute, but it doesn't seems enough (see github issue #131). Is it
possible to wait until the whole cluster nodes have the same value
propagated everywhere?
Thanks in advance for your help!
Bonus: Should it help, here is my analysis of the logfile. Please, crop it while
answering if not needed!
L1-3
LRMd/CRMd failure detection on slesHAtest1
L.38-42
pengine plans a Recover of pgsqld:0 on slesHAtest1
L.43,45,46,50
CRMd starts the transition execution and launch the pre notify actions
L.51-60
CIB update the failcount (count + timestamp) to the CIB
L.66-67
CRMd abort the current transition because of the failcount change
L.102-106
pengine produce the same transition
L.107-109,112 and L.119,125
CRMd starts the transition execution and launch the pre notify actions and
receive the results
L.126
CRMd starts the demote action (and cancel the monitor on pgsqld first)
L.137-138
CRMd abort the transition because it receives the results of the
notify action of the previous transition
L.205-209
Because the demote already occurred, pengine plans a slave recover+promote on
slesHAtest1
L.210-229
CRMd starts the transition, do the notifications, do the stop
L.234-235
CRMd abort the transition because it receives the results of the
notify action of the previous transition
L.298-302
Because the stop occurred, pengine plans to start pgsqld on slesHAtest1 and
promote it on slesHAtest2
L.306,312-313
CRMd do the notify and start action
L.318-319
CRMd abort the transition because it receives the results of the
notify action of the previous transition
L.336-337
Because the start occurred, the post notify actions are triggered
L381-385
Because the previous transition was aborted, pengine change its mind and
plans to promote pgsqld on slesHAtest1
L.401-402
CRMd abort the transition because it receives the results of the
notify action of the previous transition
L.436-440
pengine plans the same transition (promote on slesHAtest1)
L.448-449
CRMd abort the transition because it receives the results of the
notify action of the previous transition
L.488-492
pengine plans the same transition (promote on slesHAtest1)
L.512
CRMd launch the pre notify promote, then promote pgsqld on slesHAtest1
L.520-522
The attributes set during the pre-notify action hit attrd AFTER the promote
action is initiated. Note that PAF is waiting for each attribute to be
visible from the node itself before going on.
L.525-526,529-530
CRMd abort the transition because it receives the results of the
notify action of the previous transition (two of them!)
L.531-535
Some more parameter are deleted/set from attrd. At this point, I'm not sure
what transition was setting this values. They seems to collide...
L.538-542
CRMd abort the transition because it receives the results of the
notify action of the previous transition (another two of them!)
L.543-547
Attrd keeps deleting/setting parameters coming from pre-notify actions
L.551
LRMd promote local pgsqld on slesHAtest1 (CRMd initiated it on L.512). L.551
This promotion fails because PAF is not able anymore to detect this was a
originally master recovery and trigger a promotion where the second node
wins.
L.566
Because Pacemaker ignore the notify action return value, PAF rset a
higher master score on the second node and raise an error to force the new
promotion.
L.653-658
pengine plans to recover Master on slesHAtest1, demote it, and promote
pgsqld on slesHAtest2
L.695-696
CRMd initiate the first demote, on slesHAtest1, then abort the transition
because of old events again:
L767-769
pengine plans to recover pgsqld on both node and promote it on
slesHAtest2
L.800-804
As usual, CRMd initiate the stop actions on both nodes (L.772,775) but
decide to fails the transition because of old events.
L.863-867
pengine plans to start pgsqld on both node and promote it on slesHAtest2
This one succeed \o/
Cheers,
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pacemaker.late.log
Type: text/x-log
Size: 175738 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180409/a0076982/attachment-0001.bin>
More information about the Users
mailing list