[ClusterLabs Developers] Pacemaker issues found while testing a big setup

Thu May 26 20:17:54 UTC 2016

Hi all,

here is a list of issues found during testing of a setup with 2 cluster 
nodes, 8 remote nodes and around 450 resources. I hope it could be 
useful to do some polishing before 1.1.15 release. pacemaker version is 
quite close to 1.1.15-rc1

* templates are not supported for ocf:pacemaker:remote
* fencing events may be lost due to long transition run time ( already 
discussed)
* cib becomes unresponsive when uploading many changes, that leads to 
sbd fencing (if sbd is enabled)
* node-action-limit seems to work on a per-cluster-node basis, so it 
limits number of operations run on all remote nodes connected by a given 
cluster node
* changing many node attributes during the transition run may lead to 
transition-recalculation-storm (found with a resource-agent which 
changes dozens of attributes)
* notice: Relying on watchdog integration for fencing - this should 
probably needs to be reworded/downgraded
* application of a big enough CIB diff results in monitor failures - CPU 
hog? CIB hang?
* crmd[9834]:     crit: GLib: g_hash_table_lookup: assertion 'hash_table 
!= NULL' failed - hope to catch this again next week as coredump is lost
* pacemaker looses resource exit from a pending state 
(Starting/Stopping/Migrating) change is visible in logs of a local node 
(or crmd manages a given remote node) but is not propagated to CIB
* crmd crash discovered after moving DC node to standby
   segfault in crmd's remote-related code (lrmd client) - hope to catch 
this again next week
* failcounts for resources on remote nodes are not properly cleaned up 
(related to pending states enabled???)
* many "warning: No reason to expect node XXX to be down" when deleting 
attributes on remote nodes
* "error: Query resulted in an error: Timer expired" when adding 
attributes on remote nodes
* the same when uploading CIB patch
* attrd[23798]:   notice: Update error (unknown peer uuid, retry will be 
attempted once uuid is discovered): <node>[<attribute>]=(null) failed 
(host=0x2921ae0) - needs to be reinvestigated

If there any interest in additional information, I can gather it next 
week when I have access to a hardware again.

Hope this could be useful,

Vladislav