[ClusterLabs Developers] Pacemaker issues found while testing a big setup
Vladislav Bogdanov
bubble at hoster-ok.com
Thu May 26 20:17:54 UTC 2016
Hi all,
here is a list of issues found during testing of a setup with 2 cluster
nodes, 8 remote nodes and around 450 resources. I hope it could be
useful to do some polishing before 1.1.15 release. pacemaker version is
quite close to 1.1.15-rc1
* templates are not supported for ocf:pacemaker:remote
* fencing events may be lost due to long transition run time ( already
discussed)
* cib becomes unresponsive when uploading many changes, that leads to
sbd fencing (if sbd is enabled)
* node-action-limit seems to work on a per-cluster-node basis, so it
limits number of operations run on all remote nodes connected by a given
cluster node
* changing many node attributes during the transition run may lead to
transition-recalculation-storm (found with a resource-agent which
changes dozens of attributes)
* notice: Relying on watchdog integration for fencing - this should
probably needs to be reworded/downgraded
* application of a big enough CIB diff results in monitor failures - CPU
hog? CIB hang?
* crmd[9834]: crit: GLib: g_hash_table_lookup: assertion 'hash_table
!= NULL' failed - hope to catch this again next week as coredump is lost
* pacemaker looses resource exit from a pending state
(Starting/Stopping/Migrating) change is visible in logs of a local node
(or crmd manages a given remote node) but is not propagated to CIB
* crmd crash discovered after moving DC node to standby
segfault in crmd's remote-related code (lrmd client) - hope to catch
this again next week
* failcounts for resources on remote nodes are not properly cleaned up
(related to pending states enabled???)
* many "warning: No reason to expect node XXX to be down" when deleting
attributes on remote nodes
* "error: Query resulted in an error: Timer expired" when adding
attributes on remote nodes
* the same when uploading CIB patch
* attrd[23798]: notice: Update error (unknown peer uuid, retry will be
attempted once uuid is discovered): <node>[<attribute>]=(null) failed
(host=0x2921ae0) - needs to be reinvestigated
If there any interest in additional information, I can gather it next
week when I have access to a hardware again.
Hope this could be useful,
Vladislav
More information about the Developers
mailing list