[ClusterLabs Developers] Pacemaker issues found while testing a big setup

Thu May 26 18:14:40 EDT 2016

On 05/26/2016 03:17 PM, Vladislav Bogdanov wrote:
> Hi all,
> 
> here is a list of issues found during testing of a setup with 2 cluster
> nodes, 8 remote nodes and around 450 resources. I hope it could be
> useful to do some polishing before 1.1.15 release. pacemaker version is
> quite close to 1.1.15-rc1

Thanks, this is useful.

> * templates are not supported for ocf:pacemaker:remote
> * fencing events may be lost due to long transition run time ( already
> discussed)
> * cib becomes unresponsive when uploading many changes, that leads to
> sbd fencing (if sbd is enabled)
> * node-action-limit seems to work on a per-cluster-node basis, so it
> limits number of operations run on all remote nodes connected by a given
> cluster node
> * changing many node attributes during the transition run may lead to
> transition-recalculation-storm (found with a resource-agent which
> changes dozens of attributes)
> * notice: Relying on watchdog integration for fencing - this should
> probably needs to be reworded/downgraded

FYI there was a regression introduced in 1.1.14 that resulted in
have-watchdog always being true (and the above message being printed)
regardless of whether sbd was actually running. That has been fixed and
the fix will be in 1.1.15rc3 (which I intend to release tomorrow).

> * application of a big enough CIB diff results in monitor failures - CPU
> hog? CIB hang?
> * crmd[9834]:     crit: GLib: g_hash_table_lookup: assertion 'hash_table
> != NULL' failed - hope to catch this again next week as coredump is lost
> * pacemaker looses resource exit from a pending state
> (Starting/Stopping/Migrating) change is visible in logs of a local node
> (or crmd manages a given remote node) but is not propagated to CIB
> * crmd crash discovered after moving DC node to standby
>   segfault in crmd's remote-related code (lrmd client) - hope to catch
> this again next week
> * failcounts for resources on remote nodes are not properly cleaned up
> (related to pending states enabled???)
> * many "warning: No reason to expect node XXX to be down" when deleting
> attributes on remote nodes
> * "error: Query resulted in an error: Timer expired" when adding
> attributes on remote nodes
> * the same when uploading CIB patch
> * attrd[23798]:   notice: Update error (unknown peer uuid, retry will be
> attempted once uuid is discovered): <node>[<attribute>]=(null) failed
> (host=0x2921ae0) - needs to be reinvestigated

The above could be related to a bug introduced after 1.1.14, having to
do with reusing node IDs when removing/adding nodes. It is now fixed,
and the fix will be in 1.1.15rc3.

> If there any interest in additional information, I can gather it next
> week when I have access to a hardware again.
> 
> Hope this could be useful,
> 
> Vladislav