[Pacemaker] Patches: RFC before pull request

Lars Ellenberg lars.ellenberg at linbit.com
Tue Dec 9 14:33:29 UTC 2014


Andrew,
All,

Please have a look at the patches I queued up here:
https://github.com/lge/pacemaker/commits/for-beekhof

Most (not all) are specific for the heartbeat cluster stack.

Thanks,
	Lars

A few comments here:

-----

This effectively changes crm_mon output,
but also changes logging where this method is invoked:

    Low: native_print: report target-role as well

    This is for the "Why does my resource not start?" guys who
    forgot to remove the limiting target-role setting.

    Report target role (unless "Started", which is the default anyways),
    if it limits our abilities (Slave, Stopped),
    or if it differs from the current status.

-----

Heartbeat specific:

    Low: allow heartbeat to spawn the pengine itself, and tell crmd about it

    Heartbeat 3.0.6 now may spawn the pengine directly, and will announce
    this in the environment -- I introduced the setting "crmd_spawns_pengine".

    This improves shutdown behavior.  Otherwise I regularly find an orphaned
    pengine process after pacemaker shutdown.

-----

Heartbeat specific, as consequence of the fix blow:

    Low: add debugging aid to help spot missing set_msg_callback()s on heartbeat

    In ha_msg_dispatch(), change from rcvmsg() to readmsg().
    rcvmsg() is internally simply a wrapper around readmsg(),
    which silently deletes messages without matching callback.

    Use readmsg() directly here. It will only return unprocessed (by
    callbacks) messages, so log a warning, notice or debug message
    depending on message header information, and ha_msg_del() it ourselves.

-----

Heartbeat specific bug fix:

    High: fix stonith ignoring its own messages on heartbeat

    Since the introduction of the additional F_TYPE messages
    T_STONITH_NOTIFY and T_STONITH_TIMEOUT_VALUE, and their use as message
    types in global heartbeat cluster messages, stonith-ng was broken on the
    heartbeat cluster stack.

    When delegation was made the default, and the result could only be
    reaped by listening for the T_STONITH_NOTIFY message, no-one (but
    stonithd itself) would ever notice successful completion,
    and stonith would be re-issued forever.

    Registering callbacks for these F_TYPE fixes these hung stonith and
    stonith_admin operations on the heartbeat cluster stack.

-----

Heartbeat specific:

    Medium: fix tracking of peer client process status on heartbeat

    Don't optimistically assume that peer client processes are alive,
    or that a node that can talk to us is in fact member of the same
    ccm partition.

    Whenever ccm tells us about a new membership, *ask* for peer client
    process status.

-----

This oneliner may well be relevant for corosync CPG as well,
possibly one of the reasons the pcmk_cpg_membership() has this funny
"appears to be online even though we think it is dead" block?

    fix crm_update_peer_proc to NOT ignore flags if partially set

    The "set_bit()" function used here actually deals with masks, not bit numbers.
    The "flag" argument should in fact be plural: flags.

    These proc flag bits are not always set one at a time,
    but for example as "crm_proc_crmd | crm_proc_cpg",
    and not necessarily cleared with the same combination.

    Ignoring to-be-set flags just because *some* of the flag bits are
    already set is clearly a bug, and may be the reason for stale process
    cache information.

-----

Heartbeat specific:

    Medium: map heartbeat JOIN/LEAVE status to ONLINE/OFFLINE

    The rest of the code deals in "online" and "offline",
    not "join" and "leave". Need to map these states,
    or the rest of the code won't work properly.

-----

Generic, if shutdown is requested before stonith connection was ever established
(due to other problems), inisting to re-try the stonith connection confused the shutdown.

    Medium: don't trigger a stonith_reconnect if no longer required

    Get rid of some spurious error messages, and speed up shutdown,
    even if the connection to the stonith daemon failed.

-----

Non-functional change, just for readability:

    Low: use CRM_NODE_MEMBER, not CRM_NODE_ACTIVE

    ACTIVE is defined to be MEMBER anyways:
    include/crm/cluster.h:#define CRM_NODE_ACTIVE    CRM_NODE_MEMBER

    Don't confuse the reader of the code
    by implying it was something different.

-----

Heartbeat specific, packaging only:

    Low: heartbeat 3.0.6 knows to finds the daemons; drop compat symlinks





More information about the Pacemaker mailing list