Lars Ellenberg lars.ellenberg at linbit.com
Tue Dec 9 09:33:29 EST 2014


Please have a look at the patches I queued up here:

Most (not all) are specific for the heartbeat cluster stack.


A few comments here:


This effectively changes crm_mon output,
but also changes logging where this method is invoked:

    Low: native_print: report target-role as well

    This is for the "Why does my resource not start?" guys who
    forgot to remove the limiting target-role setting.

    Report target role (unless "Started", which is the default anyways),
    if it limits our abilities (Slave, Stopped),
    or if it differs from the current status.


Heartbeat specific:

    Low: allow heartbeat to spawn the pengine itself, and tell crmd about it

    Heartbeat 3.0.6 now may spawn the pengine directly, and will announce
    this in the environment -- I introduced the setting "crmd_spawns_pengine".

    This improves shutdown behavior.  Otherwise I regularly find an orphaned
    pengine process after pacemaker shutdown.


Heartbeat specific, as consequence of the fix blow:

    Low: add debugging aid to help spot missing set_msg_callback()s on heartbeat

    In ha_msg_dispatch(), change from rcvmsg() to readmsg().
    rcvmsg() is internally simply a wrapper around readmsg(),
    which silently deletes messages without matching callback.

    Use readmsg() directly here. It will only return unprocessed (by
    callbacks) messages, so log a warning, notice or debug message
    depending on message header information, and ha_msg_del() it ourselves.


Heartbeat specific bug fix:

    High: fix stonith ignoring its own messages on heartbeat

    Since the introduction of the additional F_TYPE messages
    T_STONITH_NOTIFY and T_STONITH_TIMEOUT_VALUE, and their use as message
    types in global heartbeat cluster messages, stonith-ng was broken on the
    heartbeat cluster stack.

    When delegation was made the default, and the result could only be
    reaped by listening for the T_STONITH_NOTIFY message, no-one (but
    stonithd itself) would ever notice successful completion,
    and stonith would be re-issued forever.

    Registering callbacks for these F_TYPE fixes these hung stonith and
    stonith_admin operations on the heartbeat cluster stack.


Heartbeat specific:

    Medium: fix tracking of peer client process status on heartbeat

    Don't optimistically assume that peer client processes are alive,
    or that a node that can talk to us is in fact member of the same
    ccm partition.

    Whenever ccm tells us about a new membership, *ask* for peer client
    process status.


This oneliner may well be relevant for corosync CPG as well,
possibly one of the reasons the pcmk_cpg_membership() has this funny
"appears to be online even though we think it is dead" block?

    fix crm_update_peer_proc to NOT ignore flags if partially set

    The "set_bit()" function used here actually deals with masks, not bit numbers.
    The "flag" argument should in fact be plural: flags.

    These proc flag bits are not always set one at a time,
    but for example as "crm_proc_crmd | crm_proc_cpg",
    and not necessarily cleared with the same combination.

    Ignoring to-be-set flags just because *some* of the flag bits are
    already set is clearly a bug, and may be the reason for stale process
    cache information.


Heartbeat specific:

    Medium: map heartbeat JOIN/LEAVE status to ONLINE/OFFLINE

    The rest of the code deals in "online" and "offline",
    not "join" and "leave". Need to map these states,
    or the rest of the code won't work properly.


Generic, if shutdown is requested before stonith connection was ever established
(due to other problems), inisting to re-try the stonith connection confused the shutdown.

    Medium: don't trigger a stonith_reconnect if no longer required

    Get rid of some spurious error messages, and speed up shutdown,
    even if the connection to the stonith daemon failed.


Non-functional change, just for readability:


    ACTIVE is defined to be MEMBER anyways:
    include/crm/cluster.h:#define CRM_NODE_ACTIVE    CRM_NODE_MEMBER

    Don't confuse the reader of the code
    by implying it was something different.


Heartbeat specific, packaging only:

    Low: heartbeat 3.0.6 knows to finds the daemons; drop compat symlinks

