[ClusterLabs] controlling cluster behavior on startup

Mon Jan 29 13:05:05 EST 2024

Hi,

I have configured clusters of node pairs, so each cluster has 2 nodes.  The cluster members are statically defined in corosync.conf before corosync or pacemaker is started, and quorum {two_node: 1} is set.

When both nodes are powered off and I power them on, they do not start pacemaker at exactly the same time.  The time difference may be a few minutes depending on other factors outside the nodes.

My goals are (I call the first node to start pacemaker "node1"):
1) I want to control how long pacemaker on node1 waits before fencing node2 if node2 does not start pacemaker.
2) If node1 is part-way through that waiting period, and node2 starts pacemaker so they detect each other, I would like them to proceed immediately to probing resource state and starting resources which are down, not wait until the end of that "grace period".

It looks from the documentation like dc-deadtime is how #1 is controlled, and #2 is expected normal behavior.  However, I'm seeing fence actions before dc-deadtime has passed.

Am I misunderstanding Pacemaker's expected behavior and/or how dc-deadtime should be used?

One possibly unusual aspect of this cluster is that these two nodes are stateless - they PXE boot from an image on another server - and I build the cluster configuration at boot time with a series of pcs commands, because the nodes have no local storage for this purpose.  The commands are:

['pcs', 'cluster', 'start']
['pcs', 'property', 'set', 'stonith-action=off']
['pcs', 'property', 'set', 'cluster-recheck-interval=60']
['pcs', 'property', 'set', 'start-failure-is-fatal=false']
['pcs', 'property', 'set', 'dc-deadtime=300']
['pcs', 'stonith', 'create', 'fence_gopher11', 'fence_powerman', 'ip=192.168.64.65', 'pcmk_host_check=static-list', 'pcmk_host_list=gopher11,gopher12']
['pcs', 'stonith', 'create', 'fence_gopher12', 'fence_powerman', 'ip=192.168.64.65', 'pcmk_host_check=static-list', 'pcmk_host_list=gopher11,gopher12']
['pcs', 'resource', 'create', 'gopher11_zpool', 'ocf:llnl:zpool', 'import_options="-f -N -d /dev/disk/by-vdev"', 'pool=gopher11', 'op', 'start', 'timeout=805']
...
['pcs', 'property', 'set', 'no-quorum-policy=ignore']

I could, instead, generate a CIB so that when Pacemaker is started, it has a full config.  Is that better?

thanks,
Olaf

=== corosync.conf:
totem {
    version: 2
    cluster_name: gopher11
    secauth: off
    transport: udpu
}
nodelist {
    node {
        ring0_addr: gopher11
        name: gopher11
        nodeid: 1
    }
    node {
        ring0_addr: gopher12
        name: gopher12
        nodeid: 2
    }
}
quorum {
    provider: corosync_votequorum
    two_node: 1
}

=== Log excerpt

Here's an except from Pacemaker logs that reflect what I'm seeing.  These are from gopher12, the node that came up first.  The other node, which is not yet up, is gopher11.

Jan 25 17:55:38 gopher12 pacemakerd          [116033] (main)    notice: Starting Pacemaker 2.1.7-1.t4 | build=2.1.7 features:agent-manpages ascii-docs compat-2.0 corosync-ge-2 default-concurrent-fencing generated-manpages monotonic nagios ncurses remote systemd
Jan 25 17:55:39 gopher12 pacemaker-controld  [116040] (peer_update_callback)    info: Cluster node gopher12 is now member (was in unknown state)
Jan 25 17:55:43 gopher12 pacemaker-based     [116035] (cib_perform_op)  info: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:  <nvpair id="cib-bootstrap-options-dc-deadtime" name="dc-deadtime" value="300"/>
Jan 25 17:56:00 gopher12 pacemaker-controld  [116040] (crm_timer_popped)        info: Election Trigger just popped | input=I_DC_TIMEOUT time=300000ms
Jan 25 17:56:01 gopher12 pacemaker-based     [116035] (cib_perform_op)  info: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']:  <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
Jan 25 17:56:01 gopher12 pacemaker-controld  [116040] (abort_transition_graph)  info: Transition 0 aborted by cib-bootstrap-options-no-quorum-policy doing create no-quorum-policy=ignore: Configuration change | cib=0.26.0 source=te_update_diff_v2:464 path=/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options'] complete=true
Jan 25 17:56:01 gopher12 pacemaker-controld  [116040] (controld_execute_fence_action)   notice: Requesting fencing (off) targeting node gopher11 | action=11 timeout=60