[ClusterLabs] Pacemaker problems with pingd

Tue Aug 3 15:17:13 EDT 2021

Hello.

Please forgive the length of this email but I wanted to provide as much
details as possible.

I'm trying to set up a cluster of two nodes for my service.
I have a problem with a scenario where the network between two nodes gets
broken and they can no longer see each other.
This causes split-brain.
I know that proper way of implementing this would be to employ STONITH, but
it is not feasible for me now (I don't have necessary hardware support and
I don't want to introduce another point of failure by introducing shared
storage based STONITH).

In order to work-around the split-brain scenario I introduced pingd to my
cluster, which in theory should do what I expect.
pingd pings a network device, so when the NIC is broken on one of my nodes,
this node should not run the resources because pingd would fail for it.

pingd resource is configured to update the value of variable 'pingd'
(interval: 5s, dampen: 3s, multiplier:1000).
Based on the value of pingd I have a location constraint which sets score
to -INFINITY for resource DimProdClusterIP when 'pingd' is not 1000.
All other resources are colocated with DimProdClusterIP, and
DimProdClusterIP should start before all other resources.

Based on that setup I would expect that when the resources run on dimprod01
and I disconnect dimprod02 from the network, the resources will not start
on dimprod02.
Unfortunately I see that after a token interval + consensus interval my
resources are brought up for a moment and then go down again.
This is undesirable, as it causes DRBD split-brain inconsistency and
cluster IP may also be taken over by the node which is down.

I tried to debug it, but I can't figure out why it doesn't work.
I would appreciate any help/pointers.

Following are some details of my setup and snippet of pacemaker logs with
comments:

Setup details:

pcs status:
Cluster name: dimprodcluster
Cluster Summary:
  * Stack: corosync
  * Current DC: dimprod02 (version 2.0.5-9.el8_4.1-ba59be7122) - partition
with quorum
  * Last updated: Tue Aug  3 08:20:32 2021
  * Last change:  Mon Aug  2 18:24:39 2021 by root via cibadmin on dimprod01
  * 2 nodes configured
  * 8 resource instances configured

Node List:
  * Online: [ dimprod01 dimprod02 ]

Full List of Resources:
  * DimProdClusterIP (ocf::heartbeat:IPaddr2): Started dimprod01
  * WyrDimProdServer (systemd:wyr-dim): Started dimprod01
  * Clone Set: WyrDimProdServerData-clone [WyrDimProdServerData]
(promotable):
    * Masters: [ dimprod01 ]
    * Slaves: [ dimprod02 ]
  * WyrDimProdFS (ocf::heartbeat:Filesystem): Started dimprod01
  * DimTestClusterIP (ocf::heartbeat:IPaddr2): Started dimprod01
  * Clone Set: ping-clone [ping]:
    * Started: [ dimprod01 dimprod02 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

pcs constraint
Location Constraints:
  Resource: DimProdClusterIP
    Constraint: location-DimProdClusterIP
      Rule: score=-INFINITY
        Expression: pingd ne 1000
Ordering Constraints:
  start DimProdClusterIP then promote WyrDimProdServerData-clone
(kind:Mandatory)
  promote WyrDimProdServerData-clone then start WyrDimProdFS
(kind:Mandatory)
  start WyrDimProdFS then start WyrDimProdServer (kind:Mandatory)
  start WyrDimProdServer then start DimTestClusterIP (kind:Mandatory)
Colocation Constraints:
  WyrDimProdServer with DimProdClusterIP (score:INFINITY)
  DimTestClusterIP with DimProdClusterIP (score:INFINITY)
  WyrDimProdServerData-clone with DimProdClusterIP (score:INFINITY)
(with-rsc-role:Master)
  WyrDimProdFS with DimProdClusterIP (score:INFINITY)
Ticket Constraints:

pcs resource config ping
 Resource: ping (class=ocf provider=pacemaker type=ping)
  Attributes: dampen=3s host_list=193.30.22.33 multiplier=1000
  Operations: monitor interval=5s timeout=4s (ping-monitor-interval-5s)
              start interval=0s timeout=60s (ping-start-interval-0s)
              stop interval=0s timeout=5s (ping-stop-interval-0s)

cat /etc/corosync/corosync.conf
totem {
    version: 2
    cluster_name: dimprodcluster
    transport: knet
    crypto_cipher: aes256
    crypto_hash: sha256
    token: 10000
    interface {
        knet_ping_interval: 1000
        knet_ping_timeout: 1000
    }
}

nodelist {
    node {
        ring0_addr: dimprod01
        name: dimprod01
        nodeid: 1
    }

    node {
        ring0_addr: dimprod02
        name: dimprod02
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
    debug:on
}

Logs:
When the network is connected 'pingd' takes value of 1000:

Aug 03 08:23:01 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(attrd_client_update) debug: Broadcasting pingd[dimprod02]=1000 (writer)
Aug 03 08:23:01 dimprod02.my.clustertest.com attrd_updater       [3369856]
(pcmk__node_attr_request) debug: Asked pacemaker-attrd to update pingd=1000
for dimprod02: OK (0)
Aug 03 08:23:01 dimprod02.my.clustertest.com attrd_updater       [3369856]
(crm_xml_cleanup) info: Cleaning up memory from libxml2
Aug 03 08:23:01 dimprod02.my.clustertest.com attrd_updater       [3369856]
(crm_exit) info: Exiting attrd_updater | with status 0

When the network is down we update 'pingd' to 0:

Aug 03 08:23:09 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(attrd_client_update) debug: Broadcasting pingd[dimprod02]=0 (writer)
Aug 03 08:23:09 dimprod02.my.clustertest.com attrd_updater       [3370053]
(pcmk__node_attr_request) debug: Asked pacemaker-attrd to update pingd=0
for dimprod02: OK (0)
Aug 03 08:23:09 dimprod02.my.clustertest.com attrd_updater       [3370053]
(crm_xml_cleanup) info: Cleaning up memory from libxml2
Aug 03 08:23:09 dimprod02.my.clustertest.com attrd_updater       [3370053]
(crm_exit) info: Exiting attrd_updater | with status 0

And again:

Aug 03 08:23:17 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(attrd_client_update) debug: Broadcasting pingd[dimprod02]=0 (writer)
Aug 03 08:23:17 dimprod02.my.clustertest.com attrd_updater       [3370109]
(pcmk__node_attr_request) debug: Asked pacemaker-attrd to update pingd=0
for dimprod02: OK (0)
Aug 03 08:23:17 dimprod02.my.clustertest.com attrd_updater       [3370109]
(crm_xml_cleanup) info: Cleaning up memory from libxml2
Aug 03 08:23:17 dimprod02.my.clustertest.com attrd_updater       [3370109]
(crm_exit) info: Exiting attrd_updater | with status 0

Then the node realizes it is not connected to the other node:

Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(pcmk_cpg_membership) info: Group attrd event 8: dimprod01 (node 1 pid
2118843) left via cluster exit
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based     [2827043]
(pcmk_cpg_membership) info: Group cib event 8: dimprod01 (node 1 pid
2118840) left via cluster exit
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(crm_update_peer_proc) info: pcmk_cpg_membership: Node dimprod01[1] -
corosync-cpg is now offline
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based     [2827043]
(crm_update_peer_proc) info: pcmk_cpg_membership: Node dimprod01[1] -
corosync-cpg is now offline
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk_cpg_membership) info: Group crmd event 8: dimprod01 (node 1 pid
2118845) left via cluster exit
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced    [2827044]
(pcmk_cpg_membership) info: Group stonith-ng event 8: dimprod01 (node 1 pid
2118841) left via cluster exit
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(crm_update_peer_proc) info: pcmk_cpg_membership: Node dimprod01[1] -
corosync-cpg is now offline
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(crm_update_peer_state_iter) notice: Node dimprod01 state is now lost |
nodeid=1 previous=member source=crm_update_peer_proc
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced    [2827044]
(crm_update_peer_proc) info: pcmk_cpg_membership: Node dimprod01[1] -
corosync-cpg is now offline
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(peer_update_callback) info: Node dimprod01 is no longer a peer | DC=true
old=0x4000000 new=0x0000000
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(attrd_peer_remove) notice: Removing all dimprod01 attributes for peer loss
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based     [2827043]
(crm_update_peer_state_iter) notice: Node dimprod01 state is now lost |
nodeid=1 previous=member source=crm_update_peer_proc
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(attrd_peer_remove) debug: Removed #attrd-protocol[dimprod01] for peer loss
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based     [2827043]
(crm_reap_dead_member) info: Removing node with name dimprod01 and id 1
from membership cache
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(attrd_peer_remove) debug: Removed master-WyrDimProdServerData[dimprod01]
for peer loss
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced    [2827044]
(crm_update_peer_state_iter) notice: Node dimprod01 state is now lost |
nodeid=1 previous=member source=crm_update_peer_proc
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based     [2827043]
(reap_crm_member) notice: Purged 1 peer with id=1 and/or uname=dimprod01
from the membership cache
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(attrd_peer_remove) debug: Removed
last-failure-WyrDimProdFS#start_0[dimprod01] for peer loss
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(controld_delete_node_state) info: Deleting transient attributes for node
dimprod01 (via CIB call 466) |
xpath=//node_state[@uname='dimprod01']/transient_attributes
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based     [2827043]
(pcmk_cpg_membership) info: Group cib event 8: dimprod02 (node 2 pid
2827043) is member
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(attrd_peer_remove) debug: Removed
fail-count-WyrDimProdFS#start_0[dimprod01] for peer loss
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced    [2827044]
(st_peer_update_callback) debug: Broadcasting our uname because of node 1
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(attrd_peer_remove) debug: Removed pingd[dimprod01] for peer loss
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(crm_reap_dead_member) info: Removing node with name dimprod01 and id 1
from membership cache
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced    [2827044]
(crm_reap_dead_member) info: Removing node with name dimprod01 and id 1
from membership cache
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(reap_crm_member) notice: Purged 1 peer with id=1 and/or uname=dimprod01
from the membership cache
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(pcmk_cpg_membership) info: Group attrd event 8: dimprod02 (node 2 pid
2827046) is member
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(match_down_event) debug: No reason to expect node 1 to be down
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced    [2827044]
(reap_crm_member) notice: Purged 1 peer with id=1 and/or uname=dimprod01
from the membership cache
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced    [2827044]
(pcmk_cpg_membership) info: Group stonith-ng event 8: dimprod02 (node 2 pid
2827044) is member
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(peer_update_callback) warning: Stonith/shutdown of node dimprod01 was not
expected
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(abort_transition_graph) info: Transition 99 aborted: Node failure |
source=peer_update_callback:280 complete=true
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based     [2827043]
(cib_process_request) info: Forwarding cib_delete operation for section
//node_state[@uname='dimprod01']/transient_attributes to all
(origin=local/crmd/466)
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk_cpg_membership) info: Group crmd event 8: dimprod02 (node 2 pid
2827048) is member
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk__set_flags_as) debug: FSA action flags 0x2000000000000 (new_actions)
for controller set by s_crmd_fsa:198
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(s_crmd_fsa) debug: Processing I_PE_CALC: [ state=S_IDLE
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based     [2827043]
(cib_process_request) info: Forwarding cib_modify operation for section
status to all (origin=local/crmd/467)
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(do_state_transition) notice: State transition S_IDLE -> S_POLICY_ENGINE |
input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk__set_flags_as) debug: FSA action flags 0x00000020
(A_INTEGRATE_TIMER_STOP) for controller set by do_state_transition:559
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk__set_flags_as) debug: FSA action flags 0x00000080
(A_FINALIZE_TIMER_STOP) for controller set by do_state_transition:565
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk__set_flags_as) debug: FSA action flags 0x00000200 (A_DC_TIMER_STOP)
for controller set by do_state_transition:569
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(check_join_counts) debug: Sole active cluster node is fully joined
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk__clear_flags_as) debug: FSA action flags 0x00000200 (an_action) for
controller cleared by do_fsa_action:108
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk__clear_flags_as) debug: FSA action flags 0x00000020 (an_action) for
controller cleared by do_fsa_action:108
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk__clear_flags_as) debug: FSA action flags 0x00000080 (an_action) for
controller cleared by do_fsa_action:108
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk__clear_flags_as) debug: FSA action flags 0x2000000000000 (an_action)
for controller cleared by do_fsa_action:108
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(do_pe_invoke) debug: Query 468: Requesting the current CIB: S_POLICY_ENGINE
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk_quorum_notification) info: Quorum retained | membership=1140 members=1
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd     [2827046]
(attrd_peer_update) notice: Setting pingd[dimprod02]: 1000 -> 0 | from
dimprod02
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(pcmk_quorum_notification) debug: Member[0] 2
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(crm_update_peer_state_iter) notice: Node dimprod01 state is now lost |
nodeid=1 previous=member source=crm_reap_unseen_nodes
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(peer_update_callback) info: Cluster node dimprod01 is now lost (was member)
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(match_down_event) debug: No reason to expect node 1 to be down
Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld  [2827048]
(peer_update_callback) warning: Stonith/shutdown of node dimprod01 was not
expected

And then the node decides to allocate the resources to this node and start
it, ignoring the location constraint

Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_config) debug: STONITH timeout: 60000
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_config) debug: STONITH of failed nodes is disabled
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_config) debug: Concurrent fencing is enabled
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_config) debug: Stop all active resources: false
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_config) debug: Cluster is symmetric - resources can run anywhere by
default
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_config) debug: On loss of quorum: Stop ALL resources
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_config) debug: Node scores: 'red' = -INFINITY, 'yellow' = 0,
'green' = 0
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(determine_online_status) info: Node dimprod02 is online
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_find_resource) debug: Internally renamed WyrDimProdServerData on
dimprod02 to WyrDimProdServerData:0
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_find_resource) debug: Internally renamed ping on dimprod02 to ping:0
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: DimProdClusterIP (ocf::heartbeat:IPaddr2): Stopped
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: WyrDimProdServer (systemd:wyr-dim): Stopped
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: Clone Set: WyrDimProdServerData-clone
[WyrDimProdServerData] (promotable): Slaves: [ dimprod02 ]
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: Clone Set: WyrDimProdServerData-clone
[WyrDimProdServerData] (promotable): Stopped: [ dimprod01 ]
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: WyrDimProdFS (ocf::heartbeat:Filesystem): Stopped
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: DimTestClusterIP (ocf::heartbeat:IPaddr2): Stopped
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: Clone Set: ping-clone [ping]: Started: [ dimprod02 ]
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: Clone Set: ping-clone [ping]: Stopped: [ dimprod01 ]
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(common_apply_stickiness) debug: Resource WyrDimProdServerData:0:
preferring current location (node=dimprod02, weight=100)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(common_apply_stickiness) debug: Resource ping:0: preferring current
location (node=dimprod02, weight=100)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Assigning dimprod02 to DimProdClusterIP
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Assigning dimprod02 to WyrDimProdServer
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(distribute_children) debug: Allocating up to 2 WyrDimProdServerData-clone
instances to a possible 1 nodes (at most 1 per host, 2 optimal)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Assigning dimprod02 to WyrDimProdServerData:0
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: All nodes for resource WyrDimProdServerData:1
are unavailable, unclean or shutting down (dimprod01: 0, -1000000)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Could not allocate a node for
WyrDimProdServerData:1
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_allocate) info: Resource WyrDimProdServerData:1 cannot run
anywhere
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(distribute_children) debug: Allocated 1 WyrDimProdServerData-clone
instances of a possible 2
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__set_instance_roles) debug: WyrDimProdServerData:0 promotion score:
1000
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__set_instance_roles) info: Promoting WyrDimProdServerData:0 (Slave
dimprod02)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__set_instance_roles) debug: WyrDimProdServerData:1 promotion score:
1000
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__set_instance_roles) info: WyrDimProdServerData-clone: Promoted 1
instances of a possible 1 to master
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Assigning dimprod02 to WyrDimProdFS
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Assigning dimprod02 to DimTestClusterIP
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(distribute_children) debug: Allocating up to 2 ping-clone instances to a
possible 1 nodes (at most 1 per host, 2 optimal)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Assigning dimprod02 to ping:0
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: All nodes for resource ping:1 are unavailable,
unclean or shutting down (dimprod01: 0, -1000000)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Could not allocate a node for ping:1
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_allocate) info: Resource ping:1 cannot run anywhere
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(distribute_children) debug: Allocated 1 ping-clone instances of a possible
2
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(RecurringOp) info:  Start recurring monitor (30s) for DimProdClusterIP on
dimprod02
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(RecurringOp) info:  Start recurring monitor (60s) for WyrDimProdServer on
dimprod02
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(RecurringOp) info: Cancelling action WyrDimProdServerData:0_monitor_60000
(Slave vs. Master)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(create_promotable_actions) debug: Creating actions for
WyrDimProdServerData-clone
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(RecurringOp) info: Cancelling action WyrDimProdServerData:0_monitor_60000
(Slave vs. Master)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(RecurringOp) info:  Start recurring monitor (20s) for WyrDimProdFS on
dimprod02
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(RecurringOp) info:  Start recurring monitor (30s) for DimTestClusterIP on
dimprod02
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogAction) notice:  * Start      DimProdClusterIP           (
    dimprod02 )
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogAction) notice:  * Start      WyrDimProdServer           (
    dimprod02 )
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogAction) notice:  * Promote    WyrDimProdServerData:0     ( Slave ->
Master dimprod02 )
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogActions) info: Leave   WyrDimProdServerData:1 (Stopped)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogAction) notice:  * Start      WyrDimProdFS               (
    dimprod02 )
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogAction) notice:  * Start      DimTestClusterIP           (
    dimprod02 )
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogActions) info: Leave   ping:0 (Started dimprod02)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogActions) info: Leave   ping:1 (Stopped)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(action2xml) debug: Using anonymous clone name WyrDimProdServerData for
WyrDimProdServerData:0 (aka. WyrDimProdServerData)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(action2xml) debug: Using anonymous clone name WyrDimProdServerData for
WyrDimProdServerData:0 (aka. WyrDimProdServerData)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(action2xml) debug: Using anonymous clone name WyrDimProdServerData for
WyrDimProdServerData:0 (aka. WyrDimProdServerData)
Aug 03 08:23:24 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(action2xml) debug: Using anonymous clone name WyrDimProdServerData for
WyrDimProdServerData:0 (aka. WyrDimProdServerData)

When the resources are started, the node decides that they can not be
allocated and closes them:

Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(determine_online_status) info: Node dimprod02 is online
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_find_resource) debug: Internally renamed WyrDimProdServerData on
dimprod02 to WyrDimProdServerData:0
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(unpack_find_resource) debug: Internally renamed ping on dimprod02 to ping:0
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: DimProdClusterIP (ocf::heartbeat:IPaddr2): Started
dimprod02
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: WyrDimProdServer (systemd:wyr-dim): Started dimprod02
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: Clone Set: WyrDimProdServerData-clone
[WyrDimProdServerData] (promotable): Masters: [ dimprod02 ]
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: Clone Set: WyrDimProdServerData-clone
[WyrDimProdServerData] (promotable): Stopped: [ dimprod01 ]
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: WyrDimProdFS (ocf::heartbeat:Filesystem): Started
dimprod02
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: DimTestClusterIP (ocf::heartbeat:IPaddr2): Stopped
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: Clone Set: ping-clone [ping]: Started: [ dimprod02 ]
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(log_list_item) info: Clone Set: ping-clone [ping]: Stopped: [ dimprod01 ]
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(common_apply_stickiness) debug: Resource DimProdClusterIP: preferring
current location (node=dimprod02, weight=100)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(common_apply_stickiness) debug: Resource WyrDimProdServer: preferring
current location (node=dimprod02, weight=100)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(common_apply_stickiness) debug: Resource WyrDimProdServerData:0:
preferring current location (node=dimprod02, weight=100)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(common_apply_stickiness) debug: Resource WyrDimProdFS: preferring current
location (node=dimprod02, weight=100)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(common_apply_stickiness) debug: Resource ping:0: preferring current
location (node=dimprod02, weight=100)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_merge_weights) info: DimProdClusterIP: Rolling back optional
scores from WyrDimProdServerData-clone
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_merge_weights) info: DimProdClusterIP: Rolling back optional
scores from WyrDimProdFS
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_merge_weights) info: DimProdClusterIP: Rolling back optional
scores from WyrDimProdServer
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_merge_weights) info: DimProdClusterIP: Rolling back optional
scores from DimTestClusterIP
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: All nodes for resource DimProdClusterIP are
unavailable, unclean or shutting down (dimprod02: 1, -1000000)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Could not allocate a node for DimProdClusterIP
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Processing DimProdClusterIP_monitor_30000
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_allocate) info: Resource DimProdClusterIP cannot run anywhere
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: All nodes for resource WyrDimProdServer are
unavailable, unclean or shutting down (dimprod02: 1, -1000000)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Could not allocate a node for WyrDimProdServer
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_allocate) info: Resource WyrDimProdServer cannot run anywhere
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(distribute_children) debug: Allocating up to 2 WyrDimProdServerData-clone
instances to a possible 1 nodes (at most 1 per host, 2 optimal)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Assigning dimprod02 to WyrDimProdServerData:0
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: All nodes for resource WyrDimProdServerData:1
are unavailable, unclean or shutting down (dimprod01: 0, -1000000)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Could not allocate a node for
WyrDimProdServerData:1
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_allocate) info: Resource WyrDimProdServerData:1 cannot run
anywhere
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(distribute_children) debug: Allocated 1 WyrDimProdServerData-clone
instances of a possible 2
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(filter_colocation_constraint) error: WyrDimProdServerData:0 must be
colocated with DimProdClusterIP but is not (dimprod02 vs. unallocated)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__set_instance_roles) debug: WyrDimProdServerData:0 promotion score:
10000
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__set_instance_roles) info: Promoting WyrDimProdServerData:0 (Master
dimprod02)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__set_instance_roles) debug: WyrDimProdServerData:1 promotion score:
10000
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__set_instance_roles) info: WyrDimProdServerData-clone: Promoted 1
instances of a possible 1 to master
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: All nodes for resource WyrDimProdFS are
unavailable, unclean or shutting down (dimprod01: 0, -1000000)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Could not allocate a node for WyrDimProdFS
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Processing WyrDimProdFS_monitor_20000
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_allocate) info: Resource WyrDimProdFS cannot run anywhere
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: All nodes for resource DimTestClusterIP are
unavailable, unclean or shutting down (dimprod01: 0, -1000000)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Could not allocate a node for DimTestClusterIP
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_allocate) info: Resource DimTestClusterIP cannot run anywhere
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(distribute_children) debug: Allocating up to 2 ping-clone instances to a
possible 1 nodes (at most 1 per host, 2 optimal)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Assigning dimprod02 to ping:0
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: All nodes for resource ping:1 are unavailable,
unclean or shutting down (dimprod01: 0, -1000000)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(native_assign_node) debug: Could not allocate a node for ping:1
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(pcmk__native_allocate) info: Resource ping:1 cannot run anywhere
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(distribute_children) debug: Allocated 1 ping-clone instances of a possible
2
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(create_promotable_actions) debug: Creating actions for
WyrDimProdServerData-clone
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogAction) notice:  * Stop       DimProdClusterIP           (
    dimprod02 )   due to node availability
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogAction) notice:  * Stop       WyrDimProdServer           (
    dimprod02 )   due to node availability
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogActions) info: Leave   WyrDimProdServerData:0 (Master dimprod02)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogActions) info: Leave   WyrDimProdServerData:1 (Stopped)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogAction) notice:  * Stop       WyrDimProdFS               (
    dimprod02 )   due to node availability
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogActions) info: Leave   DimTestClusterIP (Stopped)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogActions) info: Leave   ping:0 (Started dimprod02)
Aug 03 08:23:27 dimprod02.my.clustertest.com pacemaker-schedulerd[2827047]
(LogActions) info: Leave   ping:1 (Stopped)

So the final result is OK, I would just like to avoid the start of the
resources on disconnected node.
I have no idea how I could further debug it.
I will appreciate any help.
If that's helpful I can provide full debug log and more configuration
details.

Regards,
Janusz.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210803/4477c1d8/attachment-0001.htm>