[ClusterLabs] Pacemaker problems with pingd

Klaus Wenninger kwenning at redhat.com
Thu Aug 5 08:01:56 EDT 2021


On Wed, Aug 4, 2021 at 5:30 PM Janusz Jaskiewicz <
janusz.jaskiewicz at gmail.com> wrote:

> Hello.
>
> Please forgive the length of this email but I wanted to provide as much
> details as possible.
>
> I'm trying to set up a cluster of two nodes for my service.
> I have a problem with a scenario where the network between two nodes gets
> broken and they can no longer see each other.
> This causes split-brain.
> I know that proper way of implementing this would be to employ STONITH,
> but it is not feasible for me now (I don't have necessary hardware support
> and I don't want to introduce another point of failure by introducing
> shared storage based STONITH).
>
> In order to work-around the split-brain scenario I introduced pingd to my
> cluster, which in theory should do what I expect.
> pingd pings a network device, so when the NIC is broken on one of my
> nodes, this node should not run the resources because pingd would fail for
> it.
>
As we've discussed on this list in multiple previous threads already there
are lots of failure scenarios
where cluster-nodes don't see each other but both can ping something else
on the network.
Important cases where your approach wouldn't work are as well those where
nodes are just
partially alive - leads to corosync membership being lost & node not able
to stop resources
properly anymore.
Thus it is highly recommended to have all these setups that rely on some
kind of self-fencing or
bringing down of resources within some timeout being guarded by a
(hardware)-watchdog.
Previously you probably were referring to SBD which implements such a
watchdog-guarded approach. As you've probably figured out you can't
directly use SBD
in a 2-node-setup without a shared-disk. Pure watchdog-fencing needs quorum
decision
made by at least 3 instances. If you don't want a full blown 3rd node you
can consider
qdevice - can be used by multiple 2-node-clusters for quorum evaluation.
Otherwise you can use SBD with a shared disk.
You are right that both, a shared disk and any kind of 3rd node are an
additional point of
failure. Important is that in both cases we are talking about a point of
failure but not of a
single point of failure - meaning it failing it would not necessarily
impose services to be
shutdown.

Klaus

>
> pingd resource is configured to update the value of variable 'pingd'
> (interval: 5s, dampen: 3s, multiplier:1000).
> Based on the value of pingd I have a location constraint which sets score
> to -INFINITY for resource DimProdClusterIP when 'pingd' is not 1000.
> All other resources are colocated with DimProdClusterIP, and
> DimProdClusterIP should start before all other resources.
>
> Based on that setup I would expect that when the resources run on
> dimprod01 and I disconnect dimprod02 from the network, the resources will
> not start on dimprod02.
> Unfortunately I see that after a token interval + consensus interval my
> resources are brought up for a moment and then go down again.
> This is undesirable, as it causes DRBD split-brain inconsistency and
> cluster IP may also be taken over by the node which is down.
>
> I tried to debug it, but I can't figure out why it doesn't work.
> I would appreciate any help/pointers.
>
>
> Following are some details of my setup and snippet of pacemaker logs with
> comments:
>
> Setup details:
>
> pcs status:
> Cluster name: dimprodcluster
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: dimprod02 (version 2.0.5-9.el8_4.1-ba59be7122) - partition
> with quorum
>   * Last updated: Tue Aug  3 08:20:32 2021
>   * Last change:  Mon Aug  2 18:24:39 2021 by root via cibadmin on
> dimprod01
>   * 2 nodes configured
>   * 8 resource instances configured
>
> Node List:
>   * Online: [ dimprod01 dimprod02 ]
>
> Full List of Resources:
>   * DimProdClusterIP (ocf::heartbeat:IPaddr2): Started dimprod01
>   * WyrDimProdServer (systemd:wyr-dim): Started dimprod01
>   * Clone Set: WyrDimProdServerData-clone [WyrDimProdServerData]
> (promotable):
>     * Masters: [ dimprod01 ]
>     * Slaves: [ dimprod02 ]
>   * WyrDimProdFS (ocf::heartbeat:Filesystem): Started dimprod01
>   * DimTestClusterIP (ocf::heartbeat:IPaddr2): Started dimprod01
>   * Clone Set: ping-clone [ping]:
>     * Started: [ dimprod01 dimprod02 ]
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
>
>
> pcs constraint
> Location Constraints:
>   Resource: DimProdClusterIP
>     Constraint: location-DimProdClusterIP
>       Rule: score=-INFINITY
>         Expression: pingd ne 1000
> Ordering Constraints:
>   start DimProdClusterIP then promote WyrDimProdServerData-clone
> (kind:Mandatory)
>   promote WyrDimProdServerData-clone then start WyrDimProdFS
> (kind:Mandatory)
>   start WyrDimProdFS then start WyrDimProdServer (kind:Mandatory)
>   start WyrDimProdServer then start DimTestClusterIP (kind:Mandatory)
> Colocation Constraints:
>   WyrDimProdServer with DimProdClusterIP (score:INFINITY)
>   DimTestClusterIP with DimProdClusterIP (score:INFINITY)
>   WyrDimProdServerData-clone with DimProdClusterIP (score:INFINITY)
> (with-rsc-role:Master)
>   WyrDimProdFS with DimProdClusterIP (score:INFINITY)
> Ticket Constraints:
>
>
> pcs resource config ping
>  Resource: ping (class=ocf provider=pacemaker type=ping)
>   Attributes: dampen=3s host_list=193.30.22.33 multiplier=1000
>   Operations: monitor interval=5s timeout=4s (ping-monitor-interval-5s)
>               start interval=0s timeout=60s (ping-start-interval-0s)
>               stop interval=0s timeout=5s (ping-stop-interval-0s)
>
>
>
> cat /etc/corosync/corosync.conf
> totem {
>     version: 2
>     cluster_name: dimprodcluster
>     transport: knet
>     crypto_cipher: aes256
>     crypto_hash: sha256
>     token: 10000
>     interface {
>         knet_ping_interval: 1000
>         knet_ping_timeout: 1000
>     }
> }
>
> nodelist {
>     node {
>         ring0_addr: dimprod01
>         name: dimprod01
>         nodeid: 1
>     }
>
>     node {
>         ring0_addr: dimprod02
>         name: dimprod02
>         nodeid: 2
>     }
> }
>
> quorum {
>     provider: corosync_votequorum
>     two_node: 1
> }
>
> logging {
>     to_logfile: yes
>     logfile: /var/log/cluster/corosync.log
>     to_syslog: yes
>     timestamp: on
>     debug:on
> }
>
>
>
> Logs:
> When the network is connected 'pingd' takes value of 1000:
>
> Aug 03 08:23:01 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (attrd_client_update) debug: Broadcasting pingd[dimprod02]=1000
> (writer)
> Aug 03 08:23:01 dimprod02.my.clustertest.com attrd_updater
> [3369856] (pcmk__node_attr_request) debug: Asked pacemaker-attrd to update
> pingd=1000 for dimprod02: OK (0)
> Aug 03 08:23:01 dimprod02.my.clustertest.com attrd_updater
> [3369856] (crm_xml_cleanup) info: Cleaning up memory from libxml2
> Aug 03 08:23:01 dimprod02.my.clustertest.com attrd_updater
> [3369856] (crm_exit) info: Exiting attrd_updater | with status 0
>
> When the network is down we update 'pingd' to 0:
>
> Aug 03 08:23:09 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (attrd_client_update) debug: Broadcasting pingd[dimprod02]=0
> (writer)
> Aug 03 08:23:09 dimprod02.my.clustertest.com attrd_updater
> [3370053] (pcmk__node_attr_request) debug: Asked pacemaker-attrd to update
> pingd=0 for dimprod02: OK (0)
> Aug 03 08:23:09 dimprod02.my.clustertest.com attrd_updater
> [3370053] (crm_xml_cleanup) info: Cleaning up memory from libxml2
> Aug 03 08:23:09 dimprod02.my.clustertest.com attrd_updater
> [3370053] (crm_exit) info: Exiting attrd_updater | with status 0
>
> And again:
>
> Aug 03 08:23:17 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (attrd_client_update) debug: Broadcasting pingd[dimprod02]=0
> (writer)
> Aug 03 08:23:17 dimprod02.my.clustertest.com attrd_updater
> [3370109] (pcmk__node_attr_request) debug: Asked pacemaker-attrd to update
> pingd=0 for dimprod02: OK (0)
> Aug 03 08:23:17 dimprod02.my.clustertest.com attrd_updater
> [3370109] (crm_xml_cleanup) info: Cleaning up memory from libxml2
> Aug 03 08:23:17 dimprod02.my.clustertest.com attrd_updater
> [3370109] (crm_exit) info: Exiting attrd_updater | with status 0
>
> Then the node realizes it is not connected to the other node:
>
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (pcmk_cpg_membership) info: Group attrd event 8: dimprod01 (node
> 1 pid 2118843) left via cluster exit
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based
> [2827043] (pcmk_cpg_membership) info: Group cib event 8: dimprod01 (node 1
> pid 2118840) left via cluster exit
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (crm_update_peer_proc) info: pcmk_cpg_membership: Node
> dimprod01[1] - corosync-cpg is now offline
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based
> [2827043] (crm_update_peer_proc) info: pcmk_cpg_membership: Node
> dimprod01[1] - corosync-cpg is now offline
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk_cpg_membership) info: Group crmd event 8: dimprod01 (node
> 1 pid 2118845) left via cluster exit
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced
>  [2827044] (pcmk_cpg_membership) info: Group stonith-ng event 8: dimprod01
> (node 1 pid 2118841) left via cluster exit
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (crm_update_peer_proc) info: pcmk_cpg_membership: Node
> dimprod01[1] - corosync-cpg is now offline
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (crm_update_peer_state_iter) notice: Node dimprod01 state is now
> lost | nodeid=1 previous=member source=crm_update_peer_proc
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced
>  [2827044] (crm_update_peer_proc) info: pcmk_cpg_membership: Node
> dimprod01[1] - corosync-cpg is now offline
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (peer_update_callback) info: Node dimprod01 is no longer a peer
> | DC=true old=0x4000000 new=0x0000000
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (attrd_peer_remove) notice: Removing all dimprod01 attributes for
> peer loss
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based
> [2827043] (crm_update_peer_state_iter) notice: Node dimprod01 state is now
> lost | nodeid=1 previous=member source=crm_update_peer_proc
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (attrd_peer_remove) debug: Removed #attrd-protocol[dimprod01] for
> peer loss
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based
> [2827043] (crm_reap_dead_member) info: Removing node with name dimprod01
> and id 1 from membership cache
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (attrd_peer_remove) debug: Removed
> master-WyrDimProdServerData[dimprod01] for peer loss
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced
>  [2827044] (crm_update_peer_state_iter) notice: Node dimprod01 state is now
> lost | nodeid=1 previous=member source=crm_update_peer_proc
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based
> [2827043] (reap_crm_member) notice: Purged 1 peer with id=1 and/or
> uname=dimprod01 from the membership cache
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (attrd_peer_remove) debug: Removed
> last-failure-WyrDimProdFS#start_0[dimprod01] for peer loss
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (controld_delete_node_state) info: Deleting transient attributes
> for node dimprod01 (via CIB call 466) |
> xpath=//node_state[@uname='dimprod01']/transient_attributes
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based
> [2827043] (pcmk_cpg_membership) info: Group cib event 8: dimprod02 (node 2
> pid 2827043) is member
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (attrd_peer_remove) debug: Removed
> fail-count-WyrDimProdFS#start_0[dimprod01] for peer loss
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced
>  [2827044] (st_peer_update_callback) debug: Broadcasting our uname because
> of node 1
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (attrd_peer_remove) debug: Removed pingd[dimprod01] for peer loss
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (crm_reap_dead_member) info: Removing node with name dimprod01
> and id 1 from membership cache
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced
>  [2827044] (crm_reap_dead_member) info: Removing node with name dimprod01
> and id 1 from membership cache
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (reap_crm_member) notice: Purged 1 peer with id=1 and/or
> uname=dimprod01 from the membership cache
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (pcmk_cpg_membership) info: Group attrd event 8: dimprod02 (node
> 2 pid 2827046) is member
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (match_down_event) debug: No reason to expect node 1 to be down
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced
>  [2827044] (reap_crm_member) notice: Purged 1 peer with id=1 and/or
> uname=dimprod01 from the membership cache
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-fenced
>  [2827044] (pcmk_cpg_membership) info: Group stonith-ng event 8: dimprod02
> (node 2 pid 2827044) is member
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (peer_update_callback) warning: Stonith/shutdown of node
> dimprod01 was not expected
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (abort_transition_graph) info: Transition 99 aborted: Node
> failure | source=peer_update_callback:280 complete=true
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based
> [2827043] (cib_process_request) info: Forwarding cib_delete operation for
> section //node_state[@uname='dimprod01']/transient_attributes to all
> (origin=local/crmd/466)
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk_cpg_membership) info: Group crmd event 8: dimprod02 (node
> 2 pid 2827048) is member
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk__set_flags_as) debug: FSA action flags 0x2000000000000
> (new_actions) for controller set by s_crmd_fsa:198
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (s_crmd_fsa) debug: Processing I_PE_CALC: [ state=S_IDLE
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-based
> [2827043] (cib_process_request) info: Forwarding cib_modify operation for
> section status to all (origin=local/crmd/467)
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (do_state_transition) notice: State transition S_IDLE ->
> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk__set_flags_as) debug: FSA action flags 0x00000020
> (A_INTEGRATE_TIMER_STOP) for controller set by do_state_transition:559
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk__set_flags_as) debug: FSA action flags 0x00000080
> (A_FINALIZE_TIMER_STOP) for controller set by do_state_transition:565
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk__set_flags_as) debug: FSA action flags 0x00000200
> (A_DC_TIMER_STOP) for controller set by do_state_transition:569
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (check_join_counts) debug: Sole active cluster node is fully
> joined
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk__clear_flags_as) debug: FSA action flags 0x00000200
> (an_action) for controller cleared by do_fsa_action:108
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk__clear_flags_as) debug: FSA action flags 0x00000020
> (an_action) for controller cleared by do_fsa_action:108
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk__clear_flags_as) debug: FSA action flags 0x00000080
> (an_action) for controller cleared by do_fsa_action:108
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk__clear_flags_as) debug: FSA action flags 0x2000000000000
> (an_action) for controller cleared by do_fsa_action:108
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (do_pe_invoke) debug: Query 468: Requesting the current CIB:
> S_POLICY_ENGINE
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk_quorum_notification) info: Quorum retained |
> membership=1140 members=1
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-attrd
> [2827046] (attrd_peer_update) notice: Setting pingd[dimprod02]: 1000 -> 0 |
> from dimprod02
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (pcmk_quorum_notification) debug: Member[0] 2
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (crm_update_peer_state_iter) notice: Node dimprod01 state is now
> lost | nodeid=1 previous=member source=crm_reap_unseen_nodes
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (peer_update_callback) info: Cluster node dimprod01 is now lost
> (was member)
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (match_down_event) debug: No reason to expect node 1 to be down
> Aug 03 08:23:23 dimprod02.my.clustertest.com pacemaker-controld
>  [2827048] (peer_update_callback) warning: Stonith/shutdown of node
> dimprod01 was not expected
>
> And then the node decides to allocate the resources to this node and start
> it, ignoring the location constraint
>
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_config) debug: STONITH timeout: 60000
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_config) debug: STONITH of failed
> nodes is disabled
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_config) debug: Concurrent fencing is
> enabled
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_config) debug: Stop all active
> resources: false
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_config) debug: Cluster is symmetric -
> resources can run anywhere by default
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_config) debug: On loss of quorum:
> Stop ALL resources
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_config) debug: Node scores: 'red' =
> -INFINITY, 'yellow' = 0, 'green' = 0
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (determine_online_status) info: Node
> dimprod02 is online
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_find_resource) debug: Internally
> renamed WyrDimProdServerData on dimprod02 to WyrDimProdServerData:0
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_find_resource) debug: Internally
> renamed ping on dimprod02 to ping:0
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: DimProdClusterIP
> (ocf::heartbeat:IPaddr2): Stopped
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: WyrDimProdServer
> (systemd:wyr-dim): Stopped
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set:
> WyrDimProdServerData-clone [WyrDimProdServerData] (promotable): Slaves: [
> dimprod02 ]
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set:
> WyrDimProdServerData-clone [WyrDimProdServerData] (promotable): Stopped: [
> dimprod01 ]
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: WyrDimProdFS
> (ocf::heartbeat:Filesystem): Stopped
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: DimTestClusterIP
> (ocf::heartbeat:IPaddr2): Stopped
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: ping-clone
> [ping]: Started: [ dimprod02 ]
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: ping-clone
> [ping]: Stopped: [ dimprod01 ]
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource
> WyrDimProdServerData:0: preferring current location (node=dimprod02,
> weight=100)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource
> ping:0: preferring current location (node=dimprod02, weight=100)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning
> dimprod02 to DimProdClusterIP
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning
> dimprod02 to WyrDimProdServer
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (distribute_children) debug: Allocating up to
> 2 WyrDimProdServerData-clone instances to a possible 1 nodes (at most 1 per
> host, 2 optimal)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning
> dimprod02 to WyrDimProdServerData:0
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for
> resource WyrDimProdServerData:1 are unavailable, unclean or shutting down
> (dimprod01: 0, -1000000)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not
> allocate a node for WyrDimProdServerData:1
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource
> WyrDimProdServerData:1 cannot run anywhere
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (distribute_children) debug: Allocated 1
> WyrDimProdServerData-clone instances of a possible 2
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) debug:
> WyrDimProdServerData:0 promotion score: 1000
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) info: Promoting
> WyrDimProdServerData:0 (Slave dimprod02)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) debug:
> WyrDimProdServerData:1 promotion score: 1000
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) info:
> WyrDimProdServerData-clone: Promoted 1 instances of a possible 1 to master
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning
> dimprod02 to WyrDimProdFS
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning
> dimprod02 to DimTestClusterIP
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (distribute_children) debug: Allocating up to
> 2 ping-clone instances to a possible 1 nodes (at most 1 per host, 2 optimal)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning
> dimprod02 to ping:0
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for
> resource ping:1 are unavailable, unclean or shutting down (dimprod01: 0,
> -1000000)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not
> allocate a node for ping:1
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource ping:1
> cannot run anywhere
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (distribute_children) debug: Allocated 1
> ping-clone instances of a possible 2
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (RecurringOp) info:  Start recurring monitor
> (30s) for DimProdClusterIP on dimprod02
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (RecurringOp) info:  Start recurring monitor
> (60s) for WyrDimProdServer on dimprod02
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (RecurringOp) info: Cancelling action
> WyrDimProdServerData:0_monitor_60000 (Slave vs. Master)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (create_promotable_actions) debug: Creating
> actions for WyrDimProdServerData-clone
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (RecurringOp) info: Cancelling action
> WyrDimProdServerData:0_monitor_60000 (Slave vs. Master)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (RecurringOp) info:  Start recurring monitor
> (20s) for WyrDimProdFS on dimprod02
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (RecurringOp) info:  Start recurring monitor
> (30s) for DimTestClusterIP on dimprod02
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogAction) notice:  * Start
>  DimProdClusterIP           (                 dimprod02 )
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogAction) notice:  * Start
>  WyrDimProdServer           (                 dimprod02 )
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogAction) notice:  * Promote
>  WyrDimProdServerData:0     ( Slave -> Master dimprod02 )
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogActions) info: Leave
> WyrDimProdServerData:1 (Stopped)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogAction) notice:  * Start
>  WyrDimProdFS               (                 dimprod02 )
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogAction) notice:  * Start
>  DimTestClusterIP           (                 dimprod02 )
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogActions) info: Leave   ping:0 (Started
> dimprod02)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogActions) info: Leave   ping:1 (Stopped)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (action2xml) debug: Using anonymous clone
> name WyrDimProdServerData for WyrDimProdServerData:0 (aka.
> WyrDimProdServerData)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (action2xml) debug: Using anonymous clone
> name WyrDimProdServerData for WyrDimProdServerData:0 (aka.
> WyrDimProdServerData)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (action2xml) debug: Using anonymous clone
> name WyrDimProdServerData for WyrDimProdServerData:0 (aka.
> WyrDimProdServerData)
> Aug 03 08:23:24 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (action2xml) debug: Using anonymous clone
> name WyrDimProdServerData for WyrDimProdServerData:0 (aka.
> WyrDimProdServerData)
>
> When the resources are started, the node decides that they can not be
> allocated and closes them:
>
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (determine_online_status) info: Node
> dimprod02 is online
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_find_resource) debug: Internally
> renamed WyrDimProdServerData on dimprod02 to WyrDimProdServerData:0
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (unpack_find_resource) debug: Internally
> renamed ping on dimprod02 to ping:0
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: DimProdClusterIP
> (ocf::heartbeat:IPaddr2): Started dimprod02
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: WyrDimProdServer
> (systemd:wyr-dim): Started dimprod02
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set:
> WyrDimProdServerData-clone [WyrDimProdServerData] (promotable): Masters: [
> dimprod02 ]
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set:
> WyrDimProdServerData-clone [WyrDimProdServerData] (promotable): Stopped: [
> dimprod01 ]
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: WyrDimProdFS
> (ocf::heartbeat:Filesystem): Started dimprod02
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: DimTestClusterIP
> (ocf::heartbeat:IPaddr2): Stopped
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: ping-clone
> [ping]: Started: [ dimprod02 ]
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (log_list_item) info: Clone Set: ping-clone
> [ping]: Stopped: [ dimprod01 ]
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource
> DimProdClusterIP: preferring current location (node=dimprod02, weight=100)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource
> WyrDimProdServer: preferring current location (node=dimprod02, weight=100)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource
> WyrDimProdServerData:0: preferring current location (node=dimprod02,
> weight=100)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource
> WyrDimProdFS: preferring current location (node=dimprod02, weight=100)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (common_apply_stickiness) debug: Resource
> ping:0: preferring current location (node=dimprod02, weight=100)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_merge_weights) info:
> DimProdClusterIP: Rolling back optional scores from
> WyrDimProdServerData-clone
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_merge_weights) info:
> DimProdClusterIP: Rolling back optional scores from WyrDimProdFS
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_merge_weights) info:
> DimProdClusterIP: Rolling back optional scores from WyrDimProdServer
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_merge_weights) info:
> DimProdClusterIP: Rolling back optional scores from DimTestClusterIP
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for
> resource DimProdClusterIP are unavailable, unclean or shutting down
> (dimprod02: 1, -1000000)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not
> allocate a node for DimProdClusterIP
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Processing
> DimProdClusterIP_monitor_30000
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource
> DimProdClusterIP cannot run anywhere
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for
> resource WyrDimProdServer are unavailable, unclean or shutting down
> (dimprod02: 1, -1000000)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not
> allocate a node for WyrDimProdServer
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource
> WyrDimProdServer cannot run anywhere
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (distribute_children) debug: Allocating up to
> 2 WyrDimProdServerData-clone instances to a possible 1 nodes (at most 1 per
> host, 2 optimal)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning
> dimprod02 to WyrDimProdServerData:0
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for
> resource WyrDimProdServerData:1 are unavailable, unclean or shutting down
> (dimprod01: 0, -1000000)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not
> allocate a node for WyrDimProdServerData:1
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource
> WyrDimProdServerData:1 cannot run anywhere
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (distribute_children) debug: Allocated 1
> WyrDimProdServerData-clone instances of a possible 2
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (filter_colocation_constraint) error:
> WyrDimProdServerData:0 must be colocated with DimProdClusterIP but is not
> (dimprod02 vs. unallocated)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) debug:
> WyrDimProdServerData:0 promotion score: 10000
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) info: Promoting
> WyrDimProdServerData:0 (Master dimprod02)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) debug:
> WyrDimProdServerData:1 promotion score: 10000
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__set_instance_roles) info:
> WyrDimProdServerData-clone: Promoted 1 instances of a possible 1 to master
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for
> resource WyrDimProdFS are unavailable, unclean or shutting down (dimprod01:
> 0, -1000000)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not
> allocate a node for WyrDimProdFS
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Processing
> WyrDimProdFS_monitor_20000
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource
> WyrDimProdFS cannot run anywhere
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for
> resource DimTestClusterIP are unavailable, unclean or shutting down
> (dimprod01: 0, -1000000)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not
> allocate a node for DimTestClusterIP
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource
> DimTestClusterIP cannot run anywhere
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (distribute_children) debug: Allocating up to
> 2 ping-clone instances to a possible 1 nodes (at most 1 per host, 2 optimal)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Assigning
> dimprod02 to ping:0
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: All nodes for
> resource ping:1 are unavailable, unclean or shutting down (dimprod01: 0,
> -1000000)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (native_assign_node) debug: Could not
> allocate a node for ping:1
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (pcmk__native_allocate) info: Resource ping:1
> cannot run anywhere
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (distribute_children) debug: Allocated 1
> ping-clone instances of a possible 2
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (create_promotable_actions) debug: Creating
> actions for WyrDimProdServerData-clone
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogAction) notice:  * Stop
> DimProdClusterIP           (                 dimprod02 )   due to node
> availability
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogAction) notice:  * Stop
> WyrDimProdServer           (                 dimprod02 )   due to node
> availability
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogActions) info: Leave
> WyrDimProdServerData:0 (Master dimprod02)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogActions) info: Leave
> WyrDimProdServerData:1 (Stopped)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogAction) notice:  * Stop
> WyrDimProdFS               (                 dimprod02 )   due to node
> availability
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogActions) info: Leave   DimTestClusterIP
> (Stopped)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogActions) info: Leave   ping:0 (Started
> dimprod02)
> Aug 03 08:23:27 dimprod02.my.clustertest.com
> pacemaker-schedulerd[2827047] (LogActions) info: Leave   ping:1 (Stopped)
>
> So the final result is OK, I would just like to avoid the start of the
> resources on disconnected node.
> I have no idea how I could further debug it.
> I will appreciate any help.
> If that's helpful I can provide full debug log and more configuration
> details.
>
> Regards,
> Janusz.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210805/763b73db/attachment-0001.htm>


More information about the Users mailing list