[ClusterLabs] Stop action called after reboot and on all nodes

Tue Jul 30 03:56:09 EDT 2019

On 7/30/19 9:32 AM, Thomas Singleton wrote:
> Dear all
>
> With the following cluster setup :
> 2 nodes and one spare, two ressources, full config at the end of
> message
> Test resource is a simple binary with a sleep loop and logging in the
> heartbeat script
>
>
> When one node is rebooted, two actions occur that we are trying to
> understand :
>
> - When first node is taken offline by a reboot command the resource is
> started on the spare as expected, when the node is back online the
> resource on the spare is stopped, again as expected BUT on the second
> node, the resource also receives a stop/start command 
>
> - When first node is taken offline by a reboot, when node is back
> online, resource is stopped on spare BUT on the first node the action
> called in the heartbeat script is the stop action and not the start
> action as one would expect (thus preventing the resource from coming
> back online) 
>
> How can we explain these two actions ?
Have you checked that your resource-agent (heartbeat script)
is answering properly on monitors/probes in all states of
your resource?
I've seen that you are running without fencing which will
probably not give you the desired behavior. And as pacemaker
behaves differently without fencing enabled it usually makes
sense to involve some kind of fencing already at early test
stages. If you don't have anything at hand that might be
used as a fencing-device you might consider sbd-watchdog-fencing.

Klaus
>
>
> Thank you for your input
>
>
>
>
>
> Cluster config 
>
> # pcs config
> Cluster Name: cluster1
> Corosync Nodes:
>  node1 node2 nodespare
> Pacemaker Nodes:
>  node1 node2 nodespare
>
> Resources:
>  Resource: TEST_HBNode1 (class=ocf provider=heartbeat type=TEST_HB)
>   Meta Attrs: priority=50 
>   Utilization: cpu=1 memory=1000
>   Operations: monitor interval=10 timeout=20 (TEST_HBNode1-monitor-
> interval-10)
>               start interval=0s timeout=120 (TEST_HBNode1-start-
> interval-0s)
>               stop interval=0s timeout=120 (TEST_HBNode1-stop-interval-
> 0s)
>  Resource: TEST_HBNode2 (class=ocf provider=heartbeat type=TEST_HB)
>   Meta Attrs: priority=100 
>   Utilization: cpu=1 memory=1000
>   Operations: monitor interval=10 timeout=20 (TEST_HBNode2-monitor-
> interval-10)
>               start interval=0s timeout=120 (TEST_HBNode2-start-
> interval-0s)
>               stop interval=0s timeout=120 (TEST_HBNode2-stop-interval-
> 0s)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
>   Resource: TEST_HBNode1
>     Enabled on: node1 (score:50) (id:location-TEST_HBNode1-node1-50)
>     Enabled on: nodespare (score:30) (id:location-TEST_HBNode1-
> nodespare-30)
>     Disabled on: node2 (score:-INFINITY) (id:location-TEST_HBNode1-
> node2--INFINITY)
>   Resource: TEST_HBNode2
>     Enabled on: node2 (score:100) (id:location-TEST_HBNode2-node2-100)
>     Enabled on: nodespare (score:80) (id:location-TEST_HBNode2-
> nodespare-80)
>     Disabled on: node1 (score:-INFINITY) (id:location-TEST_HBNode2-
> node1--INFINITY)
> Ordering Constraints:
> Colocation Constraints:
> Ticket Constraints:
>
> Alerts:
>  No alerts defined
>
> Resources Defaults:
>  resource-stickiness: 0
> Operations Defaults:
>  No defaults set
>
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: cluster1
>  dc-version: 1.1.19-8.el7_6.4-c3c624ea3d
>  have-watchdog: false
>  placement-strategy: utilization
>  stonith-enabled: false
>  symmetric-cluster: false
>
> Quorum:
>   Options:
>
>