[ClusterLabs] Pacemaker multi-state resource stop not running although "pcs status" indicates "Stopped"

Sat Aug 14 11:34:28 EDT 2021

On 13.08.2021 22:46, ChittaNagaraj, Raghav wrote:
> Hello Team,
> 
> Hope you doing well.
> 
> Running into an issue with multi-state resources not running stop function on a node but failing over to start the resource on another node part of the cluster when corosync process is killed.
> 
> Note, in the below, actual resource names/hostnames have been changed from the original.
> 
> Snippet of pcs status before corosync is killed:
> 
>              $ hostname
> pace_node_a
> 
> snippet of "pcs status"
> colocated-resource (ocf::xxx:colocated-resource):  Started pace_node_a
> Master/Slave Set: main-multi-state-resource [main-multi]
>      Masters: [ pace_node_a ]
>      Stopped: [ pace_node_b ]
> 
> Now executed action to kill corosync process using kill -9 on "pace_node_a"
> 
> Resulting snippet of "pcs status"
> 
> colocated-resource (ocf::xxx:colocated-resource):  Started pace_node_b
> Master/Slave Set: main-multi-state-resource [main-multi]
>      Stopped: [ pace_node_a ]
>      Masters: [ pace_node_b ]
> 
> As you can see, pcs status indicates that "main-multi-state-resource" stopped where corosync was killed on "pace_node_a" and started on "pace_node_b". Although, this indication is right, the underlying resource managed by "main-multi-state-resource" never stopped on "pace_node_a".

When you kill corosync, this node is isolated and should have been
fenced. At which point it does not matter whether resources had been
stopped or not. Besides, when you kill corosync pacemaker processes on
this node are also terminated so nothing can initiate stop of any resource.

> Also, there were no logs from crmd and other components stating it even attempted to stop on "pace_node_a". Interestingly, crmd logs indicated that the colocated resource - "colocated-resource" was being stopped and there is evidence that the resource managed by "colocated-resource" actually stopped.

Well, you did not show any evidence so it is hard to make any comment.

> 
> Is this a known issue?
>

There is no issue. Most likely stonith is not enabled and pacemaker
assumes node is offline when communication is lost. When node is offline
nothing can be active on this node by definition.

> Please let us know if any additional information is needed.
> 

There is no information so far. You need to show actual configuration of
your cluster and those resources as well as logs from DC starting with
killing corosync until resources were migrated.