[ClusterLabs] colocation/order for cloned resource + group being ignored

Mon Apr 11 12:53:39 EDT 2022

On 11.04.2022 19:02, Salatiel Filho wrote:
> Hi, I am deploying pacemaker + drbd to provide a high availability
> storage and during the troubleshooting tests I got an strange
> behaviour where the colocation constraint for the remaining resources
> and the cloned group appear to be just ignored.
> 
> These are the constraints I have:
> Location Constraints:
> Ordering Constraints:
>   start DRBDData-clone then start nfs (kind:Mandatory)
> Colocation Constraints:
>   nfs with DRBDData-clone (score:INFINITY)
> Ticket Constraints:
> 
> 
> The environment: I have a two node cluster with a remote quorum
> device. The test was to stop the quorum device and afterwards stop the
> node currently running all the services ( node1 ).
> The expected behaviour would be that the remaining node would not be
> able to do anything ( partition without-quorum ) until it gets quorum.
> This is the output of pcs status on node2 after power off the quorum
> device and the node1.
> 
> Some resources have been removed from the output to make this email cleaner.
> 
> Cluster name: storage-drbd
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: node2 (version 2.1.0-8.el8-7c3f660707) - partition
> WITHOUT quorum
>   * Last updated: Mon Apr 11 12:28:06 2022
>   * Last change:  Mon Apr 11 12:26:10 2022 by root via cibadmin on node2
>   * 2 nodes configured
>   * 11 resource instances configured
> 
> Node List:
>   * Node node1: UNCLEAN (offline)
>   * Online: [ node2 ]
> 
> Full List of Resources:
>   * fence-node1  (stonith:fence_vmware_rest):     Started node2
>   * fence-node2  (stonith:fence_vmware_rest):     Started node1 (UNCLEAN)
>   * Clone Set: DRBDData-clone [DRBDData] (promotable):
>     * DRBDData  (ocf::linbit:drbd):     Master node1 (UNCLEAN)
>     * Slaves: [ node2 ]
>   * Resource Group: nfs:
>     * vip_nfs   (ocf::heartbeat:IPaddr2):        Started node1 (UNCLEAN)
>     * drbd_fs   (ocf::heartbeat:Filesystem):     Started node1 (UNCLEAN)
>     * nfsd    (ocf::heartbeat:nfsserver):     Started node1 (UNCLEAN)
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> 
> 
> 
> 
> 
> As expected, the node 2 is without quorum and waiting. The problem
> hapenned  when I turn the node1 back. The quorum was restablished, but
> the drbd master started on node1, but the nfs group started on node2,
> even though I have both start order and colocation to make both the
> Cloned Resource and the NFS group to run on the same node.
> 

No. you do not.

> 
> 
> Cluster name: storage-drbd
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: node2 (version 2.1.0-8.el8-7c3f660707) - partition with quorum
>   * Last updated: Mon Apr 11 12:29:08 2022
>   * Last change:  Mon Apr 11 12:26:10 2022 by root via cibadmin on node2
>   * 2 nodes configured
>   * 11 resource instances configured
> 
> Node List:
>   * Online: [ node1 node2 ]
> 
> Full List of Resources:
>   * fence-node1  (stonith:fence_vmware_rest):     Started node2
>   * fence-node2  (stonith:fence_vmware_rest):     Started node1
>   * Clone Set: DRBDData-clone [DRBDData] (promotable):
>     * Masters: [ node2 ]
>     * Slaves: [ node1 ]
>   * Resource Group: nfs:
>     * vip_nfs   (ocf::heartbeat:IPaddr2):        Started node1
>     * drbd_fs   (ocf::heartbeat:Filesystem):     FAILED node1
>     * nfsd    (ocf::heartbeat:nfsserver):     Stopped
> 
> Failed Resource Actions:
>   * drbd_fs_start_0 on node1 'error' (1): call=90, status='complete',
> exitreason='Couldn't mount device [/dev/drbd0] as /exports/drbd0',
> last-rc-change='2022-04-11 12:29:05 -03:00', queued=0ms, exec=2567ms
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> 
> 
> 
> 
> Can anyone explain to me why are the constraints being ignored?
> 

You order/colocation is against starting of clone resource, not against
master. If you need to order/colocate resource against master, you need
to say this explicitly. Colocating/ordering against "start" is satisfied
as soon as cloned resource is started as slave, before it gets promoted.