[ClusterLabs] colocation/order for cloned resource + group being ignored

Mon Apr 11 12:02:02 EDT 2022

Hi, I am deploying pacemaker + drbd to provide a high availability
storage and during the troubleshooting tests I got an strange
behaviour where the colocation constraint for the remaining resources
and the cloned group appear to be just ignored.

These are the constraints I have:
Location Constraints:
Ordering Constraints:
  start DRBDData-clone then start nfs (kind:Mandatory)
Colocation Constraints:
  nfs with DRBDData-clone (score:INFINITY)
Ticket Constraints:

The environment: I have a two node cluster with a remote quorum
device. The test was to stop the quorum device and afterwards stop the
node currently running all the services ( node1 ).
The expected behaviour would be that the remaining node would not be
able to do anything ( partition without-quorum ) until it gets quorum.
This is the output of pcs status on node2 after power off the quorum
device and the node1.

Some resources have been removed from the output to make this email cleaner.

Cluster name: storage-drbd
Cluster Summary:
  * Stack: corosync
  * Current DC: node2 (version 2.1.0-8.el8-7c3f660707) - partition
WITHOUT quorum
  * Last updated: Mon Apr 11 12:28:06 2022
  * Last change:  Mon Apr 11 12:26:10 2022 by root via cibadmin on node2
  * 2 nodes configured
  * 11 resource instances configured

Node List:
  * Node node1: UNCLEAN (offline)
  * Online: [ node2 ]

Full List of Resources:
  * fence-node1  (stonith:fence_vmware_rest):     Started node2
  * fence-node2  (stonith:fence_vmware_rest):     Started node1 (UNCLEAN)
  * Clone Set: DRBDData-clone [DRBDData] (promotable):
    * DRBDData  (ocf::linbit:drbd):     Master node1 (UNCLEAN)
    * Slaves: [ node2 ]
  * Resource Group: nfs:
    * vip_nfs   (ocf::heartbeat:IPaddr2):        Started node1 (UNCLEAN)
    * drbd_fs   (ocf::heartbeat:Filesystem):     Started node1 (UNCLEAN)
    * nfsd    (ocf::heartbeat:nfsserver):     Started node1 (UNCLEAN)

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

As expected, the node 2 is without quorum and waiting. The problem
hapenned  when I turn the node1 back. The quorum was restablished, but
the drbd master started on node1, but the nfs group started on node2,
even though I have both start order and colocation to make both the
Cloned Resource and the NFS group to run on the same node.

Cluster name: storage-drbd
Cluster Summary:
  * Stack: corosync
  * Current DC: node2 (version 2.1.0-8.el8-7c3f660707) - partition with quorum
  * Last updated: Mon Apr 11 12:29:08 2022
  * Last change:  Mon Apr 11 12:26:10 2022 by root via cibadmin on node2
  * 2 nodes configured
  * 11 resource instances configured

Node List:
  * Online: [ node1 node2 ]

Full List of Resources:
  * fence-node1  (stonith:fence_vmware_rest):     Started node2
  * fence-node2  (stonith:fence_vmware_rest):     Started node1
  * Clone Set: DRBDData-clone [DRBDData] (promotable):
    * Masters: [ node2 ]
    * Slaves: [ node1 ]
  * Resource Group: nfs:
    * vip_nfs   (ocf::heartbeat:IPaddr2):        Started node1
    * drbd_fs   (ocf::heartbeat:Filesystem):     FAILED node1
    * nfsd    (ocf::heartbeat:nfsserver):     Stopped

Failed Resource Actions:
  * drbd_fs_start_0 on node1 'error' (1): call=90, status='complete',
exitreason='Couldn't mount device [/dev/drbd0] as /exports/drbd0',
last-rc-change='2022-04-11 12:29:05 -03:00', queued=0ms, exec=2567ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Can anyone explain to me why are the constraints being ignored?

Running on alma linux 8.5 + pcs-0.10.10-4.el8_5.1.alma.x86_64

Thanks!

Kind regards,
Salatiel