[ClusterLabs] DRBD demote/promote not called - Why? How to fix?

Sat Nov 12 14:00:09 UTC 2016

On 11/10/2016 19:37, Ken Gaillot wrote:
> On 11/09/2016 12:27 PM, CART Andreas wrote:
>> I again started with all resources located at ventsi-clst1 and issued a
>> 'pcs resource move DRBD_global_clst' (the resource next collocated next
>> to the DRBDClone).
>> 
>> With that I end up with all primitive resources stopped and the
>> DRBDClone resource still being master at ventsi-clst1.
>> 
>> Transition Summary:
>> * Start   IPaddrNFS    (ventsi-clst2-sync)
>> * Start   NFSServer    (ventsi-clst2-sync)
>> * Demote  DRBD:0       (Master -> Slave ventsi-clst1-sync)    <=== this
>> demote never happens
>> * Promote DRBD:1       (Slave -> Master ventsi-clst2-sync)
>> * Start   DRBD_global_clst     (ventsi-clst2-sync)
>> * Start   NFS_global_clst      (ventsi-clst1-sync)
>> * Start   BIND_global_clst     (ventsi-clst2-sync)
>
> Strangely, this sequence appears to be ignoring the constraint "start
> DRBD_global_clst then start IPaddrNFS".
>
> Can you open a bug report at http://bugs.clusterlabs.org/ and attach the
> CIB (or pe-input file) in use at this time?
>
> For testing purposes, you may want to try replacing the "start
> DRBD_global_clst then start IPaddrNFS" constraint with "promote
> DRBDClone then start IPaddrNFS" to see whether that makes a difference.

I reproduced the problem in a test environment and hopefully can now provide some more information.

The problem seems to be the same with stopping the resources but not demoting the master resource (in time).
But this time I noticed that the problem is cleaned up by the cluster after 15 minutes.
The original transaction had again too less actions. (This time not only the demote is missing but stopping other resources as well.)
(Additionally I had some files open at the mounted filesystems. So the unmount did not succeed immediately but it took some time to disconnect.)

Exactly the same behavior on the attempt to move the NFS server back (without any open files).
This time I tried 'pcs resource cleanup' afterwards, which resolved things to the correct state immediately.
(Just to note: In contrast if I perform a 'pcs resource clear DRBD_global_clst' in the problematic interim state everything returns to the origin state; i.e. as if no 'move' command had been applied.)

Furthermore I tried to add order constraints for the complete "stop"-chain, but unfortunately this didn't help either.

In another attempt I added colocation constraints to make each other individual resource depend on the master role of the DRBD clone - as well without any change in behavior.

All resources move immediately and successfully if I delete the 2 filesystem resources on top of the NFS server and then move the NFS server.
(But still not demoting the DRBD master if I try to move the filesystem on top of it (which is below the IP addr and the NFS server).)

If I try to move any of the 2 filesystem resources on top of the NFS server only this filesystem is stopped but no other resource.
Even more strange 'crm_simulate -Ls' does in this case not show any missing actions in the transition summary. This state does not resolve to the intended state even after 15 minutes.

So finally I reported a bug for this behavior: "Bug 5305 - part of the resource chain not being considered (in time)"

Kind regards
Andi