[ClusterLabs] HA problem: No live migration when setting node on standby

Thu Apr 13 15:24:38 EDT 2023

On 12.04.2023 15:44, Philip Schiller wrote:
> Here are also some Additional some additional information for a failover with setting the node standby.
> 
> Apr 12 12:40:28 s1 pacemaker-controld[1611990]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: On loss of quorum: Ignore
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       sto-ipmi-s0                (                        s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-zfs-drbd_storage:0     (                        s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-pluto:0           (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-poserver:0        (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-webserver:0       (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-dhcp:0            (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-wawi:0            (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-wawius:0          (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-saturn:0          (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-openvpn:0         (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-asterisk:0        (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-alarmanlage:0     (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-jabber:0          (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Stop       pri-drbd-TESTOPTIXXX:0     (               Promoted s1 )  due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Move       pri-vm-jabber              (                  s1 -> s0 )  due to unrunnable mas-drbd-jabber demote
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]:  notice: Actions: Move       pri-vm-alarmanlage         (                  s1 -> s0 )  due to unrunnable mas-drbd-alarmanlage demote

I had the same "unrunnable demote" yesterday when I tried to reproduce 
it, but I cannot reproduce it anymore. After some CIB modifications it 
works as expected.

Using the original execution date of: 2023-04-13 18:35:11Z
Current cluster status:
   * Node List:
     * Node ha1: standby (with active resources)
     * Online: [ ha2 qnetd ]

   * Full List of Resources:
     * dummy_stonith	(stonith:external/_dummy):	 Started ha1
     * Clone Set: cl-zfs_drbd_storage [zfs_drbd_storage]:
       * Started: [ ha1 ha2 ]
     * Clone Set: ms-drbd_fs [drbd_fs] (promotable):
       * Masters: [ ha1 ha2 ]
     * just_vm	(ocf::pacemaker:Dummy):	 Started ha2
     * drbd_vm	(ocf::pacemaker:Dummy):	 Started ha1

Transition Summary:
   * Move       dummy_stonith          ( ha1 -> ha2 )
   * Stop       zfs_drbd_storage:0     (        ha1 )  due to node 
availability
   * Stop       drbd_fs:0              ( Master ha1 )  due to node 
availability
   * Migrate    drbd_vm                ( ha1 -> ha2 )

Executing Cluster Transition:
   * Resource action: dummy_stonith   stop on ha1
   * Pseudo action:   ms-drbd_fs_demote_0
   * Resource action: drbd_vm         migrate_to on ha1
   * Resource action: dummy_stonith   start on ha2
   * Resource action: drbd_fs         demote on ha1
   * Pseudo action:   ms-drbd_fs_demoted_0
   * Pseudo action:   ms-drbd_fs_stop_0
   * Resource action: drbd_vm         migrate_from on ha2
   * Resource action: drbd_vm         stop on ha1
   * Resource action: dummy_stonith   monitor=3600000 on ha2
   * Pseudo action:   cl-zfs_drbd_storage_stop_0
   * Resource action: drbd_fs         stop on ha1
   * Pseudo action:   ms-drbd_fs_stopped_0
   * Pseudo action:   drbd_vm_start_0
   * Resource action: zfs_drbd_storage stop on ha1
   * Pseudo action:   cl-zfs_drbd_storage_stopped_0
   * Resource action: drbd_vm         monitor=10000 on ha2
Using the original execution date of: 2023-04-13 18:35:11Z

Revised Cluster Status:
   * Node List:
     * Node ha1: standby
     * Online: [ ha2 qnetd ]

   * Full List of Resources:
     * dummy_stonith	(stonith:external/_dummy):	 Started ha2
     * Clone Set: cl-zfs_drbd_storage [zfs_drbd_storage]:
       * Started: [ ha2 ]
       * Stopped: [ ha1 qnetd ]
     * Clone Set: ms-drbd_fs [drbd_fs] (promotable):
       * Masters: [ ha2 ]
       * Stopped: [ ha1 qnetd ]
     * just_vm	(ocf::pacemaker:Dummy):	 Started ha2
     * drbd_vm	(ocf::pacemaker:Dummy):	 Started ha2

where ordering constraints are

order drbd_fs_after_zfs_drbd_storage Mandatory: cl-zfs_drbd_storage 
ms-drbd_fs:promote
order drbd_vm_after_drbd_fs Mandatory: ms-drbd_fs:promote drbd_vm
order just_vm_after_zfs_drbd_storage Mandatory: cl-zfs_drbd_storage just_vm

The "just_vm" was added to test behavior of ordering resource against 
normal, non promotable, clone.

OK, I compared CIB and the difference is that non-working case has 
explicit "start" action in order constraint. I.e.

order drbd_vm_after_drbd_fs Mandatory: ms-drbd_fs:promote drbd_vm:start

After I added it back I get the same failed "demote" action.

Transition Summary:
   * Stop       zfs_drbd_storage:0     (        ha1 )  due to node 
availability
   * Stop       drbd_fs:0              ( Master ha1 )  due to node 
availability
   * Migrate    just_vm                ( ha1 -> ha2 )
   * Move       drbd_vm                ( ha1 -> ha2 )  due to unrunnable 
ms-drbd_fs demote

I was sure that "start" is default anyway. Go figure ...