[ClusterLabs] HA problem: No live migration when setting node on standby
Andrei Borzenkov
arvidjaar at gmail.com
Thu Apr 13 15:24:38 EDT 2023
On 12.04.2023 15:44, Philip Schiller wrote:
> Here are also some Additional some additional information for a failover with setting the node standby.
>
> Apr 12 12:40:28 s1 pacemaker-controld[1611990]: notice: State transition S_IDLE -> S_POLICY_ENGINE
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: On loss of quorum: Ignore
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop sto-ipmi-s0 ( s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-zfs-drbd_storage:0 ( s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-pluto:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-poserver:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-webserver:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-dhcp:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-wawi:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-wawius:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-saturn:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-openvpn:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-asterisk:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-alarmanlage:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-jabber:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Stop pri-drbd-TESTOPTIXXX:0 ( Promoted s1 ) due to node availability
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Move pri-vm-jabber ( s1 -> s0 ) due to unrunnable mas-drbd-jabber demote
> Apr 12 12:40:28 s1 pacemaker-schedulerd[1611989]: notice: Actions: Move pri-vm-alarmanlage ( s1 -> s0 ) due to unrunnable mas-drbd-alarmanlage demote
I had the same "unrunnable demote" yesterday when I tried to reproduce
it, but I cannot reproduce it anymore. After some CIB modifications it
works as expected.
Using the original execution date of: 2023-04-13 18:35:11Z
Current cluster status:
* Node List:
* Node ha1: standby (with active resources)
* Online: [ ha2 qnetd ]
* Full List of Resources:
* dummy_stonith (stonith:external/_dummy): Started ha1
* Clone Set: cl-zfs_drbd_storage [zfs_drbd_storage]:
* Started: [ ha1 ha2 ]
* Clone Set: ms-drbd_fs [drbd_fs] (promotable):
* Masters: [ ha1 ha2 ]
* just_vm (ocf::pacemaker:Dummy): Started ha2
* drbd_vm (ocf::pacemaker:Dummy): Started ha1
Transition Summary:
* Move dummy_stonith ( ha1 -> ha2 )
* Stop zfs_drbd_storage:0 ( ha1 ) due to node
availability
* Stop drbd_fs:0 ( Master ha1 ) due to node
availability
* Migrate drbd_vm ( ha1 -> ha2 )
Executing Cluster Transition:
* Resource action: dummy_stonith stop on ha1
* Pseudo action: ms-drbd_fs_demote_0
* Resource action: drbd_vm migrate_to on ha1
* Resource action: dummy_stonith start on ha2
* Resource action: drbd_fs demote on ha1
* Pseudo action: ms-drbd_fs_demoted_0
* Pseudo action: ms-drbd_fs_stop_0
* Resource action: drbd_vm migrate_from on ha2
* Resource action: drbd_vm stop on ha1
* Resource action: dummy_stonith monitor=3600000 on ha2
* Pseudo action: cl-zfs_drbd_storage_stop_0
* Resource action: drbd_fs stop on ha1
* Pseudo action: ms-drbd_fs_stopped_0
* Pseudo action: drbd_vm_start_0
* Resource action: zfs_drbd_storage stop on ha1
* Pseudo action: cl-zfs_drbd_storage_stopped_0
* Resource action: drbd_vm monitor=10000 on ha2
Using the original execution date of: 2023-04-13 18:35:11Z
Revised Cluster Status:
* Node List:
* Node ha1: standby
* Online: [ ha2 qnetd ]
* Full List of Resources:
* dummy_stonith (stonith:external/_dummy): Started ha2
* Clone Set: cl-zfs_drbd_storage [zfs_drbd_storage]:
* Started: [ ha2 ]
* Stopped: [ ha1 qnetd ]
* Clone Set: ms-drbd_fs [drbd_fs] (promotable):
* Masters: [ ha2 ]
* Stopped: [ ha1 qnetd ]
* just_vm (ocf::pacemaker:Dummy): Started ha2
* drbd_vm (ocf::pacemaker:Dummy): Started ha2
where ordering constraints are
order drbd_fs_after_zfs_drbd_storage Mandatory: cl-zfs_drbd_storage
ms-drbd_fs:promote
order drbd_vm_after_drbd_fs Mandatory: ms-drbd_fs:promote drbd_vm
order just_vm_after_zfs_drbd_storage Mandatory: cl-zfs_drbd_storage just_vm
The "just_vm" was added to test behavior of ordering resource against
normal, non promotable, clone.
OK, I compared CIB and the difference is that non-working case has
explicit "start" action in order constraint. I.e.
order drbd_vm_after_drbd_fs Mandatory: ms-drbd_fs:promote drbd_vm:start
After I added it back I get the same failed "demote" action.
Transition Summary:
* Stop zfs_drbd_storage:0 ( ha1 ) due to node
availability
* Stop drbd_fs:0 ( Master ha1 ) due to node
availability
* Migrate just_vm ( ha1 -> ha2 )
* Move drbd_vm ( ha1 -> ha2 ) due to unrunnable
ms-drbd_fs demote
I was sure that "start" is default anyway. Go figure ...
More information about the Users
mailing list