[ClusterLabs] (Live) Migration failure results in a stop operation

Ken Gaillot kgaillot at redhat.com
Tue Feb 20 10:38:46 EST 2018


On Tue, 2018-02-20 at 02:13 -0500, Digimer wrote:
> On 2018-02-20 12:07 AM, Digimer wrote:
> > Hi all,
> > 
> >   Is there a way to tell pacemaker that, if a migration operation
> > fails,
> > to just leave the service on the host node? The service being
> > hosted is
> > a VM and a migration failure that triggers a shut down and reboot
> > is
> > very disruptive. I'd rather just leave it alone (and let a human
> > fix the
> > underlying problem).
> > 
> > Thanks!
> > 
> 
> I should mention; I tried setting the 'on-fail' for the 'migate_to'
> and
> 'migrate_from' operations;
> 
> pcs resource create srv01-c7 ocf:alteeve:server name="srv01-c7" \
>     meta allow-migrate="true" op monitor interval="60" \
>     op stop on-fail="block" op migrate_to on-fail="ignore" \
>     op migrate_from on-fail="ignore" \

I think you want "block" (don't take any further action) rather than
"ignore" (proceed as if the action succeeded).

With "ignore", you should see log messages like "Pretending the failure
of ... succeeded". "ignore" is rarely useful, mainly when debugging a
resource agent that is wrongly returning an error.

>     meta allow-migrate="true" failure-timeout="75"
> 
> ==== [root at m3-a02n01 ~]# pcs config
> Cluster Name: m3-anvil-02
> Corosync Nodes:
>  m3-a02n01.alteeve.com m3-a02n02.alteeve.com
> Pacemaker Nodes:
>  m3-a02n01.alteeve.com m3-a02n02.alteeve.com
> 
> Resources:
>  Clone: hypervisor-clone
>   Meta Attrs: clone-max=2 notify=false
>   Resource: hypervisor (class=systemd type=libvirtd)
>    Operations: monitor interval=60 (hypervisor-monitor-interval-60)
>                start interval=0s timeout=100 (hypervisor-start-
> interval-0s)
>                stop interval=0s timeout=100 (hypervisor-stop-
> interval-0s)
>  Resource: srv01-c7 (class=ocf provider=alteeve type=server)
>   Attributes: name=srv01-c7
>   Meta Attrs: allow-migrate=true failure-timeout=75
>   Operations: migrate_from interval=0s on-fail=ignore
> (srv01-c7-migrate_from-interval-0s)
>               migrate_to interval=0s on-fail=ignore
> (srv01-c7-migrate_to-interval-0s)
>               monitor interval=60 (srv01-c7-monitor-interval-60)
>               start interval=0s timeout=30 (srv01-c7-start-interval-
> 0s)
>               stop interval=0s on-fail=block (srv01-c7-stop-interval-
> 0s)
> 
> Stonith Devices:
>  Resource: virsh_node1 (class=stonith type=fence_virsh)
>   Attributes: delay=15 ipaddr=10.255.255.250 login=root
> passwd="secret"
> pcmk_host_list=m3-a02n01.alteeve.com port=m3-a02n01
>   Operations: monitor interval=60 (virsh_node1-monitor-interval-60)
>  Resource: virsh_node2 (class=stonith type=fence_virsh)
>   Attributes: ipaddr=10.255.255.250 login=root passwd="secret"
> pcmk_host_list=m3-a02n02.alteeve.com port=m3-a02n02
>   Operations: monitor interval=60 (virsh_node2-monitor-interval-60)
> Fencing Levels:
> 
> Location Constraints:
>   Resource: srv01-c7
>     Enabled on: m3-a02n02.alteeve.com (score:50)
> (id:location-srv01-c7-m3-a02n02.alteeve.com-50)
> Ordering Constraints:
> Colocation Constraints:
> Ticket Constraints:
> 
> Alerts:
>  No alerts defined
> 
> Resources Defaults:
>  No defaults set
> Operations Defaults:
>  No defaults set
> 
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: m3-anvil-02
>  dc-version: 1.1.16-12.el7_4.7-94ff4df
>  have-watchdog: false
>  last-lrm-refresh: 1518584295
> 
> Quorum:
>   Options:
> ====
> 
> When I tried to migrate (with the RA set to fail on purpose), I got:
> 
> ==== Node 1
> Feb 20 07:06:40 m3-a02n01.alteeve.com crmd[1865]:   notice: Result of
> migrate_to operation for srv01-c7 on m3-a02n01.alteeve.com: 1
> (unknown
> error)
> Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3440]: 167;
> ocf:alteeve:server invoked.
> Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3442]: 1360;
> Command line switch: [stop] -> [#!SET!#]
> ====
> 
> ==== Node 2
> Feb 20 07:05:37 m3-a02n02.alteeve.com crmd[2394]:   notice: State
> transition S_TRANSITION_ENGINE -> S_IDLE
> Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]:   notice: State
> transition S_IDLE -> S_POLICY_ENGINE
> Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]:   notice:  *
> Migrate    srv01-c7        ( m3-a02n01.alteeve.com ->
> m3-a02n02.alteeve.com )
> Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]:   notice:
> Calculated transition 756, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-172.bz2
> Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]:   notice:
> Initiating
> migrate_to operation srv01-c7_migrate_to_0 on m3-a02n01.alteeve.com
> Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 22
> (srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0
> vs.
> rc: 1): Error
> Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 22
> (srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0
> vs.
> rc: 1): Error
> Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice:
> Initiating
> migrate_from operation srv01-c7_migrate_from_0 locally on
> m3-a02n02.alteeve.com
> Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3396]: 167;
> ocf:alteeve:server invoked.
> Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3398]: 1360;
> Command line switch: [migrate_from] -> [#!SET!#]
> Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice: Result of
> migrate_from operation for srv01-c7 on m3-a02n02.alteeve.com: 1
> (unknown
> error)
> Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 23
> (srv01-c7_migrate_from_0) on m3-a02n02.alteeve.com failed (target: 0
> vs.
> rc: 1): Error
> Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 23
> (srv01-c7_migrate_from_0) on m3-a02n02.alteeve.com failed (target: 0
> vs.
> rc: 1): Error
> Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice:
> Initiating
> stop operation srv01-c7_stop_0 on m3-a02n01.alteeve.com
> ===
> 
> Thoughts?
> 
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list