[ClusterLabs] (Live) Migration failure results in a stop operation
Digimer
lists at alteeve.ca
Tue Feb 20 02:13:28 EST 2018
On 2018-02-20 12:07 AM, Digimer wrote:
> Hi all,
>
> Is there a way to tell pacemaker that, if a migration operation fails,
> to just leave the service on the host node? The service being hosted is
> a VM and a migration failure that triggers a shut down and reboot is
> very disruptive. I'd rather just leave it alone (and let a human fix the
> underlying problem).
>
> Thanks!
>
I should mention; I tried setting the 'on-fail' for the 'migate_to' and
'migrate_from' operations;
pcs resource create srv01-c7 ocf:alteeve:server name="srv01-c7" \
meta allow-migrate="true" op monitor interval="60" \
op stop on-fail="block" op migrate_to on-fail="ignore" \
op migrate_from on-fail="ignore" \
meta allow-migrate="true" failure-timeout="75"
==== [root at m3-a02n01 ~]# pcs config
Cluster Name: m3-anvil-02
Corosync Nodes:
m3-a02n01.alteeve.com m3-a02n02.alteeve.com
Pacemaker Nodes:
m3-a02n01.alteeve.com m3-a02n02.alteeve.com
Resources:
Clone: hypervisor-clone
Meta Attrs: clone-max=2 notify=false
Resource: hypervisor (class=systemd type=libvirtd)
Operations: monitor interval=60 (hypervisor-monitor-interval-60)
start interval=0s timeout=100 (hypervisor-start-interval-0s)
stop interval=0s timeout=100 (hypervisor-stop-interval-0s)
Resource: srv01-c7 (class=ocf provider=alteeve type=server)
Attributes: name=srv01-c7
Meta Attrs: allow-migrate=true failure-timeout=75
Operations: migrate_from interval=0s on-fail=ignore
(srv01-c7-migrate_from-interval-0s)
migrate_to interval=0s on-fail=ignore
(srv01-c7-migrate_to-interval-0s)
monitor interval=60 (srv01-c7-monitor-interval-60)
start interval=0s timeout=30 (srv01-c7-start-interval-0s)
stop interval=0s on-fail=block (srv01-c7-stop-interval-0s)
Stonith Devices:
Resource: virsh_node1 (class=stonith type=fence_virsh)
Attributes: delay=15 ipaddr=10.255.255.250 login=root passwd="secret"
pcmk_host_list=m3-a02n01.alteeve.com port=m3-a02n01
Operations: monitor interval=60 (virsh_node1-monitor-interval-60)
Resource: virsh_node2 (class=stonith type=fence_virsh)
Attributes: ipaddr=10.255.255.250 login=root passwd="secret"
pcmk_host_list=m3-a02n02.alteeve.com port=m3-a02n02
Operations: monitor interval=60 (virsh_node2-monitor-interval-60)
Fencing Levels:
Location Constraints:
Resource: srv01-c7
Enabled on: m3-a02n02.alteeve.com (score:50)
(id:location-srv01-c7-m3-a02n02.alteeve.com-50)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
No defaults set
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: m3-anvil-02
dc-version: 1.1.16-12.el7_4.7-94ff4df
have-watchdog: false
last-lrm-refresh: 1518584295
Quorum:
Options:
====
When I tried to migrate (with the RA set to fail on purpose), I got:
==== Node 1
Feb 20 07:06:40 m3-a02n01.alteeve.com crmd[1865]: notice: Result of
migrate_to operation for srv01-c7 on m3-a02n01.alteeve.com: 1 (unknown
error)
Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3440]: 167;
ocf:alteeve:server invoked.
Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3442]: 1360;
Command line switch: [stop] -> [#!SET!#]
====
==== Node 2
Feb 20 07:05:37 m3-a02n02.alteeve.com crmd[2394]: notice: State
transition S_TRANSITION_ENGINE -> S_IDLE
Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]: notice: State
transition S_IDLE -> S_POLICY_ENGINE
Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]: notice: *
Migrate srv01-c7 ( m3-a02n01.alteeve.com ->
m3-a02n02.alteeve.com )
Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]: notice:
Calculated transition 756, saving inputs in
/var/lib/pacemaker/pengine/pe-input-172.bz2
Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]: notice: Initiating
migrate_to operation srv01-c7_migrate_to_0 on m3-a02n01.alteeve.com
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: warning: Action 22
(srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: warning: Action 22
(srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: notice: Initiating
migrate_from operation srv01-c7_migrate_from_0 locally on
m3-a02n02.alteeve.com
Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3396]: 167;
ocf:alteeve:server invoked.
Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3398]: 1360;
Command line switch: [migrate_from] -> [#!SET!#]
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: notice: Result of
migrate_from operation for srv01-c7 on m3-a02n02.alteeve.com: 1 (unknown
error)
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: warning: Action 23
(srv01-c7_migrate_from_0) on m3-a02n02.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: warning: Action 23
(srv01-c7_migrate_from_0) on m3-a02n02.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]: notice: Initiating
stop operation srv01-c7_stop_0 on m3-a02n01.alteeve.com
===
Thoughts?
--
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
More information about the Users
mailing list