[ClusterLabs] (Live) Migration failure results in a stop operation

Tue Feb 20 02:13:28 EST 2018

On 2018-02-20 12:07 AM, Digimer wrote:
> Hi all,
> 
>   Is there a way to tell pacemaker that, if a migration operation fails,
> to just leave the service on the host node? The service being hosted is
> a VM and a migration failure that triggers a shut down and reboot is
> very disruptive. I'd rather just leave it alone (and let a human fix the
> underlying problem).
> 
> Thanks!
> 

I should mention; I tried setting the 'on-fail' for the 'migate_to' and
'migrate_from' operations;

pcs resource create srv01-c7 ocf:alteeve:server name="srv01-c7" \
    meta allow-migrate="true" op monitor interval="60" \
    op stop on-fail="block" op migrate_to on-fail="ignore" \
    op migrate_from on-fail="ignore" \
    meta allow-migrate="true" failure-timeout="75"

==== [root at m3-a02n01 ~]# pcs config
Cluster Name: m3-anvil-02
Corosync Nodes:
 m3-a02n01.alteeve.com m3-a02n02.alteeve.com
Pacemaker Nodes:
 m3-a02n01.alteeve.com m3-a02n02.alteeve.com

Resources:
 Clone: hypervisor-clone
  Meta Attrs: clone-max=2 notify=false
  Resource: hypervisor (class=systemd type=libvirtd)
   Operations: monitor interval=60 (hypervisor-monitor-interval-60)
               start interval=0s timeout=100 (hypervisor-start-interval-0s)
               stop interval=0s timeout=100 (hypervisor-stop-interval-0s)
 Resource: srv01-c7 (class=ocf provider=alteeve type=server)
  Attributes: name=srv01-c7
  Meta Attrs: allow-migrate=true failure-timeout=75
  Operations: migrate_from interval=0s on-fail=ignore
(srv01-c7-migrate_from-interval-0s)
              migrate_to interval=0s on-fail=ignore
(srv01-c7-migrate_to-interval-0s)
              monitor interval=60 (srv01-c7-monitor-interval-60)
              start interval=0s timeout=30 (srv01-c7-start-interval-0s)
              stop interval=0s on-fail=block (srv01-c7-stop-interval-0s)

Stonith Devices:
 Resource: virsh_node1 (class=stonith type=fence_virsh)
  Attributes: delay=15 ipaddr=10.255.255.250 login=root passwd="secret"
pcmk_host_list=m3-a02n01.alteeve.com port=m3-a02n01
  Operations: monitor interval=60 (virsh_node1-monitor-interval-60)
 Resource: virsh_node2 (class=stonith type=fence_virsh)
  Attributes: ipaddr=10.255.255.250 login=root passwd="secret"
pcmk_host_list=m3-a02n02.alteeve.com port=m3-a02n02
  Operations: monitor interval=60 (virsh_node2-monitor-interval-60)
Fencing Levels:

Location Constraints:
  Resource: srv01-c7
    Enabled on: m3-a02n02.alteeve.com (score:50)
(id:location-srv01-c7-m3-a02n02.alteeve.com-50)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: m3-anvil-02
 dc-version: 1.1.16-12.el7_4.7-94ff4df
 have-watchdog: false
 last-lrm-refresh: 1518584295

Quorum:
  Options:
====

When I tried to migrate (with the RA set to fail on purpose), I got:

==== Node 1
Feb 20 07:06:40 m3-a02n01.alteeve.com crmd[1865]:   notice: Result of
migrate_to operation for srv01-c7 on m3-a02n01.alteeve.com: 1 (unknown
error)
Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3440]: 167;
ocf:alteeve:server invoked.
Feb 20 07:06:40 m3-a02n01.alteeve.com ocf:alteeve:server[3442]: 1360;
Command line switch: [stop] -> [#!SET!#]
====

==== Node 2
Feb 20 07:05:37 m3-a02n02.alteeve.com crmd[2394]:   notice: State
transition S_TRANSITION_ENGINE -> S_IDLE
Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]:   notice: State
transition S_IDLE -> S_POLICY_ENGINE
Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]:   notice:  *
Migrate    srv01-c7        ( m3-a02n01.alteeve.com ->
m3-a02n02.alteeve.com )
Feb 20 07:06:33 m3-a02n02.alteeve.com pengine[2393]:   notice:
Calculated transition 756, saving inputs in
/var/lib/pacemaker/pengine/pe-input-172.bz2
Feb 20 07:06:33 m3-a02n02.alteeve.com crmd[2394]:   notice: Initiating
migrate_to operation srv01-c7_migrate_to_0 on m3-a02n01.alteeve.com
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 22
(srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 22
(srv01-c7_migrate_to_0) on m3-a02n01.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice: Initiating
migrate_from operation srv01-c7_migrate_from_0 locally on
m3-a02n02.alteeve.com
Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3396]: 167;
ocf:alteeve:server invoked.
Feb 20 07:06:34 m3-a02n02.alteeve.com ocf:alteeve:server[3398]: 1360;
Command line switch: [migrate_from] -> [#!SET!#]
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice: Result of
migrate_from operation for srv01-c7 on m3-a02n02.alteeve.com: 1 (unknown
error)
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 23
(srv01-c7_migrate_from_0) on m3-a02n02.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:  warning: Action 23
(srv01-c7_migrate_from_0) on m3-a02n02.alteeve.com failed (target: 0 vs.
rc: 1): Error
Feb 20 07:06:34 m3-a02n02.alteeve.com crmd[2394]:   notice: Initiating
stop operation srv01-c7_stop_0 on m3-a02n01.alteeve.com
===

Thoughts?

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould