[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Thu Jun 23 15:26:03 UTC 2016

Adam Spiers <aspiers at suse.com> wrote:
> As per the FIXME, one remaining problem is dealing with this kind of
> scenario:
> 
>   - Cloud operator notices SMART warnings on the compute node
>     which is not yet causing hard failures but signifies that the
>     hard disk might die soon.
> 
>   - Operator manually runs "nova service-disable" with the intention
>     of doing some maintenance soon, i.e. live-migrating instances away
>     and replacing the dying hard disk.
> 
>   - Before the operator gracefully shuts down nova-compute, an I/O
>     error from the disk causes nova-compute to fail.
> 
>   - Pacemaker invokes the monitor action which spots the failure.
> 
>   - Pacemaker invokes the stop action which runs service-disable.
> 
>   - Pacemaker attempts to restart nova-compute by invoking the start
>     action.  Since the disk failure is currently intermittent, we
>     get (un)lucky and nova-compute starts fine.
> 
>     Then it calls service-enable - BAD!  This is now overriding the
>     cloud operator's manual request for the service to be disabled.
>     If we're really unlucky, nova-scheduler will now start up new VMs
>     on the node, even though the hard disk is dying.
> 
> However I can't see a way to defend against this :-/

OK, I think I figured this out.  The answer is not to use
service-disable at all, but to use force_down in the same way we
already use it during fencing.  This means we don't mess with the
intentions of the cloud operator which were manually specified via
service-disable.

I asked on #openstack-nova and got confirmation that this made sense.
Hooray!  Dare I suggest we are finally coming close to a consensus?