[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?
Andrew Beekhof
abeekhof at redhat.com
Thu Jun 23 23:22:48 UTC 2016
On Fri, Jun 24, 2016 at 1:26 AM, Adam Spiers <aspiers at suse.com> wrote:
> Adam Spiers <aspiers at suse.com> wrote:
>> As per the FIXME, one remaining problem is dealing with this kind of
>> scenario:
>>
>> - Cloud operator notices SMART warnings on the compute node
>> which is not yet causing hard failures but signifies that the
>> hard disk might die soon.
>>
>> - Operator manually runs "nova service-disable" with the intention
>> of doing some maintenance soon, i.e. live-migrating instances away
>> and replacing the dying hard disk.
>>
>> - Before the operator gracefully shuts down nova-compute, an I/O
>> error from the disk causes nova-compute to fail.
>>
>> - Pacemaker invokes the monitor action which spots the failure.
>>
>> - Pacemaker invokes the stop action which runs service-disable.
>>
>> - Pacemaker attempts to restart nova-compute by invoking the start
>> action. Since the disk failure is currently intermittent, we
>> get (un)lucky and nova-compute starts fine.
>>
>> Then it calls service-enable - BAD! This is now overriding the
>> cloud operator's manual request for the service to be disabled.
>> If we're really unlucky, nova-scheduler will now start up new VMs
>> on the node, even though the hard disk is dying.
>>
>> However I can't see a way to defend against this :-/
>
> OK, I think I figured this out. The answer is not to use
> service-disable at all, but to use force_down in the same way we
> already use it during fencing. This means we don't mess with the
> intentions of the cloud operator which were manually specified via
> service-disable.
>
> I asked on #openstack-nova and got confirmation that this made sense.
> Hooray! Dare I suggest we are finally coming close to a consensus?
I'm sure we can find more to argue over if we put our minds to it :-)
More information about the Users
mailing list