[ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager
Digimer
lists at alteeve.ca
Mon Sep 19 15:07:41 EDT 2016
On 19/09/16 02:39 PM, Digimer wrote:
> On 19/09/16 02:30 PM, Jan Pokorný wrote:
>> On 18/09/16 15:37 -0400, Digimer wrote:
>>> If, for example, a server's definition file is corrupted while the
>>> server is running, rgmanager will put the server into a 'failed' state.
>>> That's fine and fair.
>>
>> Please, be more precise. Is it "vm" resource agent that you are talking
>> about, hence server is the particular virtual machine to be managed?
>> Is the agent in the role of a service (defined at a top-level) or
>> a standard resource (without special treatment, possibly with
>> dependent services further in the group)?
>
> In 'clustat', vm:foo reports 'failed' after the vm.sh calls a status and
> gets a bad return (because the foo.xml file was corrupted by creating a
> typo that breaks the XML, as an example).
>
> I'm not sure if that answers your question, sorry.
>
>>> The problem is that, once the file is fixed, there appears to be no
>>> way to go failed -> started without disabling (and thus powering off)
>>> the VM. This is troublesom because it forces an interruption when the
>>> service could have been placed under resource management without a reboot.
>>>
>>> For example, doing 'clusvcadm -e <server>' when the service was
>>> 'disabled' (say because of a manual boot of the server), rgmanager
>>> detects that the server is running fine and simply marks the server as
>>> 'started'. Is there no way to do something similar to go 'failed' ->
>>> 'started' without the 'disable' step?
>>
>> In case it's a VM as a service, this could possibly be "exploited"
>> (never tested that, though):
>>
>> # MANWIDTH=72 man rgmanager | col -b \
>> | sed -n '/^VIRTUAL MACHINE/{:a;p;n;/^\s*$/d;ba}'
>>> VIRTUAL MACHINE FEATURES
>>> Apart from what is noted in the VM resource agent, rgman-
>>> ager provides a few convenience features when dealing
>>> with virtual machines.
>>> * it will use live migration when transferring a virtual
>>> machine to a more-preferred host in the cluster as a
>>> consequence of failover domain operation
>>> * it will search the other instances of rgmanager in the
>>> cluster in the case that a user accidentally moves a
>>> virtual machine using other management tools
>>> * unlike services, adding a virtual machine to rgman-
>>> ager’s configuration will not cause the virtual machine
>>> to be restarted
>>> * removing a virtual machine from rgmanager’s
>>> configuration will leave the virtual machine running.
>>
>> (see the last two items).
>
> So a possible "recover" would be to remove the VM from rgmanager, then
> add it back? I can see that working, but it seems heavy handed. :)
>
>>> I tried freezing the service, no luck. I also tried coalescing via
>>> '-c', but that didn't help either.
>>
>> Any path from "failed" in the resource (group) life-cycle goes either
>> through "disabled" or "stopped" if I am not mistaken, so would rather
>> experiment with adding a new service and dropping the old one per
>> the above description as a possible workaround (perhaps in the reverse
>> order so as to retain the same name for the service, indeed unless
>> rgmanager would actively prevent that anyway -- no idea).
>
> This is my understanding as well, yes (that failed must go through
> 'disabled' or 'stopped').
>
> I'll try the remove/re-add option and report back.
OK, didn't work.
I corrupted the XML definition to cause rgmanager to report it as
'failed', removed it from rgmanager (clustat no longer reported it at
all), re-added it and when it came back, it was still listed as 'failed'.
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Users
mailing list