[ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

Mon Sep 19 15:16:48 EDT 2016

On 19/09/16 03:13 PM, Digimer wrote:
> On 19/09/16 03:07 PM, Digimer wrote:
>> On 19/09/16 02:39 PM, Digimer wrote:
>>> On 19/09/16 02:30 PM, Jan Pokorný wrote:
>>>> On 18/09/16 15:37 -0400, Digimer wrote:
>>>>>   If, for example, a server's definition file is corrupted while the
>>>>> server is running, rgmanager will put the server into a 'failed' state.
>>>>> That's fine and fair.
>>>>
>>>> Please, be more precise.  Is it "vm" resource agent that you are talking
>>>> about, hence server is the particular virtual machine to be managed?
>>>> Is the agent in the role of a service (defined at a top-level) or
>>>> a standard resource (without special treatment, possibly with
>>>> dependent services further in the group)?
>>>
>>> In 'clustat', vm:foo reports 'failed' after the vm.sh calls a status and
>>> gets a bad return (because the foo.xml file was corrupted by creating a
>>> typo that breaks the XML, as an example).
>>>
>>> I'm not sure if that answers your question, sorry.
>>>
>>>>>   The problem is that, once the file is fixed, there appears to be no
>>>>> way to go failed -> started without disabling (and thus powering off)
>>>>> the VM. This is troublesom because it forces an interruption when the
>>>>> service could have been placed under resource management without a reboot.
>>>>>
>>>>>   For example, doing 'clusvcadm -e <server>' when the service was
>>>>> 'disabled' (say because of a manual boot of the server), rgmanager
>>>>> detects that the server is running fine and simply marks the server as
>>>>> 'started'. Is there no way to do something similar to go 'failed' ->
>>>>> 'started' without the 'disable' step?
>>>>
>>>> In case it's a VM as a service, this could possibly be "exploited"
>>>> (never tested that, though):
>>>>
>>>> # MANWIDTH=72 man rgmanager | col -b \
>>>>   | sed -n '/^VIRTUAL MACHINE/{:a;p;n;/^\s*$/d;ba}'
>>>>> VIRTUAL MACHINE FEATURES
>>>>>        Apart from what is noted in the VM resource agent, rgman-
>>>>>        ager  provides  a  few  convenience features when dealing
>>>>>        with virtual machines.
>>>>>         * it will use live migration when transferring a virtual
>>>>>         machine  to  a  more-preferred  host in the cluster as a
>>>>>         consequence of failover domain operation
>>>>>         * it will search the other instances of rgmanager in the
>>>>>         cluster  in  the  case  that a user accidentally moves a
>>>>>         virtual machine using other management tools
>>>>>         * unlike services, adding a virtual  machine  to  rgman-
>>>>>         ager’s  configuration will not cause the virtual machine
>>>>>         to be restarted
>>>>>         *  removing   a   virtual   machine   from   rgmanager’s
>>>>>         configuration will leave the virtual machine running.
>>>>
>>>> (see the last two items).
>>>
>>> So a possible "recover" would be to remove the VM from rgmanager, then
>>> add it back? I can see that working, but it seems heavy handed. :)
>>>
>>>>>   I tried freezing the service, no luck. I also tried coalescing via
>>>>> '-c', but that didn't help either.
>>>>
>>>> Any path from "failed" in the resource (group) life-cycle goes either
>>>> through "disabled" or "stopped" if I am not mistaken, so would rather
>>>> experiment with adding a new service and dropping the old one per
>>>> the above description as a possible workaround (perhaps in the reverse
>>>> order so as to retain the same name for the service, indeed unless
>>>> rgmanager would actively prevent that anyway -- no idea).
>>>
>>> This is my understanding as well, yes (that failed must go through
>>> 'disabled' or 'stopped').
>>>
>>> I'll try the remove/re-add option and report back.
>>
>> OK, didn't work.
>>
>> I corrupted the XML definition to cause rgmanager to report it as
>> 'failed', removed it from rgmanager (clustat no longer reported it at
>> all), re-added it and when it came back, it was still listed as 'failed'.
> 
> Ha!
> 
> So, it was still flagged as 'failed', so I called '-d' to disable it
> (after adding it back to rgmanager) and it went 'disabled' WITHOUT
> stopping the server. When I called '-e' on node 2 (the server was on
> node 1), it started on node 1 properly and returned to a 'started' state
> without restarting.
> 
> I wonder if I could call disable directly from the other node...

So yes, I can.

If I call -d on a node that ISN'T the host, it flags the server as
stopped without actually shutting it down. Then I can call '-e' and
bring it back up fine.

This feels like I am exploiting a bug though... I wonder if there is a
more "proper" way to recover the server?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?