[ClusterLabs] monitor timed out with unknown error

Sun May 5 14:53:08 EDT 2019

05.05.2019 21:43, Arkadiy Kulev пишет:
> Is there a way how I can get Pacemaker to repeat the stop of the resource
> if it failed?
> 

Not on pacemaker level. You would need to modify resource agent to retry
operation.

> Sincerely,
> Ark.
> 
> eth at ethaniel.com
> 
> 
> On Sun, May 5, 2019 at 11:05 PM Andrei Borzenkov <arvidjaar at gmail.com>
> wrote:
> 
>> 05.05.2019 18:43, Arkadiy Kulev пишет:
>>> Dear Andrei,
>>>
>>> I'm sorry for the screenshot, this is the only thing that I have left
>> after
>>> the crash.
>>>
>>
>> What crash do you mean? All nodes appear up and running, you are able to
>> execute commands, I do not see anything crashed.
>>
>>> What would the best course of action be in this situation?
>>
>> Configure STONITH. It is mandatory so pacemaker can resolve such
>> situation among others.
>>
>> For now assuming node problems are over you should be able to clean
>> resource state (crm_resource --cleanup). Restarting pacemaker on all
>> nodes would also work.
>>
>>> We don't have a STONITH device. But the local network is still up (both
>>> nodes see each othes).
>>>
>>> Also, what does "(blocked)" means?
>>>
>>
>> It means that pacemaker cannot perform any action on this resource due
>> to failed prerequisites. In this case failed prerequisite was successful
>> stop of resource.
>>
>>> Sincerely,
>>> Ark.
>>>
>>> eth at ethaniel.com
>>>
>>>
>>> On Sun, May 5, 2019 at 9:46 PM Andrei Borzenkov <arvidjaar at gmail.com>
>> wrote:
>>>
>>>> 05.05.2019 16:14, Arkadiy Kulev пишет:
>>>>> Hello!
>>>>>
>>>>> I run pacemaker on 2 active/active hosts which balance the load of 2
>>>> public
>>>>> IP addresses.
>>>>> A few days ago we ran a very CPU/network intensive process on one of
>> the
>>>> 2
>>>>> hosts and Pacemaker failed.
>>>>>
>>>>> I've attached a screenshot of the terminal to this email.
>>>>>
>>>>> The "Failed Actions" shows that the IPaddr2 "monitor_30000" failed with
>>>>> "unknown error" and a status of "Timed Out" (queue=0ms exec=0ms). The
>>>>> /etc/init.d LSB script (mycluster) failed as well (and set to blocked).
>>>>>
>>>>> This completely stalled Pacemaker and the second host didn't take over
>>>> the
>>>>> IP address and gateway settings.
>>>>>
>>>>> Any ideas would be appreciated.
>>>>>
>>>>
>>>> Stop operation failed, you have no stonith, so pacemaker cannot continue
>>>> and is stuck.
>>>>
>>>>
>>>>>
>>>>> [image: Screen Shot 2019-04-30 at 12.36.34.png]
>>>>>
>>>>
>>>>
>>>> Images are hard to reply to, consume excessive space and cannot be
>>>> viewed using text only clients. There is no reason to send image when
>>>> you can just copy and paste several lines of text.
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>