[ClusterLabs] monitor timed out with unknown error

Mon May 6 01:30:35 EDT 2019

Andrei,

I just went through the docs (
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html
)
and it says that the option "*failure-timeout*" is responsible for retrying
a failed resource.

*"If STONITH is not enabled, then the cluster has no way to continue and
will not try to start the resource elsewhere, but will try to stop it again
after the failure timeout."*

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-options.html
says
that
"failure-timeout" is disabled by default:

*How many seconds to wait before acting as if the failure had not occurred,
and potentially allowing the resource back to the node on which it failed.
A value of 0 indicates that this feature is disabled. *

Sincerely,
Ark.

eth at ethaniel.com

On Mon, May 6, 2019 at 1:53 AM Andrei Borzenkov <arvidjaar at gmail.com> wrote:

> 05.05.2019 21:43, Arkadiy Kulev пишет:
> > Is there a way how I can get Pacemaker to repeat the stop of the resource
> > if it failed?
> >
>
> Not on pacemaker level. You would need to modify resource agent to retry
> operation.
>
> > Sincerely,
> > Ark.
> >
> > eth at ethaniel.com
> >
> >
> > On Sun, May 5, 2019 at 11:05 PM Andrei Borzenkov <arvidjaar at gmail.com>
> > wrote:
> >
> >> 05.05.2019 18:43, Arkadiy Kulev пишет:
> >>> Dear Andrei,
> >>>
> >>> I'm sorry for the screenshot, this is the only thing that I have left
> >> after
> >>> the crash.
> >>>
> >>
> >> What crash do you mean? All nodes appear up and running, you are able to
> >> execute commands, I do not see anything crashed.
> >>
> >>> What would the best course of action be in this situation?
> >>
> >> Configure STONITH. It is mandatory so pacemaker can resolve such
> >> situation among others.
> >>
> >> For now assuming node problems are over you should be able to clean
> >> resource state (crm_resource --cleanup). Restarting pacemaker on all
> >> nodes would also work.
> >>
> >>> We don't have a STONITH device. But the local network is still up (both
> >>> nodes see each othes).
> >>>
> >>> Also, what does "(blocked)" means?
> >>>
> >>
> >> It means that pacemaker cannot perform any action on this resource due
> >> to failed prerequisites. In this case failed prerequisite was successful
> >> stop of resource.
> >>
> >>> Sincerely,
> >>> Ark.
> >>>
> >>> eth at ethaniel.com
> >>>
> >>>
> >>> On Sun, May 5, 2019 at 9:46 PM Andrei Borzenkov <arvidjaar at gmail.com>
> >> wrote:
> >>>
> >>>> 05.05.2019 16:14, Arkadiy Kulev пишет:
> >>>>> Hello!
> >>>>>
> >>>>> I run pacemaker on 2 active/active hosts which balance the load of 2
> >>>> public
> >>>>> IP addresses.
> >>>>> A few days ago we ran a very CPU/network intensive process on one of
> >> the
> >>>> 2
> >>>>> hosts and Pacemaker failed.
> >>>>>
> >>>>> I've attached a screenshot of the terminal to this email.
> >>>>>
> >>>>> The "Failed Actions" shows that the IPaddr2 "monitor_30000" failed
> with
> >>>>> "unknown error" and a status of "Timed Out" (queue=0ms exec=0ms). The
> >>>>> /etc/init.d LSB script (mycluster) failed as well (and set to
> blocked).
> >>>>>
> >>>>> This completely stalled Pacemaker and the second host didn't take
> over
> >>>> the
> >>>>> IP address and gateway settings.
> >>>>>
> >>>>> Any ideas would be appreciated.
> >>>>>
> >>>>
> >>>> Stop operation failed, you have no stonith, so pacemaker cannot
> continue
> >>>> and is stuck.
> >>>>
> >>>>
> >>>>>
> >>>>> [image: Screen Shot 2019-04-30 at 12.36.34.png]
> >>>>>
> >>>>
> >>>>
> >>>> Images are hard to reply to, consume excessive space and cannot be
> >>>> viewed using text only clients. There is no reason to send image when
> >>>> you can just copy and paste several lines of text.
> >>>> _______________________________________________
> >>>> Manage your subscription:
> >>>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>>
> >>>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >>>
> >>> _______________________________________________
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >>
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190506/fff3a45b/attachment-0001.html>