[ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
Klaus Wenninger
kwenning at redhat.com
Wed Oct 19 14:06:20 EDT 2016
On 10/14/2016 11:21 AM, renayama19661014 at ybb.ne.jp wrote:
> Hi Klaus,
> Hi All,
>
> I tried prototype of watchdog using WD service.
> - https://github.com/HideoYamauchi/pacemaker/commit/3ee97b76e0212b1790226864dfcacd1a327dbcc9
>
> Please comment.
Thank you Hideo for providing the prototype.
Added the patch to my build and it seems to
be working as expected.
A few thoughts triggered by this approach:
- we have to alert the corosync-people as in
a chat with Jan Friesse he pointed me to the
fact that for corosync 3.x the wd-service was
planned to be removed
especially delicate as the binding is very loose
so that - as is - it builds against a corosync with
disabled wd-service without any complaints...
- as of now if you enable wd-service in the
corosync-build it is on by default and would
be hogging the watchdog presumably
(there is obviously a pull request that makes
it default to off)
- with my thoughts about adding an API to
sbd previously in the thread I was trying to
target closer observation of pacemaker_remoted
as well (remote-nodes don't have corosync
running)
I guess it would be possible to run corosync
with a static config as single-node cluster
bound to localhost for that purpose.
I read the thread about corosync-remote and
that happening might make the special-handling
for pacemaker-remote obsolete anyway ...
- to enable the approach to live alongside
sbd it would be possible to make sbd use
the corosync-API as well for watchdog purposes
instead of opening the watchdog directly
This shouldn't be a big deal for sbd used to
observe a pacemaker-node as cluster-watcher
(the part of sbd that sends cpg-pings to corosync)
already builds against corosync.
The blockdevice-part of sbd being basically
generic it might be an issue though.
Regards,
Klaus
>
>
> Best Regards,
> Hideo Yamauchi.
>
>
> ----- Original Message -----
>> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
>> To: "users at clusterlabs.org" <users at clusterlabs.org>
>> Cc:
>> Date: 2016/10/11, Tue 17:58
>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
>>
>> Hi Klaus,
>>
>> Thank you for comment.
>>
>> I make the patch which is prototype using WD service.
>>
>> Please wait a little.
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>>
>>
>>
>> ----- Original Message -----
>>> From: Klaus Wenninger <kwenning at redhat.com>
>>> To: users at clusterlabs.org
>>> Cc:
>>> Date: 2016/10/10, Mon 21:03
>>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd
>> is frozen, cluster decisions are delayed infinitely
>>> On 10/07/2016 11:10 PM, renayama19661014 at ybb.ne.jp wrote:
>>>> Hi All,
>>>>
>>>> Our user may not necessarily use sdb.
>>>>
>>>> I confirmed that there was a method using WD service of corosync as
>> one
>>> method not to use sdb.
>>>> Pacemaker watches the process of pacemaker by WD service using CMAP
>> and can
>>> carry out watchdog.
>>>
>>> Have to have a look at that...
>>> But if we establish some in-between-layer in pacemaker we could have this
>>> as one of the possibilities besides e.g. sbd (with enhanced API), going for
>>> a watchdog-device directly, ...
>>>
>>>>
>>>> We can set up a patch of pacemaker.
>>> Always helpful to discuss/clarify an idea once some code is available ...
>>>
>>>> Was the discussion of using WD service over so far?
>>> Not from my pov. Just a day off ;-)
>>>
>>>>
>>>> Best Regard,
>>>> Hideo Yamauchi.
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: Klaus Wenninger <kwenning at redhat.com>
>>>>> To: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>;
>>> users at clusterlabs.org
>>>>> Cc:
>>>>> Date: 2016/10/7, Fri 17:47
>>>>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the
>> DC
>>> crmd is frozen, cluster decisions are delayed infinitely
>>>>> On 10/07/2016 08:14 AM, Ulrich Windl wrote:
>>>>>>>>> Klaus Wenninger <kwenning at redhat.com>
>> schrieb am
>>>>> 06.10.2016 um 18:03 in
>>>>>> Nachricht
>> <3980cfdd-ebd9-1597-f6bd-a1ca808f7688 at redhat.com>:
>>>>>>> On 10/05/2016 04:22 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>>>> If a user uses sbd, can the cluster evade a
>>> problem of
>>>>> SIGSTOP of crmd?
>>>>>>>>>
>>>>>>>>> As pointed out earlier, maybe crmd should feed a
>>> watchdog. Then
>>>>> stopping
>>>>>>> crmd
>>>>>>>>> will reboot the node (unless the watchdog fails).
>>>>>>>> Thank you for comment.
>>>>>>>>
>>>>>>>> We examine watchdog of crmd, too.
>>>>>>>> In addition, I comment after examination advanced.
>>>>>>> Was thinking of doing a small test implementation going
>>>>>>> a little in the direction Lars Ellenberg had been
>> pointing
>>> out.
>>>>>>> a couple of thoughts I had so far:
>>>>>>>
>>>>>>> - add an API (via DBus or libqb - favoring libqb atm) to
>> sbd
>>>>>>> an application can use to create a watchdog within sbd
>>>>>> Why has it to be done within sbd?
>>>>> Not necessarily, could be spawned out as well into an own project
>> or
>>>>> something already existent could be taken.
>>>>> Remember to have added a dbus-interface to
>>>>> https://sourceforge.net/projects/watchdog/ for a project once.
>>>>> If you have a suggestion I'm open.
>>>>> Going off sbd would have the advantage of a smooth start:
>>>>>
>>>>> - cluster/pacemaker-watcher are there already and can
>>>>> be replaced/moved over time
>>>>> - the lifecycle of the daemon (when started/stopped) is
>>>>> already something that is in the code and in the people's
>> minds
>>>>>>> - parameters for the first are a name and a timeout
>>>>>>>
>>>>>>> - first use-case would be crmd observation
>>>>>>>
>>>>>>> - later on we could think of removing pacemaker
>> dependencies
>>>>>>> from sbd by moving the actual implementation of
>>>>>>> pacemaker-watcher and probably cluster-watcher as well
>>>>>>> into pacemaker - using the new API
>>>>>>>
>>>>>>> - this of course creates sbd dependency within pacemaker
>> so
>>>>>>> that it would make sense to offer a simpler and
>>> self-contained
>>>>>>> implementation within pacemaker as an alternative
>>>>>> I think the watchdog interface is so simple that you
>> don't
>>> need a relay
>>>>> for it. The only limit I can imagine is the number of watchdogs
>>> available of
>>>>> some specific hardware.
>>>>> That is the point ;-)
>>>>>>> thus it would be favorable to have the dependency
>>>>>>> within a non-compulsory pacemaker-rpm so that
>>>>>>> we can offer an alternative that doesn't use sbd
>>>>>>> at maybe the cost of being less reliable or one
>>>>>>> that owns a hardware-watchdog by itself for systems
>>>>>>> where this is still unused.
>>>>>>>
>>>>>>> - e.g. via some kind of plugin (Andrew forgive me -
>>>>>>> no
>> pils ;-)
>>> )
>>>>>>> - or via an additional daemon
>>>>>>>
>>>>>>> What did you have in mind?
>>>>>>> Maybe it makes sense to synchronize...
>>>>>>>
>>>>>>> Regards,
>>>>>>> Klaus
>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Hideo Yamauchi.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: Ulrich Windl
>>> <Ulrich.Windl at rz.uni-regensburg.de>
>>>>>>>>> To: users at clusterlabs.org;
>> renayama19661014 at ybb.ne.jp
>>>>>>>>> Cc:
>>>>>>>>> Date: 2016/10/5, Wed 23:08
>>>>>>>>> Subject: Antw: Re: [ClusterLabs] Antw: Re: When
>> the DC
>>> crmd is
>>>>> frozen,
>>>>>>> cluster decisions are delayed infinitely
>>>>>>>>>>>> <renayama19661014 at ybb.ne.jp>
>>> schrieb am
>>>>> 21.09.2016 um 11:52
>>>>>>>>> in Nachricht
>>>>>>>>>
>>> <876439.61305.qm at web200311.mail.ssk.yahoo.co.jp>:
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Was the final conclusion given about this
>>> problem?
>>>>>>>>>> If a user uses sbd, can the cluster evade a
>>> problem of
>>>>> SIGSTOP of crmd?
>>>>>>>>> As pointed out earlier, maybe crmd should feed a
>>> watchdog. Then
>>>>> stopping
>>>>>>> crmd
>>>>>>>>> will reboot the node (unless the watchdog fails).
>>>>>>>>>
>>>>>>>>>> We are interested in this problem, too.
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>>
>>>>>>>>>> Hideo Yamauchi.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>> _______________________________________________
>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>> _______________________________________________
>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list