[ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
Jan Friesse
jfriesse at redhat.com
Thu Oct 20 06:46:22 UTC 2016
>
> On 10/14/2016 11:21 AM, renayama19661014 at ybb.ne.jp wrote:
>> Hi Klaus,
>> Hi All,
>>
>> I tried prototype of watchdog using WD service.
>> - https://github.com/HideoYamauchi/pacemaker/commit/3ee97b76e0212b1790226864dfcacd1a327dbcc9
>>
>> Please comment.
> Thank you Hideo for providing the prototype.
> Added the patch to my build and it seems to
> be working as expected.
>
> A few thoughts triggered by this approach:
>
> - we have to alert the corosync-people as in
> a chat with Jan Friesse he pointed me to the
> fact that for corosync 3.x the wd-service was
> planned to be removed
Actually I didn't express myself correctly. What I wanted to say was
"I'm considering idea of removing it", simply because it's disabled in
downstream.
BUT keep in mind that removing functionality = ask community to find out
if there is not somebody actively using it.
And because there is active users and future use case, removing of wd is
not an option.
>
> especially delicate as the binding is very loose
> so that - as is - it builds against a corosync with
> disabled wd-service without any complaints...
>
> - as of now if you enable wd-service in the
> corosync-build it is on by default and would
> be hogging the watchdog presumably
> (there is obviously a pull request that makes
> it default to off)
>
> - with my thoughts about adding an API to
> sbd previously in the thread I was trying to
> target closer observation of pacemaker_remoted
> as well (remote-nodes don't have corosync
> running)
>
> I guess it would be possible to run corosync
> with a static config as single-node cluster
> bound to localhost for that purpose.
>
> I read the thread about corosync-remote and
> that happening might make the special-handling
> for pacemaker-remote obsolete anyway ...
>
> - to enable the approach to live alongside
> sbd it would be possible to make sbd use
> the corosync-API as well for watchdog purposes
> instead of opening the watchdog directly
>
> This shouldn't be a big deal for sbd used to
> observe a pacemaker-node as cluster-watcher
> (the part of sbd that sends cpg-pings to corosync)
> already builds against corosync.
> The blockdevice-part of sbd being basically
> generic it might be an issue though.
>
> Regards,
> Klaus
>
>>
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>>
>> ----- Original Message -----
>>> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
>>> To: "users at clusterlabs.org" <users at clusterlabs.org>
>>> Cc:
>>> Date: 2016/10/11, Tue 17:58
>>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
>>>
>>> Hi Klaus,
>>>
>>> Thank you for comment.
>>>
>>> I make the patch which is prototype using WD service.
>>>
>>> Please wait a little.
>>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> From: Klaus Wenninger <kwenning at redhat.com>
>>>> To: users at clusterlabs.org
>>>> Cc:
>>>> Date: 2016/10/10, Mon 21:03
>>>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd
>>> is frozen, cluster decisions are delayed infinitely
>>>> On 10/07/2016 11:10 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>> Hi All,
>>>>>
>>>>> Our user may not necessarily use sdb.
>>>>>
>>>>> I confirmed that there was a method using WD service of corosync as
>>> one
>>>> method not to use sdb.
>>>>> Pacemaker watches the process of pacemaker by WD service using CMAP
>>> and can
>>>> carry out watchdog.
>>>>
>>>> Have to have a look at that...
>>>> But if we establish some in-between-layer in pacemaker we could have this
>>>> as one of the possibilities besides e.g. sbd (with enhanced API), going for
>>>> a watchdog-device directly, ...
>>>>
>>>>>
>>>>> We can set up a patch of pacemaker.
>>>> Always helpful to discuss/clarify an idea once some code is available ...
>>>>
>>>>> Was the discussion of using WD service over so far?
>>>> Not from my pov. Just a day off ;-)
>>>>
>>>>>
>>>>> Best Regard,
>>>>> Hideo Yamauchi.
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: Klaus Wenninger <kwenning at redhat.com>
>>>>>> To: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>;
>>>> users at clusterlabs.org
>>>>>> Cc:
>>>>>> Date: 2016/10/7, Fri 17:47
>>>>>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the
>>> DC
>>>> crmd is frozen, cluster decisions are delayed infinitely
>>>>>> On 10/07/2016 08:14 AM, Ulrich Windl wrote:
>>>>>>>>>> Klaus Wenninger <kwenning at redhat.com>
>>> schrieb am
>>>>>> 06.10.2016 um 18:03 in
>>>>>>> Nachricht
>>> <3980cfdd-ebd9-1597-f6bd-a1ca808f7688 at redhat.com>:
>>>>>>>> On 10/05/2016 04:22 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>>>> If a user uses sbd, can the cluster evade a
>>>> problem of
>>>>>> SIGSTOP of crmd?
>>>>>>>>>>
>>>>>>>>>> As pointed out earlier, maybe crmd should feed a
>>>> watchdog. Then
>>>>>> stopping
>>>>>>>> crmd
>>>>>>>>>> will reboot the node (unless the watchdog fails).
>>>>>>>>> Thank you for comment.
>>>>>>>>>
>>>>>>>>> We examine watchdog of crmd, too.
>>>>>>>>> In addition, I comment after examination advanced.
>>>>>>>> Was thinking of doing a small test implementation going
>>>>>>>> a little in the direction Lars Ellenberg had been
>>> pointing
>>>> out.
>>>>>>>> a couple of thoughts I had so far:
>>>>>>>>
>>>>>>>> - add an API (via DBus or libqb - favoring libqb atm) to
>>> sbd
>>>>>>>> an application can use to create a watchdog within sbd
>>>>>>> Why has it to be done within sbd?
>>>>>> Not necessarily, could be spawned out as well into an own project
>>> or
>>>>>> something already existent could be taken.
>>>>>> Remember to have added a dbus-interface to
>>>>>> https://sourceforge.net/projects/watchdog/ for a project once.
>>>>>> If you have a suggestion I'm open.
>>>>>> Going off sbd would have the advantage of a smooth start:
>>>>>>
>>>>>> - cluster/pacemaker-watcher are there already and can
>>>>>> be replaced/moved over time
>>>>>> - the lifecycle of the daemon (when started/stopped) is
>>>>>> already something that is in the code and in the people's
>>> minds
>>>>>>>> - parameters for the first are a name and a timeout
>>>>>>>>
>>>>>>>> - first use-case would be crmd observation
>>>>>>>>
>>>>>>>> - later on we could think of removing pacemaker
>>> dependencies
>>>>>>>> from sbd by moving the actual implementation of
>>>>>>>> pacemaker-watcher and probably cluster-watcher as well
>>>>>>>> into pacemaker - using the new API
>>>>>>>>
>>>>>>>> - this of course creates sbd dependency within pacemaker
>>> so
>>>>>>>> that it would make sense to offer a simpler and
>>>> self-contained
>>>>>>>> implementation within pacemaker as an alternative
>>>>>>> I think the watchdog interface is so simple that you
>>> don't
>>>> need a relay
>>>>>> for it. The only limit I can imagine is the number of watchdogs
>>>> available of
>>>>>> some specific hardware.
>>>>>> That is the point ;-)
>>>>>>>> thus it would be favorable to have the dependency
>>>>>>>> within a non-compulsory pacemaker-rpm so that
>>>>>>>> we can offer an alternative that doesn't use sbd
>>>>>>>> at maybe the cost of being less reliable or one
>>>>>>>> that owns a hardware-watchdog by itself for systems
>>>>>>>> where this is still unused.
>>>>>>>>
>>>>>>>> - e.g. via some kind of plugin (Andrew forgive me -
>>>>>>>> no
>>> pils ;-)
>>>> )
>>>>>>>> - or via an additional daemon
>>>>>>>>
>>>>>>>> What did you have in mind?
>>>>>>>> Maybe it makes sense to synchronize...
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Klaus
>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Hideo Yamauchi.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: Ulrich Windl
>>>> <Ulrich.Windl at rz.uni-regensburg.de>
>>>>>>>>>> To: users at clusterlabs.org;
>>> renayama19661014 at ybb.ne.jp
>>>>>>>>>> Cc:
>>>>>>>>>> Date: 2016/10/5, Wed 23:08
>>>>>>>>>> Subject: Antw: Re: [ClusterLabs] Antw: Re: When
>>> the DC
>>>> crmd is
>>>>>> frozen,
>>>>>>>> cluster decisions are delayed infinitely
>>>>>>>>>>>>> <renayama19661014 at ybb.ne.jp>
>>>> schrieb am
>>>>>> 21.09.2016 um 11:52
>>>>>>>>>> in Nachricht
>>>>>>>>>>
>>>> <876439.61305.qm at web200311.mail.ssk.yahoo.co.jp>:
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Was the final conclusion given about this
>>>> problem?
>>>>>>>>>>> If a user uses sbd, can the cluster evade a
>>>> problem of
>>>>>> SIGSTOP of crmd?
>>>>>>>>>> As pointed out earlier, maybe crmd should feed a
>>>> watchdog. Then
>>>>>> stopping
>>>>>>>> crmd
>>>>>>>>>> will reboot the node (unless the watchdog fails).
>>>>>>>>>>
>>>>>>>>>>> We are interested in this problem, too.
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>>
>>>>>>>>>>> Hideo Yamauchi.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>> _______________________________________________
>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>> Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> _______________________________________________
>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list