[ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

Jan Friesse jfriesse at redhat.com
Thu Oct 20 08:46:22 CEST 2016


>
> On 10/14/2016 11:21 AM, renayama19661014 at ybb.ne.jp wrote:
>> Hi Klaus,
>> Hi All,
>>
>> I tried prototype of watchdog using WD service.
>>   - https://github.com/HideoYamauchi/pacemaker/commit/3ee97b76e0212b1790226864dfcacd1a327dbcc9
>>
>> Please comment.
> Thank you Hideo for providing the prototype.
> Added the patch to my build and it seems to
> be working as expected.
>
> A few thoughts triggered by this approach:
>
> - we have to alert the corosync-people as in
>    a chat with Jan Friesse he pointed me to the
>    fact that for corosync 3.x the wd-service was
>    planned to be removed

Actually I didn't express myself correctly. What I wanted to say was 
"I'm considering idea of removing it", simply because it's disabled in 
downstream.

BUT keep in mind that removing functionality = ask community to find out 
if there is not somebody actively using it.

And because there is active users and future use case, removing of wd is 
not an option.


>
>    especially delicate as the binding is very loose
>    so that - as is - it builds against a corosync with
>    disabled wd-service without any complaints...
>
> - as of now if you enable wd-service in the
>    corosync-build it is on by default and would
>    be hogging the watchdog presumably
>    (there is obviously a pull request that makes
>    it default to off)
>
> - with my thoughts about adding an API to
>    sbd previously in the thread I was trying to
>    target closer observation of pacemaker_remoted
>    as well (remote-nodes don't have corosync
>    running)
>
>    I guess it would be possible to run corosync
>    with a static config as single-node cluster
>    bound to localhost for that purpose.
>
>    I read the thread about corosync-remote and
>    that happening might make the special-handling
>    for pacemaker-remote obsolete anyway ...
>
> - to enable the approach to live alongside
>    sbd it would be possible to make sbd use
>    the corosync-API as well for watchdog purposes
>    instead of opening the watchdog directly
>
>    This shouldn't be a big deal for sbd used to
>    observe a pacemaker-node as cluster-watcher
>    (the part of sbd that sends cpg-pings to corosync)
>    already builds against corosync.
>    The blockdevice-part of sbd being basically
>    generic it might be an issue though.
>
> Regards,
> Klaus
>
>>
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>>
>> ----- Original Message -----
>>> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
>>> To: "users at clusterlabs.org" <users at clusterlabs.org>
>>> Cc:
>>> Date: 2016/10/11, Tue 17:58
>>> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
>>>
>>> Hi Klaus,
>>>
>>> Thank you for comment.
>>>
>>> I make the patch which is prototype using WD service.
>>>
>>> Please wait a little.
>>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>>>   From: Klaus Wenninger <kwenning at redhat.com>
>>>>   To: users at clusterlabs.org
>>>>   Cc:
>>>>   Date: 2016/10/10, Mon 21:03
>>>>   Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd
>>> is frozen, cluster decisions are delayed infinitely
>>>>   On 10/07/2016 11:10 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>    Hi All,
>>>>>
>>>>>    Our user may not necessarily use sdb.
>>>>>
>>>>>    I confirmed that there was a method using WD service of corosync as
>>> one
>>>>   method not to use sdb.
>>>>>    Pacemaker watches the process of pacemaker by WD service using CMAP
>>> and can
>>>>   carry out watchdog.
>>>>
>>>>   Have to have a look at that...
>>>>   But if we establish some in-between-layer in pacemaker we could have this
>>>>   as one of the possibilities besides e.g. sbd (with enhanced API), going for
>>>>   a watchdog-device directly, ...
>>>>
>>>>>
>>>>>    We can set up a patch of pacemaker.
>>>>   Always helpful to discuss/clarify an idea once some code is available ...
>>>>
>>>>>    Was the discussion of using WD service over so far?
>>>>   Not from my pov. Just a day off ;-)
>>>>
>>>>>
>>>>>    Best Regard,
>>>>>    Hideo Yamauchi.
>>>>>
>>>>>
>>>>>    ----- Original Message -----
>>>>>>    From: Klaus Wenninger <kwenning at redhat.com>
>>>>>>    To: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>;
>>>>   users at clusterlabs.org
>>>>>>    Cc:
>>>>>>    Date: 2016/10/7, Fri 17:47
>>>>>>    Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the
>>> DC
>>>>   crmd is frozen, cluster decisions are delayed infinitely
>>>>>>    On 10/07/2016 08:14 AM, Ulrich Windl wrote:
>>>>>>>>>>     Klaus Wenninger <kwenning at redhat.com>
>>> schrieb am
>>>>>>    06.10.2016 um 18:03 in
>>>>>>>     Nachricht
>>> <3980cfdd-ebd9-1597-f6bd-a1ca808f7688 at redhat.com>:
>>>>>>>>     On 10/05/2016 04:22 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>>>>>     Hi All,
>>>>>>>>>
>>>>>>>>>>>     If a user uses sbd, can the cluster evade a
>>>>   problem of
>>>>>>    SIGSTOP of crmd?
>>>>>>>>>>
>>>>>>>>>>     As pointed out earlier, maybe crmd should feed a
>>>>   watchdog. Then
>>>>>>    stopping
>>>>>>>>     crmd
>>>>>>>>>>     will reboot the node (unless the watchdog fails).
>>>>>>>>>     Thank you for comment.
>>>>>>>>>
>>>>>>>>>     We examine watchdog of crmd, too.
>>>>>>>>>     In addition, I comment after examination advanced.
>>>>>>>>     Was thinking of doing a small test implementation going
>>>>>>>>     a little in the direction Lars Ellenberg had been
>>> pointing
>>>>   out.
>>>>>>>>     a couple of thoughts I had so far:
>>>>>>>>
>>>>>>>>     - add an API (via DBus or libqb - favoring libqb atm) to
>>> sbd
>>>>>>>>       an application can use to create a watchdog within sbd
>>>>>>>     Why has it to be done within sbd?
>>>>>>    Not necessarily, could be spawned out as well into an own project
>>> or
>>>>>>    something already existent could be taken.
>>>>>>    Remember to have added a dbus-interface to
>>>>>>    https://sourceforge.net/projects/watchdog/ for a project once.
>>>>>>    If you have a suggestion I'm open.
>>>>>>    Going off sbd would have the advantage of a smooth start:
>>>>>>
>>>>>>    - cluster/pacemaker-watcher are there already and can
>>>>>>      be replaced/moved over time
>>>>>>    - the lifecycle of the daemon (when started/stopped) is
>>>>>>      already something that is in the code and in the people's
>>> minds
>>>>>>>>     - parameters for the first are a name and a timeout
>>>>>>>>
>>>>>>>>     - first use-case would be crmd observation
>>>>>>>>
>>>>>>>>     - later on we could think of removing pacemaker
>>> dependencies
>>>>>>>>       from sbd by moving the actual implementation of
>>>>>>>>       pacemaker-watcher and probably cluster-watcher as well
>>>>>>>>       into pacemaker - using the new API
>>>>>>>>
>>>>>>>>     - this of course creates sbd dependency within pacemaker
>>> so
>>>>>>>>       that it would make sense to offer a simpler and
>>>>   self-contained
>>>>>>>>       implementation within pacemaker as an alternative
>>>>>>>     I think the watchdog interface is so simple that you
>>> don't
>>>>   need a relay
>>>>>>    for it. The only limit I can imagine is the number of watchdogs
>>>>   available of
>>>>>>    some specific hardware.
>>>>>>    That is the point ;-)
>>>>>>>>       thus it would be favorable to have the dependency
>>>>>>>>       within a non-compulsory pacemaker-rpm so that
>>>>>>>>       we can offer an alternative that doesn't use sbd
>>>>>>>>       at maybe the cost of being less reliable or one
>>>>>>>>       that owns a hardware-watchdog by itself for systems
>>>>>>>>       where this is still unused.
>>>>>>>>
>>>>>>>>       - e.g. via some kind of plugin (Andrew forgive me -
>>>>>>>>                                                        no
>>> pils ;-)
>>>>   )
>>>>>>>>       - or via an additional daemon
>>>>>>>>
>>>>>>>>     What did you have in mind?
>>>>>>>>     Maybe it makes sense to synchronize...
>>>>>>>>
>>>>>>>>     Regards,
>>>>>>>>     Klaus
>>>>>>>>
>>>>>>>>>     Best Regards,
>>>>>>>>>     Hideo Yamauchi.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     ----- Original Message -----
>>>>>>>>>>     From: Ulrich Windl
>>>>   <Ulrich.Windl at rz.uni-regensburg.de>
>>>>>>>>>>     To: users at clusterlabs.org;
>>> renayama19661014 at ybb.ne.jp
>>>>>>>>>>     Cc:
>>>>>>>>>>     Date: 2016/10/5, Wed 23:08
>>>>>>>>>>     Subject: Antw: Re: [ClusterLabs] Antw: Re: When
>>> the DC
>>>>   crmd is
>>>>>>    frozen,
>>>>>>>>     cluster decisions are delayed infinitely
>>>>>>>>>>>>>      <renayama19661014 at ybb.ne.jp>
>>>>   schrieb am
>>>>>>    21.09.2016 um 11:52
>>>>>>>>>>     in Nachricht
>>>>>>>>>>
>>>>   <876439.61305.qm at web200311.mail.ssk.yahoo.co.jp>:
>>>>>>>>>>>      Hi All,
>>>>>>>>>>>
>>>>>>>>>>>      Was the final conclusion given about this
>>>>   problem?
>>>>>>>>>>>      If a user uses sbd, can the cluster evade a
>>>>   problem of
>>>>>>    SIGSTOP of crmd?
>>>>>>>>>>     As pointed out earlier, maybe crmd should feed a
>>>>   watchdog. Then
>>>>>>    stopping
>>>>>>>>     crmd
>>>>>>>>>>     will reboot the node (unless the watchdog fails).
>>>>>>>>>>
>>>>>>>>>>>      We are interested in this problem, too.
>>>>>>>>>>>
>>>>>>>>>>>      Best Regards,
>>>>>>>>>>>
>>>>>>>>>>>      Hideo Yamauchi.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>> _______________________________________________
>>>>>>>>>>>      Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>     http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>>>>      Getting started:
>>>>>>    http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>>>     _______________________________________________
>>>>>>>>>     Users mailing list: Users at clusterlabs.org
>>>>>>>>>     http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>>>     Getting started:
>>>>>>    http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>>>     _______________________________________________
>>>>>>>>     Users mailing list: Users at clusterlabs.org
>>>>>>>>     http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>>     Getting started:
>>>>>>    http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>    _______________________________________________
>>>>>>    Users mailing list: Users at clusterlabs.org
>>>>>>    http://clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>>    Project Home: http://www.clusterlabs.org
>>>>>>    Getting started:
>>>>   http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>    _______________________________________________
>>>>>    Users mailing list: Users at clusterlabs.org
>>>>>    http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>>    Project Home: http://www.clusterlabs.org
>>>>>    Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>   _______________________________________________
>>>>   Users mailing list: Users at clusterlabs.org
>>>>   http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>




More information about the Users mailing list