[ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

renayama19661014 at ybb.ne.jp renayama19661014 at ybb.ne.jp
Thu Oct 20 12:08:35 CEST 2016


Hi Klaus,
Hi Jan,

Thank you for comment.

I wait for other comment a little more.
We will argue about this matter next week.

Best Regards,
Hideo Yamauchi.


----- Original Message -----
> From: Jan Friesse <jfriesse at redhat.com>
> To: kwenning at redhat.com; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc: 
> Date: 2016/10/20, Thu 15:46
> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
> 
>> 
>>  On 10/14/2016 11:21 AM, renayama19661014 at ybb.ne.jp wrote:
>>>  Hi Klaus,
>>>  Hi All,
>>> 
>>>  I tried prototype of watchdog using WD service.
>>>    - 
> https://github.com/HideoYamauchi/pacemaker/commit/3ee97b76e0212b1790226864dfcacd1a327dbcc9
>>> 
>>>  Please comment.
>>  Thank you Hideo for providing the prototype.
>>  Added the patch to my build and it seems to
>>  be working as expected.
>> 
>>  A few thoughts triggered by this approach:
>> 
>>  - we have to alert the corosync-people as in
>>     a chat with Jan Friesse he pointed me to the
>>     fact that for corosync 3.x the wd-service was
>>     planned to be removed
> 
> Actually I didn't express myself correctly. What I wanted to say was 
> "I'm considering idea of removing it", simply because it's 
> disabled in 
> downstream.
> 
> BUT keep in mind that removing functionality = ask community to find out 
> if there is not somebody actively using it.
> 
> And because there is active users and future use case, removing of wd is 
> not an option.
> 
> 
>> 
>>     especially delicate as the binding is very loose
>>     so that - as is - it builds against a corosync with
>>     disabled wd-service without any complaints...
>> 
>>  - as of now if you enable wd-service in the
>>     corosync-build it is on by default and would
>>     be hogging the watchdog presumably
>>     (there is obviously a pull request that makes
>>     it default to off)
>> 
>>  - with my thoughts about adding an API to
>>     sbd previously in the thread I was trying to
>>     target closer observation of pacemaker_remoted
>>     as well (remote-nodes don't have corosync
>>     running)
>> 
>>     I guess it would be possible to run corosync
>>     with a static config as single-node cluster
>>     bound to localhost for that purpose.
>> 
>>     I read the thread about corosync-remote and
>>     that happening might make the special-handling
>>     for pacemaker-remote obsolete anyway ...
>> 
>>  - to enable the approach to live alongside
>>     sbd it would be possible to make sbd use
>>     the corosync-API as well for watchdog purposes
>>     instead of opening the watchdog directly
>> 
>>     This shouldn't be a big deal for sbd used to
>>     observe a pacemaker-node as cluster-watcher
>>     (the part of sbd that sends cpg-pings to corosync)
>>     already builds against corosync.
>>     The blockdevice-part of sbd being basically
>>     generic it might be an issue though.
>> 
>>  Regards,
>>  Klaus
>> 
>>> 
>>> 
>>>  Best Regards,
>>>  Hideo Yamauchi.
>>> 
>>> 
>>>  ----- Original Message -----
>>>>  From: "renayama19661014 at ybb.ne.jp" 
> <renayama19661014 at ybb.ne.jp>
>>>>  To: "users at clusterlabs.org" <users at clusterlabs.org>
>>>>  Cc:
>>>>  Date: 2016/10/11, Tue 17:58
>>>>  Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the 
> DC crmd is frozen, cluster decisions are delayed infinitely
>>>> 
>>>>  Hi Klaus,
>>>> 
>>>>  Thank you for comment.
>>>> 
>>>>  I make the patch which is prototype using WD service.
>>>> 
>>>>  Please wait a little.
>>>> 
>>>>  Best Regards,
>>>>  Hideo Yamauchi.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>  ----- Original Message -----
>>>>>    From: Klaus Wenninger <kwenning at redhat.com>
>>>>>    To: users at clusterlabs.org
>>>>>    Cc:
>>>>>    Date: 2016/10/10, Mon 21:03
>>>>>    Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When 
> the DC crmd
>>>>  is frozen, cluster decisions are delayed infinitely
>>>>>    On 10/07/2016 11:10 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>>     Hi All,
>>>>>> 
>>>>>>     Our user may not necessarily use sdb.
>>>>>> 
>>>>>>     I confirmed that there was a method using WD service of 
> corosync as
>>>>  one
>>>>>    method not to use sdb.
>>>>>>     Pacemaker watches the process of pacemaker by WD service 
> using CMAP
>>>>  and can
>>>>>    carry out watchdog.
>>>>> 
>>>>>    Have to have a look at that...
>>>>>    But if we establish some in-between-layer in pacemaker we 
> could have this
>>>>>    as one of the possibilities besides e.g. sbd (with enhanced 
> API), going for
>>>>>    a watchdog-device directly, ...
>>>>> 
>>>>>> 
>>>>>>     We can set up a patch of pacemaker.
>>>>>    Always helpful to discuss/clarify an idea once some code is 
> available ...
>>>>> 
>>>>>>     Was the discussion of using WD service over so far?
>>>>>    Not from my pov. Just a day off ;-)
>>>>> 
>>>>>> 
>>>>>>     Best Regard,
>>>>>>     Hideo Yamauchi.
>>>>>> 
>>>>>> 
>>>>>>     ----- Original Message -----
>>>>>>>     From: Klaus Wenninger <kwenning at redhat.com>
>>>>>>>     To: Ulrich Windl 
> <Ulrich.Windl at rz.uni-regensburg.de>;
>>>>>   users at clusterlabs.org
>>>>>>>     Cc:
>>>>>>>     Date: 2016/10/7, Fri 17:47
>>>>>>>     Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: 
> Re: When the
>>>>  DC
>>>>>    crmd is frozen, cluster decisions are delayed infinitely
>>>>>>>     On 10/07/2016 08:14 AM, Ulrich Windl wrote:
>>>>>>>>>>>      Klaus Wenninger 
> <kwenning at redhat.com>
>>>>  schrieb am
>>>>>>>     06.10.2016 um 18:03 in
>>>>>>>>      Nachricht
>>>>  <3980cfdd-ebd9-1597-f6bd-a1ca808f7688 at redhat.com>:
>>>>>>>>>      On 10/05/2016 04:22 PM, 
> renayama19661014 at ybb.ne.jp wrote:
>>>>>>>>>>      Hi All,
>>>>>>>>>> 
>>>>>>>>>>>>      If a user uses sbd, can the 
> cluster evade a
>>>>>    problem of
>>>>>>>     SIGSTOP of crmd?
>>>>>>>>>>> 
>>>>>>>>>>>      As pointed out earlier, maybe crmd 
> should feed a
>>>>>    watchdog. Then
>>>>>>>     stopping
>>>>>>>>>      crmd
>>>>>>>>>>>      will reboot the node (unless the 
> watchdog fails).
>>>>>>>>>>      Thank you for comment.
>>>>>>>>>> 
>>>>>>>>>>      We examine watchdog of crmd, too.
>>>>>>>>>>      In addition, I comment after 
> examination advanced.
>>>>>>>>>      Was thinking of doing a small test 
> implementation going
>>>>>>>>>      a little in the direction Lars Ellenberg 
> had been
>>>>  pointing
>>>>>    out.
>>>>>>>>>      a couple of thoughts I had so far:
>>>>>>>>> 
>>>>>>>>>      - add an API (via DBus or libqb - favoring 
> libqb atm) to
>>>>  sbd
>>>>>>>>>        an application can use to create a 
> watchdog within sbd
>>>>>>>>      Why has it to be done within sbd?
>>>>>>>     Not necessarily, could be spawned out as well into 
> an own project
>>>>  or
>>>>>>>     something already existent could be taken.
>>>>>>>     Remember to have added a dbus-interface to
>>>>>>>     https://sourceforge.net/projects/watchdog/ for a 
> project once.
>>>>>>>     If you have a suggestion I'm open.
>>>>>>>     Going off sbd would have the advantage of a smooth 
> start:
>>>>>>> 
>>>>>>>     - cluster/pacemaker-watcher are there already and 
> can
>>>>>>>       be replaced/moved over time
>>>>>>>     - the lifecycle of the daemon (when started/stopped) 
> is
>>>>>>>       already something that is in the code and in the 
> people's
>>>>  minds
>>>>>>>>>      - parameters for the first are a name and a 
> timeout
>>>>>>>>> 
>>>>>>>>>      - first use-case would be crmd observation
>>>>>>>>> 
>>>>>>>>>      - later on we could think of removing 
> pacemaker
>>>>  dependencies
>>>>>>>>>        from sbd by moving the actual 
> implementation of
>>>>>>>>>        pacemaker-watcher and probably 
> cluster-watcher as well
>>>>>>>>>        into pacemaker - using the new API
>>>>>>>>> 
>>>>>>>>>      - this of course creates sbd dependency 
> within pacemaker
>>>>  so
>>>>>>>>>        that it would make sense to offer a 
> simpler and
>>>>>    self-contained
>>>>>>>>>        implementation within pacemaker as an 
> alternative
>>>>>>>>      I think the watchdog interface is so simple 
> that you
>>>>  don't
>>>>>    need a relay
>>>>>>>     for it. The only limit I can imagine is the number 
> of watchdogs
>>>>>    available of
>>>>>>>     some specific hardware.
>>>>>>>     That is the point ;-)
>>>>>>>>>        thus it would be favorable to have the 
> dependency
>>>>>>>>>        within a non-compulsory pacemaker-rpm so 
> that
>>>>>>>>>        we can offer an alternative that 
> doesn't use sbd
>>>>>>>>>        at maybe the cost of being less reliable 
> or one
>>>>>>>>>        that owns a hardware-watchdog by itself 
> for systems
>>>>>>>>>        where this is still unused.
>>>>>>>>> 
>>>>>>>>>        - e.g. via some kind of plugin (Andrew 
> forgive me -
>>>>>>>>>                                                 
>         no
>>>>  pils ;-)
>>>>>    )
>>>>>>>>>        - or via an additional daemon
>>>>>>>>> 
>>>>>>>>>      What did you have in mind?
>>>>>>>>>      Maybe it makes sense to synchronize...
>>>>>>>>> 
>>>>>>>>>      Regards,
>>>>>>>>>      Klaus
>>>>>>>>> 
>>>>>>>>>>      Best Regards,
>>>>>>>>>>      Hideo Yamauchi.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>      ----- Original Message -----
>>>>>>>>>>>      From: Ulrich Windl
>>>>>    <Ulrich.Windl at rz.uni-regensburg.de>
>>>>>>>>>>>      To: users at clusterlabs.org;
>>>>  renayama19661014 at ybb.ne.jp
>>>>>>>>>>>      Cc:
>>>>>>>>>>>      Date: 2016/10/5, Wed 23:08
>>>>>>>>>>>      Subject: Antw: Re: [ClusterLabs] 
> Antw: Re: When
>>>>  the DC
>>>>>    crmd is
>>>>>>>     frozen,
>>>>>>>>>      cluster decisions are delayed infinitely
>>>>>>>>>>>>>>       
> <renayama19661014 at ybb.ne.jp>
>>>>>    schrieb am
>>>>>>>     21.09.2016 um 11:52
>>>>>>>>>>>      in Nachricht
>>>>>>>>>>> 
>>>>>    <876439.61305.qm at web200311.mail.ssk.yahoo.co.jp>:
>>>>>>>>>>>>       Hi All,
>>>>>>>>>>>> 
>>>>>>>>>>>>       Was the final conclusion given 
> about this
>>>>>    problem?
>>>>>>>>>>>>       If a user uses sbd, can the 
> cluster evade a
>>>>>    problem of
>>>>>>>     SIGSTOP of crmd?
>>>>>>>>>>>      As pointed out earlier, maybe crmd 
> should feed a
>>>>>    watchdog. Then
>>>>>>>     stopping
>>>>>>>>>      crmd
>>>>>>>>>>>      will reboot the node (unless the 
> watchdog fails).
>>>>>>>>>>> 
>>>>>>>>>>>>       We are interested in this 
> problem, too.
>>>>>>>>>>>> 
>>>>>>>>>>>>       Best Regards,
>>>>>>>>>>>> 
>>>>>>>>>>>>       Hideo Yamauchi.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>  _______________________________________________
>>>>>>>>>>>>       Users mailing list: 
> Users at clusterlabs.org
>>>>>>>>>>>>     
> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>       Project Home: 
> http://www.clusterlabs.org
>>>>>>>>>>>>       Getting started:
>>>>>>>     
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>       Bugs: 
> http://bugs.clusterlabs.org
>>>>>>>>>>     
> _______________________________________________
>>>>>>>>>>      Users mailing list: 
> Users at clusterlabs.org
>>>>>>>>>>     
> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>> 
>>>>>>>>>>      Project Home: 
> http://www.clusterlabs.org
>>>>>>>>>>      Getting started:
>>>>>>>     
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>>>     
> _______________________________________________
>>>>>>>>>      Users mailing list: Users at clusterlabs.org
>>>>>>>>>     
> http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>> 
>>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>>      Getting started:
>>>>>>>     
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>     _______________________________________________
>>>>>>>     Users mailing list: Users at clusterlabs.org
>>>>>>>     http://clusterlabs.org/mailman/listinfo/users
>>>>>>> 
>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>     Getting started:
>>>>>   http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>> 
>>>>>>     _______________________________________________
>>>>>>     Users mailing list: Users at clusterlabs.org
>>>>>>     http://clusterlabs.org/mailman/listinfo/users
>>>>>> 
>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>     Getting started:
>>>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>> 
>>>>> 
>>>>>    _______________________________________________
>>>>>    Users mailing list: Users at clusterlabs.org
>>>>>   http://clusterlabs.org/mailman/listinfo/users
>>>>> 
>>>>>    Project Home: http://www.clusterlabs.org
>>>>>    Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>> 
>>>>  _______________________________________________
>>>>  Users mailing list: Users at clusterlabs.org
>>>>  http://clusterlabs.org/mailman/listinfo/users
>>>> 
>>>>  Project Home: http://www.clusterlabs.org
>>>>  Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>  Bugs: http://bugs.clusterlabs.org
>>>> 
>>>  _______________________________________________
>>>  Users mailing list: Users at clusterlabs.org
>>>  http://clusterlabs.org/mailman/listinfo/users
>>> 
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> 
>>  _______________________________________________
>>  Users mailing list: Users at clusterlabs.org
>>  http://clusterlabs.org/mailman/listinfo/users
>> 
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 



More information about the Users mailing list