[ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

Wed Oct 26 08:46:59 UTC 2016

Hi Klaus,
Hi Jan,
Hi All,

Our member argued about watchdog using WD service.

1) The WD service is not abolished.
2) In pacemaker_remote, it is available by starting corosync in localhost.
3) It is necessary for the scramble of watchdog to consider it.
4) Because I think about the case which does not use sbd, I do not think about adding an interface similar to corosync-API to sbd for the moment.

The user chooses a method using method and WD service using sbd and will use it.
It may cause confusion that there are two methods, but there is value for the user who does not use sbd.

We want to include watchdog using WD service in Pacemaker.
I intend to make an official patch.

What do you think?

Best Regards,
Hideo Yamauchi.

----- Original Message -----
> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
> To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc: 
> Date: 2016/10/20, Thu 19:08
> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
> 
> Hi Klaus,
> Hi Jan,
> 
> Thank you for comment.
> 
> I wait for other comment a little more.
> We will argue about this matter next week.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> ----- Original Message -----
>>  From: Jan Friesse <jfriesse at redhat.com>
>>  To: kwenning at redhat.com; Cluster Labs - All topics related to open-source 
> clustering welcomed <users at clusterlabs.org>
>>  Cc: 
>>  Date: 2016/10/20, Thu 15:46
>>  Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd 
> is frozen, cluster decisions are delayed infinitely
>> 
>>> 
>>>   On 10/14/2016 11:21 AM, renayama19661014 at ybb.ne.jp wrote:
>>>>   Hi Klaus,
>>>>   Hi All,
>>>> 
>>>>   I tried prototype of watchdog using WD service.
>>>>     - 
>> 
> https://github.com/HideoYamauchi/pacemaker/commit/3ee97b76e0212b1790226864dfcacd1a327dbcc9
>>>> 
>>>>   Please comment.
>>>   Thank you Hideo for providing the prototype.
>>>   Added the patch to my build and it seems to
>>>   be working as expected.
>>> 
>>>   A few thoughts triggered by this approach:
>>> 
>>>   - we have to alert the corosync-people as in
>>>      a chat with Jan Friesse he pointed me to the
>>>      fact that for corosync 3.x the wd-service was
>>>      planned to be removed
>> 
>>  Actually I didn't express myself correctly. What I wanted to say was 
>>  "I'm considering idea of removing it", simply because 
> it's 
>>  disabled in 
>>  downstream.
>> 
>>  BUT keep in mind that removing functionality = ask community to find out 
>>  if there is not somebody actively using it.
>> 
>>  And because there is active users and future use case, removing of wd is 
>>  not an option.
>> 
>> 
>>> 
>>>      especially delicate as the binding is very loose
>>>      so that - as is - it builds against a corosync with
>>>      disabled wd-service without any complaints...
>>> 
>>>   - as of now if you enable wd-service in the
>>>      corosync-build it is on by default and would
>>>      be hogging the watchdog presumably
>>>      (there is obviously a pull request that makes
>>>      it default to off)
>>> 
>>>   - with my thoughts about adding an API to
>>>      sbd previously in the thread I was trying to
>>>      target closer observation of pacemaker_remoted
>>>      as well (remote-nodes don't have corosync
>>>      running)
>>> 
>>>      I guess it would be possible to run corosync
>>>      with a static config as single-node cluster
>>>      bound to localhost for that purpose.
>>> 
>>>      I read the thread about corosync-remote and
>>>      that happening might make the special-handling
>>>      for pacemaker-remote obsolete anyway ...
>>> 
>>>   - to enable the approach to live alongside
>>>      sbd it would be possible to make sbd use
>>>      the corosync-API as well for watchdog purposes
>>>      instead of opening the watchdog directly
>>> 
>>>      This shouldn't be a big deal for sbd used to
>>>      observe a pacemaker-node as cluster-watcher
>>>      (the part of sbd that sends cpg-pings to corosync)
>>>      already builds against corosync.
>>>      The blockdevice-part of sbd being basically
>>>      generic it might be an issue though.
>>> 
>>>   Regards,
>>>   Klaus
>>> 
>>>> 
>>>> 
>>>>   Best Regards,
>>>>   Hideo Yamauchi.
>>>> 
>>>> 
>>>>   ----- Original Message -----
>>>>>   From: "renayama19661014 at ybb.ne.jp" 
>>  <renayama19661014 at ybb.ne.jp>
>>>>>   To: "users at clusterlabs.org" 
> <users at clusterlabs.org>
>>>>>   Cc:
>>>>>   Date: 2016/10/11, Tue 17:58
>>>>>   Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When 
> the 
>>  DC crmd is frozen, cluster decisions are delayed infinitely
>>>>> 
>>>>>   Hi Klaus,
>>>>> 
>>>>>   Thank you for comment.
>>>>> 
>>>>>   I make the patch which is prototype using WD service.
>>>>> 
>>>>>   Please wait a little.
>>>>> 
>>>>>   Best Regards,
>>>>>   Hideo Yamauchi.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>   ----- Original Message -----
>>>>>>     From: Klaus Wenninger <kwenning at redhat.com>
>>>>>>     To: users at clusterlabs.org
>>>>>>     Cc:
>>>>>>     Date: 2016/10/10, Mon 21:03
>>>>>>     Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: 
> When 
>>  the DC crmd
>>>>>   is frozen, cluster decisions are delayed infinitely
>>>>>>     On 10/07/2016 11:10 PM, renayama19661014 at ybb.ne.jp 
> wrote:
>>>>>>>      Hi All,
>>>>>>> 
>>>>>>>      Our user may not necessarily use sdb.
>>>>>>> 
>>>>>>>      I confirmed that there was a method using WD 
> service of 
>>  corosync as
>>>>>   one
>>>>>>     method not to use sdb.
>>>>>>>      Pacemaker watches the process of pacemaker by WD 
> service 
>>  using CMAP
>>>>>   and can
>>>>>>     carry out watchdog.
>>>>>> 
>>>>>>     Have to have a look at that...
>>>>>>     But if we establish some in-between-layer in pacemaker 
> we 
>>  could have this
>>>>>>     as one of the possibilities besides e.g. sbd (with 
> enhanced 
>>  API), going for
>>>>>>     a watchdog-device directly, ...
>>>>>> 
>>>>>>> 
>>>>>>>      We can set up a patch of pacemaker.
>>>>>>     Always helpful to discuss/clarify an idea once some code 
> is 
>>  available ...
>>>>>> 
>>>>>>>      Was the discussion of using WD service over so far?
>>>>>>     Not from my pov. Just a day off ;-)
>>>>>> 
>>>>>>> 
>>>>>>>      Best Regard,
>>>>>>>      Hideo Yamauchi.
>>>>>>> 
>>>>>>> 
>>>>>>>      ----- Original Message -----
>>>>>>>>      From: Klaus Wenninger 
> <kwenning at redhat.com>
>>>>>>>>      To: Ulrich Windl 
>>  <Ulrich.Windl at rz.uni-regensburg.de>;
>>>>>>    users at clusterlabs.org
>>>>>>>>      Cc:
>>>>>>>>      Date: 2016/10/7, Fri 17:47
>>>>>>>>      Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: 
> Antw: 
>>  Re: When the
>>>>>   DC
>>>>>>     crmd is frozen, cluster decisions are delayed infinitely
>>>>>>>>      On 10/07/2016 08:14 AM, Ulrich Windl wrote:
>>>>>>>>>>>>       Klaus Wenninger 
>>  <kwenning at redhat.com>
>>>>>   schrieb am
>>>>>>>>      06.10.2016 um 18:03 in
>>>>>>>>>       Nachricht
>>>>>   <3980cfdd-ebd9-1597-f6bd-a1ca808f7688 at redhat.com>:
>>>>>>>>>>       On 10/05/2016 04:22 PM, 
>>  renayama19661014 at ybb.ne.jp wrote:
>>>>>>>>>>>       Hi All,
>>>>>>>>>>> 
>>>>>>>>>>>>>       If a user uses sbd, can 
> the 
>>  cluster evade a
>>>>>>     problem of
>>>>>>>>      SIGSTOP of crmd?
>>>>>>>>>>>> 
>>>>>>>>>>>>       As pointed out earlier, maybe 
> crmd 
>>  should feed a
>>>>>>     watchdog. Then
>>>>>>>>      stopping
>>>>>>>>>>       crmd
>>>>>>>>>>>>       will reboot the node (unless 
> the 
>>  watchdog fails).
>>>>>>>>>>>       Thank you for comment.
>>>>>>>>>>> 
>>>>>>>>>>>       We examine watchdog of crmd, too.
>>>>>>>>>>>       In addition, I comment after 
>>  examination advanced.
>>>>>>>>>>       Was thinking of doing a small test 
>>  implementation going
>>>>>>>>>>       a little in the direction Lars 
> Ellenberg 
>>  had been
>>>>>   pointing
>>>>>>     out.
>>>>>>>>>>       a couple of thoughts I had so far:
>>>>>>>>>> 
>>>>>>>>>>       - add an API (via DBus or libqb - 
> favoring 
>>  libqb atm) to
>>>>>   sbd
>>>>>>>>>>         an application can use to create a 
>>  watchdog within sbd
>>>>>>>>>       Why has it to be done within sbd?
>>>>>>>>      Not necessarily, could be spawned out as well 
> into 
>>  an own project
>>>>>   or
>>>>>>>>      something already existent could be taken.
>>>>>>>>      Remember to have added a dbus-interface to
>>>>>>>>      https://sourceforge.net/projects/watchdog/ for 
> a 
>>  project once.
>>>>>>>>      If you have a suggestion I'm open.
>>>>>>>>      Going off sbd would have the advantage of a 
> smooth 
>>  start:
>>>>>>>> 
>>>>>>>>      - cluster/pacemaker-watcher are there already 
> and 
>>  can
>>>>>>>>        be replaced/moved over time
>>>>>>>>      - the lifecycle of the daemon (when 
> started/stopped) 
>>  is
>>>>>>>>        already something that is in the code and in 
> the 
>>  people's
>>>>>   minds
>>>>>>>>>>       - parameters for the first are a name 
> and a 
>>  timeout
>>>>>>>>>> 
>>>>>>>>>>       - first use-case would be crmd 
> observation
>>>>>>>>>> 
>>>>>>>>>>       - later on we could think of removing 
>>  pacemaker
>>>>>   dependencies
>>>>>>>>>>         from sbd by moving the actual 
>>  implementation of
>>>>>>>>>>         pacemaker-watcher and probably 
>>  cluster-watcher as well
>>>>>>>>>>         into pacemaker - using the new API
>>>>>>>>>> 
>>>>>>>>>>       - this of course creates sbd 
> dependency 
>>  within pacemaker
>>>>>   so
>>>>>>>>>>         that it would make sense to offer a 
>>  simpler and
>>>>>>     self-contained
>>>>>>>>>>         implementation within pacemaker as 
> an 
>>  alternative
>>>>>>>>>       I think the watchdog interface is so 
> simple 
>>  that you
>>>>>   don't
>>>>>>     need a relay
>>>>>>>>      for it. The only limit I can imagine is the 
> number 
>>  of watchdogs
>>>>>>     available of
>>>>>>>>      some specific hardware.
>>>>>>>>      That is the point ;-)
>>>>>>>>>>         thus it would be favorable to have 
> the 
>>  dependency
>>>>>>>>>>         within a non-compulsory 
> pacemaker-rpm so 
>>  that
>>>>>>>>>>         we can offer an alternative that 
>>  doesn't use sbd
>>>>>>>>>>         at maybe the cost of being less 
> reliable 
>>  or one
>>>>>>>>>>         that owns a hardware-watchdog by 
> itself 
>>  for systems
>>>>>>>>>>         where this is still unused.
>>>>>>>>>> 
>>>>>>>>>>         - e.g. via some kind of plugin 
> (Andrew 
>>  forgive me -
>>>>>>>>>>                                              
>     
>>          no
>>>>>   pils ;-)
>>>>>>     )
>>>>>>>>>>         - or via an additional daemon
>>>>>>>>>> 
>>>>>>>>>>       What did you have in mind?
>>>>>>>>>>       Maybe it makes sense to synchronize...
>>>>>>>>>> 
>>>>>>>>>>       Regards,
>>>>>>>>>>       Klaus
>>>>>>>>>> 
>>>>>>>>>>>       Best Regards,
>>>>>>>>>>>       Hideo Yamauchi.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>       ----- Original Message -----
>>>>>>>>>>>>       From: Ulrich Windl
>>>>>>     <Ulrich.Windl at rz.uni-regensburg.de>
>>>>>>>>>>>>       To: users at clusterlabs.org;
>>>>>   renayama19661014 at ybb.ne.jp
>>>>>>>>>>>>       Cc:
>>>>>>>>>>>>       Date: 2016/10/5, Wed 23:08
>>>>>>>>>>>>       Subject: Antw: Re: 
> [ClusterLabs] 
>>  Antw: Re: When
>>>>>   the DC
>>>>>>     crmd is
>>>>>>>>      frozen,
>>>>>>>>>>       cluster decisions are delayed 
> infinitely
>>>>>>>>>>>>>>>        
>>  <renayama19661014 at ybb.ne.jp>
>>>>>>     schrieb am
>>>>>>>>      21.09.2016 um 11:52
>>>>>>>>>>>>       in Nachricht
>>>>>>>>>>>> 
>>>>>>     <876439.61305.qm at web200311.mail.ssk.yahoo.co.jp>:
>>>>>>>>>>>>>        Hi All,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>        Was the final conclusion 
> given 
>>  about this
>>>>>>     problem?
>>>>>>>>>>>>>        If a user uses sbd, can 
> the 
>>  cluster evade a
>>>>>>     problem of
>>>>>>>>      SIGSTOP of crmd?
>>>>>>>>>>>>       As pointed out earlier, maybe 
> crmd 
>>  should feed a
>>>>>>     watchdog. Then
>>>>>>>>      stopping
>>>>>>>>>>       crmd
>>>>>>>>>>>>       will reboot the node (unless 
> the 
>>  watchdog fails).
>>>>>>>>>>>> 
>>>>>>>>>>>>>        We are interested in this 
> 
>>  problem, too.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>        Best Regards,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>        Hideo Yamauchi.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>   _______________________________________________
>>>>>>>>>>>>>        Users mailing list: 
>>  Users at clusterlabs.org
>>>>>>>>>>>>>      
>>  http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>        Project Home: 
>>  http://www.clusterlabs.org
>>>>>>>>>>>>>        Getting started:
>>>>>>>>      
>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>        Bugs: 
>>  http://bugs.clusterlabs.org
>>>>>>>>>>>      
>>  _______________________________________________
>>>>>>>>>>>       Users mailing list: 
>>  Users at clusterlabs.org
>>>>>>>>>>>      
>>  http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>> 
>>>>>>>>>>>       Project Home: 
>>  http://www.clusterlabs.org
>>>>>>>>>>>       Getting started:
>>>>>>>>      
>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>      
>>  _______________________________________________
>>>>>>>>>>       Users mailing list: 
> Users at clusterlabs.org
>>>>>>>>>>      
>>  http://clusterlabs.org/mailman/listinfo/users
>>>>>>>>>> 
>>>>>>>>>>       Project Home: 
> http://www.clusterlabs.org
>>>>>>>>>>       Getting started:
>>>>>>>>      
>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>      _______________________________________________
>>>>>>>>      Users mailing list: Users at clusterlabs.org
>>>>>>>>      http://clusterlabs.org/mailman/listinfo/users
>>>>>>>> 
>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>      Getting started:
>>>>>>    http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>> 
>>>>>>>      _______________________________________________
>>>>>>>      Users mailing list: Users at clusterlabs.org
>>>>>>>      http://clusterlabs.org/mailman/listinfo/users
>>>>>>> 
>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>      Getting started:
>>>>>   http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>> 
>>>>>> 
>>>>>>     _______________________________________________
>>>>>>     Users mailing list: Users at clusterlabs.org
>>>>>>    http://clusterlabs.org/mailman/listinfo/users
>>>>>> 
>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>     Getting started: 
>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>> 
>>>>>   _______________________________________________
>>>>>   Users mailing list: Users at clusterlabs.org
>>>>>   http://clusterlabs.org/mailman/listinfo/users
>>>>> 
>>>>>   Project Home: http://www.clusterlabs.org
>>>>>   Getting started: 
>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>> 
>>>>   _______________________________________________
>>>>   Users mailing list: Users at clusterlabs.org
>>>>   http://clusterlabs.org/mailman/listinfo/users
>>>> 
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: 
>>  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>> 
>>> 
>>> 
>>>   _______________________________________________
>>>   Users mailing list: Users at clusterlabs.org
>>>   http://clusterlabs.org/mailman/listinfo/users
>>> 
>>>   Project Home: http://www.clusterlabs.org
>>>   Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>   Bugs: http://bugs.clusterlabs.org
>>> 
>> 
>> 
>>  _______________________________________________
>>  Users mailing list: Users at clusterlabs.org
>>  http://clusterlabs.org/mailman/listinfo/users
>> 
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>