[Pacemaker] [Problem or Enhancement]When attrd reboots, a fail count is initialized.

Mon Oct 4 10:32:53 EDT 2010

Hi Andrew,

Thank you for comment.

> > Is the change of this attrd and crmd difficult?
> 
> I dont think so.
> But its not a huge priority because I've never heard of attrd actually crashing.
> 
> So while I agree that its theoretically a problem, in practice no-one
> is going to hit this in production.
> Even if they were unlucky enough to see it, at worst the resource is
> able to run on the node again - which doesn't seem that bad for a HA
> cluster :-)

All right.

I register this problem with Bugzilla as a demand first of all. 
I will wait for the opinion from other users already appearing a little.

Thanks,
Hideo Yamauchi.

--- Andrew Beekhof <andrew at beekhof.net> wrote:

> On Fri, Oct 1, 2010 at 4:00 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> > Hi Andrew,
> >
> > Thank you for comment.
> >
> >> During crmd startup, one could read all the values from attrd into the
> >> hashtable.
> >> So the hashtable would only do something if only attrd went down.
> >
> > If attrd communicates with crmd at the time of start and reads the data of the hash table, the
> problem
> > seems to be able to be settled.
> >
> > Is the change of this attrd and crmd difficult?
> 
> I dont think so.
> But its not a huge priority because I've never heard of attrd actually crashing.
> 
> So while I agree that its theoretically a problem, in practice no-one
> is going to hit this in production.
> Even if they were unlucky enough to see it, at worst the resource is
> able to run on the node again - which doesn't seem that bad for a HA
> cluster :-)
> 
> >
> >
> >> I mean: did you see this behavior in a production system, or only
> >> during testing when you manually killed attrd?
> >
> > We carry out kill-command by manual operation as one of the tests of the trouble of the
> processes.
> > Our user minds behavior of the process trouble very much.
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> > --- Andrew Beekhof <andrew at beekhof.net> wrote:
> >
> >> On Wed, Sep 29, 2010 at 3:59 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> >> > Hi Andrew,
> >> >
> >> > Thank you for comment.
> >> >
> >> >> The problem here is that attrd is supposed to be the authoritative
> >> >> source for this sort of data.
> >> >
> >> > Yes. I understand.
> >> >
> >> >> Additionally, you don't always want attrd reading from the status
> >> >> section - like after the cluster restarts.
> >> >
> >> > The problem seems to be able to solve even that it retrieves a status section from cib
> after
> >> attrd
> >> > rebooted.
> >> > "method2" which I suggested is such a meaning.
> >> >> > method 2)When attrd started, Attrd communicates with cib and receives fail-count.
> >> >
> >> >> For failcount, the crmd could keep a hashtable of the current values
> >> >> which it could re-send to attrd if it detects a disconnection.
> >> >> But that might not be a generic-enough solution.
> >> >
> >> > If a Hash table of crmd can maintain it, it may be a good thought.
> >> > However, I have a feeling that the same problem happens when crmd causes trouble and
> rebooted.
> >>
> >> During crmd startup, one could read all the values from attrd into the
> >> hashtable.
> >> So the hashtable would only do something if only attrd went down.
> >>
> >> >
> >> >> The chance that attrd dies _and_ there were relevant values for
> >> >> fail-count is pretty remote though... is this a real problem you've
> >> >> experienced or a theoretical one?
> >> >
> >> > I did not understand meanings well.
> >> > Does this mean that there is fail-count of attrd in the other node?
> >>
> >> I mean: did you see this behavior in a production system, or only
> >> during testing when you manually killed attrd?
> >>
> >> >
> >> > Best Regards,
> >> > Hideo Yamauchi.
> >> >
> >> > --- Andrew Beekhof <andrew at beekhof.net> wrote:
> >> >
> >> >> On Mon, Sep 27, 2010 at 7:26 AM, �<renayama19661014 at ybb.ne.jp> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > When I investigated another problem, I discovered this phenomenon.
> >> >> > If attrd causes process trouble and does not restart, the problem does not occur.
> >> >> >
> >> >> > Step1) After start, it causes a monitor error in UmIPaddr twice.
> >> >> >
> >> >> > Online: [ srv01 srv02 ]
> >> >> >
> >> >> > �Resource Group: UMgroup01
> >> >> > � � UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
> >> >> > � � UmIPaddr � (ocf::heartbeat:Dummy2): � � �
> >> > �Started srv01
> >> >> >
> >> >> > Migration summary:
> >> >> > * Node srv02:
> >> >> > * Node srv01:
> >> >> > � UmIPaddr: migration-threshold=10 fail-count=2
> >> >> >
> >> >> > Step2) Kill Attrd and Attrd reboots.
> >> >> >
> >> >> > Online: [ srv01 srv02 ]
> >> >> >
> >> >> > �Resource Group: UMgroup01
> >> >> > � � UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
> >> >> > � � UmIPaddr � (ocf::heartbeat:Dummy2): � � �
> >> > �Started srv01
> >> >> >
> >> >> > Migration summary:
> >> >> > * Node srv02:
> >> >> > * Node srv01:
> >> >> > � UmIPaddr: migration-threshold=10 fail-count=2
> >> >> >
> >> >> > Step3) It causes a monitor error in UmIPaddr.
> >> >> >
> >> >> > Online: [ srv01 srv02 ]
> >> >> >
> >> >> > �Resource Group: UMgroup01
> >> >> > � � UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
> >> >> > � � UmIPaddr � (ocf::heartbeat:Dummy2): � � �
> >> > �Started srv01
> >> >> >
> >> >> > Migration summary:
> >> >> > * Node srv02:
> >> >> > * Node srv01:
> >> >> > � UmIPaddr: migration-threshold=10 fail-count=1 -----> Fail-count return to the
> >> first.
> >> >> >
> >> >> > The problem is so that attrd disappears fail-count by reboot.(Hash-tables is Lost.)
> >> >> > It is a problem very much that the trouble number of times is initialized.
> >> >> >
> >> >> > I think that there is the following method.
> >> >> >
> >> >> > method 1)Attrd maintain fail-count as a file in "/var/run" directories and refer.
> >> >> >
> >> >> > method 2)When attrd started, Attrd communicates with cib and receives fail-count.
> >> >> >
> >> >> > Is there a better method?
> >> >> >
> >> >> > Please think about the solution of this problem.
> >> >>
> >> >> Hmmmm... a tricky one.
> >> >>
> >> >> The problem here is that attrd is supposed to be the authoritative
> >> >> source for this sort of data.
> >> >> Additionally, you don't always want attrd reading from the status
> >> >> section - like after the cluster restarts.
> >> >>
> >> >> For failcount, the crmd could keep a hashtable of the current values
> >> >> which it could re-send to attrd if it detects a disconnection.
> >> >> But that might not be a generic-enough solution.
> >> >>
> >> >> The chance that attrd dies _and_ there were relevant values for
> >> >> fail-count is pretty remote though... is this a real problem you've
> >> >> experienced or a theoretical one?
> >> >>
> >> >> _______________________________________________
> >> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >>
> >> >> Project Home: http://www.clusterlabs.org
> >> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> >> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >> >>
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >
> >> > Project Home: http://www.clusterlabs.org
> >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> > Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >> >
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >>
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
=== 以下のメッセージは省略されました ===