[Pacemaker] [Problem or Enhancement]When attrd reboots, a fail count is initialized.

Fri Oct 1 07:13:30 EDT 2010

On Fri, Oct 1, 2010 at 4:00 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> Hi Andrew,
>
> Thank you for comment.
>
>> During crmd startup, one could read all the values from attrd into the
>> hashtable.
>> So the hashtable would only do something if only attrd went down.
>
> If attrd communicates with crmd at the time of start and reads the data of the hash table, the problem
> seems to be able to be settled.
>
> Is the change of this attrd and crmd difficult?

I dont think so.
But its not a huge priority because I've never heard of attrd actually crashing.

So while I agree that its theoretically a problem, in practice no-one
is going to hit this in production.
Even if they were unlucky enough to see it, at worst the resource is
able to run on the node again - which doesn't seem that bad for a HA
cluster :-)

>
>
>> I mean: did you see this behavior in a production system, or only
>> during testing when you manually killed attrd?
>
> We carry out kill-command by manual operation as one of the tests of the trouble of the processes.
> Our user minds behavior of the process trouble very much.
>
> Best Regards,
> Hideo Yamauchi.
>
> --- Andrew Beekhof <andrew at beekhof.net> wrote:
>
>> On Wed, Sep 29, 2010 at 3:59 AM,  <renayama19661014 at ybb.ne.jp> wrote:
>> > Hi Andrew,
>> >
>> > Thank you for comment.
>> >
>> >> The problem here is that attrd is supposed to be the authoritative
>> >> source for this sort of data.
>> >
>> > Yes. I understand.
>> >
>> >> Additionally, you don't always want attrd reading from the status
>> >> section - like after the cluster restarts.
>> >
>> > The problem seems to be able to solve even that it retrieves a status section from cib after
>> attrd
>> > rebooted.
>> > "method2" which I suggested is such a meaning.
>> >> > method 2)When attrd started, Attrd communicates with cib and receives fail-count.
>> >
>> >> For failcount, the crmd could keep a hashtable of the current values
>> >> which it could re-send to attrd if it detects a disconnection.
>> >> But that might not be a generic-enough solution.
>> >
>> > If a Hash table of crmd can maintain it, it may be a good thought.
>> > However, I have a feeling that the same problem happens when crmd causes trouble and rebooted.
>>
>> During crmd startup, one could read all the values from attrd into the
>> hashtable.
>> So the hashtable would only do something if only attrd went down.
>>
>> >
>> >> The chance that attrd dies _and_ there were relevant values for
>> >> fail-count is pretty remote though... is this a real problem you've
>> >> experienced or a theoretical one?
>> >
>> > I did not understand meanings well.
>> > Does this mean that there is fail-count of attrd in the other node?
>>
>> I mean: did you see this behavior in a production system, or only
>> during testing when you manually killed attrd?
>>
>> >
>> > Best Regards,
>> > Hideo Yamauchi.
>> >
>> > --- Andrew Beekhof <andrew at beekhof.net> wrote:
>> >
>> >> On Mon, Sep 27, 2010 at 7:26 AM, �<renayama19661014 at ybb.ne.jp> wrote:
>> >> > Hi,
>> >> >
>> >> > When I investigated another problem, I discovered this phenomenon.
>> >> > If attrd causes process trouble and does not restart, the problem does not occur.
>> >> >
>> >> > Step1) After start, it causes a monitor error in UmIPaddr twice.
>> >> >
>> >> > Online: [ srv01 srv02 ]
>> >> >
>> >> > �Resource Group: UMgroup01
>> >> > � � UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
>> >> > � � UmIPaddr � (ocf::heartbeat:Dummy2): � � �
>> > �Started srv01
>> >> >
>> >> > Migration summary:
>> >> > * Node srv02:
>> >> > * Node srv01:
>> >> > � UmIPaddr: migration-threshold=10 fail-count=2
>> >> >
>> >> > Step2) Kill Attrd and Attrd reboots.
>> >> >
>> >> > Online: [ srv01 srv02 ]
>> >> >
>> >> > �Resource Group: UMgroup01
>> >> > � � UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
>> >> > � � UmIPaddr � (ocf::heartbeat:Dummy2): � � �
>> > �Started srv01
>> >> >
>> >> > Migration summary:
>> >> > * Node srv02:
>> >> > * Node srv01:
>> >> > � UmIPaddr: migration-threshold=10 fail-count=2
>> >> >
>> >> > Step3) It causes a monitor error in UmIPaddr.
>> >> >
>> >> > Online: [ srv01 srv02 ]
>> >> >
>> >> > �Resource Group: UMgroup01
>> >> > � � UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
>> >> > � � UmIPaddr � (ocf::heartbeat:Dummy2): � � �
>> > �Started srv01
>> >> >
>> >> > Migration summary:
>> >> > * Node srv02:
>> >> > * Node srv01:
>> >> > � UmIPaddr: migration-threshold=10 fail-count=1 -----> Fail-count return to the
>> first.
>> >> >
>> >> > The problem is so that attrd disappears fail-count by reboot.(Hash-tables is Lost.)
>> >> > It is a problem very much that the trouble number of times is initialized.
>> >> >
>> >> > I think that there is the following method.
>> >> >
>> >> > method 1)Attrd maintain fail-count as a file in "/var/run" directories and refer.
>> >> >
>> >> > method 2)When attrd started, Attrd communicates with cib and receives fail-count.
>> >> >
>> >> > Is there a better method?
>> >> >
>> >> > Please think about the solution of this problem.
>> >>
>> >> Hmmmm... a tricky one.
>> >>
>> >> The problem here is that attrd is supposed to be the authoritative
>> >> source for this sort of data.
>> >> Additionally, you don't always want attrd reading from the status
>> >> section - like after the cluster restarts.
>> >>
>> >> For failcount, the crmd could keep a hashtable of the current values
>> >> which it could re-send to attrd if it detects a disconnection.
>> >> But that might not be a generic-enough solution.
>> >>
>> >> The chance that attrd dies _and_ there were relevant values for
>> >> fail-count is pretty remote though... is this a real problem you've
>> >> experienced or a theoretical one?
>> >>
>> >> _______________________________________________
>> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >>
>> >
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>