[Pacemaker] [Problem] The attrd does not sometimes stop.

Wed Jan 11 20:54:03 EST 2012

Hi Lars,
Hi Dejan,

I got ltrace file when a problem occurred.
I attach ltrace file.

The investigation in gdb continues it and performs it.

If there is suggestion of any improvement, please tell me.

Best Regards,
Hideo Yamauchi.

--- On Tue, 2012/1/10, renayama19661014 at ybb.ne.jp <renayama19661014 at ybb.ne.jp> wrote:

> Hi Lars,
> 
> I attach strace file when a problem reappeared at the end of last year.
> I used glue which applied your patch for confirmation.
> 
> It is the file which I picked with attrd by strace -p command right before I stop Heartbeat.
> 
> Finally SIGTERM caught it, but attrd did not stop.
> The attrd stopped afterwards when I sent SIGKILL.
> 
>  * I acquire the information such as ltrace from now on.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> --- On Thu, 2012/1/5, renayama19661014 at ybb.ne.jp <renayama19661014 at ybb.ne.jp> wrote:
> 
> > Hi Lars,
> > 
> > > If you are able to reproduce,
> > > you could try to find out what exactly attrd is doing.
> > > 
> > > various ways to try to do that:
> > > cat /proc/<pid-of-attrd>/stack   # if your platform supports that
> > > strace it,
> > > ltrace it,
> > > attach with gdb and provide a stack trace, or even start to single step it,
> > > cause attrd to core dump, and analyse the core.
> > 
> > All right.
> > I investigate the cause a little more.
> > 
> > Give me the time for investigation a little more.
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > --- On Fri, 2011/12/30, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:
> > 
> > > On Thu, Dec 22, 2011 at 09:54:47AM +0900, renayama19661014 at ybb.ne.jp wrote:
> > > > Hi Dejan,
> > > > Hi Lars,
> > > > 
> > > > In our environment, the problem recurred with the patch of Mr. Lars.
> > > > After a problem occurred, I sent TERM signal, but attrd does not seem to
> > > > receive TERM at all.
> > > 
> > > If you are able to reproduce,
> > > you could try to find out what exactly attrd is doing.
> > > 
> > > various ways to try to do that:
> > > cat /proc/<pid-of-attrd>/stack   # if your platform supports that
> > > strace it,
> > > ltrace it,
> > > attach with gdb and provide a stack trace, or even start to single step it,
> > > cause attrd to core dump, and analyse the core.
> > > 
> > > > The reconsideration of the patch is necessary for the solution to problem.
> > > > 
> > > > 
> > > > Best Regards,
> > > > Hideo Yamauchi.
> > > > 
> > > > 
> > > > --- On Tue, 2011/11/15, renayama19661014 at ybb.ne.jp <renayama19661014 at ybb.ne.jp> wrote:
> > > > 
> > > > > Hi Dejan,
> > > > > Hi Lars,
> > > > > 
> > > > > I understood it.
> > > > > I try the operation of the patch in our environment.
> > > > > 
> > > > > To Alan: Will you try a patch?
> > > > > 
> > > > > Best Regards,
> > > > > Hideo Yamauchi.
> > > > > 
> > > > > --- On Tue, 2011/11/15, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > On Mon, Nov 14, 2011 at 01:17:37PM +0100, Lars Ellenberg wrote:
> > > > > > > On Mon, Nov 14, 2011 at 11:58:09AM +1100, Andrew Beekhof wrote:
> > > > > > > > On Mon, Nov 7, 2011 at 8:39 AM, Lars Ellenberg
> > > > > > > > <lars.ellenberg at linbit.com> wrote:
> > > > > > > > > On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> > > > > > > > >> On Tue, Oct 18, 2011 at 12:19 PM,  <renayama19661014 at ybb.ne.jp> wrote:
> > > > > > > > >> > Hi,
> > > > > > > > >> >
> > > > > > > > >> > We sometimes fail in a stop of attrd.
> > > > > > > > >> >
> > > > > > > > >> > Step1. start a cluster in 2 nodes
> > > > > > > > >> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > > > > > > > >> > Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat
> > > > > > > > >> > stop.)
> > > > > > > > >> >
> > > > > > > > >> > The attrd catches the TERM signal, but does not stop.
> > > > > > > > >>
> > > > > > > > >> There's no evidence that it actually catches it, only that it is sent.
> > > > > > > > >> I've seen it before but never figured out why it occurs.
> > > > > > > > >
> > > > > > > > > I had it once tracked down almost to where it occurs, but then got distracted.
> > > > > > > > > Yes the signal was delivered.
> > > > > > > > >
> > > > > > > > > I *think* it had to do with attrd doing a blocking read,
> > > > > > > > > or looping in some internal message delivery function too often.
> > > > > > > > >
> > > > > > > > > I had a quick look at the code again now, to try and remember,
> > > > > > > > > but I'm not sure.
> > > > > > > > >
> > > > > > > > > I *may* be that, because
> > > > > > > > > xmlfromIPC(IPC_Channel * ch, int timeout) calls
> > > > > > > > >    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);
> > > > > > > > >
> > > > > > > > > And MSG_ALLOWINTR will cause msgfromIPC_ll() to
> > > > > > > > >        IPC_INTR:
> > > > > > > > >                if ( allow_intr){
> > > > > > > > >                        goto startwait;
> > > > > > > > >
> > > > > > > > > Depending on the frequency of deliverd signals, it may cause this goto
> > > > > > > > > startwait loop to never exit, because the timeout always starts again
> > > > > > > > > from the full passed in timeout.
> > > > > > > > >
> > > > > > > > > If only one signal is deliverd, it may still take 120 seconds
> > > > > > > > > (MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
> > > > > > > > > handler only raises a flag for the next mainloop iteration.
> > > > > > > > >
> > > > > > > > > If a (non-fatal) signal is delivered every few seconds,
> > > > > > > > > then the goto loop will never timeout.
> > > > > > > > >
> > > > > > > > > Please someone check this for plausibility ;-)
> > > > > > > > 
> > > > > > > > Most plausible explanation I've heard so far... still odd that only
> > > > > > > > attrd is affected.
> > > > > > > > So what do we do about it?
> > > > > > > 
> > > > > > > Reproduce, and confirm that this is what people are seeing.
> > > > > > > 
> > > > > > > Make attrd non-blocking?
> > > > > > > 
> > > > > > > Fix the ipc layer to not restart the full timeout,
> > > > > > > but only the remaining partial time?
> > > > > > 
> > > > > > Lars and I made a quick patch for cluster-glue (attached).
> > > > > > Hideo-san, is there a way for you to verify if it helps? The
> > > > > > patch is not perfect and under unfavourable circumstances it may
> > > > > > still take a long time for the caller to exit, but it'd be good
> > > > > > to know if this is the right spot.
> > > > > > 
> > > > > > Cheers,
> > > > > > 
> > > > > > Dejan
> > > > > > 
> > > > > > > -- 
> > > > > > > : Lars Ellenberg
> > > > > > > : LINBIT | Your Way to High Availability
> > > > > > > : DRBD/HA support and consulting http://www.linbit.com
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > > > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > > > > > 
> > > > > > > Project Home: http://www.clusterlabs.org
> > > > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > > > > Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > > > > >
> > > > > 
> > > > 
> > > > _______________________________________________
> > > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > > 
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > 
> > > -- 
> > > : Lars Ellenberg
> > > : LINBIT | Your Way to High Availability
> > > : DRBD/HA support and consulting http://www.linbit.com
> > > 
> > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> > > 
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ltrace.20120112-patch
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120112/ad43b48e/attachment-0003.ksh>