[Pacemaker] [Problem] The attrd does not sometimes stop.

Sun Nov 6 21:39:35 UTC 2011

On Thu, Nov 03, 2011 at 01:49:46AM +1100, Andrew Beekhof wrote:
> On Tue, Oct 18, 2011 at 12:19 PM,  <renayama19661014 at ybb.ne.jp> wrote:
> > Hi,
> >
> > We sometimes fail in a stop of attrd.
> >
> > Step1. start a cluster in 2 nodes
> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat
> > stop.)
> >
> > The attrd catches the TERM signal, but does not stop.
> 
> There's no evidence that it actually catches it, only that it is sent.
> I've seen it before but never figured out why it occurs.

I had it once tracked down almost to where it occurs, but then got distracted.
Yes the signal was delivered.

I *think* it had to do with attrd doing a blocking read,
or looping in some internal message delivery function too often.

I had a quick look at the code again now, to try and remember,
but I'm not sure.

I *may* be that, because
xmlfromIPC(IPC_Channel * ch, int timeout) calls
    msg = msgfromIPC_timeout(ch, MSG_ALLOWINTR, timeout, &ipc_rc);

And MSG_ALLOWINTR will cause msgfromIPC_ll() to 
	IPC_INTR:
                if ( allow_intr){
                        goto startwait;

Depending on the frequency of deliverd signals, it may cause this goto
startwait loop to never exit, because the timeout always starts again
from the full passed in timeout.

If only one signal is deliverd, it may still take 120 seconds
(MAX_IPC_DELAY from crm.h) to be actually processed, as the signal
handler only raises a flag for the next mainloop iteration.

If a (non-fatal) signal is delivered every few seconds,
then the goto loop will never timeout.

Please someone check this for plausibility ;-)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com