[Pacemaker] [Problem] The attrd does not sometimes stop.

renayama19661014 at ybb.ne.jp renayama19661014 at ybb.ne.jp
Thu Oct 20 21:30:59 EDT 2011


Hi Alan,

Thank you for comment.

We reproduce a problem, too and are going to send a report.
However, the problem does not reappear for the moment.

Best Regards,
Hideo Yamauchi.

--- On Thu, 2011/10/20, Alan Robertson <alanr at unix.sh> wrote:

> Hi,
> 
> I've seen a very similar problem in a recent release.  In fact, I'm in the process of reproducing it so that it can be properly logged and so on.  When I get the right data for the bug report, I'll attach it to the bug.
> 
> FWIW: I'm pretty sure that the signal was properly received by attrd.  I haven't looked at the attrd code, but my guess is that either it didn't issue the correct function call for exiting from mainloop - or that the mainloop code didn't actually exit.  FWIW - it probably doesn't matter at all what the priority for signal handling is - since attrd consumes nearly no CPU.  Too bad it doesn't log receiving the signal or beginning the process of exiting...
> 
> Another random thought - I suppose attrd could be clobbering some memory which mainloop needs to properly process an exit.  Doesn't seem likely - but neither of the above options seem very likely either.
> 
> 
> ----------------------------
> An historical note on an early bug that had similar symptoms (but affected every process - not just attrd).
> 
> First - what caused such a problem (a very long time ago):
>     There is a window between the checking for signals and going to sleep in the poll call where
>         such that a signal might be ignored for a while.
> 
>     The glib mainloop code has three entry points called each time a signal is received:
>             prepare, check, dispatch.
> 
> There is a poll call which occurs between the prepare and check steps.  If a signal comes in after the prepare call returns, but before the code goes to sleep in the poll system call, it will be ignored until
> the poll system call returns.  It will get caught on the next iteration of the loop.
> 
> The fix was fairly simple - the signal handling code instructs the mainloop infrastructure to call poll with an argument which prevents it from staying asleep longer than a second.
> 
> Then the code processes the signal correctly.
> 
> 
> On 10/17/2011 07:19 PM, renayama19661014 at ybb.ne.jp wrote:
> > Hi,
> > 
> > We sometimes fail in a stop of attrd.
> > 
> > Step1. start a cluster in 2 nodes
> > Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> > Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat
> > stop.)
> > 
> > The attrd catches the TERM signal, but does not stop.
> > 
> > (snip)
> > Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel to
> > 12238 is not connected
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
> > Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 failed
> > Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply to
> > crmd failed: reply failed
> > Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
> > /usr/lib64/heartbeat/attrd process group 12237 with signal 15
> > Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 operations
> > (4123.00us average, 0% utilization) in the last 10min
> > Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> > channel took 1010 ms (>  100 ms)
> > Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> > channel took 1010 ms (>  100 ms)
> > Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) before
> > being called (GSource: 0xd28010)
> > Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431583547 should have started at 431583444
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for send local status was delayed 1030 ms (>  1010 ms) before
> > being called (GSource: 0xd27dd0)
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431584254 should have started at 431584151
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) before
> > being called (GSource: 0xd28010)
> > Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431584254 should have started at 431584151
> > Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working on
> > write child took 1010 ms (>  100 ms)
> > Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
> > Heartbeat API channel took 1010 ms (>  100 ms)
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for send local status was delayed 1030 ms (>  1010 ms) before
> > being called (GSource: 0xd27dd0)
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431607988 should have started at 431607885
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> > Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) before
> > being called (GSource: 0xd28010)
> > Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> > started at 431607988 should have started at 431607885
> > (snip)
> > 
> > We try the reproduction of the phenomenon, but do not reappear very much.
> > 
> > The same phenomenon is reported by the next email.
> > However, the argument of the problem is over on the way.
> > 
> >   * http://www.gossamer-threads.com/lists/linuxha/pacemaker/62147
> > 
> > The phenomenon occurred by the next combination.
> >   * pacemaker-1.0.11
> >   * resource-agents-3.9.2
> >   * cluster-glue-1.0.7
> >   * heartbeat-3.0.5
> > 
> > I registered these contents with Bugzilla.
> >   * http://bugs.clusterlabs.org/show_bug.cgi?id=5004
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > 
> 
> 
> --     Alan Robertson<alanr at unix.sh>
> 
> "Openness is the foundation and preservative of friendship...  Let me claim from you at all times your undisguised opinions." - William Wilberforce
> 




More information about the Pacemaker mailing list