[Pacemaker] [Problem] The attrd does not sometimes stop.

Alan Robertson alanr at unix.sh
Wed Oct 19 16:46:46 EDT 2011


Hi,

I've seen a very similar problem in a recent release.  In fact, I'm in 
the process of reproducing it so that it can be properly logged and so 
on.  When I get the right data for the bug report, I'll attach it to the 
bug.

FWIW: I'm pretty sure that the signal was properly received by attrd.  I 
haven't looked at the attrd code, but my guess is that either it didn't 
issue the correct function call for exiting from mainloop - or that the 
mainloop code didn't actually exit.  FWIW - it probably doesn't matter 
at all what the priority for signal handling is - since attrd consumes 
nearly no CPU.  Too bad it doesn't log receiving the signal or beginning 
the process of exiting...

Another random thought - I suppose attrd could be clobbering some memory 
which mainloop needs to properly process an exit.  Doesn't seem likely - 
but neither of the above options seem very likely either.


----------------------------
An historical note on an early bug that had similar symptoms (but 
affected every process - not just attrd).

First - what caused such a problem (a very long time ago):
     There is a window between the checking for signals and going to 
sleep in the poll call where
         such that a signal might be ignored for a while.

     The glib mainloop code has three entry points called each time a 
signal is received:
             prepare, check, dispatch.

There is a poll call which occurs between the prepare and check steps.  
If a signal comes in after the prepare call returns, but before the code 
goes to sleep in the poll system call, it will be ignored until
the poll system call returns.  It will get caught on the next iteration 
of the loop.

The fix was fairly simple - the signal handling code instructs the 
mainloop infrastructure to call poll with an argument which prevents it 
from staying asleep longer than a second.

Then the code processes the signal correctly.


On 10/17/2011 07:19 PM, renayama19661014 at ybb.ne.jp wrote:
> Hi,
>
> We sometimes fail in a stop of attrd.
>
> Step1. start a cluster in 2 nodes
> Step2. stop the first node.(/etc/init.d/heartbeat stop.)
> Step3. stop the second node after time passed a little.(/etc/init.d/heartbeat
> stop.)
>
> The attrd catches the TERM signal, but does not stop.
>
> (snip)
> Oct  5 02:37:38 hpdb0201 crmd: [12238]: info: do_exit: [crmd] stopped (0)
> Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_ipc_message: IPC Channel to
> 12238 is not connected
> Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: send_via_callback_channel:
> Delivery of reply to client 12238/0dbc9e28-d90d-4335-b9c4-9dd3fcb38163 failed
> Oct  5 02:37:38 hpdb0201 cib: [12234]: WARN: do_local_notify: A-Sync reply to
> crmd failed: reply failed
> Oct  5 02:37:38 hpdb0201 heartbeat: [12223]: info: killing
> /usr/lib64/heartbeat/attrd process group 12237 with signal 15
> Oct  5 02:47:03 hpdb0201 cib: [12234]: info: cib_stats: Processed 97 operations
> (4123.00us average, 0% utilization) in the last 10min
> Oct  5 07:15:25 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> channel took 1010 ms (>  100 ms)
> Oct  5 07:15:26 hpdb0201 ccm: [12233]: WARN: G_CH_check_int: working on IPC
> channel took 1010 ms (>  100 ms)
> Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) before
> being called (GSource: 0xd28010)
> Oct  5 07:15:37 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> started at 431583547 should have started at 431583444
> Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> Dispatch function for send local status was delayed 1030 ms (>  1010 ms) before
> being called (GSource: 0xd27dd0)
> Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> started at 431584254 should have started at 431584151
> Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) before
> being called (GSource: 0xd28010)
> Oct  5 07:15:44 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> started at 431584254 should have started at 431584151
> Oct  5 07:16:59 hpdb0201 heartbeat: [12223]: WARN: G_CH_check_int: working on
> write child took 1010 ms (>  100 ms)
> Oct  5 07:17:14 hpdb0201 stonithd: [12236]: WARN: G_CH_check_int: working on
> Heartbeat API channel took 1010 ms (>  100 ms)
> Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> Dispatch function for send local status was delayed 1030 ms (>  1010 ms) before
> being called (GSource: 0xd27dd0)
> Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> started at 431607988 should have started at 431607885
> Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: WARN: Gmain_timeout_dispatch:
> Dispatch function for check for signals was delayed 1030 ms (>  1010 ms) before
> being called (GSource: 0xd28010)
> Oct  5 07:19:41 hpdb0201 heartbeat: [12223]: info: Gmain_timeout_dispatch:
> started at 431607988 should have started at 431607885
> (snip)
>
> We try the reproduction of the phenomenon, but do not reappear very much.
>
> The same phenomenon is reported by the next email.
> However, the argument of the problem is over on the way.
>
>   * http://www.gossamer-threads.com/lists/linuxha/pacemaker/62147
>
> The phenomenon occurred by the next combination.
>   * pacemaker-1.0.11
>   * resource-agents-3.9.2
>   * cluster-glue-1.0.7
>   * heartbeat-3.0.5
>
> I registered these contents with Bugzilla.
>   * http://bugs.clusterlabs.org/show_bug.cgi?id=5004
>
> Best Regards,
> Hideo Yamauchi.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>


-- 
     Alan Robertson<alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim from you at all times your undisguised opinions." - William Wilberforce




More information about the Pacemaker mailing list