[Pacemaker] [Problem] The attrd does not sometimes stop.

renayama19661014 at ybb.ne.jp renayama19661014 at ybb.ne.jp
Sun Jan 15 19:23:11 EST 2012


Hi Lars,

Thank you for comments and suggestion.

> > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 3, -1
> 
> Note the -1 (infinity timeout!)
> 
> So even though the trigger was (presumably) set,
> and the ->prepare() should have returned true,
> the mainloop waits forever for "something" to happen on those file descriptors.
> 
> 
> I suggest this:
> 
> crm_trigger_prepare should set *timeout = 0, if trigger is set.
> 
> Also think about this race: crm_trigger_prepare was already
> called, only then the signal came in...
> 
> diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
> index 2e8b1d0..fd17b87 100644
> --- a/lib/common/mainloop.c
> +++ b/lib/common/mainloop.c
> @@ -33,6 +33,13 @@ static gboolean
>  crm_trigger_prepare(GSource * source, gint * timeout)
>  {
>      crm_trigger_t *trig = (crm_trigger_t *) source;
> +    /* Do not delay signal processing by the mainloop poll stage */
> +    if (trig->trigger)
> +        *timeout = 0;
> +    /* To avoid races between signal delivery and the mainloop poll stage,
> +     * make sure we always have a finite timeout. Unit: milliseconds. */
> +    else
> +        *timeout = 5000; /* arbitrary */
>  
>      return trig->trigger;
>  }
> 
> 
> This scenario does not let the blocked IPC off the hook, though.
> That is still possible, both for blocking send and blocking receive,
> so that should probably be fixed as well, somehow.
> I'm not sure how likely this "stuck in blocking IPC" is, though.

Including a correction of your suggestion, I continue investigating the problem again.

I report it if I get some information.

Best Regards,
Hideo Yamauchi.

--- On Sat, 2012/1/14, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:

> On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661014 at ybb.ne.jp wrote:
> > Hi Lars,
> > 
> > I attach strace file when a problem reappeared at the end of last year.
> > I used glue which applied your patch for confirmation.
> > 
> > It is the file which I picked with attrd by strace -p command right before I stop Heartbeat.
> > 
> > Finally SIGTERM caught it, but attrd did not stop.
> > The attrd stopped afterwards when I sent SIGKILL.
> 
> The strace reveals something interesting:
> 
> This poll looks like the mainloop poll,
> but some ->prepare() has modified the timeout to be 0,
> so we proceed directly to ->check() and then ->dispatch().
> 
> > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}])
> 
> > times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632
> > recv(4, 0x95af308, 576, MSG_DONTWAIT)   = -1 EAGAIN (Resource temporarily unavailable)
> ...
> > recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily unavailable)
> > poll([{fd=7, events=0}], 1, 0)          = ? ERESTART_RESTARTBLOCK (To be restarted)
> > --- SIGTERM (Terminated) @ 0 (0) ---
> > sigreturn()                             = ? (mask now [])
> 
> Ok. signal received, trigger set.
> Still finishing this mainloop iteration, though.
> 
> These recv(),poll() look like invocations of G_CH_prepare_int().
> Does not matter much, though.
> 
> > recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily unavailable)
> > poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
> > recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily unavailable)
> > poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
> 
> > times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634
> 
> Now we proceed to the next mainloop poll:
> 
> > poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 3, -1
> 
> Note the -1 (infinity timeout!)
> 
> So even though the trigger was (presumably) set,
> and the ->prepare() should have returned true,
> the mainloop waits forever for "something" to happen on those file descriptors.
> 
> 
> I suggest this:
> 
> crm_trigger_prepare should set *timeout = 0, if trigger is set.
> 
> Also think about this race: crm_trigger_prepare was already
> called, only then the signal came in...
> 
> diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
> index 2e8b1d0..fd17b87 100644
> --- a/lib/common/mainloop.c
> +++ b/lib/common/mainloop.c
> @@ -33,6 +33,13 @@ static gboolean
>  crm_trigger_prepare(GSource * source, gint * timeout)
>  {
>      crm_trigger_t *trig = (crm_trigger_t *) source;
> +    /* Do not delay signal processing by the mainloop poll stage */
> +    if (trig->trigger)
> +        *timeout = 0;
> +    /* To avoid races between signal delivery and the mainloop poll stage,
> +     * make sure we always have a finite timeout. Unit: milliseconds. */
> +    else
> +        *timeout = 5000; /* arbitrary */
>  
>      return trig->trigger;
>  }
> 
> 
> This scenario does not let the blocked IPC off the hook, though.
> That is still possible, both for blocking send and blocking receive,
> so that should probably be fixed as well, somehow.
> I'm not sure how likely this "stuck in blocking IPC" is, though.
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




More information about the Pacemaker mailing list