[Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1

Fri Jan 10 10:23:20 EST 2014

----- Original Message -----
> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Friday, January 10, 2014 5:23:04 AM
> Subject: Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1
> 
> 2014/1/9 Andrew Beekhof <andrew at beekhof.net>:
> >
> > On 8 Jan 2014, at 9:15 pm, Kazunori INOUE <kazunori.inoue3 at gmail.com>
> > wrote:
> >
> >> 2014/1/8 Andrew Beekhof <andrew at beekhof.net>:
> >>>
> >>> On 18 Dec 2013, at 9:50 pm, Kazunori INOUE <kazunori.inoue3 at gmail.com>
> >>> wrote:
> >>>
> >>>> Hi David,
> >>>>
> >>>> 2013/12/18 David Vossel <dvossel at redhat.com>:
> >>>>>
> >>>>> That's a really weird one... I don't see how it is possible for op->id
> >>>>> to be NULL there.   You might need to give valgrind a shot to detect
> >>>>> whatever is really going on here.
> >>>>>
> >>>>> -- Vossel
> >>>>>
> >>>> Thank you for advice. I try it.
> >>>
> >>> Any update on this?
> >>>
> >>
> >> We are still investigating a cause. It was not reproduced when I gave
> >> valgrind..
> >> And it was reproduced in RC3.
> >
> > So it happened RC3 - valgrind, but not RC3 + valgrind?
> > Thats concerning.
> >
> > Nothing in the valgrind output?
> >
> 
> The cause was found.
> 
> 230 gboolean
> 231 operation_finalize(svc_action_t * op)
> 232 {
> 233     int recurring = 0;
> 234
> 235     if (op->interval) {
> 236         if (op->cancel) {
> 237             op->status = PCMK_LRM_OP_CANCELLED;
> 238             cancel_recurring_action(op);
> 239         } else {
> 240             recurring = 1;
> 241             op->opaque->repeat_timer = g_timeout_add(op->interval,
> 242
> recurring_action_timer, (void *)op);
> 243         }
> 244     }
> 245
> 246     if (op->opaque->callback) {
> 247         op->opaque->callback(op);
> 248     }
> 249
> 250     op->pid = 0;
> 251
> 252     if (!recurring) {
> 253         /*
> 254          * If this is a recurring action, do not free explicitly.
> 255          * It will get freed whenever the action gets cancelled.
> 256          */
> 257         services_action_free(op);
> 258         return TRUE;
> 259     }
> 260     return FALSE;
> 261 }
> 
> When op->id is not 0, in cancel_recurring_action function (l.238), op
> is not removed from hash table.
> However, op is freed in services_action_free function (l.257).  That
> is, the freed data remains in hash table.
> Then, g_hash_table_lookup function may look up the freed data.
> 
> Therefore, when g_hash_table_replace function was called (in
> services_action_async function), I added change so that
> g_hash_table_remove function might certainly be called.
> As of now, segfault has not happened.

Awesome, thanks for tracking this down.  I created a modified version of your patch and put it up for review as a pacemaker pull request.
https://github.com/ClusterLabs/pacemaker/pull/408

-- Vossel