[ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed
Christine Caulfield
ccaulfie at redhat.com
Wed Jan 27 11:16:54 UTC 2021
OK. There are definitely race conditions in the timer handling of libqb
- at least according to helgrind which found 2 rather easily, so I
suspect that is what is happening here.
I'm preparing a patch to fix them at the moment. The patch will go into
the version 2 stream. Depending on which distribution you're using you
might need to ask for a backport of that patch to v1 or, if you're using
sources, ask me very nicely for a backport to upstream ;-)
Chrissie
On 27/01/2021 08:07, Christine Caulfield wrote:
> Hi,
>
> I've never seen anything like that before and, as ou say, no real work
> ha sbeen don in that are since 1.0.2 so I doubt it's a fixed bug.
>
> Although libqb claims thread-safety I wouldn't rule out a threading bug
> with you adding timers from another thread. Looking at the test suite
> there is actually very little testing of libqbs thread-safety so that's
> where I would start looking for a start.
>
> I'll have a play with it today, but you are able to come up with a
> reproducer for us that would be very helpful.
>
> Chrissie
>
> On 25/01/2021 05:29, Miranda, Russel wrote:
>> Hello ClusterLabs Developers,
>>
>> I'm a developer working on an application that is using libqb (1.0.2)
>> for its main event loop, and after running for a few hours, I'm
>> encountering this assertion failure:
>>
>> loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state ==
>> QB_POLL_ENTRY_ACTIVE' failed.
>>
>> The assertion causes a SIGABRT, and generates a core file. The stack
>> trace is at the point of the failure is:
>>
>> #0 0x00007ff3c76dd207 in raise () from /lib64/libc.so.6
>> #1 0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6
>> #2 0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6
>> #3 0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6
>> #4 0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) at
>> loop_timerlist.c:67
>> #5 0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at
>> ../include/tlist.h:210
>> #6 expire_the_timers (s=0x5dffe80, ms_timeout=<optimized out>) at
>> loop_timerlist.c:78
>> #7 0x00007ff3ca1c6acc in qb_loop_run (lp=<optimized out>) at
>> loop.c:175 ...
>>
>> Going back up to the point of the assertion (make_job_from_tmo), here
>> is what the "data" parameter looks like:
>>
>> (gdb) p *((struct qb_loop_timer*)data)
>> $1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source =
>> 0x5dffe80, user_data = 0x7ff3c9144f60 <sessions+6240>, type =
>> QB_LOOP_FD},
>> dispatch_fn = 0xac8b00 <sessionTimeoutHandler(void*)>, p =
>> QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY,
>> check = 978999470, install_pos = 11}
>>
>> Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTY
>> when the assertion fails.
>>
>> The timer that appears to have expired was added a few seconds earlier
>> at the point when a message was sent (via socket) to an external
>> entity. Typically, when that entity responds, the timer is stopped
>> (via qb_loop_timer_del) to prevent expiration. If timer expiration
>> occurs before the response is received, the registered dispatch
>> function is called to handle failure of the external entity to respond.
>>
>> That this timer would expire and call the dispatch_fn
>> (sessionTimeoutHandler(void*)) passing in the value of user_data
>> (0x7ff3c9144f60) at this point in the operation of the program is
>> entirely plausible / expected, and I have validated that user_data
>> points to a valid object of the appropriate type as well - so all of
>> the caller-settable data looks correct to me in the timer entry. While
>> I cannot completely exclude the possibility that something stepped on
>> the t->state value, it seems unlikely that it would have affected only
>> that one object property in the timer entry.
>>
>> It appears to me that the timer has expired, but the timer list entry
>> has the wrong state. I'm confused to how that might occur. I've been
>> searching through the API documentation and online through forums, and
>> I can't find anyone else reporting this particular assertion failure
>> within "make_job_from_tmo", or even suggesting what we might be doing
>> wrong to cause it. Has anyone seen this before?
>>
>> Some additional notes:
>>
>> * There is another thread active that can call qb_loop_timer_add to
>> add a timer to the main event loop (it is not the timer that has
>> expired here), but it could have been adding a timer at the same time
>> this timer expired.
>> My understanding is that (excluding the IPC calls), libqb is
>> thread-safe, so it should be safe to call the qb_loop_timer_add() API
>> from another thread. qb_loop_timer_add() is the only libqb API called
>> from the other thread.
>> * In some cases we are adding timers that expire immediately (that is,
>> with a delay as short as 0ns). That is not the case with this
>> particular timer, it has a delay that is several seconds long.
>> The qb_loop_timer_add() API doesn't indicate a minimum for the
>> delay, so I am not aware of this being an incorrect usage - though we
>> could probably switch to calling the qb_loop_job_add API instead if
>> that is an incorrect use of the API.
>>
>> None of the libqb releases since 1.0.2 appear to have any changes
>> introduced specifically to address a problem like this - though I do
>> see there were some minor updates made to the loop_timerlist.c code to
>> silence warnings. If there is a known problem in this area in 1.0.2, I
>> will try see how hard it is going to be to step up to the newer version.
>>
>> I'm hoping someone has seen this problem before, or can suggest
>> something that we might be doing wrong in the use of the API that
>> could lead to this state.
>>
>> Sincere thanks for any suggestions,
>>
>> Russ
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/developers
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/developers
>
> ClusterLabs home: https://www.clusterlabs.org/
>
More information about the Developers
mailing list