[ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed

Christine Caulfield ccaulfie at redhat.com
Wed Jan 27 06:16:54 EST 2021


OK. There are definitely race conditions in the timer handling of libqb 
- at least according to helgrind which found 2 rather easily, so I 
suspect that is what is happening here.

I'm preparing a patch to fix them at the moment. The patch will go into 
the version 2 stream. Depending on which distribution you're using you 
might need to ask for a backport of that patch to v1 or, if you're using 
sources, ask me very nicely for a backport to upstream ;-)

Chrissie

On 27/01/2021 08:07, Christine Caulfield wrote:
> Hi,
> 
> I've never seen anything like that before and, as ou say, no real work 
> ha sbeen don in that are since 1.0.2 so I doubt it's a fixed bug.
> 
> Although libqb claims thread-safety I wouldn't rule out a threading bug 
> with you adding timers from another thread.  Looking at the test suite 
> there is actually very little testing of libqbs thread-safety so that's 
> where I would start looking for a start.
> 
> I'll have a play with it today, but you are able to come up with a 
> reproducer for us that would be very helpful.
> 
> Chrissie
> 
> On 25/01/2021 05:29, Miranda, Russel wrote:
>> Hello ClusterLabs Developers,
>>
>> I'm a developer working on an application that is using libqb (1.0.2) 
>> for its main event loop, and after running for a few hours, I'm 
>> encountering this assertion failure:
>>
>> loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state == 
>> QB_POLL_ENTRY_ACTIVE' failed.
>>
>> The assertion causes a SIGABRT, and generates a core file. The stack 
>> trace is at the point of the failure is:
>>
>> #0  0x00007ff3c76dd207 in raise () from /lib64/libc.so.6
>> #1  0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6
>> #2  0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6
>> #3  0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6
>> #4  0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) at 
>> loop_timerlist.c:67
>> #5  0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at 
>> ../include/tlist.h:210
>> #6  expire_the_timers (s=0x5dffe80, ms_timeout=<optimized out>) at 
>> loop_timerlist.c:78
>> #7  0x00007ff3ca1c6acc in qb_loop_run (lp=<optimized out>) at 
>> loop.c:175 ...
>>
>> Going back up to the point of the assertion (make_job_from_tmo), here 
>> is what the "data" parameter looks like:
>>
>> (gdb) p *((struct qb_loop_timer*)data)
>> $1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source = 
>> 0x5dffe80, user_data = 0x7ff3c9144f60 <sessions+6240>, type = 
>> QB_LOOP_FD},
>>    dispatch_fn = 0xac8b00 <sessionTimeoutHandler(void*)>, p = 
>> QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY, 
>> check = 978999470, install_pos = 11}
>>
>> Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTY 
>> when the assertion fails.
>>
>> The timer that appears to have expired was added a few seconds earlier 
>> at the point when a message was sent (via socket) to an external 
>> entity. Typically, when that entity responds, the timer is stopped 
>> (via qb_loop_timer_del) to prevent expiration. If timer expiration 
>> occurs before the response is received, the registered dispatch 
>> function is called to handle failure of the external entity to respond.
>>
>> That this timer would expire and call the dispatch_fn 
>> (sessionTimeoutHandler(void*)) passing in the value of user_data 
>> (0x7ff3c9144f60) at this point in the operation of the program is 
>> entirely plausible / expected, and I have validated that user_data 
>> points to a valid object of the appropriate type as well - so all of 
>> the caller-settable data looks correct to me in the timer entry. While 
>> I cannot completely exclude the possibility that something stepped on 
>> the t->state value, it seems unlikely that it would have affected only 
>> that one object property in the timer entry.
>>
>> It appears to me that the timer has expired, but the timer list entry 
>> has the wrong state. I'm confused to how that might occur. I've been 
>> searching through the API documentation and online through forums, and 
>> I can't find anyone else reporting this particular assertion failure 
>> within "make_job_from_tmo", or even suggesting what we might be doing 
>> wrong to cause it. Has anyone seen this before?
>>
>> Some additional notes:
>>
>> * There is another thread active that can call qb_loop_timer_add to 
>> add a timer to the main event loop (it is not the timer that has 
>> expired here), but it could have been adding a timer at the same time 
>> this timer expired.
>>    My understanding is that (excluding the IPC calls), libqb is 
>> thread-safe, so it should be safe to call the qb_loop_timer_add() API 
>> from another thread. qb_loop_timer_add() is the only libqb API called 
>> from the other thread.
>> * In some cases we are adding timers that expire immediately (that is, 
>> with a delay as short as 0ns). That is not the case with this 
>> particular timer, it has a delay that is several seconds long.
>>    The qb_loop_timer_add() API doesn't indicate a minimum for the 
>> delay, so I am not aware of this being an incorrect usage - though we 
>> could probably switch to calling the qb_loop_job_add API instead if 
>> that is an incorrect use of the API.
>>
>> None of the libqb releases since 1.0.2 appear to have any changes 
>> introduced specifically to address a problem like this - though I do 
>> see there were some minor updates made to the loop_timerlist.c code to 
>> silence warnings. If there is a known problem in this area in 1.0.2, I 
>> will try see how hard it is going to be to step up to the newer version.
>>
>> I'm hoping someone has seen this problem before, or can suggest 
>> something that we might be doing wrong in the use of the API that 
>> could lead to this state.
>>
>> Sincere thanks for any suggestions,
>>
>> Russ
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/developers
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/developers
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 



More information about the Developers mailing list