From kgaillot at redhat.com Mon Jan 11 13:59:27 2021 From: kgaillot at redhat.com (Ken Gaillot) Date: Mon, 11 Jan 2021 12:59:27 -0600 Subject: [ClusterLabs Developers] Clean-up needed if updating a Pacemaker checkout Message-ID: Hi all, As part of the build changes for the next (2.1.0) Pacemaker release, the top-level directory has been de-cluttered by moving more build artifacts to subdirectories. A side effect is that if you pull the latest master branch in an existing checkout, you'll have to clean up the build files and regenerate them, e.g.: rm -rf aclocal.m4 autom4te.cache config.guess config.log config.status \ config.sub configure depcomp install-sh libltdl libtool ltmain.sh \ missing py-compile test-driver ./autogen.sh ./configure -- Ken Gaillot From kgaillot at redhat.com Tue Jan 19 16:34:13 2021 From: kgaillot at redhat.com (Ken Gaillot) Date: Tue, 19 Jan 2021 15:34:13 -0600 Subject: [ClusterLabs Developers] Any projects using Github large file storage? Message-ID: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> Hi all, Does anyone remember what ClusterLabs project is using Github's Large File Storage (https://git-lfs.github.com/)? They notified us that "Git LFS has been disabled on the organization ClusterLabs because you?ve exceeded your data plan by at least 150%" (we're currently using 2.2.GB). I vaguely remember someone wanting that capability, but I don't remember the project(s). That project will need to either move away from LFS, or cough up for a data plan. If no one remembers, I can ask their support. -- Ken Gaillot From lists at alteeve.ca Tue Jan 19 18:00:15 2021 From: lists at alteeve.ca (Digimer) Date: Tue, 19 Jan 2021 18:00:15 -0500 Subject: [ClusterLabs Developers] Any projects using Github large file storage? In-Reply-To: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> References: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> Message-ID: <1c4312a7-3fc3-4b43-ac76-fc31f1f857b2@alteeve.ca> On 2021-01-19 4:34 p.m., Ken Gaillot wrote: > Hi all, > > Does anyone remember what ClusterLabs project is using Github's Large > File Storage (https://git-lfs.github.com/)? > > They notified us that "Git LFS has been disabled on the organization > ClusterLabs because you?ve exceeded your data plan by at least 150%" > (we're currently using 2.2.GB). > > I vaguely remember someone wanting that capability, but I don't > remember the project(s). That project will need to either move away > from LFS, or cough up for a data plan. > > If no one remembers, I can ask their support. That would be anvil-striker-extra, which I just moved into ClusterLabs. I figured it would keep billing me though. If it's effecting any other projects and it's not a quick fix, I'll pull it out. -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein?s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould From lists at alteeve.ca Tue Jan 19 18:02:35 2021 From: lists at alteeve.ca (Digimer) Date: Tue, 19 Jan 2021 18:02:35 -0500 Subject: [ClusterLabs Developers] Any projects using Github large file storage? In-Reply-To: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> References: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> Message-ID: <4b4735de-6941-89db-d749-67c2b08d0b9a@alteeve.ca> On 2021-01-19 4:34 p.m., Ken Gaillot wrote: > Hi all, > > Does anyone remember what ClusterLabs project is using Github's Large > File Storage (https://git-lfs.github.com/)? > > They notified us that "Git LFS has been disabled on the organization > ClusterLabs because you?ve exceeded your data plan by at least 150%" > (we're currently using 2.2.GB). > > I vaguely remember someone wanting that capability, but I don't > remember the project(s). That project will need to either move away > from LFS, or cough up for a data plan. > > If no one remembers, I can ask their support. Update; I pulled it back out (pushed it under my own user's repos). Let me know if this causes any further problems. I'll have to figure out how to ensure that this continues to bill Alteeve and not interfere with ClusterLabs in general. Sorry for any hassle. -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein?s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould From kgaillot at redhat.com Tue Jan 19 18:08:17 2021 From: kgaillot at redhat.com (Ken Gaillot) Date: Tue, 19 Jan 2021 17:08:17 -0600 Subject: [ClusterLabs Developers] Any projects using Github large file storage? In-Reply-To: <4b4735de-6941-89db-d749-67c2b08d0b9a@alteeve.ca> References: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> <4b4735de-6941-89db-d749-67c2b08d0b9a@alteeve.ca> Message-ID: On Tue, 2021-01-19 at 18:02 -0500, Digimer wrote: > On 2021-01-19 4:34 p.m., Ken Gaillot wrote: > > Hi all, > > > > Does anyone remember what ClusterLabs project is using Github's > > Large > > File Storage (https://git-lfs.github.com/)? > > > > They notified us that "Git LFS has been disabled on the > > organization > > ClusterLabs because you?ve exceeded your data plan by at least > > 150%" > > (we're currently using 2.2.GB). > > > > I vaguely remember someone wanting that capability, but I don't > > remember the project(s). That project will need to either move away > > from LFS, or cough up for a data plan. > > > > If no one remembers, I can ask their support. > > Update; I pulled it back out (pushed it under my own user's repos). > Let > me know if this causes any further problems. I'll have to figure out > how > to ensure that this continues to bill Alteeve and not interfere with > ClusterLabs in general. > > Sorry for any hassle. No problem -- I thought someone long ago wanted it, but I most likely misremembered :) -- Ken Gaillot From russel.miranda at lmco.com Mon Jan 25 00:29:30 2021 From: russel.miranda at lmco.com (Miranda, Russel) Date: Mon, 25 Jan 2021 05:29:30 +0000 Subject: [ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed In-Reply-To: <4135cf388b704950b4da98fa4e5ac2e1@lmco.com> References: <4135cf388b704950b4da98fa4e5ac2e1@lmco.com> Message-ID: Hello ClusterLabs Developers, I'm a developer working on an application that is using libqb (1.0.2) for its main event loop, and after running for a few hours, I'm encountering this assertion failure: loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state == QB_POLL_ENTRY_ACTIVE' failed. The assertion causes a SIGABRT, and generates a core file. The stack trace is at the point of the failure is: #0 0x00007ff3c76dd207 in raise () from /lib64/libc.so.6 #1 0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6 #2 0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6 #4 0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) at loop_timerlist.c:67 #5 0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at ../include/tlist.h:210 #6 expire_the_timers (s=0x5dffe80, ms_timeout=) at loop_timerlist.c:78 #7 0x00007ff3ca1c6acc in qb_loop_run (lp=) at loop.c:175 ... Going back up to the point of the assertion (make_job_from_tmo), here is what the "data" parameter looks like: (gdb) p *((struct qb_loop_timer*)data) $1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source = 0x5dffe80, user_data = 0x7ff3c9144f60 , type = QB_LOOP_FD}, dispatch_fn = 0xac8b00 , p = QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY, check = 978999470, install_pos = 11} Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTY when the assertion fails. The timer that appears to have expired was added a few seconds earlier at the point when a message was sent (via socket) to an external entity. Typically, when that entity responds, the timer is stopped (via qb_loop_timer_del) to prevent expiration. If timer expiration occurs before the response is received, the registered dispatch function is called to handle failure of the external entity to respond. That this timer would expire and call the dispatch_fn (sessionTimeoutHandler(void*)) passing in the value of user_data (0x7ff3c9144f60) at this point in the operation of the program is entirely plausible / expected, and I have validated that user_data points to a valid object of the appropriate type as well - so all of the caller-settable data looks correct to me in the timer entry. While I cannot completely exclude the possibility that something stepped on the t->state value, it seems unlikely that it would have affected only that one object property in the timer entry. It appears to me that the timer has expired, but the timer list entry has the wrong state. I'm confused to how that might occur. I've been searching through the API documentation and online through forums, and I can't find anyone else reporting this particular assertion failure within "make_job_from_tmo", or even suggesting what we might be doing wrong to cause it. Has anyone seen this before? Some additional notes: * There is another thread active that can call qb_loop_timer_add to add a timer to the main event loop (it is not the timer that has expired here), but it could have been adding a timer at the same time this timer expired. My understanding is that (excluding the IPC calls), libqb is thread-safe, so it should be safe to call the qb_loop_timer_add() API from another thread. qb_loop_timer_add() is the only libqb API called from the other thread. * In some cases we are adding timers that expire immediately (that is, with a delay as short as 0ns). That is not the case with this particular timer, it has a delay that is several seconds long. The qb_loop_timer_add() API doesn't indicate a minimum for the delay, so I am not aware of this being an incorrect usage - though we could probably switch to calling the qb_loop_job_add API instead if that is an incorrect use of the API. None of the libqb releases since 1.0.2 appear to have any changes introduced specifically to address a problem like this - though I do see there were some minor updates made to the loop_timerlist.c code to silence warnings. If there is a known problem in this area in 1.0.2, I will try see how hard it is going to be to step up to the newer version. I'm hoping someone has seen this problem before, or can suggest something that we might be doing wrong in the use of the API that could lead to this state. Sincere thanks for any suggestions, Russ From ccaulfie at redhat.com Wed Jan 27 03:07:09 2021 From: ccaulfie at redhat.com (Christine Caulfield) Date: Wed, 27 Jan 2021 08:07:09 +0000 Subject: [ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed In-Reply-To: References: <4135cf388b704950b4da98fa4e5ac2e1@lmco.com> Message-ID: Hi, I've never seen anything like that before and, as ou say, no real work ha sbeen don in that are since 1.0.2 so I doubt it's a fixed bug. Although libqb claims thread-safety I wouldn't rule out a threading bug with you adding timers from another thread. Looking at the test suite there is actually very little testing of libqbs thread-safety so that's where I would start looking for a start. I'll have a play with it today, but you are able to come up with a reproducer for us that would be very helpful. Chrissie On 25/01/2021 05:29, Miranda, Russel wrote: > Hello ClusterLabs Developers, > > I'm a developer working on an application that is using libqb (1.0.2) for its main event loop, and after running for a few hours, I'm encountering this assertion failure: > > loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state == QB_POLL_ENTRY_ACTIVE' failed. > > The assertion causes a SIGABRT, and generates a core file. The stack trace is at the point of the failure is: > > #0 0x00007ff3c76dd207 in raise () from /lib64/libc.so.6 > #1 0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6 > #2 0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6 > #3 0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6 > #4 0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) at loop_timerlist.c:67 > #5 0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at ../include/tlist.h:210 > #6 expire_the_timers (s=0x5dffe80, ms_timeout=) at loop_timerlist.c:78 > #7 0x00007ff3ca1c6acc in qb_loop_run (lp=) at loop.c:175 ... > > Going back up to the point of the assertion (make_job_from_tmo), here is what the "data" parameter looks like: > > (gdb) p *((struct qb_loop_timer*)data) > $1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source = 0x5dffe80, user_data = 0x7ff3c9144f60 , type = QB_LOOP_FD}, > dispatch_fn = 0xac8b00 , p = QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY, check = 978999470, install_pos = 11} > > Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTY when the assertion fails. > > The timer that appears to have expired was added a few seconds earlier at the point when a message was sent (via socket) to an external entity. Typically, when that entity responds, the timer is stopped (via qb_loop_timer_del) to prevent expiration. If timer expiration occurs before the response is received, the registered dispatch function is called to handle failure of the external entity to respond. > > That this timer would expire and call the dispatch_fn (sessionTimeoutHandler(void*)) passing in the value of user_data (0x7ff3c9144f60) at this point in the operation of the program is entirely plausible / expected, and I have validated that user_data points to a valid object of the appropriate type as well - so all of the caller-settable data looks correct to me in the timer entry. While I cannot completely exclude the possibility that something stepped on the t->state value, it seems unlikely that it would have affected only that one object property in the timer entry. > > It appears to me that the timer has expired, but the timer list entry has the wrong state. I'm confused to how that might occur. I've been searching through the API documentation and online through forums, and I can't find anyone else reporting this particular assertion failure within "make_job_from_tmo", or even suggesting what we might be doing wrong to cause it. Has anyone seen this before? > > Some additional notes: > > * There is another thread active that can call qb_loop_timer_add to add a timer to the main event loop (it is not the timer that has expired here), but it could have been adding a timer at the same time this timer expired. > My understanding is that (excluding the IPC calls), libqb is thread-safe, so it should be safe to call the qb_loop_timer_add() API from another thread. qb_loop_timer_add() is the only libqb API called from the other thread. > * In some cases we are adding timers that expire immediately (that is, with a delay as short as 0ns). That is not the case with this particular timer, it has a delay that is several seconds long. > The qb_loop_timer_add() API doesn't indicate a minimum for the delay, so I am not aware of this being an incorrect usage - though we could probably switch to calling the qb_loop_job_add API instead if that is an incorrect use of the API. > > None of the libqb releases since 1.0.2 appear to have any changes introduced specifically to address a problem like this - though I do see there were some minor updates made to the loop_timerlist.c code to silence warnings. If there is a known problem in this area in 1.0.2, I will try see how hard it is going to be to step up to the newer version. > > I'm hoping someone has seen this problem before, or can suggest something that we might be doing wrong in the use of the API that could lead to this state. > > Sincere thanks for any suggestions, > > Russ > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/developers > > ClusterLabs home: https://www.clusterlabs.org/ > From ccaulfie at redhat.com Wed Jan 27 06:16:54 2021 From: ccaulfie at redhat.com (Christine Caulfield) Date: Wed, 27 Jan 2021 11:16:54 +0000 Subject: [ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed In-Reply-To: References: <4135cf388b704950b4da98fa4e5ac2e1@lmco.com> Message-ID: OK. There are definitely race conditions in the timer handling of libqb - at least according to helgrind which found 2 rather easily, so I suspect that is what is happening here. I'm preparing a patch to fix them at the moment. The patch will go into the version 2 stream. Depending on which distribution you're using you might need to ask for a backport of that patch to v1 or, if you're using sources, ask me very nicely for a backport to upstream ;-) Chrissie On 27/01/2021 08:07, Christine Caulfield wrote: > Hi, > > I've never seen anything like that before and, as ou say, no real work > ha sbeen don in that are since 1.0.2 so I doubt it's a fixed bug. > > Although libqb claims thread-safety I wouldn't rule out a threading bug > with you adding timers from another thread.? Looking at the test suite > there is actually very little testing of libqbs thread-safety so that's > where I would start looking for a start. > > I'll have a play with it today, but you are able to come up with a > reproducer for us that would be very helpful. > > Chrissie > > On 25/01/2021 05:29, Miranda, Russel wrote: >> Hello ClusterLabs Developers, >> >> I'm a developer working on an application that is using libqb (1.0.2) >> for its main event loop, and after running for a few hours, I'm >> encountering this assertion failure: >> >> loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state == >> QB_POLL_ENTRY_ACTIVE' failed. >> >> The assertion causes a SIGABRT, and generates a core file. The stack >> trace is at the point of the failure is: >> >> #0? 0x00007ff3c76dd207 in raise () from /lib64/libc.so.6 >> #1? 0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6 >> #2? 0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6 >> #3? 0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6 >> #4? 0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) at >> loop_timerlist.c:67 >> #5? 0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at >> ../include/tlist.h:210 >> #6? expire_the_timers (s=0x5dffe80, ms_timeout=) at >> loop_timerlist.c:78 >> #7? 0x00007ff3ca1c6acc in qb_loop_run (lp=) at >> loop.c:175 ... >> >> Going back up to the point of the assertion (make_job_from_tmo), here >> is what the "data" parameter looks like: >> >> (gdb) p *((struct qb_loop_timer*)data) >> $1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source = >> 0x5dffe80, user_data = 0x7ff3c9144f60 , type = >> QB_LOOP_FD}, >> ?? dispatch_fn = 0xac8b00 , p = >> QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY, >> check = 978999470, install_pos = 11} >> >> Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTY >> when the assertion fails. >> >> The timer that appears to have expired was added a few seconds earlier >> at the point when a message was sent (via socket) to an external >> entity. Typically, when that entity responds, the timer is stopped >> (via qb_loop_timer_del) to prevent expiration. If timer expiration >> occurs before the response is received, the registered dispatch >> function is called to handle failure of the external entity to respond. >> >> That this timer would expire and call the dispatch_fn >> (sessionTimeoutHandler(void*)) passing in the value of user_data >> (0x7ff3c9144f60) at this point in the operation of the program is >> entirely plausible / expected, and I have validated that user_data >> points to a valid object of the appropriate type as well - so all of >> the caller-settable data looks correct to me in the timer entry. While >> I cannot completely exclude the possibility that something stepped on >> the t->state value, it seems unlikely that it would have affected only >> that one object property in the timer entry. >> >> It appears to me that the timer has expired, but the timer list entry >> has the wrong state. I'm confused to how that might occur. I've been >> searching through the API documentation and online through forums, and >> I can't find anyone else reporting this particular assertion failure >> within "make_job_from_tmo", or even suggesting what we might be doing >> wrong to cause it. Has anyone seen this before? >> >> Some additional notes: >> >> * There is another thread active that can call qb_loop_timer_add to >> add a timer to the main event loop (it is not the timer that has >> expired here), but it could have been adding a timer at the same time >> this timer expired. >> ?? My understanding is that (excluding the IPC calls), libqb is >> thread-safe, so it should be safe to call the qb_loop_timer_add() API >> from another thread. qb_loop_timer_add() is the only libqb API called >> from the other thread. >> * In some cases we are adding timers that expire immediately (that is, >> with a delay as short as 0ns). That is not the case with this >> particular timer, it has a delay that is several seconds long. >> ?? The qb_loop_timer_add() API doesn't indicate a minimum for the >> delay, so I am not aware of this being an incorrect usage - though we >> could probably switch to calling the qb_loop_job_add API instead if >> that is an incorrect use of the API. >> >> None of the libqb releases since 1.0.2 appear to have any changes >> introduced specifically to address a problem like this - though I do >> see there were some minor updates made to the loop_timerlist.c code to >> silence warnings. If there is a known problem in this area in 1.0.2, I >> will try see how hard it is going to be to step up to the newer version. >> >> I'm hoping someone has seen this problem before, or can suggest >> something that we might be doing wrong in the use of the API that >> could lead to this state. >> >> Sincere thanks for any suggestions, >> >> Russ >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/developers >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/developers > > ClusterLabs home: https://www.clusterlabs.org/ > From russel.miranda at lmco.com Thu Jan 28 23:18:50 2021 From: russel.miranda at lmco.com (Miranda, Russel) Date: Fri, 29 Jan 2021 04:18:50 +0000 Subject: [ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed In-Reply-To: References: <4135cf388b704950b4da98fa4e5ac2e1@lmco.com> Message-ID: <658d259c11fc4ccebe1095e900c1a08b@lmco.com> Hi Chrissie, I'm sorry I didn't get back to you sooner with a minimal example. I just joined the list and I didn't realize it, but my work email is apparently eating the mailing list traffic. I just finally had the brilliant idea to go check the archive. Since I hadn't heard anything back, I just dove into making a few changes to how we are using the API. We no longer create 0-delay timers (instead use qb_loop_job_add), and that seems to have mitigated the problem somewhat (no make_job_from_tmo assertion failures in 2 days), but I think it is in our best interest to upgrade to the 2.0.x baseline anyway. I can't really think of a good reason to stay on 1.0.x unless there's some vast difference in the API that would make the upgrade difficult. From my quick review of the changes, there doesn't seem to be much change to the APIs we use. I'll see if I can throw something simple together that can reproduce the problem anyway, just to see if the patch addresses it. I don't expect we'll revert our timer-to-job changes, but it will be somewhat reassuring to know that we won't run into it again in the future. Many thanks for looking into it, Russ On 27/01/2021 11:16, Christine Caulfield wrote: > OK. There are definitely race conditions in the timer handling of libqb > - at least according to helgrind which found 2 rather easily, so I > suspect that is what is happening here. > > I'm preparing a patch to fix them at the moment. The patch will go into > the version 2 stream. Depending on which distribution you're using you > might need to ask for a backport of that patch to v1 or, if you're using > sources, ask me very nicely for a backport to upstream ;-) > > Chrissie > > On 27/01/2021 08:07, Christine Caulfield wrote: > > Hi, > > > > I've never seen anything like that before and, as ou say, no real work > > ha sbeen don in that are since 1.0.2 so I doubt it's a fixed bug. > > > > Although libqb claims thread-safety I wouldn't rule out a threading bug > > with you adding timers from another thread. Looking at the test suite > > there is actually very little testing of libqbs thread-safety so that's > > where I would start looking for a start. > > > > I'll have a play with it today, but you are able to come up with a > > reproducer for us that would be very helpful. > > > > Chrissie -----Original Message----- From: Miranda, Russel (US) Sent: Monday, January 25, 2021 12:30 AM To: 'developers at clusterlabs.org' Subject: libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed Hello ClusterLabs Developers, I'm a developer working on an application that is using libqb (1.0.2) for its main event loop, and after running for a few hours, I'm encountering this assertion failure: loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state == QB_POLL_ENTRY_ACTIVE' failed. The assertion causes a SIGABRT, and generates a core file. The stack trace is at the point of the failure is: #0 0x00007ff3c76dd207 in raise () from /lib64/libc.so.6 #1 0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6 #2 0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6 #4 0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) at loop_timerlist.c:67 #5 0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at ../include/tlist.h:210 #6 expire_the_timers (s=0x5dffe80, ms_timeout=) at loop_timerlist.c:78 #7 0x00007ff3ca1c6acc in qb_loop_run (lp=) at loop.c:175 ... Going back up to the point of the assertion (make_job_from_tmo), here is what the "data" parameter looks like: (gdb) p *((struct qb_loop_timer*)data) $1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source = 0x5dffe80, user_data = 0x7ff3c9144f60 , type = QB_LOOP_FD}, dispatch_fn = 0xac8b00 , p = QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY, check = 978999470, install_pos = 11} Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTY when the assertion fails. The timer that appears to have expired was added a few seconds earlier at the point when a message was sent (via socket) to an external entity. Typically, when that entity responds, the timer is stopped (via qb_loop_timer_del) to prevent expiration. If timer expiration occurs before the response is received, the registered dispatch function is called to handle failure of the external entity to respond. That this timer would expire and call the dispatch_fn (sessionTimeoutHandler(void*)) passing in the value of user_data (0x7ff3c9144f60) at this point in the operation of the program is entirely plausible / expected, and I have validated that user_data points to a valid object of the appropriate type as well - so all of the caller-settable data looks correct to me in the timer entry. While I cannot completely exclude the possibility that something stepped on the t->state value, it seems unlikely that it would have affected only that one object property in the timer entry. It appears to me that the timer has expired, but the timer list entry has the wrong state. I'm confused to how that might occur. I've been searching through the API documentation and online through forums, and I can't find anyone else reporting this particular assertion failure within "make_job_from_tmo", or even suggesting what we might be doing wrong to cause it. Has anyone seen this before? Some additional notes: * There is another thread active that can call qb_loop_timer_add to add a timer to the main event loop (it is not the timer that has expired here), but it could have been adding a timer at the same time this timer expired. My understanding is that (excluding the IPC calls), libqb is thread-safe, so it should be safe to call the qb_loop_timer_add() API from another thread. qb_loop_timer_add() is the only libqb API called from the other thread. * In some cases we are adding timers that expire immediately (that is, with a delay as short as 0ns). That is not the case with this particular timer, it has a delay that is several seconds long. The qb_loop_timer_add() API doesn't indicate a minimum for the delay, so I am not aware of this being an incorrect usage - though we could probably switch to calling the qb_loop_job_add API instead if that is an incorrect use of the API. None of the libqb releases since 1.0.2 appear to have any changes introduced specifically to address a problem like this - though I do see there were some minor updates made to the loop_timerlist.c code to silence warnings. If there is a known problem in this area in 1.0.2, I will try see how hard it is going to be to step up to the newer version. I'm hoping someone has seen this problem before, or can suggest something that we might be doing wrong in the use of the API that could lead to this state. Sincere thanks for any suggestions, Russ From kgaillot at redhat.com Mon Jan 11 18:59:27 2021 From: kgaillot at redhat.com (Ken Gaillot) Date: Mon, 11 Jan 2021 12:59:27 -0600 Subject: [ClusterLabs Developers] Clean-up needed if updating a Pacemaker checkout Message-ID: Hi all, As part of the build changes for the next (2.1.0) Pacemaker release, the top-level directory has been de-cluttered by moving more build artifacts to subdirectories. A side effect is that if you pull the latest master branch in an existing checkout, you'll have to clean up the build files and regenerate them, e.g.: rm -rf aclocal.m4 autom4te.cache config.guess config.log config.status \ config.sub configure depcomp install-sh libltdl libtool ltmain.sh \ missing py-compile test-driver ./autogen.sh ./configure -- Ken Gaillot From kgaillot at redhat.com Tue Jan 19 21:34:13 2021 From: kgaillot at redhat.com (Ken Gaillot) Date: Tue, 19 Jan 2021 15:34:13 -0600 Subject: [ClusterLabs Developers] Any projects using Github large file storage? Message-ID: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> Hi all, Does anyone remember what ClusterLabs project is using Github's Large File Storage (https://git-lfs.github.com/)? They notified us that "Git LFS has been disabled on the organization ClusterLabs because you?ve exceeded your data plan by at least 150%" (we're currently using 2.2.GB). I vaguely remember someone wanting that capability, but I don't remember the project(s). That project will need to either move away from LFS, or cough up for a data plan. If no one remembers, I can ask their support. -- Ken Gaillot From lists at alteeve.ca Tue Jan 19 23:00:15 2021 From: lists at alteeve.ca (Digimer) Date: Tue, 19 Jan 2021 18:00:15 -0500 Subject: [ClusterLabs Developers] Any projects using Github large file storage? In-Reply-To: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> References: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> Message-ID: <1c4312a7-3fc3-4b43-ac76-fc31f1f857b2@alteeve.ca> On 2021-01-19 4:34 p.m., Ken Gaillot wrote: > Hi all, > > Does anyone remember what ClusterLabs project is using Github's Large > File Storage (https://git-lfs.github.com/)? > > They notified us that "Git LFS has been disabled on the organization > ClusterLabs because you?ve exceeded your data plan by at least 150%" > (we're currently using 2.2.GB). > > I vaguely remember someone wanting that capability, but I don't > remember the project(s). That project will need to either move away > from LFS, or cough up for a data plan. > > If no one remembers, I can ask their support. That would be anvil-striker-extra, which I just moved into ClusterLabs. I figured it would keep billing me though. If it's effecting any other projects and it's not a quick fix, I'll pull it out. -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein?s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould From lists at alteeve.ca Tue Jan 19 23:02:35 2021 From: lists at alteeve.ca (Digimer) Date: Tue, 19 Jan 2021 18:02:35 -0500 Subject: [ClusterLabs Developers] Any projects using Github large file storage? In-Reply-To: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> References: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> Message-ID: <4b4735de-6941-89db-d749-67c2b08d0b9a@alteeve.ca> On 2021-01-19 4:34 p.m., Ken Gaillot wrote: > Hi all, > > Does anyone remember what ClusterLabs project is using Github's Large > File Storage (https://git-lfs.github.com/)? > > They notified us that "Git LFS has been disabled on the organization > ClusterLabs because you?ve exceeded your data plan by at least 150%" > (we're currently using 2.2.GB). > > I vaguely remember someone wanting that capability, but I don't > remember the project(s). That project will need to either move away > from LFS, or cough up for a data plan. > > If no one remembers, I can ask their support. Update; I pulled it back out (pushed it under my own user's repos). Let me know if this causes any further problems. I'll have to figure out how to ensure that this continues to bill Alteeve and not interfere with ClusterLabs in general. Sorry for any hassle. -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein?s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould From kgaillot at redhat.com Tue Jan 19 23:08:17 2021 From: kgaillot at redhat.com (Ken Gaillot) Date: Tue, 19 Jan 2021 17:08:17 -0600 Subject: [ClusterLabs Developers] Any projects using Github large file storage? In-Reply-To: <4b4735de-6941-89db-d749-67c2b08d0b9a@alteeve.ca> References: <62b9d6281aafb5978738b23761489600853f2ce4.camel@redhat.com> <4b4735de-6941-89db-d749-67c2b08d0b9a@alteeve.ca> Message-ID: On Tue, 2021-01-19 at 18:02 -0500, Digimer wrote: > On 2021-01-19 4:34 p.m., Ken Gaillot wrote: > > Hi all, > > > > Does anyone remember what ClusterLabs project is using Github's > > Large > > File Storage (https://git-lfs.github.com/)? > > > > They notified us that "Git LFS has been disabled on the > > organization > > ClusterLabs because you?ve exceeded your data plan by at least > > 150%" > > (we're currently using 2.2.GB). > > > > I vaguely remember someone wanting that capability, but I don't > > remember the project(s). That project will need to either move away > > from LFS, or cough up for a data plan. > > > > If no one remembers, I can ask their support. > > Update; I pulled it back out (pushed it under my own user's repos). > Let > me know if this causes any further problems. I'll have to figure out > how > to ensure that this continues to bill Alteeve and not interfere with > ClusterLabs in general. > > Sorry for any hassle. No problem -- I thought someone long ago wanted it, but I most likely misremembered :) -- Ken Gaillot From russel.miranda at lmco.com Mon Jan 25 05:29:30 2021 From: russel.miranda at lmco.com (Miranda, Russel) Date: Mon, 25 Jan 2021 05:29:30 +0000 Subject: [ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed In-Reply-To: <4135cf388b704950b4da98fa4e5ac2e1@lmco.com> References: <4135cf388b704950b4da98fa4e5ac2e1@lmco.com> Message-ID: Hello ClusterLabs Developers, I'm a developer working on an application that is using libqb (1.0.2) for its main event loop, and after running for a few hours, I'm encountering this assertion failure: loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state == QB_POLL_ENTRY_ACTIVE' failed. The assertion causes a SIGABRT, and generates a core file. The stack trace is at the point of the failure is: #0 0x00007ff3c76dd207 in raise () from /lib64/libc.so.6 #1 0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6 #2 0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6 #4 0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) at loop_timerlist.c:67 #5 0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at ../include/tlist.h:210 #6 expire_the_timers (s=0x5dffe80, ms_timeout=) at loop_timerlist.c:78 #7 0x00007ff3ca1c6acc in qb_loop_run (lp=) at loop.c:175 ... Going back up to the point of the assertion (make_job_from_tmo), here is what the "data" parameter looks like: (gdb) p *((struct qb_loop_timer*)data) $1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source = 0x5dffe80, user_data = 0x7ff3c9144f60 , type = QB_LOOP_FD}, dispatch_fn = 0xac8b00 , p = QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY, check = 978999470, install_pos = 11} Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTY when the assertion fails. The timer that appears to have expired was added a few seconds earlier at the point when a message was sent (via socket) to an external entity. Typically, when that entity responds, the timer is stopped (via qb_loop_timer_del) to prevent expiration. If timer expiration occurs before the response is received, the registered dispatch function is called to handle failure of the external entity to respond. That this timer would expire and call the dispatch_fn (sessionTimeoutHandler(void*)) passing in the value of user_data (0x7ff3c9144f60) at this point in the operation of the program is entirely plausible / expected, and I have validated that user_data points to a valid object of the appropriate type as well - so all of the caller-settable data looks correct to me in the timer entry. While I cannot completely exclude the possibility that something stepped on the t->state value, it seems unlikely that it would have affected only that one object property in the timer entry. It appears to me that the timer has expired, but the timer list entry has the wrong state. I'm confused to how that might occur. I've been searching through the API documentation and online through forums, and I can't find anyone else reporting this particular assertion failure within "make_job_from_tmo", or even suggesting what we might be doing wrong to cause it. Has anyone seen this before? Some additional notes: * There is another thread active that can call qb_loop_timer_add to add a timer to the main event loop (it is not the timer that has expired here), but it could have been adding a timer at the same time this timer expired. My understanding is that (excluding the IPC calls), libqb is thread-safe, so it should be safe to call the qb_loop_timer_add() API from another thread. qb_loop_timer_add() is the only libqb API called from the other thread. * In some cases we are adding timers that expire immediately (that is, with a delay as short as 0ns). That is not the case with this particular timer, it has a delay that is several seconds long. The qb_loop_timer_add() API doesn't indicate a minimum for the delay, so I am not aware of this being an incorrect usage - though we could probably switch to calling the qb_loop_job_add API instead if that is an incorrect use of the API. None of the libqb releases since 1.0.2 appear to have any changes introduced specifically to address a problem like this - though I do see there were some minor updates made to the loop_timerlist.c code to silence warnings. If there is a known problem in this area in 1.0.2, I will try see how hard it is going to be to step up to the newer version. I'm hoping someone has seen this problem before, or can suggest something that we might be doing wrong in the use of the API that could lead to this state. Sincere thanks for any suggestions, Russ From ccaulfie at redhat.com Wed Jan 27 08:07:09 2021 From: ccaulfie at redhat.com (Christine Caulfield) Date: Wed, 27 Jan 2021 08:07:09 +0000 Subject: [ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed In-Reply-To: References: <4135cf388b704950b4da98fa4e5ac2e1@lmco.com> Message-ID: Hi, I've never seen anything like that before and, as ou say, no real work ha sbeen don in that are since 1.0.2 so I doubt it's a fixed bug. Although libqb claims thread-safety I wouldn't rule out a threading bug with you adding timers from another thread. Looking at the test suite there is actually very little testing of libqbs thread-safety so that's where I would start looking for a start. I'll have a play with it today, but you are able to come up with a reproducer for us that would be very helpful. Chrissie On 25/01/2021 05:29, Miranda, Russel wrote: > Hello ClusterLabs Developers, > > I'm a developer working on an application that is using libqb (1.0.2) for its main event loop, and after running for a few hours, I'm encountering this assertion failure: > > loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state == QB_POLL_ENTRY_ACTIVE' failed. > > The assertion causes a SIGABRT, and generates a core file. The stack trace is at the point of the failure is: > > #0 0x00007ff3c76dd207 in raise () from /lib64/libc.so.6 > #1 0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6 > #2 0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6 > #3 0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6 > #4 0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) at loop_timerlist.c:67 > #5 0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at ../include/tlist.h:210 > #6 expire_the_timers (s=0x5dffe80, ms_timeout=) at loop_timerlist.c:78 > #7 0x00007ff3ca1c6acc in qb_loop_run (lp=) at loop.c:175 ... > > Going back up to the point of the assertion (make_job_from_tmo), here is what the "data" parameter looks like: > > (gdb) p *((struct qb_loop_timer*)data) > $1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source = 0x5dffe80, user_data = 0x7ff3c9144f60 , type = QB_LOOP_FD}, > dispatch_fn = 0xac8b00 , p = QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY, check = 978999470, install_pos = 11} > > Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTY when the assertion fails. > > The timer that appears to have expired was added a few seconds earlier at the point when a message was sent (via socket) to an external entity. Typically, when that entity responds, the timer is stopped (via qb_loop_timer_del) to prevent expiration. If timer expiration occurs before the response is received, the registered dispatch function is called to handle failure of the external entity to respond. > > That this timer would expire and call the dispatch_fn (sessionTimeoutHandler(void*)) passing in the value of user_data (0x7ff3c9144f60) at this point in the operation of the program is entirely plausible / expected, and I have validated that user_data points to a valid object of the appropriate type as well - so all of the caller-settable data looks correct to me in the timer entry. While I cannot completely exclude the possibility that something stepped on the t->state value, it seems unlikely that it would have affected only that one object property in the timer entry. > > It appears to me that the timer has expired, but the timer list entry has the wrong state. I'm confused to how that might occur. I've been searching through the API documentation and online through forums, and I can't find anyone else reporting this particular assertion failure within "make_job_from_tmo", or even suggesting what we might be doing wrong to cause it. Has anyone seen this before? > > Some additional notes: > > * There is another thread active that can call qb_loop_timer_add to add a timer to the main event loop (it is not the timer that has expired here), but it could have been adding a timer at the same time this timer expired. > My understanding is that (excluding the IPC calls), libqb is thread-safe, so it should be safe to call the qb_loop_timer_add() API from another thread. qb_loop_timer_add() is the only libqb API called from the other thread. > * In some cases we are adding timers that expire immediately (that is, with a delay as short as 0ns). That is not the case with this particular timer, it has a delay that is several seconds long. > The qb_loop_timer_add() API doesn't indicate a minimum for the delay, so I am not aware of this being an incorrect usage - though we could probably switch to calling the qb_loop_job_add API instead if that is an incorrect use of the API. > > None of the libqb releases since 1.0.2 appear to have any changes introduced specifically to address a problem like this - though I do see there were some minor updates made to the loop_timerlist.c code to silence warnings. If there is a known problem in this area in 1.0.2, I will try see how hard it is going to be to step up to the newer version. > > I'm hoping someone has seen this problem before, or can suggest something that we might be doing wrong in the use of the API that could lead to this state. > > Sincere thanks for any suggestions, > > Russ > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/developers > > ClusterLabs home: https://www.clusterlabs.org/ > From ccaulfie at redhat.com Wed Jan 27 11:16:54 2021 From: ccaulfie at redhat.com (Christine Caulfield) Date: Wed, 27 Jan 2021 11:16:54 +0000 Subject: [ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed In-Reply-To: References: <4135cf388b704950b4da98fa4e5ac2e1@lmco.com> Message-ID: OK. There are definitely race conditions in the timer handling of libqb - at least according to helgrind which found 2 rather easily, so I suspect that is what is happening here. I'm preparing a patch to fix them at the moment. The patch will go into the version 2 stream. Depending on which distribution you're using you might need to ask for a backport of that patch to v1 or, if you're using sources, ask me very nicely for a backport to upstream ;-) Chrissie On 27/01/2021 08:07, Christine Caulfield wrote: > Hi, > > I've never seen anything like that before and, as ou say, no real work > ha sbeen don in that are since 1.0.2 so I doubt it's a fixed bug. > > Although libqb claims thread-safety I wouldn't rule out a threading bug > with you adding timers from another thread.? Looking at the test suite > there is actually very little testing of libqbs thread-safety so that's > where I would start looking for a start. > > I'll have a play with it today, but you are able to come up with a > reproducer for us that would be very helpful. > > Chrissie > > On 25/01/2021 05:29, Miranda, Russel wrote: >> Hello ClusterLabs Developers, >> >> I'm a developer working on an application that is using libqb (1.0.2) >> for its main event loop, and after running for a few hours, I'm >> encountering this assertion failure: >> >> loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state == >> QB_POLL_ENTRY_ACTIVE' failed. >> >> The assertion causes a SIGABRT, and generates a core file. The stack >> trace is at the point of the failure is: >> >> #0? 0x00007ff3c76dd207 in raise () from /lib64/libc.so.6 >> #1? 0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6 >> #2? 0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6 >> #3? 0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6 >> #4? 0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) at >> loop_timerlist.c:67 >> #5? 0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at >> ../include/tlist.h:210 >> #6? expire_the_timers (s=0x5dffe80, ms_timeout=) at >> loop_timerlist.c:78 >> #7? 0x00007ff3ca1c6acc in qb_loop_run (lp=) at >> loop.c:175 ... >> >> Going back up to the point of the assertion (make_job_from_tmo), here >> is what the "data" parameter looks like: >> >> (gdb) p *((struct qb_loop_timer*)data) >> $1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source = >> 0x5dffe80, user_data = 0x7ff3c9144f60 , type = >> QB_LOOP_FD}, >> ?? dispatch_fn = 0xac8b00 , p = >> QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY, >> check = 978999470, install_pos = 11} >> >> Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTY >> when the assertion fails. >> >> The timer that appears to have expired was added a few seconds earlier >> at the point when a message was sent (via socket) to an external >> entity. Typically, when that entity responds, the timer is stopped >> (via qb_loop_timer_del) to prevent expiration. If timer expiration >> occurs before the response is received, the registered dispatch >> function is called to handle failure of the external entity to respond. >> >> That this timer would expire and call the dispatch_fn >> (sessionTimeoutHandler(void*)) passing in the value of user_data >> (0x7ff3c9144f60) at this point in the operation of the program is >> entirely plausible / expected, and I have validated that user_data >> points to a valid object of the appropriate type as well - so all of >> the caller-settable data looks correct to me in the timer entry. While >> I cannot completely exclude the possibility that something stepped on >> the t->state value, it seems unlikely that it would have affected only >> that one object property in the timer entry. >> >> It appears to me that the timer has expired, but the timer list entry >> has the wrong state. I'm confused to how that might occur. I've been >> searching through the API documentation and online through forums, and >> I can't find anyone else reporting this particular assertion failure >> within "make_job_from_tmo", or even suggesting what we might be doing >> wrong to cause it. Has anyone seen this before? >> >> Some additional notes: >> >> * There is another thread active that can call qb_loop_timer_add to >> add a timer to the main event loop (it is not the timer that has >> expired here), but it could have been adding a timer at the same time >> this timer expired. >> ?? My understanding is that (excluding the IPC calls), libqb is >> thread-safe, so it should be safe to call the qb_loop_timer_add() API >> from another thread. qb_loop_timer_add() is the only libqb API called >> from the other thread. >> * In some cases we are adding timers that expire immediately (that is, >> with a delay as short as 0ns). That is not the case with this >> particular timer, it has a delay that is several seconds long. >> ?? The qb_loop_timer_add() API doesn't indicate a minimum for the >> delay, so I am not aware of this being an incorrect usage - though we >> could probably switch to calling the qb_loop_job_add API instead if >> that is an incorrect use of the API. >> >> None of the libqb releases since 1.0.2 appear to have any changes >> introduced specifically to address a problem like this - though I do >> see there were some minor updates made to the loop_timerlist.c code to >> silence warnings. If there is a known problem in this area in 1.0.2, I >> will try see how hard it is going to be to step up to the newer version. >> >> I'm hoping someone has seen this problem before, or can suggest >> something that we might be doing wrong in the use of the API that >> could lead to this state. >> >> Sincere thanks for any suggestions, >> >> Russ >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/developers >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/developers > > ClusterLabs home: https://www.clusterlabs.org/ > From russel.miranda at lmco.com Fri Jan 29 04:18:50 2021 From: russel.miranda at lmco.com (Miranda, Russel) Date: Fri, 29 Jan 2021 04:18:50 +0000 Subject: [ClusterLabs Developers] libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed In-Reply-To: References: <4135cf388b704950b4da98fa4e5ac2e1@lmco.com> Message-ID: <658d259c11fc4ccebe1095e900c1a08b@lmco.com> Hi Chrissie, I'm sorry I didn't get back to you sooner with a minimal example. I just joined the list and I didn't realize it, but my work email is apparently eating the mailing list traffic. I just finally had the brilliant idea to go check the archive. Since I hadn't heard anything back, I just dove into making a few changes to how we are using the API. We no longer create 0-delay timers (instead use qb_loop_job_add), and that seems to have mitigated the problem somewhat (no make_job_from_tmo assertion failures in 2 days), but I think it is in our best interest to upgrade to the 2.0.x baseline anyway. I can't really think of a good reason to stay on 1.0.x unless there's some vast difference in the API that would make the upgrade difficult. From my quick review of the changes, there doesn't seem to be much change to the APIs we use. I'll see if I can throw something simple together that can reproduce the problem anyway, just to see if the patch addresses it. I don't expect we'll revert our timer-to-job changes, but it will be somewhat reassuring to know that we won't run into it again in the future. Many thanks for looking into it, Russ On 27/01/2021 11:16, Christine Caulfield wrote: > OK. There are definitely race conditions in the timer handling of libqb > - at least according to helgrind which found 2 rather easily, so I > suspect that is what is happening here. > > I'm preparing a patch to fix them at the moment. The patch will go into > the version 2 stream. Depending on which distribution you're using you > might need to ask for a backport of that patch to v1 or, if you're using > sources, ask me very nicely for a backport to upstream ;-) > > Chrissie > > On 27/01/2021 08:07, Christine Caulfield wrote: > > Hi, > > > > I've never seen anything like that before and, as ou say, no real work > > ha sbeen don in that are since 1.0.2 so I doubt it's a fixed bug. > > > > Although libqb claims thread-safety I wouldn't rule out a threading bug > > with you adding timers from another thread. Looking at the test suite > > there is actually very little testing of libqbs thread-safety so that's > > where I would start looking for a start. > > > > I'll have a play with it today, but you are able to come up with a > > reproducer for us that would be very helpful. > > > > Chrissie -----Original Message----- From: Miranda, Russel (US) Sent: Monday, January 25, 2021 12:30 AM To: 'developers at clusterlabs.org' Subject: libqb 1.0.2 / loop_timerlist.c:67: make_job_from_tmo: Assertion failed Hello ClusterLabs Developers, I'm a developer working on an application that is using libqb (1.0.2) for its main event loop, and after running for a few hours, I'm encountering this assertion failure: loop_timerlist.c:67: make_job_from_tmo: Assertion `t->state == QB_POLL_ENTRY_ACTIVE' failed. The assertion causes a SIGABRT, and generates a core file. The stack trace is at the point of the failure is: #0 0x00007ff3c76dd207 in raise () from /lib64/libc.so.6 #1 0x00007ff3c76de8f8 in abort () from /lib64/libc.so.6 #2 0x00007ff3c76d6026 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007ff3c76d60d2 in __assert_fail () from /lib64/libc.so.6 #4 0x00007ff3ca1c828d in make_job_from_tmo (data=0x5460250) at loop_timerlist.c:67 #5 0x00007ff3ca1c815b in timerlist_expire (timerlist=0x5dffe98) at ../include/tlist.h:210 #6 expire_the_timers (s=0x5dffe80, ms_timeout=) at loop_timerlist.c:78 #7 0x00007ff3ca1c6acc in qb_loop_run (lp=) at loop.c:175 ... Going back up to the point of the assertion (make_job_from_tmo), here is what the "data" parameter looks like: (gdb) p *((struct qb_loop_timer*)data) $1 = {item = {list = {next = 0x5460250, prev = 0x5460250}, source = 0x5dffe80, user_data = 0x7ff3c9144f60 , type = QB_LOOP_FD}, dispatch_fn = 0xac8b00 , p = QB_LOOP_MED, timerlist_handle = 0x0, state = QB_POLL_ENTRY_EMPTY, check = 978999470, install_pos = 11} Indeed, as the assertion indicates, t->state is QB_POLL_ENTRY_EMPTY when the assertion fails. The timer that appears to have expired was added a few seconds earlier at the point when a message was sent (via socket) to an external entity. Typically, when that entity responds, the timer is stopped (via qb_loop_timer_del) to prevent expiration. If timer expiration occurs before the response is received, the registered dispatch function is called to handle failure of the external entity to respond. That this timer would expire and call the dispatch_fn (sessionTimeoutHandler(void*)) passing in the value of user_data (0x7ff3c9144f60) at this point in the operation of the program is entirely plausible / expected, and I have validated that user_data points to a valid object of the appropriate type as well - so all of the caller-settable data looks correct to me in the timer entry. While I cannot completely exclude the possibility that something stepped on the t->state value, it seems unlikely that it would have affected only that one object property in the timer entry. It appears to me that the timer has expired, but the timer list entry has the wrong state. I'm confused to how that might occur. I've been searching through the API documentation and online through forums, and I can't find anyone else reporting this particular assertion failure within "make_job_from_tmo", or even suggesting what we might be doing wrong to cause it. Has anyone seen this before? Some additional notes: * There is another thread active that can call qb_loop_timer_add to add a timer to the main event loop (it is not the timer that has expired here), but it could have been adding a timer at the same time this timer expired. My understanding is that (excluding the IPC calls), libqb is thread-safe, so it should be safe to call the qb_loop_timer_add() API from another thread. qb_loop_timer_add() is the only libqb API called from the other thread. * In some cases we are adding timers that expire immediately (that is, with a delay as short as 0ns). That is not the case with this particular timer, it has a delay that is several seconds long. The qb_loop_timer_add() API doesn't indicate a minimum for the delay, so I am not aware of this being an incorrect usage - though we could probably switch to calling the qb_loop_job_add API instead if that is an incorrect use of the API. None of the libqb releases since 1.0.2 appear to have any changes introduced specifically to address a problem like this - though I do see there were some minor updates made to the loop_timerlist.c code to silence warnings. If there is a known problem in this area in 1.0.2, I will try see how hard it is going to be to step up to the newer version. I'm hoping someone has seen this problem before, or can suggest something that we might be doing wrong in the use of the API that could lead to this state. Sincere thanks for any suggestions, Russ