<html><head><style type='text/css'>p { margin: 0; }</style></head><body><div style='font-family: Times New Roman; font-size: 12pt; color: #000000'><font size="3">Honza and Angus,</font><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">Here is the backtrace:</div><div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "># ls -l /var/lib/corosync/</div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">total 8</div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">-rwx------ 1 root root 8 Nov 2 09:02 ringid_10.xxx.xxx.xxx</div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">-rwxr-xr-x 1 root root 8 Nov 1 14:54 ringid_127.0.0.1</div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><div># ls -ltr /var/crash</div><div>total 47296</div><div>-rw-r----- 1 root whoopsie 266218 Oct 30 22:24 _usr_sbin_smbd.0.crash</div><div>-rw-r----- 1 root whoopsie 309850 Oct 30 22:25 _usr_libexec_pacemaker_lrmd.0.crash</div><div>-rw-r----- 1 root whoopsie 211640 Oct 30 22:25 _usr_sbin_nmbd.0.crash</div><div>-rw-r----- 1 root whoopsie 221656 Oct 31 22:43 _usr_sbin_corosync.0.crash</div><div>-rw------- 1 root whoopsie 23302144 Nov 1 17:05 core.corosync.0.1351807501.16625</div><div>-rw------- 1 root whoopsie 24375296 Nov 2 12:53 core.corosync.0.1351878781.28065</div></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "># with libqb 0.14.2</div><div><div><div># gdb corosync /var/crash/core.corosync.0.1351807501.16625 </div><div>GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04</div><div>Copyright (C) 2012 Free Software Foundation, Inc.</div><div>License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html></div><div>This is free software: you are free to change and redistribute it.</div><div>There is NO WARRANTY, to the extent permitted by law. Type "show copying"</div><div>and "show warranty" for details.</div><div>This GDB was configured as "x86_64-linux-gnu".</div><div>For bug reporting instructions, please see:</div><div><http://bugs.launchpad.net/gdb-linaro/>...</div><div>Reading symbols from /usr/sbin/corosync...(no debugging symbols found)...done.</div><div>[New LWP 16625]</div><div>[New LWP 16626]</div><div><br></div><div>warning: Can't read pathname for load map: Input/output error.</div><div>[Thread debugging using libthread_db enabled]</div><div>Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".</div><div>Core was generated by `corosync -f'.</div><div>Program terminated with signal 7, Bus error.</div><div>#0 0x00007f36f6a09d43 in qb_rb_chunks_used () from /usr/lib/libqb.so.0</div><div>(gdb) thread apply all bt</div><div><br></div><div>Thread 2 (Thread 0x7f36f44a9700 (LWP 16626)):</div><div>#0 0x00007f36f67f0fd0 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0</div><div>#1 0x00007f36f6a12ff3 in qb_log_init () from /usr/lib/libqb.so.0</div><div>#2 0x0000000000000000 in ?? ()</div><div><br></div><div>Thread 1 (Thread 0x7f36f729b700 (LWP 16625)):</div><div>#0 0x00007f36f6a09d43 in qb_rb_chunks_used () from /usr/lib/libqb.so.0</div><div>#1 0x00007f36f44ac463 in ?? ()</div><div>#2 0x00007f36f6c249f0 in ?? () from /usr/lib/libqb.so.0</div><div>#3 0x000000000000002f in ?? ()</div><div>#4 0x00007f36f9226e90 in ?? ()</div><div>#5 0x00007f36f91b0600 in ?? ()</div><div>#6 0x00007f36f6a135b9 in qb_log_thread_stop () from /usr/lib/libqb.so.0</div><div>#7 0x0000000000000002 in ?? ()</div><div>#8 0x00007f36f6c249f0 in ?? () from /usr/lib/libqb.so.0</div><div>#9 0x00007f36f9226e90 in ?? ()</div><div>#10 0x00007f36f6c20920 in ?? () from /usr/lib/libqb.so.0</div><div>#11 0x00007fff35232ee8 in ?? ()</div><div>#12 0x0000000000000000 in ?? ()</div></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "># with libqb 0.14.3</div><div><div># gdb corosync /var/crash/core.corosync.0.1351878781.28065 </div><div>GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04</div><div>Copyright (C) 2012 Free Software Foundation, Inc.</div><div>License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html></div><div>This is free software: you are free to change and redistribute it.</div><div>There is NO WARRANTY, to the extent permitted by law. Type "show copying"</div><div>and "show warranty" for details.</div><div>This GDB was configured as "x86_64-linux-gnu".</div><div>For bug reporting instructions, please see:</div><div><http://bugs.launchpad.net/gdb-linaro/>...</div><div>Reading symbols from /usr/sbin/corosync...(no debugging symbols found)...done.</div><div>[New LWP 28065]</div><div>[New LWP 28066]</div><div><br></div><div>warning: Can't read pathname for load map: Input/output error.</div><div>[Thread debugging using libthread_db enabled]</div><div>Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".</div><div>Core was generated by `corosync -f'.</div><div>Program terminated with signal 7, Bus error.</div><div>#0 0x00007f5174840bbd in qb_rb_space_free () from /usr/lib/libqb.so.0</div><div>(gdb) thread apply all bt</div><div><br></div><div>Thread 2 (Thread 0x7f51722e0700 (LWP 28066)):</div><div>#0 0x00007f5174627fd0 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0</div><div>#1 0x00007f517484a0f3 in ?? () from /usr/lib/libqb.so.0</div><div>#2 0x00007f5174621e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0</div><div>#3 0x00007f517434f4bd in clone () from /lib/x86_64-linux-gnu/libc.so.6</div><div>#4 0x0000000000000000 in ?? ()</div><div><br></div><div>Thread 1 (Thread 0x7f51750d2700 (LWP 28065)):</div><div>#0 0x00007f5174840bbd in qb_rb_space_free () from /usr/lib/libqb.so.0</div><div>#1 0x00007f5174840d90 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0</div><div>#2 0x00007f517484a6b9 in ?? () from /usr/lib/libqb.so.0</div><div>#3 0x00007f51748487ba in qb_log_real_va_ () from /usr/lib/libqb.so.0</div><div>#4 0x00007f51751016f0 in ?? ()</div><div>#5 0x00007f5174cacdf6 in ?? () from /usr/lib/libtotem_pg.so.5</div><div>#6 0x00007f5174ca6a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5</div><div>#7 0x00007f5174ca18e2 in ?? () from /usr/lib/libtotem_pg.so.5</div><div>#8 0x00007f517484246f in ?? () from /usr/lib/libqb.so.0</div><div>#9 0x00007f5174841fe7 in qb_loop_run () from /usr/lib/libqb.so.0</div><div>#10 0x00007f51750f0935 in main ()</div></div><div><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">I see qb_rb_space_free is defined as "The amount of free space in the ring buffer".</div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">Thanks,</div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><br></div><div style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; ">Andrew</div><br><hr id="zwchr" style="color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-size: 12pt; "><div style="color: rgb(0, 0, 0); font-family: Helvetica, Arial, sans-serif; font-size: 12pt; font-weight: normal; font-style: normal; text-decoration: none; "><b>From: </b><span>"Jan Friesse" <<a class="smarterwiki-linkify" href="mailto:jfriesse@redhat.com" title="[GMCP] Compose a new mail to jfriesse@redhat.com" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=jfriesse@redhat.com','Compose new message','width=640,height=480');return false" rel="noreferrer">jfriesse@redhat.com</a>></span><br><b>To: </b><span><a class="smarterwiki-linkify" href="mailto:pacemaker@oss.clusterlabs.org," title="[GMCP] Compose a new mail to pacemaker@oss.clusterlabs.org," onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=pacemaker@oss.clusterlabs.org,','Compose new message','width=640,height=480');return false" rel="noreferrer">pacemaker@oss.clusterlabs.org,</a> <a class="smarterwiki-linkify" href="mailto:discuss@corosync.org" title="[GMCP] Compose a new mail to discuss@corosync.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=discuss@corosync.org','Compose new message','width=640,height=480');return false" rel="noreferrer">discuss@corosync.org</a></span><br><b>Sent: </b>Monday, November 5, 2012 2:21:09 AM<br><b>Subject: </b>Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster<br><br>Angus Salkeld napsal(a):<br>> On 02/11/12 13:07 -0500, Andrew Martin wrote:<br>>> Hi Angus,<br>>><br>>><br>>> Corosync died again while using libqb 0.14.3. Here is the coredump<br>>> from today:<br><span>>> <a class="smarterwiki-linkify" href="http://sources.xes-inc.com/downloads/corosync.nov2.coredump">http://sources.xes-inc.com/downloads/corosync.nov2.coredump</a></span><br>>><br>>><br>>><br>>> # corosync -f<br>>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to<br>>> provide service.<br>>> info [MAIN ] Corosync built-in features: pie relro bindnow<br>>> Bus error (core dumped)<br>>><br>>><br><span>>> Here's the log: <a class="smarterwiki-linkify" href="http://pastebin.com/bUfiB3T3">http://pastebin.com/bUfiB3T3</a></span><br>>><br>>><br>>> Did your analysis of the core dump reveal anything?<br>>><br>> <br>> I can't get any symbols out of these coredumps. Can you try get a<br>> backtrace?<br>> <br><br>Andrew,<br>as I've wrote in original mail, backtrace can be got by:<br><br>coredumps are stored in /var/lib/corosync as core.PID, and<br>way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and<br>here thread apply all bt). If you are running distribution with ABRT<br>support, you can also use ABRT to generate report.<br><br>It's also pretty weird that you are getting SIGBUS. SIGBUS is pretty<br>usually result of accessing unaligned memory on processors without<br>support to access that (for example Sparc). This doesn't seem to be your<br>case (because of AMD64).<br><br><br>>><br>>> Is there a way for me to make it generate fdata with a bus error, or<br>>> how else can I gather additional information to help debug this?<br>>><br>> <br>> if you look in exec/main.c and look for SIGSEGV you will see how the<br>> mechanism<br>> for fdata works. Just and a handler for SIGBUS and hook it up. Then you<br>> should<br>> be able to get the fdata for both.<br>> <br>> I'd rather be able to get a backtrace if possible.<br>> <br><br>Also if possible, please try to compile with --enable-debug (both libqb<br>and corosync) to get as much information as possible.<br><br>> -Angus<br>> <br><br>Regards,<br> Honza<br><br>>><br>>> Thanks,<br>>><br>>><br>>> Andrew<br>>><br>>> ----- Original Message -----<br>>><br><span>>> From: "Angus Salkeld" <<a class="smarterwiki-linkify" href="mailto:asalkeld@redhat.com" title="[GMCP] Compose a new mail to asalkeld@redhat.com" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=asalkeld@redhat.com','Compose new message','width=640,height=480');return false" rel="noreferrer">asalkeld@redhat.com</a>></span><br><span>>> To: <a class="smarterwiki-linkify" href="mailto:pacemaker@oss.clusterlabs.org," title="[GMCP] Compose a new mail to pacemaker@oss.clusterlabs.org," onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=pacemaker@oss.clusterlabs.org,','Compose new message','width=640,height=480');return false" rel="noreferrer">pacemaker@oss.clusterlabs.org,</a> <a class="smarterwiki-linkify" href="mailto:discuss@corosync.org" title="[GMCP] Compose a new mail to discuss@corosync.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=discuss@corosync.org','Compose new message','width=640,height=480');return false" rel="noreferrer">discuss@corosync.org</a></span><br>>> Sent: Thursday, November 1, 2012 5:47:16 PM<br>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes<br>>> in cluster<br>>><br>>> On 01/11/12 17:27 -0500, Andrew Martin wrote:<br>>>> Hi Angus,<br>>>><br>>>><br>>>> I'll try upgrading to the latest libqb tomorrow and see if I can<br>>>> reproduce this behavior with it. I was able to get a coredump by<br>>>> running corosync manually in the foreground (corosync -f):<br><span>>>> <a class="smarterwiki-linkify" href="http://sources.xes-inc.com/downloads/corosync.coredump">http://sources.xes-inc.com/downloads/corosync.coredump</a></span><br>>><br>>> Thanks, looking...<br>>><br>>>><br>>>><br>>>> There still isn't anything added to /var/lib/corosync however. What<br>>>> do I need to do to enable the fdata file to be created?<br>>><br>>> Well if it crashes with SIGSEGV it will generate it automatically.<br>>> (I see you are getting a bus error) - :(.<br>>><br>>> -A<br>>><br>>>><br>>>><br>>>> Thanks,<br>>>><br>>>> Andrew<br>>>><br>>>> ----- Original Message -----<br>>>><br><span>>>> From: "Angus Salkeld" <<a class="smarterwiki-linkify" href="mailto:asalkeld@redhat.com" title="[GMCP] Compose a new mail to asalkeld@redhat.com" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=asalkeld@redhat.com','Compose new message','width=640,height=480');return false" rel="noreferrer">asalkeld@redhat.com</a>></span><br><span>>>> To: <a class="smarterwiki-linkify" href="mailto:pacemaker@oss.clusterlabs.org," title="[GMCP] Compose a new mail to pacemaker@oss.clusterlabs.org," onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=pacemaker@oss.clusterlabs.org,','Compose new message','width=640,height=480');return false" rel="noreferrer">pacemaker@oss.clusterlabs.org,</a> <a class="smarterwiki-linkify" href="mailto:discuss@corosync.org" title="[GMCP] Compose a new mail to discuss@corosync.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=discuss@corosync.org','Compose new message','width=640,height=480');return false" rel="noreferrer">discuss@corosync.org</a></span><br>>>> Sent: Thursday, November 1, 2012 5:11:23 PM<br>>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes<br>>>> in cluster<br>>>><br>>>> On 01/11/12 14:32 -0500, Andrew Martin wrote:<br>>>>> Hi Honza,<br>>>>><br>>>>><br>>>>> Thanks for the help. I enabled core dumps in<br>>>>> /etc/security/limits.conf but didn't have a chance to reboot and<br>>>>> apply the changes so I don't have a core dump this time. Do core<br>>>>> dumps need to be enabled for the fdata-DATETIME-PID file to be<br>>>>> generated? right now all that is in /var/lib/corosync are the<br>>>>> ringid_XXX files. Do I need to set something explicitly in the<br>>>>> corosync config to enable this logging?<br>>>>><br>>>>><br>>>>> I did find find something else interesting with libqb this time. I<br>>>>> compiled libqb 0.14.2 for use with the cluster. This time when<br>>>>> corosync died I noticed the following in dmesg:<br>>>>> Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap<br>>>>> divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in<br>>>>> libqb.so.0.14.2[7f657a525000+1f000]<br>>>>> This error was only present for one of the many other times corosync<br>>>>> has died.<br>>>>><br>>>>><br>>>>> I see that there is a newer version of libqb (0.14.3) out, but<br>>>>> didn't see a fix for this particular bug. Could this libqb problem<br>>>>> be related to the corosync to hang up? Here's the corresponding<br>>>>> corosync log file (next time I should have a core dump as well):<br><span>>>>> <a class="smarterwiki-linkify" href="http://pastebin.com/5FLKg7We">http://pastebin.com/5FLKg7We</a></span><br>>>><br>>>> Hi Andrew<br>>>><br>>>> I can't see much wrong with the log either. If you could run with the<br>>>> latest<br>>>> (libqb-0.14.3) and post a backtrace if it still happens, that would<br>>>> be great.<br>>>><br>>>> Thanks<br>>>> Angus<br>>>><br>>>>><br>>>>><br>>>>> Thanks,<br>>>>><br>>>>><br>>>>> Andrew<br>>>>><br>>>>> ----- Original Message -----<br>>>>><br><span>>>>> From: "Jan Friesse" <<a class="smarterwiki-linkify" href="mailto:jfriesse@redhat.com" title="[GMCP] Compose a new mail to jfriesse@redhat.com" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=jfriesse@redhat.com','Compose new message','width=640,height=480');return false" rel="noreferrer">jfriesse@redhat.com</a>></span><br><span>>>>> To: "Andrew Martin" <<a class="smarterwiki-linkify" href="mailto:amartin@xes-inc.com" title="[GMCP] Compose a new mail to amartin@xes-inc.com" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=amartin@xes-inc.com','Compose new message','width=640,height=480');return false" rel="noreferrer">amartin@xes-inc.com</a>></span><br><span>>>>> Cc: <a class="smarterwiki-linkify" href="mailto:discuss@corosync.org," title="[GMCP] Compose a new mail to discuss@corosync.org," onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=discuss@corosync.org,','Compose new message','width=640,height=480');return false" rel="noreferrer">discuss@corosync.org,</a> "The Pacemaker cluster resource manager"</span><br><span>>>>> <<a class="smarterwiki-linkify" href="mailto:pacemaker@oss.clusterlabs.org" title="[GMCP] Compose a new mail to pacemaker@oss.clusterlabs.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=pacemaker@oss.clusterlabs.org','Compose new message','width=640,height=480');return false" rel="noreferrer">pacemaker@oss.clusterlabs.org</a>></span><br>>>>> Sent: Thursday, November 1, 2012 7:55:52 AM<br>>>>> Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster<br>>>>><br>>>>> Ansdrew,<br>>>>> I was not able to find anything interesting (from corosync point of<br>>>>> view) in configuration/logs (corosync related).<br>>>>><br>>>>> What would be helpful:<br>>>>> - if corosync died, there should be<br>>>>> /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please<br>>>>> xz them and store somewhere (they are quiet large but well<br>>>>> compressible).<br>>>>> - If you are able to reproduce problem (what seems like you are), can<br>>>>> you please allow generating of coredumps and store somewhere backtrace<br>>>>> of coredump? (coredumps are stored in /var/lib/corosync as core.PID,<br>>>>> and<br>>>>> way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and<br>>>>> here thread apply all bt). If you are running distribution with ABRT<br>>>>> support, you can also use ABRT to generate report.<br>>>>><br>>>>> Regards,<br>>>>> Honza<br>>>>><br>>>>> Andrew Martin napsal(a):<br>>>>>> Corosync died an additional 3 times during the night on storage1. I<br>>>>>> wrote a daemon to attempt and start it as soon as it fails, so only<br>>>>>> one of those times resulted in a STONITH of storage1.<br>>>>>><br>>>>>> I enabled debug in the corosync config, so I was able to capture a<br>>>>>> period when corosync died with debug output:<br><span>>>>>> <a class="smarterwiki-linkify" href="http://pastebin.com/eAmJSmsQ">http://pastebin.com/eAmJSmsQ</a></span><br>>>>>> In this example, Pacemaker finishes shutting down by Nov 01<br>>>>>> 05:53:02. For reference, here is my Pacemaker configuration:<br><span>>>>>> <a class="smarterwiki-linkify" href="http://pastebin.com/DFL3hNvz">http://pastebin.com/DFL3hNvz</a></span><br>>>>>><br>>>>>> It seems that an extra node, 16777343 "localhost" has been added to<br>>>>>> the cluster after storage1 was STONTIHed (must be the localhost<br>>>>>> interface on storage1). Is there anyway to prevent this?<br>>>>>><br>>>>>> Does this help to determine why corosync is dying, and what I can<br>>>>>> do to fix it?<br>>>>>><br>>>>>> Thanks,<br>>>>>><br>>>>>> Andrew<br>>>>>><br>>>>>> ----- Original Message -----<br>>>>>><br><span>>>>>> From: "Andrew Martin" <<a class="smarterwiki-linkify" href="mailto:amartin@xes-inc.com" title="[GMCP] Compose a new mail to amartin@xes-inc.com" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=amartin@xes-inc.com','Compose new message','width=640,height=480');return false" rel="noreferrer">amartin@xes-inc.com</a>></span><br><span>>>>>> To: <a class="smarterwiki-linkify" href="mailto:discuss@corosync.org" title="[GMCP] Compose a new mail to discuss@corosync.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=discuss@corosync.org','Compose new message','width=640,height=480');return false" rel="noreferrer">discuss@corosync.org</a></span><br>>>>>> Sent: Thursday, November 1, 2012 12:11:35 AM<br>>>>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster<br>>>>>><br>>>>>><br>>>>>> Hello,<br>>>>>><br>>>>>> I recently configured a 3-node fileserver cluster by building<br>>>>>> Corosync 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes<br>>>>>> are running Ubuntu 12.04 amd64. Two of the nodes (storage0 and<br>>>>>> storage1) are "real" nodes where the resources run (a DRBD disk,<br>>>>>> filesystem mount, and samba/nfs daemons), while the third node<br>>>>>> (storagequorum) is in standby mode and acts as a quorum node for<br>>>>>> the cluster. Today I discovered that corosync died on both storage0<br>>>>>> and storage1 at the same time. Since corosync died, pacemaker shut<br>>>>>> down as well on both nodes. Because the cluster no longer had<br>>>>>> quorum (and the no-quorum-policy="freeze"), storagequorum was<br>>>>>> unable to STONITH either node and just left the resources frozen<br>>>>>> where they were running, on storage0. I cannot find any log<br>>>>>> information to determine why corosync crashed, and this is a<br>>>>>> disturbing problem as the cluster and its messaging layer must be<br>>>>>> stable. Below is my corosync configuration file as well as the<br>>>>>> corosync log file from each!<br>> !<br>>> n!<br>>>> o!<br>>>>> de during<br>>>>> this period.<br>>>>>><br>>>>>> corosync.conf:<br><span>>>>>> <a class="smarterwiki-linkify" href="http://pastebin.com/vWQDVmg8">http://pastebin.com/vWQDVmg8</a></span><br>>>>>> Note that I have two redundant rings. On one of them, I specify the<br>>>>>> IP address (in this example 10.10.10.7) so that it binds to the<br>>>>>> correct interface (since potentially in the future those machines<br>>>>>> may have two interfaces on the same subnet).<br>>>>>><br>>>>>> corosync.log from storage0:<br><span>>>>>> <a class="smarterwiki-linkify" href="http://pastebin.com/HK8KYDDQ">http://pastebin.com/HK8KYDDQ</a></span><br>>>>>><br>>>>>> corosync.log from storage1:<br><span>>>>>> <a class="smarterwiki-linkify" href="http://pastebin.com/sDWkcPUz">http://pastebin.com/sDWkcPUz</a></span><br>>>>>><br>>>>>> corosync.log from storagequorum (the DC during this period):<br><span>>>>>> <a class="smarterwiki-linkify" href="http://pastebin.com/uENQ5fnf">http://pastebin.com/uENQ5fnf</a></span><br>>>>>><br>>>>>> Issuing service corosync start && service pacemaker start on<br>>>>>> storage0 and storage1 resolved the problem and allowed the nodes to<br>>>>>> successfully reconnect to the cluster. What other information can I<br>>>>>> provide to help diagnose this problem and prevent it from recurring?<br>>>>>><br>>>>>> Thanks,<br>>>>>><br>>>>>> Andrew Martin<br>>>>>><br>>>>>> _______________________________________________<br>>>>>> discuss mailing list<br><span>>>>>> <a class="smarterwiki-linkify" href="mailto:discuss@corosync.org" title="[GMCP] Compose a new mail to discuss@corosync.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=discuss@corosync.org','Compose new message','width=640,height=480');return false" rel="noreferrer">discuss@corosync.org</a></span><br><span>>>>>> <a class="smarterwiki-linkify" href="http://lists.corosync.org/mailman/listinfo/discuss">http://lists.corosync.org/mailman/listinfo/discuss</a></span><br>>>>>><br>>>>>><br>>>>>><br>>>>>><br>>>>>><br>>>>>> _______________________________________________<br>>>>>> discuss mailing list<br><span>>>>>> <a class="smarterwiki-linkify" href="mailto:discuss@corosync.org" title="[GMCP] Compose a new mail to discuss@corosync.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=discuss@corosync.org','Compose new message','width=640,height=480');return false" rel="noreferrer">discuss@corosync.org</a></span><br><span>>>>>> <a class="smarterwiki-linkify" href="http://lists.corosync.org/mailman/listinfo/discuss">http://lists.corosync.org/mailman/listinfo/discuss</a></span><br>>>>><br>>>>><br>>>><br>>>>> _______________________________________________<br><span>>>>> Pacemaker mailing list: <a class="smarterwiki-linkify" href="mailto:Pacemaker@oss.clusterlabs.org" title="[GMCP] Compose a new mail to Pacemaker@oss.clusterlabs.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=Pacemaker@oss.clusterlabs.org','Compose new message','width=640,height=480');return false" rel="noreferrer">Pacemaker@oss.clusterlabs.org</a></span><br><span>>>>> <a class="smarterwiki-linkify" href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a></span><br>>>>><br><span>>>>> Project Home: <a class="smarterwiki-linkify" href="http://www.clusterlabs.org">http://www.clusterlabs.org</a></span><br>>>>> Getting started:<br><span>>>>> <a class="smarterwiki-linkify" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a></span><br><span>>>>> Bugs: <a class="smarterwiki-linkify" href="http://bugs.clusterlabs.org">http://bugs.clusterlabs.org</a></span><br>>>><br>>>><br>>>> _______________________________________________<br>>>> discuss mailing list<br><span>>>> <a class="smarterwiki-linkify" href="mailto:discuss@corosync.org" title="[GMCP] Compose a new mail to discuss@corosync.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=discuss@corosync.org','Compose new message','width=640,height=480');return false" rel="noreferrer">discuss@corosync.org</a></span><br><span>>>> <a class="smarterwiki-linkify" href="http://lists.corosync.org/mailman/listinfo/discuss">http://lists.corosync.org/mailman/listinfo/discuss</a></span><br>>>><br>>><br>>> _______________________________________________<br>>> discuss mailing list<br><span>>> <a class="smarterwiki-linkify" href="mailto:discuss@corosync.org" title="[GMCP] Compose a new mail to discuss@corosync.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=discuss@corosync.org','Compose new message','width=640,height=480');return false" rel="noreferrer">discuss@corosync.org</a></span><br><span>>> <a class="smarterwiki-linkify" href="http://lists.corosync.org/mailman/listinfo/discuss">http://lists.corosync.org/mailman/listinfo/discuss</a></span><br>>><br>> <br>> _______________________________________________<br>> discuss mailing list<br><span>> <a class="smarterwiki-linkify" href="mailto:discuss@corosync.org" title="[GMCP] Compose a new mail to discuss@corosync.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=discuss@corosync.org','Compose new message','width=640,height=480');return false" rel="noreferrer">discuss@corosync.org</a></span><br><span>> <a class="smarterwiki-linkify" href="http://lists.corosync.org/mailman/listinfo/discuss">http://lists.corosync.org/mailman/listinfo/discuss</a></span><br><br><br>_______________________________________________<br><span>Pacemaker mailing list: <a class="smarterwiki-linkify" href="mailto:Pacemaker@oss.clusterlabs.org" title="[GMCP] Compose a new mail to Pacemaker@oss.clusterlabs.org" onclick="window.open('https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=Pacemaker@oss.clusterlabs.org','Compose new message','width=640,height=480');return false" rel="noreferrer">Pacemaker@oss.clusterlabs.org</a></span><br><span><a class="smarterwiki-linkify" href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a></span><br><br><span>Project Home: <a class="smarterwiki-linkify" href="http://www.clusterlabs.org">http://www.clusterlabs.org</a></span><br><span>Getting started: <a class="smarterwiki-linkify" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a></span><br><span>Bugs: <a class="smarterwiki-linkify" href="http://bugs.clusterlabs.org">http://bugs.clusterlabs.org</a></span><br></div><br></div></div></body></html>