[Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Andrew Martin amartin at xes-inc.com
Thu Nov 8 09:15:51 EST 2012


Honza and Angus, 


Glad to hear about this possible breakthrough! Here's the output of df: 

root at storage1:~# df 
Filesystem 1K-blocks Used Available Use% Mounted on 
/dev/mapper/vg00-lv_root 228424996 3376236 213445408 2% / 
udev 3041428 4 3041424 1% /dev 
tmpfs 1220808 340 1220468 1% /run 
none 5120 8 5112 1% /run/lock 
none 3052016 160652 2891364 6% /run/shm 
/dev/sda1 112039 88040 18214 83% /boot 
root at storage1:~# ls -la /dev/shm 
lrwxrwxrwx 1 root root 8 Nov 6 08:11 /dev/shm -> /run/shm 



root at storage0:~# df 
Filesystem 1K-blocks Used Available Use% Mounted on 
/dev/mapper/vg00-lv_root 228424996 140301080 76520564 65% / 
udev 3041264 4 3041260 1% /dev 
tmpfs 1220808 356 1220452 1% /run 
none 5120 4 5116 1% /run/lock 
none 3052012 37868 3014144 2% /run/shm 
/dev/sda1 112039 88973 17281 84% /boot 

root at storage0:~# ls -la /dev/shm 
lrwxrwxrwx 1 root root 8 Nov 7 21:07 /dev/shm -> /run/shm 



root at storagequorum:~# df 
Filesystem 1K-blocks Used Available Use% Mounted on 
/dev/sda1 77012644 4014620 69140924 6% / 
udev 467564 4 467560 1% /dev 
tmpfs 190548 384 190164 1% /run 
none 5120 0 5120 0% /run/lock 
none 476368 53260 423108 12% /run/shm 
root at storagequorum:~# ls -la /dev/shm 
lrwxrwxrwx 1 root root 8 Sep 12 12:42 /dev/shm -> /run/shm 


It isn't full now, but corosync has been dead on storage1 for several hours. I am running it in the foreground again this morning to try and reproduce a higher used value for /run/shm. 


I will also compile corosync from git for evaluating the IPC possibility. 


Thanks, 


Andrew 
----- Original Message -----

From: "Jan Friesse" <jfriesse at redhat.com> 
To: "Andrew Martin" <amartin at xes-inc.com> 
Cc: "Angus Salkeld" <asalkeld at redhat.com>, discuss at corosync.org, pacemaker at oss.clusterlabs.org 
Sent: Thursday, November 8, 2012 7:39:45 AM 
Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster 

Andrew, 
good news. I believe that I've found reproducer for problem you are 
facing. Now, to be sure it's really same, can you please run : 
df (interesting is /dev/shm) 
and send output of ls -la /dev/shm? 

I believe /dev/shm is full. 

Now, as a quick workaround, just delete all qb-* from /dev/shm and 
cluster should work. There are basically two problems: 
- ipc_shm is leaking memory 
- if there is no memory, libqb mmap nonallocated memory and receives sigbus 

Angus is working on both issues. 

Regards, 
Honza 

Jan Friesse napsal(a): 
> Andrew, 
> thanks for valgrind report (even it didn't showed anything useful) and 
> blackbox. 
> 
> We believe that problem is because of access to invalid memory mapped by 
> mmap operation. There are basically 3 places where we are doing mmap. 
> 1.) corosync cpg_zcb functions (I don't believe this is the case) 
> 2.) LibQB IPC 
> 3.) LibQB blackbox 
> 
> Now, because nether me nor Angus are able to reproduce the bug, can you 
> please: 
> - apply patches "Check successful initialization of IPC" and "Add 
> support for selecting IPC type" (later versions), or use corosync from 
> git (ether needle or master branch, they are same) 
> - compile corosync 
> - Add 
> 
> qb { 
> ipc_type: socket 
> } 
> 
> to corosync.conf 
> - Try running corosync 
> 
> This may, but may not help solve problem, but it should help us to 
> diagnose if problem is or isn't IPC one. 
> 
> Thanks, 
> Honza 
> 
> Andrew Martin napsal(a): 
>> Angus and Honza, 
>> 
>> 
>> I recompiled corosync with --enable-debug. Below is a capture of the valgrind output when corosync dies, after switching rrp_mode to passive: 
>> 
>> # valgrind corosync -f 
>> ==5453== Memcheck, a memory error detector 
>> ==5453== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al. 
>> ==5453== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info 
>> ==5453== Command: corosync -f 
>> ==5453== 
>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide service. 
>> info [MAIN ] Corosync built-in features: debug pie relro bindnow 
>> ==5453== Syscall param socketcall.sendmsg(msg) points to uninitialised byte(s) 
>> ==5453== at 0x54D233D: ??? (syscall-template.S:82) 
>> ==5453== by 0x4E391E8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3BFC8: totemudp_token_send (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E38CF0: totemnet_token_send (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F1AF: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E40FB5: totemrrp_token_send (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E47E84: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E45770: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E40AD2: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3C1A4: totemudp_token_target_set (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E38EBC: totemnet_token_target_set (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F3A8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== Address 0x7feff7f58 is on thread 1's stack 
>> ==5453== 
>> ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to uninitialised byte(s) 
>> ==5453== at 0x54D233D: ??? (syscall-template.S:82) 
>> ==5453== by 0x4E39427: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== Address 0x7feffb9da is on thread 1's stack 
>> ==5453== 
>> ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to uninitialised byte(s) 
>> ==5453== at 0x54D233D: ??? (syscall-template.S:82) 
>> ==5453== by 0x4E39526: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
>> ==5453== Address 0x7feffb9da is on thread 1's stack 
>> ==5453== 
>> Ringbuffer: 
>> ->OVERWRITE 
>> ->write_pt [0] 
>> ->read_pt [0] 
>> ->size [2097152 words] 
>> =>free [8388608 bytes] 
>> =>used [0 bytes] 
>> ==5453== 
>> ==5453== HEAP SUMMARY: 
>> ==5453== in use at exit: 13,175,149 bytes in 1,648 blocks 
>> ==5453== total heap usage: 70,091 allocs, 68,443 frees, 67,724,863 bytes allocated 
>> ==5453== 
>> ==5453== LEAK SUMMARY: 
>> ==5453== definitely lost: 0 bytes in 0 blocks 
>> ==5453== indirectly lost: 0 bytes in 0 blocks 
>> ==5453== possibly lost: 2,100,062 bytes in 35 blocks 
>> ==5453== still reachable: 11,075,087 bytes in 1,613 blocks 
>> ==5453== suppressed: 0 bytes in 0 blocks 
>> ==5453== Rerun with --leak-check=full to see details of leaked memory 
>> ==5453== 
>> ==5453== For counts of detected and suppressed errors, rerun with: -v 
>> ==5453== Use --track-origins=yes to see where uninitialised values come from 
>> ==5453== ERROR SUMMARY: 715 errors from 3 contexts (suppressed: 2 from 2) 
>> Bus error (core dumped) 
>> 
>> 
>> I was also able to capture non-truncated fdata: 
>> http://sources.xes-inc.com/downloads/fdata-20121107 
>> 
>> 
>> Here is the coredump: 
>> http://sources.xes-inc.com/downloads/vgcore.5453 
>> 
>> 
>> I was not able to get corosync to crash without pacemaker also running, though I was not able to test for a long period of time. 
>> 
>> 
>> Another thing I discovered tonight was that the 127.0.1.1 entry in /etc/hosts (on both storage0 and storage1) was the source of the extra "localhost" entry in the cluster. I have removed this extraneous node so now only the 3 real nodes remain and commented out this line in /etc/hosts on all nodes in the cluster. 
>> http://burning-midnight.blogspot.com/2012/07/cluster-building-ubuntu-1204-revised.html 
>> 
>> 
>> Thanks, 
>> 
>> 
>> Andrew 
>> ----- Original Message ----- 
>> 
>> From: "Jan Friesse" <jfriesse at redhat.com> 
>> To: "Andrew Martin" <amartin at xes-inc.com> 
>> Cc: "Angus Salkeld" <asalkeld at redhat.com>, discuss at corosync.org, pacemaker at oss.clusterlabs.org 
>> Sent: Wednesday, November 7, 2012 2:00:20 AM 
>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster 
>> 
>> Andrew, 
>> 
>> Andrew Martin napsal(a): 
>>> A bit more data on this problem: I was doing some maintenance and had to briefly disconnect storagequorum's connection to the STONITH network (ethernet cable #7 in this diagram): 
>>> http://sources.xes-inc.com/downloads/storagecluster.png 
>>> 
>>> 
>>> Since corosync has two rings (and is in active mode), this should cause no disruption to the cluster. However, as soon as I disconnected cable #7, corosync on storage0 died (corosync was already stopped on storage1), which caused pacemaker on storage0 to also shutdown. I was not able to obtain a coredump this time as apport is still running on storage0. 
>> 
>> I strongly believe corosync fault is because of original problem you 
>> have. Also I would recommend you to try passive mode. Passive mode is 
>> better, because if one link fails, passive mode make progress (delivers 
>> messages), where active mode doesn't (up to moment, when ring is marked 
>> as failed. After that, passive/active behaves same). Also passive mode 
>> is much better tested. 
>> 
>>> 
>>> 
>>> What else can I do to debug this problem? Or, should I just try to downgrade to corosync 1.4.2 (the version available in the Ubuntu repositories)? 
>> 
>> I would really like to find main issue (which looks like libqb one, 
>> rather then corosync). But if you decide to downgrade, please downgrade 
>> to latest 1.4.x series (1.4.4 for now). 1.4.2 has A LOT of known bugs. 
>> 
>>> 
>>> 
>>> Thanks, 
>>> 
>>> 
>>> Andrew 
>> 
>> Regards, 
>> Honza 
>> 
>>> 
>>> ----- Original Message ----- 
>>> 
>>> From: "Andrew Martin" <amartin at xes-inc.com> 
>>> To: "Angus Salkeld" <asalkeld at redhat.com> 
>>> Cc: discuss at corosync.org, pacemaker at oss.clusterlabs.org 
>>> Sent: Tuesday, November 6, 2012 2:01:17 PM 
>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster 
>>> 
>>> 
>>> Hi Angus, 
>>> 
>>> 
>>> I recompiled corosync with the changes you suggested in exec/main.c to generate fdata when SIGBUS is triggered. Here 's the corresponding coredump and fdata files: 
>>> http://sources.xes-inc.com/downloads/core.13027 
>>> http://sources.xes-inc.com/downloads/fdata.20121106 
>>> 
>>> 
>>> 
>>> (gdb) thread apply all bt 
>>> 
>>> 
>>> Thread 1 (Thread 0x7ffff7fec700 (LWP 13027)): 
>>> #0 0x00007ffff775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 
>>> #1 0x00007ffff77656b9 in ?? () from /usr/lib/libqb.so.0 
>>> #2 0x00007ffff77637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0 
>>> #3 0x0000555555571700 in ?? () 
>>> #4 0x00007ffff7bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5 
>>> #5 0x00007ffff7bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5 
>>> #6 0x00007ffff7bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5 
>>> #7 0x00007ffff775d46f in ?? () from /usr/lib/libqb.so.0 
>>> #8 0x00007ffff775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0 
>>> #9 0x0000555555560945 in main () 
>>> 
>>> 
>>> 
>>> 
>>> I've also been doing some hardware tests to rule it out as the cause of this problem: mcelog has found no problems and memtest finds the memory to be healthy as well. 
>>> 
>>> 
>>> Thanks, 
>>> 
>>> 
>>> Andrew 
>>> ----- Original Message ----- 
>>> 
>>> From: "Angus Salkeld" <asalkeld at redhat.com> 
>>> To: pacemaker at oss.clusterlabs.org, discuss at corosync.org 
>>> Sent: Friday, November 2, 2012 8:18:51 PM 
>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster 
>>> 
>>> On 02/11/12 13:07 -0500, Andrew Martin wrote: 
>>>> Hi Angus, 
>>>> 
>>>> 
>>>> Corosync died again while using libqb 0.14.3. Here is the coredump from today: 
>>>> http://sources.xes-inc.com/downloads/corosync.nov2.coredump 
>>>> 
>>>> 
>>>> 
>>>> # corosync -f 
>>>> notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide service. 
>>>> info [MAIN ] Corosync built-in features: pie relro bindnow 
>>>> Bus error (core dumped) 
>>>> 
>>>> 
>>>> Here's the log: http://pastebin.com/bUfiB3T3 
>>>> 
>>>> 
>>>> Did your analysis of the core dump reveal anything? 
>>>> 
>>> 
>>> I can't get any symbols out of these coredumps. Can you try get a backtrace? 
>>> 
>>>> 
>>>> Is there a way for me to make it generate fdata with a bus error, or how else can I gather additional information to help debug this? 
>>>> 
>>> 
>>> if you look in exec/main.c and look for SIGSEGV you will see how the mechanism 
>>> for fdata works. Just and a handler for SIGBUS and hook it up. Then you should 
>>> be able to get the fdata for both. 
>>> 
>>> I'd rather be able to get a backtrace if possible. 
>>> 
>>> -Angus 
>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> 
>>>> Andrew 
>>>> 
>>>> ----- Original Message ----- 
>>>> 
>>>> From: "Angus Salkeld" <asalkeld at redhat.com> 
>>>> To: pacemaker at oss.clusterlabs.org, discuss at corosync.org 
>>>> Sent: Thursday, November 1, 2012 5:47:16 PM 
>>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster 
>>>> 
>>>> On 01/11/12 17:27 -0500, Andrew Martin wrote: 
>>>>> Hi Angus, 
>>>>> 
>>>>> 
>>>>> I'll try upgrading to the latest libqb tomorrow and see if I can reproduce this behavior with it. I was able to get a coredump by running corosync manually in the foreground (corosync -f): 
>>>>> http://sources.xes-inc.com/downloads/corosync.coredump 
>>>> 
>>>> Thanks, looking... 
>>>> 
>>>>> 
>>>>> 
>>>>> There still isn't anything added to /var/lib/corosync however. What do I need to do to enable the fdata file to be created? 
>>>> 
>>>> Well if it crashes with SIGSEGV it will generate it automatically. 
>>>> (I see you are getting a bus error) - :(. 
>>>> 
>>>> -A 
>>>> 
>>>>> 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Andrew 
>>>>> 
>>>>> ----- Original Message ----- 
>>>>> 
>>>>> From: "Angus Salkeld" <asalkeld at redhat.com> 
>>>>> To: pacemaker at oss.clusterlabs.org, discuss at corosync.org 
>>>>> Sent: Thursday, November 1, 2012 5:11:23 PM 
>>>>> Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster 
>>>>> 
>>>>> On 01/11/12 14:32 -0500, Andrew Martin wrote: 
>>>>>> Hi Honza, 
>>>>>> 
>>>>>> 
>>>>>> Thanks for the help. I enabled core dumps in /etc/security/limits.conf but didn't have a chance to reboot and apply the changes so I don't have a core dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID file to be generated? right now all that is in /var/lib/corosync are the ringid_XXX files. Do I need to set something explicitly in the corosync config to enable this logging? 
>>>>>> 
>>>>>> 
>>>>>> I did find find something else interesting with libqb this time. I compiled libqb 0.14.2 for use with the cluster. This time when corosync died I noticed the following in dmesg: 
>>>>>> Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in libqb.so.0.14.2[7f657a525000+1f000] 
>>>>>> This error was only present for one of the many other times corosync has died. 
>>>>>> 
>>>>>> 
>>>>>> I see that there is a newer version of libqb (0.14.3) out, but didn't see a fix for this particular bug. Could this libqb problem be related to the corosync to hang up? Here's the corresponding corosync log file (next time I should have a core dump as well): 
>>>>>> http://pastebin.com/5FLKg7We 
>>>>> 
>>>>> Hi Andrew 
>>>>> 
>>>>> I can't see much wrong with the log either. If you could run with the latest 
>>>>> (libqb-0.14.3) and post a backtrace if it still happens, that would be great. 
>>>>> 
>>>>> Thanks 
>>>>> Angus 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> 
>>>>>> Andrew 
>>>>>> 
>>>>>> ----- Original Message ----- 
>>>>>> 
>>>>>> From: "Jan Friesse" <jfriesse at redhat.com> 
>>>>>> To: "Andrew Martin" <amartin at xes-inc.com> 
>>>>>> Cc: discuss at corosync.org, "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org> 
>>>>>> Sent: Thursday, November 1, 2012 7:55:52 AM 
>>>>>> Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster 
>>>>>> 
>>>>>> Ansdrew, 
>>>>>> I was not able to find anything interesting (from corosync point of 
>>>>>> view) in configuration/logs (corosync related). 
>>>>>> 
>>>>>> What would be helpful: 
>>>>>> - if corosync died, there should be 
>>>>>> /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please 
>>>>>> xz them and store somewhere (they are quiet large but well compressible). 
>>>>>> - If you are able to reproduce problem (what seems like you are), can 
>>>>>> you please allow generating of coredumps and store somewhere backtrace 
>>>>>> of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and 
>>>>>> way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and 
>>>>>> here thread apply all bt). If you are running distribution with ABRT 
>>>>>> support, you can also use ABRT to generate report. 
>>>>>> 
>>>>>> Regards, 
>>>>>> Honza 
>>>>>> 
>>>>>> Andrew Martin napsal(a): 
>>>>>>> Corosync died an additional 3 times during the night on storage1. I wrote a daemon to attempt and start it as soon as it fails, so only one of those times resulted in a STONITH of storage1. 
>>>>>>> 
>>>>>>> I enabled debug in the corosync config, so I was able to capture a period when corosync died with debug output: 
>>>>>>> http://pastebin.com/eAmJSmsQ 
>>>>>>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For reference, here is my Pacemaker configuration: 
>>>>>>> http://pastebin.com/DFL3hNvz 
>>>>>>> 
>>>>>>> It seems that an extra node, 16777343 "localhost" has been added to the cluster after storage1 was STONTIHed (must be the localhost interface on storage1). Is there anyway to prevent this? 
>>>>>>> 
>>>>>>> Does this help to determine why corosync is dying, and what I can do to fix it? 
>>>>>>> 
>>>>>>> Thanks, 
>>>>>>> 
>>>>>>> Andrew 
>>>>>>> 
>>>>>>> ----- Original Message ----- 
>>>>>>> 
>>>>>>> From: "Andrew Martin" <amartin at xes-inc.com> 
>>>>>>> To: discuss at corosync.org 
>>>>>>> Sent: Thursday, November 1, 2012 12:11:35 AM 
>>>>>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster 
>>>>>>> 
>>>>>>> 
>>>>>>> Hello, 
>>>>>>> 
>>>>>>> I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" nodes where the resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while the third node (storagequorum) is in standby mode and acts as a quorum node for the cluster. Today I discovered that corosync died on both storage0 and storage1 at the same time. Since corosync died, pacemaker shut down as well on both nodes. Because the cluster no longer had quorum (and the no-quorum-policy="freeze"), storagequorum was unable to STONITH either node and just left the resources frozen where they were running, on storage0. I cannot find any log information to determine why corosync crashed, and this is a disturbing problem as the cluster and its messaging layer must be stable. Below is my corosync configuration file as well as the corosync log file from e! 
ac! 
>> h! 
>>> ! 
>>>> n! 
>>>>> o! 
>>>>>> de during 
>>>>>> this period. 
>>>>>>> 
>>>>>>> corosync.conf: 
>>>>>>> http://pastebin.com/vWQDVmg8 
>>>>>>> Note that I have two redundant rings. On one of them, I specify the IP address (in this example 10.10.10.7) so that it binds to the correct interface (since potentially in the future those machines may have two interfaces on the same subnet). 
>>>>>>> 
>>>>>>> corosync.log from storage0: 
>>>>>>> http://pastebin.com/HK8KYDDQ 
>>>>>>> 
>>>>>>> corosync.log from storage1: 
>>>>>>> http://pastebin.com/sDWkcPUz 
>>>>>>> 
>>>>>>> corosync.log from storagequorum (the DC during this period): 
>>>>>>> http://pastebin.com/uENQ5fnf 
>>>>>>> 
>>>>>>> Issuing service corosync start && service pacemaker start on storage0 and storage1 resolved the problem and allowed the nodes to successfully reconnect to the cluster. What other information can I provide to help diagnose this problem and prevent it from recurring? 
>>>>>>> 
>>>>>>> Thanks, 
>>>>>>> 
>>>>>>> Andrew Martin 
>>>>>>> 
>>>>>>> _______________________________________________ 
>>>>>>> discuss mailing list 
>>>>>>> discuss at corosync.org 
>>>>>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________ 
>>>>>>> discuss mailing list 
>>>>>>> discuss at corosync.org 
>>>>>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>>>> 
>>>>>> 
>>>>> 
>>>>>> _______________________________________________ 
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>>>>> 
>>>>>> Project Home: http://www.clusterlabs.org 
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>> Bugs: http://bugs.clusterlabs.org 
>>>>> 
>>>>> 
>>>>> _______________________________________________ 
>>>>> discuss mailing list 
>>>>> discuss at corosync.org 
>>>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>>> 
>>>> 
>>>> _______________________________________________ 
>>>> discuss mailing list 
>>>> discuss at corosync.org 
>>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>>> 
>>> 
>>> _______________________________________________ 
>>> discuss mailing list 
>>> discuss at corosync.org 
>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>> 
>>> 
>>> _______________________________________________ 
>>> discuss mailing list 
>>> discuss at corosync.org 
>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> discuss mailing list 
>>> discuss at corosync.org 
>>> http://lists.corosync.org/mailman/listinfo/discuss 
>>> 
>> 
>> 
>> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20121108/3b976244/attachment-0003.html>


More information about the Pacemaker mailing list