Hello Francois<br><br>can you post your cluster configuration using pastebin?<br><br>Thanks<br><br><div class="gmail_quote">2012/5/1 Francois Gaudreault <span dir="ltr"><<a href="mailto:fgaudreault@inverse.ca" target="_blank">fgaudreault@inverse.ca</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
I am not sure if this is the right mailing list to post for our problems, let me know if this is more a Corosync issue.<br>
<br>
We are facing a weird behavior with Corosync/Pacemaker/DRBD, and I expect to have some inputs and help to troubleshoot what is going on. Let me try to explain the best I can what we do.<br>
<br>
We have a cluster with two node using DRBD running SQL stuff. We do have a location constraint that tells Pacemaker that DRBD should be primary on NODE A when able. When we migrate the services to NODE B, and we reboot NODE A, we are facing split brains every single time.<br>
<br>
>From what we can see in the logs, it appears that the DRBD resource, for some reason, is not waiting for getting an established connection (to get initial sync) before changing its role to Primary. (I apologize for the length of the log):<br>
<br>
May 1 10:30:34 npf1 kernel: d-con mysql: Starting worker thread (from drbdsetup [1960])<br>
May 1 10:30:34 npf1 kernel: block drbd0: disk( Diskless -> Attaching )<br>
May 1 10:30:34 npf1 kernel: block drbd0: Method to ensure write ordering: barrier<br>
May 1 10:30:34 npf1 kernel: block drbd0: max BIO size = 1048576<br>
May 1 10:30:34 npf1 kernel: block drbd0: drbd_bm_resize called with capacity == 143355552<br>
May 1 10:30:34 npf1 kernel: block drbd0: resync bitmap: bits=17919444 words=279992 pages=547<br>
May 1 10:30:34 npf1 kernel: block drbd0: size = 68 GB (71677776 KB)<br>
May 1 10:30:34 npf1 kernel: block drbd0: bitmap READ of 547 pages took 26 jiffies<br>
May 1 10:30:34 npf1 kernel: block drbd0: recounting of set bits took additional 1 jiffies<br>
May 1 10:30:34 npf1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.<br>
May 1 10:30:34 npf1 kernel: block drbd0: disk( Attaching -> UpToDate )<br>
May 1 10:30:34 npf1 kernel: block drbd0: attached to UUIDs 343EF8A0E434C9D8:<u></u>0000000000000000:<u></u>282208A916EB5687:<u></u>282108A916EB5687<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)<br>
May 1 10:30:34 npf1 kernel: d-con mysql: conn( StandAlone -> Unconnected )<br>
May 1 10:30:34 npf1 kernel: d-con mysql: Starting receiver thread (from drbd_w_mysql [1961])<br>
May 1 10:30:34 npf1 kernel: d-con mysql: receiver (re)started<br>
May 1 10:30:34 npf1 kernel: d-con mysql: conn( Unconnected -> WFConnection )<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)<br>
May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD:0_start_0 (call=7, rc=0, cib-update=12, confirmed=true) ok<br>
May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=53:84:0:5bc3f587-ac97-<u></u>491b-b102-2325c4352589 op=DRBD:0_notify_0 )<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:8: notify<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:notify:stdout)<br>
May 1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource op DRBD:0_notify_0 from 53:84:0:5bc3f587-ac97-491b-<u></u>b102-2325c4352589: lrm_invoke-lrmd-1335882634-4<br>
May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD:0_notify_0 (call=8, rc=0, cib-update=0, confirmed=true) ok<br>
May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=58:84:0:5bc3f587-ac97-<u></u>491b-b102-2325c4352589 op=DRBD:0_notify_0 )<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:9: notify<br>
May 1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource op DRBD:0_notify_0 from 58:84:0:5bc3f587-ac97-491b-<u></u>b102-2325c4352589: lrm_invoke-lrmd-1335882634-5<br>
May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD:0_notify_0 (call=9, rc=0, cib-update=0, confirmed=true) ok<br>
May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=20:84:0:5bc3f587-ac97-<u></u>491b-b102-2325c4352589 op=DRBD:0_promote_0 )<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:10: promote<br>
May 1 10:30:34 npf1 kernel: block drbd0: role( Secondary -> Primary )<br>
May 1 10:30:34 npf1 kernel: block drbd0: new current UUID 66767846186A82BB:<u></u>343EF8A0E434C9D8:<u></u>282208A916EB5687:<u></u>282108A916EB5687<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:promote:stdout)<br>
May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD:0_promote_0 (call=10, rc=0, cib-update=13, confirmed=true) ok<br>
May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=59:84:0:5bc3f587-ac97-<u></u>491b-b102-2325c4352589 op=DRBD:0_notify_0 )<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:11: notify<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:notify:stdout)<br>
May 1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource op DRBD:0_notify_0 from 59:84:0:5bc3f587-ac97-491b-<u></u>b102-2325c4352589: lrm_invoke-lrmd-1335882634-6<br>
May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD:0_notify_0 (call=11, rc=0, cib-update=0, confirmed=true) ok<br>
May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=7:84:0:5bc3f587-ac97-491b-<u></u>b102-2325c4352589 op=DRBD_fs_start_0 )<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD_fs:12: start<br>
May 1 10:30:34 npf1 Filesystem[2089]: INFO: Running start for /dev/drbd0 on /var/lib/mysql<br>
May 1 10:30:34 npf1 kernel: EXT4-fs (drbd0): mounted filesystem with ordered data mode. Opts:<br>
May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD_fs_start_0 (call=12, rc=0, cib-update=14, confirmed=true) ok<br>
May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=8:84:0:5bc3f587-ac97-491b-<u></u>b102-2325c4352589 op=DRBD_fs_monitor_120000 )<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD_fs:13: monitor<br>
May 1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing key=9:84:0:5bc3f587-ac97-491b-<u></u>b102-2325c4352589 op=MySQL_start_0 )<br>
May 1 10:30:34 npf1 lrmd: [1803]: info: rsc:MySQL:14: start<br>
May 1 10:30:34 npf1 lrmd: [2176]: WARN: For LSB init script, no additional parameters are needed.<br>
May 1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation DRBD_fs_monitor_120000 (call=13, rc=0, cib-update=15, confirmed=false) o<br>
May 1 10:30:35 npf1 kernel: d-con mysql: Handshake successful: Agreed network protocol version 100<br>
May 1 10:30:35 npf1 kernel: d-con mysql: conn( WFConnection -> WFReportParams )<br>
May 1 10:30:35 npf1 kernel: d-con mysql: Starting asender thread (from drbd_r_mysql [1973])<br>
May 1 10:30:35 npf1 kernel: block drbd0: drbd_sync_handshake:<br>
May 1 10:30:35 npf1 kernel: block drbd0: self 66767846186A82BB:<u></u>343EF8A0E434C9D8:<u></u>282208A916EB5687:<u></u>282108A916EB5687 bits:1 flags:0<br>
May 1 10:30:35 npf1 kernel: block drbd0: peer D06952D9712CA916:<u></u>343EF8A0E434C9D8:<u></u>282208A916EB5686:<u></u>282108A916EB5687 bits:41 flags:0<br>
May 1 10:30:35 npf1 kernel: block drbd0: uuid_compare()=100 by rule 90<br>
May 1 10:30:35 npf1 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0<br>
May 1 10:30:35 npf1 kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)<br>
May 1 10:30:35 npf1 kernel: block drbd0: Split-Brain detected, 1 primaries, automatically solved. Sync from this node<br>
May 1 10:30:35 npf1 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )<br>
<br>
The only solutions we found is to put automatic SB resolving and resource-stickiness, but we do NOT want auto split brain recovery and we want the services to go back on NODE A when the cluster is back to a sane state.<br>
<br>
I can provide the configs if needed.<br>
<br>
Thanks for your help!<span class="HOEnZb"><font color="#888888"><br>
<br>
-- <br>
Francois Gaudreault, ing. jr<br>
<a href="mailto:fgaudreault@inverse.ca" target="_blank">fgaudreault@inverse.ca</a> :: <a href="tel:%2B1.514.447.4918" value="+15144474918" target="_blank">+1.514.447.4918</a> (x130) :: <a href="http://www.inverse.ca" target="_blank">www.inverse.ca</a><br>
Inverse inc. :: Leaders behind SOGo (<a href="http://www.sogo.nu" target="_blank">www.sogo.nu</a>) and PacketFence (<a href="http://www.packetfence.org" target="_blank">www.packetfence.org</a>)<br>
<br>
______________________________<u></u>_________________<br>
Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>
<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/<u></u>mailman/listinfo/pacemaker</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/<u></u>doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
</font></span></blockquote></div><br><br clear="all"><br>-- <br>esta es mi vida e me la vivo hasta que dios quiera<br>