[Pacemaker] DRBD and Pacemaker/Corosync

emmanuel segura emi2fast at gmail.com
Tue May 1 12:51:42 EDT 2012


Hello Francois

can you post your cluster configuration using pastebin?

Thanks

2012/5/1 Francois Gaudreault <fgaudreault at inverse.ca>

> Hi,
>
> I am not sure if this is the right mailing list to post for our problems,
> let me know if this is more a Corosync issue.
>
> We are facing a weird behavior with Corosync/Pacemaker/DRBD, and I expect
> to have some inputs and help to troubleshoot what is going on. Let me try
> to explain the best I can what we do.
>
> We have a cluster with two node using DRBD running SQL stuff.  We do have
> a location constraint that tells Pacemaker that DRBD should be primary on
> NODE A when able.  When we migrate the services to NODE B, and we reboot
> NODE A, we are facing split brains every single time.
>
> From what we can see in the logs, it appears that the DRBD resource, for
> some reason, is not waiting for getting an established connection (to get
> initial sync) before changing its role to Primary.  (I apologize for the
> length of the log):
>
> May  1 10:30:34 npf1 kernel: d-con mysql: Starting worker thread (from
> drbdsetup [1960])
> May  1 10:30:34 npf1 kernel: block drbd0: disk( Diskless -> Attaching )
> May  1 10:30:34 npf1 kernel: block drbd0: Method to ensure write ordering:
> barrier
> May  1 10:30:34 npf1 kernel: block drbd0: max BIO size = 1048576
> May  1 10:30:34 npf1 kernel: block drbd0: drbd_bm_resize called with
> capacity == 143355552
> May  1 10:30:34 npf1 kernel: block drbd0: resync bitmap: bits=17919444
> words=279992 pages=547
> May  1 10:30:34 npf1 kernel: block drbd0: size = 68 GB (71677776 KB)
> May  1 10:30:34 npf1 kernel: block drbd0: bitmap READ of 547 pages took 26
> jiffies
> May  1 10:30:34 npf1 kernel: block drbd0: recounting of set bits took
> additional 1 jiffies
> May  1 10:30:34 npf1 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync
> by on disk bit-map.
> May  1 10:30:34 npf1 kernel: block drbd0: disk( Attaching -> UpToDate )
> May  1 10:30:34 npf1 kernel: block drbd0: attached to UUIDs
> 343EF8A0E434C9D8:**0000000000000000:**282208A916EB5687:**282108A916EB5687
> May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)
> May  1 10:30:34 npf1 kernel: d-con mysql: conn( StandAlone -> Unconnected )
> May  1 10:30:34 npf1 kernel: d-con mysql: Starting receiver thread (from
> drbd_w_mysql [1961])
> May  1 10:30:34 npf1 kernel: d-con mysql: receiver (re)started
> May  1 10:30:34 npf1 kernel: d-con mysql: conn( Unconnected ->
> WFConnection )
> May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)
> May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:start:stdout)
> May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation
> DRBD:0_start_0 (call=7, rc=0, cib-update=12, confirmed=true) ok
> May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing
> key=53:84:0:5bc3f587-ac97-**491b-b102-2325c4352589 op=DRBD:0_notify_0 )
> May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:8: notify
> May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:notify:stdout)
> May  1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource
> op DRBD:0_notify_0 from 53:84:0:5bc3f587-ac97-491b-**b102-2325c4352589:
> lrm_invoke-lrmd-1335882634-4
> May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation
> DRBD:0_notify_0 (call=8, rc=0, cib-update=0, confirmed=true) ok
> May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing
> key=58:84:0:5bc3f587-ac97-**491b-b102-2325c4352589 op=DRBD:0_notify_0 )
> May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:9: notify
> May  1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource
> op DRBD:0_notify_0 from 58:84:0:5bc3f587-ac97-491b-**b102-2325c4352589:
> lrm_invoke-lrmd-1335882634-5
> May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation
> DRBD:0_notify_0 (call=9, rc=0, cib-update=0, confirmed=true) ok
> May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing
> key=20:84:0:5bc3f587-ac97-**491b-b102-2325c4352589 op=DRBD:0_promote_0 )
> May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:10: promote
> May  1 10:30:34 npf1 kernel: block drbd0: role( Secondary -> Primary )
> May  1 10:30:34 npf1 kernel: block drbd0: new current UUID
> 66767846186A82BB:**343EF8A0E434C9D8:**282208A916EB5687:**282108A916EB5687
> May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:promote:stdout)
> May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation
> DRBD:0_promote_0 (call=10, rc=0, cib-update=13, confirmed=true) ok
> May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing
> key=59:84:0:5bc3f587-ac97-**491b-b102-2325c4352589 op=DRBD:0_notify_0 )
> May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD:0:11: notify
> May  1 10:30:34 npf1 lrmd: [1803]: info: RA output: (DRBD:0:notify:stdout)
> May  1 10:30:34 npf1 crmd: [1806]: info: send_direct_ack: ACK'ing resource
> op DRBD:0_notify_0 from 59:84:0:5bc3f587-ac97-491b-**b102-2325c4352589:
> lrm_invoke-lrmd-1335882634-6
> May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation
> DRBD:0_notify_0 (call=11, rc=0, cib-update=0, confirmed=true) ok
> May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing
> key=7:84:0:5bc3f587-ac97-491b-**b102-2325c4352589 op=DRBD_fs_start_0 )
> May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD_fs:12: start
> May  1 10:30:34 npf1 Filesystem[2089]: INFO: Running start for /dev/drbd0
> on /var/lib/mysql
> May  1 10:30:34 npf1 kernel: EXT4-fs (drbd0): mounted filesystem with
> ordered data mode. Opts:
> May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation
> DRBD_fs_start_0 (call=12, rc=0, cib-update=14, confirmed=true) ok
> May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing
> key=8:84:0:5bc3f587-ac97-491b-**b102-2325c4352589
> op=DRBD_fs_monitor_120000 )
> May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:DRBD_fs:13: monitor
> May  1 10:30:34 npf1 crmd: [1806]: info: do_lrm_rsc_op: Performing
> key=9:84:0:5bc3f587-ac97-491b-**b102-2325c4352589 op=MySQL_start_0 )
> May  1 10:30:34 npf1 lrmd: [1803]: info: rsc:MySQL:14: start
> May  1 10:30:34 npf1 lrmd: [2176]: WARN: For LSB init script, no
> additional parameters are needed.
> May  1 10:30:34 npf1 crmd: [1806]: info: process_lrm_event: LRM operation
> DRBD_fs_monitor_120000 (call=13, rc=0, cib-update=15, confirmed=false) o
> May  1 10:30:35 npf1 kernel: d-con mysql: Handshake successful: Agreed
> network protocol version 100
> May  1 10:30:35 npf1 kernel: d-con mysql: conn( WFConnection ->
> WFReportParams )
> May  1 10:30:35 npf1 kernel: d-con mysql: Starting asender thread (from
> drbd_r_mysql [1973])
> May  1 10:30:35 npf1 kernel: block drbd0: drbd_sync_handshake:
> May  1 10:30:35 npf1 kernel: block drbd0: self 66767846186A82BB:**
> 343EF8A0E434C9D8:**282208A916EB5687:**282108A916EB5687 bits:1 flags:0
> May  1 10:30:35 npf1 kernel: block drbd0: peer D06952D9712CA916:**
> 343EF8A0E434C9D8:**282208A916EB5686:**282108A916EB5687 bits:41 flags:0
> May  1 10:30:35 npf1 kernel: block drbd0: uuid_compare()=100 by rule 90
> May  1 10:30:35 npf1 kernel: block drbd0: helper command: /sbin/drbdadm
> initial-split-brain minor-0
> May  1 10:30:35 npf1 kernel: block drbd0: helper command: /sbin/drbdadm
> initial-split-brain minor-0 exit code 0 (0x0)
> May  1 10:30:35 npf1 kernel: block drbd0: Split-Brain detected, 1
> primaries, automatically solved. Sync from this node
> May  1 10:30:35 npf1 kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
>
> The only solutions we found is to put automatic SB resolving and
> resource-stickiness, but we do NOT want auto split brain recovery and we
> want the services to go back on NODE A when the cluster is back to a sane
> state.
>
> I can provide the configs if needed.
>
> Thanks for your help!
>
> --
> Francois Gaudreault, ing. jr
> fgaudreault at inverse.ca  ::  +1.514.447.4918 (x130) ::  www.inverse.ca
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence (
> www.packetfence.org)
>
> ______________________________**_________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/**mailman/listinfo/pacemaker<http://oss.clusterlabs.org/mailman/listinfo/pacemaker>
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/**doc/Cluster_from_Scratch.pdf<http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120501/1d3b9b78/attachment-0003.html>


More information about the Pacemaker mailing list