[Pacemaker] drbd under pacemaker - always get split brain

Tue Jul 10 18:06:49 EDT 2012

On Tue, Jul 10, 2012 at 8:12 AM, Nikola Ciprich
<nikola.ciprich at linuxbox.cz> wrote:
> Hello Andreas,
>> Why not using the RA that comes with the resource-agent package?
> well, I've historically used my scripts, haven't even noticed when LVM
> resource appeared.. I switched to it now.., thanks for the hint..
>> this "become-primary-on" was never activated?
> nope.
>
>
>> Is the drbd init script deactivated on system boot? Cluster logs should
>> give more insights ....
> yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs,
> rebooted both nodes, checked drbd ain't started and started corosync.
> result is here:
> http://nelide.cz/nik/logs.tar.gz

It really really looks like Pacemaker is too fast when promoting to
primary ... before the connection to the already up second node can be
established.  I see in your logs you have DRBD 8.3.13 userland  but
8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module
... there have been fixes that look like addressing this problem.

Another quick-fix, that should also do: add a start-delay of some
seconds to the start operation of DRBD

... or fix your after-split-brain policies to automatically solve this
special type of split-brain (with 0 blocks to sync).

Best Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

>
> thanks for Your time.
> n.
>
>
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>> >
>> > thanks a lot in advance
>> >
>> > nik
>> >
>> >
>> > On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
>> >> On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
>> >>> hello,
>> >>>
>> >>> I'm trying to solve quite mysterious problem here..
>> >>> I've got new cluster with bunch of SAS disks for testing purposes.
>> >>> I've configured DRBDs (in primary/primary configuration)
>> >>>
>> >>> when I start drbd using drbdadm, it get's up nicely (both nodes
>> >>> are Primary, connected).
>> >>> however when I start it using corosync, I always get split-brain, although
>> >>> there are no data written, no network disconnection, anything..
>> >>
>> >> your full drbd and Pacemaker configuration please ... some snippets from
>> >> something are very seldom helpful ...
>> >>
>> >> Regards,
>> >> Andreas
>> >>
>> >> --
>> >> Need help with Pacemaker?
>> >> http://www.hastexo.com/now
>> >>
>> >>>
>> >>> here's drbd resource config:
>> >>> primitive drbd-sas0 ocf:linbit:drbd \
>> >>>     params drbd_resource="drbd-sas0" \
>> >>>     operations $id="drbd-sas0-operations" \
>> >>>     op start interval="0" timeout="240s" \
>> >>>     op stop interval="0" timeout="200s" \
>> >>>     op promote interval="0" timeout="200s" \
>> >>>     op demote interval="0" timeout="200s" \
>> >>>     op monitor interval="179s" role="Master" timeout="150s" \
>> >>>     op monitor interval="180s" role="Slave" timeout="150s"
>> >>>
>> >>> ms ms-drbd-sas0 drbd-sas0 \
>> >>>    meta clone-max="2" clone-node-max="1" master-max="2" master-node-max="1" notify="true" globally-unique="false" interleave="true" target-role="Started"
>> >>>
>> >>>
>> >>> here's the dmesg output when pacemaker tries to promote drbd, causing the splitbrain:
>> >>> [  157.646292] block drbd2: Starting worker thread (from drbdsetup [6892])
>> >>> [  157.646539] block drbd2: disk( Diskless -> Attaching )
>> >>> [  157.650364] block drbd2: Found 1 transactions (1 active extents) in activity log.
>> >>> [  157.650560] block drbd2: Method to ensure write ordering: drain
>> >>> [  157.650688] block drbd2: drbd_bm_resize called with capacity == 584667688
>> >>> [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 pages=2231
>> >>> [  157.653760] block drbd2: size = 279 GB (292333844 KB)
>> >>> [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
>> >>> [  157.673722] block drbd2: recounting of set bits took additional 2 jiffies
>> >>> [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
>> >>> [  157.673972] block drbd2: disk( Attaching -> UpToDate )
>> >>> [  157.674100] block drbd2: attached to UUIDs 0150944D23F16BAE:0000000000000000:8C175205284E3262:8C165205284E3263
>> >>> [  157.685539] block drbd2: conn( StandAlone -> Unconnected )
>> >>> [  157.685704] block drbd2: Starting receiver thread (from drbd2_worker [6893])
>> >>> [  157.685928] block drbd2: receiver (re)started
>> >>> [  157.686071] block drbd2: conn( Unconnected -> WFConnection )
>> >>> [  158.960577] block drbd2: role( Secondary -> Primary )
>> >>> [  158.960815] block drbd2: new current UUID 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
>> >>> [  162.686990] block drbd2: Handshake successful: Agreed network protocol version 96
>> >>> [  162.687183] block drbd2: conn( WFConnection -> WFReportParams )
>> >>> [  162.687404] block drbd2: Starting asender thread (from drbd2_receiver [6927])
>> >>> [  162.687741] block drbd2: data-integrity-alg: <not-used>
>> >>> [  162.687930] block drbd2: drbd_sync_handshake:
>> >>> [  162.688057] block drbd2: self 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 bits:0 flags:0
>> >>> [  162.688244] block drbd2: peer 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 bits:0 flags:0
>> >>> [  162.688428] block drbd2: uuid_compare()=100 by rule 90
>> >>> [  162.688544] block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2
>> >>> [  162.691332] block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0)
>> >>>
>> >>> to me it seems to be that it's promoting it too early, and I also wonder why there is the
>> >>> "new current UUID" stuff?
>> >>>
>> >>> I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6
>> >>>
>> >>> could anybody please try to advice me? I'm sure I'm doing something stupid, but can't figure out what...
>> >>>
>> >>> thanks a lot in advance
>> >>>
>> >>> with best regards
>> >>>
>> >>> nik
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>>
>> >>> Project Home: http://www.clusterlabs.org
>> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >>> Bugs: http://bugs.clusterlabs.org
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >> _______________________________________________
>> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: http://bugs.clusterlabs.org
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>>
>>
>>
>>
>>
>
>
>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> --
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: servis at linuxbox.cz
> -------------------------------------
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120711/4006f23f/attachment-0003.sig>