[ClusterLabs] DRBD Split brain

Ken Gaillot kgaillot at redhat.com
Fri Jan 19 13:36:06 EST 2018


On Tue, 2017-12-12 at 15:30 +0200, Антон Сацкий wrote:
> Hi list 
> Need your help.
> Got 2  servers use Pacemaker  Corosync Drbd
> 
> [root at voipserver ~]# pcs config
> Cluster Name: ClusterKrusher
> Corosync Nodes:
>  voipserver.primary voipserver.backup
> Pacemaker Nodes:
>  voipserver.backup voipserver.primary
> 
> Resources:
>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: cidr_netmask=32 ip=172.20.11.10
>   Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>               start interval=0s timeout=20s (ClusterIP-start-
> interval-0s)
>               stop interval=0s timeout=20s (ClusterIP-stop-interval-
> 0s)
>  Master: WebDataClone
>   Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1
> clone-node-max=1
>   Resource: WebData (class=ocf provider=linbit type=drbd)
>    Attributes: drbd_resource=r0
>    Operations: demote interval=0s timeout=90 (WebData-demote-
> interval-0s)
>                monitor interval=60s (WebData-monitor-interval-60s)

My drbd is too rusty to comment on your specific issue, but a couple of
general notes:

* master/slave resources need two monitors, the regular one (which
monitors the slave role) and a second one (with a different interval)
monitoring the master role

* fencing needs to be configured in pacemaker, and drbd needs to be
configured to use pacemaker fencing, otherwise split-brain can more
easily happen

>                promote interval=0s timeout=90 (WebData-promote-
> interval-0s)
>                start interval=0s timeout=240 (WebData-start-interval-
> 0s)
>                stop interval=0s timeout=100 (WebData-stop-interval-
> 0s)
>  Resource: WebFS (class=ocf provider=heartbeat type=Filesystem)
>   Attributes: device=/dev/drbd1 directory=/replica fstype=ext3
>   Operations: monitor interval=20 timeout=40 (WebFS-monitor-interval-
> 20)
>               start interval=0s timeout=60 (WebFS-start-interval-0s)
>               stop interval=0s timeout=60 (WebFS-stop-interval-0s)
>  Resource: Asterisk (class=lsb type=asterisk)
>   Operations: monitor interval=15 timeout=15 (Asterisk-monitor-
> interval-15)
>               start interval=0s timeout=15 (Asterisk-start-interval-
> 0s)
>               stop interval=0s timeout=15 (Asterisk-stop-interval-0s)
>  Resource: MYSQL (class=lsb type=mysql)
>   Operations: monitor interval=15 timeout=15 (MYSQL-monitor-interval-
> 15)
>               start interval=0s timeout=15 (MYSQL-start-interval-0s)
>               stop interval=0s timeout=15 (MYSQL-stop-interval-0s)
> 
> Stonith Devices:
> Fencing Levels:
> 
> Location Constraints:
> Ordering Constraints:
>   promote WebDataClone then start WebFS (kind:Mandatory)
>   start WebFS then start MYSQL (kind:Mandatory)
>   start ClusterIP then start Asterisk (kind:Mandatory)
> Colocation Constraints:
>   WebFS with WebDataClone (score:INFINITY) (with-rsc-role:Master)
>   MYSQL with WebFS (score:INFINITY)
>   Asterisk with ClusterIP (score:INFINITY)
> Ticket Constraints:
> 
> Alerts:
>  No alerts defined
> 
> Resources Defaults:
>  resource-stickiness: 100
> Operations Defaults:
>  No defaults set
> 
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: ClusterKrusher
>  dc-version: 1.1.16-12.el7_4.2-94ff4df
>  have-watchdog: false
>  stonith-enabled: false
> 
> Quorum:
>   Options:
> ===================
> 
> 
> After some tibe got in logs 
> [root at voipserver ~]#  cat  /var/log/messages |grep drbd
> Dec 12 14:08:52 voipserver kernel: block drbd1: role( Secondary ->
> Primary )
> Dec 12 14:08:52 voipserver Filesystem(WebFS)[64935]: INFO: Running
> start for /dev/drbd1 on /replica
> Dec 12 14:08:52 voipserver kernel: EXT4-fs (drbd1): mounting ext3
> file system using the ext4 subsystem
> Dec 12 14:08:53 voipserver kernel: EXT4-fs (drbd1): mounted
> filesystem with ordered data mode. Opts: (null)
> Dec 12 14:18:13 voipserver Filesystem(WebFS)[3134]: INFO: Running
> stop for /dev/drbd1 on /replica
> Dec 12 14:18:17 voipserver Filesystem(WebFS)[3319]: INFO: Running
> start for /dev/drbd1 on /replica
> Dec 12 14:18:17 voipserver kernel: EXT4-fs (drbd1): mounting ext3
> file system using the ext4 subsystem
> Dec 12 14:18:17 voipserver kernel: EXT4-fs (drbd1): mounted
> filesystem with ordered data mode. Opts: (null)
> Dec 12 14:44:07 voipserver Filesystem(WebFS)[11669]: INFO: Running
> stop for /dev/drbd1 on /replica
> Dec 12 14:44:07 voipserver kernel: block drbd1: role( Primary ->
> Secondary )
> Dec 12 14:44:07 voipserver kernel: block drbd1: 3552 KB (888 bits)
> marked out-of-sync by on disk bit-map.
> Dec 12 14:44:08 voipserver kernel: block drbd1: disk( UpToDate ->
> Failed )
> Dec 12 14:44:08 voipserver kernel: block drbd1: 3552 KB (888 bits)
> marked out-of-sync by on disk bit-map.
> Dec 12 14:44:08 voipserver kernel: block drbd1: disk( Failed ->
> Diskless )
> Dec 12 14:44:08 voipserver kernel: drbd r0: Terminating drbd_w_r0
> Dec 12 14:44:19 voipserver kernel: drbd: loading out-of-tree module
> taints kernel.
> Dec 12 14:44:19 voipserver kernel: drbd: module verification failed:
> signature and/or required key missing - tainting kernel
> Dec 12 14:44:19 voipserver systemd-modules-load: Inserted module
> 'drbd'
> Dec 12 14:44:19 voipserver kernel: drbd: initialized. Version:
> 8.4.10-1 (api:1/proto:86-101)
> Dec 12 14:44:19 voipserver kernel: drbd: GIT-hash:
> a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-
> 09-15 14:23:22
> Dec 12 14:44:19 voipserver kernel: drbd: registered as block device
> major 147
> Dec 12 14:45:02 voipserver Filesystem(WebFS)[1400]: WARNING: Couldn't
> find device [/dev/drbd1]. Expected /dev/??? to exist
> Dec 12 14:45:03 voipserver kernel: drbd r0: Starting worker thread
> (from drbdsetup-84 [1524])
> Dec 12 14:45:03 voipserver kernel: block drbd1: disk( Diskless ->
> Attaching )
> Dec 12 14:45:03 voipserver kernel: drbd r0: Method to ensure write
> ordering: flush
> Dec 12 14:45:03 voipserver kernel: block drbd1: max BIO size = 524288
> Dec 12 14:45:03 voipserver kernel: block drbd1: drbd_bm_resize called
> with capacity == 419153344
> Dec 12 14:45:03 voipserver kernel: block drbd1: resync bitmap:
> bits=52394168 words=818659 pages=1599
> Dec 12 14:45:03 voipserver kernel: block drbd1: size = 200 GB
> (209576672 KB)
> Dec 12 14:45:03 voipserver kernel: block drbd1: recounting of set
> bits took additional 1 jiffies
> Dec 12 14:45:03 voipserver kernel: block drbd1: 3552 KB (888 bits)
> marked out-of-sync by on disk bit-map.
> Dec 12 14:45:03 voipserver kernel: block drbd1: disk( Attaching ->
> UpToDate )
> Dec 12 14:45:03 voipserver kernel: block drbd1: attached to UUIDs
> FBA12F26BE1DEE73:EE5942173C75DE98:1BF4DECFE20D51E2:1BF3DECFE20D51E3
> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( StandAlone ->
> Unconnected )
> Dec 12 14:45:03 voipserver kernel: drbd r0: Starting receiver thread
> (from drbd_w_r0 [1525])
> Dec 12 14:45:03 voipserver kernel: drbd r0: receiver (re)started
> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( Unconnected ->
> WFConnection )
> Dec 12 14:45:03 voipserver kernel: drbd r0: Handshake successful:
> Agreed network protocol version 101
> Dec 12 14:45:03 voipserver kernel: drbd r0: Feature flags enabled on
> protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( WFConnection ->
> WFReportParams )
> Dec 12 14:45:03 voipserver kernel: drbd r0: Starting ack_recv thread
> (from drbd_r_r0 [1534])
> Dec 12 14:45:03 voipserver kernel: block drbd1: drbd_sync_handshake:
> Dec 12 14:45:03 voipserver kernel: block drbd1: self
> FBA12F26BE1DEE72:EE5942173C75DE98:1BF4DECFE20D51E2:1BF3DECFE20D51E3
> bits:888 flags:0
> Dec 12 14:45:03 voipserver kernel: block drbd1: peer
> 93BB6F0A5075345D:EE5942173C75DE99:1BF4DECFE20D51E3:1BF3DECFE20D51E3
> bits:38004 flags:2
> Dec 12 14:45:03 voipserver kernel: block drbd1: uuid_compare()=100 by
> rule 90
> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
> /sbin/drbdadm initial-split-brain minor-1
> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
> /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
> Dec 12 14:45:03 voipserver kernel: block drbd1: Split-Brain detected
> but unresolved, dropping connection!
> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
> /sbin/drbdadm split-brain minor-1
> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
> /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( WFReportParams ->
> Disconnecting )
> Dec 12 14:45:03 voipserver kernel: drbd r0: error receiving
> ReportState, e: -5 l: 0!
> Dec 12 14:45:03 voipserver kernel: drbd r0: ack_receiver terminated
> Dec 12 14:45:03 voipserver kernel: drbd r0: Terminating drbd_a_r0
> Dec 12 14:45:03 voipserver kernel: drbd r0: Connection closed
> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( Disconnecting ->
> StandAlone )
> Dec 12 14:45:03 voipserver kernel: drbd r0: receiver terminated
> Dec 12 14:45:03 voipserver kernel: drbd r0: Terminating drbd_r_r0
> 
> 
> 
> So i need to decide the best way now to conf split brain recovery 
> config files appreciated. 
> 
> Primary
> [root at voipserver ~]# drbd-overview
> NOTE: drbd-overview will be deprecated soon.
> Please consider using drbdtop.
> 
>  1:r0/0  WFConnection Primary/Unknown UpToDate/DUnknown /replica ext3
> 197G 720M 186G 1%
> 
> Secondary
> 
> [root at voipserver ~]# drbd-overview
> NOTE: drbd-overview will be deprecated soon.
> Please consider using drbdtop.
> 
>  1:r0/0  StandAlone Secondary/Unknown UpToDate/DUnknown
> 
> 
> So i need to decide the best way now to conf split brain recovery 
> config files appreciated. 
> THANKS
> 
> -- 
> Best regards
> Antony
> tel.   +380669197533
> tel2. +380636564340
> Paypal http://paypal.me/Satskiy
> satskiy.a at gmail.com
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>




More information about the Users mailing list