[ClusterLabs] DRBD Split brain
Ken Gaillot
kgaillot at redhat.com
Fri Jan 19 13:36:06 EST 2018
On Tue, 2017-12-12 at 15:30 +0200, Антон Сацкий wrote:
> Hi list
> Need your help.
> Got 2 servers use Pacemaker Corosync Drbd
>
> [root at voipserver ~]# pcs config
> Cluster Name: ClusterKrusher
> Corosync Nodes:
> voipserver.primary voipserver.backup
> Pacemaker Nodes:
> voipserver.backup voipserver.primary
>
> Resources:
> Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: cidr_netmask=32 ip=172.20.11.10
> Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
> start interval=0s timeout=20s (ClusterIP-start-
> interval-0s)
> stop interval=0s timeout=20s (ClusterIP-stop-interval-
> 0s)
> Master: WebDataClone
> Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1
> clone-node-max=1
> Resource: WebData (class=ocf provider=linbit type=drbd)
> Attributes: drbd_resource=r0
> Operations: demote interval=0s timeout=90 (WebData-demote-
> interval-0s)
> monitor interval=60s (WebData-monitor-interval-60s)
My drbd is too rusty to comment on your specific issue, but a couple of
general notes:
* master/slave resources need two monitors, the regular one (which
monitors the slave role) and a second one (with a different interval)
monitoring the master role
* fencing needs to be configured in pacemaker, and drbd needs to be
configured to use pacemaker fencing, otherwise split-brain can more
easily happen
> promote interval=0s timeout=90 (WebData-promote-
> interval-0s)
> start interval=0s timeout=240 (WebData-start-interval-
> 0s)
> stop interval=0s timeout=100 (WebData-stop-interval-
> 0s)
> Resource: WebFS (class=ocf provider=heartbeat type=Filesystem)
> Attributes: device=/dev/drbd1 directory=/replica fstype=ext3
> Operations: monitor interval=20 timeout=40 (WebFS-monitor-interval-
> 20)
> start interval=0s timeout=60 (WebFS-start-interval-0s)
> stop interval=0s timeout=60 (WebFS-stop-interval-0s)
> Resource: Asterisk (class=lsb type=asterisk)
> Operations: monitor interval=15 timeout=15 (Asterisk-monitor-
> interval-15)
> start interval=0s timeout=15 (Asterisk-start-interval-
> 0s)
> stop interval=0s timeout=15 (Asterisk-stop-interval-0s)
> Resource: MYSQL (class=lsb type=mysql)
> Operations: monitor interval=15 timeout=15 (MYSQL-monitor-interval-
> 15)
> start interval=0s timeout=15 (MYSQL-start-interval-0s)
> stop interval=0s timeout=15 (MYSQL-stop-interval-0s)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
> Ordering Constraints:
> promote WebDataClone then start WebFS (kind:Mandatory)
> start WebFS then start MYSQL (kind:Mandatory)
> start ClusterIP then start Asterisk (kind:Mandatory)
> Colocation Constraints:
> WebFS with WebDataClone (score:INFINITY) (with-rsc-role:Master)
> MYSQL with WebFS (score:INFINITY)
> Asterisk with ClusterIP (score:INFINITY)
> Ticket Constraints:
>
> Alerts:
> No alerts defined
>
> Resources Defaults:
> resource-stickiness: 100
> Operations Defaults:
> No defaults set
>
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: ClusterKrusher
> dc-version: 1.1.16-12.el7_4.2-94ff4df
> have-watchdog: false
> stonith-enabled: false
>
> Quorum:
> Options:
> ===================
>
>
> After some tibe got in logs
> [root at voipserver ~]# cat /var/log/messages |grep drbd
> Dec 12 14:08:52 voipserver kernel: block drbd1: role( Secondary ->
> Primary )
> Dec 12 14:08:52 voipserver Filesystem(WebFS)[64935]: INFO: Running
> start for /dev/drbd1 on /replica
> Dec 12 14:08:52 voipserver kernel: EXT4-fs (drbd1): mounting ext3
> file system using the ext4 subsystem
> Dec 12 14:08:53 voipserver kernel: EXT4-fs (drbd1): mounted
> filesystem with ordered data mode. Opts: (null)
> Dec 12 14:18:13 voipserver Filesystem(WebFS)[3134]: INFO: Running
> stop for /dev/drbd1 on /replica
> Dec 12 14:18:17 voipserver Filesystem(WebFS)[3319]: INFO: Running
> start for /dev/drbd1 on /replica
> Dec 12 14:18:17 voipserver kernel: EXT4-fs (drbd1): mounting ext3
> file system using the ext4 subsystem
> Dec 12 14:18:17 voipserver kernel: EXT4-fs (drbd1): mounted
> filesystem with ordered data mode. Opts: (null)
> Dec 12 14:44:07 voipserver Filesystem(WebFS)[11669]: INFO: Running
> stop for /dev/drbd1 on /replica
> Dec 12 14:44:07 voipserver kernel: block drbd1: role( Primary ->
> Secondary )
> Dec 12 14:44:07 voipserver kernel: block drbd1: 3552 KB (888 bits)
> marked out-of-sync by on disk bit-map.
> Dec 12 14:44:08 voipserver kernel: block drbd1: disk( UpToDate ->
> Failed )
> Dec 12 14:44:08 voipserver kernel: block drbd1: 3552 KB (888 bits)
> marked out-of-sync by on disk bit-map.
> Dec 12 14:44:08 voipserver kernel: block drbd1: disk( Failed ->
> Diskless )
> Dec 12 14:44:08 voipserver kernel: drbd r0: Terminating drbd_w_r0
> Dec 12 14:44:19 voipserver kernel: drbd: loading out-of-tree module
> taints kernel.
> Dec 12 14:44:19 voipserver kernel: drbd: module verification failed:
> signature and/or required key missing - tainting kernel
> Dec 12 14:44:19 voipserver systemd-modules-load: Inserted module
> 'drbd'
> Dec 12 14:44:19 voipserver kernel: drbd: initialized. Version:
> 8.4.10-1 (api:1/proto:86-101)
> Dec 12 14:44:19 voipserver kernel: drbd: GIT-hash:
> a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-
> 09-15 14:23:22
> Dec 12 14:44:19 voipserver kernel: drbd: registered as block device
> major 147
> Dec 12 14:45:02 voipserver Filesystem(WebFS)[1400]: WARNING: Couldn't
> find device [/dev/drbd1]. Expected /dev/??? to exist
> Dec 12 14:45:03 voipserver kernel: drbd r0: Starting worker thread
> (from drbdsetup-84 [1524])
> Dec 12 14:45:03 voipserver kernel: block drbd1: disk( Diskless ->
> Attaching )
> Dec 12 14:45:03 voipserver kernel: drbd r0: Method to ensure write
> ordering: flush
> Dec 12 14:45:03 voipserver kernel: block drbd1: max BIO size = 524288
> Dec 12 14:45:03 voipserver kernel: block drbd1: drbd_bm_resize called
> with capacity == 419153344
> Dec 12 14:45:03 voipserver kernel: block drbd1: resync bitmap:
> bits=52394168 words=818659 pages=1599
> Dec 12 14:45:03 voipserver kernel: block drbd1: size = 200 GB
> (209576672 KB)
> Dec 12 14:45:03 voipserver kernel: block drbd1: recounting of set
> bits took additional 1 jiffies
> Dec 12 14:45:03 voipserver kernel: block drbd1: 3552 KB (888 bits)
> marked out-of-sync by on disk bit-map.
> Dec 12 14:45:03 voipserver kernel: block drbd1: disk( Attaching ->
> UpToDate )
> Dec 12 14:45:03 voipserver kernel: block drbd1: attached to UUIDs
> FBA12F26BE1DEE73:EE5942173C75DE98:1BF4DECFE20D51E2:1BF3DECFE20D51E3
> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( StandAlone ->
> Unconnected )
> Dec 12 14:45:03 voipserver kernel: drbd r0: Starting receiver thread
> (from drbd_w_r0 [1525])
> Dec 12 14:45:03 voipserver kernel: drbd r0: receiver (re)started
> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( Unconnected ->
> WFConnection )
> Dec 12 14:45:03 voipserver kernel: drbd r0: Handshake successful:
> Agreed network protocol version 101
> Dec 12 14:45:03 voipserver kernel: drbd r0: Feature flags enabled on
> protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( WFConnection ->
> WFReportParams )
> Dec 12 14:45:03 voipserver kernel: drbd r0: Starting ack_recv thread
> (from drbd_r_r0 [1534])
> Dec 12 14:45:03 voipserver kernel: block drbd1: drbd_sync_handshake:
> Dec 12 14:45:03 voipserver kernel: block drbd1: self
> FBA12F26BE1DEE72:EE5942173C75DE98:1BF4DECFE20D51E2:1BF3DECFE20D51E3
> bits:888 flags:0
> Dec 12 14:45:03 voipserver kernel: block drbd1: peer
> 93BB6F0A5075345D:EE5942173C75DE99:1BF4DECFE20D51E3:1BF3DECFE20D51E3
> bits:38004 flags:2
> Dec 12 14:45:03 voipserver kernel: block drbd1: uuid_compare()=100 by
> rule 90
> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
> /sbin/drbdadm initial-split-brain minor-1
> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
> /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
> Dec 12 14:45:03 voipserver kernel: block drbd1: Split-Brain detected
> but unresolved, dropping connection!
> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
> /sbin/drbdadm split-brain minor-1
> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
> /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( WFReportParams ->
> Disconnecting )
> Dec 12 14:45:03 voipserver kernel: drbd r0: error receiving
> ReportState, e: -5 l: 0!
> Dec 12 14:45:03 voipserver kernel: drbd r0: ack_receiver terminated
> Dec 12 14:45:03 voipserver kernel: drbd r0: Terminating drbd_a_r0
> Dec 12 14:45:03 voipserver kernel: drbd r0: Connection closed
> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( Disconnecting ->
> StandAlone )
> Dec 12 14:45:03 voipserver kernel: drbd r0: receiver terminated
> Dec 12 14:45:03 voipserver kernel: drbd r0: Terminating drbd_r_r0
>
>
>
> So i need to decide the best way now to conf split brain recovery
> config files appreciated.
>
> Primary
> [root at voipserver ~]# drbd-overview
> NOTE: drbd-overview will be deprecated soon.
> Please consider using drbdtop.
>
> 1:r0/0 WFConnection Primary/Unknown UpToDate/DUnknown /replica ext3
> 197G 720M 186G 1%
>
> Secondary
>
> [root at voipserver ~]# drbd-overview
> NOTE: drbd-overview will be deprecated soon.
> Please consider using drbdtop.
>
> 1:r0/0 StandAlone Secondary/Unknown UpToDate/DUnknown
>
>
> So i need to decide the best way now to conf split brain recovery
> config files appreciated.
> THANKS
>
> --
> Best regards
> Antony
> tel. +380669197533
> tel2. +380636564340
> Paypal http://paypal.me/Satskiy
> satskiy.a at gmail.com
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list