[ClusterLabs] DRBD Split brain

Digimer lists at alteeve.ca
Sat Jan 20 18:10:40 UTC 2018


On 2018-01-19 01:36 PM, Ken Gaillot wrote:
> On Tue, 2017-12-12 at 15:30 +0200, Антон Сацкий wrote:
>> Hi list 
>> Need your help.
>> Got 2  servers use Pacemaker  Corosync Drbd
>>
>> [root at voipserver ~]# pcs config
>> Cluster Name: ClusterKrusher
>> Corosync Nodes:
>>  voipserver.primary voipserver.backup
>> Pacemaker Nodes:
>>  voipserver.backup voipserver.primary
>>
>> Resources:
>>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>   Attributes: cidr_netmask=32 ip=172.20.11.10
>>   Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>>               start interval=0s timeout=20s (ClusterIP-start-
>> interval-0s)
>>               stop interval=0s timeout=20s (ClusterIP-stop-interval-
>> 0s)
>>  Master: WebDataClone
>>   Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1
>> clone-node-max=1
>>   Resource: WebData (class=ocf provider=linbit type=drbd)
>>    Attributes: drbd_resource=r0
>>    Operations: demote interval=0s timeout=90 (WebData-demote-
>> interval-0s)
>>                monitor interval=60s (WebData-monitor-interval-60s)
> 
> My drbd is too rusty to comment on your specific issue, but a couple of
> general notes:
> 
> * master/slave resources need two monitors, the regular one (which
> monitors the slave role) and a second one (with a different interval)
> monitoring the master role
> 
> * fencing needs to be configured in pacemaker, and drbd needs to be
> configured to use pacemaker fencing, otherwise split-brain can more
> easily happen

Once fencing is working in pacemaker (and tested!), set 'fencing
resource-and-stonith;' and configure crm-{un,}fence-peer.sh as you
{un,}fence-handlers.

This will tell DRBD to block IO on peer loss and request a fence in
pacemaker (prevent the target from using DRBD). When the node rejoins,
the unfence handler removes the restriction.

This is the only way to prevent a split-brain.

>>                promote interval=0s timeout=90 (WebData-promote-
>> interval-0s)
>>                start interval=0s timeout=240 (WebData-start-interval-
>> 0s)
>>                stop interval=0s timeout=100 (WebData-stop-interval-
>> 0s)
>>  Resource: WebFS (class=ocf provider=heartbeat type=Filesystem)
>>   Attributes: device=/dev/drbd1 directory=/replica fstype=ext3
>>   Operations: monitor interval=20 timeout=40 (WebFS-monitor-interval-
>> 20)
>>               start interval=0s timeout=60 (WebFS-start-interval-0s)
>>               stop interval=0s timeout=60 (WebFS-stop-interval-0s)
>>  Resource: Asterisk (class=lsb type=asterisk)
>>   Operations: monitor interval=15 timeout=15 (Asterisk-monitor-
>> interval-15)
>>               start interval=0s timeout=15 (Asterisk-start-interval-
>> 0s)
>>               stop interval=0s timeout=15 (Asterisk-stop-interval-0s)
>>  Resource: MYSQL (class=lsb type=mysql)
>>   Operations: monitor interval=15 timeout=15 (MYSQL-monitor-interval-
>> 15)
>>               start interval=0s timeout=15 (MYSQL-start-interval-0s)
>>               stop interval=0s timeout=15 (MYSQL-stop-interval-0s)
>>
>> Stonith Devices:
>> Fencing Levels:
>>
>> Location Constraints:
>> Ordering Constraints:
>>   promote WebDataClone then start WebFS (kind:Mandatory)
>>   start WebFS then start MYSQL (kind:Mandatory)
>>   start ClusterIP then start Asterisk (kind:Mandatory)
>> Colocation Constraints:
>>   WebFS with WebDataClone (score:INFINITY) (with-rsc-role:Master)
>>   MYSQL with WebFS (score:INFINITY)
>>   Asterisk with ClusterIP (score:INFINITY)
>> Ticket Constraints:
>>
>> Alerts:
>>  No alerts defined
>>
>> Resources Defaults:
>>  resource-stickiness: 100
>> Operations Defaults:
>>  No defaults set
>>
>> Cluster Properties:
>>  cluster-infrastructure: corosync
>>  cluster-name: ClusterKrusher
>>  dc-version: 1.1.16-12.el7_4.2-94ff4df
>>  have-watchdog: false
>>  stonith-enabled: false
>>
>> Quorum:
>>   Options:
>> ===================
>>
>>
>> After some tibe got in logs 
>> [root at voipserver ~]#  cat  /var/log/messages |grep drbd
>> Dec 12 14:08:52 voipserver kernel: block drbd1: role( Secondary ->
>> Primary )
>> Dec 12 14:08:52 voipserver Filesystem(WebFS)[64935]: INFO: Running
>> start for /dev/drbd1 on /replica
>> Dec 12 14:08:52 voipserver kernel: EXT4-fs (drbd1): mounting ext3
>> file system using the ext4 subsystem
>> Dec 12 14:08:53 voipserver kernel: EXT4-fs (drbd1): mounted
>> filesystem with ordered data mode. Opts: (null)
>> Dec 12 14:18:13 voipserver Filesystem(WebFS)[3134]: INFO: Running
>> stop for /dev/drbd1 on /replica
>> Dec 12 14:18:17 voipserver Filesystem(WebFS)[3319]: INFO: Running
>> start for /dev/drbd1 on /replica
>> Dec 12 14:18:17 voipserver kernel: EXT4-fs (drbd1): mounting ext3
>> file system using the ext4 subsystem
>> Dec 12 14:18:17 voipserver kernel: EXT4-fs (drbd1): mounted
>> filesystem with ordered data mode. Opts: (null)
>> Dec 12 14:44:07 voipserver Filesystem(WebFS)[11669]: INFO: Running
>> stop for /dev/drbd1 on /replica
>> Dec 12 14:44:07 voipserver kernel: block drbd1: role( Primary ->
>> Secondary )
>> Dec 12 14:44:07 voipserver kernel: block drbd1: 3552 KB (888 bits)
>> marked out-of-sync by on disk bit-map.
>> Dec 12 14:44:08 voipserver kernel: block drbd1: disk( UpToDate ->
>> Failed )
>> Dec 12 14:44:08 voipserver kernel: block drbd1: 3552 KB (888 bits)
>> marked out-of-sync by on disk bit-map.
>> Dec 12 14:44:08 voipserver kernel: block drbd1: disk( Failed ->
>> Diskless )
>> Dec 12 14:44:08 voipserver kernel: drbd r0: Terminating drbd_w_r0
>> Dec 12 14:44:19 voipserver kernel: drbd: loading out-of-tree module
>> taints kernel.
>> Dec 12 14:44:19 voipserver kernel: drbd: module verification failed:
>> signature and/or required key missing - tainting kernel
>> Dec 12 14:44:19 voipserver systemd-modules-load: Inserted module
>> 'drbd'
>> Dec 12 14:44:19 voipserver kernel: drbd: initialized. Version:
>> 8.4.10-1 (api:1/proto:86-101)
>> Dec 12 14:44:19 voipserver kernel: drbd: GIT-hash:
>> a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-
>> 09-15 14:23:22
>> Dec 12 14:44:19 voipserver kernel: drbd: registered as block device
>> major 147
>> Dec 12 14:45:02 voipserver Filesystem(WebFS)[1400]: WARNING: Couldn't
>> find device [/dev/drbd1]. Expected /dev/??? to exist
>> Dec 12 14:45:03 voipserver kernel: drbd r0: Starting worker thread
>> (from drbdsetup-84 [1524])
>> Dec 12 14:45:03 voipserver kernel: block drbd1: disk( Diskless ->
>> Attaching )
>> Dec 12 14:45:03 voipserver kernel: drbd r0: Method to ensure write
>> ordering: flush
>> Dec 12 14:45:03 voipserver kernel: block drbd1: max BIO size = 524288
>> Dec 12 14:45:03 voipserver kernel: block drbd1: drbd_bm_resize called
>> with capacity == 419153344
>> Dec 12 14:45:03 voipserver kernel: block drbd1: resync bitmap:
>> bits=52394168 words=818659 pages=1599
>> Dec 12 14:45:03 voipserver kernel: block drbd1: size = 200 GB
>> (209576672 KB)
>> Dec 12 14:45:03 voipserver kernel: block drbd1: recounting of set
>> bits took additional 1 jiffies
>> Dec 12 14:45:03 voipserver kernel: block drbd1: 3552 KB (888 bits)
>> marked out-of-sync by on disk bit-map.
>> Dec 12 14:45:03 voipserver kernel: block drbd1: disk( Attaching ->
>> UpToDate )
>> Dec 12 14:45:03 voipserver kernel: block drbd1: attached to UUIDs
>> FBA12F26BE1DEE73:EE5942173C75DE98:1BF4DECFE20D51E2:1BF3DECFE20D51E3
>> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( StandAlone ->
>> Unconnected )
>> Dec 12 14:45:03 voipserver kernel: drbd r0: Starting receiver thread
>> (from drbd_w_r0 [1525])
>> Dec 12 14:45:03 voipserver kernel: drbd r0: receiver (re)started
>> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( Unconnected ->
>> WFConnection )
>> Dec 12 14:45:03 voipserver kernel: drbd r0: Handshake successful:
>> Agreed network protocol version 101
>> Dec 12 14:45:03 voipserver kernel: drbd r0: Feature flags enabled on
>> protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
>> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( WFConnection ->
>> WFReportParams )
>> Dec 12 14:45:03 voipserver kernel: drbd r0: Starting ack_recv thread
>> (from drbd_r_r0 [1534])
>> Dec 12 14:45:03 voipserver kernel: block drbd1: drbd_sync_handshake:
>> Dec 12 14:45:03 voipserver kernel: block drbd1: self
>> FBA12F26BE1DEE72:EE5942173C75DE98:1BF4DECFE20D51E2:1BF3DECFE20D51E3
>> bits:888 flags:0
>> Dec 12 14:45:03 voipserver kernel: block drbd1: peer
>> 93BB6F0A5075345D:EE5942173C75DE99:1BF4DECFE20D51E3:1BF3DECFE20D51E3
>> bits:38004 flags:2
>> Dec 12 14:45:03 voipserver kernel: block drbd1: uuid_compare()=100 by
>> rule 90
>> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
>> /sbin/drbdadm initial-split-brain minor-1
>> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
>> /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)
>> Dec 12 14:45:03 voipserver kernel: block drbd1: Split-Brain detected
>> but unresolved, dropping connection!
>> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
>> /sbin/drbdadm split-brain minor-1
>> Dec 12 14:45:03 voipserver kernel: block drbd1: helper command:
>> /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)
>> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( WFReportParams ->
>> Disconnecting )
>> Dec 12 14:45:03 voipserver kernel: drbd r0: error receiving
>> ReportState, e: -5 l: 0!
>> Dec 12 14:45:03 voipserver kernel: drbd r0: ack_receiver terminated
>> Dec 12 14:45:03 voipserver kernel: drbd r0: Terminating drbd_a_r0
>> Dec 12 14:45:03 voipserver kernel: drbd r0: Connection closed
>> Dec 12 14:45:03 voipserver kernel: drbd r0: conn( Disconnecting ->
>> StandAlone )
>> Dec 12 14:45:03 voipserver kernel: drbd r0: receiver terminated
>> Dec 12 14:45:03 voipserver kernel: drbd r0: Terminating drbd_r_r0
>>
>>
>>
>> So i need to decide the best way now to conf split brain recovery 
>> config files appreciated. 
>>
>> Primary
>> [root at voipserver ~]# drbd-overview
>> NOTE: drbd-overview will be deprecated soon.
>> Please consider using drbdtop.
>>
>>  1:r0/0  WFConnection Primary/Unknown UpToDate/DUnknown /replica ext3
>> 197G 720M 186G 1%
>>
>> Secondary
>>
>> [root at voipserver ~]# drbd-overview
>> NOTE: drbd-overview will be deprecated soon.
>> Please consider using drbdtop.
>>
>>  1:r0/0  StandAlone Secondary/Unknown UpToDate/DUnknown
>>
>>
>> So i need to decide the best way now to conf split brain recovery 
>> config files appreciated. 
>> THANKS
>>
>> -- 
>> Best regards
>> Antony
>> tel.   +380669197533
>> tel2. +380636564340
>> Paypal http://paypal.me/Satskiy
>> satskiy.a at gmail.com
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould




More information about the Users mailing list