[ClusterLabs] DRBD split brain after Cluster node recovery
ArekW
arkaduis at gmail.com
Wed Jul 12 05:33:40 EDT 2017
Hi,
Can in be fixed that the drbd is entering split brain after cluster
node recovery? After few tests I saw drbd recovered but in most
situations (9/10) it didn't sync.
1. When a node is put to standby and than unstandby everything is
working OK. The drbd is syncing and go to primary mode.
2. When a node is (hard)poweroff, the stonith brings it up and
eventually the node becomes online but the drdb is in StandAlone state
on the recovered node. I can sync it only manually but that require to
stop the cluster.
Logs:
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Handshake to
peer 1 successful: Agreed network protocol version 112
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Feature flags
enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Starting
ack_recv thread (from drbd_r_storage [28960])
Jul 12 10:26:35 nfsnode1 kernel: drbd storage: Preparing cluster-wide
state change 2237079084 (0->1 499/145)
Jul 12 10:26:35 nfsnode1 kernel: drbd storage: State change
2237079084: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
Jul 12 10:26:35 nfsnode1 kernel: drbd storage: Committing cluster-wide
state change 2237079084 (1ms)
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: conn(
Connecting -> Connected ) peer( Unknown -> Secondary )
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: current_size: 14679544
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
c_size: 14679544 u_size: 0 d_size: 14679544 max_size: 14679544
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
la_size: 14679544 my_usize: 0 my_max_size: 14679544
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: my node_id: 0
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
node_id: 1 idx: 0 bm-uuid: 0x441536064ceddc92 flags: 0x10 max_size:
14679544 (DUnknown)
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
calling drbd_determine_dev_size()
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: my node_id: 0
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
node_id: 1 idx: 0 bm-uuid: 0x441536064ceddc92 flags: 0x10 max_size:
14679544 (DUnknown)
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
drbd_sync_handshake:
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: self
342BE98297943C35:441536064CEDDC92:69D98E1FCC2BB44C:E04101C6FF76D1CC
bits:15450 flags:120
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: peer
A8908796A7CCFF6E:CE6B672F4EDA6E78:69D98E1FCC2BB44C:E04101C6FF76D1CC
bits:32768 flags:2
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
uuid_compare()=-100 by rule 100
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper
command: /sbin/drbdadm initial-split-brain
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper
command: /sbin/drbdadm initial-split-brain exit code 0 (0x0)
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: Split-Brain
detected but unresolved, dropping connection!
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper
command: /sbin/drbdadm split-brain
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper
command: /sbin/drbdadm split-brain exit code 0 (0x0)
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: conn(
Connected -> Disconnecting ) peer( Secondary -> Unknown )
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: error
receiving P_STATE, e: -5 l: 0!
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: ack_receiver terminated
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Terminating
ack_recv thread
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Connection closed
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: conn(
Disconnecting -> StandAlone )
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Terminating
receiver thread
Config:
resource storage {
protocol C;
meta-disk internal;
device /dev/drbd1;
syncer {
verify-alg sha1;
}
net {
allow-two-primaries;
}
on nfsnode1 {
disk /dev/storage/drbd;
address 10.0.2.15:7789;
}
on nfsnode2 {
disk /dev/storage/drbd;
address 10.0.2.4:7789;
}
}
pcs resource show StorageFS-clone
Clone: StorageFS-clone
Resource: StorageFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd1 directory=/mnt/drbd fstype=gfs2
Operations: start interval=0s timeout=60 (StorageFS-start-interval-0s)
stop interval=0s timeout=60 (StorageFS-stop-interval-0s)
monitor interval=20 timeout=40 (StorageFS-monitor-interval-20)
Full list of resources:
Master/Slave Set: StorageClone [Storage]
Masters: [ nfsnode1 nfsnode2 ]
Clone Set: dlm-clone [dlm]
Started: [ nfsnode1 nfsnode2 ]
Clone Set: ClusterIP-clone [ClusterIP] (unique)
ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2
ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1
Clone Set: StorageFS-clone [StorageFS]
Started: [ nfsnode1 nfsnode2 ]
Clone Set: WebSite-clone [WebSite]
Started: [ nfsnode1 nfsnode2 ]
Clone Set: nfs-group-clone [nfs-group]
Started: [ nfsnode1 nfsnode2 ]
Clone Set: ping-clone [ping]
Started: [ nfsnode1 nfsnode2 ]
vbox-fencing (stonith:fence_vbox): Started nfsnode2
More information about the Users
mailing list