[ClusterLabs] DRBD split brain after Cluster node recovery

emmanuel segura emi2fast at gmail.com
Wed Jul 12 05:46:32 EDT 2017


you need to configure cluster fencing and drbd fencing handler, in this
way, the cluster can recevory without manual intervention.

2017-07-12 11:33 GMT+02:00 ArekW <arkaduis at gmail.com>:

> Hi,
> Can in be fixed that the drbd is entering split brain after cluster
> node recovery? After few tests I saw drbd recovered but in most
> situations (9/10) it didn't sync.
>
> 1. When a node is put to standby and than unstandby everything is
> working OK. The drbd is syncing and go to primary mode.
>
> 2. When a node is (hard)poweroff, the stonith brings it up and
> eventually the node becomes online but the drdb is in StandAlone state
> on the recovered node. I can sync it only manually but that require to
> stop the cluster.
>
> Logs:
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Handshake to
> peer 1 successful: Agreed network protocol version 112
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Feature flags
> enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Starting
> ack_recv thread (from drbd_r_storage [28960])
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage: Preparing cluster-wide
> state change 2237079084 (0->1 499/145)
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage: State change
> 2237079084: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage: Committing cluster-wide
> state change 2237079084 (1ms)
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: conn(
> Connecting -> Connected ) peer( Unknown -> Secondary )
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: current_size:
> 14679544
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
> c_size: 14679544 u_size: 0 d_size: 14679544 max_size: 14679544
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
> la_size: 14679544 my_usize: 0 my_max_size: 14679544
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: my node_id: 0
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
> node_id: 1 idx: 0 bm-uuid: 0x441536064ceddc92 flags: 0x10 max_size:
> 14679544 (DUnknown)
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
> calling drbd_determine_dev_size()
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: my node_id: 0
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
> node_id: 1 idx: 0 bm-uuid: 0x441536064ceddc92 flags: 0x10 max_size:
> 14679544 (DUnknown)
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
> drbd_sync_handshake:
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: self
> 342BE98297943C35:441536064CEDDC92:69D98E1FCC2BB44C:E04101C6FF76D1CC
> bits:15450 flags:120
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: peer
> A8908796A7CCFF6E:CE6B672F4EDA6E78:69D98E1FCC2BB44C:E04101C6FF76D1CC
> bits:32768 flags:2
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:
> uuid_compare()=-100 by rule 100
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper
> command: /sbin/drbdadm initial-split-brain
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper
> command: /sbin/drbdadm initial-split-brain exit code 0 (0x0)
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: Split-Brain
> detected but unresolved, dropping connection!
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper
> command: /sbin/drbdadm split-brain
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper
> command: /sbin/drbdadm split-brain exit code 0 (0x0)
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: conn(
> Connected -> Disconnecting ) peer( Secondary -> Unknown )
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: error
> receiving P_STATE, e: -5 l: 0!
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: ack_receiver
> terminated
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Terminating
> ack_recv thread
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Connection closed
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: conn(
> Disconnecting -> StandAlone )
> Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Terminating
> receiver thread
>
>
> Config:
> resource storage {
>   protocol C;
>   meta-disk internal;
>   device /dev/drbd1;
>   syncer {
>     verify-alg sha1;
>   }
>   net {
>     allow-two-primaries;
>   }
>   on nfsnode1 {
>     disk   /dev/storage/drbd;
>     address  10.0.2.15:7789;
>   }
>   on nfsnode2 {
>     disk   /dev/storage/drbd;
>     address  10.0.2.4:7789;
>   }
> }
>
> pcs resource show StorageFS-clone
>  Clone: StorageFS-clone
>   Resource: StorageFS (class=ocf provider=heartbeat type=Filesystem)
>    Attributes: device=/dev/drbd1 directory=/mnt/drbd fstype=gfs2
>    Operations: start interval=0s timeout=60 (StorageFS-start-interval-0s)
>                stop interval=0s timeout=60 (StorageFS-stop-interval-0s)
>                monitor interval=20 timeout=40 (StorageFS-monitor-interval-
> 20)
>
> Full list of resources:
>
>  Master/Slave Set: StorageClone [Storage]
>      Masters: [ nfsnode1 nfsnode2 ]
>  Clone Set: dlm-clone [dlm]
>      Started: [ nfsnode1 nfsnode2 ]
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>      ClusterIP:0        (ocf::heartbeat:IPaddr2):       Started nfsnode2
>      ClusterIP:1        (ocf::heartbeat:IPaddr2):       Started nfsnode1
>  Clone Set: StorageFS-clone [StorageFS]
>      Started: [ nfsnode1 nfsnode2 ]
>  Clone Set: WebSite-clone [WebSite]
>      Started: [ nfsnode1 nfsnode2 ]
>  Clone Set: nfs-group-clone [nfs-group]
>      Started: [ nfsnode1 nfsnode2 ]
>  Clone Set: ping-clone [ping]
>      Started: [ nfsnode1 nfsnode2 ]
>  vbox-fencing   (stonith:fence_vbox):   Started nfsnode2
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170712/df4d1466/attachment-0003.html>


More information about the Users mailing list