<div dir="ltr">you need to configure cluster fencing and drbd fencing handler, in this way, the cluster can recevory without manual intervention.<br></div><div class="gmail_extra"><br><div class="gmail_quote">2017-07-12 11:33 GMT+02:00 ArekW <span dir="ltr"><<a href="mailto:arkaduis@gmail.com" target="_blank">arkaduis@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
Can in be fixed that the drbd is entering split brain after cluster<br>
node recovery? After few tests I saw drbd recovered but in most<br>
situations (9/10) it didn't sync.<br>
<br>
1. When a node is put to standby and than unstandby everything is<br>
working OK. The drbd is syncing and go to primary mode.<br>
<br>
2. When a node is (hard)poweroff, the stonith brings it up and<br>
eventually the node becomes online but the drdb is in StandAlone state<br>
on the recovered node. I can sync it only manually but that require to<br>
stop the cluster.<br>
<br>
Logs:<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Handshake to<br>
peer 1 successful: Agreed network protocol version 112<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Feature flags<br>
enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Starting<br>
ack_recv thread (from drbd_r_storage [28960])<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage: Preparing cluster-wide<br>
state change 2237079084 (0->1 499/145)<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage: State change<br>
2237079084: primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage: Committing cluster-wide<br>
state change 2237079084 (1ms)<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: conn(<br>
Connecting -> Connected ) peer( Unknown -> Secondary )<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: current_size: 14679544<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:<br>
c_size: 14679544 u_size: 0 d_size: 14679544 max_size: 14679544<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:<br>
la_size: 14679544 my_usize: 0 my_max_size: 14679544<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: my node_id: 0<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:<br>
node_id: 1 idx: 0 bm-uuid: 0x441536064ceddc92 flags: 0x10 max_size:<br>
14679544 (DUnknown)<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:<br>
calling drbd_determine_dev_size()<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: my node_id: 0<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:<br>
node_id: 1 idx: 0 bm-uuid: 0x441536064ceddc92 flags: 0x10 max_size:<br>
14679544 (DUnknown)<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:<br>
drbd_sync_handshake:<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: self<br>
342BE98297943C35:<wbr>441536064CEDDC92:<wbr>69D98E1FCC2BB44C:<wbr>E04101C6FF76D1CC<br>
bits:15450 flags:120<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: peer<br>
A8908796A7CCFF6E:<wbr>CE6B672F4EDA6E78:<wbr>69D98E1FCC2BB44C:<wbr>E04101C6FF76D1CC<br>
bits:32768 flags:2<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2:<br>
uuid_compare()=-100 by rule 100<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper<br>
command: /sbin/drbdadm initial-split-brain<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper<br>
command: /sbin/drbdadm initial-split-brain exit code 0 (0x0)<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1: Split-Brain<br>
detected but unresolved, dropping connection!<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper<br>
command: /sbin/drbdadm split-brain<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage/0 drbd1 nfsnode2: helper<br>
command: /sbin/drbdadm split-brain exit code 0 (0x0)<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: conn(<br>
Connected -> Disconnecting ) peer( Secondary -> Unknown )<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: error<br>
receiving P_STATE, e: -5 l: 0!<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: ack_receiver terminated<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Terminating<br>
ack_recv thread<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Connection closed<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: conn(<br>
Disconnecting -> StandAlone )<br>
Jul 12 10:26:35 nfsnode1 kernel: drbd storage nfsnode2: Terminating<br>
receiver thread<br>
<br>
<br>
Config:<br>
resource storage {<br>
protocol C;<br>
meta-disk internal;<br>
device /dev/drbd1;<br>
syncer {<br>
verify-alg sha1;<br>
}<br>
net {<br>
allow-two-primaries;<br>
}<br>
on nfsnode1 {<br>
disk /dev/storage/drbd;<br>
address <a href="http://10.0.2.15:7789" rel="noreferrer" target="_blank">10.0.2.15:7789</a>;<br>
}<br>
on nfsnode2 {<br>
disk /dev/storage/drbd;<br>
address <a href="http://10.0.2.4:7789" rel="noreferrer" target="_blank">10.0.2.4:7789</a>;<br>
}<br>
}<br>
<br>
pcs resource show StorageFS-clone<br>
Clone: StorageFS-clone<br>
Resource: StorageFS (class=ocf provider=heartbeat type=Filesystem)<br>
Attributes: device=/dev/drbd1 directory=/mnt/drbd fstype=gfs2<br>
Operations: start interval=0s timeout=60 (StorageFS-start-interval-0s)<br>
stop interval=0s timeout=60 (StorageFS-stop-interval-0s)<br>
monitor interval=20 timeout=40 (StorageFS-monitor-interval-<wbr>20)<br>
<br>
Full list of resources:<br>
<br>
Master/Slave Set: StorageClone [Storage]<br>
Masters: [ nfsnode1 nfsnode2 ]<br>
Clone Set: dlm-clone [dlm]<br>
Started: [ nfsnode1 nfsnode2 ]<br>
Clone Set: ClusterIP-clone [ClusterIP] (unique)<br>
ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2<br>
ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1<br>
Clone Set: StorageFS-clone [StorageFS]<br>
Started: [ nfsnode1 nfsnode2 ]<br>
Clone Set: WebSite-clone [WebSite]<br>
Started: [ nfsnode1 nfsnode2 ]<br>
Clone Set: nfs-group-clone [nfs-group]<br>
Started: [ nfsnode1 nfsnode2 ]<br>
Clone Set: ping-clone [ping]<br>
Started: [ nfsnode1 nfsnode2 ]<br>
vbox-fencing (stonith:fence_vbox): Started nfsnode2<br>
<br>
______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>
<a href="http://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"> .~.<br> /V\<br> // \\<br>/( )\<br>^`~'^</div>
</div>