<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>On my running centos 7 system I get the "Cluster is now split"

      message on the demoting node, as well as getting fence-peer on the

      promoting node. Looks somewhat the same. As I've previously

      stated, the "Could not connect to the CIB: No such device or

      address" is the same message if you've manually stopped the

      primary node w/o demoting and trying promote the secondary node.

      Hence the reason it fixes itself if you reconnect drbd on the now

      standby node.<br>

    </p>

    <p>I'm using elrepo's version (you would think that is should work):<br>

    </p>

    <ul>

      <li>drbd90-utils-9.13.1-1.el8.elrepo.x86_64</li>

      <li>kmod-drbd90-9.0.25-2.el8_3.elrepo.x86_64</li>

      <li>drbd90-utils-sysvinit-9.13.1-1.el8.elrepo.x86_64</li>

    </ul>

    <p>===> I would love to hear from anyone else who has a working

      master/slave DRBD (or any kind) running on Centos 8 w/ elrepo's

      latest.</p>

    <p>DRBD seems to run just fine if you manually run things w/ drbdadm

      (up, primary/secondary, etc), mounting, etc.<br>

      <br>

      I blew away all configs and started anew  w/ this:</p>

    <p>pcs property set no-quorum-policy=ignore (w/ or without made no

      difference)<br>

      pcs property set stonith-enabled=false<br>

      pcs resource create drbd0 ocf:linbit:drbd drbd_resource=r0

      promotable promoted-max=1 promoted-node-max=1 clone-max=2

      clone-node-max=1 notify=true</p>

    <p>That's it! One node goes primary as it should. But when going

      standby on that node it fails to promote the other node until I

      reconnect DRBD (e.g. drbdadm up r0).</p>

    <p>Brent<br>

    </p>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 1/18/2021 2:43 PM, Ken Gaillot

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:a9cb6da29286f0b23bcf303fd0b7773ec581efd6.camel@redhat.com">

      <pre class="moz-quote-pre" wrap="">The part that sticks out to me is "Cluster is now split" followed by

"helper command: /sbin/drbdadm fence-peer", which I believe should not

happen after a clean demote/stop of the other side, and then crm-fence-

peer.9.sh says "Could not connect to the CIB: No such device or

address". The unknown command error is also suspicious.

I'd make sure the installed versions of everything are happy with each

other (i.e. the drbd utils version supports the installed kernel module

and pacemaker versions, and similarly with the resource agent if it

came separately). I'm not familiar enough with DRBD 9 to know if any

further configuration changes are needed.

On Sun, 2021-01-17 at 12:00 -0700, Brent Jensen wrote:

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">Here are some more log files (notice the error on 'helper command:

/sbin/drbdadm disconnected') on the Primary Node logs

Master (Primary) Mode going Standby

-----------------------------------

Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop

operation for nfs5-stonith on nfs6: ok

Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of

notify operation for drbd0 on nfs6: ok

Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: Running stop

for /dev/drbd0 on /data

Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: Trying to

unmount /data

Jan 17 11:48:14 nfs6 systemd[1923]: data.mount: Succeeded.

Jan 17 11:48:14 nfs6 systemd[1]: data.mount: Succeeded.

Jan 17 11:48:14 nfs6 kernel: XFS (drbd0): Unmounting Filesystem

Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: unmounted

/data successfully

Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop

operation for fs_drbd on nfs6: ok

Jan 17 11:48:14 nfs6 kernel: drbd r0: role( Primary -> Secondary )

Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of

demote operation for drbd0 on nfs6: ok

Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of

notify operation for drbd0 on nfs6: ok

Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of

notify operation for drbd0 on nfs6: ok

Jan 17 11:48:14 nfs6 kernel: drbd r0: Preparing cluster-wide state

change 59605293 (1->0 496/16)

Jan 17 11:48:14 nfs6 kernel: drbd r0: State change 59605293:

primary_nodes=0, weak_nodes=0

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Cluster is now split

Jan 17 11:48:14 nfs6 kernel: drbd r0: Committing cluster-wide state

change 59605293 (0ms)

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: conn( Connected ->

Disconnecting ) peer( Secondary -> Unknown )

Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0 nfs5: pdsk( UpToDate ->

DUnknown ) repl( Established -> Off )

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: ack_receiver terminated

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating ack_recv

thread

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Restarting sender thread

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Connection closed

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: helper command:

/sbin/drbdadm disconnected

Jan 17 11:48:14 nfs6 drbdadm[797503]: drbdadm: Unknown command

'disconnected'

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: helper command:

/sbin/drbdadm disconnected exit code 1 (0x100)

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: conn( Disconnecting ->

StandAlone )

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating receiver

thread

Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating sender thread

Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: disk( UpToDate ->

Detaching )

Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: disk( Detaching ->

Diskless )

Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: drbd_bm_resize called

with capacity == 0

Jan 17 11:48:14 nfs6 kernel: drbd r0: Terminating worker thread

Jan 17 11:48:14 nfs6 pacemaker-attrd[1691]: notice: Setting master-

drbd0[nfs6]: 10000 -> (unset)

Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop

operation for drbd0 on nfs6: ok

Jan 17 11:48:14 nfs6 pacemaker-attrd[1691]: notice: Setting master-

drbd0[nfs5]: 10000 -> 1000

Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Current ping

state: S_NOT_DC

Secondary Node going primary (fails)

------------------------------------

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: State

transition S_IDLE -> S_POLICY_ENGINE

Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: On loss of

quorum: Ignore

Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  *

Move       fs_drbd          (         nfs6 -> nfs5 )

Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  *

Stop       nfs5-stonith     (                 nfs6 )   due to node

availability

Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  *

Stop       drbd0:0          (          Master nfs6 )   due to node

availability

Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  *

Promote    drbd0:1          ( Slave -> Master nfs5 )

Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: Calculated

transition 490, saving inputs in /var/lib/pacemaker/pengine/pe-input-

123.bz2

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

stop operation fs_drbd_stop_0 on nfs6

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

stop operation nfs5-stonith_stop_0 on nfs6

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

cancel operation drbd0_monitor_20000 locally on nfs5

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

notify operation drbd0_pre_notify_demote_0 on nfs6

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

notify operation drbd0_pre_notify_demote_0 locally on nfs5

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of

notify operation for drbd0 on nfs5: ok

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

demote operation drbd0_demote_0 on nfs6

Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: peer( Primary -> Secondary

)

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

notify operation drbd0_post_notify_demote_0 on nfs6

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

notify operation drbd0_post_notify_demote_0 locally on nfs5

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of

notify operation for drbd0 on nfs5: ok

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

notify operation drbd0_pre_notify_stop_0 on nfs6

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

notify operation drbd0_pre_notify_stop_0 locally on nfs5

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of

notify operation for drbd0 on nfs5: ok

Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating

stop operation drbd0_stop_0 on nfs6

Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Preparing remote state

change 59605293

Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Committing remote state

change 59605293 (primary_nodes=0)

Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: conn( Connected ->

TearDown ) peer( Secondary -> Unknown )

Jan 17 11:48:14 nfs5 kernel: drbd r0/0 drbd0 nfs6: pdsk( UpToDate ->

DUnknown ) repl( Established -> Off )

Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: ack_receiver terminated

Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Terminating ack_recv

thread

Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Restarting sender thread

Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Connection closed