[ClusterLabs] DRBD 2-node M/S doesn't want to promote new master, Centos 8
Brent Jensen
jeneral9 at gmail.com
Sun Jan 17 14:00:14 EST 2021
Here are some more log files (notice the error on 'helper command:
/sbin/drbdadm disconnected') on the Primary Node logs
Master (Primary) Mode going Standby
-----------------------------------
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop
operation for nfs5-stonith on nfs6: ok
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of notify
operation for drbd0 on nfs6: ok
Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: Running stop for
/dev/drbd0 on /data
Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: Trying to
unmount /data
Jan 17 11:48:14 nfs6 systemd[1923]: data.mount: Succeeded.
Jan 17 11:48:14 nfs6 systemd[1]: data.mount: Succeeded.
Jan 17 11:48:14 nfs6 kernel: XFS (drbd0): Unmounting Filesystem
Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: unmounted /data
successfully
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop
operation for fs_drbd on nfs6: ok
Jan 17 11:48:14 nfs6 kernel: drbd r0: role( Primary -> Secondary )
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of demote
operation for drbd0 on nfs6: ok
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of notify
operation for drbd0 on nfs6: ok
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of notify
operation for drbd0 on nfs6: ok
Jan 17 11:48:14 nfs6 kernel: drbd r0: Preparing cluster-wide state
change 59605293 (1->0 496/16)
Jan 17 11:48:14 nfs6 kernel: drbd r0: State change 59605293:
primary_nodes=0, weak_nodes=0
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Cluster is now split
Jan 17 11:48:14 nfs6 kernel: drbd r0: Committing cluster-wide state
change 59605293 (0ms)
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: conn( Connected ->
Disconnecting ) peer( Secondary -> Unknown )
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0 nfs5: pdsk( UpToDate ->
DUnknown ) repl( Established -> Off )
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: ack_receiver terminated
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating ack_recv thread
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Restarting sender thread
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Connection closed
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: helper command: /sbin/drbdadm
disconnected
Jan 17 11:48:14 nfs6 drbdadm[797503]: drbdadm: Unknown command
'disconnected'
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: helper command: /sbin/drbdadm
disconnected exit code 1 (0x100)
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: conn( Disconnecting ->
StandAlone )
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating receiver thread
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating sender thread
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: disk( UpToDate -> Detaching )
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: disk( Detaching -> Diskless )
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: drbd_bm_resize called with
capacity == 0
Jan 17 11:48:14 nfs6 kernel: drbd r0: Terminating worker thread
Jan 17 11:48:14 nfs6 pacemaker-attrd[1691]: notice: Setting
master-drbd0[nfs6]: 10000 -> (unset)
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop
operation for drbd0 on nfs6: ok
Jan 17 11:48:14 nfs6 pacemaker-attrd[1691]: notice: Setting
master-drbd0[nfs5]: 10000 -> 1000
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Current ping
state: S_NOT_DC
Secondary Node going primary (fails)
------------------------------------
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: State
transition S_IDLE -> S_POLICY_ENGINE
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: On loss of
quorum: Ignore
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: * Move
fs_drbd ( nfs6 -> nfs5 )
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: * Stop
nfs5-stonith ( nfs6 ) due to node availability
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: * Stop
drbd0:0 ( Master nfs6 ) due to node availability
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: * Promote
drbd0:1 ( Slave -> Master nfs5 )
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: Calculated
transition 490, saving inputs in /var/lib/pacemaker/pengine/pe-input-123.bz2
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating stop
operation fs_drbd_stop_0 on nfs6
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating stop
operation nfs5-stonith_stop_0 on nfs6
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
cancel operation drbd0_monitor_20000 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
notify operation drbd0_pre_notify_demote_0 on nfs6
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
notify operation drbd0_pre_notify_demote_0 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of
notify operation for drbd0 on nfs5: ok
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
demote operation drbd0_demote_0 on nfs6
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: peer( Primary -> Secondary )
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
notify operation drbd0_post_notify_demote_0 on nfs6
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
notify operation drbd0_post_notify_demote_0 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of
notify operation for drbd0 on nfs5: ok
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
notify operation drbd0_pre_notify_stop_0 on nfs6
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
notify operation drbd0_pre_notify_stop_0 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of
notify operation for drbd0 on nfs5: ok
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating stop
operation drbd0_stop_0 on nfs6
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Preparing remote state change
59605293
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Committing remote state
change 59605293 (primary_nodes=0)
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: conn( Connected -> TearDown )
peer( Secondary -> Unknown )
Jan 17 11:48:14 nfs5 kernel: drbd r0/0 drbd0 nfs6: pdsk( UpToDate ->
DUnknown ) repl( Established -> Off )
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: ack_receiver terminated
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Terminating ack_recv thread
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Restarting sender thread
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Connection closed
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm
disconnected
Jan 17 11:48:14 nfs5 drbdadm[570326]: drbdadm: Unknown command
'disconnected'
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm
disconnected exit code 1 (0x100)
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: conn( TearDown -> Unconnected )
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Restarting receiver thread
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: conn( Unconnected -> Connecting )
Jan 17 11:48:14 nfs5 pacemaker-attrd[207003]: notice: Setting
master-drbd0[nfs6]: 10000 -> (unset)
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Transition 490
aborted by deletion of nvpair[@id='status-2-master-drbd0']: Transient
attribute change
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
notify operation drbd0_post_notify_stop_0 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-attrd[207003]: notice: Setting
master-drbd0[nfs5]: 10000 -> 1000
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of
notify operation for drbd0 on nfs5: ok
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Transition 490
(Complete=27, Pending=0, Fired=0, Skipped=1, Incomplete=13,
Source=/var/lib/pacemaker/pengine/pe-input-123.bz2): Stopped
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: On loss of
quorum: Ignore
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: * Start
fs_drbd ( nfs5 )
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: * Promote
drbd0:0 ( Slave -> Master nfs5 )
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: Calculated
transition 491, saving inputs in /var/lib/pacemaker/pengine/pe-input-124.bz2
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
notify operation drbd0_pre_notify_promote_0 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of
notify operation for drbd0 on nfs5: ok
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating
promote operation drbd0_promote_0 locally on nfs5
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm
fence-peer
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]:
DRBD_BACKING_DEV_0=/dev/sdb1 DRBD_CONF=/etc/drbd.conf
DRBD_CSTATE=Connecting DRBD_LL_DISK=/dev/sdb1 DRBD_MINOR=0
DRBD_MINOR_0=0 DRBD_MY_ADDRESS=10.1.3.35 DRBD_MY_AF=ipv4
DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=nfs5 DRBD_NODE_ID_1=nfs6
DRBD_PEER_ADDRESS=10.1.3.36 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1
DRBD_RESOURCE=r0 DRBD_VOLUME=0 UP_TO_DATE_NODES=0x00000001
/usr/lib/drbd/crm-fence-peer.9.sh
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]:
(connect_with_main_loop) #011debug: Connected to controller IPC
(attached to main loop)
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (post_connect)
#011debug: Sent IPC hello to controller
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_ipcc_disconnect)
#011debug: qb_ipcc_disconnect()
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_close_helper)
#011debug: Closing ringbuffer:
/dev/shm/qb-207005-570438-16-rx95PO/qb-request-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_close_helper)
#011debug: Closing ringbuffer:
/dev/shm/qb-207005-570438-16-rx95PO/qb-response-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_close_helper)
#011debug: Closing ringbuffer:
/dev/shm/qb-207005-570438-16-rx95PO/qb-event-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (ipc_post_disconnect)
#011info: Disconnected from controller IPC API
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (pcmk_free_ipc_api)
#011debug: Releasing controller IPC API
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (crm_xml_cleanup)
#011info: Cleaning up memory from libxml2
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (crm_exit) #011info:
Exiting crm_node | with status 0
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: /
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: Could not connect to
the CIB: No such device or address
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: Init failed, could not
perform requested operations
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: WARNING DATA INTEGRITY
at RISK: could not place the fencing constraint!
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm
fence-peer exit code 1 (0x100)
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: fence-peer helper broken,
returned 1
Jan 17 11:48:14 nfs5 kernel: drbd r0: State change failed: Refusing to
be Primary while peer is not outdated
Jan 17 11:48:14 nfs5 kernel: drbd r0: Failed: role( Secondary -> Primary )
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm
fence-peer
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]:
DRBD_BACKING_DEV_0=/dev/sdb1 DRBD_CONF=/etc/drbd.conf
DRBD_CSTATE=Connecting DRBD_LL_DISK=/dev/sdb1 DRBD_MINOR=0
DRBD_MINOR_0=0 DRBD_MY_ADDRESS=10.1.3.35 DRBD_MY_AF=ipv4
DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=nfs5 DRBD_NODE_ID_1=nfs6
DRBD_PEER_ADDRESS=10.1.3.36 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1
DRBD_RESOURCE=r0 DRBD_VOLUME=0 UP_TO_DATE_NODES=0x00000001
/usr/lib/drbd/crm-fence-peer.9.sh
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]:
(connect_with_main_loop) #011debug: Connected to controller IPC
(attached to main loop)
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (post_connect)
#011debug: Sent IPC hello to controller
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_ipcc_disconnect)
#011debug: qb_ipcc_disconnect()
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_close_helper)
#011debug: Closing ringbuffer:
/dev/shm/qb-207005-570490-16-D2a84t/qb-request-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_close_helper)
#011debug: Closing ringbuffer:
/dev/shm/qb-207005-570490-16-D2a84t/qb-response-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_close_helper)
#011debug: Closing ringbuffer:
/dev/shm/qb-207005-570490-16-D2a84t/qb-event-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (ipc_post_disconnect)
#011info: Disconnected from controller IPC API
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (pcmk_free_ipc_api)
#011debug: Releasing controller IPC API
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (crm_xml_cleanup)
#011info: Cleaning up memory from libxml2
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (crm_exit) #011info:
Exiting crm_node | with status 0
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: /
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: Could not connect to
the CIB: No such device or address
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: Init failed, could not
perform requested operations
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: WARNING DATA INTEGRITY
at RISK: could not place the fencing constraint!
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm
fence-peer exit code 1 (0x100)
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: fence-peer helper broken,
returned 1
Jan 17 11:48:14 nfs5 drbd(drbd0)[570375]: ERROR: r0: Called drbdadm -c
/etc/drbd.conf primary r0
Jan 17 11:48:14 nfs5 drbd(drbd0)[570375]: ERROR: r0: Exit code 11
Jan 17 11:48:14 nfs5 drbd(drbd0)[570375]: ERROR: r0: Command output:
Jan 17 11:48:15 nfs5 drbd(drbd0)[570375]: ERROR: r0: Command stderr: r0:
State change failed: (-7) Refusing to be Primary while peer is not
outdated#012Command 'drbdsetup primary r0' terminated with exit code 11
Jan 17 11:48:15 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm
fence-peer
...
On 1/16/2021 11:07 AM, Strahil Nikolov wrote:
> В 14:10 -0700 на 15.01.2021 (пт), Brent Jensen написа:
>>
>> Problem: When performing "pcs node standby" on the current master,
>> this node demotes fine but the slave doesn't promote to master. It
>> keeps looping the same error including "Refusing to be Primary while
>> peer is not outdated" and "Could not connect to the CIB." At this
>> point the old master has already unloaded drbd. The only way to fix
>> it is to start drbd on the standby node (e.g. drbdadm r0 up). Logs
>> contained herein are from the node trying to be master.
>>
> In order to debug, stop the cluster and verify that drbd is running
> properly. Promote one of the nodes, then demote and promote another one...
>> I have done this on DRBD9/Centos7/Pacemaker1 w/o any problems. So I
>> don't know were the issue is (crm-fence-peer.9.sh
>> <http://crm-fence-peer.9.sh>
>>
>> Another odd data point: On the slave if I do a "pcs node standby" &
>> then unstandby, DRBD is loaded again; HOWEVER, when I do this on the
>> master (which should then be slave), DRBD doesn't get loaded.
>>
>> Stonith/Fencing doesn't seem to make a difference. Not sure if
>> auto-promote is required.
>>
> Quote from official documentation
> (https://www.linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-pacemaker-crm-drbd-backed-service
> <https://www.linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-pacemaker-crm-drbd-backed-service>):
> If you are employing the DRBD OCF resource agent, it is recommended
> that you defer DRBD startup, shutdown, promotion, and demotion
> /exclusively/ to the OCF resource agent. That means that you should
> disable the DRBD init script:
> So remove the autopromote and disable the drbd service at all.
>
> Best Regards, Strahil Nikolov
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210117/5e72defe/attachment-0001.htm>
More information about the Users
mailing list