[ClusterLabs] DRBD 2-node M/S doesn't want to promote new master, Centos 8

Sun Jan 17 14:00:14 EST 2021

Here are some more log files (notice the error on 'helper command: 
/sbin/drbdadm disconnected') on the Primary Node logs

Master (Primary) Mode going Standby
-----------------------------------
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop 
operation for nfs5-stonith on nfs6: ok
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of notify 
operation for drbd0 on nfs6: ok
Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: Running stop for 
/dev/drbd0 on /data
Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: Trying to 
unmount /data
Jan 17 11:48:14 nfs6 systemd[1923]: data.mount: Succeeded.
Jan 17 11:48:14 nfs6 systemd[1]: data.mount: Succeeded.
Jan 17 11:48:14 nfs6 kernel: XFS (drbd0): Unmounting Filesystem
Jan 17 11:48:14 nfs6 Filesystem(fs_drbd)[797290]: INFO: unmounted /data 
successfully
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop 
operation for fs_drbd on nfs6: ok
Jan 17 11:48:14 nfs6 kernel: drbd r0: role( Primary -> Secondary )
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of demote 
operation for drbd0 on nfs6: ok
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of notify 
operation for drbd0 on nfs6: ok
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of notify 
operation for drbd0 on nfs6: ok
Jan 17 11:48:14 nfs6 kernel: drbd r0: Preparing cluster-wide state 
change 59605293 (1->0 496/16)
Jan 17 11:48:14 nfs6 kernel: drbd r0: State change 59605293: 
primary_nodes=0, weak_nodes=0
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Cluster is now split
Jan 17 11:48:14 nfs6 kernel: drbd r0: Committing cluster-wide state 
change 59605293 (0ms)
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: conn( Connected -> 
Disconnecting ) peer( Secondary -> Unknown )
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0 nfs5: pdsk( UpToDate -> 
DUnknown ) repl( Established -> Off )
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: ack_receiver terminated
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating ack_recv thread
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Restarting sender thread
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Connection closed
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: helper command: /sbin/drbdadm 
disconnected
Jan 17 11:48:14 nfs6 drbdadm[797503]: drbdadm: Unknown command 
'disconnected'
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: helper command: /sbin/drbdadm 
disconnected exit code 1 (0x100)
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: conn( Disconnecting -> 
StandAlone )
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating receiver thread
Jan 17 11:48:14 nfs6 kernel: drbd r0 nfs5: Terminating sender thread
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: disk( UpToDate -> Detaching )
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: disk( Detaching -> Diskless )
Jan 17 11:48:14 nfs6 kernel: drbd r0/0 drbd0: drbd_bm_resize called with 
capacity == 0
Jan 17 11:48:14 nfs6 kernel: drbd r0: Terminating worker thread
Jan 17 11:48:14 nfs6 pacemaker-attrd[1691]: notice: Setting 
master-drbd0[nfs6]: 10000 -> (unset)
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Result of stop 
operation for drbd0 on nfs6: ok
Jan 17 11:48:14 nfs6 pacemaker-attrd[1691]: notice: Setting 
master-drbd0[nfs5]: 10000 -> 1000
Jan 17 11:48:14 nfs6 pacemaker-controld[1693]: notice: Current ping 
state: S_NOT_DC


Secondary Node going primary (fails)
------------------------------------
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: State 
transition S_IDLE -> S_POLICY_ENGINE
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: On loss of 
quorum: Ignore
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Move       
fs_drbd          (         nfs6 -> nfs5 )
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Stop       
nfs5-stonith     (                 nfs6 )   due to node availability
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Stop       
drbd0:0          (          Master nfs6 )   due to node availability
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Promote    
drbd0:1          ( Slave -> Master nfs5 )
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: Calculated 
transition 490, saving inputs in /var/lib/pacemaker/pengine/pe-input-123.bz2
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating stop 
operation fs_drbd_stop_0 on nfs6
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating stop 
operation nfs5-stonith_stop_0 on nfs6
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
cancel operation drbd0_monitor_20000 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
notify operation drbd0_pre_notify_demote_0 on nfs6
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
notify operation drbd0_pre_notify_demote_0 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of 
notify operation for drbd0 on nfs5: ok
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
demote operation drbd0_demote_0 on nfs6
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: peer( Primary -> Secondary )
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
notify operation drbd0_post_notify_demote_0 on nfs6
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
notify operation drbd0_post_notify_demote_0 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of 
notify operation for drbd0 on nfs5: ok
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
notify operation drbd0_pre_notify_stop_0 on nfs6
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
notify operation drbd0_pre_notify_stop_0 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of 
notify operation for drbd0 on nfs5: ok
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating stop 
operation drbd0_stop_0 on nfs6
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Preparing remote state change 
59605293
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Committing remote state 
change 59605293 (primary_nodes=0)
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: conn( Connected -> TearDown ) 
peer( Secondary -> Unknown )
Jan 17 11:48:14 nfs5 kernel: drbd r0/0 drbd0 nfs6: pdsk( UpToDate -> 
DUnknown ) repl( Established -> Off )
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: ack_receiver terminated
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Terminating ack_recv thread
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Restarting sender thread
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Connection closed
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm 
disconnected
Jan 17 11:48:14 nfs5 drbdadm[570326]: drbdadm: Unknown command 
'disconnected'
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm 
disconnected exit code 1 (0x100)
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: conn( TearDown -> Unconnected )
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: Restarting receiver thread
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: conn( Unconnected -> Connecting )
Jan 17 11:48:14 nfs5 pacemaker-attrd[207003]: notice: Setting 
master-drbd0[nfs6]: 10000 -> (unset)
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Transition 490 
aborted by deletion of nvpair[@id='status-2-master-drbd0']: Transient 
attribute change
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
notify operation drbd0_post_notify_stop_0 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-attrd[207003]: notice: Setting 
master-drbd0[nfs5]: 10000 -> 1000
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of 
notify operation for drbd0 on nfs5: ok
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Transition 490 
(Complete=27, Pending=0, Fired=0, Skipped=1, Incomplete=13, 
Source=/var/lib/pacemaker/pengine/pe-input-123.bz2): Stopped
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: On loss of 
quorum: Ignore
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Start      
fs_drbd          (                 nfs5 )
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice:  * Promote    
drbd0:0          ( Slave -> Master nfs5 )
Jan 17 11:48:14 nfs5 pacemaker-schedulerd[207004]: notice: Calculated 
transition 491, saving inputs in /var/lib/pacemaker/pengine/pe-input-124.bz2
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
notify operation drbd0_pre_notify_promote_0 locally on nfs5
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Result of 
notify operation for drbd0 on nfs5: ok
Jan 17 11:48:14 nfs5 pacemaker-controld[207005]: notice: Initiating 
promote operation drbd0_promote_0 locally on nfs5
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm 
fence-peer
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: 
DRBD_BACKING_DEV_0=/dev/sdb1 DRBD_CONF=/etc/drbd.conf 
DRBD_CSTATE=Connecting DRBD_LL_DISK=/dev/sdb1 DRBD_MINOR=0 
DRBD_MINOR_0=0 DRBD_MY_ADDRESS=10.1.3.35 DRBD_MY_AF=ipv4 
DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=nfs5 DRBD_NODE_ID_1=nfs6 
DRBD_PEER_ADDRESS=10.1.3.36 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 
DRBD_RESOURCE=r0 DRBD_VOLUME=0 UP_TO_DATE_NODES=0x00000001 
/usr/lib/drbd/crm-fence-peer.9.sh
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_open_2) 
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_open_2) 
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_open_2) 
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: 
(connect_with_main_loop) #011debug: Connected to controller IPC 
(attached to main loop)
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (post_connect) 
#011debug: Sent IPC hello to controller
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_ipcc_disconnect) 
#011debug: qb_ipcc_disconnect()
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_close_helper) 
#011debug: Closing ringbuffer: 
/dev/shm/qb-207005-570438-16-rx95PO/qb-request-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_close_helper) 
#011debug: Closing ringbuffer: 
/dev/shm/qb-207005-570438-16-rx95PO/qb-response-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (qb_rb_close_helper) 
#011debug: Closing ringbuffer: 
/dev/shm/qb-207005-570438-16-rx95PO/qb-event-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (ipc_post_disconnect) 
#011info: Disconnected from controller IPC API
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (pcmk_free_ipc_api) 
#011debug: Releasing controller IPC API
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (crm_xml_cleanup) 
#011info: Cleaning up memory from libxml2
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: (crm_exit) #011info: 
Exiting crm_node | with status 0
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: /
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: Could not connect to 
the CIB: No such device or address
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: Init failed, could not 
perform requested operations
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570403]: WARNING DATA INTEGRITY 
at RISK: could not place the fencing constraint!
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm 
fence-peer exit code 1 (0x100)
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: fence-peer helper broken, 
returned 1
Jan 17 11:48:14 nfs5 kernel: drbd r0: State change failed: Refusing to 
be Primary while peer is not outdated
Jan 17 11:48:14 nfs5 kernel: drbd r0: Failed: role( Secondary -> Primary )
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm 
fence-peer
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: 
DRBD_BACKING_DEV_0=/dev/sdb1 DRBD_CONF=/etc/drbd.conf 
DRBD_CSTATE=Connecting DRBD_LL_DISK=/dev/sdb1 DRBD_MINOR=0 
DRBD_MINOR_0=0 DRBD_MY_ADDRESS=10.1.3.35 DRBD_MY_AF=ipv4 
DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=nfs5 DRBD_NODE_ID_1=nfs6 
DRBD_PEER_ADDRESS=10.1.3.36 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1 
DRBD_RESOURCE=r0 DRBD_VOLUME=0 UP_TO_DATE_NODES=0x00000001 
/usr/lib/drbd/crm-fence-peer.9.sh
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_open_2) 
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_open_2) 
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_open_2) 
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: 
(connect_with_main_loop) #011debug: Connected to controller IPC 
(attached to main loop)
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (post_connect) 
#011debug: Sent IPC hello to controller
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_ipcc_disconnect) 
#011debug: qb_ipcc_disconnect()
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_close_helper) 
#011debug: Closing ringbuffer: 
/dev/shm/qb-207005-570490-16-D2a84t/qb-request-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_close_helper) 
#011debug: Closing ringbuffer: 
/dev/shm/qb-207005-570490-16-D2a84t/qb-response-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (qb_rb_close_helper) 
#011debug: Closing ringbuffer: 
/dev/shm/qb-207005-570490-16-D2a84t/qb-event-crmd-header
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (ipc_post_disconnect) 
#011info: Disconnected from controller IPC API
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (pcmk_free_ipc_api) 
#011debug: Releasing controller IPC API
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (crm_xml_cleanup) 
#011info: Cleaning up memory from libxml2
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: (crm_exit) #011info: 
Exiting crm_node | with status 0
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: /
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: Could not connect to 
the CIB: No such device or address
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: Init failed, could not 
perform requested operations
Jan 17 11:48:14 nfs5 crm-fence-peer.9.sh[570455]: WARNING DATA INTEGRITY 
at RISK: could not place the fencing constraint!
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm 
fence-peer exit code 1 (0x100)
Jan 17 11:48:14 nfs5 kernel: drbd r0 nfs6: fence-peer helper broken, 
returned 1
Jan 17 11:48:14 nfs5 drbd(drbd0)[570375]: ERROR: r0: Called drbdadm -c 
/etc/drbd.conf primary r0
Jan 17 11:48:14 nfs5 drbd(drbd0)[570375]: ERROR: r0: Exit code 11
Jan 17 11:48:14 nfs5 drbd(drbd0)[570375]: ERROR: r0: Command output:
Jan 17 11:48:15 nfs5 drbd(drbd0)[570375]: ERROR: r0: Command stderr: r0: 
State change failed: (-7) Refusing to be Primary while peer is not 
outdated#012Command 'drbdsetup primary r0' terminated with exit code 11
Jan 17 11:48:15 nfs5 kernel: drbd r0 nfs6: helper command: /sbin/drbdadm 
fence-peer
...


On 1/16/2021 11:07 AM, Strahil Nikolov wrote:
> В 14:10 -0700 на 15.01.2021 (пт), Brent Jensen написа:
>>
>> Problem: When performing "pcs node standby" on the current master, 
>> this node demotes fine but the slave doesn't promote to master. It 
>> keeps  looping the same error including "Refusing to be Primary while 
>> peer is  not outdated" and "Could not connect to the CIB." At this 
>> point the old  master has already unloaded drbd. The only way to fix 
>> it is to start  drbd on the standby node (e.g. drbdadm r0 up). Logs 
>> contained herein are  from the node trying to be master.
>>
> In order to debug, stop the cluster and verify that drbd is running 
> properly. Promote one of the nodes, then demote and promote another one...
>> I have done this on DRBD9/Centos7/Pacemaker1 w/o any problems. So I 
>> don't know were the issue is (crm-fence-peer.9.sh 
>> <http://crm-fence-peer.9.sh>
>>
>> Another odd data point: On the slave if I do a "pcs node standby" & 
>> then unstandby, DRBD is loaded again; HOWEVER, when I do this on the 
>> master (which should then be slave), DRBD doesn't get loaded.
>>
>> Stonith/Fencing doesn't seem to make a difference. Not sure if 
>> auto-promote is required.
>>
> Quote from official documentation 
> (https://www.linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-pacemaker-crm-drbd-backed-service 
> <https://www.linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-pacemaker-crm-drbd-backed-service>):
> If you are employing the DRBD OCF resource agent, it is recommended 
> that you defer DRBD startup, shutdown, promotion, and demotion 
> /exclusively/ to the OCF resource agent. That means that you should 
> disable the DRBD init script:
> So remove the autopromote and disable the drbd service at all.
>
> Best Regards, Strahil Nikolov
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/


-- 
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210117/5e72defe/attachment-0001.htm>