[ClusterLabs] DRBD 2-node M/S doesn't want to promote new master, Centos 8
Brent Jensen
jeneral9 at gmail.com
Fri Jan 15 16:10:00 EST 2021
Problem: When performing "pcs node standby" on the current master, this
node demotes fine but the slave doesn't promote to master. It keeps
looping the same error including "Refusing to be Primary while peer is
not outdated" and "Could not connect to the CIB." At this point the
old master has already unloaded drbd. The only way to fix it is to
start drbd on the standby node (e.g. drbdadm r0 up). Logs contained
herein are from the node trying to be master.
I have done this on DRBD9/Centos7/Pacemaker1 w/o any problems. So I
don't know were the issue is (crm-fence-peer.9.sh
<http://crm-fence-peer.9.sh>? DRBD? newer pacemaker?). DRBD seems to
work fine; unclear if there are some additional configs I need to do.
There are some slight pcs config changes between Centos 7 & 8 (Pacemaker
1->2)
Another odd data point: On the slave if I do a "pcs node standby" & then
unstandby, DRBD is loaded again; HOWEVER, when I do this on the master
(which should then be slave), DRBD doesn't get loaded.
Stonith/Fencing doesn't seem to make a difference. Not sure if
auto-promote is required.
Appreciate any help!
Brent
Basic Config (Centos 8 packages):
--------------------------------
2 Node Master/Slave
OS: Centos8
Pacemaker: pacemaker-2.0.4-6.el8_3.1
Corosync: corosync-3.0.3-4.el8
DRBD config:
------------
resource r0 {
protocol C;
disk {
on-io-error detach;
no-disk-flushes ;
no-disk-barrier;
c-plan-ahead 10;
c-fill-target 24M;
c-min-rate 10M;
c-max-rate 1000M;
}
net {
fencing resource-only;
# max-epoch-size 20000;
max-buffers 36k;
sndbuf-size 1024k ;
rcvbuf-size 2048k;
}
handlers {
# these handlers are necessary for drbd 9.0 + pacemaker
compatibility
fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh
<http://crm-fence-peer.9.sh> --timeout 30 --dc-timeout 60";
after-resync-target
"/usr/lib/drbd/crm-unfence-peer.9.sh <http://crm-unfence-peer.9.sh>";
}
options {
auto-promote yes;
}
on nfs5 {
node-id 0;
device /dev/drbd0;
disk /dev/sdb1;
address 10.1.3.35:7788 <http://10.1.3.35:7788>;
meta-disk internal;
}
on nfs6 {
node-id 1;
device /dev/drbd0;
disk /dev/sdb1;
address 10.1.3.36:7788 <http://10.1.3.36:7788>;
meta-disk internal;
}
}
Pacemaker Config
----------------
Cluster Name: nfs
Corosync Nodes:
nfs5 nfs6
Pacemaker Nodes:
nfs5 nfs6
Resources:
Group: cluster_group
Resource: fs_drbd (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd0 directory=/data/ fstype=xfs
Meta Attrs: target-role=Started
Operations: monitor interval=20s timeout=40s
(fs_drbd-monitor-interval-20s)
start interval=0 timeout=60 (fs_drbd-start-interval-0)
stop interval=0 timeout=60 (fs_drbd-stop-interval-0)
Clone: drbd0-clone
Meta Attrs: clone-max=2 clone-node-max=1 notify=true promotable=true
promoted-max=1 promoted-node-max=1
Resource: drbd0 (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=r0
Operations: demote interval=0s timeout=90 (drbd0-demote-interval-0s)
monitor interval=20 role=Slave timeout=20
(drbd0-monitor-interval-20)
monitor interval=10 role=Master timeout=20
(drbd0-monitor-interval-10)
notify interval=0s timeout=90 (drbd0-notify-interval-0s)
promote interval=0s timeout=90 (drbd0-promote-interval-0s)
reload interval=0s timeout=30 (drbd0-reload-interval-0s)
start interval=0s timeout=240 (drbd0-start-interval-0s)
stop interval=0s timeout=100 (drbd0-stop-interval-0s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
promote drbd0-clone then start cluster_group (kind:Mandatory)
(id:nfs_after_drbd)
Colocation Constraints:
cluster_group with drbd0-clone (score:INFINITY)
(with-rsc-role:Master) (id:nfs_on_drbd)
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
No defaults set
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: nfs
dc-version: 2.0.4-6.el8_3.1-2deceaa3ae
have-watchdog: false
last-lrm-refresh: 1610570527
no-quorum-policy: ignore
stonith-enabled: false
Tags:
No tags defined
Quorum:
Options:
wait_for_all: 0
Error Logs
----------
pacemaker-controld[7673]: notice: Result of notify operation for drbd0
on nfs5: ok
kernel: drbd r0 nfs6: peer( Primary -> Secondary )
pacemaker-controld[7673]: notice: Result of notify operation for drbd0
on nfs5: ok
pacemaker-controld[7673]: notice: Result of notify operation for drbd0
on nfs5: ok
kernel: drbd r0 nfs6: Preparing remote state change 3411954157
kernel: drbd r0 nfs6: Committing remote state change 3411954157
(primary_nodes=0)
kernel: drbd r0 nfs6: conn( Connected -> TearDown ) peer( Secondary ->
Unknown )
kernel: drbd r0/0 drbd0 nfs6: pdsk( UpToDate -> DUnknown ) repl(
Established -> Off )
kernel: drbd r0 nfs6: ack_receiver terminated
kernel: drbd r0 nfs6: Terminating ack_recv thread
kernel: drbd r0 nfs6: Restarting sender thread
drbdadm[89851]: drbdadm: Unknown command 'disconnected'
kernel: drbd r0 nfs6: Connection closed
kernel: drbd r0 nfs6: helper command: /sbin/drbdadm disconnected
kernel: drbd r0 nfs6: helper command: /sbin/drbdadm disconnected exit
code 1 (0x100)
kernel: drbd r0 nfs6: conn( TearDown -> Unconnected )
kernel: drbd r0 nfs6: Restarting receiver thread
kernel: drbd r0 nfs6: conn( Unconnected -> Connecting )
pacemaker-attrd[7671]: notice: Setting master-drbd0[nfs6]: 10000 -> (unset)
pacemaker-attrd[7671]: notice: Setting master-drbd0[nfs5]: 10000 -> 1000
pacemaker-controld[7673]: notice: Result of notify operation for drbd0
on nfs5: ok
pacemaker-controld[7673]: notice: Result of notify operation for drbd0
on nfs5: ok
kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer
DRBD_NODE_ID_1=nfs6 DRBD_PEER_ADDRESS=10.1.1.36 DRBD_PEER_AF=ipv4
DRBD_PEER_NODE_ID=1 DRBD_RESOURCE=r0 DRBD_VOLUME=0
UP_TO_DATE_NODES=0x00000001 /usr/lib/drbd/crm-fence-peer.9.sh
<http://crm-fence-peer.9.sh>
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]:
(connect_with_main_loop) #011debug: Connected to controller IPC
(attached to main loop)
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]: (post_connect)
#011debug: Sent IPC hello to controller
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]:
(qb_ipcc_disconnect) #011debug: qb_ipcc_disconnect()
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]:
(qb_rb_close_helper) #011debug: Closing ringbuffer:
/dev/shm/qb-7673-89963-13-RTpTPN/qb-request-crmd-header
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]:
(qb_rb_close_helper) #011debug: Closing ringbuffer:
/dev/shm/qb-7673-89963-13-RTpTPN/qb-response-crmd-header
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]:
(qb_rb_close_helper) #011debug: Closing ringbuffer:
/dev/shm/qb-7673-89963-13-RTpTPN/qb-event-crmd-header
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]:
(ipc_post_disconnect) #011info: Disconnected from controller IPC API
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]:
(pcmk_free_ipc_api) #011debug: Releasing controller IPC API
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]:
(crm_xml_cleanup) #011info: Cleaning up memory from libxml2
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]: (crm_exit)
#011info: Exiting crm_node | with status 0
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]: /
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]: Could not
connect to the CIB: No such device or address
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]: Init failed,
could not perform requested operations
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[89928]: WARNING DATA
INTEGRITY at RISK: could not place the fencing constraint!
kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer exit code
1 (0x100)
kernel: drbd r0 nfs6: fence-peer helper broken, returned 1
kernel: drbd r0: State change failed: Refusing to be Primary while peer
is not outdated
kernel: drbd r0: Failed: role( Secondary -> Primary )
kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer
DRBD_BACKING_DEV_0=/dev/sdb1 DRBD_CONF=/etc/drbd.conf
DRBD_CSTATE=Connecting DRBD_LL_DISK=/dev/sdb1 DRBD_MINOR=0
DRBD_MINOR_0=0 DRBD_MY_ADDRESS=10.1.1.35 DRBD_MY_AF=ipv4
DRBD_MY_NODE_ID=0 DRBD_NODE_ID_0=nfs5 DRBD_NODE_ID_1=nfs6
DRBD_PEER_ADDRESS=10.1.1.36 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=1
DRBD_RESOURCE=r0 DRBD_VOLUME=0 UP_TO_DATE_NODES=0x00000001
/usr/lib/drbd/crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[24197]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
crm-fence-peer.9.sh <http://crm-fence-peer.9.sh>[24197]: (qb_rb_open_2)
#011debug: shm size:131085; real_size:135168; rb->word_size:33792
...
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210115/5f4c2a62/attachment.htm>
More information about the Users
mailing list