[ClusterLabs] DRBD Cluster Problem
Tiaan Wessels
tiaanwessels at gmail.com
Thu Aug 10 06:00:49 EDT 2023
Hi,
I need some help!
I have a DRBD cluster and one node was switched off for a couple of days.
The single node ran fine without a hiccup. When i switch it on I got into a
situation where all resources got stopped and one DRBD volume was secondary
and the others primary as it seemingly tried to perform a role swop to the
node just switched on (ha1 was live and then i switched on ha2 at 08:06 for
the sake of logs understanding)
bash-5.1# cat /proc/drbd
version: 8.4.11 (api:1/proto:86-101)
srcversion: 60F610B702CC05315B04B50
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0
pe:0 ua:0 ap:0 ep:1 wo:f oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0
lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0 pe:0
ua:0 ap:0 ep:1 wo:f oos:0
The cluster state ended up as
bash-5.1# pcs status
Cluster name: HA
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10
08:38:40Z)
Cluster Summary:
* Stack: corosync
* Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition
with quorum
* Last updated: Thu Aug 10 08:38:40 2023
* Last change: Mon Jul 10 06:49:08 2023 by hacluster via crmd on
ha1.local
* 2 nodes configured
* 14 resource instances configured
Node List:
* Online: [ ha1.local ha2.local ]
Full List of Resources:
* Clone Set: LV_BLOB-clone [LV_BLOB] (promotable):
* Promoted: [ ha2.local ]
* Unpromoted: [ ha1.local ]
* Resource Group: nsdrbd:
* LV_BLOBFS (ocf:heartbeat:Filesystem): Started ha2.local
* LV_POSTGRESFS (ocf:heartbeat:Filesystem): Stopped
* LV_HOMEFS (ocf:heartbeat:Filesystem): Stopped
* ClusterIP (ocf:heartbeat:IPaddr2): Stopped
* Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable):
* Promoted: [ ha1.local ]
* Unpromoted: [ ha2.local ]
* postgresql (systemd:postgresql): Stopped
* Clone Set: LV_HOME-clone [LV_HOME] (promotable):
* Promoted: [ ha1.local ]
* Unpromoted: [ ha2.local ]
* ns_mhswdog (lsb:mhswdog): Stopped
* Clone Set: pingd-clone [pingd]:
* Started: [ ha1.local ha2.local ]
Failed Resource Actions:
* LV_POSTGRES promote on ha2.local could not be executed (Timed Out:
Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023
after 1m30.003s
* LV_BLOB promote on ha2.local could not be executed (Timed Out: Resource
agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023 after
1m30.001s
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
I attach the logs of the two nodes. I also attach the output of pcs config
show
My questions:
- can anyone help me figure out what happened here ?
- as a side question, if a situation resolved itself, is there a way to
have pcs do a resource cleanup by itself ?
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230810/a6b98fe1/attachment-0001.htm>
-------------- next part --------------
Cluster Name: HA
Corosync Nodes:
ha1.local ha2.local
Pacemaker Nodes:
ha1.local ha2.local
Resources:
Resource: postgresql (class=systemd type=postgresql)
Operations:
monitor: postgresql-monitor-interval-60s
interval=60s
start: postgresql-start-interval-0s
interval=0s
timeout=100
stop: postgresql-stop-interval-0s
interval=0s
timeout=100
Resource: ns_mhswdog (class=lsb type=mhswdog)
Operations:
force-reload: ns_mhswdog-force-reload-interval-0s
interval=0s
timeout=15
monitor: ns_mhswdog-monitor-interval-60s
interval=60s
timeout=10s
on-fail=standby
restart: ns_mhswdog-restart-interval-0s
interval=0s
timeout=140s
start: ns_mhswdog-start-interval-0s
interval=0s
timeout=80s
stop: ns_mhswdog-stop-interval-0s
interval=0s
timeout=80s
Group: nsdrbd
Resource: LV_BLOBFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: LV_BLOBFS-instance_attributes
device=/dev/drbd0
directory=/data
fstype=ext4
Operations:
monitor: LV_BLOBFS-monitor-interval-20s
interval=20s
timeout=40s
start: LV_BLOBFS-start-interval-0s
interval=0s
timeout=60s
stop: LV_BLOBFS-stop-interval-0s
interval=0s
timeout=60s
Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: LV_POSTGRESFS-instance_attributes
device=/dev/drbd1
directory=/var/lib/pgsql
fstype=ext4
Operations:
monitor: LV_POSTGRESFS-monitor-interval-20s
interval=20s
timeout=40s
start: LV_POSTGRESFS-start-interval-0s
interval=0s
timeout=60s
stop: LV_POSTGRESFS-stop-interval-0s
interval=0s
timeout=60s
Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: LV_HOMEFS-instance_attributes
device=/dev/drbd2
directory=/home
fstype=ext4
Operations:
monitor: LV_HOMEFS-monitor-interval-20s
interval=20s
timeout=40s
start: LV_HOMEFS-start-interval-0s
interval=0s
timeout=60s
stop: LV_HOMEFS-stop-interval-0s
interval=0s
timeout=60s
Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ClusterIP-instance_attributes
cidr_netmask=32
ip=192.168.51.75
Operations:
monitor: ClusterIP-monitor-interval-60s
interval=60s
start: ClusterIP-start-interval-0s
interval=0s
timeout=20s
stop: ClusterIP-stop-interval-0s
interval=0s
timeout=20s
Clone: LV_BLOB-clone
Meta Attributes: LV_BLOB-clone-meta_attributes
clone-max=2
clone-node-max=1
notify=true
promotable=true
promoted-max=1
promoted-node-max=1
Resource: LV_BLOB (class=ocf provider=linbit type=drbd)
Attributes: LV_BLOB-instance_attributes
drbd_resource=lv_blob
Operations:
demote: LV_BLOB-demote-interval-0s
interval=0s
timeout=90
monitor: LV_BLOB-monitor-interval-60s
interval=60s
role=Promoted
monitor: LV_BLOB-monitor-interval-63s
interval=63s
role=Unpromoted
notify: LV_BLOB-notify-interval-0s
interval=0s
timeout=90
promote: LV_BLOB-promote-interval-0s
interval=0s
timeout=90
reload: LV_BLOB-reload-interval-0s
interval=0s
timeout=30
start: LV_BLOB-start-interval-0s
interval=0s
timeout=240
stop: LV_BLOB-stop-interval-0s
interval=0s
timeout=100
Clone: LV_POSTGRES-clone
Meta Attributes: LV_POSTGRES-clone-meta_attributes
clone-max=2
clone-node-max=1
notify=true
promotable=true
promoted-max=1
promoted-node-max=1
Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd)
Attributes: LV_POSTGRES-instance_attributes
drbd_resource=lv_postgres
Operations:
demote: LV_POSTGRES-demote-interval-0s
interval=0s
timeout=90
monitor: LV_POSTGRES-monitor-interval-60s
interval=60s
role=Promoted
monitor: LV_POSTGRES-monitor-interval-63s
interval=63s
role=Unpromoted
notify: LV_POSTGRES-notify-interval-0s
interval=0s
timeout=90
promote: LV_POSTGRES-promote-interval-0s
interval=0s
timeout=90
reload: LV_POSTGRES-reload-interval-0s
interval=0s
timeout=30
start: LV_POSTGRES-start-interval-0s
interval=0s
timeout=240
stop: LV_POSTGRES-stop-interval-0s
interval=0s
timeout=100
Clone: LV_HOME-clone
Meta Attributes: LV_HOME-clone-meta_attributes
clone-max=2
clone-node-max=1
notify=true
promotable=true
promoted-max=1
promoted-node-max=1
Resource: LV_HOME (class=ocf provider=linbit type=drbd)
Attributes: LV_HOME-instance_attributes
drbd_resource=lv_home
Operations:
demote: LV_HOME-demote-interval-0s
interval=0s
timeout=90
monitor: LV_HOME-monitor-interval-60s
interval=60s
role=Promoted
monitor: LV_HOME-monitor-interval-63s
interval=63s
role=Unpromoted
notify: LV_HOME-notify-interval-0s
interval=0s
timeout=90
promote: LV_HOME-promote-interval-0s
interval=0s
timeout=90
reload: LV_HOME-reload-interval-0s
interval=0s
timeout=30
start: LV_HOME-start-interval-0s
interval=0s
timeout=240
stop: LV_HOME-stop-interval-0s
interval=0s
timeout=100
Clone: pingd-clone
Resource: pingd (class=ocf provider=pacemaker type=ping)
Attributes: pingd-instance_attributes
dampen=6s
host_list=192.168.51.251
multiplier=1000
Operations:
monitor: pingd-monitor-interval-10s
interval=10s
timeout=60s
reload-agent: pingd-reload-agent-interval-0s
interval=0s
timeout=20s
start: pingd-start-interval-0s
interval=0s
timeout=60s
stop: pingd-stop-interval-0s
interval=0s
timeout=20s
Stonith Devices:
Fencing Levels:
Location Constraints:
Resource: ClusterIP
Constraint: location-ClusterIP
Rule: boolean-op=or score=-INFINITY (id:location-ClusterIP-rule)
Expression: pingd lt 1 (id:location-ClusterIP-rule-expr)
Expression: not_defined pingd (id:location-ClusterIP-rule-expr-1)
Ordering Constraints:
promote LV_BLOB-clone then start LV_BLOBFS (kind:Mandatory) (id:order-LV_BLOB-clone-LV_BLOBFS-mandatory)
promote LV_POSTGRES-clone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRES-clone-LV_POSTGRESFS-mandatory)
start LV_POSTGRESFS then start postgresql (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql-mandatory)
promote LV_HOME-clone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOME-clone-LV_HOMEFS-mandatory)
start LV_HOMEFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_HOMEFS-ns_mhswdog-mandatory)
start LV_BLOBFS then start ns_mhswdog (kind:Mandatory) (id:order-LV_BLOBFS-ns_mhswdog-mandatory)
start postgresql then start ns_mhswdog (kind:Mandatory) (id:order-postgresql-ns_mhswdog-mandatory)
start ns_mhswdog then start ClusterIP (kind:Mandatory) (id:order-ns_mhswdog-ClusterIP-mandatory)
Colocation Constraints:
LV_BLOBFS with LV_BLOB-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_BLOBFS-LV_BLOB-clone-INFINITY)
LV_POSTGRESFS with LV_POSTGRES-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_POSTGRESFS-LV_POSTGRES-clone-INFINITY)
postgresql with LV_POSTGRESFS (score:INFINITY) (id:colocation-postgresql-LV_POSTGRESFS-INFINITY)
LV_HOMEFS with LV_HOME-clone (score:INFINITY) (with-rsc-role:Promoted) (id:colocation-LV_HOMEFS-LV_HOME-clone-INFINITY)
ns_mhswdog with LV_HOMEFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_HOMEFS-INFINITY)
ns_mhswdog with LV_BLOBFS (score:INFINITY) (id:colocation-ns_mhswdog-LV_BLOBFS-INFINITY)
ns_mhswdog with postgresql (score:INFINITY) (id:colocation-ns_mhswdog-postgresql-INFINITY)
ClusterIP with ns_mhswdog (score:INFINITY) (id:colocation-ClusterIP-ns_mhswdog-INFINITY)
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
Meta Attrs: build-resource-defaults
resource-stickiness=INFINITY
Operations Defaults:
Meta Attrs: op_defaults-meta_attributes
timeout=240s
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: HA
dc-version: 2.1.4-5.el9_1.2-dc6eb4362e
have-watchdog: false
last-lrm-refresh: 1688971748
maintenance-mode: false
no-quorum-policy: ignore
stonith-enabled: false
Tags:
No tags defined
Quorum:
Options:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ha2-corosync.log
Type: text/x-log
Size: 7032 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230810/a6b98fe1/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ha1-corosync.log
Type: text/x-log
Size: 1767 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230810/a6b98fe1/attachment-0003.bin>
More information about the Users
mailing list