[ClusterLabs] DRBD Cluster Problem
Damiano Giuliani
damianogiuliani87 at gmail.com
Thu Aug 10 11:33:47 EDT 2023
Seems you are not using any fencing / stonith mechanism. A cluster is not
fully functional without it.
On Thu, Aug 10, 2023, 4:03 PM Tiaan Wessels <tiaanwessels at gmail.com> wrote:
> Hi,
>
> I need some help!
>
> I have a DRBD cluster and one node was switched off for a couple of days.
> The single node ran fine without a hiccup. When i switch it on I got into a
> situation where all resources got stopped and one DRBD volume was secondary
> and the others primary as it seemingly tried to perform a role swop to the
> node just switched on (ha1 was live and then i switched on ha2 at 08:06 for
> the sake of logs understanding)
>
> bash-5.1# cat /proc/drbd
> version: 8.4.11 (api:1/proto:86-101)
> srcversion: 60F610B702CC05315B04B50
> 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
> ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0
> pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
> ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0
> lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
> ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0
> pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>
> The cluster state ended up as
>
> bash-5.1# pcs status
> Cluster name: HA
> Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10
> 08:38:40Z)
> Cluster Summary:
> * Stack: corosync
> * Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition
> with quorum
> * Last updated: Thu Aug 10 08:38:40 2023
> * Last change: Mon Jul 10 06:49:08 2023 by hacluster via crmd on
> ha1.local
> * 2 nodes configured
> * 14 resource instances configured
>
> Node List:
> * Online: [ ha1.local ha2.local ]
>
> Full List of Resources:
> * Clone Set: LV_BLOB-clone [LV_BLOB] (promotable):
> * Promoted: [ ha2.local ]
> * Unpromoted: [ ha1.local ]
> * Resource Group: nsdrbd:
> * LV_BLOBFS (ocf:heartbeat:Filesystem): Started ha2.local
> * LV_POSTGRESFS (ocf:heartbeat:Filesystem): Stopped
> * LV_HOMEFS (ocf:heartbeat:Filesystem): Stopped
> * ClusterIP (ocf:heartbeat:IPaddr2): Stopped
> * Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable):
> * Promoted: [ ha1.local ]
> * Unpromoted: [ ha2.local ]
> * postgresql (systemd:postgresql): Stopped
> * Clone Set: LV_HOME-clone [LV_HOME] (promotable):
> * Promoted: [ ha1.local ]
> * Unpromoted: [ ha2.local ]
> * ns_mhswdog (lsb:mhswdog): Stopped
> * Clone Set: pingd-clone [pingd]:
> * Started: [ ha1.local ha2.local ]
>
> Failed Resource Actions:
> * LV_POSTGRES promote on ha2.local could not be executed (Timed Out:
> Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023
> after 1m30.003s
> * LV_BLOB promote on ha2.local could not be executed (Timed Out:
> Resource agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023
> after 1m30.001s
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> I attach the logs of the two nodes. I also attach the output of pcs config
> show
>
> My questions:
> - can anyone help me figure out what happened here ?
> - as a side question, if a situation resolved itself, is there a way to
> have pcs do a resource cleanup by itself ?
>
> Thanks
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230810/a69f686e/attachment.htm>
More information about the Users
mailing list