[ClusterLabs] DRBD Cluster Problem

Damiano Giuliani damianogiuliani87 at gmail.com
Thu Aug 10 11:33:47 EDT 2023


Seems you are not using any fencing / stonith mechanism. A cluster is not
fully functional without it.


On Thu, Aug 10, 2023, 4:03 PM Tiaan Wessels <tiaanwessels at gmail.com> wrote:

> Hi,
>
> I need some help!
>
> I have a DRBD cluster and one node was switched off for a couple of days.
> The single node ran fine without a hiccup. When i switch it on I got into a
> situation where all resources got stopped and one DRBD volume was secondary
> and the others primary as it seemingly tried to perform a role swop to the
> node just switched on (ha1 was live and then i switched on ha2 at 08:06 for
> the sake of logs understanding)
>
> bash-5.1# cat /proc/drbd
> version: 8.4.11 (api:1/proto:86-101)
> srcversion: 60F610B702CC05315B04B50
>  0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
>     ns:109798092 nr:90528 dw:373317496 dr:353811713 al:558387 bm:0 lo:0
> pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>     ns:415010252 nr:188601628 dw:1396698240 dr:1032339078 al:1387347 bm:0
> lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>     ns:27957772 nr:21354732 dw:97210572 dr:100798651 al:5283 bm:0 lo:0
> pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>
> The cluster state ended up as
>
> bash-5.1# pcs status
> Cluster name: HA
> Status of pacemakerd: 'Pacemaker is running' (last updated 2023-08-10
> 08:38:40Z)
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: ha2.local (version 2.1.4-5.el9_1.2-dc6eb4362e) - partition
> with quorum
>   * Last updated: Thu Aug 10 08:38:40 2023
>   * Last change:  Mon Jul 10 06:49:08 2023 by hacluster via crmd on
> ha1.local
>   * 2 nodes configured
>   * 14 resource instances configured
>
> Node List:
>   * Online: [ ha1.local ha2.local ]
>
> Full List of Resources:
>   * Clone Set: LV_BLOB-clone [LV_BLOB] (promotable):
>     * Promoted: [ ha2.local ]
>     * Unpromoted: [ ha1.local ]
>   * Resource Group: nsdrbd:
>     * LV_BLOBFS (ocf:heartbeat:Filesystem): Started ha2.local
>     * LV_POSTGRESFS (ocf:heartbeat:Filesystem): Stopped
>     * LV_HOMEFS (ocf:heartbeat:Filesystem): Stopped
>     * ClusterIP (ocf:heartbeat:IPaddr2): Stopped
>   * Clone Set: LV_POSTGRES-clone [LV_POSTGRES] (promotable):
>     * Promoted: [ ha1.local ]
>     * Unpromoted: [ ha2.local ]
>   * postgresql (systemd:postgresql): Stopped
>   * Clone Set: LV_HOME-clone [LV_HOME] (promotable):
>     * Promoted: [ ha1.local ]
>     * Unpromoted: [ ha2.local ]
>   * ns_mhswdog (lsb:mhswdog): Stopped
>   * Clone Set: pingd-clone [pingd]:
>     * Started: [ ha1.local ha2.local ]
>
> Failed Resource Actions:
>   * LV_POSTGRES promote on ha2.local could not be executed (Timed Out:
> Resource agent did not complete within 1m30s) at Thu Aug 10 08:19:27 2023
> after 1m30.003s
>   * LV_BLOB promote on ha2.local could not be executed (Timed Out:
> Resource agent did not complete within 1m30s) at Thu Aug 10 08:15:38 2023
> after 1m30.001s
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
>
> I attach the logs of the two nodes. I also attach the output of pcs config
> show
>
> My questions:
> - can anyone help me figure out what happened here ?
> - as a side question, if a situation resolved itself, is there a way to
> have pcs do a resource cleanup by itself ?
>
> Thanks
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230810/a69f686e/attachment.htm>


More information about the Users mailing list