[ClusterLabs] Cluster (sometimes) hangs during shutdown - EL9

Fri Jun 6 06:28:25 UTC 2025

Hi!

Maybe a configuration problem:
Operation 'monitor' [3482074] using fence_node2 could not be executed: Timed Out
In general I think there are too many errors. Could it be that a node is to be fenced, but fencing fails?

Do you use on-faul=blocked?
crit: Cannot shut down node1.my.org because of pgsql-ha-vip: blocked

I would look at the regular syslog, too.

Kind regards,
Ulrich Windl

From: Users <users-bounces at clusterlabs.org> On Behalf Of Larry G. Mills via Users
Sent: Tuesday, May 13, 2025 12:05 AM
To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
Cc: Larry G. Mills <lgmills at fnal.gov>
Subject: [EXT] [ClusterLabs] Cluster (sometimes) hangs during shutdown - EL9

Hello all,

I have a fairly simple two-node cluster that supports three resources - promotable Postgres, fencing, and virtual IP.  This cluster is running on AlmaLinux 9.5 (RHEL9 variant).  In recent months, I have noticed that the cluster will occasionally hang when shutting down.  I use "pcs" to manage the cluster, so the shutdown command used is "pcs cluster stop -all".

During the last hang, I observed that all the resources appeared to be shut down except the virtual IP - the VIP remained in the "Started" state, and the cluster remained running on the node where the VIP was running.    I eventually was able to stop the cluster by issuing a "pcs cluster stop -all -request-timeout=1".

I have been using this same cluster configuration (across multiple OS releases) for years, and have never experienced a shutdown hang before.  Unfortunately, I can not reliably reproduce the scenario, but it has definitely happened on multiple occasions.

Some config information:

Linux node1.my.org 5.14.0-503.38.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr 18 08:52:10 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux
corosync.x86_64                                                                        3.1.8-2.el9
pacemaker.x86_64                                                                       2.1.8-3.el9
pcs.x86_64                                                                             0.11.8-1.el9_5.1.alma.1

Cluster constraints:

Location Constraints:
  resource 'fence_node1' avoids node 'node1.my.orig' with score INFINITY
  resource 'fence_node2' avoids node 'node2.my.org' with score INFINITY
Colocation Constraints:
  Started resource 'pgsql-ha-vip' with Promoted resource 'pgsql-clone'
    score=INFINITY
Order Constraints:
  promote resource 'pgsql-clone' then start resource 'pgsql-ha-vip'
    symmetrical=0 kind=Mandatory
  demote resource 'pgsql-clone' then stop resource 'pgsql-ha-vip'
    symmetrical=0 kind=Mandatory

Although I'm not super adept at parsing the pacemaker logs, the following error messages looked problematic:

May 08 14:59:19.492 node1.my.org pacemaker-schedulerd[7000] (log_list_item)     notice: Actions: Stop       pgsql-ha-vip     ( node1.my.org )  due to node availability (blocked)
May 08 14:59:19.492 node1.my.org pacemaker-schedulerd[7000] (pcmk__create_graph)        crit: Cannot shut down node1.my.org because of pgsql-ha-vip: blocked (pgsql-ha-vip_stop_0)

A sanitized pacemaker log of the hang event is attached - 5/8/2025 @14:59.

Is this a latent configuration problem that's just now showing up, or a problem with the pacemaker version's currently in EL9?

Any thoughts appreciated,

Larry Mills
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250606/894216a6/attachment.htm>