[ClusterLabs] Fence node when network interface goes down

Sun Nov 14 16:59:37 EST 2021

The mentioned error occurs when attempting to promote the PostgreSQL 
resource on the standby node, after the master PostgreSQL resource is 
stopped.

For info, here is my configuration:

Corosync Nodes:
  node1.local node2.local
Pacemaker Nodes:
  node1.local node2.local

Resources:
  Clone: public_network_monitor-clone
   Resource: public_network_monitor (class=ocf provider=heartbeat 
type=ethmonitor)
    Attributes: interface=eth0 link_status_only=true name=ethmonitor-public
    Operations: monitor interval=10s timeout=60s 
(public_network_monitor-monitor-interval-10s)
                start interval=0s timeout=60s 
(public_network_monitor-start-interval-0s)
                stop interval=0s timeout=20s 
(public_network_monitor-stop-interval-0s)
  Clone: pgsqld-clone
   Meta Attrs: notify=true promotable=true
   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
    Attributes: bindir=/usr/lib/postgresql/12/bin 
datadir=/var/lib/postgresql/12/main pgdata=/etc/postgresql/12/main
    Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
                methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
                monitor interval=15s role=Master timeout=10s 
(pgsqld-monitor-interval-15s)
                monitor interval=16s role=Slave timeout=10s 
(pgsqld-monitor-interval-16s)
                notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
                promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
                reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
                start interval=0s timeout=60s (pgsqld-start-interval-0s)
                stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
  Resource: public_virtual_ip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=24 ip=192.168.50.3 nic=mgnet0
   Operations: monitor interval=30s (public_virtual_ip-monitor-interval-30s)
               start interval=0s timeout=20s 
(public_virtual_ip-start-interval-0s)
               stop interval=0s timeout=20s 
(public_virtual_ip-stop-interval-0s)

Stonith Devices:
  Resource: node1_fence_agent (class=stonith type=fence_ssh)
   Attributes: hostname=192.168.60.1 pcmk_delay_base=15 
pcmk_host_list=node1.local user=root
   Operations: monitor interval=60s (node1_fence_agent-monitor-interval-60s)
  Resource: node2_fence_agent (class=stonith type=fence_ssh)
   Attributes: hostname=192.168.60.2 pcmk_host_list=node2.local user=root
   Operations: monitor interval=60s (node2_fence_agent-monitor-interval-60s)
Fencing Levels:

Location Constraints:
   Resource: node1_fence_agent
     Disabled on: node1.local (score:-INFINITY) 
(id:location-node1_fence_agent-node1.local--INFINITY)
   Resource: node2_fence_agent
     Disabled on: node2.local (score:-INFINITY) 
(id:location-node2_fence_agent-node2.local--INFINITY)
   Resource: public_virtual_ip
     Constraint: location-public_virtual_ip
       Rule: score=INFINITY  (id:location-public_virtual_ip-rule)
         Expression: ethmonitor-public eq 1 
(id:location-public_virtual_ip-rule-expr)
Ordering Constraints:
   promote pgsqld-clone then start public_virtual_ip (kind:Mandatory) 
(non-symmetrical) (id:order-pgsqld-clone-public_virtual_ip-Mandatory)
   demote pgsqld-clone then stop public_virtual_ip (kind:Mandatory) 
(non-symmetrical) (id:order-pgsqld-clone-public_virtual_ip-Mandatory-1)
Colocation Constraints:
   public_virtual_ip with pgsqld-clone (score:INFINITY) 
(rsc-role:Started) (with-rsc-role:Master) 
(id:colocation-public_virtual_ip-pgsqld-clone-INFINITY)
Ticket Constraints:

This is my understanding of the sequence of events:

1. Node1 is running the PostgreSQL resource as master, Node2 is running 
the PostgreSQL resource as standby. Everything is working okay at this 
point.
2. On Node1, the public network goes down and ethmonitor changes the 
ethmonitor-public node attribute from 1 to 0.
3. The location-public_virtual_ip constraint (which requires the IP to 
run on a node with ethmonitor-public==1) kicks in, and pacemaker demotes 
the master PostgreSQL resource so that it can then promote it on Node2.
4. The primary PostgreSQL instance on Node2 attempts to shutdown in 
response to the demotion, but it can't connect to the standby so is 
unable to stop cleanly. The PostgreSQL resource shows as demoting for 60 
seconds, as below:

Clone Set: pgsqld-clone [pgsqld] (promotable)
      pgsqld     (ocf::heartbeat:pgsqlms):       Demoting node1.local
      Slaves: [ node2.local ]

5. After a minute, the demotion finishes and pacemaker attempts to 
promote the PostgreSQL resource on Node2. This action fails with the 
"Switchover has been canceled from pre-promote action" error, because 
the standby didn't receive the final WAL activity from the primary.
6. Due to the failed promotion on Node2, PAF/Pacemaker promotes the 
PostgreSQL resource on Node1 again. However, due to the public network 
interface being down, the PostgreSQL and virtual IP resources provided 
by the HA cluster are now completely inaccessible, even though Node2 is 
perfectly capable of hosting the resources.

I believe the 60 second wait during demotion is due to the default value 
of '60s' for wal_sender_timeout 
(https://www.postgresql.org/docs/12/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-SENDER). 
After 60 seconds of trying to reach the standby node, PostgreSQL 
terminates the replication connection, at which point the shutdown and 
demotion complete. If I set wal_sender_timeout to a value higher than 
the pgsqld resource demote timeout (eg, demote timeout=120s, 
wal_sender_timeout=150s), then the demote action times out and the node 
is fenced, at which point the PostgreSQL resource is promoted 
successfully on the standby node. This is almost what I want, but it 
means it can take over 2 minutes just for the failover to initiate (then 
there is additional time to start the resources on the standby node, 
etc) - which is not an acceptable timeframe for us, given that 
ethmonitor detects that there is a problem within 10 seconds. I could 
reduce the pgsqld demote timeout in order to achieve a quicker failed 
demotion, but that would go against the officially suggested values by 
the PAF team, so I don't really want to do that.

Pacemaker logs can be found here:

Node1: https://pastebin.com/iT6GgWTe
Node2: https://pastebin.com/Yj8Xjxe7