[ClusterLabs] Need help debugging a STONITH resource

Casey & Gina caseyandgina at icloud.com
Wed Jul 11 14:56:02 EDT 2018


I have a number of clusters in a vmWare ESX environment which have all been set up following the same steps, unless somehow I did something wrong on some without realizing it.

The issue I am facing is that on some of the clusters, after adding the STONITH resource, testing with `stonith_admin -F <node_hostname>` is failing with the error "Command failed: No route to host".  Executing it with --verbose adds no additional output.

The stonith plugin I am using is external/vcenter, which in turn utilizes the vSphere CLI package.  I'm not certain what command it might be trying to run, or how to debug this further...  It's not an ESX issue, as meanwhile testing this same command on other clusters works fine.

Here is the output of `pcs config`:

------
Cluster Name: d-gp2-dbpg35
Corosync Nodes:
 d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
Pacemaker Nodes:
 d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3

Resources:
 Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.124.167.158 cidr_netmask=22
  Operations: start interval=0s timeout=20s (postgresql-master-vip-start-interval-0s)
              stop interval=0s timeout=20s (postgresql-master-vip-stop-interval-0s)
              monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
 Master: postgresql-ha
  Meta Attrs: notify=true 
  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/10/bin pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf"
   Operations: start interval=0s timeout=60s (postgresql-10-main-start-interval-0s)
               stop interval=0s timeout=60s (postgresql-10-main-stop-interval-0s)
               promote interval=0s timeout=30s (postgresql-10-main-promote-interval-0s)
               demote interval=0s timeout=120s (postgresql-10-main-demote-interval-0s)
               monitor interval=15s role=Master timeout=10s (postgresql-10-main-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (postgresql-10-main-monitor-interval-16s)
               notify interval=0s timeout=60s (postgresql-10-main-notify-interval-0s)

Stonith Devices:
 Resource: vfencing (class=stonith type=external/vcenter)
  Attributes: VI_SERVER=vcenter.imovetv.com VI_CREDSTORE=/etc/pacemaker/vicredentials.xml HOSTLIST=d-gp2-dbpg35-1;d-gp2-dbpg35-2;d-gp2-dbpg35-3 RESETPOWERON=1
  Operations: monitor interval=60s (vfencing-monitor-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
Colocation Constraints:
  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)

Resources Defaults:
 migration-threshold: 5
 resource-stickiness: 10
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: d-gp2-dbpg35
 dc-version: 1.1.14-70404b0
 have-watchdog: false
 stonith-enabled: false
Node Attributes:
 d-gp2-dbpg35-1: master-postgresql-10-main=1001
 d-gp2-dbpg35-2: master-postgresql-10-main=1000
 d-gp2-dbpg35-3: master-postgresql-10-main=990
------

Here is a failure of fence testing on the same cluster:

------
root at d-gp2-dbpg35-1:~# stonith_admin -FV d-gp2-dbpg35-3
Command failed: No route to host
------

For comparison sake, here is the output of `pcs config` on another cluster where the stonith_admin commands work:

------
Cluster Name: d-gp2-dbpg64
Corosync Nodes:
 d-gp2-dbpg64-1 d-gp2-dbpg64-2
Pacemaker Nodes:
 d-gp2-dbpg64-1 d-gp2-dbpg64-2

Resources:
 Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.124.165.40 cidr_netmask=22
  Operations: start interval=0s timeout=20s (postgresql-master-vip-start-interval-0s)
              stop interval=0s timeout=20s (postgresql-master-vip-stop-interval-0s)
              monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
 Master: postgresql-ha
  Meta Attrs: notify=true 
  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/10/bin pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf"
   Operations: start interval=0s timeout=60s (postgresql-10-main-start-interval-0s)
               stop interval=0s timeout=60s (postgresql-10-main-stop-interval-0s)
               promote interval=0s timeout=30s (postgresql-10-main-promote-interval-0s)
               demote interval=0s timeout=120s (postgresql-10-main-demote-interval-0s)
               monitor interval=15s role=Master timeout=10s (postgresql-10-main-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (postgresql-10-main-monitor-interval-16s)
               notify interval=0s timeout=60s (postgresql-10-main-notify-interval-0s)

Stonith Devices:
 Resource: vfencing (class=stonith type=external/vcenter)
  Attributes: VI_SERVER=vcenter.imovetv.com VI_CREDSTORE=/etc/pacemaker/vicredentials.xml HOSTLIST=d-gp2-dbpg64-1;d-gp2-dbpg64-2 RESETPOWERON=1
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
Colocation Constraints:
  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)

Resources Defaults:
 migration-threshold: 5
 resource-stickiness: 10
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: d-gp2-dbpg64
 dc-version: 1.1.14-70404b0
 have-watchdog: false
 last-lrm-refresh: 1527114792
 no-quorum-policy: ignore
 stonith-enabled: true
Node Attributes:
 d-gp2-dbpg64-1: master-postgresql-10-main=1001
 d-gp2-dbpg64-2: master-postgresql-10-main=1000
------

I have also verified that the username and password saved in /etc/pacemaker/vicredentials.xml file is identical, and the version of the vSphere CLI is identical between clusters.  I don't know how to test a vCLI command directly to rule out something related to that package, but hope that there is some way I can figure out what the stonith_admin command is in turn trying to execute to debug further.

Thank you in advance for any help,
-- 
Casey


More information about the Users mailing list