[ClusterLabs] Need help debugging a STONITH resource

Wed Jul 11 22:28:58 UTC 2018

I was able to get this sorted out thanks to Ken's help on IRC.  For some reason, stonith_admin -L did not list the device I'd added until I set stonith_enabled=true, even though on other clusters this was not necessary.  My process was to ensure that stonith_admin could successfully fence/reboot a node in the cluster before enabling fencing in the pacemaker config.  So not sure why some times it registered and sometimes it didn't, but it seems that enabling stonith always registers it.

> On 2018-07-11, at 12:56 PM, Casey & Gina <caseyandgina at icloud.com> wrote:
> 
> I have a number of clusters in a vmWare ESX environment which have all been set up following the same steps, unless somehow I did something wrong on some without realizing it.
> 
> The issue I am facing is that on some of the clusters, after adding the STONITH resource, testing with `stonith_admin -F <node_hostname>` is failing with the error "Command failed: No route to host".  Executing it with --verbose adds no additional output.
> 
> The stonith plugin I am using is external/vcenter, which in turn utilizes the vSphere CLI package.  I'm not certain what command it might be trying to run, or how to debug this further...  It's not an ESX issue, as meanwhile testing this same command on other clusters works fine.
> 
> Here is the output of `pcs config`:
> 
> ------
> Cluster Name: d-gp2-dbpg35
> Corosync Nodes:
> d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
> Pacemaker Nodes:
> d-gp2-dbpg35-1 d-gp2-dbpg35-2 d-gp2-dbpg35-3
> 
> Resources:
> Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
>  Attributes: ip=10.124.167.158 cidr_netmask=22
>  Operations: start interval=0s timeout=20s (postgresql-master-vip-start-interval-0s)
>              stop interval=0s timeout=20s (postgresql-master-vip-stop-interval-0s)
>              monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
> Master: postgresql-ha
>  Meta Attrs: notify=true 
>  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
>   Attributes: bindir=/usr/lib/postgresql/10/bin pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf"
>   Operations: start interval=0s timeout=60s (postgresql-10-main-start-interval-0s)
>               stop interval=0s timeout=60s (postgresql-10-main-stop-interval-0s)
>               promote interval=0s timeout=30s (postgresql-10-main-promote-interval-0s)
>               demote interval=0s timeout=120s (postgresql-10-main-demote-interval-0s)
>               monitor interval=15s role=Master timeout=10s (postgresql-10-main-monitor-interval-15s)
>               monitor interval=16s role=Slave timeout=10s (postgresql-10-main-monitor-interval-16s)
>               notify interval=0s timeout=60s (postgresql-10-main-notify-interval-0s)
> 
> Stonith Devices:
> Resource: vfencing (class=stonith type=external/vcenter)
>  Attributes: VI_SERVER=vcenter.imovetv.com VI_CREDSTORE=/etc/pacemaker/vicredentials.xml HOSTLIST=d-gp2-dbpg35-1;d-gp2-dbpg35-2;d-gp2-dbpg35-3 RESETPOWERON=1
>  Operations: monitor interval=60s (vfencing-monitor-60s)
> Fencing Levels:
> 
> Location Constraints:
> Ordering Constraints:
>  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
>  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
> Colocation Constraints:
>  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)
> 
> Resources Defaults:
> migration-threshold: 5
> resource-stickiness: 10
> Operations Defaults:
> No defaults set
> 
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: d-gp2-dbpg35
> dc-version: 1.1.14-70404b0
> have-watchdog: false
> stonith-enabled: false
> Node Attributes:
> d-gp2-dbpg35-1: master-postgresql-10-main=1001
> d-gp2-dbpg35-2: master-postgresql-10-main=1000
> d-gp2-dbpg35-3: master-postgresql-10-main=990
> ------
> 
> Here is a failure of fence testing on the same cluster:
> 
> ------
> root at d-gp2-dbpg35-1:~# stonith_admin -FV d-gp2-dbpg35-3
> Command failed: No route to host
> ------
> 
> For comparison sake, here is the output of `pcs config` on another cluster where the stonith_admin commands work:
> 
> ------
> Cluster Name: d-gp2-dbpg64
> Corosync Nodes:
> d-gp2-dbpg64-1 d-gp2-dbpg64-2
> Pacemaker Nodes:
> d-gp2-dbpg64-1 d-gp2-dbpg64-2
> 
> Resources:
> Resource: postgresql-master-vip (class=ocf provider=heartbeat type=IPaddr2)
>  Attributes: ip=10.124.165.40 cidr_netmask=22
>  Operations: start interval=0s timeout=20s (postgresql-master-vip-start-interval-0s)
>              stop interval=0s timeout=20s (postgresql-master-vip-stop-interval-0s)
>              monitor interval=10s (postgresql-master-vip-monitor-interval-10s)
> Master: postgresql-ha
>  Meta Attrs: notify=true 
>  Resource: postgresql-10-main (class=ocf provider=heartbeat type=pgsqlms)
>   Attributes: bindir=/usr/lib/postgresql/10/bin pgdata=/var/lib/postgresql/10/main pghost=/var/run/postgresql pgport=5432 recovery_template=/etc/postgresql/10/main/recovery.conf start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf"
>   Operations: start interval=0s timeout=60s (postgresql-10-main-start-interval-0s)
>               stop interval=0s timeout=60s (postgresql-10-main-stop-interval-0s)
>               promote interval=0s timeout=30s (postgresql-10-main-promote-interval-0s)
>               demote interval=0s timeout=120s (postgresql-10-main-demote-interval-0s)
>               monitor interval=15s role=Master timeout=10s (postgresql-10-main-monitor-interval-15s)
>               monitor interval=16s role=Slave timeout=10s (postgresql-10-main-monitor-interval-16s)
>               notify interval=0s timeout=60s (postgresql-10-main-notify-interval-0s)
> 
> Stonith Devices:
> Resource: vfencing (class=stonith type=external/vcenter)
>  Attributes: VI_SERVER=vcenter.imovetv.com VI_CREDSTORE=/etc/pacemaker/vicredentials.xml HOSTLIST=d-gp2-dbpg64-1;d-gp2-dbpg64-2 RESETPOWERON=1
> Fencing Levels:
> 
> Location Constraints:
> Ordering Constraints:
>  promote postgresql-ha then start postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory)
>  demote postgresql-ha then stop postgresql-master-vip (kind:Mandatory) (non-symmetrical) (id:order-postgresql-ha-postgresql-master-vip-Mandatory-1)
> Colocation Constraints:
>  postgresql-master-vip with postgresql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-postgresql-master-vip-postgresql-ha-INFINITY)
> 
> Resources Defaults:
> migration-threshold: 5
> resource-stickiness: 10
> Operations Defaults:
> No defaults set
> 
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: d-gp2-dbpg64
> dc-version: 1.1.14-70404b0
> have-watchdog: false
> last-lrm-refresh: 1527114792
> no-quorum-policy: ignore
> stonith-enabled: true
> Node Attributes:
> d-gp2-dbpg64-1: master-postgresql-10-main=1001
> d-gp2-dbpg64-2: master-postgresql-10-main=1000
> ------
> 
> I have also verified that the username and password saved in /etc/pacemaker/vicredentials.xml file is identical, and the version of the vSphere CLI is identical between clusters.  I don't know how to test a vCLI command directly to rule out something related to that package, but hope that there is some way I can figure out what the stonith_admin command is in turn trying to execute to debug further.
> 
> Thank you in advance for any help,
> -- 
> Casey
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org