[ClusterLabs] How to set up fencing/stonith

Fri May 18 19:38:12 UTC 2018

On Fri, 2018-05-18 at 12:33 -0600, Casey & Gina wrote:
> I think that I finally managed to get fencing working!  To do this,
> I've (for now) used the approach of using the stock Ubuntu package
> for pcs, but using crmsh to create the fencing resource.  I've had a
> lot of trouble trying to get a newer pcs compiled and working, and
> don't really know what I'm doing well enough to overcome that as of
> yet.  I'm thinking of reporting the issue of pcs not supporting the
> external stonith plugins as a bug with Ubuntu, in hopes that they can
> update the version of the package available.  I think I should also
> be able to just edit the relevant XML into the cib without the help
> of crmsh, although I'll have to research how to do that more
> later.  It may well be easier to use fence_vmware_soap instead if I
> can figure out how to make that work at some point, so I'm still keen
> to figure out the problems with that too.
> 
> I wish that I knew how to diagnose what was going wrong from the logs
> (quoted below) or some debugging mode of pacemaker, but I took a wild
> guess and installed vCLI to a prefix of /usr (it's default) instead
> of /usr/local (where I'd prefer it since it's not installed by
> apt).  Once this was done, I added the fencing resource with crmsh,
> and it didn't start failing between the nodes as it had before.  I
> was then able to use `stonith_admin -F` and `stonith_admin -U` to
> power off and on a node in the cluster.  I can't tell you how
> exciting that was to finally see!
> 
> Sadly, my excitement was quickly squashed.  I proceeded to add
> PostgreSQL and VIP resources to the cluster as per how I had done
> them before without fencing, and everything looked good when I
> checked `pcs status`.  So then I logged in to vSphere and powered off
> the primary node, expecting the VIP and PostgreSQL to come up on one
> of the standby nodes.  Instead, I ended up with this:
> 
> ------
> Node d-gp2-dbpg0-1: UNCLEAN (offline)
> Online: [ d-gp2-dbpg0-2 d-gp2-dbpg0-3 ]
> 
> Full list of resources:
> 
>  vfencing       (stonith:external/vcenter):     Started[ d-gp2-dbpg0-
> 1 d-gp2-dbpg0-2 ]
>  postgresql-master-vip  (ocf::heartbeat:IPaddr2):       Started d-
> gp2-dbpg0-1 (UNCLEAN)
>  Master/Slave Set: postgresql-ha [postgresql-10-main]
>      postgresql-10-main (ocf::heartbeat:pgsqlms):       Master d-gp2-
> dbpg0-1 (UNCLEAN)
>      Slaves: [ d-gp2-dbpg0-2 d-gp2-dbpg0-3 ]
> ------
> 
> Why does it show above that the vfencing resource is started on nodes
> 1 and 2, when node 1 is down?  Why is it not started on node
> 3?  Prior to powering off node 1, it said that it was only started on
> node 1 - is that a misconfiguration on my part or normal?

Having it started on one node is normal. Fence devices default to
requires=quorum, meaning they can start on a new node even before the
original node is fenced. It looks like that's what happened here, but
something went wrong with the fencing, so the cluster assumes it's
still active on the old node as well.

I'm not sure what went wrong with the fencing. Once fencing succeeds,
the node should show up as offline without also being unclean. Anything
interesting in the logs around the time of the fencing?

> 
> Most importantly, what's keeping a standby from taking over after the
> primary is powered off?
> 
> Strangely, when I power back on node 1 and `pcs cluster start` on it,
> the cluster ends up promoting node 2 as the primary, but with errors
> reported on node 1:
> 
> ------
> Online: [ d-gp2-dbpg0-1 d-gp2-dbpg0-2 d-gp2-dbpg0-3 ]
> 
> Full list of resources:
> 
>  vfencing       (stonith:external/vcenter):     Started d-gp2-dbpg0-2
>  postgresql-master-vip  (ocf::heartbeat:IPaddr2):       Started d-
> gp2-dbpg0-2
>  Master/Slave Set: postgresql-ha [postgresql-10-main]
>      postgresql-10-main (ocf::heartbeat:pgsqlms):       FAILED Master
> d-gp2-dbpg0-1
>      Masters: [ d-gp2-dbpg0-2 ]
>      Slaves: [ d-gp2-dbpg0-3 ]
> 
> Failed Actions:
> * postgresql-10-main_monitor_0 on d-gp2-dbpg0-1 'master (failed)'
> (9): call=14, status=complete, exitreason='Instance "postgresql-10-
> main" controldata indicates a running primary instance, the instance
> has probably crashed',
>     last-rc-change='Fri May 18 18:29:51 2018', queued=0ms, exec=90ms
> ------
> 
> Here is the full list of commands that I have used to configure the
> cluster after a fresh installation:
> 
> ------
> crm configure primitive vfencing stonith::external/vcenter params
> VI_SERVER="10.124.137.100"
> VI_CREDSTORE="/etc/pacemaker/vicredentials.xml" HOSTLIST="d-gp2-
> dbpg0-1=d-gp2-dbpg0-1;d-gp2-dbpg0-2=d-gp2-dbpg0-2;d-gp2-dbpg0-3=d-
> gp2-dbpg0-3" RESETPOWERON="0" op monitor interval="60s"
> pcs cluster cib /tmp/dbpg.xml
> pcs -f /tmp/dbpg.xml property set stonith-enabled=true
> pcs -f /tmp/dbpg.xml resource defaults migration-threshold=5
> pcs -f /tmp/dbpg.xml resource defaults resource-stickiness=10
> pcs -f /tmp/dbpg.xml resource create postgresql-master-vip
> ocf:heartbeat:IPaddr2 ip=10.124.164.250 cidr_netmask=22 op monitor
> interval=10s
> pcs -f /tmp/dbpg.xml resource create postgresql-10-main
> ocf:heartbeat:pgsqlms bindir="/usr/lib/postgresql/10/bin"
> pgdata="/var/lib/postgresql/10/main" pghost="/var/run/postgresql"
> pgport=5432 recovery_template="/etc/postgresql/10/main/recovery.conf"
> start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf"
> op start timeout=60s op stop timeout=60s op promote timeout=30s op
> demote timeout=120s op monitor interval=15s timeout=10s role="Master"
> op monitor interval=16s timeout=10s role="Slave" op notify
> timeout=60s
> pcs -f /tmp/dbpg.xml resource master postgresql-ha postgresql-10-main 
> notify=true
> pcs -f /tmp/dbpg.xml constraint colocation add postgresql-master-vip
> with master postgresql-ha INFINITY
> pcs -f /tmp/dbpg.xml constraint order promote postgresql-ha then
> start postgresql-master-vip symmetrical=false kind=Mandatory
> pcs -f /tmp/dbpg.xml constraint order demote postgresql-ha then stop
> postgresql-master-vip symmetrical=false kind=Mandatory
> pcs cluster cib-push /tmp/dbpg.xml
> ------
> 
> Here is the output of `pcs status` before powering off the primary:
> 
> ------
> Online: [ d-gp2-dbpg0-1 d-gp2-dbpg0-2 d-gp2-dbpg0-3 ]
> 
> Full list of resources:
> 
>  vfencing       (stonith:external/vcenter):     Started d-gp2-dbpg0-1
>  postgresql-master-vip  (ocf::heartbeat:IPaddr2):       Started d-
> gp2-dbpg0-1
>  Master/Slave Set: postgresql-ha [postgresql-10-main]
>      Masters: [ d-gp2-dbpg0-1 ]
>      Slaves: [ d-gp2-dbpg0-2 d-gp2-dbpg0-3 ]
> ------
> 
> As always, thank you all for any help that you can provide,
-- 
Ken Gaillot <kgaillot at redhat.com>