[Pacemaker] Avoid one node from being a target for resources migration

Thu Jan 15 03:14:59 UTC 2015

> On 15 Jan 2015, at 12:43 am, Dmitry Koterov <dmitry.koterov at gmail.com> wrote:
> 
> Sorry!
> 
> Pacemaker 1.1.10
> Corosync 2.3.30
> 
> BTW I removed quorum.two_node:1 from corosync.conf, and it helped! Now isolated node stops its services in 3-node cluster. Was it the right solution?

Yes. 'quorum.two_node:1' is only sane for a 2 node cluster

> 
> On Wednesday, January 14, 2015, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
> > On 14 Jan 2015, at 12:06 am, Dmitry Koterov <dmitry.koterov at gmail.com> wrote:
> >
> >
> > > Then I see that, although node2 clearly knows it's isolated (it doesn't see other 2 nodes and does not have quorum)
> >
> > we don't know that - there are several algorithms for calculating quorum and the information isn't included in your output.
> > are you using cman, or corosync underneath pacemaker? corosync version? pacemaker version? have you set no-quorum-policy?
> >
> > no-quorum-policy is not set, so, according to http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-cluster-options.html , it is "stop - stop all resources in the affected cluster parition". I suppose this is the right option, but why the resources are not stopped on the node when this one node of three becomes isolated and the node clearly sees other nodes as offline (so it knows it's isolated)? What should I configure in addition?
> >
> > I'm using corosync+pacemaker, no cman. Below (in quotes) is output of "crm configure show". Versions are from Ubuntu 14.04, so almost new.
> 
> I don't have Ubuntu installed.  You'll have to be more specific as to what package versions you have.
> 
> >
> >
> > > , it does not stop its services:
> > >
> > > root at node2:~# crm status
> > > Online: [ node2 ]
> > > OFFLINE: [ node1 node3 ]
> > > Master/Slave Set: ms_drbd [drbd]
> > >      Masters: [ node2 ]
> > >      Stopped: [ node1 node3 ]
> > > Resource Group: server
> > >      fs       (ocf::heartbeat:Filesystem):    Started node2
> > >      postgresql       (lsb:postgresql):       Started node2
> > >      bind9    (lsb:bind9):    Started node2
> > >      nginx    (lsb:nginx):    Started node2
> > >
> > > So is there a way to say pacemaker to shutdown nodes' services when they become isolated?
> > >
> > >
> > >
> > > On Mon, Jan 12, 2015 at 8:25 PM, David Vossel <dvossel at redhat.com> wrote:
> > >
> > >
> > > ----- Original Message -----
> > > > Hello.
> > > >
> > > > I have 3-node cluster managed by corosync+pacemaker+crm. Node1 and Node2 are
> > > > DRBD master-slave, also they have a number of other services installed
> > > > (postgresql, nginx, ...). Node3 is just a corosync node (for quorum), no
> > > > DRBD/postgresql/... are installed at it, only corosync+pacemaker.
> > > >
> > > > But when I add resources to the cluster, a part of them are somehow moved to
> > > > node3 and since then fail. Note than I have a "colocation" directive to
> > > > place these resources to the DRBD master only and "location" with -inf for
> > > > node3, but this does not help - why? How to make pacemaker not run anything
> > > > at node3?
> > > >
> > > > All the resources are added in a single transaction: "cat config.txt | crm -w
> > > > -f- configure" where config.txt contains directives and "commit" statement
> > > > at the end.
> > > >
> > > > Below are "crm status" (error messages) and "crm configure show" outputs.
> > > >
> > > >
> > > > root at node3:~# crm status
> > > > Current DC: node2 (1017525950) - partition with quorum
> > > > 3 Nodes configured
> > > > 6 Resources configured
> > > > Online: [ node1 node2 node3 ]
> > > > Master/Slave Set: ms_drbd [drbd]
> > > > Masters: [ node1 ]
> > > > Slaves: [ node2 ]
> > > > Resource Group: server
> > > > fs (ocf::heartbeat:Filesystem): Started node1
> > > > postgresql (lsb:postgresql): Started node3 FAILED
> > > > bind9 (lsb:bind9): Started node3 FAILED
> > > > nginx (lsb:nginx): Started node3 (unmanaged) FAILED
> > > > Failed actions:
> > > > drbd_monitor_0 (node=node3, call=744, rc=5, status=complete,
> > > > last-rc-change=Mon Jan 12 11:16:43 2015, queued=2ms, exec=0ms): not
> > > > installed
> > > > postgresql_monitor_0 (node=node3, call=753, rc=1, status=complete,
> > > > last-rc-change=Mon Jan 12 11:16:43 2015, queued=8ms, exec=0ms): unknown
> > > > error
> > > > bind9_monitor_0 (node=node3, call=757, rc=1, status=complete,
> > > > last-rc-change=Mon Jan 12 11:16:43 2015, queued=11ms, exec=0ms): unknown
> > > > error
> > > > nginx_stop_0 (node=node3, call=767, rc=5, status=complete, last-rc-change=Mon
> > > > Jan 12 11:16:44 2015, queued=1ms, exec=0ms): not installed
> > >
> > > Here's what is going on. Even when you say "never run this resource on node3"
> > > pacemaker is going to probe for the resource regardless on node3 just to verify
> > > the resource isn't running.
> > >
> > > The failures you are seeing "monitor_0 failed" indicate that pacemaker failed
> > > to be able to verify resources are running on node3 because the related
> > > packages for the resources are not installed. Given pacemaker's default
> > > behavior I'd expect this.
> > >
> > > You have two options.
> > >
> > > 1. install the resource related packages on node3 even though you never want
> > > them to run there. This will allow the resource-agents to verify the resource
> > > is in fact inactive.
> > >
> > > 2. If you are using the current master branch of pacemaker, there's a new
> > > location constraint option called 'resource-discovery=always|never|exclusive'.
> > > If you add the 'resource-discovery=never' option to your location constraint
> > > that attempts to keep resources from node3, you'll avoid having pacemaker
> > > perform the 'monitor_0' actions on node3 as well.
> > >
> > > -- Vossel
> > >
> > > >
> > > > root at node3:~# crm configure show | cat
> > > > node $id="1017525950" node2
> > > > node $id="13071578" node3
> > > > node $id="1760315215" node1
> > > > primitive drbd ocf:linbit:drbd \
> > > > params drbd_resource="vlv" \
> > > > op start interval="0" timeout="240" \
> > > > op stop interval="0" timeout="120"
> > > > primitive fs ocf:heartbeat:Filesystem \
> > > > params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root"
> > > > options="noatime,nodiratime" fstype="xfs" \
> > > > op start interval="0" timeout="300" \
> > > > op stop interval="0" timeout="300"
> > > > primitive postgresql lsb:postgresql \
> > > > op monitor interval="10" timeout="60" \
> > > > op start interval="0" timeout="60" \
> > > > op stop interval="0" timeout="60"
> > > > primitive bind9 lsb:bind9 \
> > > > op monitor interval="10" timeout="60" \
> > > > op start interval="0" timeout="60" \
> > > > op stop interval="0" timeout="60"
> > > > primitive nginx lsb:nginx \
> > > > op monitor interval="10" timeout="60" \
> > > > op start interval="0" timeout="60" \
> > > > op stop interval="0" timeout="60"
> > > > group server fs postgresql bind9 nginx
> > > > ms ms_drbd drbd meta master-max="1" master-node-max="1" clone-max="2"
> > > > clone-node-max="1" notify="true"
> > > > location loc_server server rule $id="loc_server-rule" -inf: #uname eq node3
> > > > colocation col_server inf: server ms_drbd:Master
> > > > order ord_server inf: ms_drbd:promote server:start
> > > > property $id="cib-bootstrap-options" \
> > > > stonith-enabled="false" \
> > > > last-lrm-refresh="1421079189" \
> > > > maintenance-mode="false"
> > > >
> > > > _______________________________________________
> > > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > >
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org