[Pacemaker] Avoid one node from being a target for resources migration

Mon Jan 12 15:56:00 EST 2015

>
> 1. install the resource related packages on node3 even though you never
> want
> them to run there. This will allow the resource-agents to verify the
> resource
> is in fact inactive.

Thanks, your advise helped: I installed all the services at node3 as well
(including DRBD, but without it configs) and stopped+disabled them. Then I
added the following line to my configuration:

location loc_drbd drbd rule -inf: #uname eq node3

So node3 is never a target for DRBD, and this helped: "crm nodr standby
node1" doesn't tries to use node3 anymore.

But I have another (related) issue. If some node (e.g. node1) becomes
isolated from other 2 nodes, how to force it to shutdown its services? I
cannot use IPMB-based fencing/stonith, because there are no reliable
connections between nodes at all (the nodes are in geo-distributed
datacenters), and IPMI call to shutdown a node from another node is
impossible.

E.g. initially I have the following:

*# crm status*
Online: [ node1 node2 node3 ]
Master/Slave Set: ms_drbd [drbd]
     Masters: [ node2 ]
     Slaves: [ node1 ]
Resource Group: server
     fs (ocf::heartbeat:Filesystem):    Started node2
     postgresql (lsb:postgresql):       Started node2
     bind9      (lsb:bind9):    Started node2
     nginx      (lsb:nginx):    Started node2

Then I turn on firewall on node2 to isolate it from the outside internet:

*root at node2:~# iptables -A INPUT -p tcp --dport 22 -j ACCEPT*
*root at node2:~# **iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT*
*root at node2:~# **iptables -A INPUT -i lo -j ACCEPT*
*root at node2:~# **iptables -A OUTPUT -o lo -j ACCEPT*
*root at node2:~# **iptables -P INPUT DROP; iptables -P OUTPUT DROP*

Then I see that, although node2 clearly knows it's isolated (it doesn't see
other 2 nodes and does not have quorum), it does not stop its services:

*root at node2:~# crm status*
Online: [ node2 ]
OFFLINE: [ node1 node3 ]
Master/Slave Set: ms_drbd [drbd]
     Masters: [ node2 ]
     Stopped: [ node1 node3 ]
Resource Group: server
     fs (ocf::heartbeat:Filesystem): Started node2
     postgresql (lsb:postgresql): Started node2
     bind9 (lsb:bind9): Started node2
     nginx (lsb:nginx): Started node2

So is there a way to say pacemaker to shutdown nodes' services when they
become isolated?

On Mon, Jan 12, 2015 at 8:25 PM, David Vossel <dvossel at redhat.com> wrote:

>
>
> ----- Original Message -----
> > Hello.
> >
> > I have 3-node cluster managed by corosync+pacemaker+crm. Node1 and Node2
> are
> > DRBD master-slave, also they have a number of other services installed
> > (postgresql, nginx, ...). Node3 is just a corosync node (for quorum), no
> > DRBD/postgresql/... are installed at it, only corosync+pacemaker.
> >
> > But when I add resources to the cluster, a part of them are somehow
> moved to
> > node3 and since then fail. Note than I have a "colocation" directive to
> > place these resources to the DRBD master only and "location" with -inf
> for
> > node3, but this does not help - why? How to make pacemaker not run
> anything
> > at node3?
> >
> > All the resources are added in a single transaction: "cat config.txt |
> crm -w
> > -f- configure" where config.txt contains directives and "commit"
> statement
> > at the end.
> >
> > Below are "crm status" (error messages) and "crm configure show" outputs.
> >
> >
> > root at node3:~# crm status
> > Current DC: node2 (1017525950) - partition with quorum
> > 3 Nodes configured
> > 6 Resources configured
> > Online: [ node1 node2 node3 ]
> > Master/Slave Set: ms_drbd [drbd]
> > Masters: [ node1 ]
> > Slaves: [ node2 ]
> > Resource Group: server
> > fs (ocf::heartbeat:Filesystem): Started node1
> > postgresql (lsb:postgresql): Started node3 FAILED
> > bind9 (lsb:bind9): Started node3 FAILED
> > nginx (lsb:nginx): Started node3 (unmanaged) FAILED
> > Failed actions:
> > drbd_monitor_0 (node=node3, call=744, rc=5, status=complete,
> > last-rc-change=Mon Jan 12 11:16:43 2015, queued=2ms, exec=0ms): not
> > installed
> > postgresql_monitor_0 (node=node3, call=753, rc=1, status=complete,
> > last-rc-change=Mon Jan 12 11:16:43 2015, queued=8ms, exec=0ms): unknown
> > error
> > bind9_monitor_0 (node=node3, call=757, rc=1, status=complete,
> > last-rc-change=Mon Jan 12 11:16:43 2015, queued=11ms, exec=0ms): unknown
> > error
> > nginx_stop_0 (node=node3, call=767, rc=5, status=complete,
> last-rc-change=Mon
> > Jan 12 11:16:44 2015, queued=1ms, exec=0ms): not installed
>
> Here's what is going on. Even when you say "never run this resource on
> node3"
> pacemaker is going to probe for the resource regardless on node3 just to
> verify
> the resource isn't running.
>
> The failures you are seeing "monitor_0 failed" indicate that pacemaker
> failed
> to be able to verify resources are running on node3 because the related
> packages for the resources are not installed. Given pacemaker's default
> behavior I'd expect this.
>
> You have two options.
>
> 1. install the resource related packages on node3 even though you never
> want
> them to run there. This will allow the resource-agents to verify the
> resource
> is in fact inactive.
>
> 2. If you are using the current master branch of pacemaker, there's a new
> location constraint option called
> 'resource-discovery=always|never|exclusive'.
> If you add the 'resource-discovery=never' option to your location
> constraint
> that attempts to keep resources from node3, you'll avoid having pacemaker
> perform the 'monitor_0' actions on node3 as well.
>
> -- Vossel
>
> >
> > root at node3:~# crm configure show | cat
> > node $id="1017525950" node2
> > node $id="13071578" node3
> > node $id="1760315215" node1
> > primitive drbd ocf:linbit:drbd \
> > params drbd_resource="vlv" \
> > op start interval="0" timeout="240" \
> > op stop interval="0" timeout="120"
> > primitive fs ocf:heartbeat:Filesystem \
> > params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root"
> > options="noatime,nodiratime" fstype="xfs" \
> > op start interval="0" timeout="300" \
> > op stop interval="0" timeout="300"
> > primitive postgresql lsb:postgresql \
> > op monitor interval="10" timeout="60" \
> > op start interval="0" timeout="60" \
> > op stop interval="0" timeout="60"
> > primitive bind9 lsb:bind9 \
> > op monitor interval="10" timeout="60" \
> > op start interval="0" timeout="60" \
> > op stop interval="0" timeout="60"
> > primitive nginx lsb:nginx \
> > op monitor interval="10" timeout="60" \
> > op start interval="0" timeout="60" \
> > op stop interval="0" timeout="60"
> > group server fs postgresql bind9 nginx
> > ms ms_drbd drbd meta master-max="1" master-node-max="1" clone-max="2"
> > clone-node-max="1" notify="true"
> > location loc_server server rule $id="loc_server-rule" -inf: #uname eq
> node3
> > colocation col_server inf: server ms_drbd:Master
> > order ord_server inf: ms_drbd:promote server:start
> > property $id="cib-bootstrap-options" \
> > stonith-enabled="false" \
> > last-lrm-refresh="1421079189" \
> > maintenance-mode="false"
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150112/608d232e/attachment-0003.html>