[ClusterLabs] node went to stand-by after one single resource-failure

Mon Jun 8 13:32:06 UTC 2015

2015-06-08 15:25 GMT+02:00 Andrei Borzenkov <arvidjaar at gmail.com>:

> On Mon, Jun 8, 2015 at 3:58 PM, Oscar Salvador
> <osalvador.vilardaga at gmail.com> wrote:
> >
> > 2015-06-08 14:23 GMT+02:00 Andrei Borzenkov <arvidjaar at gmail.com>:
> >>
> >> On Mon, Jun 8, 2015 at 3:05 PM, Oscar Salvador
> >> <osalvador.vilardaga at gmail.com> wrote:
> >> > Hi guys!
> >> >
> >> > I've configured two nodes with the stack pacemaker + corosync, with
> only
> >> > one
> >> > resource ( just for test purposes ), and I'm having a strange result.
> >> >
> >> > First a little bit of information:
> >> >
> >> > pacemaker version: 1.1.12-1
> >> > corosync version: 2.3.4-1
> >> >
> >> >
> >> > # crm configure show
> >> > node 1053402612: server1 \
> >> > node 1053402613: server2
> >> > primitive IP-rsc_apache IPaddr2 \
> >> > params ip=xx.xx.xx.xy nic=eth0 cidr_netmask=255.255.255.192 \
> >> > meta migration-threshold=2 \
> >> > op monitor interval=20 timeout=60 on-fail=standby
> >> > property cib-bootstrap-options: \
> >> > last-lrm-refresh=1433763004 \
> >> > stonith-enabled=false \
> >> > no-quorum-policy=ignore
> >> >
> >> ...
> >> >
> >> >
> >> > It seems like pacemaker is assuming that the monitor-operation failed,
> >> > and
> >> > because of this, decides to mark the node as a standby. But should not
> >> > be,
> >> > no?
> >> >
> >>
> >> You told it to do exactly that (on-fail=standby).
> >>
> >> _______________________________________________
> >> Users mailing list: Users at clusterlabs.org
> >> http://clusterlabs.org/mailman/listinfo/users
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> > Yes, I told that: if the monitor-operation failed, put the node in
> standby.
> > But from my point of view, the monitor-operation doesn't fail, but the
> > resource itself.
>
> The only way pacemaker can determine resource failure is by result of
> operations. So in a sense resource can never fail - operation can
> return unexpected result.
>
> > I'm very stranged with this because as I told, I tested this with and old
> > version of pacemaker, and it didn't have this behaviour.
>
> Here I cannot say anything, sorry; hopefully someone who has been here
> for longer time could chime in.
>
> > Maybe I was consufed because of that.
> >
> > So, somehow is reduntant do something like that:
> >
> > meta migration-threshold=2
> > op monitor interval=20 timeout=60 on-fail=standby
> >
> > since it will never reach the failcount of 2, no?
> >
>
> Migration threshold defines when pacemaker will force resource away
> from node. But here you say that node is put in standby as soon as the
> first error occurs. So yes, this combination makes no sense IMHO.
>
>
_______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

Thanks to clarify, I was confused due to the old behaviour ;)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150608/bfa1551e/attachment-0002.html>