[ClusterLabs] node went to stand-by after one single resource-failure

Mon Jun 8 09:25:35 EDT 2015

On Mon, Jun 8, 2015 at 3:58 PM, Oscar Salvador
<osalvador.vilardaga at gmail.com> wrote:
>
> 2015-06-08 14:23 GMT+02:00 Andrei Borzenkov <arvidjaar at gmail.com>:
>>
>> On Mon, Jun 8, 2015 at 3:05 PM, Oscar Salvador
>> <osalvador.vilardaga at gmail.com> wrote:
>> > Hi guys!
>> >
>> > I've configured two nodes with the stack pacemaker + corosync, with only
>> > one
>> > resource ( just for test purposes ), and I'm having a strange result.
>> >
>> > First a little bit of information:
>> >
>> > pacemaker version: 1.1.12-1
>> > corosync version: 2.3.4-1
>> >
>> >
>> > # crm configure show
>> > node 1053402612: server1 \
>> > node 1053402613: server2
>> > primitive IP-rsc_apache IPaddr2 \
>> > params ip=xx.xx.xx.xy nic=eth0 cidr_netmask=255.255.255.192 \
>> > meta migration-threshold=2 \
>> > op monitor interval=20 timeout=60 on-fail=standby
>> > property cib-bootstrap-options: \
>> > last-lrm-refresh=1433763004 \
>> > stonith-enabled=false \
>> > no-quorum-policy=ignore
>> >
>> ...
>> >
>> >
>> > It seems like pacemaker is assuming that the monitor-operation failed,
>> > and
>> > because of this, decides to mark the node as a standby. But should not
>> > be,
>> > no?
>> >
>>
>> You told it to do exactly that (on-fail=standby).
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> Yes, I told that: if the monitor-operation failed, put the node in standby.
> But from my point of view, the monitor-operation doesn't fail, but the
> resource itself.

The only way pacemaker can determine resource failure is by result of
operations. So in a sense resource can never fail - operation can
return unexpected result.

> I'm very stranged with this because as I told, I tested this with and old
> version of pacemaker, and it didn't have this behaviour.

Here I cannot say anything, sorry; hopefully someone who has been here
for longer time could chime in.

> Maybe I was consufed because of that.
>
> So, somehow is reduntant do something like that:
>
> meta migration-threshold=2
> op monitor interval=20 timeout=60 on-fail=standby
>
> since it will never reach the failcount of 2, no?
>

Migration threshold defines when pacemaker will force resource away
from node. But here you say that node is put in standby as soon as the
first error occurs. So yes, this combination makes no sense IMHO.