[ClusterLabs] A processor failed, forming new configuration very often and without reason

Sun Apr 26 20:08:56 UTC 2015

> On 13 Apr 2015, at 7:08 pm, Philippe Carbonnier <Philippe.Carbonnier at vif.fr> wrote:
> 
> Hello Mr Beekhof,
> 
> thanks for your answer. The error when trying to stop the service is just the result of the unsuccesfull start of the service: the start try to create a new IP alias which fail because the other node still ran it,

Is this IP also a managed resource? If so, it should have been removed when the service was asked to stop (and not reported ’success’ if it could not do so).

> so the stop can't be successfull because the IP alias is not up on this node.

I think we’d better see your full config and agent, something sounds very wrong.

> IMHO, it's just the result while the root cause is that the 2 nodes doesn't see each other sometimes.
> 
> But I've listen to your proposition and I've change the return code of the agent when we request it to stop : now it returns 0 even if it can't remove an iptables nat rule.
> 
> Do you think that this message can help ? The 2 VM are on vmware which sometime gives strange time : " current "epoch" is greater than required"
> 
> Best regards,
> 
> 
> 2015-04-13 5:00 GMT+02:00 Andrew Beekhof <andrew at beekhof.net>:
> 
> > On 10 Apr 2015, at 11:37 pm, Philippe Carbonnier <Philippe.Carbonnier at vif.fr> wrote:
> >
> > Hello,
> >
> > The context :
> >   Red Hat Enterprise Linux Server release 5.7
> >   corosynclib-1.2.7-1.1.el5.x86_64
> >   corosync-1.2.7-1.1.el5.x86_64
> >   pacemaker-1.0.10-1.4.el5.x86_64
> >   pacemaker-libs-1.0.10-1.4.el5.x86_64
> >   2 nodes, both on same ESX server
> >
> > I've lost of processor joined of left the membership message but can't understand why, because the 2 hosts are up and running, and when the corosync try to start the cluster's ressource he can't because the are already up on the first node.
> > We can see "Another DC detected" so the communication between the 2 VM is OK.
> >
> > I've tried to raise totem parameter, without success.
> 
> > Apr 10 13:34:55 host2.example.com pengine: [26529]: WARN: unpack_rsc_op: Processing failed op routing-jboss_stop_0 on tango2.luxlait.lan: invalid parameter (2)
> 
> ^^^ Failed stops lead to fencing. 
> 
> The agent and/or your config need fixing.
> 
>  
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> L'informatique 100% Agro	www.vif.fr 
> 
> Suivez l'actualité VIF sur:
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org