[ClusterLabs] A processor failed, forming new configuration very often and without reason
Andrew Beekhof
andrew at beekhof.net
Sun Apr 26 20:08:56 UTC 2015
> On 13 Apr 2015, at 7:08 pm, Philippe Carbonnier <Philippe.Carbonnier at vif.fr> wrote:
>
> Hello Mr Beekhof,
>
> thanks for your answer. The error when trying to stop the service is just the result of the unsuccesfull start of the service: the start try to create a new IP alias which fail because the other node still ran it,
Is this IP also a managed resource? If so, it should have been removed when the service was asked to stop (and not reported ’success’ if it could not do so).
> so the stop can't be successfull because the IP alias is not up on this node.
I think we’d better see your full config and agent, something sounds very wrong.
> IMHO, it's just the result while the root cause is that the 2 nodes doesn't see each other sometimes.
>
> But I've listen to your proposition and I've change the return code of the agent when we request it to stop : now it returns 0 even if it can't remove an iptables nat rule.
>
> Do you think that this message can help ? The 2 VM are on vmware which sometime gives strange time : " current "epoch" is greater than required"
>
> Best regards,
>
>
> 2015-04-13 5:00 GMT+02:00 Andrew Beekhof <andrew at beekhof.net>:
>
> > On 10 Apr 2015, at 11:37 pm, Philippe Carbonnier <Philippe.Carbonnier at vif.fr> wrote:
> >
> > Hello,
> >
> > The context :
> > Red Hat Enterprise Linux Server release 5.7
> > corosynclib-1.2.7-1.1.el5.x86_64
> > corosync-1.2.7-1.1.el5.x86_64
> > pacemaker-1.0.10-1.4.el5.x86_64
> > pacemaker-libs-1.0.10-1.4.el5.x86_64
> > 2 nodes, both on same ESX server
> >
> > I've lost of processor joined of left the membership message but can't understand why, because the 2 hosts are up and running, and when the corosync try to start the cluster's ressource he can't because the are already up on the first node.
> > We can see "Another DC detected" so the communication between the 2 VM is OK.
> >
> > I've tried to raise totem parameter, without success.
>
> > Apr 10 13:34:55 host2.example.com pengine: [26529]: WARN: unpack_rsc_op: Processing failed op routing-jboss_stop_0 on tango2.luxlait.lan: invalid parameter (2)
>
> ^^^ Failed stops lead to fencing.
>
> The agent and/or your config need fixing.
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> L'informatique 100% Agro www.vif.fr
>
> Suivez l'actualité VIF sur:
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list