[ClusterLabs] Antw: Re: A processor failed, forming new configuration very often and without reason

Philippe Carbonnier philippe.carbonnier at vif.fr
Thu Apr 30 10:20:34 UTC 2015


Hello,

thanks for your help.

I don't see main process was not scheduled for at all in the log, just:
grep -i main corosync.log
Apr 20 02:13:30 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 02:21:23 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 04:18:40 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 04:18:40 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 15:38:42 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 15:38:42 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 15:57:36 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 15:57:38 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 16:20:07 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 16:20:11 corosync [MAIN  ] Completed service synchronization, ready
to provide service.

grep -i schedul corosync.log
Apr 14 02:14:31 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 15 02:14:26 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 16 02:14:45 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 17 02:14:33 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 18 02:14:34 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 19 02:15:11 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 20 02:13:29 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 21 02:12:14 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 22 02:12:13 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 23 02:10:52 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 24 02:10:18 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 25 02:10:35 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 26 02:10:35 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 27 02:09:36 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 28 02:10:37 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 29 02:10:40 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown
Apr 30 02:09:30 host2.example.com pengine: [16959]: info: stage6:
Scheduling Node vif5_7 for shutdown

The node vif5_7 is rebooted after backup at ~ 2 o'clock.


Firewall is ok : accepting all traffic on eth0, without any rate
limitation...

Best regards,



Philippe CARBONNIER
Pôle Recherche et Développement
Tél. +33 (0)2 51 89 12 58




2015-04-30 8:30 GMT+02:00 Jan Friesse <jfriesse at redhat.com>:

> Philippe,
>
> Philippe Carbonnier napsal(a):
>
>> Thanks for your answers.
>> Token value was previoulsly 5000, but I already increased it to 10000,
>> without any change. So 10 secondes before TOTEM fire the "A processor
>> failed, forming new configuration" message, but in the log we see that in
>> the same second the other node reappeared !
>>
>
> That's weird
>
>  Should I use an higher token value ?
>>
>
> I don't think so. I mean, if both nodes are running on same ESX, it
> shouldn't be needed.
>
> - Is corosync scheduled regularly (you would see message "Corosync main
> process was not scheduled for ... sec" in logs if not)?
> - Is firewall correctly configured?
> - Isn't there some kind of rate limiting for packets?
>
> Regards,
>   Honza
>
>
>
>> Best regards,
>>
>> 2015-04-29 14:17 GMT+02:00 Ulrich Windl <
>> Ulrich.Windl at rz.uni-regensburg.de>:
>>
>>  Jan Friesse <jfriesse at redhat.com> schrieb am 29.04.2015 um 13:10 in
>>>>>>
>>>>> Nachricht
>>> <5540BC0B.50409 at redhat.com>:
>>>
>>>> Philippe,
>>>>
>>>> Philippe Carbonnier napsal(a):
>>>>
>>>>> Hello,
>>>>> just for the guys who doesn't want to read all the logs, I put my
>>>>>
>>>> question
>>>
>>>> on top (and at the end) of the post :
>>>>> Is there a timer that I can raise to try to give more time to each
>>>>>
>>>> nodes to
>>>
>>>> see each other BEFORE TOTEM fire the "A processor failed, forming new
>>>>> configuration", because the 2 nodes are really up and running.
>>>>>
>>>>
>>>> There are many timers, but basically almost everything depends on token
>>>> timeout, so just set "token" to higher value.
>>>>
>>>
>>> Please correct me if I'm wrong: A token timeout is oly triggered when
>>> 1) The token is lost in the network (i.e. a packet is lost and not
>>> retransmitted in time)
>>> 2) The token is lost on a node (e.g. it crashes while it has the token)
>>> 3) The host or the network don't respond in time (the token is not lost,
>>> but late)
>>> 4) There's a major bug in the TOTEM protocol (its implementation)
>>>
>>> I really wonder whether the resaon for frequent token timeouts is 1);
>>> usually it's not 2) either. For me 3) is hard to believe also. And nobody
>>> admits it's 4).
>>>
>>> So everybody says it's 3) and suggests to increase the timeout.
>>>
>>>
>>>>> The 2 linux servers (vif5_7 and host2.example.com) are 2 VM on the
>>>>> same
>>>>> VMWare ESX server. May be the network is 'not working' the way corosync
>>>>> wants ?
>>>>>
>>>>
>>> OK, for virtual hosts I might add:
>>> 5) The virtual time is not flowing steadily, i.e. the number of usable
>>> CPU
>>> cycles per walltime unit is highly variable.
>>>
>>>
>>>> Yep. But first give a chance to token timeout increase.
>>>>
>>>
>>> I agree that for 5) a longer token timeout might be a workaround, but
>>> finding the root cause may be worth the time being spent doing so.
>>>
>>>
>>> Regards,
>>> Ulrich
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
[image: logoVif] <http://www.vif.fr/>L'informatique 100% Agrowww.vif.fr 
[image: VifYouTube] <http://www.youtube.com/user/Agrovif>[image: VifTwitter] 
<https://twitter.com/VIF_agro>*Suivez l'actualité VIF sur:* 
<http://www.agrovif.com/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150430/30f8357c/attachment.htm>


More information about the Users mailing list