[Pacemaker] very slow pacemaker/corosync shutdown

Fri Sep 20 01:12:53 EDT 2013

20.09.2013 02:52, Andrew Beekhof wrote:
> 
> On 19/09/2013, at 7:45 PM, David Lang <david at lang.hm> wrote:
> 
>> On Thu, 19 Sep 2013, Florian Crouzat wrote:
>>
>>> Le 19/09/2013 00:25, David Lang a ?crit :
>>>> I'm frequently running into a problem that shutting down
>>>> pacemaker/corosync takes a very long time (several minutes)
>>>
>>> Just to be 100% sure, you always respect the stop order ? Pacemaker *then* CMAN/corosync ?
>>
>> 'service pacemaker stop' seems to take down cman as well, but frequently stalls before that.
> 
> logs?
> 
>>
>> we are definantly not taking down cman ahead of time.
>>
>> But we are seeing problems on some systems where we start everything up, verify both nodes are seen, and then a day or
>> so later notice that the two boxes are not communicating (one of the reasons we are looking at disabling multicast, the
>> local networking people have 'interesting' ideas about multicast, and
they may be causing problems)
> 
> this is quite likely the problem.
> multicast support in various parts of the hardware and software stacks seems to be getting worse and worse over time :(

+1
With modern EL6 kernel I now see cluster nodes are advertising
themselves as a multicast routers for some reason in *some* bridged
vlans, and switch forwards all the multicast packets to them, instead of
looking at the igmp snooping table. For some reason switch is forwarding
mcast in *all* vlans to that "mrouters".
It seems that nothing perfect exists in the multicast world. :(