[Pacemaker] [Openais] very slow pacemaker/corosync shutdown

Thu Sep 19 19:50:34 EDT 2013

On 20/09/2013, at 8:19 AM, Lists <lists at benjamindsmith.com> wrote:

> On 09/18/2013 06:49 PM, Andrew Beekhof wrote:
>> On 19/09/2013, at 8:25 AM, David Lang <david at lang.hm> wrote:
>> 
>>> What's the best way to see what it's getting stuck doing?
>> Log files.
>> 
>>> Is there a good way to tell if this is a pacemaker or corosync problem (so I can drop one of the lists from the thread)?
>> Not without further information
>> 
> 
> We've had the same problem here, trying to get HA dns/named service working. Works great for a day or so, then seizes up, simple commands like `crm_standby -v true` timeout after 120 seconds, etc. We're testing for release, and keep running into issues like this. At first we suspected firewall issues, but even after confirmed operation and several hand-offs of HA services back and forth, it still dies within a day or so.
> 
> We're on CentOS 6/64 with yum packages augmented from http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/
> with exclude=pacemaker* corosync*
> 
> In order to make the log files visible, I've snipped out a time period during which it becomes unresponsive visible at http://hal.schoolpathways.com/details/
> 
> I don't know the exact moment,

I do.

It is right when you start seeing messages like:
Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: send_ais_text: 	Peer overloaded or membership in flux: Re-sending message (Attempt 1 of 20)

Eventually that escalates to:
Sep 19 00:59:39 [9004] nomad.schoolpathways.com       crmd:    error: send_ais_text: 	Sending message 94 via cpg: FAILED (rc=6): Try again: Success (0)

From this we can infer that corosync has gotten horribly confused and, as a consequence, pacemaker can't talk to its peers anymore.

> this is a test cluster and not being monitored by a netmon. Any other details I could provide that would be useful/helpful?

Shortly before this, Corosync claims:

Sep 19 00:47:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: pcmk_cpg_membership: 	Left[2.0] crmd.1 
Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: crm_update_peer_proc: 	pcmk_cpg_membership: Node bender.schoolpathways.com[1] - corosync-cpg is now offline
Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: peer_update_callback: 	Client bender.schoolpathways.com/peer now has status [offline] (DC=true)

Is this true?
If not, perhaps some timeouts need to be adjusted.  A switch to udpu (instead of multicast) may also be helpful.

> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130920/3b4f12da/attachment-0003.sig>