<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Carlos,<br>
<br>
Increasing corosync timeouts and 'monitor' action timeouts in
pacemaker might help, but do you have separate leased network
connection for corosync? It is better to connect your servers
directly with cross cable (to be independent of switches/network
infrastructure, and use this connection for intercluster
communications.<br>
<br>
Best regards,<br>
Alex<br>
<br>
07.02.2013 03:07, Andrew Beekhof:<br>
</div>
<blockquote
cite="mid:CAEDLWG1QJkwKTfDtJO6wNP4DZBDkZ6JVA6GwBZ28p8T7nf0quQ@mail.gmail.com"
type="cite">
<blockquote type="cite">
<pre wrap="">Feb 6 04:31:47 diana corosync[2902]: [CLM ] CLM CONFIGURATION CHANGE
Feb 6 04:31:47 diana corosync[2902]: [CLM ] New Configuration:
Feb 6 04:31:47 diana corosync[2902]: [CLM ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9)
Feb 6 04:31:47 diana corosync[2902]: [CLM ] Members Left:
Feb 6 04:31:47 diana corosync[2902]: [CLM ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)
Feb 6 04:31:47 diana corosync[2902]: [CLM ] Members Joined:
</pre>
</blockquote>
<pre wrap="">
This appears to be the (almost) root of your problem.
The load is staving corosync of CPU (or possibly network bandwidth)
and it can no longer talk to its peer.
Corosync then informs pacemaker who initiates recovery.
I'd start by tuning some of your timeout values in corosync.conf
</pre>
</blockquote>
<br>
</body>
</html>