<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Carlos,<br>

      <br>

      Increasing corosync timeouts and 'monitor' action timeouts in

      pacemaker might help, but do you have separate leased network

      connection for corosync? It is better to connect your servers

      directly with cross cable (to be independent of switches/network

      infrastructure, and use this connection for intercluster

      communications.<br>

      <br>

      Best regards,<br>

      Alex<br>

      <br>

      07.02.2013 03:07, Andrew Beekhof:<br>

    </div>

    <blockquote

cite="mid:CAEDLWG1QJkwKTfDtJO6wNP4DZBDkZ6JVA6GwBZ28p8T7nf0quQ@mail.gmail.com"

      type="cite">

      <blockquote type="cite">

        <pre wrap="">Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] CLM CONFIGURATION CHANGE

Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] New Configuration:

Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9)

Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Left:

Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)

Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Joined:

</pre>

      </blockquote>

      <pre wrap="">

This appears to be the (almost) root of your problem.

The load is staving corosync of CPU (or possibly network bandwidth)

and it can no longer talk to its peer.

Corosync then informs pacemaker who initiates recovery.

I'd start by tuning some of your timeout values in corosync.conf

</pre>

    </blockquote>

    <br>

  </body>

</html>