I don't think so has, I do have over similar cluster on the same network and didn't have any issues.<br>The only thing I can detect was that the virtual machine was like unresponsive.<br>But I think the VM crash was not like a power shutdown more like very slow then totaly crash.<br>

<br>Even if the drbd-nagios resource timeout, it should failover on the other node no ?<br><br>Regards,<br><br><br><div class="gmail_quote">On 20 February 2012 12:35, Andrew Beekhof <span dir="ltr"><<a href="mailto:andrew@beekhof.net">andrew@beekhof.net</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Mon, Feb 13, 2012 at 9:57 PM, Hugo Deprez <<a href="mailto:hugo.deprez@gmail.com">hugo.deprez@gmail.com</a>> wrote:<br>


> Hello,<br>

><br>

> does anyone have an idea ?<br>

<br>

</div>Well I see:<br>

<br>

Feb  8 12:59:05 server01 crmd: [19470]: ERROR: process_lrm_event: LRM<br>

operation drbd-nagios:1_monitor_15000 (90) Timed Out (timeout=20000ms)<br>

Feb  8 13:00:05 server01 crmd: [19470]: WARN: cib_rsc_callback:<br>

Resource update 415 failed: (rc=-41) Remote node did not respond<br>

Feb  8 13:06:36 server01 crmd: [19470]: notice: ais_dispatch:<br>

Membership 128: quorum lost<br>

<br>

which looks suspicious.  Network problem?<br>

<div class="HOEnZb"><div class="h5"><br>

><br>

> it seems that at 13:06:38 resources et started on slave member.<br>

> But then there is something wrong on server01 :<br>

><br>

> Feb  8 13:06:39 server01 pengine: [19469]: info: determine_online_status:<br>

> Node server01 is online<br>

> Feb  8 13:06:39 server01 pengine: [19469]: notice: unpack_rsc_op: Operation<br>

> apache2_monitor_0 found resource apache2 active on server01<br>

> Feb  8 13:06:39 server01 pengine: [19469]: notice: group_print:  Resource<br>

> Group: supervision-grp<br>

> Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:<br>

> fs-data    (ocf::heartbeat:Filesystem):    Stopped<br>

> Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:<br>

> nagios-ip    (ocf::heartbeat:IPaddr2):    Stopped<br>

> Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:<br>

> apache2    (ocf::heartbeat:apache):    Started server01<br>

> Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:<br>

> nagios    (lsb:nagios3):    Stopped<br>

><br>

><br>

> But I don't understand what fails if this is DRBD or apache2 causes the<br>

> issue.<br>

><br>

> Any idea ?<br>

><br>

><br>

><br>

> On 10 February 2012 09:39, Hugo Deprez <<a href="mailto:hugo.deprez@gmail.com">hugo.deprez@gmail.com</a>> wrote:<br>

>><br>

>> Hello,<br>

>><br>

>> please found attach to this mail the corosync logs.<br>

>> If you have any tips :)<br>

>><br>

>><br>

>><br>

>> Regards,<br>

>><br>

>> Hugo<br>

>><br>

>><br>

>> On 8 February 2012 15:39, Florian Haas <<a href="mailto:florian@hastexo.com">florian@hastexo.com</a>> wrote:<br>

>>><br>

>>> On Wed, Feb 8, 2012 at 2:29 PM, Hugo Deprez <<a href="mailto:hugo.deprez@gmail.com">hugo.deprez@gmail.com</a>><br>

>>> wrote:<br>

>>> > Dear community,<br>

>>> ><br>

>>> > I am currently running different corosync / drbd cluster using VM<br>

>>> > running on<br>

>>> > vmware esxi host.<br>

>>> > Guest Os are Debian Squeeze.<br>

>>> ><br>

>>> > the active member of the cluster just freeze the VM was unreachable.<br>

>>> > But the resources didn't achieved to move to the other node.<br>

>>> ><br>

>>> > My cluster has the following ressources :<br>

>>> ><br>

>>> > Resource Group: grp<br>

>>> >      fs-data    (ocf::heartbeat:Filesystem):<br>

>>> >      nagios-ip  (ocf::heartbeat:IPaddr2):<br>

>>> >      apache2    (ocf::heartbeat:apache):<br>

>>> >      nagios     (lsb:nagios3):<br>

>>> >      pnp        (lsb:npcd):<br>

>>> ><br>

>>> ><br>

>>> > I am currently troubleshooting this issue. I don't really know where to<br>

>>> > look. Of course I had a look at the logs, but it is pretty hard for me<br>

>>> > to<br>

>>> > understand what happen.<br>

>>><br>

>>> It's pretty hard for anyone else to understand _without_ logs. :)<br>

>>><br>

>>> > I noticed that the VM crash at 12:09 and that the cluster only try to<br>

>>> > move<br>

>>> > the ressources at  12:58, this does not make sens for me. Or maybe the<br>

>>> > host<br>

>>> > wasn't totaly down ?<br>

>>> ><br>

>>> > Do you have any idea how I can troubleshoot ?<br>

>>><br>

>>> Log analysis is where I would start.<br>

>>><br>

>>> > Last thing, I notice that If I start apache2 on the slave server,<br>

>>> > corosync<br>

>>> > didn't detect that the resource is started, could that be an issue ?<br>

>>><br>

>>> Sure it could, but Pacemaker should happily recover from that.<br>

>>><br>

>>> Cheers,<br>

>>> Florian<br>

>>><br>

>>> --<br>

>>> Need help with High Availability?<br>

>>> <a href="http://www.hastexo.com/now" target="_blank">http://www.hastexo.com/now</a><br>

>>><br>

>>> _______________________________________________<br>

>>> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

>>> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

>>><br>

>>> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

>>> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

>>> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

>><br>

>><br>

><br>

><br>

> _______________________________________________<br>

> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

><br>

> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

><br>

<br>

_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br>