[Pacemaker] Cluster crash

Mon Feb 13 05:57:07 EST 2012

Hello,

does anyone have an idea ?

it seems that at 13:06:38 resources et started on slave member.
But then there is something wrong on server01 :

Feb  8 13:06:39 server01 pengine: [19469]: info: determine_online_status:
Node server01 is online
Feb  8 13:06:39 server01 pengine: [19469]: notice: unpack_rsc_op: Operation
apache2_monitor_0 found resource apache2 active on server01
Feb  8 13:06:39 server01 pengine: [19469]: notice: group_print:  Resource
Group: supervision-grp
Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
fs-data    (ocf::heartbeat:Filesystem):    Stopped
Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
nagios-ip    (ocf::heartbeat:IPaddr2):    Stopped
Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
apache2    (ocf::heartbeat:apache):    Started server01
Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
nagios    (lsb:nagios3):    Stopped

But I don't understand what fails if this is DRBD or apache2 causes the
issue.

Any idea ?

On 10 February 2012 09:39, Hugo Deprez <hugo.deprez at gmail.com> wrote:

> Hello,
>
> please found attach to this mail the corosync logs.
> If you have any tips :)
>
>
>
> Regards,
>
> Hugo
>
>
> On 8 February 2012 15:39, Florian Haas <florian at hastexo.com> wrote:
>
>> On Wed, Feb 8, 2012 at 2:29 PM, Hugo Deprez <hugo.deprez at gmail.com>
>> wrote:
>> > Dear community,
>> >
>> > I am currently running different corosync / drbd cluster using VM
>> running on
>> > vmware esxi host.
>> > Guest Os are Debian Squeeze.
>> >
>> > the active member of the cluster just freeze the VM was unreachable.
>> > But the resources didn't achieved to move to the other node.
>> >
>> > My cluster has the following ressources :
>> >
>> > Resource Group: grp
>> >      fs-data    (ocf::heartbeat:Filesystem):
>> >      nagios-ip  (ocf::heartbeat:IPaddr2):
>> >      apache2    (ocf::heartbeat:apache):
>> >      nagios     (lsb:nagios3):
>> >      pnp        (lsb:npcd):
>> >
>> >
>> > I am currently troubleshooting this issue. I don't really know where to
>> > look. Of course I had a look at the logs, but it is pretty hard for me
>> to
>> > understand what happen.
>>
>> It's pretty hard for anyone else to understand _without_ logs. :)
>>
>> > I noticed that the VM crash at 12:09 and that the cluster only try to
>> move
>> > the ressources at  12:58, this does not make sens for me. Or maybe the
>> host
>> > wasn't totaly down ?
>> >
>> > Do you have any idea how I can troubleshoot ?
>>
>> Log analysis is where I would start.
>>
>> > Last thing, I notice that If I start apache2 on the slave server,
>> corosync
>> > didn't detect that the resource is started, could that be an issue ?
>>
>> Sure it could, but Pacemaker should happily recover from that.
>>
>> Cheers,
>> Florian
>>
>> --
>> Need help with High Availability?
>> http://www.hastexo.com/now
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120213/a022b783/attachment-0003.html>