[Pacemaker] Cluster crash

Mon Feb 20 06:35:14 EST 2012

On Mon, Feb 13, 2012 at 9:57 PM, Hugo Deprez <hugo.deprez at gmail.com> wrote:
> Hello,
>
> does anyone have an idea ?

Well I see:

Feb  8 12:59:05 server01 crmd: [19470]: ERROR: process_lrm_event: LRM
operation drbd-nagios:1_monitor_15000 (90) Timed Out (timeout=20000ms)
Feb  8 13:00:05 server01 crmd: [19470]: WARN: cib_rsc_callback:
Resource update 415 failed: (rc=-41) Remote node did not respond
Feb  8 13:06:36 server01 crmd: [19470]: notice: ais_dispatch:
Membership 128: quorum lost

which looks suspicious.  Network problem?

>
> it seems that at 13:06:38 resources et started on slave member.
> But then there is something wrong on server01 :
>
> Feb  8 13:06:39 server01 pengine: [19469]: info: determine_online_status:
> Node server01 is online
> Feb  8 13:06:39 server01 pengine: [19469]: notice: unpack_rsc_op: Operation
> apache2_monitor_0 found resource apache2 active on server01
> Feb  8 13:06:39 server01 pengine: [19469]: notice: group_print:  Resource
> Group: supervision-grp
> Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
> fs-data    (ocf::heartbeat:Filesystem):    Stopped
> Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
> nagios-ip    (ocf::heartbeat:IPaddr2):    Stopped
> Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
> apache2    (ocf::heartbeat:apache):    Started server01
> Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
> nagios    (lsb:nagios3):    Stopped
>
>
> But I don't understand what fails if this is DRBD or apache2 causes the
> issue.
>
> Any idea ?
>
>
>
> On 10 February 2012 09:39, Hugo Deprez <hugo.deprez at gmail.com> wrote:
>>
>> Hello,
>>
>> please found attach to this mail the corosync logs.
>> If you have any tips :)
>>
>>
>>
>> Regards,
>>
>> Hugo
>>
>>
>> On 8 February 2012 15:39, Florian Haas <florian at hastexo.com> wrote:
>>>
>>> On Wed, Feb 8, 2012 at 2:29 PM, Hugo Deprez <hugo.deprez at gmail.com>
>>> wrote:
>>> > Dear community,
>>> >
>>> > I am currently running different corosync / drbd cluster using VM
>>> > running on
>>> > vmware esxi host.
>>> > Guest Os are Debian Squeeze.
>>> >
>>> > the active member of the cluster just freeze the VM was unreachable.
>>> > But the resources didn't achieved to move to the other node.
>>> >
>>> > My cluster has the following ressources :
>>> >
>>> > Resource Group: grp
>>> >      fs-data    (ocf::heartbeat:Filesystem):
>>> >      nagios-ip  (ocf::heartbeat:IPaddr2):
>>> >      apache2    (ocf::heartbeat:apache):
>>> >      nagios     (lsb:nagios3):
>>> >      pnp        (lsb:npcd):
>>> >
>>> >
>>> > I am currently troubleshooting this issue. I don't really know where to
>>> > look. Of course I had a look at the logs, but it is pretty hard for me
>>> > to
>>> > understand what happen.
>>>
>>> It's pretty hard for anyone else to understand _without_ logs. :)
>>>
>>> > I noticed that the VM crash at 12:09 and that the cluster only try to
>>> > move
>>> > the ressources at  12:58, this does not make sens for me. Or maybe the
>>> > host
>>> > wasn't totaly down ?
>>> >
>>> > Do you have any idea how I can troubleshoot ?
>>>
>>> Log analysis is where I would start.
>>>
>>> > Last thing, I notice that If I start apache2 on the slave server,
>>> > corosync
>>> > didn't detect that the resource is started, could that be an issue ?
>>>
>>> Sure it could, but Pacemaker should happily recover from that.
>>>
>>> Cheers,
>>> Florian
>>>
>>> --
>>> Need help with High Availability?
>>> http://www.hastexo.com/now
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>