[Pacemaker] Reason for cluster resource migration

Mon Feb 11 23:07:41 EST 2013

On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin <amartin at xes-inc.com> wrote:
>> Hello,
>>
>> Unfortunately this same failure occurred again tonight,
>
> It might be the same effect, but there was no indication that the PE
> died last time.
>
>> taking down a production cluster. Here is the part of the log where pengine died:
>> Feb 11 17:05:15 storage0 pacemakerd[1572]:   notice: pcmk_child_exit: Child process pengine terminated with signal 6 (pid=19357, core=128)
>> Feb 11 17:05:16 storage0 pacemakerd[1572]:   notice: pcmk_child_exit: Respawning failed child process: pengine
>> Feb 11 17:05:16 storage0 pengine[12660]:   notice: crm_add_logfile: Additional logging available in /var/log/corosync.log
>> Feb 11 17:05:16 storage0 crmd[19358]:    error: crm_ipc_read: Connection to pengine failed
>> Feb 11 17:05:16 storage0 crmd[19358]:    error: mainloop_gio_callback: Connection to pengine[0x891680] closed (I/O condition=25)
>> Feb 11 17:05:16 storage0 crmd[19358]:     crit: pe_ipc_destroy: Connection to the Policy Engine failed (pid=-1, uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b)
>> Feb 11 17:05:16 storage0 crmd[19358]:   notice: save_cib_contents: Saved CIB contents after PE crash to /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.  bz2
>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA: Input I_ERROR from save_cib_contents() received in state S_POLICY_ENGINE
>> Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents ]
>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_recover: Action A_RECOVER (0000000001000000) not supported
>> Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
>> Feb 11 17:05:16 storage0 crmd[19358]:   notice: terminate_cs_connection: Disconnecting from Corosync
>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_exit: Could not recover from internal error
>>
>> The rest of the log:
>> http://sources.xes-inc.com/downloads/pengine.log
>> Looking through the full log, it seems that pengine recovers,
>
> Right, pacemakerd watches for this and restarts it.
>
>> but perhaps not quickly enough to prevent the STONITH and resource migration?
>
> Highly likely.
> However the PE crashing is quite serious.  I'd like to get to the
> bottom of that ASAP.
>
>>
>> Here is the pe-core dump file mentioned in the log:
>> http://sources.xes-inc.com/downloads/pe-core.bz2
>
> Unfortunately core files are specific to the machine that generated them.
> If you create a crm_report for about that time, it will open it and
> record a backtrace for us to look at.
>
> Also very important is the contents of:
>    /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2

Ohhh, thats what the pe-core link was.
I've run it through crm_simulate but couldn't reproduce the crash.

So we'll still need the crm_report, it will have more detail on the
"Child process pengine terminated with signal 6 (pid=19357, core=128)"
part.
The core file will likely be somewhere under /var/lib/pacemaker/cores
but crm_report should be able to find it.

>
>>
>> Thanks,
>>
>> Andrew
>>
>>
>>
>>
>> ----- Original Message -----
>>> From: "Andrew Martin" <amartin at xes-inc.com>
>>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>>> Sent: Friday, February 1, 2013 4:32:26 PM
>>> Subject: Re: [Pacemaker] Reason for cluster resource migration
>>>
>>> ----- Original Message -----
>>> > From: "Andrew Beekhof" <andrew at beekhof.net>
>>> > To: "The Pacemaker cluster resource manager"
>>> > <pacemaker at oss.clusterlabs.org>
>>> > Sent: Thursday, December 6, 2012 8:36:27 PM
>>> > Subject: Re: [Pacemaker] Reason for cluster resource migration
>>> >
>>> > On Wed, Dec 5, 2012 at 8:29 AM, Andrew Martin <amartin at xes-inc.com>
>>> > wrote:
>>> > > Hello,
>>> > >
>>> > > I am running a 3-node Pacemaker cluster (2 "real" nodes and 1
>>> > > quorum node in
>>> > > standby) on Ubuntu 12.04 server (amd64) with Pacemaker 1.1.8 and
>>> > > Corosync
>>> > > 2.1.0. My cluster configuration is:
>>> > > http://pastebin.com/6TPkWtbt
>>> > >
>>> > > Recently, pengine died on storage0 (where the resources were
>>> > > running) which
>>> > > also happened to be the DC at the time. Consequently, Pacemaker
>>> > > went into
>>> > > recovery mode and released its role as DC, at which point
>>> > > storage1
>>> > > took over
>>> > > the DC role and migrated the resources away from storage0 and
>>> > > onto
>>> > > storage1.
>>> > > Looking through the logs, it seems like storage0 came back into
>>> > > the
>>> > > cluster
>>> > > before the migration of the resources began:
>>> > > Dec 03 08:31:20 [3165] storage1       crmd:     info:
>>> > > peer_update_callback:
>>> > > Client storage0/peer now has status [online] (DC=true)
>>> > > ...
>>> > > Dec 03 08:31:20 [3164] storage1    pengine:   notice: LogActions:
>>> > > Start   rscXXX    (storage1)
>>> > >
>>> > > Thus, why did the migration occur, rather than aborting and
>>> > > having
>>> > > the
>>> > > resources simply remain running on storage0? Here are the logs
>>> > > from
>>> > > each of
>>> > > the nodes:
>>> > > storage0: http://pastebin.com/ZqqnH9uf
>>> > > storage1: http://pastebin.com/rvSLVcZs
>>> >
>>> > Hmm, thats an interesting one.
>>> > Can you provide this file?  It will hold the answer:
>>> >
>>> > Dec 03 08:31:31 [3164] storage1    pengine:   notice:
>>> > process_pe_message:         Calculated Transition 1:
>>> > /var/lib/pacemaker/pengine/pe-input-28.bz2
>>> >
>>> >
>>> > >
>>> > > Thanks,
>>> > >
>>> > > Andrew
>>> > >
>>> > > _______________________________________________
>>> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> > >
>>> > > Project Home: http://www.clusterlabs.org
>>> > > Getting started:
>>> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> > > Bugs: http://bugs.clusterlabs.org
>>> > >
>>> >
>>> > _______________________________________________
>>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >
>>> > Project Home: http://www.clusterlabs.org
>>> > Getting started:
>>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> > Bugs: http://bugs.clusterlabs.org
>>> >
>>>
>>> Andrew,
>>>
>>> Sorry for the delayed response. Here is the file you requested:
>>> http://sources.xes-inc.com/downloads/pe-input-28.bz2
>>>
>>> This same condition just occurred again on storage1 today (pengine
>>> died, and then storage1 was STONITHed).
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org