[Pacemaker] Reason for cluster resource migration

Mon Feb 11 23:11:53 EST 2013

On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin <amartin at xes-inc.com> wrote:
>>> Hello,
>>>
>>> Unfortunately this same failure occurred again tonight,
>>
>> It might be the same effect, but there was no indication that the PE
>> died last time.
>>
>>> taking down a production cluster. Here is the part of the log where pengine died:
>>> Feb 11 17:05:15 storage0 pacemakerd[1572]:   notice: pcmk_child_exit: Child process pengine terminated with signal 6 (pid=19357, core=128)
>>> Feb 11 17:05:16 storage0 pacemakerd[1572]:   notice: pcmk_child_exit: Respawning failed child process: pengine
>>> Feb 11 17:05:16 storage0 pengine[12660]:   notice: crm_add_logfile: Additional logging available in /var/log/corosync.log
>>> Feb 11 17:05:16 storage0 crmd[19358]:    error: crm_ipc_read: Connection to pengine failed
>>> Feb 11 17:05:16 storage0 crmd[19358]:    error: mainloop_gio_callback: Connection to pengine[0x891680] closed (I/O condition=25)
>>> Feb 11 17:05:16 storage0 crmd[19358]:     crit: pe_ipc_destroy: Connection to the Policy Engine failed (pid=-1, uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b)
>>> Feb 11 17:05:16 storage0 crmd[19358]:   notice: save_cib_contents: Saved CIB contents after PE crash to /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.  bz2
>>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA: Input I_ERROR from save_cib_contents() received in state S_POLICY_ENGINE
>>> Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents ]
>>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_recover: Action A_RECOVER (0000000001000000) not supported
>>> Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
>>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
>>> Feb 11 17:05:16 storage0 crmd[19358]:   notice: terminate_cs_connection: Disconnecting from Corosync
>>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_exit: Could not recover from internal error
>>>
>>> The rest of the log:
>>> http://sources.xes-inc.com/downloads/pengine.log
>>> Looking through the full log, it seems that pengine recovers,
>>
>> Right, pacemakerd watches for this and restarts it.
>>
>>> but perhaps not quickly enough to prevent the STONITH and resource migration?
>>
>> Highly likely.
>> However the PE crashing is quite serious.  I'd like to get to the
>> bottom of that ASAP.
>>
>>>
>>> Here is the pe-core dump file mentioned in the log:
>>> http://sources.xes-inc.com/downloads/pe-core.bz2
>>
>> Unfortunately core files are specific to the machine that generated them.
>> If you create a crm_report for about that time, it will open it and
>> record a backtrace for us to look at.
>>
>> Also very important is the contents of:
>>    /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2
>
> Ohhh, thats what the pe-core link was.
> I've run it through crm_simulate but couldn't reproduce the crash.
>
> So we'll still need the crm_report, it will have more detail on the
> "Child process pengine terminated with signal 6 (pid=19357, core=128)"
> part.

Signal 6 is an assertion failure, but strangely there is no mention of
one in syslog.
Can you grep /var/log/corosync.log for lines containing 19357 please?

> The core file will likely be somewhere under /var/lib/pacemaker/cores
> but crm_report should be able to find it.
>
>>
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Andrew Martin" <amartin at xes-inc.com>
>>>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>>>> Sent: Friday, February 1, 2013 4:32:26 PM
>>>> Subject: Re: [Pacemaker] Reason for cluster resource migration
>>>>
>>>> ----- Original Message -----
>>>> > From: "Andrew Beekhof" <andrew at beekhof.net>
>>>> > To: "The Pacemaker cluster resource manager"
>>>> > <pacemaker at oss.clusterlabs.org>
>>>> > Sent: Thursday, December 6, 2012 8:36:27 PM
>>>> > Subject: Re: [Pacemaker] Reason for cluster resource migration
>>>> >
>>>> > On Wed, Dec 5, 2012 at 8:29 AM, Andrew Martin <amartin at xes-inc.com>
>>>> > wrote:
>>>> > > Hello,
>>>> > >
>>>> > > I am running a 3-node Pacemaker cluster (2 "real" nodes and 1
>>>> > > quorum node in
>>>> > > standby) on Ubuntu 12.04 server (amd64) with Pacemaker 1.1.8 and
>>>> > > Corosync
>>>> > > 2.1.0. My cluster configuration is:
>>>> > > http://pastebin.com/6TPkWtbt
>>>> > >
>>>> > > Recently, pengine died on storage0 (where the resources were
>>>> > > running) which
>>>> > > also happened to be the DC at the time. Consequently, Pacemaker
>>>> > > went into
>>>> > > recovery mode and released its role as DC, at which point
>>>> > > storage1
>>>> > > took over
>>>> > > the DC role and migrated the resources away from storage0 and
>>>> > > onto
>>>> > > storage1.
>>>> > > Looking through the logs, it seems like storage0 came back into
>>>> > > the
>>>> > > cluster
>>>> > > before the migration of the resources began:
>>>> > > Dec 03 08:31:20 [3165] storage1       crmd:     info:
>>>> > > peer_update_callback:
>>>> > > Client storage0/peer now has status [online] (DC=true)
>>>> > > ...
>>>> > > Dec 03 08:31:20 [3164] storage1    pengine:   notice: LogActions:
>>>> > > Start   rscXXX    (storage1)
>>>> > >
>>>> > > Thus, why did the migration occur, rather than aborting and
>>>> > > having
>>>> > > the
>>>> > > resources simply remain running on storage0? Here are the logs
>>>> > > from
>>>> > > each of
>>>> > > the nodes:
>>>> > > storage0: http://pastebin.com/ZqqnH9uf
>>>> > > storage1: http://pastebin.com/rvSLVcZs
>>>> >
>>>> > Hmm, thats an interesting one.
>>>> > Can you provide this file?  It will hold the answer:
>>>> >
>>>> > Dec 03 08:31:31 [3164] storage1    pengine:   notice:
>>>> > process_pe_message:         Calculated Transition 1:
>>>> > /var/lib/pacemaker/pengine/pe-input-28.bz2
>>>> >
>>>> >
>>>> > >
>>>> > > Thanks,
>>>> > >
>>>> > > Andrew
>>>> > >
>>>> > > _______________________________________________
>>>> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> > >
>>>> > > Project Home: http://www.clusterlabs.org
>>>> > > Getting started:
>>>> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> > > Bugs: http://bugs.clusterlabs.org
>>>> > >
>>>> >
>>>> > _______________________________________________
>>>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> >
>>>> > Project Home: http://www.clusterlabs.org
>>>> > Getting started:
>>>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> > Bugs: http://bugs.clusterlabs.org
>>>> >
>>>>
>>>> Andrew,
>>>>
>>>> Sorry for the delayed response. Here is the file you requested:
>>>> http://sources.xes-inc.com/downloads/pe-input-28.bz2
>>>>
>>>> This same condition just occurred again on storage1 today (pengine
>>>> died, and then storage1 was STONITHed).
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org