[Pacemaker] Reason for cluster resource migration

Tue Feb 12 23:52:23 EST 2013

On Wed, Feb 13, 2013 at 2:04 AM, Andrew Martin <amartin at xes-inc.com> wrote:
> ----- Original Message -----
>> From: "Andrew Beekhof" <andrew at beekhof.net>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Monday, February 11, 2013 10:11:53 PM
>> Subject: Re: [Pacemaker] Reason for cluster resource migration
>>
>> On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof <andrew at beekhof.net>
>> wrote:
>> > On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof
>> > <andrew at beekhof.net> wrote:
>> >> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin
>> >> <amartin at xes-inc.com> wrote:
>> >>> Hello,
>> >>>
>> >>> Unfortunately this same failure occurred again tonight,
>> >>
>> >> It might be the same effect, but there was no indication that the
>> >> PE
>> >> died last time.
>> >>
>> >>> taking down a production cluster. Here is the part of the log
>> >>> where pengine died:
>> >>> Feb 11 17:05:15 storage0 pacemakerd[1572]:   notice:
>> >>> pcmk_child_exit: Child process pengine terminated with signal 6
>> >>> (pid=19357, core=128)
>> >>> Feb 11 17:05:16 storage0 pacemakerd[1572]:   notice:
>> >>> pcmk_child_exit: Respawning failed child process: pengine
>> >>> Feb 11 17:05:16 storage0 pengine[12660]:   notice:
>> >>> crm_add_logfile: Additional logging available in
>> >>> /var/log/corosync.log
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: crm_ipc_read:
>> >>> Connection to pengine failed
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error:
>> >>> mainloop_gio_callback: Connection to pengine[0x891680] closed
>> >>> (I/O condition=25)
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:     crit: pe_ipc_destroy:
>> >>> Connection to the Policy Engine failed (pid=-1,
>> >>> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b)
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:   notice:
>> >>> save_cib_contents: Saved CIB contents after PE crash to
>> >>> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.
>> >>>  bz2
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA:
>> >>> Input I_ERROR from save_cib_contents() received in state
>> >>> S_POLICY_ENGINE
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:  warning:
>> >>> do_state_transition: State transition S_POLICY_ENGINE ->
>> >>> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL
>> >>> origin=save_cib_contents ]
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_recover:
>> >>> Action A_RECOVER (0000000001000000) not supported
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_election_vote:
>> >>> Not voting in election, we're in state S_RECOVERY
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA:
>> >>> Input I_TERMINATE from do_recover() received in state S_RECOVERY
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:   notice:
>> >>> terminate_cs_connection: Disconnecting from Corosync
>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_exit: Could
>> >>> not recover from internal error
>> >>>
>> >>> The rest of the log:
>> >>> http://sources.xes-inc.com/downloads/pengine.log
>> >>> Looking through the full log, it seems that pengine recovers,
>> >>
>> >> Right, pacemakerd watches for this and restarts it.
>> >>
>> >>> but perhaps not quickly enough to prevent the STONITH and
>> >>> resource migration?
>> >>
>> >> Highly likely.
>> >> However the PE crashing is quite serious.  I'd like to get to the
>> >> bottom of that ASAP.
>> >>
>> >>>
>> >>> Here is the pe-core dump file mentioned in the log:
>> >>> http://sources.xes-inc.com/downloads/pe-core.bz2
>> >>
>> >> Unfortunately core files are specific to the machine that
>> >> generated them.
>> >> If you create a crm_report for about that time, it will open it
>> >> and
>> >> record a backtrace for us to look at.
>> >>
>> >> Also very important is the contents of:
>> >>    /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2
>> >
>> > Ohhh, thats what the pe-core link was.
>> > I've run it through crm_simulate but couldn't reproduce the crash.
>> >
>> > So we'll still need the crm_report, it will have more detail on the
>> > "Child process pengine terminated with signal 6 (pid=19357,
>> > core=128)"
>> > part.
>>
>> Signal 6 is an assertion failure, but strangely there is no mention
>> of
>> one in syslog.
>> Can you grep /var/log/corosync.log for lines containing 19357 please?
>>
> Andrew,
>
> Thanks for the help. Here are the lines containing 19357:
> http://sources.xes-inc.com/downloads/19357.log
> cl_sysadmin_notify is a clone of a ocf:heartbeat:MailTo resource. Postfix
> is installed and running, so I am not sure why these failures are occurring.
>
>> > The core file will likely be somewhere under
>> > /var/lib/pacemaker/cores
> That directory doesn't exist on this server, and it doesn't appear to be in /var/crash either:

It looks like /var/lib/heartbeat/cores/ on your system.

> # ls /var/crash/ -ltr
> total 67548
> -rw-r----- 1 hacluster whoopsie  1293711 Feb  6 10:01 _usr_libexec_pacemaker_pengine.110.crash
> ---------- 1 root      whoopsie 67874816 Feb 11 17:07 _usr_libexec_pacemaker_lrmd.0.crash
> In case they would be helpful, here are those two files:
> http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_pengine.110.crash
> http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_lrmd.0.crash
>
> Here is the crm_report from storage0 from this time period:
> http://sources.xes-inc.com/downloads/pengine-report.tar.bz2

Are you sure?
The pengine crashed on "Feb 11 17:05:15" but the report appears to be
from "Tue Feb 12 09:59:50 EST 2013" to "Tue Feb 12 10:30:10 EST 2013"

There was one crash in there, but it was of the lrmd.
Unfortunately it looks like the binaries and libraries have been stripped.

Where did you get them from?  Do you know how to install the -debug packages?