[Pacemaker] Reason for cluster resource migration

Mon Feb 11 21:40:46 EST 2013

Hello,

Unfortunately this same failure occurred again tonight, taking down a production cluster. Here is the part of the log where pengine died:
Feb 11 17:05:15 storage0 pacemakerd[1572]:   notice: pcmk_child_exit: Child process pengine terminated with signal 6 (pid=19357, core=128)
Feb 11 17:05:16 storage0 pacemakerd[1572]:   notice: pcmk_child_exit: Respawning failed child process: pengine
Feb 11 17:05:16 storage0 pengine[12660]:   notice: crm_add_logfile: Additional logging available in /var/log/corosync.log
Feb 11 17:05:16 storage0 crmd[19358]:    error: crm_ipc_read: Connection to pengine failed
Feb 11 17:05:16 storage0 crmd[19358]:    error: mainloop_gio_callback: Connection to pengine[0x891680] closed (I/O condition=25)
Feb 11 17:05:16 storage0 crmd[19358]:     crit: pe_ipc_destroy: Connection to the Policy Engine failed (pid=-1, uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b)
Feb 11 17:05:16 storage0 crmd[19358]:   notice: save_cib_contents: Saved CIB contents after PE crash to /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.  bz2
Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA: Input I_ERROR from save_cib_contents() received in state S_POLICY_ENGINE
Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents ]
Feb 11 17:05:16 storage0 crmd[19358]:    error: do_recover: Action A_RECOVER (0000000001000000) not supported
Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Feb 11 17:05:16 storage0 crmd[19358]:   notice: terminate_cs_connection: Disconnecting from Corosync
Feb 11 17:05:16 storage0 crmd[19358]:    error: do_exit: Could not recover from internal error

The rest of the log:
http://sources.xes-inc.com/downloads/pengine.log
Looking through the full log, it seems that pengine recovers, but perhaps not quickly enough to prevent the STONITH and resource migration?

Here is the pe-core dump file mentioned in the log:
http://sources.xes-inc.com/downloads/pe-core.bz2

Thanks,

Andrew 

----- Original Message -----
> From: "Andrew Martin" <amartin at xes-inc.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Friday, February 1, 2013 4:32:26 PM
> Subject: Re: [Pacemaker] Reason for cluster resource migration
> 
> ----- Original Message -----
> > From: "Andrew Beekhof" <andrew at beekhof.net>
> > To: "The Pacemaker cluster resource manager"
> > <pacemaker at oss.clusterlabs.org>
> > Sent: Thursday, December 6, 2012 8:36:27 PM
> > Subject: Re: [Pacemaker] Reason for cluster resource migration
> > 
> > On Wed, Dec 5, 2012 at 8:29 AM, Andrew Martin <amartin at xes-inc.com>
> > wrote:
> > > Hello,
> > >
> > > I am running a 3-node Pacemaker cluster (2 "real" nodes and 1
> > > quorum node in
> > > standby) on Ubuntu 12.04 server (amd64) with Pacemaker 1.1.8 and
> > > Corosync
> > > 2.1.0. My cluster configuration is:
> > > http://pastebin.com/6TPkWtbt
> > >
> > > Recently, pengine died on storage0 (where the resources were
> > > running) which
> > > also happened to be the DC at the time. Consequently, Pacemaker
> > > went into
> > > recovery mode and released its role as DC, at which point
> > > storage1
> > > took over
> > > the DC role and migrated the resources away from storage0 and
> > > onto
> > > storage1.
> > > Looking through the logs, it seems like storage0 came back into
> > > the
> > > cluster
> > > before the migration of the resources began:
> > > Dec 03 08:31:20 [3165] storage1       crmd:     info:
> > > peer_update_callback:
> > > Client storage0/peer now has status [online] (DC=true)
> > > ...
> > > Dec 03 08:31:20 [3164] storage1    pengine:   notice: LogActions:
> > > Start   rscXXX    (storage1)
> > >
> > > Thus, why did the migration occur, rather than aborting and
> > > having
> > > the
> > > resources simply remain running on storage0? Here are the logs
> > > from
> > > each of
> > > the nodes:
> > > storage0: http://pastebin.com/ZqqnH9uf
> > > storage1: http://pastebin.com/rvSLVcZs
> > 
> > Hmm, thats an interesting one.
> > Can you provide this file?  It will hold the answer:
> > 
> > Dec 03 08:31:31 [3164] storage1    pengine:   notice:
> > process_pe_message: 	Calculated Transition 1:
> > /var/lib/pacemaker/pengine/pe-input-28.bz2
> > 
> > 
> > >
> > > Thanks,
> > >
> > > Andrew
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started:
> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> Andrew,
> 
> Sorry for the delayed response. Here is the file you requested:
> http://sources.xes-inc.com/downloads/pe-input-28.bz2
> 
> This same condition just occurred again on storage1 today (pengine
> died, and then storage1 was STONITHed).
> 
> Thanks,
> 
> Andrew
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>