[Pacemaker] Master won't get promoted

Mon Oct 3 09:27:56 EDT 2011

Hi,

Thanks for the answer below, that clears up why that wasn't working without
a stonith device.

I'm wondering if i would need a stonith device if we plan to have 2
redundant nic interfaces on each node (connected to a different switch) for
the lan connection plus one nic for the drbd sync connection.  The 2
redundant lan connections would be bonded together so they would have the
same IP.  If one lan connection goes down, the other would still be up so
there would be no split brain scenario and we'd just have to make sure we
fix the downed lan connection before the other one has a chance to fail and
cause a split brain scenario.

I realize there's still a single point of failure there as the bonded
interface could possibly fail as a whole.  I don't think the company will
spring to buy stonith devices and I don't see how i could use
fence_ipmilan/ipmi as a stonith device in my 2 node primary/secondary
scenario as my comprehension on this is still far from where it should be
but my impression is that the stonith device will only reboot a failed node
if the drbd sync connection is down.

Thanks,
Charles

On Thu, Sep 29, 2011 at 10:25 AM, Dejan Muhamedagic <dejanmm at fastmail.fm>wrote:

> Hi,
>
> On Thu, Sep 29, 2011 at 09:30:55AM -0300, Charles Richard wrote:
> > Here it is attached.
> >
> > I also see the following 2 errors in the node 2 logs which I assume mean
> the
> > problem is really that node1 is not getting demoted and I'm not sure why:
> >
> > Error 1:
> > Sep 28 19:53:20 staging2 drbd[8587]: ERROR: mysqld: Called drbdadm -c
> > /etc/drbd.conf primary mysqld
> > Sep 28 19:53:20 staging2 drbd[8587]: ERROR: mysqld: Exit code 11
> > Sep 28 19:53:20 staging2 drbd[8587]: ERROR: mysqld: Command output:
> > Sep 28 19:53:20 staging2 lrmd: [1442]: info: RA output:
> > (drbd_mysql:1:promote:stdout)
> > Sep 28 19:53:22 staging2 lrmd: [1442]: info: RA output:
> > (drbd_mysql:1:promote:stderr) 0: State change failed: (-1) Multiple
> > primaries not allowed by config
> >
> > Error 2:
> > Sep 28 19:53:27 staging2 kernel: d-con mysqld: Requested state change
> failed
> > by peer: Refusing to be Primary while peer is not outdated (-7)
> > Sep 28 19:53:27 staging2 kernel: d-con mysqld: peer( Primary -> Unknown )
> > conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) pdsk(
> > UpToDate -> DUnknown )
> > Sep 28 19:53:27 staging2 kernel: d-con mysqld: meta connection shut down
> by
> > peer.
> >
> > Also, failover works fine if i reboot either machine.  The outdated
> machines
> > comes back up as secondary.  The scenario where i get the errors above is
> > when i pull the network cable from the primary.  Is that a stonith device
> > that should be protecting from this scenario and potentially rebooting
> the
> > primary?
>
> Yes. That's the only way for the cluster to keep sanity in case
> of split-brain caused by pulling the network cable.
>
> Thanks,
>
> Dejan
>
> > Feels like I'm getting so close to getting this working!
> >
> > Thanks!
> > Charles
> >
> > On Thu, Sep 29, 2011 at 4:15 AM, Andrew Beekhof <andrew at beekhof.net>
> wrote:
> >
> > > Could you attach  /var/lib/pengine/pe-input-3802.bz2 from staging1?
> > > That would tell us why.
> > >
> > > On Mon, Sep 26, 2011 at 10:28 PM, Charles Richard
> > > <chachi.richard at gmail.com> wrote:
> > > > Hi,
> > > >
> > > > I'm making some headway finally with my pacemaker install but now
> that
> > > > crm_mon doesn't return errors any more and crm_verify is clear, I'm
> > > having a
> > > > problem where my master won't get promoted.  Not sure what to do with
> > > this
> > > > one, any suggestions?   Here's the log snippet and config files:
> > > >
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: crm_timer_popped:
> PEngine
> > > > Recheck Timer (I_PE_CALC) just popped!
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition:
> State
> > > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > > cause=C_TIMER_POPPED
> > > > origin=crm_timer_popped ]
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition:
> > > Progressed
> > > > to state S_POLICY_ENGINE after C_TIMER_POPPED
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition: All
> 2
> > > > cluster nodes are eligible to run resources.
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_pe_invoke: Query 106:
> > > > Requesting the current CIB: S_POLICY_ENGINE
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_pe_invoke_callback:
> > > Invoking
> > > > the PE: query=106, ref=pe_calc-dc-1317020772-95, seq=2564, quorate=1
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: info: unpack_config:
> Startup
> > > > probes: enabled
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: unpack_config: On
> loss
> > > of
> > > > CCM Quorum: Ignore
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: info: unpack_config: Node
> > > scores:
> > > > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: info: unpack_domains:
> Unpacking
> > > > domains
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: info:
> determine_online_status:
> > > > Node staging1.dev.applepeak.com is online
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: info:
> determine_online_status:
> > > > Node staging2.dev.applepeak.com is online
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: group_print:
>  Resource
> > > > Group: mysql
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: native_print:
> > > > fs_mysql#011(ocf::heartbeat:Filesystem):#011Stopped
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: native_print:
> > > > ip_mysql#011(ocf::heartbeat:IPaddr2):#011Stopped
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: native_print:
> > > > mysqld#011(lsb:mysqld):#011Stopped
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: clone_print:
> > > Master/Slave
> > > > Set: ms_drbd_mysql
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: short_print:
> > > Stopped:
> > > > [ drbd_mysql:0 drbd_mysql:1 ]
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: info: master_color:
> > > ms_drbd_mysql:
> > > > Promoted 0 instances of a possible 1 to master
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: info: native_merge_weights:
> > > > fs_mysql: Rolling back scores from ip_mysql
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: info: native_merge_weights:
> > > > ip_mysql: Rolling back scores from mysqld
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: info: master_color:
> > > ms_drbd_mysql:
> > > > Promoted 0 instances of a possible 1 to master
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: LogActions: Leave
> > > resource
> > > > fs_mysql#011(Stopped)
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: LogActions: Leave
> > > resource
> > > > ip_mysql#011(Stopped)
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: LogActions: Leave
> > > resource
> > > > mysqld#011(Stopped)
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: LogActions: Leave
> > > resource
> > > > drbd_mysql:0#011(Stopped)
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: LogActions: Leave
> > > resource
> > > > drbd_mysql:1#011(Stopped)
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition:
> State
> > > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> input=I_PE_SUCCESS
> > > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: unpack_graph: Unpacked
> > > > transition 72: 0 actions in 0 synapses
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_te_invoke: Processing
> > > graph
> > > > 72 (ref=pe_calc-dc-1317020772-95) derived from
> > > > /var/lib/pengine/pe-input-3802.bz2
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: run_graph:
> > > > ====================================================
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: notice: run_graph: Transition
> 72
> > > > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > > > Source=/var/lib/pengine/pe-input-3802.bz2): Complete
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: te_graph_trigger:
> Transition
> > > 72
> > > > is now complete
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: notify_crmd: Transition
> 72
> > > > status: done - <null>
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition:
> State
> > > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > > > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition:
> > > Starting
> > > > PEngine Recheck Timer
> > > > Sep 26 04:06:12 staging1 pengine: [1685]: info: process_pe_message:
> > > > Transition 72: PEngine Input stored in:
> > > /var/lib/pengine/pe-input-3802.bz2
> > > > Sep 26 04:15:09 staging1 cib: [1682]: info: cib_stats: Processed 1
> > > > operations (0.00us average, 0% utilization) in the last 10min
> > > >
> > > > My drbd config file:
> > > >
> > > > resource mysqld {
> > > >
> > > > protocol C;
> > > >
> > > > startup { wfc-timeout 0; degr-wfc-timeout 120; }
> > > >
> > > > disk { on-io-error detach; }
> > > >
> > > >
> > > > on staging1 {
> > > >
> > > > device /dev/drbd0;
> > > >
> > > > disk /dev/vg_staging1/lv_data;
> > > >
> > > > meta-disk internal;
> > > >
> > > > address 10.10.20.1:7788;
> > > >
> > > > }
> > > >
> > > > on staging2 {
> > > >
> > > > device /dev/drbd0;
> > > >
> > > > disk /dev/vg_staging2/lv_data;
> > > >
> > > > meta-disk internal;
> > > >
> > > > address 10.10.20.2:7788;
> > > >
> > > > }
> > > >
> > > > }
> > > >
> > > > corosync.conf:
> > > >
> > > > compatibility: whitetank
> > > >
> > > > aisexec {
> > > >   user: root
> > > >   group: root
> > > > }
> > > >
> > > > totem {
> > > >         version: 2
> > > >         secauth: off
> > > >         threads: 0
> > > >         interface {
> > > >                 ringnumber: 0
> > > >                 bindnetaddr: 10.10.10.0
> > > >                 mcastaddr: 226.94.1.1
> > > >                 mcastport: 5405
> > > >         }
> > > > }
> > > >
> > > > logging {
> > > >         fileline: off
> > > >         to_stderr: no
> > > >         to_logfile: no
> > > >         to_syslog: yes
> > > >         logfile: /var/log/cluster/corosync.log
> > > >         debug: off
> > > >         timestamp: on
> > > >         logger_subsys {
> > > >                 subsys: AMF
> > > >                 debug: off
> > > >         }
> > > > }
> > > >
> > > > amf {
> > > >         mode: disabled
> > > > }
> > > >
> > > > service {
> > > > #Load Pacemaker
> > > > name: pacemaker
> > > > ver: 0
> > > > use_mgmtd: yes
> > > > }
> > > >
> > > > And my crm config:
> > > >
> > > > node staging1.dev.applepeak.com
> > > > node staging2.dev.applepeak.com
> > > > primitive drbd_mysql ocf:linbit:drbd \
> > > >         params drbd_resource="mysqld" \
> > > >         op monitor interval="15s" \
> > > >         op start interval="0" timeout="240s" \
> > > >         op stop interval="0" timeout="100s"
> > > > primitive fs_mysql ocf:heartbeat:Filesystem \
> > > >         params device="/dev/drbd0"
> directory="/opt/data/mysql/data/mysql"
> > > > fstype="ext4" \
> > > >         op start interval="0" timeout="60s" \
> > > >         op stop interval="0" timeout="60s"
> > > > primitive ip_mysql ocf:heartbeat:IPaddr2 \
> > > >         params ip="10.10.10.31" nic="eth0"
> > > > primitive mysqld lsb:mysqld
> > > > group mysql fs_mysql ip_mysql mysqld
> > > > ms ms_drbd_mysql drbd_mysql \
> > > >         meta master-max="1" master-node-max="1" clone-max="2"
> > > > clone-node-max="1" notify="true"
> > > > colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master
> > > > order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start
> > > > property $id="cib-bootstrap-options" \
> > > >         dc-version="1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe" \
> > > >         cluster-infrastructure="openais" \
> > > >         expected-quorum-votes="2" \
> > > >         stonith-enabled="false" \
> > > >         last-lrm-refresh="1316961847" \
> > > >         stop-all-resources="true" \
> > > >         no-quorum-policy="ignore"
> > > > rsc_defaults $id="rsc-options" \
> > > >         resource-stickiness="100"
> > > >
> > > > Thanks,
> > > > Charles
> > > >
> > > > _______________________________________________
> > > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > > Bugs:
> > > >
> > >
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > > >
> > > >
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs:
> > >
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >
>
>
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111003/a8c87e99/attachment-0002.html>