[Pacemaker] Cluster Refuses to Stop/Shutdown

Thu Sep 24 18:47:54 EDT 2009

Remi,

Likely a defect.  We will have to look into it.  Please file a bug as
per instructions on the corosync wiki at www.corosync.org.

On Thu, 2009-09-24 at 16:47 -0600, Remi Broemeling wrote:
> I've spent all day working on this; even going so far as to completely
> build my own set of packages from the Debian-available ones (which
> appear to be different than the Ubuntu-available ones).  It didn't
> have any effect on the issue at all: the cluster still freaks out and
> becomes a split-brain after a single SIGQUIT.
> 
> The debian packages that also demonstrate this behavior were the below
> versions:
>     cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
>     corosync_1.0.0-5~bpo50+1_i386.deb
>     libcorosync4_1.0.0-5~bpo50+1_i386.deb
>     libopenais3_1.0.0-4~bpo50+1_i386.deb
>     openais_1.0.0-4~bpo50+1_i386.deb
>     pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb
> 
> These packages were re-built (under Ubuntu Hardy Heron LTS) from the
> *.diff.gz, *.dsc, and *.orig.tar.gz files available at
> http://people.debian.org/~madkiss/ha-corosync, and as I said the
> symptoms remain exactly the same, both under the configuration that I
> list below and the sample configuration that came with these packages.
> I also attempted the same with a single IP Address resource associated
> with the cluster; just to be sure it wasn't an edge case for a cluster
> with no resources; but again that had no effect.
> 
> Basically I'm still exactly at the point that I was at yesterday
> morning at about 0900.
> 
> Remi Broemeling wrote: 
> > I posted this to the OpenAIS Mailing List
> > (openais at lists.linux-foundation.org) yesterday, but haven't received
> > a response and upon further reflection I think that maybe I chose
> > the wrong list to post it to.  That list seems to be far less about
> > user support and far more about developer communication.  Therefore
> > re-trying here, as the archives show it to be somewhat more
> > user-focused.
> > 
> > The problem is that I'm having an issue with corosync refusing to
> > shutdown in response to a QUIT signal.  Given the below cluster
> > (output of crm_mon):
> > 
> > ============
> > Last updated: Wed Sep 23 15:56:24 2009
> > Stack: openais
> > Current DC: boot1 - partition with quorum
> > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > 2 Nodes configured, 2 expected votes
> > 0 Resources configured.
> > ============
> > 
> > Online: [ boot1 boot2 ]
> > 
> > If I go onto the host 'boot2', and issue the command "killall -QUIT
> > corosync", the anticipated result would be that boot2 would go
> > offline (out of the cluster), and all of the cluster processes
> > (corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down.
> > However, this is not occurring, and I don't really have any idea
> > why.  After logging into boot2, and issuing the command "killall
> > -QUIT corosync", the result is a split-brain:
> > 
> > From boot1's viewpoint:
> > ============
> > Last updated: Wed Sep 23 15:58:27 2009
> > Stack: openais
> > Current DC: boot1 - partition WITHOUT quorum
> > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > 2 Nodes configured, 2 expected votes
> > 0 Resources configured.
> > ============
> > 
> > Online: [ boot1 ]
> > OFFLINE: [ boot2 ]
> > 
> > From boot2's viewpoint:
> > ============
> > Last updated: Wed Sep 23 15:58:35 2009
> > Stack: openais
> > Current DC: boot1 - partition with quorum
> > Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> > 2 Nodes configured, 2 expected votes
> > 0 Resources configured.
> > ============
> > 
> > Online: [ boot1 boot2 ]
> > 
> > At this point the status quo holds until such time as ANOTHER QUIT
> > signal is sent to corosync, (i.e. the command "killall -QUIT
> > corosync" is executed on boot2 again).  Then, boot2 shuts down
> > properly and everything appears to be kosher.  Basically, what I
> > expect to happen after a single QUIT signal is instead taking two
> > QUIT signals to occur; and that summarizes my question: why does it
> > take two QUIT signals to force corosync to actually shutdown?  Is
> > that desired behavior?  From everything online that I have read it
> > seems to be very strange, and it makes me think that I have a
> > problem in my configuration(s), but I've no idea what that would be
> > even after playing with things and investigating for the day.
> > 
> > I would be very grateful for any guidance that could be provided, as
> > at the moment I seem to be at an impasse.
> > 
> > Log files, with debugging set to 'on', can be found at the following
> > pastebin locations:
> >     After first QUIT signal issued on boot2:
> >         boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
> >         boot2:/var/log/syslog: http://pastebin.com/d26fdfee
> >     After second QUIT signal issued on boot2:
> >         boot1:/var/log/syslog: http://pastebin.com/m755fb989
> >         boot2:/var/log/syslog: http://pastebin.com/m22dcef45
> > 
> > OS, Software Packages, and Versions:
> >     * two nodes, each running Ubuntu Hardy Heron LTS
> >     * ubuntu-ha packages, as downloaded from
> > http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
> >         * pacemaker-openais package version 1.0.5
> > +hg20090813-0ubuntu2~hardy1
> >         * openais package version 1.0.0-3ubuntu1~hardy1
> >         * corosync package version 1.0.0-4ubuntu1~hardy2
> >         * heartbeat-common package version heartbeat-common_2.99.2
> > +sles11r9-5ubuntu1~hardy1
> > 
> > Network Setup:
> >     * boot1
> >         * eth0 is 192.168.10.192
> >         * eth1 is 172.16.1.1
> >     * boot2
> >         * eth0 is 192.168.10.193
> >         * eth1 is 172.16.1.2
> >     * boot1:eth0 and boot2:eth0 both connect to the same switch.
> >     * boot1:eth1 and boot2:eth1 are connected directly to each other
> > via a cross-over cable.
> >     * no firewalls are involved, and tcpdump shows the multicast and
> > UDP traffic flowing correctly over these links.
> >     * I attempted a broadcast (rather than multicast) configuration,
> > to see if that would fix the problem.  It did not.
> > 
> > `crm configure show` output:
> >     node boot1
> >     node boot2
> >     property $id="cib-bootstrap-options" \
> > 
> > dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
> >             cluster-infrastructure="openais" \
> >             expected-quorum-votes="2" \
> >             stonith-enabled="false" \
> >             no-quorum-policy="ignore"
> > 
> > Contents of /etc/corosync/corosync.conf:
> >     # Please read the corosync.conf.5 manual page
> >     compatibility: whitetank
> > 
> >     totem {
> >         clear_node_high_bit: yes
> >         version: 2
> >         secauth: on
> >         threads: 1
> >         heartbeat_failures_allowed: 3
> >         interface {
> >                 ringnumber: 0
> >                 bindnetaddr: 172.16.1.0
> >                 mcastaddr: 239.42.0.1
> >                 mcastport: 5505
> >         }
> >         interface {
> >                 ringnumber: 1
> >                 bindnetaddr: 192.168.10.0
> >                 mcastaddr: 239.42.0.2
> >                 mcastport: 6606
> >         }
> >         rrp_mode: passive
> >     }
> > 
> >     amf {
> >         mode: disabled
> >     }
> > 
> >     service {
> >         name: pacemaker
> >         ver: 0
> >     }
> > 
> >     aisexec {
> >         user: root
> >         group: root
> >     }
> > 
> >     logging {
> >         debug: on
> >         fileline: off
> >         function_name: off
> >         to_logfile: no
> >         to_stderr: no
> >         to_syslog: yes
> >         timestamp: on
> >         logger_subsys {
> >                 subsys: AMF
> >                 debug: off
> >                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> >         }
> >     }
> 
> -- 
> 
> Remi Broemeling
> Sr System Administrator
> 
> Nexopia.com Inc.
> direct: 780 444 1250 ext 435
> email: remi at nexopia.com
> fax: 780 487 0376 
> 
> www.nexopia.com
> 
> On going to war over religion: "You're basically killing each other to
> see who's got the better imaginary friend."
> Rich Jeni
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker