[Pacemaker] Cluster Refuses to Stop/Shutdown

Thu Sep 24 22:47:42 UTC 2009

I've spent all day working on this; even going so far as to completely 
build my own set of packages from the Debian-available ones (which 
appear to be different than the Ubuntu-available ones).  It didn't have 
any effect on the issue at all: the cluster still freaks out and becomes 
a split-brain after a single SIGQUIT.

The debian packages that also demonstrate this behavior were the below 
versions:
    cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
    corosync_1.0.0-5~bpo50+1_i386.deb
    libcorosync4_1.0.0-5~bpo50+1_i386.deb
    libopenais3_1.0.0-4~bpo50+1_i386.deb
    openais_1.0.0-4~bpo50+1_i386.deb
    pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb

These packages were re-built (under Ubuntu Hardy Heron LTS) from the 
*.diff.gz, *.dsc, and *.orig.tar.gz files available at 
http://people.debian.org/~madkiss/ha-corosync 
<http://people.debian.org/%7Emadkiss/ha-corosync>, and as I said the 
symptoms remain exactly the same, both under the configuration that I 
list below and the sample configuration that came with these packages.  
I also attempted the same with a single IP Address resource associated 
with the cluster; just to be sure it wasn't an edge case for a cluster 
with no resources; but again that had no effect.

Basically I'm still exactly at the point that I was at yesterday morning 
at about 0900.

Remi Broemeling wrote:
> I posted this to the OpenAIS Mailing List 
> (openais at lists.linux-foundation.org) yesterday, but haven't received a 
> response and upon further reflection I think that maybe I chose the 
> wrong list to post it to.  That list seems to be far less about user 
> support and far more about developer communication.  Therefore 
> re-trying here, as the archives show it to be somewhat more user-focused.
>
> The problem is that I'm having an issue with corosync refusing to 
> shutdown in response to a QUIT signal.  Given the below cluster 
> (output of crm_mon):
>
> ============
> Last updated: Wed Sep 23 15:56:24 2009
> Stack: openais
> Current DC: boot1 - partition with quorum
> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Online: [ boot1 boot2 ]
>
> If I go onto the host 'boot2', and issue the command "killall -QUIT 
> corosync", the anticipated result would be that boot2 would go offline 
> (out of the cluster), and all of the cluster processes 
> (corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down.  
> However, this is not occurring, and I don't really have any idea why.  
> After logging into boot2, and issuing the command "killall -QUIT 
> corosync", the result is a split-brain:
>
> From boot1's viewpoint:
> ============
> Last updated: Wed Sep 23 15:58:27 2009
> Stack: openais
> Current DC: boot1 - partition WITHOUT quorum
> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Online: [ boot1 ]
> OFFLINE: [ boot2 ]
>
> From boot2's viewpoint:
> ============
> Last updated: Wed Sep 23 15:58:35 2009
> Stack: openais
> Current DC: boot1 - partition with quorum
> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Online: [ boot1 boot2 ]
>
> At this point the status quo holds until such time as ANOTHER QUIT 
> signal is sent to corosync, (i.e. the command "killall -QUIT corosync" 
> is executed on boot2 again).  Then, boot2 shuts down properly and 
> everything appears to be kosher.  Basically, what I expect to happen 
> after a single QUIT signal is instead taking two QUIT signals to 
> occur; and that summarizes my question: why does it take two QUIT 
> signals to force corosync to actually shutdown?  Is that desired 
> behavior?  From everything online that I have read it seems to be very 
> strange, and it makes me think that I have a problem in my 
> configuration(s), but I've no idea what that would be even after 
> playing with things and investigating for the day.
>
> I would be very grateful for any guidance that could be provided, as 
> at the moment I seem to be at an impasse.
>
> Log files, with debugging set to 'on', can be found at the following 
> pastebin locations:
>     After first QUIT signal issued on boot2:
>         boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
>         boot2:/var/log/syslog: http://pastebin.com/d26fdfee
>     After second QUIT signal issued on boot2:
>         boot1:/var/log/syslog: http://pastebin.com/m755fb989
>         boot2:/var/log/syslog: http://pastebin.com/m22dcef45
>
> OS, Software Packages, and Versions:
>     * two nodes, each running Ubuntu Hardy Heron LTS
>     * ubuntu-ha packages, as downloaded from 
> http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
>         * pacemaker-openais package version 
> 1.0.5+hg20090813-0ubuntu2~hardy1
>         * openais package version 1.0.0-3ubuntu1~hardy1
>         * corosync package version 1.0.0-4ubuntu1~hardy2
>         * heartbeat-common package version 
> heartbeat-common_2.99.2+sles11r9-5ubuntu1~hardy1
>
> Network Setup:
>     * boot1
>         * eth0 is 192.168.10.192
>         * eth1 is 172.16.1.1
>     * boot2
>         * eth0 is 192.168.10.193
>         * eth1 is 172.16.1.2
>     * boot1:eth0 and boot2:eth0 both connect to the same switch.
>     * boot1:eth1 and boot2:eth1 are connected directly to each other 
> via a cross-over cable.
>     * no firewalls are involved, and tcpdump shows the multicast and 
> UDP traffic flowing correctly over these links.
>     * I attempted a broadcast (rather than multicast) configuration, 
> to see if that would fix the problem.  It did not.
>
> `crm configure show` output:
>     node boot1
>     node boot2
>     property $id="cib-bootstrap-options" \
>             dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
>             cluster-infrastructure="openais" \
>             expected-quorum-votes="2" \
>             stonith-enabled="false" \
>             no-quorum-policy="ignore"
>
> Contents of /etc/corosync/corosync.conf:
>     # Please read the corosync.conf.5 manual page
>     compatibility: whitetank
>
>     totem {
>         clear_node_high_bit: yes
>         version: 2
>         secauth: on
>         threads: 1
>         heartbeat_failures_allowed: 3
>         interface {
>                 ringnumber: 0
>                 bindnetaddr: 172.16.1.0
>                 mcastaddr: 239.42.0.1
>                 mcastport: 5505
>         }
>         interface {
>                 ringnumber: 1
>                 bindnetaddr: 192.168.10.0
>                 mcastaddr: 239.42.0.2
>                 mcastport: 6606
>         }
>         rrp_mode: passive
>     }
>
>     amf {
>         mode: disabled
>     }
>
>     service {
>         name: pacemaker
>         ver: 0
>     }
>
>     aisexec {
>         user: root
>         group: root
>     }
>
>     logging {
>         debug: on
>         fileline: off
>         function_name: off
>         to_logfile: no
>         to_stderr: no
>         to_syslog: yes
>         timestamp: on
>         logger_subsys {
>                 subsys: AMF
>                 debug: off
>                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>         }
>     }

-- 

Remi Broemeling
Sr System Administrator

Nexopia.com Inc.
direct: 780 444 1250 ext 435
email: remi at nexopia.com <mailto:remi at nexopia.com>
fax: 780 487 0376

www.nexopia.com <http://www.nexopia.com>

On going to war over religion: "You're basically killing each other to 
see who's got the better imaginary friend."
Rich Jeni
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20090924/59917195/attachment-0002.htm>