Hi Michel,<br>Yes, I have try with a simpler configuration, I follow this steps:<br>1) Master/Slave ocf:linbit:drbd RA + ocf:heartbeat:Filesystem RA-> shutdown -r now -> Ok, no split brain<br>2) ..+ ocf:heartbeat:IPaddr2 RA -> shutdown -r now -> Ok<br>
3) ..+ heartbeat:drbdlinks RA -> shutdown -r now -> Ok<br>4) ..+ ocf:heartbeat:pgsql RA -> shutdown -r now -> Ok<br>5) ..+ ocf:custom:Asterisk RA -> shutdown -r now -> Ok<br>6) ..+ ocf:heartbeat:apache RA -> shutdown -r now -> Ok<br>
7) ..+ lsb:postfix RA -> shutdown -r now -> Ok<br>8) ..+ lsb:dhcp3-server -> shutdown -r now -> Ok<br>9) ..+ lsb:lsb:atftpd -> shutdown -r now -> FAIL, Split brain<br>At this point I get the first split brain, after a lot of google search I finally add a start-delay of ten seconds to Start and Promote operations for drbd RA. After that I reboot a couple times and everything works fine, no more split brain.<br>
10) ..+ ocf:custom:JBoss RA -> shutdown -r now -> FAIL, Split brain<br>With this resource enabled I always get a split brain after "normal" reboot. I tryed to<br>increase start-delay time on both.start and promote operation to 40 seconds, that time is more than required to stop JBoss.<br>
If I remove start-delay , I can see in secondary logs:<br><br>Dec 21 23:38:02 secondary drbd[17758]: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf primary r0<br>Dec 21 23:38:03 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stderr) 0: State change failed: (-1) Multiple primaries not allowed by config<br>
Dec 21 23:38:03 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stderr) Command 'drbdsetup 0 primary' terminated with exit code 11<br>Dec 21 23:38:03 secondary drbd[17758]: ERROR: r0: Called drbdadm -c /etc/drbd.conf primary r0<br>
Dec 21 23:38:03 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stderr) 2009/12/21_23:38:03 ERROR: r0: Called drbdadm -c /etc/drbd.conf primary r0<br>Dec 21 23:38:03 secondary drbd[17758]: ERROR: r0: Exit code 11<br>
Dec 21 23:38:03 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stderr) 2009/12/21_23:38:03 ERROR: r0: Exit code 11<br>Dec 21 23:38:03 secondary drbd[17758]: ERROR: r0: Command output:<br>Dec 21 23:38:03 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stderr) 2009/12/21_23:38:03 ERROR: r0: Command output:<br>
Dec 21 23:38:03 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stdout)<br>Dec 21 23:38:03 secondary drbd[17758]: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf primary r0<br>Dec 21 23:38:04 secondary kernel: [215495.004740] tg3: eth1: Link is down.<br>
Dec 21 23:38:04 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stderr) 0: State change failed: (-1) Multiple primaries not allowed by config<br>Dec 21 23:38:04 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stderr) Command 'drbdsetup 0 primary' terminated with exit code 11<br>
Dec 21 23:38:04 secondary drbd[17758]: ERROR: r0: Called drbdadm -c /etc/drbd.conf primary r0<br>Dec 21 23:38:04 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stderr) 2009/12/21_23:38:04 ERROR: r0: Called drbdadm -c /etc/drbd.conf primary r0<br>
Dec 21 23:38:04 secondary drbd[17758]: ERROR: r0: Exit code 11<br>Dec 21 23:38:04 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stderr) 2009/12/21_23:38:04 ERROR: r0: Exit code 11<br>Dec 21 23:38:04 secondary drbd[17758]: ERROR: r0: Command output:<br>
Dec 21 23:38:04 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stderr) 2009/12/21_23:38:04 ERROR: r0: Command output:<br>Dec 21 23:38:04 secondary lrmd: [19818]: info: RA output: (drbd:0:promote:stdout)<br>Dec 21 23:38:05 secondary drbd[17758]: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf primary r0<br>
Dec 21 23:38:06 secondary kernel: [215496.595684] block drbd0: PingAck did not arrive in time.<br>Dec 21 23:38:06 secondary kernel: [215496.602119] tg3: eth1: Link is up at 100 Mbps, full duplex.<br>Dec 21 23:38:06 secondary kernel: [215496.602122] tg3: eth1: Flow control is off for TX and off for RX.<br>
Dec 21 23:38:06 secondary kernel: [215496.638538] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )<br>Dec 21 23:38:06 secondary kernel: [215496.638547] block drbd0: asender terminated<br>
Dec 21 23:38:06 secondary kernel: [215496.638550] block drbd0: Terminating asender thread<br>Dec 21 23:38:06 secondary kernel: [215496.638589] block drbd0: short read expecting header on sock: r=-512<br>Dec 21 23:38:06 secondary kernel: [215496.697734] block drbd0: Connection closed<br>
Dec 21 23:38:06 secondary kernel: [215496.697734] block drbd0: conn( NetworkFailure -> Unconnected )<br>Dec 21 23:38:06 secondary kernel: [215496.697734] block drbd0: receiver terminated<br>Dec 21 23:38:06 secondary kernel: [215496.697734] block drbd0: Restarting receiver thread<br>
Dec 21 23:38:06 secondary kernel: [215496.697734] block drbd0: receiver (re)started<br>Dec 21 23:38:06 secondary kernel: [215496.697734] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0<br>Dec 21 23:38:07 secondary crm-fence-peer.sh[17839]: invoked for r0<br>
<br>Means that I'm having communication failure due to network goes down too fast or that secondary wants to be Master before primary can be slave?...or both? :-)<br><br>Thanks for help!!<br>Andres<br><br>
<br><br><div class="gmail_quote">2009/12/21 <a href="mailto:andschais@gmail.com" target="_blank">andschais@gmail.com</a> <span dir="ltr"><<a href="mailto:andschais@gmail.com" target="_blank">andschais@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi all, <br><br>I'm getting troubles with a Pacemaker+DRBD 2 nodes cluster. I am trying to solve it for about a week, I really need help!!! <br>If I disconnect power cord the failover works great, resources migrate to secondary node and back to primary when I turn it on.<br>
But when turn off primary node with a "shutdown -r now" command, I always finish with a split brian. <span><span style="background-color: rgb(255, 255, 255);" title="pero esto no es todo">That's not all</span></span>, If a put just a few resources (for example: virtual IP, DRBD, Apache and PostgreSQL) split brain does not take place, but at the moment I put 8 or 9 resources (specially when one of those resources is JBoss AS) I always get split brain...<br>
<span><span style="background-color: rgb(255, 255, 255);" title="alguien puede darme alguna pista?">Can someone give me some hints?</span></span><br><br>My systems are:<br><br>OS: Debian Lenny 2.6.26-2-686<br>
Corosync 1.1.2<br>DRBD 8.3.6<br><br>And my configuration files are:<br><br>/etc/corosync/corosync.conf<br><br># Please read the openais.conf.5 manual page<br>totem {<br> version: 2<br> # How long before declaring a token lost (ms)<br>
token: 3000<br> # How many token retransmits before forming a new configuration<br> token_retransmits_before_loss_const: 10<br> # How long to wait for join messages in the membership protocol (ms)<br>
join: 60<br> # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)<br> consensus: 1500<br> # Turn off the virtual synchrony filter<br> vsftype: none<br>
# Number of messages that may be sent by one processor on receipt of the token<br> max_messages: 20<br> # Limit generated nodeids to 31-bits (positive signed integers)<br> clear_node_high_bit: yes<br>
# Disable encryption<br> secauth: on<br> # How many threads to use for encryption/decryption<br> threads: 0<br> # Optionally assign a fixed node id (integer)<br> # nodeid: 1234<br>
# This specifies the mode of redundant ring, which may be none, active, or passive.<br> rrp_mode: passive<br> interface {<br> # The following values need to be set based on your environment<br>
ringnumber: 0<br> bindnetaddr: 172.16.1.0<br> mcastaddr: 226.94.1.1<br> mcastport: 5405<br> }<br> interface {<br> # The following values need to be set based on your environment<br>
ringnumber: 1<br> bindnetaddr: 10.186.68.0<br> mcastaddr: 226.94.2.1<br> mcastport: 5405<br> }<br>}<br>amf {<br> mode: disabled<br>}<br>service {<br>
# Load the Pacemaker Cluster Resource Manager<br> ver: 0<br> name: pacemaker<br>}<br>aisexec {<br> user: root<br> group: root<br>}<br>logging {<br> to_stderr: yes<br> debug: on<br>
timestamp: on<br> to_file: yes<br> logfile: /var/log/corosync.log<br> to_syslog: no<br> syslog_facility: daemon<br>}<br>}<br><br><br>/etc/drbd.conf<br><br>global {<br> usage-count yes;<br>}<br>common {<br>
syncer { rate 33M; }<br>}<br>resource r0 {<br> protocol C;<br> handlers {<br> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";<br>
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";<br> local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";<br>
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";<br> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";<br> outdate-peer "/usr/lib/drbd/outdate-peer.sh";<br> split-brain "/usr/lib/drbd/notify-split-brain.sh root@localhost";<br>
}<br> startup {<br> degr-wfc-timeout 30;<br> wfc-timeout 30;<br> }<br> disk {<br> fencing resource-only;<br> on-io-error detach;<br> }<br> net {<br> after-sb-0pri disconnect;<br>
after-sb-1pri disconnect;<br> after-sb-2pri disconnect;<br> rr-conflict disconnect;<br> }<br><br> on primary {<br> device /dev/drbd0;<br> disk /dev/vg00/drbd;<br> address <a href="http://172.16.1.1:7788" target="_blank">172.16.1.1:7788</a>;<br>
meta-disk internal;<br> }<br> on secondary {<br> device /dev/drbd0;<br> disk /dev/vg00/drbd;<br> address <a href="http://172.16.1.2:7788" target="_blank">172.16.1.2:7788</a>;<br>
meta-disk internal;<br>
}<br>}<br><br><br>and my crm config<br><br><configuration><br> <crm_config><br> <cluster_property_set id="cib-bootstrap-options"><br> <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/><br>
<nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/><br> <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="2"/><br>
<nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1261424411"/><br> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe"/><br>
<nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="openais"/><br> </cluster_property_set><br> </crm_config><br> <nodes><br>
<node uname="primary" type="normal" id="primary"><br> <instance_attributes id="nodes-primary"><br> <nvpair name="standby" id="nodes-primary-standby" value="off"/><br>
</instance_attributes><br> </node><br> <node uname="secondary" type="normal" id="secondary"><br> <instance_attributes id="nodes-secondary"><br>
<nvpair name="standby" id="nodes-secondary-standby" value="off"/><br> </instance_attributes><br> </node><br> </nodes><br> <resources><br>
<master id="ms-drbd"><br> <meta_attributes id="ms-drbd-meta_attributes"><br> <nvpair id="ms-drbd-meta_attributes-master-max" name="master-max" value="1"/><br>
<nvpair id="ms-drbd-meta_attributes-master-node-max" name="master-node-max" value="1"/><br> <nvpair id="ms-drbd-meta_attributes-clone-max" name="clone-max" value="2"/><br>
<nvpair id="ms-drbd-meta_attributes-clone-node-max" name="clone-node-max" value="1"/><br> <nvpair id="ms-drbd-meta_attributes-notify" name="notify" value="true"/><br>
<nvpair id="ms-drbd-meta_attributes-globally-unique" name="globally-unique" value="false"/><br> <nvpair name="target-role" id="ms-drbd-meta_attributes-target-role" value="Started"/><br>
</meta_attributes><br> <primitive class="ocf" id="drbd" provider="linbit" type="drbd"><br> <instance_attributes id="drbd-instance_attributes"><br>
<nvpair id="drbd-instance_attributes-drbd_resource" name="drbd_resource" value="r0"/><br> </instance_attributes><br> <operations><br> <op id="drbd-monitor-59s" interval="59s" name="monitor" role="Master" timeout="30s"/><br>
<op id="drbd-monitor-60s" interval="60s" name="monitor" role="Slave" timeout="30s"/><br> <op id="drbd-start-0s" interval="0s" name="start" start-delay="10s"/><br>
<op id="drbd-promote-0s" interval="0s" name="promote" start-delay="10s"/><br> </operations><br> </primitive><br> </master><br>
<group id="p-group"><br> <primitive class="ocf" id="fs" provider="heartbeat" type="Filesystem"><br> <instance_attributes id="fs-instance_attributes"><br>
<nvpair id="fs-instance_attributes-fstype" name="fstype" value="ext3"/><br> <nvpair id="fs-instance_attributes-directory" name="directory" value="/drbd"/><br>
<nvpair id="fs-instance_attributes-device" name="device" value="/dev/drbd0"/><br> </instance_attributes><br> <meta_attributes id="fs-meta_attributes"><br>
<nvpair id="fs-meta_attributes-is-managed" name="is-managed" value="true"/><br> </meta_attributes><br> </primitive><br> <primitive class="ocf" id="ip" provider="heartbeat" type="IPaddr2"><br>
<instance_attributes id="ip-instance_attributes"><br> <nvpair id="ip-instance_attributes-ip" name="ip" value="10.186.68.1"/><br> <nvpair id="ip-instance_attributes-broadcast" name="broadcast" value="10.186.68.127"/><br>
<nvpair id="ip-instance_attributes-cidr_netmask" name="cidr_netmask" value="25"/><br> </instance_attributes><br> <operations><br> <op id="ip-monitor-10s" interval="10s" name="monitor"/><br>
</operations><br> </primitive><br> <primitive class="heartbeat" id="drbdlinks" type="drbdlinks"><br> <operations><br> <op id="drbdlinks-monitor-60s" interval="60s" name="monitor"/><br>
</operations><br> </primitive><br> <primitive class="ocf" id="postgresql" provider="heartbeat" type="pgsql"><br> <instance_attributes id="postgresql-instance_attributes"><br>
<nvpair id="postgresql-instance_attributes-pgctl" name="pgctl" value="/usr/lib/postgresql/8.3/bin/pg_ctl"/><br> <nvpair id="postgresql-instance_attributes-psql" name="psql" value="/usr/bin/psql"/><br>
<nvpair id="postgresql-instance_attributes-pgdata" name="pgdata" value="/var/lib/postgresql/8.3/main"/><br> <nvpair id="postgresql-instance_attributes-pgdba" name="pgdba" value="postgres"/><br>
<nvpair id="postgresql-instance_attributes-pgdb" name="pgdb" value="postgres"/><br> <nvpair id="postgresql-instance_attributes-logfile" name="logfile" value="/var/log/postgresql/postgresql-8.3-main.log"/><br>
</instance_attributes><br> <operations><br> <op id="postgresql-monitor-60s" interval="60s" name="monitor" timeout="30s"/><br> </operations><br>
</primitive><br> <primitive class="ocf" id="asterisk" provider="custom" type="Asterisk"><br> <operations><br> <op id="asterisk-monitor-60s" interval="60s" name="monitor" start-delay="30s" timeout="30s"/><br>
</operations><br> </primitive><br> <primitive class="lsb" id="postfix" type="postfix"/><br> <primitive class="ocf" id="apache2" provider="heartbeat" type="apache"><br>
<instance_attributes id="apache2-instance_attributes"><br> <nvpair id="apache2-instance_attributes-configfile" name="configfile" value="/etc/apache2/apache2.conf"/><br>
</instance_attributes><br> <operations><br> <op id="apache2-monitor-60s" interval="60s" name="monitor"/><br> </operations><br> </primitive><br>
<primitive class="lsb" id="dhcp" type="dhcp3-server"/><br> <primitive class="ocf" id="jboss" provider="custom" type="JBoss"><br>
<instance_attributes id="jboss-instance_attributes"><br> <nvpair id="jboss-instance_attributes-java_home" name="java_home" value="/opt/java/"/><br>
<nvpair id="jboss-instance_attributes-jboss_home" name="jboss_home" value="/opt/jboss"/><br>
</instance_attributes><br> <operations><br> <op id="jboss-monitor-60s" interval="60s" name="monitor" start-delay="100s" timeout="30s"/><br>
<op id="jboss-start-0s" interval="0s" name="start" timeout="99s"/><br> </operations><br> </primitive><br> </group><br> </resources><br>
<constraints><br> <rsc_colocation id="p-group-on-ms-drbd" rsc="p-group" score="INFINITY" with-rsc="ms-drbd" with-rsc-role="Master"/><br> <rsc_location id="ms-drbd-master-on-primary" rsc="ms-drbd"><br>
<rule id="ms-drbd-master-on-primary-rule" role="Master" score="100"><br> <expression attribute="#uname" id="ms-drbd-master-on-primary-expression" operation="eq" value="primary"/><br>
</rule><br> </rsc_location><br> <rsc_order first="ms-drbd" first-action="promote" id="ms-drbd-before-group" score="INFINITY" then="p-group" then-action="start"/><br>
</constraints><br> <rsc_defaults/><br> <op_defaults/><br> </configuration><br><br>Thanks in advance.<br>Andres.<br><br>
</blockquote></div><br>