[Pacemaker] A/P Corosync, PGSQL and Split Brains questions

Thu Feb 10 19:54:04 EST 2011

On Wed, Feb 09, 2011 at 02:48:52PM +0100, Stephan-Frank Henry wrote:
> Hello agian,
> 
> after fixing up my VirtualIP problem, I have been doing some Split
> Brain tests and while everything 'returns to normal', it is not quite
> what I had desired.
> 
> My scenario:
> Acive/Passive 2 node cluster (serverA & serverB) with Corosync, DRBD & PGSQL.
> The resources are configured as Master/Slave and sofar it is fine.
> 
> Since bullet points speak more then words: ;)
> Test:
>  1) Pull the plug on the master (serverA)
>  2) Then Reattach
> Expected results:
>  1) serverB becomes Master
>  2) serverB remains Master, serverA syncs with serverB
> Actual results:
>  1) serverB becomes Master
>  2) serverA becomes Master, data written on serverB is lost.
> 
> In all honesty, I am not an expert in HA, DRBD and Corosync. I know
> the basics but it is not my domain of excellence.  Most of my configs
> has been influenced... ok, blatantly copied from the net and tweaked
> until the worked.  Yet now I am at a loss.

Without logs, it does not make much sense to guess what may have
happened, and why.

> Am I presuming something that is not possible with Corosync (which I doubt) or is my config wrong(probably)?
> Yet I am unable to find any smoking gun.
> 
> I have visited all the sites that might hold the information, but none really point anything out.
> Only difference I could tell was that some examples did not have the split brain handling in the drbd.conf.
> 
> Can someone possibly point me into the correct direction?
> 
> Thanks!
> 
> Frank
> 
> Here are the obligatory config file contents:
> 
> ############### /etc/drbd.conf 
> 
> global {
>   usage-count no;
> }
> common {
>   syncer {
>     rate 100M;
>   }
>   protocol C;
> }
> resource drbd0 {
> 
>   startup {
>     wfc-timeout 20;
>     degr-wfc-timeout 10;
>   }
>   disk {
>     on-io-error detach;
>   }
>   net {
>     cram-hmac-alg sha1;
>     after-sb-0pri discard-zero-changes;
>     after-sb-1pri discard-secondary;

This is configuring data loss.
Just because during some connection handshake, after a split brain,
one node is currently secondary, does not necessarily mean that is the
data set you want to throw away.

>     after-sb-2pri disconnect; 
>   }
>   on serverA {
>     device /dev/drbd0;
>     disk /dev/sda5;
>     meta-disk internal;
>     address 150.158.183.22:7788;
>   }
>   on serverB {
>     device /dev/drbd0;
>     disk /dev/sda5;
>     meta-disk internal;
>     address 150.158.183.23:7788;
>   }
> }
> 
> ############### /etc/ha.d/ha.cf 
> 
> udpport 694
> ucast eth0 150.158.183.23

You absolutely want redundant communication links.

> 
> autojoin none
> debug 1
> logfile /var/log/ha-log
> use_logd false
> logfacility daemon
> keepalive 2 # 2 second(s)
> deadtime 10
> # warntime 10
> initdead 80
> 
> # list all shared ip addresses we want to ping
> ping 150.158.183.30

Ping directive in ha.cf is heartbeat haresources mode stuff.
Discard that.

> # list all node names
> node serverB serverA
> crm yes
> respawn root /usr/lib/heartbeat/pingd -m 100 -d 5s

I don't think that makes much sense nowadays anymore.
Use the pacemaker/ping resource agent instead.

Besides, you are using corosync.
So why include the _heartbeat_ configuration here?
It is completely irrelevant.

> ############### /etc/corosync/corosync.conf
> 
> totem {
> 	version: 2
> 	token: 1000
> 	hold: 180
> 	token_retransmits_before_loss_const: 20
> 	join: 60
> 	configuration (ms)
> 	consensus: 4800
> 	vsftype: none
> 	max_messages: 20
> 	clear_node_high_bit: yes
> 	secauth: off
> 	threads: 0
> 	rrp_mode: none
> 	interface {
> 		ringnumber: 0
> 		bindnetaddr: 150.158.183.0
> 		mcastaddr: 226.94.1.22
> 		mcastport: 5427
> 	}

I said it earlier, you want to have redundant communication channels.
I'm not on top of thing how corosync redundant ring behavior is doing
now, but it had some quirks in the past.

> <cib have_quorum="true" generated="true" ignore_dtd="false" epoch="14" num_updates="0" admin_epoch="0" validate-with="transitional-0.6" cib-last-written="Wed Feb  9 14:03:30 2011" crm_feature_set="3.0.1" have-quorum="0" dc-uuid="serverA">

validate-with transitional 0.6? Really?
Is this an upgrade from something?
Or a copy'n'paste?

Where does this cib come from?

>         <primitive class="ocf" type="drbd" provider="heartbeat" id="drbddisk_rep">

please use the linbit drbd resource agent.

>       <group id="rg_drbd" ordered="true">

>         <primitive id="ip_resource" class="ocf" type="IPaddr2" provider="heartbeat">

>         <primitive class="ocf" provider="heartbeat" type="Filesystem" id="fs0">

>         <primitive id="pgsql" class="ocf" type="pgsql" provider="heartbeat">

>         <rule id="drbd0-master-on-1" role="master" score="100">
>           <expression id="exp-1" attribute="#uname" operation="eq" value="serverA"/>

Get rid of that rule.
Seriously.

Combined with your use of the heartbeat/drbd (instead of the
linbit/drbd) agent, and the "after-sb-1pri discard-secondary;" in your
drbd.conf, it is most likely the root cause of your trouble.

>         </rule>

>       </rsc_location>
>       <rsc_order id="mount_after_drbd" from="rg_drbd" action="start" to="ms_drbd0" to_action="promote"/>
>       <rsc_colocation id="mount_on_drbd" to="ms_drbd0" to_role="master" from="rg_drbd" score="INFINITY"/>

And please start using the crm shell.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.