[Pacemaker] Seeking for advice after cluster freeze

Thu Feb 18 15:02:35 EST 2010

Hi,

On Thu, Feb 18, 2010 at 05:15:42PM +0100, Patrick Zwahlen wrote:
> Dear list,
> 
> I am looking for some advice regarding a freeze that we experienced. My
> project is a 2-nodes active-passive NFS cluster on two virtual machines.
> I am using CentOS 5.4 x86_64, drbd, xfs, corosync and pacemaker.
> Following are the RPM versions:
> 
> From clusterlabs:
> cluster-glue.x86_64        1.0.1-1.el5
> cluster-glue-libs.x86_64   1.0.1-1.el5

Please upgrade to 1.0.3. Not sure, but those versions you have
may have a bad bug.

> corosync.x86_64            1.2.0-1.el5
> corosynclib.x86_64         1.2.0-1.el5
> heartbeat.x86_64           3.0.1-1.el5
> heartbeat-libs.x86_64      3.0.1-1.el5

You don't need both heartbeat and corosync.

> pacemaker.x86_64           1.0.7-2.el5
> pacemaker-libs.x86_64      1.0.7-2.el5
> resource-agents.x86_64     1.0.1-1.el5
> 
> From CentOS extras:
> drbd83.x86_64              8.3.2-6.el5_3
> kmod-drbd83.x86_64         8.3.2-6.el5_3
> 
> I made many tests before going into production, and the cluster has been
> running fine for some weeks. We regularly test failover by powering off
> one of the physical node that is running the VMs.
> 
> Our problem appeared after shutting down the host that was hosting the
> backup node. After powering-off the backup node, the primary became
> totally unresponsive, and we lost the NFS store. Had to reboot the
> primary node.
> 
> I rebuilt a lab, and tried to replicate the problem by powering off the
> backup node. After about 50 tries, I could replicate, and saw that:
> 
> - It was not a kernel panic
> - VM console was totally unresponsive
> - VM was using 100% CPU
> - I was still able to PING the VM
> - I was unable to log on the console/ssh

Anything in logs? Or is that the log attached?

> I have attached all my config files, as well as the /var/log/messages

You can use hb_report to collect all relevant info.

> around the crash (messages from the primary node). We see the secondary
> leaving the cluster, drbd activity and then nothing until the reboot.
> Since the crash, I have made one single change to the pacemaker config,
> which was to change my drbd location rule from +INF to 1000, as I
> thought the rule included by drbd fencing (with -INF wheight) could
> conflict with my +INF.

Feb  4 17:41:54 nfs2a lrmd: [3072]: info: RA output: (res_drbd:1:start:stderr) 0 : Failure: (124) Device is attached to a disk (use detach first) 
Feb  4 17:41:54 nfs2a lrmd: [3072]: info: RA output: (res_drbd:1:start:stderr) Command 'drbdsetup 0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --fencing=resource-only --on-io-error=detach' terminated with exit code 10 
Feb  4 17:41:54 nfs2a drbd[3243]: ERROR: nfs: Called drbdadm -c /etc/drbd.conf --peer nfs2b.test.local up nfs
Feb  4 17:41:54 nfs2a drbd[3243]: ERROR: nfs: Exit code 1

That's what I could find in the logs.

Thanks,

Dejan

> Of course, I have no clue whether this is a
> pacemaker/drbd/corosync/other issue and I am just looking for advice or
> similar experience. Corosync 1.2.0 being quite new, I thought I might
> make another test using the heartbeat stack.
> 
> Any hint appreciated. Thx, - Patrick -
> 
> 
> **************************************************************************************
> This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they
> are addressed. If you have received this email in error please notify
> the system manager. postmaster at navixia.com
> **************************************************************************************

> node nfs2a.test.local \
> 	attributes standby="off"
> node nfs2b.test.local \
> 	attributes standby="off"
> primitive res_drbd ocf:linbit:drbd \
> 	params drbd_resource="nfs" \
> 	op monitor interval="9s" role="Master" timeout="20s" \
> 	op monitor interval="10s" role="Slave" timeout="20s"
> primitive res_fs ocf:heartbeat:Filesystem \
> 	params fstype="xfs" directory="/mnt/drbd" device="/dev/drbd0" options="noatime,nodiratime,logbufs=8" \
> 	op monitor interval="10s"
> primitive res_ip ocf:heartbeat:IPaddr2 \
> 	params ip="10.1.111.33" \
> 	op monitor interval="10s"
> primitive res_nfs lsb:nfs \
> 	op monitor interval="10s"
> group grp_nfs res_fs res_nfs res_ip \
> 	meta target-role="Started"
> ms ms_drbd res_drbd \
> 	meta clone-max="2" notify="true"
> location loc_drbd-master ms_drbd \
> 	rule $id="loc_drbd-master-rule" $role="master" 1000: #uname eq nfs2a.test.local
> colocation col_grp_nfs_on_drbd_master inf: grp_nfs ms_drbd:Master
> order ord_drbd_before_grp_nfs inf: ms_drbd:promote grp_nfs:start
> property $id="cib-bootstrap-options" \
> 	dc-version="1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782" \
> 	cluster-infrastructure="openais" \
> 	expected-quorum-votes="2" \
> 	stonith-enabled="false" \
> 	no-quorum-policy="ignore" \
> 	last-lrm-refresh="1263554345"

> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker