[Pacemaker] DRBD monitor time out in high I/O situations

Tue Jul 12 08:37:47 UTC 2011

 Hi!

 We have set up a 2-node Pacemaker cluster using SLES 11 SP1 + 
 HA-Extension.
 Each machine has two DRBD resources, on is called 'mysql' and the other 
 'wwwdata'.
 The mysql resource has an XFS filesystem; wwwdata is using an OCFS2 1.4 
 FS.
 Our goal is to create an Active/Standby MySQL cluster with the 
 databases being
 on the XFS filesystem. The OCFS2 FS is supposed to store data that is 
 created by
 scripts that access the MySQL server database.

 The primitive resources are setup as follows:
 ----- snip -----
 primitive p_controld ocf:pacemaker:controld \
         op start interval="0" timeout="90s" \
         op stop interval="0" timeout="100s"
 primitive p_drbd_mysql ocf:linbit:drbd \
         params drbd_resource="mysql" \
         op monitor interval="20" role="Master" timeout="20" \
         op monitor interval="30" role="Slave" timeout="20" \
         op notify interval="0" timeout="90" \
         op start interval="0" timeout="240s" \
         op stop interval="0" timeout="100s"
 primitive p_drbd_wwwdata ocf:linbit:drbd \
         params drbd_resource="wwwdata" \
         op monitor interval="20" role="Master" timeout="20" \
         op monitor interval="30" role="Slave" timeout="20" \
         op notify interval="0" timeout="90" \
         op start interval="0" timeout="240s" \
         op stop interval="0" timeout="360s"
 primitive p_fs_mysql ocf:heartbeat:Filesystem \
         params device="/dev/drbd/by-res/mysql" directory="/data/mysql" 
 fstype="xfs" options="rw,noatime" \
         op start interval="0" timeout="90s" \
         op stop interval="0" timeout="100s" \
         meta is-managed="true"
 primitive p_fs_wwwdata ocf:heartbeat:Filesystem \
         params device="/dev/drbd/by-res/wwwdata" directory="/data/www" 
 fstype="ocfs2" 
 options="rw,noatime,noacl,nouser_xattr,commit=30,data=writeback" \
         op start interval="0" timeout="90s" \
         op stop interval="0" timeout="300s"
 primitive p_ip_float_cluster ocf:heartbeat:IPaddr2 \
         params ip="1.2.3.4" nic="bond0" cidr_netmask="24" 
 flush_routes="true" \
         meta target-role="Started"
 primitive p_o2cb ocf:ocfs2:o2cb \
         op monitor interval="120s" \
         op start interval="0" timeout="90s" \
         op stop interval="0" timeout="100s" \
         meta target-role="Started"
 ----- snip -----

 The problem with the setup is that the DRBD monitor operation seem to 
 time out in situations with high I/O load,
 triggering a Failover-attempt followed by one node getting STONITH'd 
 since the file system is still busy running
 the operation that caused this in the first place. For example, this is 
 what happened yesterday when I did a
 "chmod -R" on a directory-tree containing about 4.5 million rather 
 small files on the OCFS2 fs:

 ----- snip -----
 Jul 11 11:06:14 node01 lrmd: [25011]: info: rsc:p_drbd_mysql:0:39: 
 monitor
 Jul 11 11:06:14 node01 lrmd: [25011]: info: rsc:p_drbd_wwwdata:0:38: 
 monitor
 Jul 11 11:06:29 node01 mysql[6665]: INFO: MySQL monitor succeeded
 Jul 11 11:07:37 node01 lrmd: [25011]: WARN: p_drbd_wwwdata:0:monitor 
 process (PID 6776) timed out (try 1).  Killing with signal SIGTERM (15).
 Jul 11 11:07:37 node01 lrmd: [25011]: WARN: operation monitor[38] on 
 ocf::drbd::p_drbd_wwwdata:0 for client 25014, its parameters: 
 CRM_meta_clone=[0] CRM_meta_role=[Master] 
 CRM_meta_notify_slave_resource=[ ] CRM_meta_notify_active_resource=[ ] 
 CRM_meta_notify_demote_uname=[ ] drbd_resource=[wwwdata] 
 CRM_meta_notify_inactive_resource=[p_drbd_wwwdata:0 p_drbd_wwwdata:1 ] 
 CRM_meta_master_node_max=[1] CRM_meta_notify_stop_resource=[ ] 
 CRM_meta_notify_master_resource=[ ] CRM_meta_clone_node_max=[1] 
 CRM_meta_notify=[true] CRM_meta_notify_demote_resource=[: pid [6776] 
 timed out
 Jul 11 11:07:37 node01 crmd: [25014]: ERROR: process_lrm_event: LRM 
 operation p_drbd_wwwdata:0_monitor_20000 (38) Timed Out 
 (timeout=20000ms)
 Jul 11 11:07:37 node01 crmd: [25014]: info: process_graph_event: 
 Detected action p_drbd_wwwdata:0_monitor_20000 from a different 
 transition: 11 vs. 135
 Jul 11 11:07:37 node01 crmd: [25014]: info: abort_transition_graph: 
 process_graph_event:477 - Triggered transition abort (complete=1, 
 tag=lrm_rsc_op, id=p_drbd_wwwdata:0_monitor_20000, 
 magic=2:-2;15:11:8:6f0304c9-522b-4582-a26b-cffe24afe9e2, cib=0.349.10) : 
 Old event
 Jul 11 11:07:37 node01 crmd: [25014]: WARN: update_failcount: Updating 
 failcount for p_drbd_wwwdata:0 on node01 after failed monitor: rc=-2 
 (update=value++, time=1310375257)
 ----- snip -----

 The operation would have taken a few minutes to complete, but shouldn't 
 have had any
 larger impact on the rest of the system. Increasing the monitor timeout 
 indefinitely
 doesn't look like the way to go here.
 Is there a way to ensure that the monitor operations return within a 
 reasonable
 time-frame even in high load situations?
 Or is there something fundamentally flawed in our setup?

 Thanks in advance!

-- 
 Sebastian Kaps