[Pacemaker] DRBD monitor time out in high I/O situations

Lars Marowsky-Bree lmb at suse.de
Tue Jul 12 06:05:39 EDT 2011

On 2011-07-12T10:37:47, Sebastian Kaps <sebastian.kaps at imail.de> wrote:

Hi Sebastian,

> Our goal is to create an Active/Standby MySQL cluster with the
> databases being
> on the XFS filesystem. The OCFS2 FS is supposed to store data that
> is created by
> scripts that access the MySQL server database.

That sounds perfectly viable.

> The problem with the setup is that the DRBD monitor operation seem
> to time out in situations with high I/O load,

That shouldn't happen, obviously. The question is why it does; do you
see high network traffic during these times? How's the performance of
DRBD in general?

Is DRBD's backing device on the same local disk as the system itself? If
so, then they might impact each other.

> triggering a Failover-attempt followed by one node getting STONITH'd
> since the file system is still busy running
> the operation that caused this in the first place.

Well, in theory, the Filesystem RA should kill everything before trying
to umount, so assuming you have constraints as well, at least the
STONITH shouldn't happen, either.

> ----- snip -----
> Jul 11 11:06:14 node01 lrmd: [25011]: info: rsc:p_drbd_mysql:0:39:
> monitor
> Jul 11 11:06:14 node01 lrmd: [25011]: info: rsc:p_drbd_wwwdata:0:38:
> monitor
> Jul 11 11:06:29 node01 mysql[6665]: INFO: MySQL monitor succeeded
> Jul 11 11:07:37 node01 lrmd: [25011]: WARN: p_drbd_wwwdata:0:monitor
> process (PID 6776) timed out (try 1).  Killing with signal SIGTERM
> (15).

drbd's monitor operation is not that heavy-weight; I can't immediately
see why the IO load on the file system it hosts should affect it so

As a work-around, increasing the timeout is fine - gather some
statistics as to how long this actually does that to complete in a
normal operation and under load, and then tune that.

You can either file a support ticket with Novell/SUSE (for addressing
the DRBD slowdown), or if you want to continue to pursue the community
angle, the drbd mailing lists are a better place for this than
pacemaker - it's not a pacemaker issue.

Good luck!


Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

More information about the Pacemaker mailing list