[ClusterLabs] Regression in Filesystem RA

Tue Oct 17 07:13:11 EDT 2017

Hi Lars,

On Mon, Oct 16, 2017 at 08:52:04PM +0200, Lars Ellenberg wrote:
> On Mon, Oct 16, 2017 at 08:09:21PM +0200, Dejan Muhamedagic wrote:
> > Hi,
> > 
> > On Thu, Oct 12, 2017 at 03:30:30PM +0900, Christian Balzer wrote:
> > > 
> > > Hello,
> > > 
> > > 2nd post in 10 years, lets see if this one gets an answer unlike the first
> > > one...
> 
> Do you want to make me check for the old one? ;-)
> 
> > > One of the main use cases for pacemaker here are DRBD replicated
> > > active/active mailbox servers (dovecot/exim) on Debian machines. 
> > > We've been doing this for a loong time, as evidenced by the oldest pair
> > > still running Wheezy with heartbeat and pacemaker 1.1.7.
> > > 
> > > The majority of cluster pairs is on Jessie with corosync and backported
> > > pacemaker 1.1.16.
> > > 
> > > Yesterday we had a hiccup, resulting in half the machines loosing
> > > their upstream router for 50 seconds which in turn caused the pingd RA to
> > > trigger a fail-over of the DRBD RA and associated resource group
> > > (filesystem/IP) to the other node. 
> > > 
> > > The old cluster performed flawlessly, the newer clusters all wound up with
> > > DRBD and FS resource being BLOCKED as the processes holding open the
> > > filesystem didn't get killed fast enough.
> > > 
> > > Comparing the 2 RAs (no versioning T_T) reveals a large change in the
> > > "signal_processes" routine.
> > > 
> > > So with the old Filesystem RA using fuser we get something like this and
> > > thousands of processes killed per second:
> > > ---
> > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stdout)   3478  3593   ...
> > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stderr) cmccmccmccmcmcmcmcmccmccmcmcmcmcmcmcmcmcmcmcmcmccmcm
> > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stdout)   4032  4058   ...
> > > ---
> > > 
> > > Whereas the new RA (newer isn't better) that goes around killing processes
> > > individually with beautiful logging was a total fail at about 4 processes
> > > per second killed...
> > > ---
> > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4226    4909  0 09:43 ?        S      0:00 dovecot/imap 
> > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4229    4909  0 09:43 ?        S      0:00 dovecot/imap [idling]
> > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4238    4909  0 09:43 ?        S      0:00 dovecot/imap 
> > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4239    4909  0 09:43 ?        S      0:00 dovecot/imap 
> > > ---
> > > 
> > > So my questions are:
> > > 
> > > 1. Am I the only one with more than a handful of processes per FS who
> > > can't afford to wait hours the new routine to finish?
> > 
> > The change was introduced about five years ago.
> 
> Also, usually there should be no process anymore,
> because whatever is using the Filesystem should have it's own RA,
> which should have appropriate constraints,
> which means that should have been called and "stop"ped first,
> before the Filesystem stop and umount, and only the "accidental,
> stray, abandoned, idle since three weeks, operator shell session,
> that happend to cd into that file system" is supposed to be around
> *unexpectedly* and in need of killing, and not "thousands of service
> processes, expectedly".

Indeed, but obviously one can never tell ;-)

> So arguably your setup is broken,

Or the other RA didn't/couldn't stop the resource ...

> relying on a fall-back workaround
> which used to "perform" better.
> 
> The bug is not that this fall-back workaround now
> has pretty printing and is much slower (and eventually times out),
> the bug is that you don't properly kill the service first.
> [and that you don't have fencing].

... and didn't exit with an appropriate exit code (i.e. fail).

> > > 2. Can we have the old FUSER (kill) mode back?
> > 
> > Yes. I'll make a pull request.
> 
> Still, that's a sane thing to do,
> thanks, dejanm.

Right. We probably cannot fix all issues coming from various RAs
or configurations, but we should at least try a bit harder.

> Maybe we can even come up with a way
> to both "pretty print" and kill fast?

My best guess right now is no ;-) But we could log nicely for the
usual case of a small number of stray processes ... maybe
something like this:

	i=""
	get_pids | tr '\n' ' ' | fold -s |
	while read procs; do
		if [ -z "$i" ]; then
			killnlog $procs
			i="nolog"
		else
			justkill $procs
		fi
	done

Cheers,

Dejan

> -- 
> : Lars Ellenberg
> : LINBIT | Keeping the Digital World Running
> : DRBD -- Heartbeat -- Corosync -- Pacemaker
> : R&D, Integration, Ops, Consulting, Support
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org