[ClusterLabs] Regression in Filesystem RA

Mon Oct 16 14:52:04 EDT 2017

On Mon, Oct 16, 2017 at 08:09:21PM +0200, Dejan Muhamedagic wrote:
> Hi,
> 
> On Thu, Oct 12, 2017 at 03:30:30PM +0900, Christian Balzer wrote:
> > 
> > Hello,
> > 
> > 2nd post in 10 years, lets see if this one gets an answer unlike the first
> > one...

Do you want to make me check for the old one? ;-)

> > One of the main use cases for pacemaker here are DRBD replicated
> > active/active mailbox servers (dovecot/exim) on Debian machines. 
> > We've been doing this for a loong time, as evidenced by the oldest pair
> > still running Wheezy with heartbeat and pacemaker 1.1.7.
> > 
> > The majority of cluster pairs is on Jessie with corosync and backported
> > pacemaker 1.1.16.
> > 
> > Yesterday we had a hiccup, resulting in half the machines loosing
> > their upstream router for 50 seconds which in turn caused the pingd RA to
> > trigger a fail-over of the DRBD RA and associated resource group
> > (filesystem/IP) to the other node. 
> > 
> > The old cluster performed flawlessly, the newer clusters all wound up with
> > DRBD and FS resource being BLOCKED as the processes holding open the
> > filesystem didn't get killed fast enough.
> > 
> > Comparing the 2 RAs (no versioning T_T) reveals a large change in the
> > "signal_processes" routine.
> > 
> > So with the old Filesystem RA using fuser we get something like this and
> > thousands of processes killed per second:
> > ---
> > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stdout)   3478  3593   ...
> > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stderr) cmccmccmccmcmcmcmcmccmccmcmcmcmcmcmcmcmcmcmcmcmccmcm
> > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stdout)   4032  4058   ...
> > ---
> > 
> > Whereas the new RA (newer isn't better) that goes around killing processes
> > individually with beautiful logging was a total fail at about 4 processes
> > per second killed...
> > ---
> > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4226    4909  0 09:43 ?        S      0:00 dovecot/imap 
> > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4229    4909  0 09:43 ?        S      0:00 dovecot/imap [idling]
> > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4238    4909  0 09:43 ?        S      0:00 dovecot/imap 
> > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4239    4909  0 09:43 ?        S      0:00 dovecot/imap 
> > ---
> > 
> > So my questions are:
> > 
> > 1. Am I the only one with more than a handful of processes per FS who
> > can't afford to wait hours the new routine to finish?
> 
> The change was introduced about five years ago.

Also, usually there should be no process anymore,
because whatever is using the Filesystem should have it's own RA,
which should have appropriate constraints,
which means that should have been called and "stop"ped first,
before the Filesystem stop and umount, and only the "accidental,
stray, abandoned, idle since three weeks, operator shell session,
that happend to cd into that file system" is supposed to be around
*unexpectedly* and in need of killing, and not "thousands of service
processes, expectedly".

So arguably your setup is broken,
relying on a fall-back workaround
which used to "perform" better.

The bug is not that this fall-back workaround now
has pretty printing and is much slower (and eventually times out),
the bug is that you don't properly kill the service first.
[and that you don't have fencing].

> > 2. Can we have the old FUSER (kill) mode back?
> 
> Yes. I'll make a pull request.

Still, that's a sane thing to do,
thanks, dejanm.

Maybe we can even come up with a way
to both "pretty print" and kill fast?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT