[ClusterLabs] Regression in Filesystem RA
Christian Balzer
chibi at gol.com
Wed Oct 18 21:23:52 EDT 2017
Hello Dejan,
On Tue, 17 Oct 2017 13:13:11 +0200 Dejan Muhamedagic wrote:
> Hi Lars,
>
> On Mon, Oct 16, 2017 at 08:52:04PM +0200, Lars Ellenberg wrote:
> > On Mon, Oct 16, 2017 at 08:09:21PM +0200, Dejan Muhamedagic wrote:
> > > Hi,
> > >
> > > On Thu, Oct 12, 2017 at 03:30:30PM +0900, Christian Balzer wrote:
> > > >
> > > > Hello,
> > > >
> > > > 2nd post in 10 years, lets see if this one gets an answer unlike the first
> > > > one...
> >
> > Do you want to make me check for the old one? ;-)
> >
> > > > One of the main use cases for pacemaker here are DRBD replicated
> > > > active/active mailbox servers (dovecot/exim) on Debian machines.
> > > > We've been doing this for a loong time, as evidenced by the oldest pair
> > > > still running Wheezy with heartbeat and pacemaker 1.1.7.
> > > >
> > > > The majority of cluster pairs is on Jessie with corosync and backported
> > > > pacemaker 1.1.16.
> > > >
> > > > Yesterday we had a hiccup, resulting in half the machines loosing
> > > > their upstream router for 50 seconds which in turn caused the pingd RA to
> > > > trigger a fail-over of the DRBD RA and associated resource group
> > > > (filesystem/IP) to the other node.
> > > >
> > > > The old cluster performed flawlessly, the newer clusters all wound up with
> > > > DRBD and FS resource being BLOCKED as the processes holding open the
> > > > filesystem didn't get killed fast enough.
> > > >
> > > > Comparing the 2 RAs (no versioning T_T) reveals a large change in the
> > > > "signal_processes" routine.
> > > >
> > > > So with the old Filesystem RA using fuser we get something like this and
> > > > thousands of processes killed per second:
> > > > ---
> > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stdout) 3478 3593 ...
> > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stderr) cmccmccmccmcmcmcmcmccmccmcmcmcmcmcmcmcmcmcmcmcmccmcm
> > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stdout) 4032 4058 ...
> > > > ---
> > > >
> > > > Whereas the new RA (newer isn't better) that goes around killing processes
> > > > individually with beautiful logging was a total fail at about 4 processes
> > > > per second killed...
> > > > ---
> > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail 4226 4909 0 09:43 ? S 0:00 dovecot/imap
> > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail 4229 4909 0 09:43 ? S 0:00 dovecot/imap [idling]
> > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail 4238 4909 0 09:43 ? S 0:00 dovecot/imap
> > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail 4239 4909 0 09:43 ? S 0:00 dovecot/imap
> > > > ---
> > > >
> > > > So my questions are:
> > > >
> > > > 1. Am I the only one with more than a handful of processes per FS who
> > > > can't afford to wait hours the new routine to finish?
> > >
> > > The change was introduced about five years ago.
> >
Yeah, that was thanks to Debian Jessie not having pacemaker at all from
the start and when the backport arrived it was corosync only w/o a
graceful transition from heartbeat option, so quite a few machines stayed
at wheezy (thanks to the LTS efforts).
> > Also, usually there should be no process anymore,
> > because whatever is using the Filesystem should have it's own RA,
> > which should have appropriate constraints,
> > which means that should have been called and "stop"ped first,
> > before the Filesystem stop and umount, and only the "accidental,
> > stray, abandoned, idle since three weeks, operator shell session,
> > that happend to cd into that file system" is supposed to be around
> > *unexpectedly* and in need of killing, and not "thousands of service
> > processes, expectedly".
>
> Indeed, but obviously one can never tell ;-)
>
> > So arguably your setup is broken,
>
> Or the other RA didn't/couldn't stop the resource ...
>
See my previous mail, there is no good/right way to solve this with a RA
for dovecot, which would essentially mimic what the FS RA should be doing,
since stopping dovecot entirely is not what is called for.
> > relying on a fall-back workaround
> > which used to "perform" better.
> >
> > The bug is not that this fall-back workaround now
> > has pretty printing and is much slower (and eventually times out),
> > the bug is that you don't properly kill the service first.
> > [and that you don't have fencing].
>
> ... and didn't exit with an appropriate exit code (i.e. fail).
>
Could somebody elaborate on this, especially the fencing part?
Because DRBD fencing is configured and working as expected.
> > > > 2. Can we have the old FUSER (kill) mode back?
> > >
> > > Yes. I'll make a pull request.
> >
> > Still, that's a sane thing to do,
> > thanks, dejanm.
>
> Right. We probably cannot fix all issues coming from various RAs
> or configurations, but we should at least try a bit harder.
>
Indeed, your response is much appreciated.
And while we're at it, I'd like to point out that while FUSER is faster
than this it still slows down exponentially with the number of processes.
To wit:
---
# time lsof -n |grep mb07.dentaku |awk '{print $2}' |uniq |wc
1381 1381 7777
real 0m0.874s
user 0m0.396s
sys 0m0.496s
# time fuser -m /mail/spool/mb07.dentaku.gol.com >/dev/null
real 0m0.426s
user 0m0.112s
sys 0m0.308s
---
Snappy, fuser beats my crappy pipe for 1400 processes.
Alas on a busier server:
---
# time lsof -n |grep mb10.dentaku |awk '{print $2}' |uniq |wc
8319 8319 56849
real 0m4.767s
user 0m2.176s
sys 0m2.676s
# time fuser -m /mail/spool/mb10.dentaku.gol.com >/dev/null
real 0m11.414s
user 0m10.292s
sys 0m1.112s
---
So with 8K processes fuser is clearly falling behind.
And finally on my largest machines it turns into molasses:
---
# time lsof -n |grep mb11.dentaku |awk '{print $2}' |uniq |wc
33319 33319 231577
real 0m26.349s
user 0m15.920s
sys 0m10.688s
time fuser -m /mail/spool/mb11.dentaku.gol.com >/dev/null
real 2m32.957s
user 2m28.556s
sys 0m4.376s
---
While the 2.5 minutes would still be acceptable of sorts (with a properly
set stop-timeout), clearly the 26s version is preferable.
So what I'm suggesting here is looking at the lsof approach to speed
things up, no matter how many processes there are.
Regards,
Christian
> > Maybe we can even come up with a way
> > to both "pretty print" and kill fast?
>
> My best guess right now is no ;-) But we could log nicely for the
> usual case of a small number of stray processes ... maybe
> something like this:
>
> i=""
> get_pids | tr '\n' ' ' | fold -s |
> while read procs; do
> if [ -z "$i" ]; then
> killnlog $procs
> i="nolog"
> else
> justkill $procs
> fi
> done
>
> Cheers,
>
> Dejan
>
> > --
> > : Lars Ellenberg
> > : LINBIT | Keeping the Digital World Running
> > : DRBD -- Heartbeat -- Corosync -- Pacemaker
> > : R&D, Integration, Ops, Consulting, Support
> >
> > DRBD® and LINBIT® are registered trademarks of LINBIT
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
--
Christian Balzer Network/Systems Engineer
chibi at gol.com Rakuten Communications
More information about the Users
mailing list