[ClusterLabs] Regression in Filesystem RA

Tue Oct 24 06:59:17 UTC 2017

Hi,

On Thu, Oct 19, 2017 at 10:23:52AM +0900, Christian Balzer wrote:
> 
> Hello Dejan,
> 
> On Tue, 17 Oct 2017 13:13:11 +0200 Dejan Muhamedagic wrote:
> 
> > Hi Lars,
> > 
> > On Mon, Oct 16, 2017 at 08:52:04PM +0200, Lars Ellenberg wrote:
> > > On Mon, Oct 16, 2017 at 08:09:21PM +0200, Dejan Muhamedagic wrote:  
> > > > Hi,
> > > > 
> > > > On Thu, Oct 12, 2017 at 03:30:30PM +0900, Christian Balzer wrote:  
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > 2nd post in 10 years, lets see if this one gets an answer unlike the first
> > > > > one...  
> > > 
> > > Do you want to make me check for the old one? ;-)
> > >   
> > > > > One of the main use cases for pacemaker here are DRBD replicated
> > > > > active/active mailbox servers (dovecot/exim) on Debian machines. 
> > > > > We've been doing this for a loong time, as evidenced by the oldest pair
> > > > > still running Wheezy with heartbeat and pacemaker 1.1.7.
> > > > > 
> > > > > The majority of cluster pairs is on Jessie with corosync and backported
> > > > > pacemaker 1.1.16.
> > > > > 
> > > > > Yesterday we had a hiccup, resulting in half the machines loosing
> > > > > their upstream router for 50 seconds which in turn caused the pingd RA to
> > > > > trigger a fail-over of the DRBD RA and associated resource group
> > > > > (filesystem/IP) to the other node. 
> > > > > 
> > > > > The old cluster performed flawlessly, the newer clusters all wound up with
> > > > > DRBD and FS resource being BLOCKED as the processes holding open the
> > > > > filesystem didn't get killed fast enough.
> > > > > 
> > > > > Comparing the 2 RAs (no versioning T_T) reveals a large change in the
> > > > > "signal_processes" routine.
> > > > > 
> > > > > So with the old Filesystem RA using fuser we get something like this and
> > > > > thousands of processes killed per second:
> > > > > ---
> > > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stdout)   3478  3593   ...
> > > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stderr) cmccmccmccmcmcmcmcmccmccmcmcmcmcmcmcmcmcmcmcmcmccmcm
> > > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: (res_Filesystem_mb07:stop:stdout)   4032  4058   ...
> > > > > ---
> > > > > 
> > > > > Whereas the new RA (newer isn't better) that goes around killing processes
> > > > > individually with beautiful logging was a total fail at about 4 processes
> > > > > per second killed...
> > > > > ---
> > > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4226    4909  0 09:43 ?        S      0:00 dovecot/imap 
> > > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4229    4909  0 09:43 ?        S      0:00 dovecot/imap [idling]
> > > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4238    4909  0 09:43 ?        S      0:00 dovecot/imap 
> > > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending signal TERM to: mail        4239    4909  0 09:43 ?        S      0:00 dovecot/imap 
> > > > > ---
> > > > > 
> > > > > So my questions are:
> > > > > 
> > > > > 1. Am I the only one with more than a handful of processes per FS who
> > > > > can't afford to wait hours the new routine to finish?  
> > > > 
> > > > The change was introduced about five years ago.  
> > > 
> Yeah, that was thanks to Debian Jessie not having pacemaker at all from
> the start and when the backport arrived it was corosync only w/o a
> graceful transition from heartbeat option, so quite a few machines stayed
> at wheezy (thanks to the LTS efforts). 
> 
> > > Also, usually there should be no process anymore,
> > > because whatever is using the Filesystem should have it's own RA,
> > > which should have appropriate constraints,
> > > which means that should have been called and "stop"ped first,
> > > before the Filesystem stop and umount, and only the "accidental,
> > > stray, abandoned, idle since three weeks, operator shell session,
> > > that happend to cd into that file system" is supposed to be around
> > > *unexpectedly* and in need of killing, and not "thousands of service
> > > processes, expectedly".  
> > 
> > Indeed, but obviously one can never tell ;-)
> > 
> > > So arguably your setup is broken,  
> > 
> > Or the other RA didn't/couldn't stop the resource ...
> > 
> See my previous mail, there is no good/right way to solve this with a RA
> for dovecot, which would essentially mimic what the FS RA should be doing,
> since stopping dovecot entirely is not what is called for.
> 
> > > relying on a fall-back workaround
> > > which used to "perform" better.
> > > 
> > > The bug is not that this fall-back workaround now
> > > has pretty printing and is much slower (and eventually times out),
> > > the bug is that you don't properly kill the service first.
> > > [and that you don't have fencing].  
> > 
> > ... and didn't exit with an appropriate exit code (i.e. fail).
> > 
> Could somebody elaborate on this, especially the fencing part?

If a stop action of any cluster resource fails then the node is
fenced.

> Because DRBD fencing is configured and working as expected.
> 
> > > > > 2. Can we have the old FUSER (kill) mode back?  
> > > > 
> > > > Yes. I'll make a pull request.  
> > > 
> > > Still, that's a sane thing to do,
> > > thanks, dejanm.  
> > 
> > Right. We probably cannot fix all issues coming from various RAs
> > or configurations, but we should at least try a bit harder.
> > 
> Indeed, your response is much appreciated.
> 
> And while we're at it, I'd like to point out that while FUSER is faster
> than this it still slows down exponentially with the number of processes.
> 
> To wit:
> ---
> # time lsof -n |grep mb07.dentaku |awk '{print $2}' |uniq |wc
>    1381    1381    7777
> 
> real    0m0.874s
> user    0m0.396s
> sys     0m0.496s
> 
> # time fuser -m /mail/spool/mb07.dentaku.gol.com >/dev/null 
> real    0m0.426s
> user    0m0.112s
> sys     0m0.308s
> ---
> Snappy, fuser beats my crappy pipe for 1400 processes.
> 
> Alas on a busier server:
> ---
> # time lsof -n |grep mb10.dentaku |awk '{print $2}' |uniq |wc
>    8319    8319   56849
> 
> real    0m4.767s
> user    0m2.176s
> sys     0m2.676s
> 
> # time fuser -m /mail/spool/mb10.dentaku.gol.com >/dev/null
> real    0m11.414s
> user    0m10.292s
> sys     0m1.112s
> ---
> So with 8K processes fuser is clearly falling behind.
> 
> 
> And finally on my largest machines it turns into molasses:
> ---
> # time lsof -n |grep mb11.dentaku |awk '{print $2}' |uniq |wc 
>   33319   33319  231577
> 
> real    0m26.349s
> user    0m15.920s
> sys     0m10.688s
> 
> time fuser -m /mail/spool/mb11.dentaku.gol.com >/dev/null
> real    2m32.957s
> user    2m28.556s
> sys     0m4.376s
> ---

Heh, somebody should take a look at fuser.

> While the 2.5 minutes would still be acceptable of sorts (with a properly
> set stop-timeout), clearly the 26s version is preferable. 
> 
> So what I'm suggesting here is looking at the lsof approach to speed
> things up, no matter how many processes there are.

I just made a pull request:

https://github.com/ClusterLabs/resource-agents/pull/1042

It would be great if you could test it!

Cheers,

Dejan

> Regards,
> 
> Christian
> 
> > > Maybe we can even come up with a way
> > > to both "pretty print" and kill fast?  
> > 
> > My best guess right now is no ;-) But we could log nicely for the
> > usual case of a small number of stray processes ... maybe
> > something like this:
> > 
> > 	i=""
> > 	get_pids | tr '\n' ' ' | fold -s |
> > 	while read procs; do
> > 		if [ -z "$i" ]; then
> > 			killnlog $procs
> > 			i="nolog"
> > 		else
> > 			justkill $procs
> > 		fi
> > 	done
> > 
> > Cheers,
> > 
> > Dejan
> > 
> > > -- 
> > > : Lars Ellenberg
> > > : LINBIT | Keeping the Digital World Running
> > > : DRBD -- Heartbeat -- Corosync -- Pacemaker
> > > : R&D, Integration, Ops, Consulting, Support
> > > 
> > > DRBD® and LINBIT® are registered trademarks of LINBIT
> > > 
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org  
> > 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi at gol.com   	Rakuten Communications