[ClusterLabs] Antw: Re: ocf_take_lock is NOT actually safe to use

Wed Jul 5 04:37:56 EDT 2017

Hi!

Could it be that I pointed out this problem about six years ago? At least I found a locking implementation for OCF RAs dated from 2011. My approach is a bit different, because I use a global MUTEX lock to create the actual lock files inside:

---
# Shell functions to support file-based locking
# (c) 2011 by Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>

MUTEX_LOCK=${HA_RSCTMP}/_MUTEX_LOCK_    # lock for mutual exclusion

# create the given lockfile, writing the shell's PID as owner into it
# we assume that file descriptor 123 is not used elsewhere
xola_try_lock()
{
    typeset pidfile="$1"
    (flock -e 123 &&
        if [ -e "$pidfile" ]; then      # a lock file exists
            typeset locker="$(<"$1")"
            if ! kill -0 "$locker" > /dev/null 2>&1; then
                ocf_log info "grabbing stale PID lock $pidfile from $locker"
                echo $$ > "$pidfile"    # overwrite stale lock file, success
            else
                ocf_log info "PID $locker owns lock $pidfile"
                false                   # valid lock file exists, failure
            fi
        else
            echo $$ > "$pidfile"        # create a new lock file, success
        fi) 123> "$MUTEX_LOCK"
}

# create the given lock file, possibly waiting until lock file is available
xola_lock()
{
    typeset delay=${2:-5} limit=${3:-500}       # initial and maximum delay
    typeset sdelay                      # hundreds of a second to wait
    while ! xola_try_lock "$1"; do
        [ -w "$MUTEX_LOCK" ] || return 1
        sdelay=$(printf '%d.%02d' $(expr $delay / 100) $(expr $delay % 100))
        ocf_log info "waiting $sdelay seconds before re-trying lock $1"
        sleep $sdelay
        # use limited exponential backup wait strategy
        [ "$delay" -lt "$limit" ] && (( delay += delay / 3 ))
    done
}

# check the given lockfile is a valid lock
# we assume that file descriptor 123 is not used elsewhere
xola_check_lock()
{
    typeset pidfile="$1"
    (flock -e 123 &&
        if [ -e "$pidfile" ]; then      # a lock file exists
            typeset locker="$(<"$pidfile")"
            kill -0 "$locker" > /dev/null 2>&1
        else
            false
        fi) 123> "$MUTEX_LOCK"
}

# remove the given lock file unconditionally
# we assume that file descriptor 124 is not used elsewhere
xola_unlock()
{
    typeset pidfile="$1"
    (flock -e 124 &&
        if [ -e "$pidfile" ]; then
            typeset locker="$(<"$pidfile")"
            if [ "$locker" = $$ ]; then
                rm "$pidfile"
            else
                ocf_log err "xola_unlock: lock $pidfile owned by $locker"
                false
            fi
        else
            ocf_log warn "xola_unlock: lock $pidfile was not locked"
            false
        fi) 124> "$MUTEX_LOCK"
}
---

So far this seems to work ;-)

Regards,
Ulrich

>>> Dejan Muhamedagic <dejanmm at fastmail.fm> schrieb am 26.06.2017 um 17:05 in
Nachricht <20170626150502.GB4588 at tuttle.homenet>:
> Hi,
> 
> On Wed, Jun 21, 2017 at 04:40:47PM +0200, Lars Ellenberg wrote:
>> 
>> Repost to a wider audience, to raise awareness for this.
>> ocf_take_lock may or may not be better than nothing.
>> 
>> It at least "annotates" that the auther would like to protect something
>> that is considered a "critical region" of the resource agent.
>> 
>> At the same time, it does NOT deliver what the name seems to imply.
>> 
> 
> Lars, many thanks for the analysis and bringing this up again.
> 
> I'm not going to take on the details below, just to say that
> there's now a pull request for the issue:
> 
> https://github.com/ClusterLabs/resource-agents/pull/995 
> 
> In short, it consists of reducing the race window size (by using
> mkdir*), double test for stale locks, and improved random number
> function. I ran numerous tests with and without stale locks and
> it seems to hold quite well.
> 
> The comments there contain a detailed description of the
> approach.
> 
> Please review and comment whoever finds time.
> 
> Cheers,
> 
> Dejan
> 
> *) Though the current implementation uses just a file and the
> proposed one directories, the locks are short lived and there
> shouldn't be problems on upgrades.
> 
>> I think I brought this up a few times over the years, but was not noisy
>> enough about it, because it seemed not important enough: no-one was
>> actually using this anyways.
>> 
>> But since new usage has been recently added with
>> [ClusterLabs/resource-agents] targetcli lockfile (#917)
>> here goes:
>> 
>> On Wed, Jun 07, 2017 at 02:49:41PM -0700, Dejan Muhamedagic wrote:
>> > On Wed, Jun 07, 2017 at 05:52:33AM -0700, Lars Ellenberg wrote:
>> > > Note: ocf_take_lock is NOT actually safe to use.
>> > > 
>> > > As implemented, it uses "echo $pid > lockfile" to create the lockfile,
>> > > which means if several such "ocf_take_lock" happen at the same time,
>> > > they all "succeed", only the last one will be the "visible" one to future 
> waiters.
>> > 
>> > Ugh.
>> 
>> Exactly.
>> 
>> Reproducer:
>> #############################################################
>> #!/bin/bash
>> export OCF_ROOT=/usr/lib/ocf/ ;
>> .  /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs ;
>> 
>> x() (
>> 	ocf_take_lock dummy-lock ;
>> 	ocf_release_lock_on_exit dummy-lock  ;
>> 	set -C;
>> 	echo x > protected && sleep 0.15 && rm -f protected || touch BROKEN;
>> );
>> 
>> mkdir -p /run/ocf_take_lock_demo
>> cd /run/ocf_take_lock_demo
>> rm -f BROKEN; i=0;
>> time while ! test -e BROKEN; do
>> 	x &  x &
>> 	wait;
>> 	i=$(( i+1 ));
>> done ;
>> test -e BROKEN && echo "reproduced race in $i iterations"
>> #############################################################
>> 
>> x() above takes, and, because of the () subshell and
>> ocf_release_lock_on_exit, releases the "dummy-lock",
>> and within the protected region of code,
>> creates and removes a file "protected".
>> 
>> If ocf_take_lock was good, there could never be two instances
>> inside the lock, so echo x > protected should never fail.
>> 
>> With the current implementation of ocf_take_lock,
>> it takes "just a few" iterations here to reproduce the race.
>> (usually within a minute).
>> 
>> The races I see in ocf_take_lock:
>> "creation race":
>> 	test -e $lock
>> 		# someone else may create it here
>> 	echo $$ > $lock
>> 		# but we override it with ours anyways
>> 
>> "still empty race":
>> 	test -e $lock	# maybe it already exists (open O_CREAT|O_TRUNC)
>> 			# but does not yet contain target pid,
>> 	pid=`cat $lock` # this one is empty,
>> 	kill -0 $pid    # and this one fails
>> 	and thus a "just being created" one is considered stale
>> 
>> There are other problems around "stale pid file detection",
>> but let's not go into that minefield right now.
>> 
>> > > Maybe we should change it to 
>> > > ```
>> > > while ! ( set -C; echo $pid > lockfile ); do
>> > >     if test -e lockfile ; then
>> > >         : error handling for existing lockfile, stale lockfile detection
>> > >     else
>> > >         : error handling for not being able to create lockfile
>> > >     fi
>> > > done
>> > > : only reached if lockfile was successfully created
>> > > ```
>> > > 
>> > > (or use flock or other tools designed for that purpose)
>> > 
>> > flock would probably be the easiest. mkdir would do too, but for
>> > upgrade issues.
>> 
>> and, being part of util-linux, flock should be available "everywhere".
>> 
>> but because writing "wrappers" around flock similar to the intended
>> semantics of ocf_take_lock and ocf_release_lock_on_exit is not easy
>> either, usually you'd be better of using flock directly in the RA.
>> 
>> so, still trying to do this with shell:
>> 
>> "set -C" (respectively set -o noclober):
>> 	If set, disallow existing regular files to be overwritten
>> 	by redirection of output.
>> 
>> normal '>' means: O_WRONLY|O_CREAT|O_TRUNC,
>> set -C '>' means: O_WRONLY|O_CREAT|O_EXCL
>> 
>> using "set -C ; echo $$ > $lock" instead of 
>> "test -e $lock || echo $$ > $lock"
>> gets rid of the "creation race".
>> 
>> To get rid of the "still empty race",
>> we'd need to play games with hardlinks:
>> 
>> (
>> set -C
>> success=false
>> if echo $$ > .$lock ; then
>> 	ln .$lock $lock && success=true
>> 	rm -f .$lock
>> fi
>> $success
>> )
>> 
>> That should be "good enough",
>> and much better than what we have now.
>> 
>> 
>> 
>> back to a possible "flock" wrapper,
>> maybe something like this:
>> 
>> ocf_synchronize() {
>> 	local lockfile=$1
>> 	shift
>> 	(
>> 		flock -x 8 || exit 1
>> 		( "$@" ) 8>&-
>> 	) 8> "$lockfile"
>> }
>> # and then
>> ocf_synchronize my_exclusive_shell_function with some args
>> 
>> As this runs in subshells,
>> it would not be able to change any variales visible
>> to the rest of the script, which may limit its usefulness.
>> 
>> 
>> or maybe, have a ocf_rarun_synchronized(),
>> that will do (directly from the flock man page):
>>    [ "${FLOCKER}" != "$0" ] && exec env FLOCKER="$0" flock -e "$0" "$0" "$@" || 
> :
>> with nonblocking and timeout variants?
>> 
>> 
>> I'd very much like some volunteer to step forward
>> and implement the actual patch...
>> 
>> 
>> Cheers,
>> 
>> 	Lars
>> 
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org