[ClusterLabs] ocf_take_lock is NOT actually safe to use

Wed Jun 21 10:40:47 EDT 2017

Repost to a wider audience, to raise awareness for this.
ocf_take_lock may or may not be better than nothing.

It at least "annotates" that the auther would like to protect something
that is considered a "critical region" of the resource agent.

At the same time, it does NOT deliver what the name seems to imply.

I think I brought this up a few times over the years, but was not noisy
enough about it, because it seemed not important enough: no-one was
actually using this anyways.

But since new usage has been recently added with
[ClusterLabs/resource-agents] targetcli lockfile (#917)
here goes:

On Wed, Jun 07, 2017 at 02:49:41PM -0700, Dejan Muhamedagic wrote:
> On Wed, Jun 07, 2017 at 05:52:33AM -0700, Lars Ellenberg wrote:
> > Note: ocf_take_lock is NOT actually safe to use.
> > 
> > As implemented, it uses "echo $pid > lockfile" to create the lockfile,
> > which means if several such "ocf_take_lock" happen at the same time,
> > they all "succeed", only the last one will be the "visible" one to future waiters.
> 
> Ugh.

Exactly.

Reproducer:
#############################################################
#!/bin/bash
export OCF_ROOT=/usr/lib/ocf/ ;
.  /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs ;

x() (
	ocf_take_lock dummy-lock ;
	ocf_release_lock_on_exit dummy-lock  ;
	set -C;
	echo x > protected && sleep 0.15 && rm -f protected || touch BROKEN;
);

mkdir -p /run/ocf_take_lock_demo
cd /run/ocf_take_lock_demo
rm -f BROKEN; i=0;
time while ! test -e BROKEN; do
	x &  x &
	wait;
	i=$(( i+1 ));
done ;
test -e BROKEN && echo "reproduced race in $i iterations"
#############################################################

x() above takes, and, because of the () subshell and
ocf_release_lock_on_exit, releases the "dummy-lock",
and within the protected region of code,
creates and removes a file "protected".

If ocf_take_lock was good, there could never be two instances
inside the lock, so echo x > protected should never fail.

With the current implementation of ocf_take_lock,
it takes "just a few" iterations here to reproduce the race.
(usually within a minute).

The races I see in ocf_take_lock:
"creation race":
	test -e $lock
		# someone else may create it here
	echo $$ > $lock
		# but we override it with ours anyways

"still empty race":
	test -e $lock	# maybe it already exists (open O_CREAT|O_TRUNC)
			# but does not yet contain target pid,
	pid=`cat $lock` # this one is empty,
	kill -0 $pid    # and this one fails
	and thus a "just being created" one is considered stale

There are other problems around "stale pid file detection",
but let's not go into that minefield right now.

> > Maybe we should change it to 
> > ```
> > while ! ( set -C; echo $pid > lockfile ); do
> >     if test -e lockfile ; then
> >         : error handling for existing lockfile, stale lockfile detection
> >     else
> >         : error handling for not being able to create lockfile
> >     fi
> > done
> > : only reached if lockfile was successfully created
> > ```
> > 
> > (or use flock or other tools designed for that purpose)
> 
> flock would probably be the easiest. mkdir would do too, but for
> upgrade issues.

and, being part of util-linux, flock should be available "everywhere".

but because writing "wrappers" around flock similar to the intended
semantics of ocf_take_lock and ocf_release_lock_on_exit is not easy
either, usually you'd be better of using flock directly in the RA.

so, still trying to do this with shell:

"set -C" (respectively set -o noclober):
	If set, disallow existing regular files to be overwritten
	by redirection of output.

normal '>' means: O_WRONLY|O_CREAT|O_TRUNC,
set -C '>' means: O_WRONLY|O_CREAT|O_EXCL

using "set -C ; echo $$ > $lock" instead of 
"test -e $lock || echo $$ > $lock"
gets rid of the "creation race".

To get rid of the "still empty race",
we'd need to play games with hardlinks:

(
set -C
success=false
if echo $$ > .$lock ; then
	ln .$lock $lock && success=true
	rm -f .$lock
fi
$success
)

That should be "good enough",
and much better than what we have now.

back to a possible "flock" wrapper,
maybe something like this:

ocf_synchronize() {
	local lockfile=$1
	shift
	(
		flock -x 8 || exit 1
		( "$@" ) 8>&-
	) 8> "$lockfile"
}
# and then
ocf_synchronize my_exclusive_shell_function with some args

As this runs in subshells,
it would not be able to change any variales visible
to the rest of the script, which may limit its usefulness.

or maybe, have a ocf_rarun_synchronized(),
that will do (directly from the flock man page):
   [ "${FLOCKER}" != "$0" ] && exec env FLOCKER="$0" flock -e "$0" "$0" "$@" || :
with nonblocking and timeout variants?

I'd very much like some volunteer to step forward
and implement the actual patch...

Cheers,

	Lars