[ClusterLabs] [ClusterLabs Developers] checking all procs on system enough during stop action?

Lars Ellenberg lars.ellenberg at linbit.com
Mon Apr 24 11:08:15 EDT 2017


On Mon, Apr 24, 2017 at 04:34:07PM +0200, Jehan-Guillaume de Rorthais wrote:
> Hi all,
> 
> In the PostgreSQL Automatic Failover (PAF) project, one of most frequent
> negative feedback we got is how difficult it is to experience with it because of
> fencing occurring way too frequently. I am currently hunting this kind of
> useless fencing to make life easier.
> 
> It occurs to me, a frequent reason of fencing is because during the stop
> action, we check the status of the PostgreSQL instance using our monitor
> function before trying to stop the resource. If the function does not return
> OCF_NOT_RUNNING, OCF_SUCCESS or OCF_RUNNING_MASTER, we just raise an error,
> leading to a fencing. See:
> https://github.com/dalibo/PAF/blob/d50d0d783cfdf5566c3b7c8bd7ef70b11e4d1043/script/pgsqlms#L1291-L1301
> 
> I am considering adding a check to define if the instance is stopped even if the
> monitor action returns an error. The idea would be to parse **all** the local
> processes looking for at least one pair of "/proc/<PID>/{comm,cwd}" related to
> the PostgreSQL instance we want to stop. If none are found, we consider the
> instance is not running. Gracefully or not, we just know it is down and we can
> return OCF_SUCCESS.
> 
> Just for completeness, the piece of code would be:
> 
>    my @pids;
>    foreach my $f (glob "/proc/[0-9]*") {
>        push @pids => basename($f)
>            if -r $f
>                and basename( readlink( "$f/exe" ) ) eq "postgres"
>                and readlink( "$f/cwd" ) eq $pgdata;
>    }
> 
> I feels safe enough to me. The only risk I could think of is in a shared disk
> cluster with multiple nodes accessing the same data in RW (such setup can
> fail in so many ways :)). However, PAF is not supposed to work in such context,
> so I can live with this.
> 
> Do you guys have some advices? Do you see some drawbacks? Hazards?

Isn't that the wrong place to "fix" it?
Why did your _monitor  return something "weird"?
What did it return?
Should you not fix it there?

Just thinking out loud.

Cheers,
	Lars





More information about the Users mailing list