[ClusterLabs Developers] checking all procs on system enough during stop action?
Lars Ellenberg
lars.ellenberg at linbit.com
Mon Apr 24 15:08:15 UTC 2017
On Mon, Apr 24, 2017 at 04:34:07PM +0200, Jehan-Guillaume de Rorthais wrote:
> Hi all,
>
> In the PostgreSQL Automatic Failover (PAF) project, one of most frequent
> negative feedback we got is how difficult it is to experience with it because of
> fencing occurring way too frequently. I am currently hunting this kind of
> useless fencing to make life easier.
>
> It occurs to me, a frequent reason of fencing is because during the stop
> action, we check the status of the PostgreSQL instance using our monitor
> function before trying to stop the resource. If the function does not return
> OCF_NOT_RUNNING, OCF_SUCCESS or OCF_RUNNING_MASTER, we just raise an error,
> leading to a fencing. See:
> https://github.com/dalibo/PAF/blob/d50d0d783cfdf5566c3b7c8bd7ef70b11e4d1043/script/pgsqlms#L1291-L1301
>
> I am considering adding a check to define if the instance is stopped even if the
> monitor action returns an error. The idea would be to parse **all** the local
> processes looking for at least one pair of "/proc/<PID>/{comm,cwd}" related to
> the PostgreSQL instance we want to stop. If none are found, we consider the
> instance is not running. Gracefully or not, we just know it is down and we can
> return OCF_SUCCESS.
>
> Just for completeness, the piece of code would be:
>
> my @pids;
> foreach my $f (glob "/proc/[0-9]*") {
> push @pids => basename($f)
> if -r $f
> and basename( readlink( "$f/exe" ) ) eq "postgres"
> and readlink( "$f/cwd" ) eq $pgdata;
> }
>
> I feels safe enough to me. The only risk I could think of is in a shared disk
> cluster with multiple nodes accessing the same data in RW (such setup can
> fail in so many ways :)). However, PAF is not supposed to work in such context,
> so I can live with this.
>
> Do you guys have some advices? Do you see some drawbacks? Hazards?
Isn't that the wrong place to "fix" it?
Why did your _monitor return something "weird"?
What did it return?
Should you not fix it there?
Just thinking out loud.
Cheers,
Lars
More information about the Developers
mailing list