[ClusterLabs] checking all procs on system enough during stop action?

Mon Apr 24 10:34:07 EDT 2017

Hi all,

In the PostgreSQL Automatic Failover (PAF) project, one of most frequent
negative feedback we got is how difficult it is to experience with it because of
fencing occurring way too frequently. I am currently hunting this kind of
useless fencing to make life easier.

It occurs to me, a frequent reason of fencing is because during the stop
action, we check the status of the PostgreSQL instance using our monitor
function before trying to stop the resource. If the function does not return
OCF_NOT_RUNNING, OCF_SUCCESS or OCF_RUNNING_MASTER, we just raise an error,
leading to a fencing. See:
https://github.com/dalibo/PAF/blob/d50d0d783cfdf5566c3b7c8bd7ef70b11e4d1043/script/pgsqlms#L1291-L1301

I am considering adding a check to define if the instance is stopped even if the
monitor action returns an error. The idea would be to parse **all** the local
processes looking for at least one pair of "/proc/<PID>/{comm,cwd}" related to
the PostgreSQL instance we want to stop. If none are found, we consider the
instance is not running. Gracefully or not, we just know it is down and we can
return OCF_SUCCESS.

Just for completeness, the piece of code would be:

   my @pids;
   foreach my $f (glob "/proc/[0-9]*") {
       push @pids => basename($f)
           if -r $f
               and basename( readlink( "$f/exe" ) ) eq "postgres"
               and readlink( "$f/cwd" ) eq $pgdata;
   }

I feels safe enough to me. The only risk I could think of is in a shared disk
cluster with multiple nodes accessing the same data in RW (such setup can
fail in so many ways :)). However, PAF is not supposed to work in such context,
so I can live with this.

Do you guys have some advices? Do you see some drawbacks? Hazards?

Thanks in advance!
-- 
Jehan-Guillaume de Rorthais
Dalibo