[ClusterLabs] strange behaviour from pacemaker_remote

Sun Oct 1 12:32:01 EDT 2017

On Thu, 2017-09-28 at 01:39 +0200, Adam Spiers wrote:
> Hi all,
> 
> When I do a
> 
>     pkill -9 -f pacemaker_remote
> 
> to simulate failure of a remote node, sometimes I see things like:
> 
>     08:29:32 d52-54-00-da-4e-05 pacemaker_remoted[5806]: error: No
> ipc providers available for uid 0 gid 0
>     08:29:32 d52-54-00-da-4e-05 pacemaker_remoted[5806]: error: Error
> in connection setup (5806-5805-15): Remote I/O error (121)
> 
> ... and the node doesn't get fenced as expected.  Other times it
> does.
> Is this my fault for using an invalid way of simulating failure, or
> some kind of bug?
> 
> Sadly I don't have the exact version of pacemaker_remoted to hand,
> but
> I can provide it tomorrow if necessary.  It's not the latest release,
> maybe not even the one immediately preceding it.
> 
> Thanks!
> Adam

Before fencing, the cluster will try re-establishing the connection. If
you've got pacemaker_remote enabled via systemd, systemd may be
respawning it quick enough that the cluster reconnect succeeds.

Also, until a recent master branch commit, remote nodes would not get
fenced if they were not running any resources.

And of course, a fencing resource has to be configured for the remote
node.

If none of those things were the reason, there may be a bug -- a PE
input file from the DC for that transition would be helpful.