[ClusterLabs] pacemaker-fenced /dev/shm errors

Tue Mar 28 10:01:16 EDT 2023

On Tue, 2023-03-28 at 13:11 +0800, d tbsky wrote:
> Ken Gaillot <kgaillot at redhat.com>
> > I'm glad it's resolved, but for future reference, that does
> > indicate a
> > serious problem. It means the fencer is not accepting any requests,
> > so
> > any fencing attempts or even attempts to monitor a fencing device
> > from
> > that node will fail.
> > 
> 
>    That sounds like pacemaker-fenced became some kind of zombie.
> For testing, I block the connection between the node and ipmi-fencing
> device. the fencing resource stopped and  report error like below:
> 
> Failed Resource Actions:
>   * fence_ipmi start on c1.example.tw could not be executed (Timed
> Out) because 'Fence agent did not complete in time' at Tue Mar 28
> 12:49:58 2023 after 20.004s
> 
> and it recovered when the connection recovered.
> Does it mean fencing is still working?
> I want to make sure if I saw message like "pacemaker-fenced[2405] is
> unresponsive to ipc after 1 tries", does it mean permanent fail or
> the
> second try success so it no more complains.
> 

If successful client connections are shown later in the log, it's
recovered and should not be a problem. Of course if fencing failed or
timed out, the cluster will want to keep trying before recovering
resources.
-- 
Ken Gaillot <kgaillot at redhat.com>