[ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

Mon Sep 16 10:59:25 EDT 2019

On Tue, 2019-09-03 at 10:09 +0200, Marco Marino wrote:
> Hi, I have a problem with fencing on a two node cluster. It seems
> that randomly the cluster cannot complete monitor operation for fence
> devices. In log I see:
> crmd[8206]:   error: Result of monitor operation for fence-node2 on
> ld2.mydomain.it: Timed Out
> As attachment there is 
> - /var/log/messages for node1 (only the important part)
> - /var/log/messages for node2 (only the important part) <-- Problem
> starts here
> - pcs status
> - pcs stonith show (for both fence devices)
> 
> I think it could be a timeout problem, so how can I see timeout value
> for monitor operation in stonith devices?
> Please, someone can help me with this problem?
> Furthermore, how can I fix the state of fence devices without
> downtime?
> 
> Thank you

How to investigate depends on whether this is an occasional monitor
failure, or happens every time the device start is attempted. From the
status you attached, I'm guessing it's at start.

In that case, my next step (since you've already verified ipmitool
works directly) would be to run the fence agent manually using the same
arguments used in the cluster configuration.

Check the man page for the fence agent, looking at the section for
"Stdin Parameters". These are what's used in the cluster configuration,
so make a note of what values you've configured. Then run the fence
agent like this:

echo -e "action=status\nPARAMETER=VALUE\nPARAMETER=VALUE\n..." | /path/to/agent

where PARAMETER=VALUE entries are what you have configured in the
cluster. If the problem isn't obvious from that, you can try adding a
debug_file parameter.
-- 
Ken Gaillot <kgaillot at redhat.com>