[ClusterLabs] stonith-ng - performing action 'monitor' timed out with signal 15

Wed Sep 4 04:46:24 EDT 2019

First of all, thank you for your support.
Andrey: sure, I can reach machines through IPMI.
Here is a short "log":

#From ld1 trying to contact ld1
[root at ld1 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XXXXXX sdr
elist all
SEL              | 72h | ns  |  7.1 | No Reading
Intrusion        | 73h | ok  |  7.1 |
iDRAC8           | 00h | ok  |  7.1 | Dynamic MC @ 20h
...

#From ld1 trying to contact ld2
ipmitool -I lanplus -H 192.168.254.251 -U root -P XXXXXX sdr elist all
SEL              | 72h | ns  |  7.1 | No Reading
Intrusion        | 73h | ok  |  7.1 |
iDRAC7           | 00h | ok  |  7.1 | Dynamic MC @ 20h
.......

#From ld2 trying to contact ld1:
root at ld2 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XXXXX sdr
elist all
SEL              | 72h | ns  |  7.1 | No Reading
Intrusion        | 73h | ok  |  7.1 |
iDRAC8           | 00h | ok  |  7.1 | Dynamic MC @ 20h
System Board     | 00h | ns  |  7.1 | Logical FRU @00h
.....

#From ld2 trying to contact ld2
[root at ld2 ~]# ipmitool -I lanplus -H 192.168.254.251 -U root -P XXXX sdr
elist all
SEL              | 72h | ns  |  7.1 | No Reading
Intrusion        | 73h | ok  |  7.1 |
iDRAC7           | 00h | ok  |  7.1 | Dynamic MC @ 20h
System Board     | 00h | ns  |  7.1 | Logical FRU @00h
........

Jan: Actually the cluster uses /etc/hosts in order to resolve names:
172.16.77.10    ld1.mydomain.it      ld1
172.16.77.11    ld2.mydomain.it      ld2

Furthermore I'm using ip addresses for ipmi interfaces in the configuration:
[root at ld1 ~]# pcs stonith show fence-node1
 Resource: fence-node1 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=XXXXX
pcmk_host_check=static-list pcmk_host_list=ld1.mydomain.it
  Operations: monitor interval=60s (fence-node1-monitor-interval-60s)

Any idea?
How can I reset the state of the cluster without downtime? "pcs resource
cleanup" is enough?
Thank you,
Marco

Il giorno mer 4 set 2019 alle ore 10:29 Jan Pokorný <jpokorny at redhat.com>
ha scritto:

> On 03/09/19 20:15 +0300, Andrei Borzenkov wrote:
> > 03.09.2019 11:09, Marco Marino пишет:
> >> Hi, I have a problem with fencing on a two node cluster. It seems that
> >> randomly the cluster cannot complete monitor operation for fence
> devices.
> >> In log I see:
> >> crmd[8206]:   error: Result of monitor operation for fence-node2 on
> >> ld2.mydomain.it: Timed Out
> >
> > Can you actually access IP addresses of your IPMI ports?
>
> [
> Tangentially, interesting aspect beyond that and applicable for any
> non-IP cross-host referential needs, which I haven't seen mentioned
> anywhere so far, is the risk of DNS resolution (when /etc/hosts will
> come short) getting to troubles (stale records, port blocked, DNS
> server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW
> cannot handle gracefully, etc.).  In any case, just a single DNS
> server would apparently be an undesired SPOF, and would be unfortunate
> when unable to fence a node because of that.
>
> I think the most robust approach is to use IP addresses whenever
> possible, and unambiguous records in /etc/hosts when practical.
> ]
>
> >> As attachment there is
> >> - /var/log/messages for node1 (only the important part)
> >> - /var/log/messages for node2 (only the important part) <-- Problem
> starts
> >> here
> >> - pcs status
> >> - pcs stonith show (for both fence devices)
> >>
> >> I think it could be a timeout problem, so how can I see timeout value
> for
> >> monitor operation in stonith devices?
> >> Please, someone can help me with this problem?
> >> Furthermore, how can I fix the state of fence devices without downtime?
>
> --
> Jan (Poki)
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190904/a0b36b88/attachment.html>