<div dir="auto">Hi, some updates about this?<div dir="auto">Thank you</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il Mer 4 Set 2019, 10:46 Marco Marino <<a href="mailto:marino.mrc@gmail.com">marino.mrc@gmail.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>First of all, thank you for your support.</div><div>Andrey: sure, I can reach machines through IPMI.</div><div>Here is a short "log":</div><div><br></div><div>#From ld1 trying to contact ld1<br></div><div>[root@ld1 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XXXXXX sdr elist all<br>SEL              | 72h | ns  |  7.1 | No Reading<br>Intrusion        | 73h | ok  |  7.1 | <br>iDRAC8           | 00h | ok  |  7.1 | Dynamic MC @ 20h<br></div><div>...<br></div><div><br></div><div>#From ld1 trying to contact ld2</div><div>ipmitool -I lanplus -H 192.168.254.251 -U root -P XXXXXX sdr elist all<br>SEL              | 72h | ns  |  7.1 | No Reading<br>Intrusion        | 73h | ok  |  7.1 | <br>iDRAC7           | 00h | ok  |  7.1 | Dynamic MC @ 20h<br></div><div>.......<br></div><div><br></div><div><br></div><div>#From ld2 trying to contact ld1:</div><div>root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.250 -U root -P XXXXX sdr elist all<br>SEL              | 72h | ns  |  7.1 | No Reading<br>Intrusion        | 73h | ok  |  7.1 | <br>iDRAC8           | 00h | ok  |  7.1 | Dynamic MC @ 20h<br>System Board     | 00h | ns  |  7.1 | Logical FRU @00h<br></div><div>.....</div><div><br></div><div>#From ld2 trying to contact ld2<br></div><div>[root@ld2 ~]# ipmitool -I lanplus -H 192.168.254.251 -U root -P XXXX sdr elist all<br>SEL              | 72h | ns  |  7.1 | No Reading<br>Intrusion        | 73h | ok  |  7.1 | <br>iDRAC7           | 00h | ok  |  7.1 | Dynamic MC @ 20h<br>System Board     | 00h | ns  |  7.1 | Logical FRU @00h<br></div><div>........<br></div><div><br></div><div>Jan: Actually the cluster uses /etc/hosts in order to resolve names:</div><div>172.16.77.10    <a href="http://ld1.mydomain.it" target="_blank" rel="noreferrer">ld1.mydomain.it</a>      ld1<br>172.16.77.11    <a href="http://ld2.mydomain.it" target="_blank" rel="noreferrer">ld2.mydomain.it</a>      ld2<br></div><div><br></div><div>Furthermore I'm using ip addresses for ipmi interfaces in the configuration:</div><div>[root@ld1 ~]# pcs stonith show fence-node1<br> Resource: fence-node1 (class=stonith type=fence_ipmilan)<br>  Attributes: ipaddr=192.168.254.250 lanplus=1 login=root passwd=XXXXX pcmk_host_check=static-list pcmk_host_list=<a href="http://ld1.mydomain.it" target="_blank" rel="noreferrer">ld1.mydomain.it</a><br>  Operations: monitor interval=60s (fence-node1-monitor-interval-60s)<br></div><div><br></div><div><br></div><div>Any idea?</div><div>How can I reset the state of the cluster without downtime? "pcs resource cleanup" is enough?</div><div>Thank you,</div><div>Marco<br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno mer 4 set 2019 alle ore 10:29 Jan Pokorný <<a href="mailto:jpokorny@redhat.com" target="_blank" rel="noreferrer">jpokorny@redhat.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 03/09/19 20:15 +0300, Andrei Borzenkov wrote:<br>

> 03.09.2019 11:09, Marco Marino пишет:<br>

>> Hi, I have a problem with fencing on a two node cluster. It seems that<br>

>> randomly the cluster cannot complete monitor operation for fence devices.<br>

>> In log I see:<br>

>> crmd[8206]:   error: Result of monitor operation for fence-node2 on<br>

>> <a href="http://ld2.mydomain.it" rel="noreferrer noreferrer" target="_blank">ld2.mydomain.it</a>: Timed Out<br>

> <br>

> Can you actually access IP addresses of your IPMI ports?<br>

<br>

[<br>

Tangentially, interesting aspect beyond that and applicable for any<br>

non-IP cross-host referential needs, which I haven't seen mentioned<br>

anywhere so far, is the risk of DNS resolution (when /etc/hosts will<br>

come short) getting to troubles (stale records, port blocked, DNS<br>

server overload [DNSSEC, etc.], IPv4/IPv6 parallel records that the SW<br>

cannot handle gracefully, etc.).  In any case, just a single DNS<br>

server would apparently be an undesired SPOF, and would be unfortunate<br>

when unable to fence a node because of that.<br>

<br>

I think the most robust approach is to use IP addresses whenever<br>

possible, and unambiguous records in /etc/hosts when practical.<br>

]<br>

<br>

>> As attachment there is<br>

>> - /var/log/messages for node1 (only the important part)<br>

>> - /var/log/messages for node2 (only the important part) <-- Problem starts<br>

>> here<br>

>> - pcs status<br>

>> - pcs stonith show (for both fence devices)<br>

>> <br>

>> I think it could be a timeout problem, so how can I see timeout value for<br>

>> monitor operation in stonith devices?<br>

>> Please, someone can help me with this problem?<br>

>> Furthermore, how can I fix the state of fence devices without downtime?<br>

<br>

-- <br>

Jan (Poki)<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer noreferrer" target="_blank">https://www.clusterlabs.org/</a></blockquote></div>

</blockquote></div>