<div dir="ltr"><div>Hey first of all thank you for you answer<br></div><div> </div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">Does 'fence_ipmilan ...' work when called manually from the command line? <br></blockquote><div class="gmail_extra"><br>Yes it does (off, on, status...)<br><br><span style="font-family:monospace,monospace">[node1:~]# fence_ipmilan -v -p </span><span style="font-family:monospace,monospace"><span>********</span> -l fencer -L OPERATOR -P -a 1.1.1.1 -o status<br>Getting status of IPMI:1.1.1.1...Spawning: '/usr/bin/ipmitool -I lanplus -H '1.1.1.1' -U 'fencer' -L 'OPERATOR' -P '[set]' -v chassis power status'...<br>Chassis power = On<br>Done <br></span></div><div class="gmail_extra"><br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">Looks like the same node is defined twice, instead of 'node3'.<br></blockquote>
<span>Sorry because of that I mistyped the hostname just after I pasted configuration<br></span></div><div class="gmail_extra"><span><br>Configuration looks like this <br></span></div><div class="gmail_extra"><span style="font-family:monospace,monospace"><br><?xml version="1.0"?><br><cluster config_version="10" name="mycluster"><br> <fence_daemon/><br> <clusternodes><br> <clusternode name="node1" nodeid="1"><br> <fence><br> <method name="pcmk-redirect"><br> <device action="off" name="pcmk" port="node1"/><br> </method><br> </fence><br> </clusternode><br> <clusternode name="node2" nodeid="2"><br> <fence><br> <method name="pcmk-redirect"><br> <device action="off" name="pcmk" port="node2"/><br> </method><br> </fence><br> </clusternode><br> <clusternode name="node3" nodeid="3"><br> <fence><br> <method name="pcmk-redirect"><br> <device action="off" name="pcmk" port="node3"/><br> </method><br> </fence><br> </clusternode><br> </clusternodes><br> <cman/><br> <fencedevices><br> <fencedevice agent="fence_pcmk" name="pcmk"/><br> </fencedevices><br> <rm><br> <failoverdomains/><br> <resources/><br> </rm><br> <logging debug="on"/><br> <quorumd interval="1" label="QuorumDisk" status_file="/qdisk_status" tko="70"/><br> <totem token="108000"/><br></cluster></span><br><br><br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">Run 'fence_check' (this tests cman's fencing which is hooked into<br>
pacemaker's stonith).<br></blockquote>
<br><span style="font-family:monospace,monospace">fence_check run at Mon Jun 22 09:35:38 CEST 2015 pid: 16091<br>Checking if cman is running: running<br>Checking if node is quorate: quorate<br>Checking if node is in fence domain: yes<br>Checking if node is fence master: this node is fence master<br>Checking if real fencing is in progress: no fencing in progress<br>Get node list: node1 node2 node3<br><br>Testing node1 fencing<br>Checking if cman is running: running<br>Checking if node is quorate: quorate<br>Checking if node is in fence domain: yes<br>Checking if node is fence master: this node is fence master<br>Checking if real fencing is in progress: no fencing in progress<br>Checking how many fencing methods are configured for node node1<br>Found 1 method(s) to test for node node1<br>Testing node1 method 1 status<br>Testing node1 method 1: success<br><br>Testing node2 fencing<br>Checking if cman is running: running<br>Checking if node is quorate: quorate<br>Checking if node is in fence domain: yes<br>Checking if node is fence master: this node is fence master<br>Checking if real fencing is in progress: no fencing in progress<br>Checking how many fencing methods are configured for node node2<br>Found 1 method(s) to test for node node2<br>Testing node2 method 1 status<br>Testing node2 method 1: success<br><br>Testing node3 fencing<br>Checking if cman is running: running<br>Checking if node is quorate: quorate<br>Checking if node is in fence domain: yes<br>Checking if node is fence master: this node is fence master<br>Checking if real fencing is in progress: no fencing in progress<br>Checking how many fencing methods are configured for node node3<br>Found 1 method(s) to test for node node3<br>Testing node3 method 1 status<br>Testing node3 method 1: success<br>cleanup: 0<br></span><br><br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">Also, I'm not sure how well qdisk is tested/supported. Do you even need<br>
it with three nodes?<br></blockquote><div>Qdisk is tested in production where we're using rgmanager so I just mirrored that configuration.<br></div><div>Hm yes in three node cluster probably we don't need it.<br></div>
<br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">
<totem token="108000"/><br>
That is a VERY high number!<br></blockquote>
<span>You're probably right I changed this value to default (10 sec)<br></span></div><div class="gmail_extra"><totem token="10000"/><br><br>[node1:~]# fence_node -vv node3<br>fence node3 dev 0.0 agent fence_pcmk result: error from agent<br>agent args: action=off port=node3 nodename=node3 agent=fence_pcmk<br>fence node3 failed<br><br></div><div class="gmail_extra">Messages log caught from node2 who's running fencing resource for node3<br><br><span style="font-family:monospace,monospace">[node2:~]# tail -0f /var/log/messages<br>...<br>Jun 22 10:04:28 node2 stonith-ng[7382]: notice: can_fence_host_with_device: node1-ipmi can not fence (off) node3: static-list<br>Jun 22 10:04:28 node2 stonith-ng[7382]: notice: can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list<br>Jun 22 10:04:28 node2 stonith-ng[7382]: notice: can_fence_host_with_device: node1-ipmi can not fence (off) node3: static-list<br>Jun 22 10:04:28 node2 stonith-ng[7382]: notice: can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list<br>Jun 22 10:04:38 node2 stonith-ng[7382]: notice: log_operation: Operation 'off' [5288] (call 2 from stonith_admin.cman.7377) for host 'node3' with device 'node3-ipmi' returned: 0 (OK)<br>Jun 22 10:05:44 node2 qdiskd[5948]: Node 3 evicted<br><br><br></span></div><div class="gmail_extra"><span style="font-family:arial,helvetica,sans-serif">This is where delay happens (~3.5 min) <br></span></div><div class="gmail_extra"><span style="font-family:monospace,monospace"><br><br>Jun 22 10:08:06 node2 corosync[5861]: [QUORUM] Members[2]: 1 2<br>Jun 22 10:08:06 node2 corosync[5861]: [TOTEM ] A processor joined or left the membership and a new membership was formed.<br>Jun 22 10:08:06 node2 crmd[7386]: notice: crm_update_peer_state: cman_event_callback: Node node3[3] - state is now lost (was member)<br>Jun 22 10:08:06 node2 crmd[7386]: warning: match_down_event: No match for shutdown action on node3<br>Jun 22 10:08:06 node2 crmd[7386]: notice: peer_update_callback: Stonith/shutdown of node3 not matched<br>Jun 22 10:08:06 node2 crmd[7386]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]<br>Jun 22 10:08:06 node2 rsyslogd-2177: imuxsock begins to drop messages from pid 5861 due to rate-limiting<br>Jun 22 10:08:06 node2 kernel: dlm: closing connection to node 3<br>Jun 22 10:08:06 node2 attrd[7384]: notice: attrd_local_callback: Sending full refresh (origin=crmd)<br>Jun 22 10:08:06 node2 attrd[7384]: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)<br>Jun 22 10:08:06 node2 crmd[7386]: warning: match_down_event: No match for shutdown action on node3<br>Jun 22 10:08:06 node2 crmd[7386]: notice: peer_update_callback: Stonith/shutdown of node3 not matched<br>Jun 22 10:08:06 node2 stonith-ng[7382]: notice: remote_op_done: Operation off of node3 by node2 for stonith_admin.cman.7377@node1.753ce4e5: OK<br>Jun 22 10:08:06 node2 fenced[6211]: fencing deferred to node1<br>Jun 22 10:08:06 node2 attrd[7384]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)<br>Jun 22 10:08:06 node2 crmd[7386]: notice: tengine_stonith_notify: Peer node3 was terminated (off) by node2 for node1: OK (ref=753ce4e5-a84a-491b-8ed9-044667946381) by client stonith_admin.cman.7377<br>Jun 22 10:08:06 node2 crmd[7386]: notice: tengine_stonith_notify: Notified CMAN that 'node3' is now fenced<br>Jun 22 10:08:07 node2 rsyslogd-2177: imuxsock lost 108 messages from pid 5861 due to rate-limiting<br>Jun 22 10:08:07 node2 pengine[7385]: notice: unpack_config: On loss of CCM Quorum: Ignore<br>Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start testvm102#011(node1)<br>Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Migrate testvm103#011(Started node1 -> node2)<br>Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start testvm105#011(node1)<br>Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start testvm108#011(node1)<br>Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Migrate testvm109#011(Started node1 -> node2)<br>Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start testvm111#011(node1)<br>Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start testvm114#011(node1)<br>Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Migrate testvm115#011(Started node1 -> node2)<br>Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start testvm117#011(node1)<br>Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start node1-ipmi#011(node2)<br>...</span><br></div><div class="gmail_extra"><br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">I noticed you're not a mailing list member. Please register if you want<br>
your emails to come through without getting stuck in the moderator queue.<br></blockquote>Thanks man I will<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Problem still persist :(<br><br><br><div class="gmail_quote">On Mon, Jun 22, 2015 at 8:02 AM, Digimer <span dir="ltr"><<a href="mailto:lists@alteeve.ca" target="_blank">lists@alteeve.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>On 21/06/15 02:12 PM, Milos Buncic wrote:<br>
> Hey people<br>
><br>
> I'm experiencing very strange issue and it's appearing every time I try<br>
> to fence a node.<br>
> I have a test environment with three node cluster (CentOS 6.6 x86_64)<br>
> where rgmanager is replaced with pacemaker (CMAN + pacemaker).<br>
><br>
> I've configured fencing with pcs for all three nodes<br>
><br>
> Pacemaker:<br>
> pcs stonith create node1-ipmi \<br>
> fence_ipmilan pcmk_host_list="node1" ipaddr=1.1.1.1 login=fencer<br>
> passwd=******** privlvl=OPERATOR power_wait=10 lanplus=1 action=off \<br>
> op monitor interval=10s timeout=30s<br>
<br>
</span>Does 'fence_ipmilan ...' work when called manually from the command line? <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div> </div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div>
> pcs constraint location node1-ipmi avoids node1<br>
><br>
> pcs property set stonith-enabled=true<br>
><br>
><br>
> CMAN - /etc/cluster/cluster.conf:<br>
> <?xml version="1.0"?><br>
> <cluster config_version="10" name="mycluster"><br>
> <fence_daemon/><br>
> <clusternodes><br>
> <clusternode name="node1" nodeid="1"><br>
> <fence><br>
> <method name="pcmk-redirect"><br>
> <device action="off" name="pcmk"<br>
> port="node1"/><br>
> </method><br>
> </fence><br>
> </clusternode><br>
> <clusternode name="node2" nodeid="2"><br>
> <fence><br>
> <method name="pcmk-redirect"><br>
> <device action="off" name="pcmk"<br>
> port="node2"/><br>
> </method><br>
> </fence><br>
> </clusternode><br>
> <clusternode name="node2" nodeid="3"><br>
> <fence><br>
> <method name="pcmk-redirect"><br>
> <device action="off" name="pcmk"<br>
> port="node2"/><br>
> </method><br>
> </fence><br>
> </clusternode><br>
<br>
</div></div>Looks like the same node is defined twice, instead of 'node3'.<br>
<span><br>
> </clusternodes><br>
> <cman/><br>
> <fencedevices><br>
> <fencedevice agent="fence_pcmk" name="pcmk"/><br>
> </fencedevices><br>
> <rm><br>
> <failoverdomains/><br>
> <resources/><br>
> </rm><br>
> <logging debug="on"/><br>
> <quorumd interval="1" label="QuorumDisk"<br>
> status_file="/qdisk_status" tko="70"/><br>
<br>
</span>Also, I'm not sure how well qdisk is tested/supported. Do you even need<br>
it with three nodes?<br>
<br>
> <totem token="108000"/><br>
<br>
That is a VERY high number!<br>
<span><br>
> </cluster><br>
><br>
> Every time I try to fence a node I'm getting a timeout error with node<br>
> being fenced at the end (on second attempt) but I'm wondering why it<br>
> took so long to fence a node?<br>
<br>
</span>Run 'fence_check' (this tests cman's fencing which is hooked into<br>
pacemaker's stonith).<br>
<div><div><br>
> So when I run stonith_admin or fence_node (which at the end also runs<br>
> stonith_admin, you can see that clearly from the log file) it's always<br>
> failing on the first attempt, my guess probably because it doesn't get<br>
> status code or something like that:<br>
> strace stonith_admin --fence node1 --tolerance 5s --tag cman<br>
><br>
> Partial output from strace:<br>
> ...<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)<br>
> poll([{fd=4, events=POLLIN}], 1, 291) = 0 (Timeout)<br>
> fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0<br>
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,<br>
> 0) = 0x7fb2a8c37000<br>
> write(1, "Command failed: Timer expired\n", 30Command failed: Timer<br>
> expired<br>
> ) = 30<br>
> poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)<br>
> shutdown(4, 2 /* send and receive */) = 0<br>
> close(4) = 0<br>
> munmap(0x7fb2a8b98000, 270336) = 0<br>
> munmap(0x7fb2a8bda000, 8248) = 0<br>
> munmap(0x7fb2a8b56000, 270336) = 0<br>
> munmap(0x7fb2a8c3b000, 8248) = 0<br>
> munmap(0x7fb2a8b14000, 270336) = 0<br>
> munmap(0x7fb2a8c38000, 8248) = 0<br>
> munmap(0x7fb2a8bdd000, 135168) = 0<br>
> munmap(0x7fb2a8bfe000, 135168) = 0<br>
> exit_group(-62) = ?<br>
><br>
><br>
> Or via cman:<br>
> [node1:~]# fence_node -vv node3<br>
> fence node3 dev 0.0 agent fence_pcmk result: error from agent<br>
> agent args: action=off port=node3 timeout=15 nodename=node3 agent=fence_pcmk<br>
> fence node3 failed<br>
><br>
><br>
> /var/log/messages:<br>
> Jun 19 10:57:43 node1 stonith_admin[3804]: notice: crm_log_args:<br>
> Invoked: stonith_admin --fence node1 --tolerance 5s --tag cman<br>
> Jun 19 10:57:43 node1 stonith-ng[8283]: notice: handle_request:<br>
> Client stonith_admin.cman.3804.65de6378 wants to fence (off) 'node1'<br>
> with device '(any)'<br>
> Jun 19 10:57:43 node1 stonith-ng[8283]: notice:<br>
> initiate_remote_stonith_op: Initiating remote operation off for node1:<br>
> fbc7fe61-9451-4634-9c12-57d933ccd0a4 ( 0)<br>
> Jun 19 10:57:43 node1 stonith-ng[8283]: notice:<br>
> can_fence_host_with_device: node2-ipmi can not fence (off) node1:<br>
> static-list<br>
> Jun 19 10:57:43 node1 stonith-ng[8283]: notice:<br>
> can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list<br>
> Jun 19 10:57:54 node1 stonith-ng[8283]: warning: get_xpath_object: No<br>
> match for //@st_delegate in /st-reply<br>
> Jun 19 10:59:00 node1 qdiskd[7409]: Node 3 evicted<br>
> Jun 19 10:59:31 node1 corosync[7349]: [TOTEM ] A processor failed,<br>
> forming new configuration.<br>
> Jun 19 11:01:21 node1 corosync[7349]: [QUORUM] Members[2]: 1 2<br>
> Jun 19 11:01:21 node1 corosync[7349]: [TOTEM ] A processor joined or<br>
> left the membership and a new membership was formed.<br>
> Jun 19 11:01:21 node1 crmd[8287]: notice: crm_update_peer_state:<br>
> cman_event_callback: Node node3[3] - state is now lost (was member)<br>
> Jun 19 11:01:21 node1 kernel: dlm: closing connection to node 3<br>
> Jun 19 11:01:21 node1 stonith-ng[8283]: notice: remote_op_done:<br>
> Operation off of node3 by node2 for stonith_admin.cman.3804@node1.<br>
> com.fbc7fe61: OK<br>
> Jun 19 11:01:21 node1 crmd[8287]: notice: tengine_stonith_notify:<br>
> Peer node3 was terminated (off) by node2 for node1: OK (<br>
> ref=fbc7fe61-9451-4634-9c12-57d933ccd0a4) by client stonith_admin.cman.3804<br>
> Jun 19 11:01:21 node1 crmd[8287]: notice: tengine_stonith_notify:<br>
> Notified CMAN that 'node3' is now fenced<br>
><br>
> Jun 19 11:01:21 node1 fenced[7625]: fencing node node3<br>
> Jun 19 11:01:22 node1 fence_pcmk[8067]: Requesting Pacemaker fence<br>
> node3 (off)<br>
> Jun 19 11:01:22 node1 stonith_admin[8068]: notice: crm_log_args:<br>
> Invoked: stonith_admin --fence node3 --tolerance 5s --tag cman<br>
> Jun 19 11:01:22 node1 stonith-ng[8283]: notice: handle_request:<br>
> Client stonith_admin.cman.8068.fcd7f751 wants to fence (off) 'node3'<br>
> with device '(any)'<br>
> Jun 19 11:01:22 node1 stonith-ng[8283]: notice:<br>
> stonith_check_fence_tolerance: Target node3 was fenced (off) less than<br>
> 5s ago by node2 on behalf of node1<br>
> Jun 19 11:01:22 node1 fenced[7625]: fence node3 success<br>
><br>
><br>
><br>
> [node1:~]# ls -ahl /proc/22505/fd<br>
> total 0<br>
> dr-x------ 2 root root 0 Jun 19 11:55 .<br>
> dr-xr-xr-x 8 root root 0 Jun 19 11:55 ..<br>
> lrwx------ 1 root root 64 Jun 19 11:56 0 -> /dev/pts/8<br>
> lrwx------ 1 root root 64 Jun 19 11:56 1 -> /dev/pts/8<br>
> lrwx------ 1 root root 64 Jun 19 11:55 2 -> /dev/pts/8<br>
> lrwx------ 1 root root 64 Jun 19 11:56 3 -> socket:[4061683]<br>
> lrwx------ 1 root root 64 Jun 19 11:56 4 -> socket:[4061684]<br>
><br>
> [node1:~]# lsof -p 22505<br>
> ...<br>
> stonith_admin 22505 root 3u unix 0xffff880c14889b80 0t0<br>
> 4061683 socket<br>
> stonith_admin 22505 root 4u unix 0xffff880c2a4fbc40 0t0<br>
> 4061684 socket<br>
><br>
><br>
> Obviously it's trying to read some data from unix socket but doesn't get<br>
> anything from the other side, is there anyone there who can explain me<br>
> why fence command is always failing on first attempt?<br>
><br>
> Thanks<br>
<br>
</div></div>I noticed you're not a mailing list member. Please register if you want<br>
your emails to come through without getting stuck in the moderator queue.<br>
<span><font color="#888888"><br>
--<br>
Digimer<br>
Papers and Projects: <a href="https://alteeve.ca/w/" rel="noreferrer" target="_blank">https://alteeve.ca/w/</a><br>
What if the cure for cancer is trapped in the mind of a person without<br>
access to education?<br>
</font></span></blockquote></div><br></div></div>