[ClusterLabs] stonith_admin timeouts

Mon Jun 22 06:02:56 UTC 2015

On 21/06/15 02:12 PM, Milos Buncic wrote:
> Hey people
> 
> I'm experiencing very strange issue and it's appearing every time I try
> to fence a node.
> I have a test environment with three node cluster (CentOS 6.6 x86_64)
> where rgmanager is replaced with pacemaker (CMAN + pacemaker).
> 
> I've configured fencing with pcs for all three nodes
> 
> Pacemaker:
> pcs stonith create node1-ipmi \
> fence_ipmilan pcmk_host_list="node1" ipaddr=1.1.1.1 login=fencer
> passwd=******** privlvl=OPERATOR power_wait=10 lanplus=1 action=off \
> op monitor interval=10s timeout=30s 

Does 'fence_ipmilan ...' work when called manually from the command line?

> pcs constraint location node1-ipmi avoids node1
> 
> pcs property set stonith-enabled=true
> 
> 
> CMAN - /etc/cluster/cluster.conf:
> <?xml version="1.0"?>
> <cluster config_version="10" name="mycluster">
>         <fence_daemon/>
>         <clusternodes>
>                 <clusternode name="node1" nodeid="1">
>                         <fence>
>                                 <method name="pcmk-redirect">
>                                         <device action="off" name="pcmk"
> port="node1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node2" nodeid="2">
>                         <fence>
>                                 <method name="pcmk-redirect">
>                                         <device action="off" name="pcmk"
> port="node2"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node2" nodeid="3">
>                         <fence>
>                                 <method name="pcmk-redirect">
>                                         <device action="off" name="pcmk"
> port="node2"/>
>                                 </method>
>                         </fence>
>                 </clusternode>

Looks like the same node is defined twice, instead of 'node3'.

>         </clusternodes>
>         <cman/>
>         <fencedevices>
>                 <fencedevice agent="fence_pcmk" name="pcmk"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
>         <logging debug="on"/>
>         <quorumd interval="1" label="QuorumDisk"
> status_file="/qdisk_status" tko="70"/>

Also, I'm not sure how well qdisk is tested/supported. Do you even need
it with three nodes?

>         <totem token="108000"/>

That is a VERY high number!

> </cluster>
> 
> Every time I try to fence a node I'm getting a timeout error with node
> being fenced at the end (on second attempt) but I'm wondering why it
> took so long to fence a node?

Run 'fence_check' (this tests cman's fencing which is hooked into
pacemaker's stonith).

> So when I run stonith_admin or fence_node (which at the end also runs
> stonith_admin, you can see that clearly from the log file) it's always
> failing on the first attempt, my guess probably  because it doesn't get
> status code or something like that:
> strace stonith_admin --fence node1 --tolerance 5s --tag cman
> 
> Partial output from strace:
>   ...
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>   poll([{fd=4, events=POLLIN}], 1, 291)   = 0 (Timeout)
>   fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0
>   mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x7fb2a8c37000
>   write(1, "Command failed: Timer expired\n", 30Command failed: Timer
> expired
>   ) = 30
>   poll([{fd=4, events=POLLIN}], 1, 0)     = 0 (Timeout)
>   shutdown(4, 2 /* send and receive */)   = 0
>   close(4)                                = 0
>   munmap(0x7fb2a8b98000, 270336)          = 0
>   munmap(0x7fb2a8bda000, 8248)            = 0
>   munmap(0x7fb2a8b56000, 270336)          = 0
>   munmap(0x7fb2a8c3b000, 8248)            = 0
>   munmap(0x7fb2a8b14000, 270336)          = 0
>   munmap(0x7fb2a8c38000, 8248)            = 0
>   munmap(0x7fb2a8bdd000, 135168)          = 0
>   munmap(0x7fb2a8bfe000, 135168)          = 0
>   exit_group(-62)                         = ?
> 
> 
> Or via cman:
> [node1:~]# fence_node -vv node3
> fence node3 dev 0.0 agent fence_pcmk result: error from agent
> agent args: action=off port=node3 timeout=15 nodename=node3 agent=fence_pcmk
> fence node3 failed
> 
> 
> /var/log/messages:
>   Jun 19 10:57:43 node1 stonith_admin[3804]:   notice: crm_log_args:
> Invoked: stonith_admin --fence node1 --tolerance 5s --tag cman
>   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice: handle_request:
> Client stonith_admin.cman.3804.65de6378 wants to fence (off) 'node1'
> with device '(any)'
>   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
> initiate_remote_stonith_op: Initiating remote operation off for node1:
> fbc7fe61-9451-4634-9c12-57d933ccd0a4 (  0)
>   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
> can_fence_host_with_device: node2-ipmi can not fence (off) node1:
> static-list
>   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
> can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
>   Jun 19 10:57:54 node1 stonith-ng[8283]:  warning: get_xpath_object: No
> match for //@st_delegate in /st-reply
>   Jun 19 10:59:00 node1 qdiskd[7409]: Node 3 evicted
>   Jun 19 10:59:31 node1 corosync[7349]:   [TOTEM ] A processor failed,
> forming new configuration.
>   Jun 19 11:01:21 node1 corosync[7349]:   [QUORUM] Members[2]: 1 2
>   Jun 19 11:01:21 node1 corosync[7349]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
>   Jun 19 11:01:21 node1 crmd[8287]:   notice: crm_update_peer_state:
> cman_event_callback: Node node3[3] - state is now lost (was member)
>   Jun 19 11:01:21 node1 kernel: dlm: closing connection to node 3
>   Jun 19 11:01:21 node1 stonith-ng[8283]:   notice: remote_op_done:
> Operation off of node3 by node2 for stonith_admin.cman.3804 at node1. 
> com.fbc7fe61: OK
>   Jun 19 11:01:21 node1 crmd[8287]:   notice: tengine_stonith_notify:
> Peer node3 was terminated (off) by node2 for node1: OK ( 
> ref=fbc7fe61-9451-4634-9c12-57d933ccd0a4) by client stonith_admin.cman.3804
>   Jun 19 11:01:21 node1 crmd[8287]:   notice: tengine_stonith_notify:
> Notified CMAN that 'node3' is now fenced
>  
>   Jun 19 11:01:21 node1 fenced[7625]: fencing node node3
>   Jun 19 11:01:22 node1 fence_pcmk[8067]: Requesting Pacemaker fence
> node3 (off)
>   Jun 19 11:01:22 node1 stonith_admin[8068]:   notice: crm_log_args:
> Invoked: stonith_admin --fence node3 --tolerance 5s --tag cman
>   Jun 19 11:01:22 node1 stonith-ng[8283]:   notice: handle_request:
> Client stonith_admin.cman.8068.fcd7f751 wants to fence (off) 'node3'
> with device '(any)'
>   Jun 19 11:01:22 node1 stonith-ng[8283]:   notice:
> stonith_check_fence_tolerance: Target node3 was fenced (off) less than
> 5s ago by node2 on   behalf of node1
>   Jun 19 11:01:22 node1 fenced[7625]: fence node3 success
> 
> 
> 
>     [node1:~]# ls -ahl /proc/22505/fd
>   total 0
>   dr-x------ 2 root root  0 Jun 19 11:55 .
>   dr-xr-xr-x 8 root root  0 Jun 19 11:55 ..
>   lrwx------ 1 root root 64 Jun 19 11:56 0 -> /dev/pts/8
>   lrwx------ 1 root root 64 Jun 19 11:56 1 -> /dev/pts/8
>   lrwx------ 1 root root 64 Jun 19 11:55 2 -> /dev/pts/8
>   lrwx------ 1 root root 64 Jun 19 11:56 3 -> socket:[4061683]
>  lrwx------ 1 root root 64 Jun 19 11:56 4 -> socket:[4061684]
> 
>   [node1:~]# lsof -p 22505
>   ...
>   stonith_admin 22505 root    3u  unix 0xffff880c14889b80      0t0
> 4061683 socket
>   stonith_admin 22505 root    4u  unix 0xffff880c2a4fbc40      0t0
> 4061684 socket
> 
> 
> Obviously it's trying to read some data from unix socket but doesn't get
> anything from the other side, is there anyone there who can explain me
> why fence command is always failing on first attempt?
> 
> Thanks

I noticed you're not a mailing list member. Please register if you want
your emails to come through without getting stuck in the moderator queue.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?