[ClusterLabs] stonith_admin timeouts
Digimer
lists at alteeve.ca
Mon Jun 22 06:02:56 UTC 2015
On 21/06/15 02:12 PM, Milos Buncic wrote:
> Hey people
>
> I'm experiencing very strange issue and it's appearing every time I try
> to fence a node.
> I have a test environment with three node cluster (CentOS 6.6 x86_64)
> where rgmanager is replaced with pacemaker (CMAN + pacemaker).
>
> I've configured fencing with pcs for all three nodes
>
> Pacemaker:
> pcs stonith create node1-ipmi \
> fence_ipmilan pcmk_host_list="node1" ipaddr=1.1.1.1 login=fencer
> passwd=******** privlvl=OPERATOR power_wait=10 lanplus=1 action=off \
> op monitor interval=10s timeout=30s
Does 'fence_ipmilan ...' work when called manually from the command line?
> pcs constraint location node1-ipmi avoids node1
>
> pcs property set stonith-enabled=true
>
>
> CMAN - /etc/cluster/cluster.conf:
> <?xml version="1.0"?>
> <cluster config_version="10" name="mycluster">
> <fence_daemon/>
> <clusternodes>
> <clusternode name="node1" nodeid="1">
> <fence>
> <method name="pcmk-redirect">
> <device action="off" name="pcmk"
> port="node1"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="node2" nodeid="2">
> <fence>
> <method name="pcmk-redirect">
> <device action="off" name="pcmk"
> port="node2"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="node2" nodeid="3">
> <fence>
> <method name="pcmk-redirect">
> <device action="off" name="pcmk"
> port="node2"/>
> </method>
> </fence>
> </clusternode>
Looks like the same node is defined twice, instead of 'node3'.
> </clusternodes>
> <cman/>
> <fencedevices>
> <fencedevice agent="fence_pcmk" name="pcmk"/>
> </fencedevices>
> <rm>
> <failoverdomains/>
> <resources/>
> </rm>
> <logging debug="on"/>
> <quorumd interval="1" label="QuorumDisk"
> status_file="/qdisk_status" tko="70"/>
Also, I'm not sure how well qdisk is tested/supported. Do you even need
it with three nodes?
> <totem token="108000"/>
That is a VERY high number!
> </cluster>
>
> Every time I try to fence a node I'm getting a timeout error with node
> being fenced at the end (on second attempt) but I'm wondering why it
> took so long to fence a node?
Run 'fence_check' (this tests cman's fencing which is hooked into
pacemaker's stonith).
> So when I run stonith_admin or fence_node (which at the end also runs
> stonith_admin, you can see that clearly from the log file) it's always
> failing on the first attempt, my guess probably because it doesn't get
> status code or something like that:
> strace stonith_admin --fence node1 --tolerance 5s --tag cman
>
> Partial output from strace:
> ...
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}], 1, 291) = 0 (Timeout)
> fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x7fb2a8c37000
> write(1, "Command failed: Timer expired\n", 30Command failed: Timer
> expired
> ) = 30
> poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)
> shutdown(4, 2 /* send and receive */) = 0
> close(4) = 0
> munmap(0x7fb2a8b98000, 270336) = 0
> munmap(0x7fb2a8bda000, 8248) = 0
> munmap(0x7fb2a8b56000, 270336) = 0
> munmap(0x7fb2a8c3b000, 8248) = 0
> munmap(0x7fb2a8b14000, 270336) = 0
> munmap(0x7fb2a8c38000, 8248) = 0
> munmap(0x7fb2a8bdd000, 135168) = 0
> munmap(0x7fb2a8bfe000, 135168) = 0
> exit_group(-62) = ?
>
>
> Or via cman:
> [node1:~]# fence_node -vv node3
> fence node3 dev 0.0 agent fence_pcmk result: error from agent
> agent args: action=off port=node3 timeout=15 nodename=node3 agent=fence_pcmk
> fence node3 failed
>
>
> /var/log/messages:
> Jun 19 10:57:43 node1 stonith_admin[3804]: notice: crm_log_args:
> Invoked: stonith_admin --fence node1 --tolerance 5s --tag cman
> Jun 19 10:57:43 node1 stonith-ng[8283]: notice: handle_request:
> Client stonith_admin.cman.3804.65de6378 wants to fence (off) 'node1'
> with device '(any)'
> Jun 19 10:57:43 node1 stonith-ng[8283]: notice:
> initiate_remote_stonith_op: Initiating remote operation off for node1:
> fbc7fe61-9451-4634-9c12-57d933ccd0a4 ( 0)
> Jun 19 10:57:43 node1 stonith-ng[8283]: notice:
> can_fence_host_with_device: node2-ipmi can not fence (off) node1:
> static-list
> Jun 19 10:57:43 node1 stonith-ng[8283]: notice:
> can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
> Jun 19 10:57:54 node1 stonith-ng[8283]: warning: get_xpath_object: No
> match for //@st_delegate in /st-reply
> Jun 19 10:59:00 node1 qdiskd[7409]: Node 3 evicted
> Jun 19 10:59:31 node1 corosync[7349]: [TOTEM ] A processor failed,
> forming new configuration.
> Jun 19 11:01:21 node1 corosync[7349]: [QUORUM] Members[2]: 1 2
> Jun 19 11:01:21 node1 corosync[7349]: [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Jun 19 11:01:21 node1 crmd[8287]: notice: crm_update_peer_state:
> cman_event_callback: Node node3[3] - state is now lost (was member)
> Jun 19 11:01:21 node1 kernel: dlm: closing connection to node 3
> Jun 19 11:01:21 node1 stonith-ng[8283]: notice: remote_op_done:
> Operation off of node3 by node2 for stonith_admin.cman.3804 at node1.
> com.fbc7fe61: OK
> Jun 19 11:01:21 node1 crmd[8287]: notice: tengine_stonith_notify:
> Peer node3 was terminated (off) by node2 for node1: OK (
> ref=fbc7fe61-9451-4634-9c12-57d933ccd0a4) by client stonith_admin.cman.3804
> Jun 19 11:01:21 node1 crmd[8287]: notice: tengine_stonith_notify:
> Notified CMAN that 'node3' is now fenced
>
> Jun 19 11:01:21 node1 fenced[7625]: fencing node node3
> Jun 19 11:01:22 node1 fence_pcmk[8067]: Requesting Pacemaker fence
> node3 (off)
> Jun 19 11:01:22 node1 stonith_admin[8068]: notice: crm_log_args:
> Invoked: stonith_admin --fence node3 --tolerance 5s --tag cman
> Jun 19 11:01:22 node1 stonith-ng[8283]: notice: handle_request:
> Client stonith_admin.cman.8068.fcd7f751 wants to fence (off) 'node3'
> with device '(any)'
> Jun 19 11:01:22 node1 stonith-ng[8283]: notice:
> stonith_check_fence_tolerance: Target node3 was fenced (off) less than
> 5s ago by node2 on behalf of node1
> Jun 19 11:01:22 node1 fenced[7625]: fence node3 success
>
>
>
> [node1:~]# ls -ahl /proc/22505/fd
> total 0
> dr-x------ 2 root root 0 Jun 19 11:55 .
> dr-xr-xr-x 8 root root 0 Jun 19 11:55 ..
> lrwx------ 1 root root 64 Jun 19 11:56 0 -> /dev/pts/8
> lrwx------ 1 root root 64 Jun 19 11:56 1 -> /dev/pts/8
> lrwx------ 1 root root 64 Jun 19 11:55 2 -> /dev/pts/8
> lrwx------ 1 root root 64 Jun 19 11:56 3 -> socket:[4061683]
> lrwx------ 1 root root 64 Jun 19 11:56 4 -> socket:[4061684]
>
> [node1:~]# lsof -p 22505
> ...
> stonith_admin 22505 root 3u unix 0xffff880c14889b80 0t0
> 4061683 socket
> stonith_admin 22505 root 4u unix 0xffff880c2a4fbc40 0t0
> 4061684 socket
>
>
> Obviously it's trying to read some data from unix socket but doesn't get
> anything from the other side, is there anyone there who can explain me
> why fence command is always failing on first attempt?
>
> Thanks
I noticed you're not a mailing list member. Please register if you want
your emails to come through without getting stuck in the moderator queue.
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Users
mailing list