[ClusterLabs] stonith_admin timeouts
Milos Buncic
htchak19 at gmail.com
Sun Jun 21 18:12:02 UTC 2015
Hey people
I'm experiencing very strange issue and it's appearing every time I try to
fence a node.
I have a test environment with three node cluster (CentOS 6.6 x86_64) where
rgmanager is replaced with pacemaker (CMAN + pacemaker).
I've configured fencing with pcs for all three nodes
Pacemaker:
pcs stonith create node1-ipmi \
fence_ipmilan pcmk_host_list="node1" ipaddr=1.1.1.1 login=fencer
passwd=******** privlvl=OPERATOR power_wait=10 lanplus=1 action=off \
op monitor interval=10s timeout=30s
pcs constraint location node1-ipmi avoids node1
pcs property set stonith-enabled=true
CMAN - /etc/cluster/cluster.conf:
<?xml version="1.0"?>
<cluster config_version="10" name="mycluster">
<fence_daemon/>
<clusternodes>
<clusternode name="node1" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device action="off" name="pcmk"
port="node1"/>
</method>
</fence>
</clusternode>
<clusternode name="node2" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device action="off" name="pcmk"
port="node2"/>
</method>
</fence>
</clusternode>
<clusternode name="node2" nodeid="3">
<fence>
<method name="pcmk-redirect">
<device action="off" name="pcmk"
port="node2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_pcmk" name="pcmk"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
<logging debug="on"/>
<quorumd interval="1" label="QuorumDisk"
status_file="/qdisk_status" tko="70"/>
<totem token="108000"/>
</cluster>
Every time I try to fence a node I'm getting a timeout error with node
being fenced at the end (on second attempt) but I'm wondering why it took
so long to fence a node?
So when I run stonith_admin or fence_node (which at the end also runs
stonith_admin, you can see that clearly from the log file) it's always
failing on the first attempt, my guess probably because it doesn't get
status code or something like that:
strace stonith_admin --fence node1 --tolerance 5s --tag cman
Partial output from strace:
...
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 291) = 0 (Timeout)
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x7fb2a8c37000
write(1, "Command failed: Timer expired\n", 30Command failed: Timer
expired
) = 30
poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)
shutdown(4, 2 /* send and receive */) = 0
close(4) = 0
munmap(0x7fb2a8b98000, 270336) = 0
munmap(0x7fb2a8bda000, 8248) = 0
munmap(0x7fb2a8b56000, 270336) = 0
munmap(0x7fb2a8c3b000, 8248) = 0
munmap(0x7fb2a8b14000, 270336) = 0
munmap(0x7fb2a8c38000, 8248) = 0
munmap(0x7fb2a8bdd000, 135168) = 0
munmap(0x7fb2a8bfe000, 135168) = 0
exit_group(-62) = ?
Or via cman:
[node1:~]# fence_node -vv node3
fence node3 dev 0.0 agent fence_pcmk result: error from agent
agent args: action=off port=node3 timeout=15 nodename=node3 agent=fence_pcmk
fence node3 failed
/var/log/messages:
Jun 19 10:57:43 node1 stonith_admin[3804]: notice: crm_log_args:
Invoked: stonith_admin --fence node1 --tolerance 5s --tag cman
Jun 19 10:57:43 node1 stonith-ng[8283]: notice: handle_request: Client
stonith_admin.cman.3804.65de6378 wants to fence (off) 'node1' with device
'(any)'
Jun 19 10:57:43 node1 stonith-ng[8283]: notice:
initiate_remote_stonith_op: Initiating remote operation off for node1:
fbc7fe61-9451-4634-9c12-57d933ccd0a4 ( 0)
Jun 19 10:57:43 node1 stonith-ng[8283]: notice:
can_fence_host_with_device: node2-ipmi can not fence (off) node1:
static-list
Jun 19 10:57:43 node1 stonith-ng[8283]: notice:
can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
Jun 19 10:57:54 node1 stonith-ng[8283]: warning: get_xpath_object: No
match for //@st_delegate in /st-reply
Jun 19 10:59:00 node1 qdiskd[7409]: Node 3 evicted
Jun 19 10:59:31 node1 corosync[7349]: [TOTEM ] A processor failed,
forming new configuration.
Jun 19 11:01:21 node1 corosync[7349]: [QUORUM] Members[2]: 1 2
Jun 19 11:01:21 node1 corosync[7349]: [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Jun 19 11:01:21 node1 crmd[8287]: notice: crm_update_peer_state:
cman_event_callback: Node node3[3] - state is now lost (was member)
Jun 19 11:01:21 node1 kernel: dlm: closing connection to node 3
Jun 19 11:01:21 node1 stonith-ng[8283]: notice: remote_op_done:
Operation off of node3 by node2 for stonith_admin.cman.3804 at node1.
com.fbc7fe61: OK
Jun 19 11:01:21 node1 crmd[8287]: notice: tengine_stonith_notify: Peer
node3 was terminated (off) by node2 for node1: OK (
ref=fbc7fe61-9451-4634-9c12-57d933ccd0a4) by client stonith_admin.cman.3804
Jun 19 11:01:21 node1 crmd[8287]: notice: tengine_stonith_notify:
Notified CMAN that 'node3' is now fenced
Jun 19 11:01:21 node1 fenced[7625]: fencing node node3
Jun 19 11:01:22 node1 fence_pcmk[8067]: Requesting Pacemaker fence node3
(off)
Jun 19 11:01:22 node1 stonith_admin[8068]: notice: crm_log_args:
Invoked: stonith_admin --fence node3 --tolerance 5s --tag cman
Jun 19 11:01:22 node1 stonith-ng[8283]: notice: handle_request: Client
stonith_admin.cman.8068.fcd7f751 wants to fence (off) 'node3' with device
'(any)'
Jun 19 11:01:22 node1 stonith-ng[8283]: notice:
stonith_check_fence_tolerance: Target node3 was fenced (off) less than 5s
ago by node2 on behalf of node1
Jun 19 11:01:22 node1 fenced[7625]: fence node3 success
[node1:~]# ls -ahl /proc/22505/fd
total 0
dr-x------ 2 root root 0 Jun 19 11:55 .
dr-xr-xr-x 8 root root 0 Jun 19 11:55 ..
lrwx------ 1 root root 64 Jun 19 11:56 0 -> /dev/pts/8
lrwx------ 1 root root 64 Jun 19 11:56 1 -> /dev/pts/8
lrwx------ 1 root root 64 Jun 19 11:55 2 -> /dev/pts/8
lrwx------ 1 root root 64 Jun 19 11:56 3 -> socket:[4061683]
lrwx------ 1 root root 64 Jun 19 11:56 4 -> socket:[4061684]
[node1:~]# lsof -p 22505
...
stonith_admin 22505 root 3u unix 0xffff880c14889b80 0t0 4061683
socket
stonith_admin 22505 root 4u unix 0xffff880c2a4fbc40 0t0 4061684
socket
Obviously it's trying to read some data from unix socket but doesn't get
anything from the other side, is there anyone there who can explain me why
fence command is always failing on first attempt?
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150621/bb4899d0/attachment-0003.html>
More information about the Users
mailing list