[ClusterLabs] stonith_admin timeouts

Sun Jun 21 14:12:02 EDT 2015

Hey people

I'm experiencing very strange issue and it's appearing every time I try to
fence a node.
I have a test environment with three node cluster (CentOS 6.6 x86_64) where
rgmanager is replaced with pacemaker (CMAN + pacemaker).

I've configured fencing with pcs for all three nodes

Pacemaker:
pcs stonith create node1-ipmi \
fence_ipmilan pcmk_host_list="node1" ipaddr=1.1.1.1 login=fencer
passwd=******** privlvl=OPERATOR power_wait=10 lanplus=1 action=off \
op monitor interval=10s timeout=30s

pcs constraint location node1-ipmi avoids node1

pcs property set stonith-enabled=true

CMAN - /etc/cluster/cluster.conf:
<?xml version="1.0"?>
<cluster config_version="10" name="mycluster">
        <fence_daemon/>
        <clusternodes>
                <clusternode name="node1" nodeid="1">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device action="off" name="pcmk"
port="node1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="2">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device action="off" name="pcmk"
port="node2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="3">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device action="off" name="pcmk"
port="node2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_pcmk" name="pcmk"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
        <logging debug="on"/>
        <quorumd interval="1" label="QuorumDisk"
status_file="/qdisk_status" tko="70"/>
        <totem token="108000"/>
</cluster>

Every time I try to fence a node I'm getting a timeout error with node
being fenced at the end (on second attempt) but I'm wondering why it took
so long to fence a node?

So when I run stonith_admin or fence_node (which at the end also runs
stonith_admin, you can see that clearly from the log file) it's always
failing on the first attempt, my guess probably  because it doesn't get
status code or something like that:
strace stonith_admin --fence node1 --tolerance 5s --tag cman

Partial output from strace:
  ...
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
  poll([{fd=4, events=POLLIN}], 1, 291)   = 0 (Timeout)
  fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0
  mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x7fb2a8c37000
  write(1, "Command failed: Timer expired\n", 30Command failed: Timer
expired
  ) = 30
  poll([{fd=4, events=POLLIN}], 1, 0)     = 0 (Timeout)
  shutdown(4, 2 /* send and receive */)   = 0
  close(4)                                = 0
  munmap(0x7fb2a8b98000, 270336)          = 0
  munmap(0x7fb2a8bda000, 8248)            = 0
  munmap(0x7fb2a8b56000, 270336)          = 0
  munmap(0x7fb2a8c3b000, 8248)            = 0
  munmap(0x7fb2a8b14000, 270336)          = 0
  munmap(0x7fb2a8c38000, 8248)            = 0
  munmap(0x7fb2a8bdd000, 135168)          = 0
  munmap(0x7fb2a8bfe000, 135168)          = 0
  exit_group(-62)                         = ?

Or via cman:
[node1:~]# fence_node -vv node3
fence node3 dev 0.0 agent fence_pcmk result: error from agent
agent args: action=off port=node3 timeout=15 nodename=node3 agent=fence_pcmk
fence node3 failed

/var/log/messages:
  Jun 19 10:57:43 node1 stonith_admin[3804]:   notice: crm_log_args:
Invoked: stonith_admin --fence node1 --tolerance 5s --tag cman
  Jun 19 10:57:43 node1 stonith-ng[8283]:   notice: handle_request: Client
stonith_admin.cman.3804.65de6378 wants to fence (off) 'node1' with device
'(any)'
  Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
initiate_remote_stonith_op: Initiating remote operation off for node1:
fbc7fe61-9451-4634-9c12-57d933ccd0a4 (  0)
  Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
can_fence_host_with_device: node2-ipmi can not fence (off) node1:
static-list
  Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
  Jun 19 10:57:54 node1 stonith-ng[8283]:  warning: get_xpath_object: No
match for //@st_delegate in /st-reply
  Jun 19 10:59:00 node1 qdiskd[7409]: Node 3 evicted
  Jun 19 10:59:31 node1 corosync[7349]:   [TOTEM ] A processor failed,
forming new configuration.
  Jun 19 11:01:21 node1 corosync[7349]:   [QUORUM] Members[2]: 1 2
  Jun 19 11:01:21 node1 corosync[7349]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
  Jun 19 11:01:21 node1 crmd[8287]:   notice: crm_update_peer_state:
cman_event_callback: Node node3[3] - state is now lost (was member)
  Jun 19 11:01:21 node1 kernel: dlm: closing connection to node 3
  Jun 19 11:01:21 node1 stonith-ng[8283]:   notice: remote_op_done:
Operation off of node3 by node2 for stonith_admin.cman.3804 at node1.
com.fbc7fe61: OK
  Jun 19 11:01:21 node1 crmd[8287]:   notice: tengine_stonith_notify: Peer
node3 was terminated (off) by node2 for node1: OK (
ref=fbc7fe61-9451-4634-9c12-57d933ccd0a4) by client stonith_admin.cman.3804
  Jun 19 11:01:21 node1 crmd[8287]:   notice: tengine_stonith_notify:
Notified CMAN that 'node3' is now fenced

  Jun 19 11:01:21 node1 fenced[7625]: fencing node node3
  Jun 19 11:01:22 node1 fence_pcmk[8067]: Requesting Pacemaker fence node3
(off)
  Jun 19 11:01:22 node1 stonith_admin[8068]:   notice: crm_log_args:
Invoked: stonith_admin --fence node3 --tolerance 5s --tag cman
  Jun 19 11:01:22 node1 stonith-ng[8283]:   notice: handle_request: Client
stonith_admin.cman.8068.fcd7f751 wants to fence (off) 'node3' with device
'(any)'
  Jun 19 11:01:22 node1 stonith-ng[8283]:   notice:
stonith_check_fence_tolerance: Target node3 was fenced (off) less than 5s
ago by node2 on   behalf of node1
  Jun 19 11:01:22 node1 fenced[7625]: fence node3 success

    [node1:~]# ls -ahl /proc/22505/fd
  total 0
  dr-x------ 2 root root  0 Jun 19 11:55 .
  dr-xr-xr-x 8 root root  0 Jun 19 11:55 ..
  lrwx------ 1 root root 64 Jun 19 11:56 0 -> /dev/pts/8
  lrwx------ 1 root root 64 Jun 19 11:56 1 -> /dev/pts/8
  lrwx------ 1 root root 64 Jun 19 11:55 2 -> /dev/pts/8
  lrwx------ 1 root root 64 Jun 19 11:56 3 -> socket:[4061683]
  lrwx------ 1 root root 64 Jun 19 11:56 4 -> socket:[4061684]

  [node1:~]# lsof -p 22505
  ...
  stonith_admin 22505 root    3u  unix 0xffff880c14889b80      0t0 4061683
socket
  stonith_admin 22505 root    4u  unix 0xffff880c2a4fbc40      0t0 4061684
socket

Obviously it's trying to read some data from unix socket but doesn't get
anything from the other side, is there anyone there who can explain me why
fence command is always failing on first attempt?

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150621/bb4899d0/attachment-0002.html>