[ClusterLabs] stonith_admin timeouts
Milos Buncic
htchak19 at gmail.com
Mon Jun 22 08:21:20 UTC 2015
Hey first of all thank you for you answer
> Does 'fence_ipmilan ...' work when called manually from the command line?
>
Yes it does (off, on, status...)
[node1:~]# fence_ipmilan -v -p ******** -l fencer -L OPERATOR -P -a 1.1.1.1
-o status
Getting status of IPMI:1.1.1.1...Spawning: '/usr/bin/ipmitool -I lanplus -H
'1.1.1.1' -U 'fencer' -L 'OPERATOR' -P '[set]' -v chassis power status'...
Chassis power = On
Done
Looks like the same node is defined twice, instead of 'node3'.
>
Sorry because of that I mistyped the hostname just after I pasted
configuration
Configuration looks like this
<?xml version="1.0"?>
<cluster config_version="10" name="mycluster">
<fence_daemon/>
<clusternodes>
<clusternode name="node1" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device action="off" name="pcmk"
port="node1"/>
</method>
</fence>
</clusternode>
<clusternode name="node2" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device action="off" name="pcmk"
port="node2"/>
</method>
</fence>
</clusternode>
<clusternode name="node3" nodeid="3">
<fence>
<method name="pcmk-redirect">
<device action="off" name="pcmk"
port="node3"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_pcmk" name="pcmk"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
<logging debug="on"/>
<quorumd interval="1" label="QuorumDisk"
status_file="/qdisk_status" tko="70"/>
<totem token="108000"/>
</cluster>
Run 'fence_check' (this tests cman's fencing which is hooked into
> pacemaker's stonith).
>
fence_check run at Mon Jun 22 09:35:38 CEST 2015 pid: 16091
Checking if cman is running: running
Checking if node is quorate: quorate
Checking if node is in fence domain: yes
Checking if node is fence master: this node is fence master
Checking if real fencing is in progress: no fencing in progress
Get node list: node1 node2 node3
Testing node1 fencing
Checking if cman is running: running
Checking if node is quorate: quorate
Checking if node is in fence domain: yes
Checking if node is fence master: this node is fence master
Checking if real fencing is in progress: no fencing in progress
Checking how many fencing methods are configured for node node1
Found 1 method(s) to test for node node1
Testing node1 method 1 status
Testing node1 method 1: success
Testing node2 fencing
Checking if cman is running: running
Checking if node is quorate: quorate
Checking if node is in fence domain: yes
Checking if node is fence master: this node is fence master
Checking if real fencing is in progress: no fencing in progress
Checking how many fencing methods are configured for node node2
Found 1 method(s) to test for node node2
Testing node2 method 1 status
Testing node2 method 1: success
Testing node3 fencing
Checking if cman is running: running
Checking if node is quorate: quorate
Checking if node is in fence domain: yes
Checking if node is fence master: this node is fence master
Checking if real fencing is in progress: no fencing in progress
Checking how many fencing methods are configured for node node3
Found 1 method(s) to test for node node3
Testing node3 method 1 status
Testing node3 method 1: success
cleanup: 0
Also, I'm not sure how well qdisk is tested/supported. Do you even need
> it with three nodes?
>
Qdisk is tested in production where we're using rgmanager so I just
mirrored that configuration.
Hm yes in three node cluster probably we don't need it.
<totem token="108000"/>
> That is a VERY high number!
>
You're probably right I changed this value to default (10 sec)
<totem token="10000"/>
[node1:~]# fence_node -vv node3
fence node3 dev 0.0 agent fence_pcmk result: error from agent
agent args: action=off port=node3 nodename=node3 agent=fence_pcmk
fence node3 failed
Messages log caught from node2 who's running fencing resource for node3
[node2:~]# tail -0f /var/log/messages
...
Jun 22 10:04:28 node2 stonith-ng[7382]: notice:
can_fence_host_with_device: node1-ipmi can not fence (off) node3:
static-list
Jun 22 10:04:28 node2 stonith-ng[7382]: notice:
can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
Jun 22 10:04:28 node2 stonith-ng[7382]: notice:
can_fence_host_with_device: node1-ipmi can not fence (off) node3:
static-list
Jun 22 10:04:28 node2 stonith-ng[7382]: notice:
can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
Jun 22 10:04:38 node2 stonith-ng[7382]: notice: log_operation: Operation
'off' [5288] (call 2 from stonith_admin.cman.7377) for host 'node3' with
device 'node3-ipmi' returned: 0 (OK)
Jun 22 10:05:44 node2 qdiskd[5948]: Node 3 evicted
This is where delay happens (~3.5 min)
Jun 22 10:08:06 node2 corosync[5861]: [QUORUM] Members[2]: 1 2
Jun 22 10:08:06 node2 corosync[5861]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Jun 22 10:08:06 node2 crmd[7386]: notice: crm_update_peer_state:
cman_event_callback: Node node3[3] - state is now lost (was member)
Jun 22 10:08:06 node2 crmd[7386]: warning: match_down_event: No match for
shutdown action on node3
Jun 22 10:08:06 node2 crmd[7386]: notice: peer_update_callback:
Stonith/shutdown of node3 not matched
Jun 22 10:08:06 node2 crmd[7386]: notice: do_state_transition: State
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
origin=check_join_state ]
Jun 22 10:08:06 node2 rsyslogd-2177: imuxsock begins to drop messages from
pid 5861 due to rate-limiting
Jun 22 10:08:06 node2 kernel: dlm: closing connection to node 3
Jun 22 10:08:06 node2 attrd[7384]: notice: attrd_local_callback: Sending
full refresh (origin=crmd)
Jun 22 10:08:06 node2 attrd[7384]: notice: attrd_trigger_update: Sending
flush op to all hosts for: shutdown (0)
Jun 22 10:08:06 node2 crmd[7386]: warning: match_down_event: No match for
shutdown action on node3
Jun 22 10:08:06 node2 crmd[7386]: notice: peer_update_callback:
Stonith/shutdown of node3 not matched
Jun 22 10:08:06 node2 stonith-ng[7382]: notice: remote_op_done: Operation
off of node3 by node2 for stonith_admin.cman.7377 at node1.753ce4e5: OK
Jun 22 10:08:06 node2 fenced[6211]: fencing deferred to node1
Jun 22 10:08:06 node2 attrd[7384]: notice: attrd_trigger_update: Sending
flush op to all hosts for: probe_complete (true)
Jun 22 10:08:06 node2 crmd[7386]: notice: tengine_stonith_notify: Peer
node3 was terminated (off) by node2 for node1: OK
(ref=753ce4e5-a84a-491b-8ed9-044667946381) by client stonith_admin.cman.7377
Jun 22 10:08:06 node2 crmd[7386]: notice: tengine_stonith_notify:
Notified CMAN that 'node3' is now fenced
Jun 22 10:08:07 node2 rsyslogd-2177: imuxsock lost 108 messages from pid
5861 due to rate-limiting
Jun 22 10:08:07 node2 pengine[7385]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start
testvm102#011(node1)
Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Migrate
testvm103#011(Started node1 -> node2)
Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start
testvm105#011(node1)
Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start
testvm108#011(node1)
Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Migrate
testvm109#011(Started node1 -> node2)
Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start
testvm111#011(node1)
Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start
testvm114#011(node1)
Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Migrate
testvm115#011(Started node1 -> node2)
Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start
testvm117#011(node1)
Jun 22 10:08:07 node2 pengine[7385]: notice: LogActions: Start
node1-ipmi#011(node2)
...
I noticed you're not a mailing list member. Please register if you want
> your emails to come through without getting stuck in the moderator queue.
>
Thanks man I will
Problem still persist :(
On Mon, Jun 22, 2015 at 8:02 AM, Digimer <lists at alteeve.ca> wrote:
> On 21/06/15 02:12 PM, Milos Buncic wrote:
> > Hey people
> >
> > I'm experiencing very strange issue and it's appearing every time I try
> > to fence a node.
> > I have a test environment with three node cluster (CentOS 6.6 x86_64)
> > where rgmanager is replaced with pacemaker (CMAN + pacemaker).
> >
> > I've configured fencing with pcs for all three nodes
> >
> > Pacemaker:
> > pcs stonith create node1-ipmi \
> > fence_ipmilan pcmk_host_list="node1" ipaddr=1.1.1.1 login=fencer
> > passwd=******** privlvl=OPERATOR power_wait=10 lanplus=1 action=off \
> > op monitor interval=10s timeout=30s
>
> Does 'fence_ipmilan ...' work when called manually from the command line?
>
>
> pcs constraint location node1-ipmi avoids node1
> >
> > pcs property set stonith-enabled=true
> >
> >
> > CMAN - /etc/cluster/cluster.conf:
> > <?xml version="1.0"?>
> > <cluster config_version="10" name="mycluster">
> > <fence_daemon/>
> > <clusternodes>
> > <clusternode name="node1" nodeid="1">
> > <fence>
> > <method name="pcmk-redirect">
> > <device action="off" name="pcmk"
> > port="node1"/>
> > </method>
> > </fence>
> > </clusternode>
> > <clusternode name="node2" nodeid="2">
> > <fence>
> > <method name="pcmk-redirect">
> > <device action="off" name="pcmk"
> > port="node2"/>
> > </method>
> > </fence>
> > </clusternode>
> > <clusternode name="node2" nodeid="3">
> > <fence>
> > <method name="pcmk-redirect">
> > <device action="off" name="pcmk"
> > port="node2"/>
> > </method>
> > </fence>
> > </clusternode>
>
> Looks like the same node is defined twice, instead of 'node3'.
>
> > </clusternodes>
> > <cman/>
> > <fencedevices>
> > <fencedevice agent="fence_pcmk" name="pcmk"/>
> > </fencedevices>
> > <rm>
> > <failoverdomains/>
> > <resources/>
> > </rm>
> > <logging debug="on"/>
> > <quorumd interval="1" label="QuorumDisk"
> > status_file="/qdisk_status" tko="70"/>
>
> Also, I'm not sure how well qdisk is tested/supported. Do you even need
> it with three nodes?
>
> > <totem token="108000"/>
>
> That is a VERY high number!
>
> > </cluster>
> >
> > Every time I try to fence a node I'm getting a timeout error with node
> > being fenced at the end (on second attempt) but I'm wondering why it
> > took so long to fence a node?
>
> Run 'fence_check' (this tests cman's fencing which is hooked into
> pacemaker's stonith).
>
> > So when I run stonith_admin or fence_node (which at the end also runs
> > stonith_admin, you can see that clearly from the log file) it's always
> > failing on the first attempt, my guess probably because it doesn't get
> > status code or something like that:
> > strace stonith_admin --fence node1 --tolerance 5s --tag cman
> >
> > Partial output from strace:
> > ...
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 500) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}], 1, 291) = 0 (Timeout)
> > fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0
> > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> > 0) = 0x7fb2a8c37000
> > write(1, "Command failed: Timer expired\n", 30Command failed: Timer
> > expired
> > ) = 30
> > poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)
> > shutdown(4, 2 /* send and receive */) = 0
> > close(4) = 0
> > munmap(0x7fb2a8b98000, 270336) = 0
> > munmap(0x7fb2a8bda000, 8248) = 0
> > munmap(0x7fb2a8b56000, 270336) = 0
> > munmap(0x7fb2a8c3b000, 8248) = 0
> > munmap(0x7fb2a8b14000, 270336) = 0
> > munmap(0x7fb2a8c38000, 8248) = 0
> > munmap(0x7fb2a8bdd000, 135168) = 0
> > munmap(0x7fb2a8bfe000, 135168) = 0
> > exit_group(-62) = ?
> >
> >
> > Or via cman:
> > [node1:~]# fence_node -vv node3
> > fence node3 dev 0.0 agent fence_pcmk result: error from agent
> > agent args: action=off port=node3 timeout=15 nodename=node3
> agent=fence_pcmk
> > fence node3 failed
> >
> >
> > /var/log/messages:
> > Jun 19 10:57:43 node1 stonith_admin[3804]: notice: crm_log_args:
> > Invoked: stonith_admin --fence node1 --tolerance 5s --tag cman
> > Jun 19 10:57:43 node1 stonith-ng[8283]: notice: handle_request:
> > Client stonith_admin.cman.3804.65de6378 wants to fence (off) 'node1'
> > with device '(any)'
> > Jun 19 10:57:43 node1 stonith-ng[8283]: notice:
> > initiate_remote_stonith_op: Initiating remote operation off for node1:
> > fbc7fe61-9451-4634-9c12-57d933ccd0a4 ( 0)
> > Jun 19 10:57:43 node1 stonith-ng[8283]: notice:
> > can_fence_host_with_device: node2-ipmi can not fence (off) node1:
> > static-list
> > Jun 19 10:57:43 node1 stonith-ng[8283]: notice:
> > can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
> > Jun 19 10:57:54 node1 stonith-ng[8283]: warning: get_xpath_object: No
> > match for //@st_delegate in /st-reply
> > Jun 19 10:59:00 node1 qdiskd[7409]: Node 3 evicted
> > Jun 19 10:59:31 node1 corosync[7349]: [TOTEM ] A processor failed,
> > forming new configuration.
> > Jun 19 11:01:21 node1 corosync[7349]: [QUORUM] Members[2]: 1 2
> > Jun 19 11:01:21 node1 corosync[7349]: [TOTEM ] A processor joined or
> > left the membership and a new membership was formed.
> > Jun 19 11:01:21 node1 crmd[8287]: notice: crm_update_peer_state:
> > cman_event_callback: Node node3[3] - state is now lost (was member)
> > Jun 19 11:01:21 node1 kernel: dlm: closing connection to node 3
> > Jun 19 11:01:21 node1 stonith-ng[8283]: notice: remote_op_done:
> > Operation off of node3 by node2 for stonith_admin.cman.3804 at node1.
> > com.fbc7fe61: OK
> > Jun 19 11:01:21 node1 crmd[8287]: notice: tengine_stonith_notify:
> > Peer node3 was terminated (off) by node2 for node1: OK (
> > ref=fbc7fe61-9451-4634-9c12-57d933ccd0a4) by client
> stonith_admin.cman.3804
> > Jun 19 11:01:21 node1 crmd[8287]: notice: tengine_stonith_notify:
> > Notified CMAN that 'node3' is now fenced
> >
> > Jun 19 11:01:21 node1 fenced[7625]: fencing node node3
> > Jun 19 11:01:22 node1 fence_pcmk[8067]: Requesting Pacemaker fence
> > node3 (off)
> > Jun 19 11:01:22 node1 stonith_admin[8068]: notice: crm_log_args:
> > Invoked: stonith_admin --fence node3 --tolerance 5s --tag cman
> > Jun 19 11:01:22 node1 stonith-ng[8283]: notice: handle_request:
> > Client stonith_admin.cman.8068.fcd7f751 wants to fence (off) 'node3'
> > with device '(any)'
> > Jun 19 11:01:22 node1 stonith-ng[8283]: notice:
> > stonith_check_fence_tolerance: Target node3 was fenced (off) less than
> > 5s ago by node2 on behalf of node1
> > Jun 19 11:01:22 node1 fenced[7625]: fence node3 success
> >
> >
> >
> > [node1:~]# ls -ahl /proc/22505/fd
> > total 0
> > dr-x------ 2 root root 0 Jun 19 11:55 .
> > dr-xr-xr-x 8 root root 0 Jun 19 11:55 ..
> > lrwx------ 1 root root 64 Jun 19 11:56 0 -> /dev/pts/8
> > lrwx------ 1 root root 64 Jun 19 11:56 1 -> /dev/pts/8
> > lrwx------ 1 root root 64 Jun 19 11:55 2 -> /dev/pts/8
> > lrwx------ 1 root root 64 Jun 19 11:56 3 -> socket:[4061683]
> > lrwx------ 1 root root 64 Jun 19 11:56 4 -> socket:[4061684]
> >
> > [node1:~]# lsof -p 22505
> > ...
> > stonith_admin 22505 root 3u unix 0xffff880c14889b80 0t0
> > 4061683 socket
> > stonith_admin 22505 root 4u unix 0xffff880c2a4fbc40 0t0
> > 4061684 socket
> >
> >
> > Obviously it's trying to read some data from unix socket but doesn't get
> > anything from the other side, is there anyone there who can explain me
> > why fence command is always failing on first attempt?
> >
> > Thanks
>
> I noticed you're not a mailing list member. Please register if you want
> your emails to come through without getting stuck in the moderator queue.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150622/1246c846/attachment.htm>
More information about the Users
mailing list