[ClusterLabs] stonith_admin timeouts

Mon Jun 22 11:29:23 UTC 2015

Hey guys

After removing qdisk configuration from /etc/cluster/cluster.conf got
fencing success!?
<quorumd interval="1" label="QuorumDisk" status_file="/qdisk_status"
tko="70"/>

Maybe tko="70" is cause of the issue, I have to inspect this further?

[node1:~]# fence_node -vv node3
fence node3 dev 0.0 agent fence_pcmk result: success
agent args: action=off port=node3 nodename=node3 agent=fence_pcmk
fence node3 success

On Mon, Jun 22, 2015 at 10:21 AM, Milos Buncic <htchak19 at gmail.com> wrote:

> Hey first of all thank you for you answer
>
>
>> Does 'fence_ipmilan ...' work when called manually from the command line?
>>
>>
>
> Yes it does (off, on, status...)
>
> [node1:~]# fence_ipmilan -v -p ******** -l fencer -L OPERATOR -P -a
> 1.1.1.1 -o status
> Getting status of IPMI:1.1.1.1...Spawning: '/usr/bin/ipmitool -I lanplus
> -H '1.1.1.1' -U 'fencer' -L 'OPERATOR' -P '[set]' -v chassis power
> status'...
> Chassis power = On
> Done
>
> Looks like the same node is defined twice, instead of 'node3'.
>>
> Sorry because of that I mistyped the hostname just after I pasted
> configuration
>
> Configuration looks like this
>
> <?xml version="1.0"?>
> <cluster config_version="10" name="mycluster">
>         <fence_daemon/>
>         <clusternodes>
>                 <clusternode name="node1" nodeid="1">
>                         <fence>
>                                 <method name="pcmk-redirect">
>                                         <device action="off" name="pcmk"
> port="node1"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node2" nodeid="2">
>                         <fence>
>                                 <method name="pcmk-redirect">
>                                         <device action="off" name="pcmk"
> port="node2"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node3" nodeid="3">
>                         <fence>
>                                 <method name="pcmk-redirect">
>                                         <device action="off" name="pcmk"
> port="node3"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman/>
>         <fencedevices>
>                 <fencedevice agent="fence_pcmk" name="pcmk"/>
>         </fencedevices>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
>         <logging debug="on"/>
>         <quorumd interval="1" label="QuorumDisk"
> status_file="/qdisk_status" tko="70"/>
>         <totem token="108000"/>
> </cluster>
>
>
> Run 'fence_check' (this tests cman's fencing which is hooked into
>> pacemaker's stonith).
>>
>
> fence_check run at Mon Jun 22 09:35:38 CEST 2015 pid: 16091
> Checking if cman is running: running
> Checking if node is quorate: quorate
> Checking if node is in fence domain: yes
> Checking if node is fence master: this node is fence master
> Checking if real fencing is in progress: no fencing in progress
> Get node list: node1 node2 node3
>
> Testing node1 fencing
> Checking if cman is running: running
> Checking if node is quorate: quorate
> Checking if node is in fence domain: yes
> Checking if node is fence master: this node is fence master
> Checking if real fencing is in progress: no fencing in progress
> Checking how many fencing methods are configured for node node1
> Found 1 method(s) to test for node node1
> Testing node1 method 1 status
> Testing node1 method 1: success
>
> Testing node2 fencing
> Checking if cman is running: running
> Checking if node is quorate: quorate
> Checking if node is in fence domain: yes
> Checking if node is fence master: this node is fence master
> Checking if real fencing is in progress: no fencing in progress
> Checking how many fencing methods are configured for node node2
> Found 1 method(s) to test for node node2
> Testing node2 method 1 status
> Testing node2 method 1: success
>
> Testing node3 fencing
> Checking if cman is running: running
> Checking if node is quorate: quorate
> Checking if node is in fence domain: yes
> Checking if node is fence master: this node is fence master
> Checking if real fencing is in progress: no fencing in progress
> Checking how many fencing methods are configured for node node3
> Found 1 method(s) to test for node node3
> Testing node3 method 1 status
> Testing node3 method 1: success
> cleanup: 0
>
>
> Also, I'm not sure how well qdisk is tested/supported. Do you even need
>> it with three nodes?
>>
> Qdisk is tested in production where we're using rgmanager so I just
> mirrored that configuration.
> Hm yes in three node cluster probably we don't need it.
>
> <totem token="108000"/>
>> That is a VERY high number!
>>
> You're probably right I changed this value to default (10 sec)
> <totem token="10000"/>
>
> [node1:~]# fence_node -vv node3
> fence node3 dev 0.0 agent fence_pcmk result: error from agent
> agent args: action=off port=node3 nodename=node3 agent=fence_pcmk
> fence node3 failed
>
> Messages log caught from node2 who's running fencing resource for node3
>
> [node2:~]# tail -0f /var/log/messages
> ...
> Jun 22 10:04:28 node2 stonith-ng[7382]:   notice:
> can_fence_host_with_device: node1-ipmi can not fence (off) node3:
> static-list
> Jun 22 10:04:28 node2 stonith-ng[7382]:   notice:
> can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
> Jun 22 10:04:28 node2 stonith-ng[7382]:   notice:
> can_fence_host_with_device: node1-ipmi can not fence (off) node3:
> static-list
> Jun 22 10:04:28 node2 stonith-ng[7382]:   notice:
> can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
> Jun 22 10:04:38 node2 stonith-ng[7382]:   notice: log_operation: Operation
> 'off' [5288] (call 2 from stonith_admin.cman.7377) for host 'node3' with
> device 'node3-ipmi' returned: 0 (OK)
> Jun 22 10:05:44 node2 qdiskd[5948]: Node 3 evicted
>
>
> This is where delay happens (~3.5 min)
>
>
> Jun 22 10:08:06 node2 corosync[5861]:   [QUORUM] Members[2]: 1 2
> Jun 22 10:08:06 node2 corosync[5861]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Jun 22 10:08:06 node2 crmd[7386]:   notice: crm_update_peer_state:
> cman_event_callback: Node node3[3] - state is now lost (was member)
> Jun 22 10:08:06 node2 crmd[7386]:  warning: match_down_event: No match for
> shutdown action on node3
> Jun 22 10:08:06 node2 crmd[7386]:   notice: peer_update_callback:
> Stonith/shutdown of node3 not matched
> Jun 22 10:08:06 node2 crmd[7386]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=check_join_state ]
> Jun 22 10:08:06 node2 rsyslogd-2177: imuxsock begins to drop messages from
> pid 5861 due to rate-limiting
> Jun 22 10:08:06 node2 kernel: dlm: closing connection to node 3
> Jun 22 10:08:06 node2 attrd[7384]:   notice: attrd_local_callback: Sending
> full refresh (origin=crmd)
> Jun 22 10:08:06 node2 attrd[7384]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: shutdown (0)
> Jun 22 10:08:06 node2 crmd[7386]:  warning: match_down_event: No match for
> shutdown action on node3
> Jun 22 10:08:06 node2 crmd[7386]:   notice: peer_update_callback:
> Stonith/shutdown of node3 not matched
> Jun 22 10:08:06 node2 stonith-ng[7382]:   notice: remote_op_done:
> Operation off of node3 by node2 for stonith_admin.cman.7377 at node1.753ce4e5:
> OK
> Jun 22 10:08:06 node2 fenced[6211]: fencing deferred to node1
> Jun 22 10:08:06 node2 attrd[7384]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: probe_complete (true)
> Jun 22 10:08:06 node2 crmd[7386]:   notice: tengine_stonith_notify: Peer
> node3 was terminated (off) by node2 for node1: OK
> (ref=753ce4e5-a84a-491b-8ed9-044667946381) by client stonith_admin.cman.7377
> Jun 22 10:08:06 node2 crmd[7386]:   notice: tengine_stonith_notify:
> Notified CMAN that 'node3' is now fenced
> Jun 22 10:08:07 node2 rsyslogd-2177: imuxsock lost 108 messages from pid
> 5861 due to rate-limiting
> Jun 22 10:08:07 node2 pengine[7385]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
> testvm102#011(node1)
> Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Migrate
> testvm103#011(Started node1 -> node2)
> Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
> testvm105#011(node1)
> Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
> testvm108#011(node1)
> Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Migrate
> testvm109#011(Started node1 -> node2)
> Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
> testvm111#011(node1)
> Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
> testvm114#011(node1)
> Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Migrate
> testvm115#011(Started node1 -> node2)
> Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
> testvm117#011(node1)
> Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
> node1-ipmi#011(node2)
> ...
>
> I noticed you're not a mailing list member. Please register if you want
>> your emails to come through without getting stuck in the moderator queue.
>>
> Thanks man I will
>
> Problem still persist :(
>
>
>
> On Mon, Jun 22, 2015 at 8:02 AM, Digimer <lists at alteeve.ca> wrote:
>
>> On 21/06/15 02:12 PM, Milos Buncic wrote:
>> > Hey people
>> >
>> > I'm experiencing very strange issue and it's appearing every time I try
>> > to fence a node.
>> > I have a test environment with three node cluster (CentOS 6.6 x86_64)
>> > where rgmanager is replaced with pacemaker (CMAN + pacemaker).
>> >
>> > I've configured fencing with pcs for all three nodes
>> >
>> > Pacemaker:
>> > pcs stonith create node1-ipmi \
>> > fence_ipmilan pcmk_host_list="node1" ipaddr=1.1.1.1 login=fencer
>> > passwd=******** privlvl=OPERATOR power_wait=10 lanplus=1 action=off \
>> > op monitor interval=10s timeout=30s
>>
>> Does 'fence_ipmilan ...' work when called manually from the command line?
>>
>
>>
> > pcs constraint location node1-ipmi avoids node1
>> >
>> > pcs property set stonith-enabled=true
>> >
>> >
>> > CMAN - /etc/cluster/cluster.conf:
>> > <?xml version="1.0"?>
>> > <cluster config_version="10" name="mycluster">
>> >         <fence_daemon/>
>> >         <clusternodes>
>> >                 <clusternode name="node1" nodeid="1">
>> >                         <fence>
>> >                                 <method name="pcmk-redirect">
>> >                                         <device action="off" name="pcmk"
>> > port="node1"/>
>> >                                 </method>
>> >                         </fence>
>> >                 </clusternode>
>> >                 <clusternode name="node2" nodeid="2">
>> >                         <fence>
>> >                                 <method name="pcmk-redirect">
>> >                                         <device action="off" name="pcmk"
>> > port="node2"/>
>> >                                 </method>
>> >                         </fence>
>> >                 </clusternode>
>> >                 <clusternode name="node2" nodeid="3">
>> >                         <fence>
>> >                                 <method name="pcmk-redirect">
>> >                                         <device action="off" name="pcmk"
>> > port="node2"/>
>> >                                 </method>
>> >                         </fence>
>> >                 </clusternode>
>>
>> Looks like the same node is defined twice, instead of 'node3'.
>>
>> >         </clusternodes>
>> >         <cman/>
>> >         <fencedevices>
>> >                 <fencedevice agent="fence_pcmk" name="pcmk"/>
>> >         </fencedevices>
>> >         <rm>
>> >                 <failoverdomains/>
>> >                 <resources/>
>> >         </rm>
>> >         <logging debug="on"/>
>> >         <quorumd interval="1" label="QuorumDisk"
>> > status_file="/qdisk_status" tko="70"/>
>>
>> Also, I'm not sure how well qdisk is tested/supported. Do you even need
>> it with three nodes?
>>
>> >         <totem token="108000"/>
>>
>> That is a VERY high number!
>>
>> > </cluster>
>> >
>> > Every time I try to fence a node I'm getting a timeout error with node
>> > being fenced at the end (on second attempt) but I'm wondering why it
>> > took so long to fence a node?
>>
>> Run 'fence_check' (this tests cman's fencing which is hooked into
>> pacemaker's stonith).
>>
>> > So when I run stonith_admin or fence_node (which at the end also runs
>> > stonith_admin, you can see that clearly from the log file) it's always
>> > failing on the first attempt, my guess probably  because it doesn't get
>> > status code or something like that:
>> > strace stonith_admin --fence node1 --tolerance 5s --tag cman
>> >
>> > Partial output from strace:
>> >   ...
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
>> >   poll([{fd=4, events=POLLIN}], 1, 291)   = 0 (Timeout)
>> >   fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0
>> >   mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
>> > 0) = 0x7fb2a8c37000
>> >   write(1, "Command failed: Timer expired\n", 30Command failed: Timer
>> > expired
>> >   ) = 30
>> >   poll([{fd=4, events=POLLIN}], 1, 0)     = 0 (Timeout)
>> >   shutdown(4, 2 /* send and receive */)   = 0
>> >   close(4)                                = 0
>> >   munmap(0x7fb2a8b98000, 270336)          = 0
>> >   munmap(0x7fb2a8bda000, 8248)            = 0
>> >   munmap(0x7fb2a8b56000, 270336)          = 0
>> >   munmap(0x7fb2a8c3b000, 8248)            = 0
>> >   munmap(0x7fb2a8b14000, 270336)          = 0
>> >   munmap(0x7fb2a8c38000, 8248)            = 0
>> >   munmap(0x7fb2a8bdd000, 135168)          = 0
>> >   munmap(0x7fb2a8bfe000, 135168)          = 0
>> >   exit_group(-62)                         = ?
>> >
>> >
>> > Or via cman:
>> > [node1:~]# fence_node -vv node3
>> > fence node3 dev 0.0 agent fence_pcmk result: error from agent
>> > agent args: action=off port=node3 timeout=15 nodename=node3
>> agent=fence_pcmk
>> > fence node3 failed
>> >
>> >
>> > /var/log/messages:
>> >   Jun 19 10:57:43 node1 stonith_admin[3804]:   notice: crm_log_args:
>> > Invoked: stonith_admin --fence node1 --tolerance 5s --tag cman
>> >   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice: handle_request:
>> > Client stonith_admin.cman.3804.65de6378 wants to fence (off) 'node1'
>> > with device '(any)'
>> >   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
>> > initiate_remote_stonith_op: Initiating remote operation off for node1:
>> > fbc7fe61-9451-4634-9c12-57d933ccd0a4 (  0)
>> >   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
>> > can_fence_host_with_device: node2-ipmi can not fence (off) node1:
>> > static-list
>> >   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
>> > can_fence_host_with_device: node3-ipmi can fence (off) node3:
>> static-list
>> >   Jun 19 10:57:54 node1 stonith-ng[8283]:  warning: get_xpath_object: No
>> > match for //@st_delegate in /st-reply
>> >   Jun 19 10:59:00 node1 qdiskd[7409]: Node 3 evicted
>> >   Jun 19 10:59:31 node1 corosync[7349]:   [TOTEM ] A processor failed,
>> > forming new configuration.
>> >   Jun 19 11:01:21 node1 corosync[7349]:   [QUORUM] Members[2]: 1 2
>> >   Jun 19 11:01:21 node1 corosync[7349]:   [TOTEM ] A processor joined or
>> > left the membership and a new membership was formed.
>> >   Jun 19 11:01:21 node1 crmd[8287]:   notice: crm_update_peer_state:
>> > cman_event_callback: Node node3[3] - state is now lost (was member)
>> >   Jun 19 11:01:21 node1 kernel: dlm: closing connection to node 3
>> >   Jun 19 11:01:21 node1 stonith-ng[8283]:   notice: remote_op_done:
>> > Operation off of node3 by node2 for stonith_admin.cman.3804 at node1.
>> > com.fbc7fe61: OK
>> >   Jun 19 11:01:21 node1 crmd[8287]:   notice: tengine_stonith_notify:
>> > Peer node3 was terminated (off) by node2 for node1: OK (
>> > ref=fbc7fe61-9451-4634-9c12-57d933ccd0a4) by client
>> stonith_admin.cman.3804
>> >   Jun 19 11:01:21 node1 crmd[8287]:   notice: tengine_stonith_notify:
>> > Notified CMAN that 'node3' is now fenced
>> >
>> >   Jun 19 11:01:21 node1 fenced[7625]: fencing node node3
>> >   Jun 19 11:01:22 node1 fence_pcmk[8067]: Requesting Pacemaker fence
>> > node3 (off)
>> >   Jun 19 11:01:22 node1 stonith_admin[8068]:   notice: crm_log_args:
>> > Invoked: stonith_admin --fence node3 --tolerance 5s --tag cman
>> >   Jun 19 11:01:22 node1 stonith-ng[8283]:   notice: handle_request:
>> > Client stonith_admin.cman.8068.fcd7f751 wants to fence (off) 'node3'
>> > with device '(any)'
>> >   Jun 19 11:01:22 node1 stonith-ng[8283]:   notice:
>> > stonith_check_fence_tolerance: Target node3 was fenced (off) less than
>> > 5s ago by node2 on   behalf of node1
>> >   Jun 19 11:01:22 node1 fenced[7625]: fence node3 success
>> >
>> >
>> >
>> >     [node1:~]# ls -ahl /proc/22505/fd
>> >   total 0
>> >   dr-x------ 2 root root  0 Jun 19 11:55 .
>> >   dr-xr-xr-x 8 root root  0 Jun 19 11:55 ..
>> >   lrwx------ 1 root root 64 Jun 19 11:56 0 -> /dev/pts/8
>> >   lrwx------ 1 root root 64 Jun 19 11:56 1 -> /dev/pts/8
>> >   lrwx------ 1 root root 64 Jun 19 11:55 2 -> /dev/pts/8
>> >   lrwx------ 1 root root 64 Jun 19 11:56 3 -> socket:[4061683]
>> >  lrwx------ 1 root root 64 Jun 19 11:56 4 -> socket:[4061684]
>> >
>> >   [node1:~]# lsof -p 22505
>> >   ...
>> >   stonith_admin 22505 root    3u  unix 0xffff880c14889b80      0t0
>> > 4061683 socket
>> >   stonith_admin 22505 root    4u  unix 0xffff880c2a4fbc40      0t0
>> > 4061684 socket
>> >
>> >
>> > Obviously it's trying to read some data from unix socket but doesn't get
>> > anything from the other side, is there anyone there who can explain me
>> > why fence command is always failing on first attempt?
>> >
>> > Thanks
>>
>> I noticed you're not a mailing list member. Please register if you want
>> your emails to come through without getting stuck in the moderator queue.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150622/3564c1e2/attachment.htm>