[ClusterLabs] stonith_admin timeouts

Mon Jun 22 08:21:20 UTC 2015

Hey first of all thank you for you answer

> Does 'fence_ipmilan ...' work when called manually from the command line?
>

Yes it does (off, on, status...)

[node1:~]# fence_ipmilan -v -p ******** -l fencer -L OPERATOR -P -a 1.1.1.1
-o status
Getting status of IPMI:1.1.1.1...Spawning: '/usr/bin/ipmitool -I lanplus -H
'1.1.1.1' -U 'fencer' -L 'OPERATOR' -P '[set]' -v chassis power status'...
Chassis power = On
Done

Looks like the same node is defined twice, instead of 'node3'.
>
Sorry because of that I mistyped the hostname just after I pasted
configuration

Configuration looks like this

<?xml version="1.0"?>
<cluster config_version="10" name="mycluster">
        <fence_daemon/>
        <clusternodes>
                <clusternode name="node1" nodeid="1">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device action="off" name="pcmk"
port="node1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="2">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device action="off" name="pcmk"
port="node2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node3" nodeid="3">
                        <fence>
                                <method name="pcmk-redirect">
                                        <device action="off" name="pcmk"
port="node3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_pcmk" name="pcmk"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
        <logging debug="on"/>
        <quorumd interval="1" label="QuorumDisk"
status_file="/qdisk_status" tko="70"/>
        <totem token="108000"/>
</cluster>

Run 'fence_check' (this tests cman's fencing which is hooked into
> pacemaker's stonith).
>

fence_check run at Mon Jun 22 09:35:38 CEST 2015 pid: 16091
Checking if cman is running: running
Checking if node is quorate: quorate
Checking if node is in fence domain: yes
Checking if node is fence master: this node is fence master
Checking if real fencing is in progress: no fencing in progress
Get node list: node1 node2 node3

Testing node1 fencing
Checking if cman is running: running
Checking if node is quorate: quorate
Checking if node is in fence domain: yes
Checking if node is fence master: this node is fence master
Checking if real fencing is in progress: no fencing in progress
Checking how many fencing methods are configured for node node1
Found 1 method(s) to test for node node1
Testing node1 method 1 status
Testing node1 method 1: success

Testing node2 fencing
Checking if cman is running: running
Checking if node is quorate: quorate
Checking if node is in fence domain: yes
Checking if node is fence master: this node is fence master
Checking if real fencing is in progress: no fencing in progress
Checking how many fencing methods are configured for node node2
Found 1 method(s) to test for node node2
Testing node2 method 1 status
Testing node2 method 1: success

Testing node3 fencing
Checking if cman is running: running
Checking if node is quorate: quorate
Checking if node is in fence domain: yes
Checking if node is fence master: this node is fence master
Checking if real fencing is in progress: no fencing in progress
Checking how many fencing methods are configured for node node3
Found 1 method(s) to test for node node3
Testing node3 method 1 status
Testing node3 method 1: success
cleanup: 0

Also, I'm not sure how well qdisk is tested/supported. Do you even need
> it with three nodes?
>
Qdisk is tested in production where we're using rgmanager so I just
mirrored that configuration.
Hm yes in three node cluster probably we don't need it.

<totem token="108000"/>
> That is a VERY high number!
>
You're probably right I changed this value to default (10 sec)
<totem token="10000"/>

[node1:~]# fence_node -vv node3
fence node3 dev 0.0 agent fence_pcmk result: error from agent
agent args: action=off port=node3 nodename=node3 agent=fence_pcmk
fence node3 failed

Messages log caught from node2 who's running fencing resource for node3

[node2:~]# tail -0f /var/log/messages
...
Jun 22 10:04:28 node2 stonith-ng[7382]:   notice:
can_fence_host_with_device: node1-ipmi can not fence (off) node3:
static-list
Jun 22 10:04:28 node2 stonith-ng[7382]:   notice:
can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
Jun 22 10:04:28 node2 stonith-ng[7382]:   notice:
can_fence_host_with_device: node1-ipmi can not fence (off) node3:
static-list
Jun 22 10:04:28 node2 stonith-ng[7382]:   notice:
can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
Jun 22 10:04:38 node2 stonith-ng[7382]:   notice: log_operation: Operation
'off' [5288] (call 2 from stonith_admin.cman.7377) for host 'node3' with
device 'node3-ipmi' returned: 0 (OK)
Jun 22 10:05:44 node2 qdiskd[5948]: Node 3 evicted

This is where delay happens (~3.5 min)

Jun 22 10:08:06 node2 corosync[5861]:   [QUORUM] Members[2]: 1 2
Jun 22 10:08:06 node2 corosync[5861]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Jun 22 10:08:06 node2 crmd[7386]:   notice: crm_update_peer_state:
cman_event_callback: Node node3[3] - state is now lost (was member)
Jun 22 10:08:06 node2 crmd[7386]:  warning: match_down_event: No match for
shutdown action on node3
Jun 22 10:08:06 node2 crmd[7386]:   notice: peer_update_callback:
Stonith/shutdown of node3 not matched
Jun 22 10:08:06 node2 crmd[7386]:   notice: do_state_transition: State
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
origin=check_join_state ]
Jun 22 10:08:06 node2 rsyslogd-2177: imuxsock begins to drop messages from
pid 5861 due to rate-limiting
Jun 22 10:08:06 node2 kernel: dlm: closing connection to node 3
Jun 22 10:08:06 node2 attrd[7384]:   notice: attrd_local_callback: Sending
full refresh (origin=crmd)
Jun 22 10:08:06 node2 attrd[7384]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: shutdown (0)
Jun 22 10:08:06 node2 crmd[7386]:  warning: match_down_event: No match for
shutdown action on node3
Jun 22 10:08:06 node2 crmd[7386]:   notice: peer_update_callback:
Stonith/shutdown of node3 not matched
Jun 22 10:08:06 node2 stonith-ng[7382]:   notice: remote_op_done: Operation
off of node3 by node2 for stonith_admin.cman.7377 at node1.753ce4e5: OK
Jun 22 10:08:06 node2 fenced[6211]: fencing deferred to node1
Jun 22 10:08:06 node2 attrd[7384]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: probe_complete (true)
Jun 22 10:08:06 node2 crmd[7386]:   notice: tengine_stonith_notify: Peer
node3 was terminated (off) by node2 for node1: OK
(ref=753ce4e5-a84a-491b-8ed9-044667946381) by client stonith_admin.cman.7377
Jun 22 10:08:06 node2 crmd[7386]:   notice: tengine_stonith_notify:
Notified CMAN that 'node3' is now fenced
Jun 22 10:08:07 node2 rsyslogd-2177: imuxsock lost 108 messages from pid
5861 due to rate-limiting
Jun 22 10:08:07 node2 pengine[7385]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
testvm102#011(node1)
Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Migrate
testvm103#011(Started node1 -> node2)
Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
testvm105#011(node1)
Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
testvm108#011(node1)
Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Migrate
testvm109#011(Started node1 -> node2)
Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
testvm111#011(node1)
Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
testvm114#011(node1)
Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Migrate
testvm115#011(Started node1 -> node2)
Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
testvm117#011(node1)
Jun 22 10:08:07 node2 pengine[7385]:   notice: LogActions: Start
node1-ipmi#011(node2)
...

I noticed you're not a mailing list member. Please register if you want
> your emails to come through without getting stuck in the moderator queue.
>
Thanks man I will

Problem still persist :(

On Mon, Jun 22, 2015 at 8:02 AM, Digimer <lists at alteeve.ca> wrote:

> On 21/06/15 02:12 PM, Milos Buncic wrote:
> > Hey people
> >
> > I'm experiencing very strange issue and it's appearing every time I try
> > to fence a node.
> > I have a test environment with three node cluster (CentOS 6.6 x86_64)
> > where rgmanager is replaced with pacemaker (CMAN + pacemaker).
> >
> > I've configured fencing with pcs for all three nodes
> >
> > Pacemaker:
> > pcs stonith create node1-ipmi \
> > fence_ipmilan pcmk_host_list="node1" ipaddr=1.1.1.1 login=fencer
> > passwd=******** privlvl=OPERATOR power_wait=10 lanplus=1 action=off \
> > op monitor interval=10s timeout=30s
>
> Does 'fence_ipmilan ...' work when called manually from the command line?
>

>
> pcs constraint location node1-ipmi avoids node1
> >
> > pcs property set stonith-enabled=true
> >
> >
> > CMAN - /etc/cluster/cluster.conf:
> > <?xml version="1.0"?>
> > <cluster config_version="10" name="mycluster">
> >         <fence_daemon/>
> >         <clusternodes>
> >                 <clusternode name="node1" nodeid="1">
> >                         <fence>
> >                                 <method name="pcmk-redirect">
> >                                         <device action="off" name="pcmk"
> > port="node1"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> >                 <clusternode name="node2" nodeid="2">
> >                         <fence>
> >                                 <method name="pcmk-redirect">
> >                                         <device action="off" name="pcmk"
> > port="node2"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
> >                 <clusternode name="node2" nodeid="3">
> >                         <fence>
> >                                 <method name="pcmk-redirect">
> >                                         <device action="off" name="pcmk"
> > port="node2"/>
> >                                 </method>
> >                         </fence>
> >                 </clusternode>
>
> Looks like the same node is defined twice, instead of 'node3'.
>
> >         </clusternodes>
> >         <cman/>
> >         <fencedevices>
> >                 <fencedevice agent="fence_pcmk" name="pcmk"/>
> >         </fencedevices>
> >         <rm>
> >                 <failoverdomains/>
> >                 <resources/>
> >         </rm>
> >         <logging debug="on"/>
> >         <quorumd interval="1" label="QuorumDisk"
> > status_file="/qdisk_status" tko="70"/>
>
> Also, I'm not sure how well qdisk is tested/supported. Do you even need
> it with three nodes?
>
> >         <totem token="108000"/>
>
> That is a VERY high number!
>
> > </cluster>
> >
> > Every time I try to fence a node I'm getting a timeout error with node
> > being fenced at the end (on second attempt) but I'm wondering why it
> > took so long to fence a node?
>
> Run 'fence_check' (this tests cman's fencing which is hooked into
> pacemaker's stonith).
>
> > So when I run stonith_admin or fence_node (which at the end also runs
> > stonith_admin, you can see that clearly from the log file) it's always
> > failing on the first attempt, my guess probably  because it doesn't get
> > status code or something like that:
> > strace stonith_admin --fence node1 --tolerance 5s --tag cman
> >
> > Partial output from strace:
> >   ...
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
> >   poll([{fd=4, events=POLLIN}], 1, 291)   = 0 (Timeout)
> >   fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 8), ...}) = 0
> >   mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> > 0) = 0x7fb2a8c37000
> >   write(1, "Command failed: Timer expired\n", 30Command failed: Timer
> > expired
> >   ) = 30
> >   poll([{fd=4, events=POLLIN}], 1, 0)     = 0 (Timeout)
> >   shutdown(4, 2 /* send and receive */)   = 0
> >   close(4)                                = 0
> >   munmap(0x7fb2a8b98000, 270336)          = 0
> >   munmap(0x7fb2a8bda000, 8248)            = 0
> >   munmap(0x7fb2a8b56000, 270336)          = 0
> >   munmap(0x7fb2a8c3b000, 8248)            = 0
> >   munmap(0x7fb2a8b14000, 270336)          = 0
> >   munmap(0x7fb2a8c38000, 8248)            = 0
> >   munmap(0x7fb2a8bdd000, 135168)          = 0
> >   munmap(0x7fb2a8bfe000, 135168)          = 0
> >   exit_group(-62)                         = ?
> >
> >
> > Or via cman:
> > [node1:~]# fence_node -vv node3
> > fence node3 dev 0.0 agent fence_pcmk result: error from agent
> > agent args: action=off port=node3 timeout=15 nodename=node3
> agent=fence_pcmk
> > fence node3 failed
> >
> >
> > /var/log/messages:
> >   Jun 19 10:57:43 node1 stonith_admin[3804]:   notice: crm_log_args:
> > Invoked: stonith_admin --fence node1 --tolerance 5s --tag cman
> >   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice: handle_request:
> > Client stonith_admin.cman.3804.65de6378 wants to fence (off) 'node1'
> > with device '(any)'
> >   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
> > initiate_remote_stonith_op: Initiating remote operation off for node1:
> > fbc7fe61-9451-4634-9c12-57d933ccd0a4 (  0)
> >   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
> > can_fence_host_with_device: node2-ipmi can not fence (off) node1:
> > static-list
> >   Jun 19 10:57:43 node1 stonith-ng[8283]:   notice:
> > can_fence_host_with_device: node3-ipmi can fence (off) node3: static-list
> >   Jun 19 10:57:54 node1 stonith-ng[8283]:  warning: get_xpath_object: No
> > match for //@st_delegate in /st-reply
> >   Jun 19 10:59:00 node1 qdiskd[7409]: Node 3 evicted
> >   Jun 19 10:59:31 node1 corosync[7349]:   [TOTEM ] A processor failed,
> > forming new configuration.
> >   Jun 19 11:01:21 node1 corosync[7349]:   [QUORUM] Members[2]: 1 2
> >   Jun 19 11:01:21 node1 corosync[7349]:   [TOTEM ] A processor joined or
> > left the membership and a new membership was formed.
> >   Jun 19 11:01:21 node1 crmd[8287]:   notice: crm_update_peer_state:
> > cman_event_callback: Node node3[3] - state is now lost (was member)
> >   Jun 19 11:01:21 node1 kernel: dlm: closing connection to node 3
> >   Jun 19 11:01:21 node1 stonith-ng[8283]:   notice: remote_op_done:
> > Operation off of node3 by node2 for stonith_admin.cman.3804 at node1.
> > com.fbc7fe61: OK
> >   Jun 19 11:01:21 node1 crmd[8287]:   notice: tengine_stonith_notify:
> > Peer node3 was terminated (off) by node2 for node1: OK (
> > ref=fbc7fe61-9451-4634-9c12-57d933ccd0a4) by client
> stonith_admin.cman.3804
> >   Jun 19 11:01:21 node1 crmd[8287]:   notice: tengine_stonith_notify:
> > Notified CMAN that 'node3' is now fenced
> >
> >   Jun 19 11:01:21 node1 fenced[7625]: fencing node node3
> >   Jun 19 11:01:22 node1 fence_pcmk[8067]: Requesting Pacemaker fence
> > node3 (off)
> >   Jun 19 11:01:22 node1 stonith_admin[8068]:   notice: crm_log_args:
> > Invoked: stonith_admin --fence node3 --tolerance 5s --tag cman
> >   Jun 19 11:01:22 node1 stonith-ng[8283]:   notice: handle_request:
> > Client stonith_admin.cman.8068.fcd7f751 wants to fence (off) 'node3'
> > with device '(any)'
> >   Jun 19 11:01:22 node1 stonith-ng[8283]:   notice:
> > stonith_check_fence_tolerance: Target node3 was fenced (off) less than
> > 5s ago by node2 on   behalf of node1
> >   Jun 19 11:01:22 node1 fenced[7625]: fence node3 success
> >
> >
> >
> >     [node1:~]# ls -ahl /proc/22505/fd
> >   total 0
> >   dr-x------ 2 root root  0 Jun 19 11:55 .
> >   dr-xr-xr-x 8 root root  0 Jun 19 11:55 ..
> >   lrwx------ 1 root root 64 Jun 19 11:56 0 -> /dev/pts/8
> >   lrwx------ 1 root root 64 Jun 19 11:56 1 -> /dev/pts/8
> >   lrwx------ 1 root root 64 Jun 19 11:55 2 -> /dev/pts/8
> >   lrwx------ 1 root root 64 Jun 19 11:56 3 -> socket:[4061683]
> >  lrwx------ 1 root root 64 Jun 19 11:56 4 -> socket:[4061684]
> >
> >   [node1:~]# lsof -p 22505
> >   ...
> >   stonith_admin 22505 root    3u  unix 0xffff880c14889b80      0t0
> > 4061683 socket
> >   stonith_admin 22505 root    4u  unix 0xffff880c2a4fbc40      0t0
> > 4061684 socket
> >
> >
> > Obviously it's trying to read some data from unix socket but doesn't get
> > anything from the other side, is there anyone there who can explain me
> > why fence command is always failing on first attempt?
> >
> > Thanks
>
> I noticed you're not a mailing list member. Please register if you want
> your emails to come through without getting stuck in the moderator queue.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150622/1246c846/attachment.htm>