[Pacemaker] Timeout value of STONITH resource is too large

Kazunori INOUE inouekazu at intellilink.co.jp
Mon Jul 30 06:13:40 EDT 2012


Hi,

I am using Pacemaker-1.1.
- glue       (2012 Jul 16) 2719:18489f275f75
- libqb      (2012 Jul 19) 11b20e19beff7f1b6003be0b4c73da8ecf936442
- corosync   (2012 Jul 12) 908ed7dcb390c0eade3dddb0cdfe181eb26b2ce2
- pacemaker  (2012 Jul 29) 33119da31c235710195c783e5c9a32c6e95b3efc

The timeout value of the _start_ operation of STONITH resource is large.
Therefore, even after the start operation is timed out, the process of
plugin remains.

The following is gdb at the time of STONITH resource starting.
----
   [root at dev1 ~]# gdb /usr/libexec/pacemaker/stonithd `pgrep stonithd`
   (gdb) b run_stonith_agent
   Breakpoint 1 at 0x7f03f1e00d69: file st_client.c, line 479.
   (gdb) c
   Continuing.

   Breakpoint 1, run_stonith_agent (agent=0xe0f820 "fence_legacy", action=0xe11fb0 "monitor",
    <snip>
   479     {
   (gdb) bt
   #0  run_stonith_agent (agent=0xe0f820 "fence_legacy", action=0xe11fb0 "monitor",
       victim=0x0, device_args=Traceback (most recent call last):0xcffe30, port_map=
       Traceback (most recent call last):0xcffe80, agent_result=0x7fff70214ef4,
       output=0x0, track=0xe11d20) at st_client.c:479
   #1  0x0000000000406230 in stonith_device_execute (device=0xe10ff0) at commands.c:140
   #2  0x0000000000406404 in stonith_device_dispatch (user_data=0xe10ff0) at commands.c:160
   #3  0x00007f03f224ad00 in crm_trigger_dispatch (source=0xe11160, callback=
       0x4063dd <stonith_device_dispatch>, userdata=0xe11160) at mainloop.c:105
   #4  0x0000003642638f0e in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
   #5  0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
   #6  0x000000364263cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
   #7  0x00000000004056dc in main (argc=1, argv=0x7fff70215278) at main.c:853
   (gdb) n 15
   Detaching after fork from child process 28915.
   510       if (pid) {
   (gdb) n 15
   542                 track->pid = pid;
   (gdb) list
   537             track->stdout = p_read_fd;
   538             g_child_watch_add(pid, track->done, track);
   539             crm_trace("Op: %s on %s, pid: %d, timeout: %ds", action, agent, pid, track->timeout);
   540
   541             if (track->timeout) {
   542                 track->pid = pid;
   543                 track->timer_sigterm = g_timeout_add(1000*track->timeout, st_child_term, track);
   544                 track->timer_sigkill = g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
   545
   546             } else {
   (gdb) n
   543                 track->timer_sigterm = g_timeout_add(1000*track->timeout, st_child_term, track);
   (gdb) n
   544                 track->timer_sigkill = g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
   (gdb) p agent
   $1 = 0xe0f820 "fence_legacy"
   (gdb) p action
   $2 = 0xe11fb0 "monitor"
   (gdb) p args
   $3 = 0xe11500 "plugin=external/libvirt\nhostlist=dev2\nhypervisor_uri=qemu+ssh://n8/system\noption=monitor\n"
 * (gdb) p track->timeout
   $4 = 61000
 * (gdb) p 1000*track->timeout
   $5 = 61000000
----
1. I added "sleep 3600" to status() of
   /usr/lib64/stonith/plugins/external/libvirt.

   [root at dev1 external]# diff -u libvirt.ORG libvirt
   --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
   +++ libvirt     2012-07-30 13:36:19.661431208 +0900
   @@ -221,6 +221,7 @@
        ;;

        status)
   +    sleep 3600
        libvirt_check_config
        libvirt_status
        exit $?

2. service corosync start ; service pacemaker start
3. cibadmin -U -x test.xml
4. When I wait for 61 seconds (timeout value of start),

   [root at dev1 ~]# crm_mon -rf1
   ============
   Last updated: Mon Jul 30 13:18:48 2012
   Last change: Mon Jul 30 13:15:08 2012 via cibadmin on dev1
   Stack: corosync
   Current DC: dev1 (-1788499776) - partition with quorum
   Version: 1.1.7-33119da
   2 Nodes configured, unknown expected votes
   1 Resources configured.
   ============

   Online: [ dev1 dev2 ]

   Full list of resources:

    f-2    (stonith:external/libvirt):     Started dev1 FAILED

   Migration summary:
   * Node dev2:
   * Node dev1:
      f-2: migration-threshold=1 fail-count=1000000

   Failed actions:
 *     f-2_start_0 (node=dev1, call=-1, rc=1, status=Timed Out): unknown error

   [root at dev1 ~]# ps -ef|egrep "UID|corosync|pacemaker|stonith|fence|sleep"
   UID    PID  PPID  C STIME TTY     TIME CMD
   root 28840     1  0 13:13 ?   00:00:01 corosync
   root 28858     1  0 13:13 ?   00:00:00 pacemakerd
   496  28860 28858  0 13:13 ?   00:00:00 /usr/libexec/pacemaker/cib
   root 28861 28858  0 13:13 ?   00:00:00 /usr/libexec/pacemaker/stonithd
   root 28862 28858 73 13:13 ?   00:04:16 /usr/libexec/pacemaker/lrmd
   496  28863 28858  0 13:13 ?   00:00:00 /usr/libexec/pacemaker/attrd
   496  28864 28858  0 13:13 ?   00:00:00 /usr/libexec/pacemaker/pengine
   496  28865 28858 51 13:13 ?   00:02:58 /usr/libexec/pacemaker/crmd
 * root 28915 28861  0 13:15 ?   00:00:00 /usr/bin/perl /usr/sbin/fence_legacy
 * root 28916 28915  0 13:15 ?   00:00:00 stonith -t external/libvirt -E -S
 * root 28921 28916  0 13:15 ?   00:00:00 /bin/sh /usr/lib64/stonith/plugins/external/libvirt status
   root 28925 28921  0 13:15 ?   00:00:00 sleep 3600

   [root at dev1 ~]# top -bn1
   top - 13:21:26 up 5 days,  3:23,  5 users,  load average: 1.99, 1.42, 0.72
   Tasks: 198 total,   3 running, 195 sleeping,   0 stopped,   0 zombie
   Cpu(s):  0.7%us,  0.7%sy,  0.0%ni, 98.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
   Mem:   5089052k total,  2423104k used,  2665948k free,   265756k buffers
   Swap:  1048568k total,        0k used,  1048568k free,  1717712k cached

     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 * 28862 root      20   0 83816 3412 2572 R 98.2  0.1   6:17.18 lrmd
 * 28865 hacluste  20   0  166m 6380 3428 R 98.2  0.1   4:59.84 crmd
   28860 hacluste  20   0 93888 7192 4472 S  2.0  0.1   0:00.23 cib
   29052 root      20   0 15024 1136  792 R  2.0  0.0   0:00.01 top
       1 root      20   0 19348 1520 1212 S  0.0  0.0   0:00.77 init
       2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
       3 root      RT   0     0    0    0 S  0.0  0.0   0:06.85 migration/0
       4 root      20   0     0    0    0 S  0.0  0.0  14:25.15 ksoftirqd/0
       5 root      RT   0     0    0    0 S  0.0  0.0   0:00.10 migration/0

Best Regards,
Kazunori INOUE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.xml
Type: text/xml
Size: 2059 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120730/5e5266db/attachment-0002.xml>


More information about the Pacemaker mailing list