[Pacemaker] Timeout value of STONITH resource is too large
Kazunori INOUE
inouekazu at intellilink.co.jp
Thu Aug 2 05:37:45 EDT 2012
Hi David,
The resource (process forked from stonithd) stays until sleep is completed.
stonithd does not call st_child_term().
I set 61 seconds in timeout of start.
Therefore, the value of track->timeout in run_stonith_agent() is 61000(ms).
However, it is after '1000*track->timeout' milliseconds that st_child_term()
is called. That is, it is approximately _17 hours_ later.
- lib/fencing/st_client.c
476 run_stonith_agent(const char *agent, const char *action, const char *victim,
477 GHashTable * device_args, GHashTable * port_map, int *agent_result, char **output,
478 async_command_t * track)
541 if (track->timeout) {
542 track->pid = pid;
543 track->timer_sigterm = g_timeout_add(1000*track->timeout, st_child_term, track);
544 track->timer_sigkill = g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
Best Regards,
Kazunori INOUE
(12.08.01 00:01), David Vossel wrote:
> ----- Original Message -----
>> From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
>> To: "pacemaker at oss" <pacemaker at oss.clusterlabs.org>
>> Cc: shimazakik at intellilink.co.jp
>> Sent: Monday, July 30, 2012 5:13:40 AM
>> Subject: [Pacemaker] Timeout value of STONITH resource is too large
>>
>> Hi,
>>
>> I am using Pacemaker-1.1.
>> - glue (2012 Jul 16) 2719:18489f275f75
>> - libqb (2012 Jul 19) 11b20e19beff7f1b6003be0b4c73da8ecf936442
>> - corosync (2012 Jul 12) 908ed7dcb390c0eade3dddb0cdfe181eb26b2ce2
>> - pacemaker (2012 Jul 29) 33119da31c235710195c783e5c9a32c6e95b3efc
>>
>> The timeout value of the _start_ operation of STONITH resource is
>> large.
>> Therefore, even after the start operation is timed out, the process
>> of
>> plugin remains.
>
> How long after the timeout occurs does the process stay around? Does it terminate a few seconds after the timeout, or does the resource wait until the entire duration of the sleep 3600 finishes?
>
> -- Vossel
>
>
>>
>> The following is gdb at the time of STONITH resource starting.
>> ----
>> [root at dev1 ~]# gdb /usr/libexec/pacemaker/stonithd `pgrep
>> stonithd`
>> (gdb) b run_stonith_agent
>> Breakpoint 1 at 0x7f03f1e00d69: file st_client.c, line 479.
>> (gdb) c
>> Continuing.
>>
>> Breakpoint 1, run_stonith_agent (agent=0xe0f820 "fence_legacy",
>> action=0xe11fb0 "monitor",
>> <snip>
>> 479 {
>> (gdb) bt
>> #0 run_stonith_agent (agent=0xe0f820 "fence_legacy",
>> action=0xe11fb0 "monitor",
>> victim=0x0, device_args=Traceback (most recent call
>> last):0xcffe30, port_map=
>> Traceback (most recent call last):0xcffe80,
>> agent_result=0x7fff70214ef4,
>> output=0x0, track=0xe11d20) at st_client.c:479
>> #1 0x0000000000406230 in stonith_device_execute (device=0xe10ff0)
>> at commands.c:140
>> #2 0x0000000000406404 in stonith_device_dispatch
>> (user_data=0xe10ff0) at commands.c:160
>> #3 0x00007f03f224ad00 in crm_trigger_dispatch (source=0xe11160,
>> callback=
>> 0x4063dd <stonith_device_dispatch>, userdata=0xe11160) at
>> mainloop.c:105
>> #4 0x0000003642638f0e in g_main_context_dispatch () from
>> /lib64/libglib-2.0.so.0
>> #5 0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
>> #6 0x000000364263cd55 in g_main_loop_run () from
>> /lib64/libglib-2.0.so.0
>> #7 0x00000000004056dc in main (argc=1, argv=0x7fff70215278) at
>> main.c:853
>> (gdb) n 15
>> Detaching after fork from child process 28915.
>> 510 if (pid) {
>> (gdb) n 15
>> 542 track->pid = pid;
>> (gdb) list
>> 537 track->stdout = p_read_fd;
>> 538 g_child_watch_add(pid, track->done, track);
>> 539 crm_trace("Op: %s on %s, pid: %d, timeout: %ds",
>> action, agent, pid, track->timeout);
>> 540
>> 541 if (track->timeout) {
>> 542 track->pid = pid;
>> 543 track->timer_sigterm =
>> g_timeout_add(1000*track->timeout, st_child_term, track);
>> 544 track->timer_sigkill =
>> g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
>> 545
>> 546 } else {
>> (gdb) n
>> 543 track->timer_sigterm =
>> g_timeout_add(1000*track->timeout, st_child_term, track);
>> (gdb) n
>> 544 track->timer_sigkill =
>> g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
>> (gdb) p agent
>> $1 = 0xe0f820 "fence_legacy"
>> (gdb) p action
>> $2 = 0xe11fb0 "monitor"
>> (gdb) p args
>> $3 = 0xe11500
>> "plugin=external/libvirt\nhostlist=dev2\nhypervisor_uri=qemu+ssh://n8/system\noption=monitor\n"
>> * (gdb) p track->timeout
>> $4 = 61000
>> * (gdb) p 1000*track->timeout
>> $5 = 61000000
>> ----
>> 1. I added "sleep 3600" to status() of
>> /usr/lib64/stonith/plugins/external/libvirt.
>>
>> [root at dev1 external]# diff -u libvirt.ORG libvirt
>> --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
>> +++ libvirt 2012-07-30 13:36:19.661431208 +0900
>> @@ -221,6 +221,7 @@
>> ;;
>>
>> status)
>> + sleep 3600
>> libvirt_check_config
>> libvirt_status
>> exit $?
>>
>> 2. service corosync start ; service pacemaker start
>> 3. cibadmin -U -x test.xml
>> 4. When I wait for 61 seconds (timeout value of start),
>>
>> [root at dev1 ~]# crm_mon -rf1
>> ============
>> Last updated: Mon Jul 30 13:18:48 2012
>> Last change: Mon Jul 30 13:15:08 2012 via cibadmin on dev1
>> Stack: corosync
>> Current DC: dev1 (-1788499776) - partition with quorum
>> Version: 1.1.7-33119da
>> 2 Nodes configured, unknown expected votes
>> 1 Resources configured.
>> ============
>>
>> Online: [ dev1 dev2 ]
>>
>> Full list of resources:
>>
>> f-2 (stonith:external/libvirt): Started dev1 FAILED
>>
>> Migration summary:
>> * Node dev2:
>> * Node dev1:
>> f-2: migration-threshold=1 fail-count=1000000
>>
>> Failed actions:
>> * f-2_start_0 (node=dev1, call=-1, rc=1, status=Timed Out):
>> unknown error
>>
>> [root at dev1 ~]# ps -ef|egrep
>> "UID|corosync|pacemaker|stonith|fence|sleep"
>> UID PID PPID C STIME TTY TIME CMD
>> root 28840 1 0 13:13 ? 00:00:01 corosync
>> root 28858 1 0 13:13 ? 00:00:00 pacemakerd
>> 496 28860 28858 0 13:13 ? 00:00:00 /usr/libexec/pacemaker/cib
>> root 28861 28858 0 13:13 ? 00:00:00
>> /usr/libexec/pacemaker/stonithd
>> root 28862 28858 73 13:13 ? 00:04:16 /usr/libexec/pacemaker/lrmd
>> 496 28863 28858 0 13:13 ? 00:00:00
>> /usr/libexec/pacemaker/attrd
>> 496 28864 28858 0 13:13 ? 00:00:00
>> /usr/libexec/pacemaker/pengine
>> 496 28865 28858 51 13:13 ? 00:02:58 /usr/libexec/pacemaker/crmd
>> * root 28915 28861 0 13:15 ? 00:00:00 /usr/bin/perl
>> /usr/sbin/fence_legacy
>> * root 28916 28915 0 13:15 ? 00:00:00 stonith -t external/libvirt
>> -E -S
>> * root 28921 28916 0 13:15 ? 00:00:00 /bin/sh
>> /usr/lib64/stonith/plugins/external/libvirt status
>> root 28925 28921 0 13:15 ? 00:00:00 sleep 3600
>>
>> [root at dev1 ~]# top -bn1
>> top - 13:21:26 up 5 days, 3:23, 5 users, load average: 1.99,
>> 1.42, 0.72
>> Tasks: 198 total, 3 running, 195 sleeping, 0 stopped, 0
>> zombie
>> Cpu(s): 0.7%us, 0.7%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi,
>> 0.0%si, 0.0%st
>> Mem: 5089052k total, 2423104k used, 2665948k free, 265756k
>> buffers
>> Swap: 1048568k total, 0k used, 1048568k free, 1717712k
>> cached
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
>> COMMAND
>> * 28862 root 20 0 83816 3412 2572 R 98.2 0.1 6:17.18 lrmd
>> * 28865 hacluste 20 0 166m 6380 3428 R 98.2 0.1 4:59.84 crmd
>> 28860 hacluste 20 0 93888 7192 4472 S 2.0 0.1 0:00.23 cib
>> 29052 root 20 0 15024 1136 792 R 2.0 0.0 0:00.01 top
>> 1 root 20 0 19348 1520 1212 S 0.0 0.0 0:00.77 init
>> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00
>> kthreadd
>> 3 root RT 0 0 0 0 S 0.0 0.0 0:06.85
>> migration/0
>> 4 root 20 0 0 0 0 S 0.0 0.0 14:25.15
>> ksoftirqd/0
>> 5 root RT 0 0 0 0 S 0.0 0.0 0:00.10
>> migration/0
>>
>> Best Regards,
>> Kazunori INOUE
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list