[Pacemaker] Among monitor of STONITH resource, monitor of other resources is not performed

David Vossel dvossel at redhat.com
Tue Jul 31 17:49:20 EDT 2012



----- Original Message -----
> From: "David Vossel" <dvossel at redhat.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Tuesday, July 31, 2012 10:11:44 AM
> Subject: Re: [Pacemaker] Among monitor of STONITH resource, monitor of other resources is not performed
> 
> ----- Original Message -----
> > From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
> > To: "pacemaker at oss" <pacemaker at oss.clusterlabs.org>
> > Cc: shimazakik at intellilink.co.jp
> > Sent: Monday, July 30, 2012 5:15:26 AM
> > Subject: [Pacemaker] Among monitor of STONITH resource, monitor of
> > other resources is not performed
> > 
> > Hi,
> > 
> > I am using Pacemaker-1.1.
> > - glue       (2012 Jul 16) 2719:18489f275f75
> > - libqb      (2012 Jul 19) 11b20e19beff7f1b6003be0b4c73da8ecf936442
> > - corosync   (2012 Jul 19) b9eb19e623d2b69c86cf78f1fa50a004c804ac20
> > - pacemaker  (2012 Jul 29) 33119da31c235710195c783e5c9a32c6e95b3efc
> > 
> > The monitor operation of other resources is not performed during
> > the
> > monitor of STONITH resource.
> 
> Looking at the code I can see this is true.  How important is it for
> the stonith monitors to be able to be performed in parallel with the
> other resource monitors?   I would expect the stonith monitor
> interval to be quite large compared to all the other monitors,
> meaning that it only executes once every hour or so.
> 
> -- Vossel

Yeah, this should be fixed. Can you create an issue for this on bugs.clusterlabs.org with your crm_report please.

-- Vossel


> 
> > 1. service corosync start ; service pacemaker start
> > 2. cibadmin -U -x test.xml
> >    (STONITH resource + Dummy resource are started on the same node)
> > 3. After libvirt (STONITH resource) started, sleep was added to
> > status()
> >    of libvirt.
> > 
> >    [root at dev1 external]# diff -u libvirt.ORG libvirt
> >    --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
> >    +++ libvirt     2012-07-30 13:36:19.661431208 +0900
> >    @@ -221,6 +221,7 @@
> >         ;;
> > 
> >         status)
> >    +    sleep 3600
> >         libvirt_check_config
> >         libvirt_status
> >         exit $?
> > 
> >    [root at dev1 ~]# ps -ef|egrep
> >    "UID|corosync|pacemaker|stonith|fence|sleep"
> >    UID    PID  PPID  C STIME TTY     TIME CMD
> >    root 18567     1  0 18:47 ?   00:00:02 corosync
> >    root 18585     1  0 18:47 ?   00:00:00 pacemakerd
> >    496  18587 18585  0 18:47 ?   00:00:00
> >    /usr/libexec/pacemaker/cib
> >    root 18588 18585  0 18:47 ?   00:00:00
> >    /usr/libexec/pacemaker/stonithd
> >    root 18589 18585 76 18:47 ?   00:05:27
> >    /usr/libexec/pacemaker/lrmd
> >    496  18590 18585  0 18:47 ?   00:00:00
> >    /usr/libexec/pacemaker/attrd
> >    496  18591 18585  0 18:47 ?   00:00:00
> >    /usr/libexec/pacemaker/pengine
> >    496  18592 18585  0 18:47 ?   00:00:00
> >    /usr/libexec/pacemaker/crmd
> >    root 18767 18588  0 18:48 ?   00:00:00 /usr/bin/perl
> >    /usr/sbin/fence_legacy
> >    root 18768 18767  0 18:48 ?   00:00:00 stonith -t
> >    external/libvirt
> >    -E -S
> >    root 18778 18768  0 18:48 ?   00:00:00 /bin/sh
> >    /usr/lib64/stonith/plugins/external/libvirt status
> >    root 18792 18778  0 18:48 ?   00:00:00 sleep 3600
> > 
> > 4. Then monitor of Dummy resource is not performed.
> > 
> > The following is behavior of lrmd at that time.
> > 
> > # gdb /usr/libexec/pacemaker/lrmd `pgrep lrmd`
> > (gdb) bt
> > #0  0x0000003f808e83e2 in recv () from /lib64/libc.so.6
> > #1  0x00007f0de3820062 in qb_ipc_us_recv_at_most
> > (one_way=0x1118ee8,
> > msg=0x111f390,
> >     len=20480, timeout=500) at ipc_us.c:299
> > #2  0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0,
> > msg_ptr=0x111f390, msg_len=
> >     20480, ms_timeout=500) at ipcc.c:249
> > #3  0x00007f0de42bc5fb in crm_ipc_send (client=0x111b580,
> > message=0x111d860, reply=
> >     0x7fffc8ec4c60, ms_timeout=61060000) at ipc.c:517
> > #4  0x00007f0de3c97e29 in stonith_send_command (stonith=0x111a6c0,
> > op=
> >     0x7f0de3c998dd "st_execute", data=0x1119170, output_data=0x0,
> >     call_options=4096,
> >     timeout=61000) at st_client.c:1676
> > #5  0x00007f0de3c94bd1 in stonith_api_call (stonith=0x111a6c0,
> > call_options=4096,
> >     id=0x1129860 "f-2", action=0x7f0de3c998f7 "monitor",
> >     victim=0x0,
> >     timeout=61000,
> >     output=0x0) at st_client.c:951
> > #6  0x00007f0de3c94d31 in stonith_api_monitor (stonith=0x111a6c0,
> > call_options=4096,
> >     id=0x1129860 "f-2", timeout=61000) at st_client.c:985
> > #7  0x00000000004044e2 in lrmd_rsc_execute_stonith (rsc=0x111f130,
> > cmd=0x1129660) at
> >     lrmd.c:522
> > #8  0x0000000000404cd6 in lrmd_rsc_execute (rsc=0x111f130) at
> > lrmd.c:667
> > #9  0x0000000000404d2d in lrmd_rsc_dispatch (user_data=0x111f130)
> > at
> > lrmd.c:678
> > #10 0x00007f0de42dcd00 in crm_trigger_dispatch (source=0x111f300,
> > callback=
> >     0x404d06 <lrmd_rsc_dispatch>, userdata=0x111f300) at
> >     mainloop.c:105
> > #11 0x0000003642638f0e in g_main_context_dispatch () from
> > /lib64/libglib-2.0.so.0
> > #12 0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
> > #13 0x000000364263cd55 in g_main_loop_run () from
> > /lib64/libglib-2.0.so.0
> > #14 0x0000000000402d3f in main (argc=1, argv=0x7fffc8ec5188) at
> > main.c:285
> > (gdb) fin
> > Run till exit from #0  0x0000003f808e83e2 in recv () from
> > /lib64/libc.so.6
> > 0x00007f0de3820062 in qb_ipc_us_recv_at_most (one_way=0x1118ee8,
> > msg=0x111f390,
> >     len=20480, timeout=500) at ipc_us.c:299
> > 299         result = recv(one_way->u.us.sock, &data[processed],
> > to_recv,
> > (gdb) fin
> > Run till exit from #0  0x00007f0de3820062 in qb_ipc_us_recv_at_most
> > (one_way=
> >     0x1118ee8, msg=0x111f390, len=20480, timeout=500) at
> >     ipc_us.c:299
> > 0x00007f0de381a28e in qb_ipcc_recv (c=0x1118ba0, msg_ptr=0x111f390,
> > msg_len=20480,
> >     ms_timeout=500) at ipcc.c:249
> > 249         res = c->funcs.recv(&c->response, msg_ptr, msg_len,
> > ms_timeout);
> > Value returned is $1 = -11
> > (gdb) fin
> > Run till exit from #0  0x00007f0de381a28e in qb_ipcc_recv
> > (c=0x1118ba0, msg_ptr=
> >     0x111f390, msg_len=20480, ms_timeout=500) at ipcc.c:249
> > 0x00007f0de42bc5fb in crm_ipc_send (client=0x111b580,
> > message=0x111d860, reply=
> >     0x7fffc8ec4c60, ms_timeout=61060000) at ipc.c:517
> > 517                 rc = qb_ipcc_recv(client->ipc, client->buffer,
> > client->buf_size, 500);
> > Value returned is $2 = -11
> > (gdb) n
> > 518                 if(rc > 0 || crm_ipc_connected(client) ==
> > FALSE)
> > {
> > (gdb) p rc
> > $3 = -11
> > (gdb) n
> > 522             } while(time(NULL) < timeout);
> > (gdb) n
> > 517                 rc = qb_ipcc_recv(client->ipc, client->buffer,
> > client->buf_size, 500);
> > (gdb) n
> > 518                 if(rc > 0 || crm_ipc_connected(client) ==
> > FALSE)
> > {
> > (gdb) p rc
> > $4 = -11
> > (gdb) n
> > 522             } while(time(NULL) < timeout);
> > (gdb) n
> > 517                 rc = qb_ipcc_recv(client->ipc, client->buffer,
> > client->buf_size, 500);
> > (gdb) n
> > 518                 if(rc > 0 || crm_ipc_connected(client) ==
> > FALSE)
> > {
> > (gdb) p rc
> > $5 = -11
> > (gdb) n
> > 522             } while(time(NULL) < timeout);
> > (gdb) n
> > 517                 rc = qb_ipcc_recv(client->ipc, client->buffer,
> > client->buf_size, 500);
> > (gdb) n
> > 518                 if(rc > 0 || crm_ipc_connected(client) ==
> > FALSE)
> > {
> > (gdb) p rc
> > $6 = -11
> > (gdb)
> > 
> > It seems that lrmd has repeated the reply reception from stonithd
> > out of the g_main_loop. Therefore, monitor of Dummy is not
> > performed.
> > 
> > [root at dev1 ~]# top -bn1
> > top - 18:53:31 up 5 days,  8:56,  4 users,  load average: 0.98,
> > 0.63,
> > 0.44
> > Tasks: 198 total,   2 running, 196 sleeping,   0 stopped,   0
> > zombie
> > Cpu(s):  0.9%us,  1.2%sy,  0.0%ni, 97.9%id,  0.0%wa,  0.0%hi,
> >  0.0%si,  0.0%st
> > Mem:   5089052k total,  2417444k used,  2671608k free,   266660k
> > buffers
> > Swap:  1048568k total,        0k used,  1048568k free,  1724616k
> > cached
> > 
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> >    COMMAND
> > 18589 root      20   0 83932 3448 2548 R 98.2  0.1   4:15.89 lrmd
> >     1 root      20   0 19348 1520 1212 S  0.0  0.0   0:00.78 init
> >     2 root      20   0     0    0    0 S  0.0  0.0   0:00.00
> >     kthreadd
> >     3 root      RT   0     0    0    0 S  0.0  0.0   0:06.93
> >     migration/0
> >     4 root      20   0     0    0    0 S  0.0  0.0  15:23.59
> >     ksoftirqd/0
> >     5 root      RT   0     0    0    0 S  0.0  0.0   0:00.10
> >     migration/0
> > 
> > Best Regards,
> > Kazunori INOUE
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




More information about the Pacemaker mailing list