[Pacemaker] hangs pending

Mon Jan 13 23:34:32 EST 2014

14.01.2014, 06:25, "Andrew Beekhof" <andrew at beekhof.net>:
> Apart from anything else, your timeout needs to be bigger:
>
> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired)
>

Bigger than that?
In :21 node2 A long time ago already booted and work (almost).
#cat /var/log/cluster/mystonith.log
.....
Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(): getinfo-devdescr
Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(): getinfo-devid
Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(): getinfo-xml
Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/root at dev-cluster2-master.unix.tensor.ru): getconfignames
Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/root at dev-cluster2-master.unix.tensor.ru): status
Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/root at dev-cluster2-master.unix.tensor.ru): getconfignames
Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () STONITH DEBUG(/opt/cluster_tools_2/keys/root at dev-cluster2-master.unix.tensor.ru): reset dev-cluster2-node2.unix.tensor.ru
Mon Jan 13 12:11:37 MSK 2014 Now boot time 1389256739, send reboot
.......

> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <andrew at beekhof.net> wrote:
>
>>  On 13 Jan 2014, at 8:31 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>  13.01.2014, 02:51, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>  On 10 Jan 2014, at 9:55 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>  10.01.2014, 14:31, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>  10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>   On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>    10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>     On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>      08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>      On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>       Hi, ALL.
>>>>>>>>>>>>
>>>>>>>>>>>>       I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>>>>>>>      Please define "pending".  Where did you see this?
>>>>>>>>>>      In crm_mon:
>>>>>>>>>>      ......
>>>>>>>>>>      Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>      ......
>>>>>>>>>>
>>>>>>>>>>      The experiment was like this:
>>>>>>>>>>      Four nodes in cluster.
>>>>>>>>>>      On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>>      Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>>>>>>>      Then in the log fell out "Too many failures ...."
>>>>>>>>>>      All this time in the status in crm_mon is "pending".
>>>>>>>>>>      Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>      Much time has passed and I can not accurately describe the behavior...
>>>>>>>>>>
>>>>>>>>>>      Now I am in the following state:
>>>>>>>>>>      I tried locate the problem. Came here with this.
>>>>>>>>>>      I set big value in property stonith-timeout="600s".
>>>>>>>>>>      And got the following behavior:
>>>>>>>>>>      1. pkill -4 corosync
>>>>>>>>>>      2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>      3. It sends reboot victim and waits until she comes to life again.
>>>>>>>>>     Hmmm.... what version of pacemaker?
>>>>>>>>>     This sounds like a timing issue that we fixed a while back
>>>>>>>>    Was a version 1.1.11 from December 3.
>>>>>>>>    Now try full update and retest.
>>>>>>>   That should be recent enough.  Can you create a crm_report the next time you reproduce?
>>>>>>  Of course yes. Little delay.... :)
>>>>>>
>>>>>>  ......
>>>>>>  cc1: warnings being treated as errors
>>>>>>  upstart.c: In function ‘upstart_job_property’:
>>>>>>  upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
>>>>>>  upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>>>>  upstart.c:264: error: assignment makes pointer from integer without a cast
>>>>>>  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>>>  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>>>  make[1]: *** [all-recursive] Error 1
>>>>>>  make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>>>  make: *** [core] Error 1
>>>>>>
>>>>>>  I'm trying to solve this a problem.
>>>>>  Do not get solved quickly...
>>>>>
>>>>>  https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>  g_variant_lookup_value () Since 2.28
>>>>>
>>>>>  # yum list installed glib2
>>>>>  Loaded plugins: fastestmirror, rhnplugin, security
>>>>>  This system is receiving updates from RHN Classic or Red Hat Satellite.
>>>>>  Loading mirror speeds from cached hostfile
>>>>>  Installed Packages
>>>>>  glib2.x86_64                                                              2.26.1-3.el6                                                               installed
>>>>>
>>>>>  # cat /etc/issue
>>>>>  CentOS release 6.5 (Final)
>>>>>  Kernel \r on an \m
>>>>  Can you try this patch?
>>>>  Upstart jobs wont work, but the code will compile
>>>>
>>>>  diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>>>  index 831e7cf..195c3a4 100644
>>>>  --- a/lib/services/upstart.c
>>>>  +++ b/lib/services/upstart.c
>>>>  @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>>>  static char *
>>>>  upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>  {
>>>>  +    char *output = NULL;
>>>>  +
>>>>  +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>  +    static bool err = TRUE;
>>>>  +
>>>>  +    if(err) {
>>>>  +        crm_err("This version of glib is too old to support upstart jobs");
>>>>  +        err = FALSE;
>>>>  +    }
>>>>  +#else
>>>>      GError *error = NULL;
>>>>      GDBusProxy *proxy;
>>>>      GVariant *asv = NULL;
>>>>      GVariant *value = NULL;
>>>>      GVariant *_ret = NULL;
>>>>  -    char *output = NULL;
>>>>
>>>>      crm_info("Calling GetAll on %s", obj);
>>>>      proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>  @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>
>>>>      g_object_unref(proxy);
>>>>      g_variant_unref(_ret);
>>>>  +#endif
>>>>      return output;
>>>>  }
>>>  Ok :) I patch source.
>>>  Type "make rc" - the same error.
>>  Because its not building your local changes
>>>  Make new copy via "fetch" - the same error.
>>>  It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it.
>>>  Otherwise use exist archive.
>>>  Cutted log .......
>>>
>>>  # make rc
>>>  make TAG=Pacemaker-1.1.11-rc3 rpm
>>>  make[1]: Entering directory `/root/ha/pacemaker'
>>>  rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
>>>  if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then                                             \
>>>            rm -f pacemaker.tar.*;                                              \
>>>            if [ Pacemaker-1.1.11-rc3 = dirty ]; then                                   \
>>>                git commit -m "DO-NOT-PUSH" -a;                                 \
>>>                git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
>>>                git reset --mixed HEAD^;                                        \
>>>            else                                                                \
>>>                git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>>            fi;                                                                 \
>>>            echo `date`: Rebuilt ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                                     \
>>>        else                                                                    \
>>>            echo `date`: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                     \
>>>        fi
>>>  Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>  .......
>>>
>>>  Well, "make rpm" - build rpms and I create cluster.
>>>  I spent the same tests and confirmed the behavior.
>>>  crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
>>  Thanks!
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org