[Pacemaker] hangs pending

Andrew Beekhof andrew at beekhof.net
Mon Jan 13 21:19:27 EST 2014


Apart from anything else, your timeout needs to be bigger:

Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  commands.c:1321  )   error: log_operation: 	Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired)


On 14 Jan 2014, at 7:18 am, Andrew Beekhof <andrew at beekhof.net> wrote:

> 
> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <greenx at yandex.ru> wrote:
> 
>> 
>> 
>> 13.01.2014, 02:51, "Andrew Beekhof" <andrew at beekhof.net>:
>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>> 
>>>> 10.01.2014, 14:31, "Andrey Groshev" <greenx at yandex.ru>:
>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>  On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>   10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>    On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>     08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>     On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>      Hi, ALL.
>>>>>>>>>>> 
>>>>>>>>>>>      I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>>>>>>     Please define "pending".  Where did you see this?
>>>>>>>>>     In crm_mon:
>>>>>>>>>     ......
>>>>>>>>>     Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>     ......
>>>>>>>>> 
>>>>>>>>>     The experiment was like this:
>>>>>>>>>     Four nodes in cluster.
>>>>>>>>>     On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>     Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>>>>>>     Then in the log fell out "Too many failures ...."
>>>>>>>>>     All this time in the status in crm_mon is "pending".
>>>>>>>>>     Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>     Much time has passed and I can not accurately describe the behavior...
>>>>>>>>> 
>>>>>>>>>     Now I am in the following state:
>>>>>>>>>     I tried locate the problem. Came here with this.
>>>>>>>>>     I set big value in property stonith-timeout="600s".
>>>>>>>>>     And got the following behavior:
>>>>>>>>>     1. pkill -4 corosync
>>>>>>>>>     2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>     3. It sends reboot victim and waits until she comes to life again.
>>>>>>>>    Hmmm.... what version of pacemaker?
>>>>>>>>    This sounds like a timing issue that we fixed a while back
>>>>>>>   Was a version 1.1.11 from December 3.
>>>>>>>   Now try full update and retest.
>>>>>>  That should be recent enough.  Can you create a crm_report the next time you reproduce?
>>>>> Of course yes. Little delay.... :)
>>>>> 
>>>>> ......
>>>>> cc1: warnings being treated as errors
>>>>> upstart.c: In function ‘upstart_job_property’:
>>>>> upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
>>>>> upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>>> upstart.c:264: error: assignment makes pointer from integer without a cast
>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>> make[1]: *** [all-recursive] Error 1
>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>> make: *** [core] Error 1
>>>>> 
>>>>> I'm trying to solve this a problem.
>>>> Do not get solved quickly...
>>>> 
>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>> g_variant_lookup_value () Since 2.28
>>>> 
>>>> # yum list installed glib2
>>>> Loaded plugins: fastestmirror, rhnplugin, security
>>>> This system is receiving updates from RHN Classic or Red Hat Satellite.
>>>> Loading mirror speeds from cached hostfile
>>>> Installed Packages
>>>> glib2.x86_64                                                              2.26.1-3.el6                                                               installed
>>>> 
>>>> # cat /etc/issue
>>>> CentOS release 6.5 (Final)
>>>> Kernel \r on an \m
>>> 
>>> Can you try this patch?
>>> Upstart jobs wont work, but the code will compile
>>> 
>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>> index 831e7cf..195c3a4 100644
>>> --- a/lib/services/upstart.c
>>> +++ b/lib/services/upstart.c
>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>> static char *
>>> upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>> {
>>> +    char *output = NULL;
>>> +
>>> +#if !GLIB_CHECK_VERSION(2,28,0)
>>> +    static bool err = TRUE;
>>> +
>>> +    if(err) {
>>> +        crm_err("This version of glib is too old to support upstart jobs");
>>> +        err = FALSE;
>>> +    }
>>> +#else
>>>     GError *error = NULL;
>>>     GDBusProxy *proxy;
>>>     GVariant *asv = NULL;
>>>     GVariant *value = NULL;
>>>     GVariant *_ret = NULL;
>>> -    char *output = NULL;
>>> 
>>>     crm_info("Calling GetAll on %s", obj);
>>>     proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>> 
>>>     g_object_unref(proxy);
>>>     g_variant_unref(_ret);
>>> +#endif
>>>     return output;
>>> }
>>> 
>> 
>> Ok :) I patch source. 
>> Type "make rc" - the same error.
> 
> Because its not building your local changes
> 
>> Make new copy via "fetch" - the same error.
>> It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it. 
>> Otherwise use exist archive.
>> Cutted log .......
>> 
>> # make rc
>> make TAG=Pacemaker-1.1.11-rc3 rpm
>> make[1]: Entering directory `/root/ha/pacemaker'
>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then                                             \
>>           rm -f pacemaker.tar.*;                                              \
>>           if [ Pacemaker-1.1.11-rc3 = dirty ]; then                                   \
>>               git commit -m "DO-NOT-PUSH" -a;                                 \
>>               git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
>>               git reset --mixed HEAD^;                                        \
>>           else                                                                \
>>               git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>           fi;                                                                 \
>>           echo `date`: Rebuilt ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                                     \
>>       else                                                                    \
>>           echo `date`: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                     \
>>       fi
>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>> .......
>> 
>> Well, "make rpm" - build rpms and I create cluster.
>> I spent the same tests and confirmed the behavior.
>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
> 
> Thanks!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140114/88204f63/attachment-0003.sig>


More information about the Pacemaker mailing list