[Pacemaker] hangs pending
Andrey Groshev
greenx at yandex.ru
Fri Feb 7 04:55:44 UTC 2014
Hi, Andrew and ALL!
Andrew, We did not bury this topic?
16.01.2014, 12:32, "Andrey Groshev" <greenx at yandex.ru>:
> 16.01.2014, 01:30, "Andrew Beekhof" <andrew at beekhof.net>:
>
>> On 16 Jan 2014, at 12:41 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>> 15.01.2014, 02:53, "Andrew Beekhof" <andrew at beekhof.net>:
>>>> On 15 Jan 2014, at 12:15 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>> 14.01.2014, 10:00, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>> 14.01.2014, 07:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>> Ok, here's what happens:
>>>>>>>
>>>>>>> 1. node2 is lost
>>>>>>> 2. fencing of node2 starts
>>>>>>> 3. node2 reboots (and cluster starts)
>>>>>>> 4. node2 returns to the membership
>>>>>>> 5. node2 is marked as a cluster member
>>>>>>> 6. DC tries to bring it into the cluster, but needs to cancel the active transition first.
>>>>>>> Which is a problem since the node2 fencing operation is part of that
>>>>>>> 7. node2 is in a transition (pending) state until fencing passes or fails
>>>>>>> 8a. fencing fails: transition completes and the node joins the cluster
>>>>>>>
>>>>>>> Thats in theory, except we automatically try again. Which isn't appropriate.
>>>>>>> This should be relatively easy to fix.
>>>>>>>
>>>>>>> 8b. fencing passes: the node is incorrectly marked as offline
>>>>>>>
>>>>>>> This I have no idea how to fix yet.
>>>>>>>
>>>>>>> On another note, it doesn't look like this agent works at all.
>>>>>>> The node has been back online for a long time and the agent is still timing out after 10 minutes.
>>>>>>> So "Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0." does not seem true.
>>>>>> Damn. Looks like you're right. At some time I broke my agent and had not noticed it. Who will understand.
>>>>> I repaired my agent - after send reboot he is wait STDIN.
>>>>> Returned "normally" a behavior - hangs "pending", until manually send reboot. :)
>>>> Right. Now you're in case 8b.
>>>>
>>>> Can you try this patch: http://paste.fedoraproject.org/68450/38973966
>>> Killed all day experiences.
>>> It turns out here that:
>>> 1. Did cluster.
>>> 2. On the node-2 send signal (-4) - killed corosink
>>> 3. From node-1 (there DC) - stonith sent reboot
>>> 4. Noda rebooted and resources start.
>>> 5. Again. On the node-2 send signal (-4) - killed corosink
>>> 6. Again. From node-1 (there DC) - stonith sent reboot
>>> 7. Noda-2 rebooted and hangs in "pending"
>>> 8. Waiting, waiting..... manually reboot.
>>> 9. Noda-2 reboot and raised resources start.
>>> 10. GOTO p.2
>> Logs?
>
> Yesterday I wrote an additional letter why not put the logs.
> Read it please, it contains a few more questions.
> Today again began to hang and continue along the same cycle.
> Logs here http://send2me.ru/crmrep2.tar.bz2
>
>>>>> New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>>> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>> Apart from anything else, your timeout needs to be bigger:
>>>>>>>>
>>>>>>>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired)
>>>>>>>>
>>>>>>>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>> 13.01.2014, 02:51, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>> 10.01.2014, 14:31, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>> Hi, ALL.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>>>>>>>>>>>>>> Please define "pending". Where did you see this?
>>>>>>>>>>>>>>>>> In crm_mon:
>>>>>>>>>>>>>>>>> ......
>>>>>>>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>>>>>>>> ......
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The experiment was like this:
>>>>>>>>>>>>>>>>> Four nodes in cluster.
>>>>>>>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>>>>>>>>> Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>>>>>>>>>>>>>> Then in the log fell out "Too many failures ...."
>>>>>>>>>>>>>>>>> All this time in the status in crm_mon is "pending".
>>>>>>>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>>>>>>>> Much time has passed and I can not accurately describe the behavior...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Now I am in the following state:
>>>>>>>>>>>>>>>>> I tried locate the problem. Came here with this.
>>>>>>>>>>>>>>>>> I set big value in property stonith-timeout="600s".
>>>>>>>>>>>>>>>>> And got the following behavior:
>>>>>>>>>>>>>>>>> 1. pkill -4 corosync
>>>>>>>>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>>>>>>>> 3. It sends reboot victim and waits until she comes to life again.
>>>>>>>>>>>>>>>> Hmmm.... what version of pacemaker?
>>>>>>>>>>>>>>>> This sounds like a timing issue that we fixed a while back
>>>>>>>>>>>>>>> Was a version 1.1.11 from December 3.
>>>>>>>>>>>>>>> Now try full update and retest.
>>>>>>>>>>>>>> That should be recent enough. Can you create a crm_report the next time you reproduce?
>>>>>>>>>>>>> Of course yes. Little delay.... :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> ......
>>>>>>>>>>>>> cc1: warnings being treated as errors
>>>>>>>>>>>>> upstart.c: In function ‘upstart_job_property’:
>>>>>>>>>>>>> upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
>>>>>>>>>>>>> upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>>>>>>>>>>> upstart.c:264: error: assignment makes pointer from integer without a cast
>>>>>>>>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>>>>>>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>>>>>>>>>> make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>>>>>>>>>> make: *** [core] Error 1
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm trying to solve this a problem.
>>>>>>>>>>>> Do not get solved quickly...
>>>>>>>>>>>>
>>>>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>>>>>>>> g_variant_lookup_value () Since 2.28
>>>>>>>>>>>>
>>>>>>>>>>>> # yum list installed glib2
>>>>>>>>>>>> Loaded plugins: fastestmirror, rhnplugin, security
>>>>>>>>>>>> This system is receiving updates from RHN Classic or Red Hat Satellite.
>>>>>>>>>>>> Loading mirror speeds from cached hostfile
>>>>>>>>>>>> Installed Packages
>>>>>>>>>>>> glib2.x86_64 2.26.1-3.el6 installed
>>>>>>>>>>>>
>>>>>>>>>>>> # cat /etc/issue
>>>>>>>>>>>> CentOS release 6.5 (Final)
>>>>>>>>>>>> Kernel \r on an \m
>>>>>>>>>>> Can you try this patch?
>>>>>>>>>>> Upstart jobs wont work, but the code will compile
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>>>>>>>>>> index 831e7cf..195c3a4 100644
>>>>>>>>>>> --- a/lib/services/upstart.c
>>>>>>>>>>> +++ b/lib/services/upstart.c
>>>>>>>>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>>>>>>>>>> static char *
>>>>>>>>>>> upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>>>>>> {
>>>>>>>>>>> + char *output = NULL;
>>>>>>>>>>> +
>>>>>>>>>>> +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>>>>>>>> + static bool err = TRUE;
>>>>>>>>>>> +
>>>>>>>>>>> + if(err) {
>>>>>>>>>>> + crm_err("This version of glib is too old to support upstart jobs");
>>>>>>>>>>> + err = FALSE;
>>>>>>>>>>> + }
>>>>>>>>>>> +#else
>>>>>>>>>>> GError *error = NULL;
>>>>>>>>>>> GDBusProxy *proxy;
>>>>>>>>>>> GVariant *asv = NULL;
>>>>>>>>>>> GVariant *value = NULL;
>>>>>>>>>>> GVariant *_ret = NULL;
>>>>>>>>>>> - char *output = NULL;
>>>>>>>>>>>
>>>>>>>>>>> crm_info("Calling GetAll on %s", obj);
>>>>>>>>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>>>>>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>>>>>>
>>>>>>>>>>> g_object_unref(proxy);
>>>>>>>>>>> g_variant_unref(_ret);
>>>>>>>>>>> +#endif
>>>>>>>>>>> return output;
>>>>>>>>>>> }
>>>>>>>>>> Ok :) I patch source.
>>>>>>>>>> Type "make rc" - the same error.
>>>>>>>>> Because its not building your local changes
>>>>>>>>>> Make new copy via "fetch" - the same error.
>>>>>>>>>> It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it.
>>>>>>>>>> Otherwise use exist archive.
>>>>>>>>>> Cutted log .......
>>>>>>>>>>
>>>>>>>>>> # make rc
>>>>>>>>>> make TAG=Pacemaker-1.1.11-rc3 rpm
>>>>>>>>>> make[1]: Entering directory `/root/ha/pacemaker'
>>>>>>>>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
>>>>>>>>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then \
>>>>>>>>>> rm -f pacemaker.tar.*; \
>>>>>>>>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then \
>>>>>>>>>> git commit -m "DO-NOT-PUSH" -a; \
>>>>>>>>>> git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \
>>>>>>>>>> git reset --mixed HEAD^; \
>>>>>>>>>> else \
>>>>>>>>>> git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \
>>>>>>>>>> fi; \
>>>>>>>>>> echo `date`: Rebuilt ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \
>>>>>>>>>> else \
>>>>>>>>>> echo `date`: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \
>>>>>>>>>> fi
>>>>>>>>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>>>>>>>> .......
>>>>>>>>>>
>>>>>>>>>> Well, "make rpm" - build rpms and I create cluster.
>>>>>>>>>> I spent the same tests and confirmed the behavior.
>>>>>>>>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
>>>>>>>>> Thanks!
>>>>>>> ,
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> ,
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> ,
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list