[Pacemaker] hangs pending

Wed Jan 15 16:24:53 EST 2014

On 16 Jan 2014, at 12:41 am, Andrey Groshev <greenx at yandex.ru> wrote:

> 
> 
> 15.01.2014, 02:53, "Andrew Beekhof" <andrew at beekhof.net>:
>> On 15 Jan 2014, at 12:15 am, Andrey Groshev <greenx at yandex.ru> wrote:
>> 
>>>  14.01.2014, 10:00, "Andrey Groshev" <greenx at yandex.ru>:
>>>>  14.01.2014, 07:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>   Ok, here's what happens:
>>>>> 
>>>>>   1. node2 is lost
>>>>>   2. fencing of node2 starts
>>>>>   3. node2 reboots (and cluster starts)
>>>>>   4. node2 returns to the membership
>>>>>   5. node2 is marked as a cluster member
>>>>>   6. DC tries to bring it into the cluster, but needs to cancel the active transition first.
>>>>>      Which is a problem since the node2 fencing operation is part of that
>>>>>   7. node2 is in a transition (pending) state until fencing passes or fails
>>>>>   8a. fencing fails: transition completes and the node joins the cluster
>>>>> 
>>>>>   Thats in theory, except we automatically try again. Which isn't appropriate.
>>>>>   This should be relatively easy to fix.
>>>>> 
>>>>>   8b. fencing passes: the node is incorrectly marked as offline
>>>>> 
>>>>>   This I have no idea how to fix yet.
>>>>> 
>>>>>   On another note, it doesn't look like this agent works at all.
>>>>>   The node has been back online for a long time and the agent is still timing out after 10 minutes.
>>>>>   So "Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0." does not seem true.
>>>>  Damn. Looks like you're right. At some time I broke my agent and had not noticed it. Who will understand.
>>>  I repaired my agent - after send reboot he is wait STDIN.
>>>  Returned "normally" a behavior - hangs "pending", until manually send reboot. :)
>> 
>> Right. Now you're in case 8b.
>> 
>> Can you try this patch:  http://paste.fedoraproject.org/68450/38973966
> 
> 
> Killed all day experiences.
> It turns out here that:
> 1. Did cluster.
> 2. On the node-2 send signal (-4) - killed corosink
> 3. From node-1 (there DC) - stonith sent reboot
> 4. Noda rebooted and resources start.
> 5. Again. On the node-2 send signal (-4) - killed corosink
> 6. Again. From node-1 (there DC) - stonith sent reboot
> 7. Noda-2 rebooted and hangs in "pending"
> 8. Waiting, waiting..... manually reboot.
> 9. Noda-2 reboot and raised resources start.
> 10. GOTO p.2

Logs?

> 
> 
> 
>>>  New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>   On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>    Apart from anything else, your timeout needs to be bigger:
>>>>>> 
>>>>>>    Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired)
>>>>>> 
>>>>>>    On 14 Jan 2014, at 7:18 am, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>    On 13 Jan 2014, at 8:31 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>    13.01.2014, 02:51, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>    On 10 Jan 2014, at 9:55 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>    10.01.2014, 14:31, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>>>>    10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>    On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>     10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>      On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>       08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>       On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>        Hi, ALL.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>        I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>>>>>>>>>>>>       Please define "pending".  Where did you see this?
>>>>>>>>>>>>>>>       In crm_mon:
>>>>>>>>>>>>>>>       ......
>>>>>>>>>>>>>>>       Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>>>>>>       ......
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>       The experiment was like this:
>>>>>>>>>>>>>>>       Four nodes in cluster.
>>>>>>>>>>>>>>>       On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>>>>>>>       Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>>>>>>>>>>>>       Then in the log fell out "Too many failures ...."
>>>>>>>>>>>>>>>       All this time in the status in crm_mon is "pending".
>>>>>>>>>>>>>>>       Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>>>>>>       Much time has passed and I can not accurately describe the behavior...
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>       Now I am in the following state:
>>>>>>>>>>>>>>>       I tried locate the problem. Came here with this.
>>>>>>>>>>>>>>>       I set big value in property stonith-timeout="600s".
>>>>>>>>>>>>>>>       And got the following behavior:
>>>>>>>>>>>>>>>       1. pkill -4 corosync
>>>>>>>>>>>>>>>       2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>>>>>>       3. It sends reboot victim and waits until she comes to life again.
>>>>>>>>>>>>>>      Hmmm.... what version of pacemaker?
>>>>>>>>>>>>>>      This sounds like a timing issue that we fixed a while back
>>>>>>>>>>>>>     Was a version 1.1.11 from December 3.
>>>>>>>>>>>>>     Now try full update and retest.
>>>>>>>>>>>>    That should be recent enough.  Can you create a crm_report the next time you reproduce?
>>>>>>>>>>>    Of course yes. Little delay.... :)
>>>>>>>>>>> 
>>>>>>>>>>>    ......
>>>>>>>>>>>    cc1: warnings being treated as errors
>>>>>>>>>>>    upstart.c: In function ‘upstart_job_property’:
>>>>>>>>>>>    upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
>>>>>>>>>>>    upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>>>>>>>>>    upstart.c:264: error: assignment makes pointer from integer without a cast
>>>>>>>>>>>    gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>>>>>>>>    gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>>>>>>>>    make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>    make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>>>>>>>>    make: *** [core] Error 1
>>>>>>>>>>> 
>>>>>>>>>>>    I'm trying to solve this a problem.
>>>>>>>>>>    Do not get solved quickly...
>>>>>>>>>> 
>>>>>>>>>>    https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>>>>>>    g_variant_lookup_value () Since 2.28
>>>>>>>>>> 
>>>>>>>>>>    # yum list installed glib2
>>>>>>>>>>    Loaded plugins: fastestmirror, rhnplugin, security
>>>>>>>>>>    This system is receiving updates from RHN Classic or Red Hat Satellite.
>>>>>>>>>>    Loading mirror speeds from cached hostfile
>>>>>>>>>>    Installed Packages
>>>>>>>>>>    glib2.x86_64                                                              2.26.1-3.el6                                                               installed
>>>>>>>>>> 
>>>>>>>>>>    # cat /etc/issue
>>>>>>>>>>    CentOS release 6.5 (Final)
>>>>>>>>>>    Kernel \r on an \m
>>>>>>>>>    Can you try this patch?
>>>>>>>>>    Upstart jobs wont work, but the code will compile
>>>>>>>>> 
>>>>>>>>>    diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>>>>>>>>    index 831e7cf..195c3a4 100644
>>>>>>>>>    --- a/lib/services/upstart.c
>>>>>>>>>    +++ b/lib/services/upstart.c
>>>>>>>>>    @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>>>>>>>>    static char *
>>>>>>>>>    upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>>>>    {
>>>>>>>>>    +    char *output = NULL;
>>>>>>>>>    +
>>>>>>>>>    +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>>>>>>    +    static bool err = TRUE;
>>>>>>>>>    +
>>>>>>>>>    +    if(err) {
>>>>>>>>>    +        crm_err("This version of glib is too old to support upstart jobs");
>>>>>>>>>    +        err = FALSE;
>>>>>>>>>    +    }
>>>>>>>>>    +#else
>>>>>>>>>       GError *error = NULL;
>>>>>>>>>       GDBusProxy *proxy;
>>>>>>>>>       GVariant *asv = NULL;
>>>>>>>>>       GVariant *value = NULL;
>>>>>>>>>       GVariant *_ret = NULL;
>>>>>>>>>    -    char *output = NULL;
>>>>>>>>> 
>>>>>>>>>       crm_info("Calling GetAll on %s", obj);
>>>>>>>>>       proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>>>>>>    @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>>>> 
>>>>>>>>>       g_object_unref(proxy);
>>>>>>>>>       g_variant_unref(_ret);
>>>>>>>>>    +#endif
>>>>>>>>>       return output;
>>>>>>>>>    }
>>>>>>>>    Ok :) I patch source.
>>>>>>>>    Type "make rc" - the same error.
>>>>>>>    Because its not building your local changes
>>>>>>>>    Make new copy via "fetch" - the same error.
>>>>>>>>    It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it.
>>>>>>>>    Otherwise use exist archive.
>>>>>>>>    Cutted log .......
>>>>>>>> 
>>>>>>>>    # make rc
>>>>>>>>    make TAG=Pacemaker-1.1.11-rc3 rpm
>>>>>>>>    make[1]: Entering directory `/root/ha/pacemaker'
>>>>>>>>    rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
>>>>>>>>    if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then                                             \
>>>>>>>>             rm -f pacemaker.tar.*;                                              \
>>>>>>>>             if [ Pacemaker-1.1.11-rc3 = dirty ]; then                                   \
>>>>>>>>                 git commit -m "DO-NOT-PUSH" -a;                                 \
>>>>>>>>                 git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
>>>>>>>>                 git reset --mixed HEAD^;                                        \
>>>>>>>>             else                                                                \
>>>>>>>>                 git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>>>>>>>             fi;                                                                 \
>>>>>>>>             echo `date`: Rebuilt ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                                     \
>>>>>>>>         else                                                                    \
>>>>>>>>             echo `date`: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                     \
>>>>>>>>         fi
>>>>>>>>    Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>>>>>>    .......
>>>>>>>> 
>>>>>>>>    Well, "make rpm" - build rpms and I create cluster.
>>>>>>>>    I spent the same tests and confirmed the behavior.
>>>>>>>>    crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
>>>>>>>    Thanks!
>>>>>   ,
>>>>>   _______________________________________________
>>>>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>>   Project Home: http://www.clusterlabs.org
>>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>  _______________________________________________
>>>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>>  Project Home: http://www.clusterlabs.org
>>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>  Bugs: http://bugs.clusterlabs.org
>>>  _______________________________________________
>>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>> 
>> ,
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140116/018ea7f4/attachment-0003.sig>