[Pacemaker] hangs pending

Sun Jan 12 17:45:57 EST 2014

On 10 Jan 2014, at 9:55 pm, Andrey Groshev <greenx at yandex.ru> wrote:

> 
> 
> 10.01.2014, 14:31, "Andrey Groshev" <greenx at yandex.ru>:
>> 10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
>> 
>>>  On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>   10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>    On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>     08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>     On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>      Hi, ALL.
>>>>>>>> 
>>>>>>>>      I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>>>     Please define "pending".  Where did you see this?
>>>>>>     In crm_mon:
>>>>>>     ......
>>>>>>     Node dev-cluster2-node2 (172793105): pending
>>>>>>     ......
>>>>>> 
>>>>>>     The experiment was like this:
>>>>>>     Four nodes in cluster.
>>>>>>     On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>     Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>>>     Then in the log fell out "Too many failures ...."
>>>>>>     All this time in the status in crm_mon is "pending".
>>>>>>     Depending on the wind direction changed to "UNCLEAN"
>>>>>>     Much time has passed and I can not accurately describe the behavior...
>>>>>> 
>>>>>>     Now I am in the following state:
>>>>>>     I tried locate the problem. Came here with this.
>>>>>>     I set big value in property stonith-timeout="600s".
>>>>>>     And got the following behavior:
>>>>>>     1. pkill -4 corosync
>>>>>>     2. from node with DC call my fence agent "sshbykey"
>>>>>>     3. It sends reboot victim and waits until she comes to life again.
>>>>>    Hmmm.... what version of pacemaker?
>>>>>    This sounds like a timing issue that we fixed a while back
>>>>   Was a version 1.1.11 from December 3.
>>>>   Now try full update and retest.
>>>  That should be recent enough.  Can you create a crm_report the next time you reproduce?
>> 
>> Of course yes. Little delay.... :)
>> 
>> ......
>> cc1: warnings being treated as errors
>> upstart.c: In function ‘upstart_job_property’:
>> upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
>> upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>> upstart.c:264: error: assignment makes pointer from integer without a cast
>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory `/root/ha/pacemaker/lib'
>> make: *** [core] Error 1
>> 
>> I'm trying to solve this a problem.
> 
> 
> Do not get solved quickly...
> 
> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
> g_variant_lookup_value () Since 2.28
> 
> # yum list installed glib2
> Loaded plugins: fastestmirror, rhnplugin, security
> This system is receiving updates from RHN Classic or Red Hat Satellite.
> Loading mirror speeds from cached hostfile
> Installed Packages
> glib2.x86_64                                                              2.26.1-3.el6                                                               installed
> 
> # cat /etc/issue
> CentOS release 6.5 (Final)
> Kernel \r on an \m

Can you try this patch?
Upstart jobs wont work, but the code will compile

diff --git a/lib/services/upstart.c b/lib/services/upstart.c
index 831e7cf..195c3a4 100644
--- a/lib/services/upstart.c
+++ b/lib/services/upstart.c
@@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
 static char *
 upstart_job_property(const char *obj, const gchar * iface, const char *name)
 {
+    char *output = NULL;
+
+#if !GLIB_CHECK_VERSION(2,28,0)
+    static bool err = TRUE;
+
+    if(err) {
+        crm_err("This version of glib is too old to support upstart jobs");
+        err = FALSE;
+    }
+#else
     GError *error = NULL;
     GDBusProxy *proxy;
     GVariant *asv = NULL;
     GVariant *value = NULL;
     GVariant *_ret = NULL;
-    char *output = NULL;
 
     crm_info("Calling GetAll on %s", obj);
     proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
@@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name)
 
     g_object_unref(proxy);
     g_variant_unref(_ret);
+#endif
     return output;
 }


> 
> 
>>>>>>       Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0.
>>>>>>       All command is logged both the victim and the killer - all right.
>>>>>>     4. A little later, the status of the (victim) nodes in crm_mon changes to online.
>>>>>>     5. BUT... not one resource don't start! Despite the fact that "crm_simalate -sL" shows the correct resource to start:
>>>>>>       * Start   pingCheck:3  (dev-cluster2-node2)
>>>>>>     6. In this state, we spend the next 600 seconds.
>>>>>>       After completing this timeout causes another node (not DC) decides to kill again our victim.
>>>>>>       All command again is logged both the victim and the killer - All documented :)
>>>>>>     7. NOW all resource started in right sequence.
>>>>>> 
>>>>>>     I almost happy, but I do not like: two reboots and 10 minutes of waiting ;)
>>>>>>     And if something happens on another node, this the behavior is superimposed on old and not any resources not start until the last node will not reload twice.
>>>>>> 
>>>>>>     I tried understood this behavior.
>>>>>>     As I understand it:
>>>>>>     1. Ultimately, in ./lib/fencing/st_client.c call internal_stonith_action_execute().
>>>>>>     2. It make fork and pipe from tham.
>>>>>>     3. Async call mainloop_child_add with callback to  stonith_action_async_done.
>>>>>>     4. Add timeout  g_timeout_add to TERM and KILL signals.
>>>>>> 
>>>>>>     If all right must - call stonith_action_async_done, remove timeout.
>>>>>>     For some reason this does not happen. I sit and think ....
>>>>>>>>      At this time, there are constant re-election.
>>>>>>>>      Also, I noticed the difference when you start pacemaker.
>>>>>>>>      At normal startup:
>>>>>>>>      * corosync
>>>>>>>>      * pacemakerd
>>>>>>>>      * attrd
>>>>>>>>      * pengine
>>>>>>>>      * lrmd
>>>>>>>>      * crmd
>>>>>>>>      * cib
>>>>>>>> 
>>>>>>>>      When hangs start:
>>>>>>>>      * corosync
>>>>>>>>      * pacemakerd
>>>>>>>>      * attrd
>>>>>>>>      * pengine
>>>>>>>>      * crmd
>>>>>>>>      * lrmd
>>>>>>>>      * cib.
>>>>>>>     Are you referring to the order of the daemons here?
>>>>>>>     The cib should not be at the bottom in either case.
>>>>>>>>      Who knows who runs lrmd?
>>>>>>>     Pacemakerd.
>>>>>>>>      _______________________________________________
>>>>>>>>      Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>      http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>> 
>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>      Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>     ,
>>>>>>>     _______________________________________________
>>>>>>>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>> 
>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>     _______________________________________________
>>>>>>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>> 
>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>    ,
>>>>>    _______________________________________________
>>>>>    Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>>    Project Home: http://www.clusterlabs.org
>>>>>    Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>   _______________________________________________
>>>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>  ,
>>>  _______________________________________________
>>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140113/00e6fdad/attachment-0003.sig>