[Pacemaker] hangs pending

Andrey Groshev greenx at yandex.ru
Sat Feb 22 08:07:55 UTC 2014



21.02.2014, 04:00, "Andrew Beekhof" <andrew at beekhof.net>:
> On 20 Feb 2014, at 10:04 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>
>>  20.02.2014, 13:57, "Andrew Beekhof" <andrew at beekhof.net>:
>>>  On 20 Feb 2014, at 5:33 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>   20.02.2014, 01:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>   On 20 Feb 2014, at 4:18 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>    19.02.2014, 06:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>    On 18 Feb 2014, at 9:29 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>     Hi, ALL and Andrew!
>>>>>>>>
>>>>>>>>     Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>>     In general - I am happy (almost like an elephant)   :)
>>>>>>>>     Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>     I killed them with different signals (4,6,11 and even 9).
>>>>>>>>     Behavior does not depend of number signal - it's good.
>>>>>>>>     If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good.
>>>>>>>>     But the behavior is different from killing various demons.
>>>>>>>>
>>>>>>>>     Turned four groups:
>>>>>>>>     1. corosync,cib - STONITH work 100%.
>>>>>>>>     Kill via any signals - call STONITH and reboot.
>>>>>>>    excellent
>>>>>>>>     3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>     This daemons simple restart, resources - stay running.
>>>>>>>    right
>>>>>>>>     2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>     Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>     Sometimes restart daemon
>>>>>>>    The daemon will always try to restart, the only variable is how long it takes the peer to notice and initiate fencing.
>>>>>>>    If the failure happens just before a they're due to receive totem token, the failure will be very quickly detected and the node fenced.
>>>>>>>    If the failure happens just after, then detection will take longer - giving the node longer to recover and not be fenced.
>>>>>>>
>>>>>>>    So fence/not fence is normal and to be expected.
>>>>>>>>     and restart resources with large delay MS:pgsql.
>>>>>>>>     One time after restart crmd - pgsql don't restart.
>>>>>>>    I would not expect pgsql to ever restart - if the RA does its job properly anyway.
>>>>>>>    In the case the node is not fenced, the crmd will respawn and the the PE will request that it re-detect the state of all resources.
>>>>>>>
>>>>>>>    If the agent reports "all good", then there is nothing more to do.
>>>>>>>    If the agent is not reporting "all good", you should really be asking why.
>>>>>>>>     4. pacemakerd - nothing happens.
>>>>>>>    On non-systemd based machines, correct.
>>>>>>>
>>>>>>>    On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons.
>>>>>>>    Any subsequent daemon failure will be detected and the daemon respawned.
>>>>>>    And! I almost forgot about IT!
>>>>>>    Exist another (NORMAL) the variants, the methods, the ideas?
>>>>>>    Without this  ... @$%#$%&$%^&$%^&##@#$$^$%& !!!!!
>>>>>>    Otherwise - it's a full epic fail ;)
>>>>>   -ENOPARSE
>>>>   OK, I remove my personal attitude to "systemd".
>>>>   Let me explain.
>>>>
>>>>   Somewhere in the beginning of this topic, I wrote:
>>>>   A.G.:Who knows who runs lrmd?
>>>>   A.B.:Pacemakerd.
>>>>   That's one!
>>>>
>>>>   Let's see the list of processes:
>>>>   #ps -axf
>>>>   .....
>>>>   6067 ?        Ssl    7:24 corosync
>>>>   6092 ?        S      0:25 pacemakerd
>>>>   6094 ?        Ss   116:13  \_ /usr/libexec/pacemaker/cib
>>>>   6095 ?        Ss     0:25  \_ /usr/libexec/pacemaker/stonithd
>>>>   6096 ?        Ss     1:27  \_ /usr/libexec/pacemaker/lrmd
>>>>   6097 ?        Ss     0:49  \_ /usr/libexec/pacemaker/attrd
>>>>   6098 ?        Ss     0:25  \_ /usr/libexec/pacemaker/pengine
>>>>   6099 ?        Ss     0:29  \_ /usr/libexec/pacemaker/crmd
>>>>   .....
>>>>   That's two!
>>>  Whats two?  I don't follow.
>>  In the sense that it creates other processes. But it does not matter.
>>>>   And more, more...
>>>>   Now you must understand - why I want this process to work always.
>>>>   Even I think, No need for anyone here to explain it!
>>>>
>>>>   And Now you say about "pacemakerd nice work, but only on systemd distros" !!!
>>>  No, I;m saying it works _better_ on systemd distros.
>>>  On non-systemd distros you still need quite a few unlikely-to-happen failures to trigger a situation in which the node still gets fenced and recovered (assuming no-one saw any of the error messages and didn't run "service pacemaker restart" prior to the additional failures).
>>  Can you show me the place where:
>>  "On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons."?
>
> The code for it is in mcp/pacemaker.c, look for find_and_track_existing_processes()
>
> The ps tree will look different though
>
>  6094 ?        Ss   116:13  /usr/libexec/pacemaker/cib
>  6095 ?        Ss     0:25  /usr/libexec/pacemaker/stonithd
>  6096 ?        Ss     1:27  /usr/libexec/pacemaker/lrmd
>  6097 ?        Ss     0:49  /usr/libexec/pacemaker/attrd
>  6098 ?        Ss     0:25  /usr/libexec/pacemaker/pengine
>  6099 ?        Ss     0:29  /usr/libexec/pacemaker/crmd
> ...
>  6666 ?        S      0:25 pacemakerd
>
> but pacemakerd will be watching the old children and respawning them on failure.
> at which point you might see:
>
>  6094 ?        Ss   116:13  /usr/libexec/pacemaker/cib
>  6096 ?        Ss     1:27  /usr/libexec/pacemaker/lrmd
>  6097 ?        Ss     0:49  /usr/libexec/pacemaker/attrd
>  6098 ?        Ss     0:25  /usr/libexec/pacemaker/pengine
>  6099 ?        Ss     0:29  /usr/libexec/pacemaker/crmd
> ...
>  6666 ?        S      0:25 pacemakerd
>  6667 ?        Ss     0:25 \_ /usr/libexec/pacemaker/stonithd
>
>>  If I respawn via upstart process pacemakerd - "reattaches to the existing daemons" ?
>
> If upstart is capable of detecting the pacemakerd failure and automagically respawning it, then yes - the same process will happen.

Some people defend you, send me hate mail when I'm not restrained. 
But You're also a beetle :) 
Why you did not say anything about supporting upstart in spec?

>>>>   What should I do now?
>>>>   * Integrate systemd in CentOS?
>>>>   * Migrate to Fefora?
>>>>   * Buy RHEL7 !?
>>>  Option 3 is particularly good :)
>>  It's too easy. Normal heroes are always going to bypass :)
>>>>   Each a variants is great, but don't fit for me.
>>>>
>>>>   P.S. And I'm not talking distros which don't migrate to systemd (and will not do).
>>>  Are there any?  Even debian and ubuntu have raised the white flag.
>>  It certainly a lyrics, but potentially it can be any Unix-like system.
>>>>   Do not be offended! We also do so.
>>>>   We are building a secret military factory,
>>>>   large concrete fence around it,
>>>>   wall barbed wire, but forget to install the gates. :)
>>>>>>>>     And then I can kill any process of the third group. They do not restart.
>>>>>>>    Until they become needed.
>>>>>>>    Eg. if the DC goes to invoke the policy engine, that will fail causing the crmd to fail and the node to be fenced.
>>>>>>>>     Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>
>>>>>>>>     What do you think about this?
>>>>>>>>     The main question of this topic - we decided.
>>>>>>>>     But this varied behavior - another big problem.
>>>>>>>>
>>>>>>>>     17.02.2014, 08:52, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>>     17.02.2014, 02:27, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>      With no quick follow-up, dare one hope that means the patch worked? :-)
>>>>>>>>>     Hi,
>>>>>>>>>     No, unfortunately the chief changed my plans on Friday and all day I was engaged in a parallel project.
>>>>>>>>>     I hope that today have time to carry out the necessary tests.
>>>>>>>>>>      On 14 Feb 2014, at 3:37 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>       Yes, of course. Now beginning build world and test )
>>>>>>>>>>>
>>>>>>>>>>>       14.02.2014, 04:41, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>       The previous patch wasn't quite right.
>>>>>>>>>>>>       Could you try this new one?
>>>>>>>>>>>>
>>>>>>>>>>>>          http://paste.fedoraproject.org/77123/13923376/
>>>>>>>>>>>>
>>>>>>>>>>>>       [11:23 AM] beekhof at f19 ~/Development/sources/pacemaker/devel ☺ # git diff
>>>>>>>>>>>>       diff --git a/crmd/callbacks.c b/crmd/callbacks.c
>>>>>>>>>>>>       index ac4b905..d49525b 100644
>>>>>>>>>>>>       --- a/crmd/callbacks.c
>>>>>>>>>>>>       +++ b/crmd/callbacks.c
>>>>>>>>>>>>       @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
>>>>>>>>>>>>                        stop_te_timer(down->timer);
>>>>>>>>>>>>
>>>>>>>>>>>>                        flags |= node_update_join | node_update_expected;
>>>>>>>>>>>>       -                crm_update_peer_join(__FUNCTION__, node, crm_join_none);
>>>>>>>>>>>>       -                crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>>       +                crmd_peer_down(node, FALSE);
>>>>>>>>>>>>                        check_join_state(fsa_state, __FUNCTION__);
>>>>>>>>>>>>
>>>>>>>>>>>>                        update_graph(transition_graph, down);
>>>>>>>>>>>>       diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h
>>>>>>>>>>>>       index bc472c2..1a2577a 100644
>>>>>>>>>>>>       --- a/crmd/crmd_utils.h
>>>>>>>>>>>>       +++ b/crmd/crmd_utils.h
>>>>>>>>>>>>       @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level);
>>>>>>>>>>>>        const char *get_timer_desc(fsa_timer_t * timer);
>>>>>>>>>>>>        gboolean too_many_st_failures(void);
>>>>>>>>>>>>        void st_fail_count_reset(const char * target);
>>>>>>>>>>>>       +void crmd_peer_down(crm_node_t *peer, bool full);
>>>>>>>>>>>>
>>>>>>>>>>>>        #  define fsa_register_cib_callback(id, flag, data, fn) do {              \
>>>>>>>>>>>>                fsa_cib_conn->cmds->register_callback(                          \
>>>>>>>>>>>>       diff --git a/crmd/te_actions.c b/crmd/te_actions.c
>>>>>>>>>>>>       index f31d4ec..3bfce59 100644
>>>>>>>>>>>>       --- a/crmd/te_actions.c
>>>>>>>>>>>>       +++ b/crmd/te_actions.c
>>>>>>>>>>>>       @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, const char *target, const char *uuid)
>>>>>>>>>>>>                crm_info("Recording uuid '%s' for node '%s'", uuid, target);
>>>>>>>>>>>>                peer->uuid = strdup(uuid);
>>>>>>>>>>>>            }
>>>>>>>>>>>>       -    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>>>>>>>>>       -    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>>>>>>>>>       -    crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>>       -    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>>>
>>>>>>>>>>>>       +    crmd_peer_down(peer, TRUE);
>>>>>>>>>>>>            node_state =
>>>>>>>>>>>>                do_update_node_cib(peer,
>>>>>>>>>>>>                                   node_update_cluster | node_update_peer | node_update_join |
>>>>>>>>>>>>       diff --git a/crmd/te_utils.c b/crmd/te_utils.c
>>>>>>>>>>>>       index ad7e573..0c92e95 100644
>>>>>>>>>>>>       --- a/crmd/te_utils.c
>>>>>>>>>>>>       +++ b/crmd/te_utils.c
>>>>>>>>>>>>       @@ -247,10 +247,7 @@ tengine_stonith_notify(stonith_t * st, stonith_event_t * st_event)
>>>>>>>>>>>>
>>>>>>>>>>>>                }
>>>>>>>>>>>>
>>>>>>>>>>>>       -        crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>>>>>>>>>       -        crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>>>>>>>>>       -        crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>>       -        crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>>>       +        crmd_peer_down(peer, TRUE);
>>>>>>>>>>>>             }
>>>>>>>>>>>>        }
>>>>>>>>>>>>
>>>>>>>>>>>>       diff --git a/crmd/utils.c b/crmd/utils.c
>>>>>>>>>>>>       index 3988cfe..2df53ab 100644
>>>>>>>>>>>>       --- a/crmd/utils.c
>>>>>>>>>>>>       +++ b/crmd/utils.c
>>>>>>>>>>>>       @@ -1077,3 +1077,13 @@ update_attrd_remote_node_removed(const char *host, const char *user_name)
>>>>>>>>>>>>            crm_trace("telling attrd to clear attributes for remote host %s", host);
>>>>>>>>>>>>            update_attrd_helper(host, NULL, NULL, user_name, TRUE, 'C');
>>>>>>>>>>>>        }
>>>>>>>>>>>>       +
>>>>>>>>>>>>       +void crmd_peer_down(crm_node_t *peer, bool full)
>>>>>>>>>>>>       +{
>>>>>>>>>>>>       +    if(full && peer->state == NULL) {
>>>>>>>>>>>>       +        crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>>>>>>>>>       +        crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>>>>>>>>>       +    }
>>>>>>>>>>>>       +    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>>>       +    crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>>       +}
>>>>>>>>>>>>
>>>>>>>>>>>>       On 16 Jan 2014, at 7:24 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>        16.01.2014, 01:30, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>        On 16 Jan 2014, at 12:41 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>         15.01.2014, 02:53, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>         On 15 Jan 2014, at 12:15 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>          14.01.2014, 10:00, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>>>>>>>>>>>          14.01.2014, 07:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>           Ok, here's what happens:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>           1. node2 is lost
>>>>>>>>>>>>>>>>>>>           2. fencing of node2 starts
>>>>>>>>>>>>>>>>>>>           3. node2 reboots (and cluster starts)
>>>>>>>>>>>>>>>>>>>           4. node2 returns to the membership
>>>>>>>>>>>>>>>>>>>           5. node2 is marked as a cluster member
>>>>>>>>>>>>>>>>>>>           6. DC tries to bring it into the cluster, but needs to cancel the active transition first.
>>>>>>>>>>>>>>>>>>>              Which is a problem since the node2 fencing operation is part of that
>>>>>>>>>>>>>>>>>>>           7. node2 is in a transition (pending) state until fencing passes or fails
>>>>>>>>>>>>>>>>>>>           8a. fencing fails: transition completes and the node joins the cluster
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>           Thats in theory, except we automatically try again. Which isn't appropriate.
>>>>>>>>>>>>>>>>>>>           This should be relatively easy to fix.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>           8b. fencing passes: the node is incorrectly marked as offline
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>           This I have no idea how to fix yet.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>           On another note, it doesn't look like this agent works at all.
>>>>>>>>>>>>>>>>>>>           The node has been back online for a long time and the agent is still timing out after 10 minutes.
>>>>>>>>>>>>>>>>>>>           So "Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0." does not seem true.
>>>>>>>>>>>>>>>>>>          Damn. Looks like you're right. At some time I broke my agent and had not noticed it. Who will understand.
>>>>>>>>>>>>>>>>>          I repaired my agent - after send reboot he is wait STDIN.
>>>>>>>>>>>>>>>>>          Returned "normally" a behavior - hangs "pending", until manually send reboot. :)
>>>>>>>>>>>>>>>>         Right. Now you're in case 8b.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         Can you try this patch:  http://paste.fedoraproject.org/68450/38973966
>>>>>>>>>>>>>>>         Killed all day experiences.
>>>>>>>>>>>>>>>         It turns out here that:
>>>>>>>>>>>>>>>         1. Did cluster.
>>>>>>>>>>>>>>>         2. On the node-2 send signal (-4) - killed corosink
>>>>>>>>>>>>>>>         3. From node-1 (there DC) - stonith sent reboot
>>>>>>>>>>>>>>>         4. Noda rebooted and resources start.
>>>>>>>>>>>>>>>         5. Again. On the node-2 send signal (-4) - killed corosink
>>>>>>>>>>>>>>>         6. Again. From node-1 (there DC) - stonith sent reboot
>>>>>>>>>>>>>>>         7. Noda-2 rebooted and hangs in "pending"
>>>>>>>>>>>>>>>         8. Waiting, waiting..... manually reboot.
>>>>>>>>>>>>>>>         9. Noda-2 reboot and raised resources start.
>>>>>>>>>>>>>>>         10. GOTO p.2
>>>>>>>>>>>>>>        Logs?
>>>>>>>>>>>>>        Yesterday I wrote an additional letter why not put the logs.
>>>>>>>>>>>>>        Read it please, it contains a few more questions.
>>>>>>>>>>>>>        Today again began to hang and continue along the same cycle.
>>>>>>>>>>>>>        Logs here http://send2me.ru/crmrep2.tar.bz2
>>>>>>>>>>>>>>>>>          New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>>>>>>>>>>>>>>>           On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>>>>>>>>>>>>>            Apart from anything else, your timeout needs to be bigger:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>            Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>            On 14 Jan 2014, at 7:18 am, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>>>>>>>>>>>>>>            On 13 Jan 2014, at 8:31 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>            13.01.2014, 02:51, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>            On 10 Jan 2014, at 9:55 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>            10.01.2014, 14:31, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>>>>>>>>>>>>>>>>>>            10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>>            On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>             10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                Hi, ALL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               Please define "pending".  Where did you see this?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               In crm_mon:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               ......
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               ......
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               The experiment was like this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               Four nodes in cluster.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               Then in the log fell out "Too many failures ...."
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               All this time in the status in crm_mon is "pending".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               Much time has passed and I can not accurately describe the behavior...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               Now I am in the following state:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               I tried locate the problem. Came here with this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               I set big value in property stonith-timeout="600s".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               And got the following behavior:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               1. pkill -4 corosync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>               3. It sends reboot victim and waits until she comes to life again.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Hmmm.... what version of pacemaker?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              This sounds like a timing issue that we fixed a while back
>>>>>>>>>>>>>>>>>>>>>>>>>>>             Was a version 1.1.11 from December 3.
>>>>>>>>>>>>>>>>>>>>>>>>>>>             Now try full update and retest.
>>>>>>>>>>>>>>>>>>>>>>>>>>            That should be recent enough.  Can you create a crm_report the next time you reproduce?
>>>>>>>>>>>>>>>>>>>>>>>>>            Of course yes. Little delay.... :)
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>            ......
>>>>>>>>>>>>>>>>>>>>>>>>>            cc1: warnings being treated as errors
>>>>>>>>>>>>>>>>>>>>>>>>>            upstart.c: In function ‘upstart_job_property’:
>>>>>>>>>>>>>>>>>>>>>>>>>            upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
>>>>>>>>>>>>>>>>>>>>>>>>>            upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>>>>>>>>>>>>>>>>>>>>>>>            upstart.c:264: error: assignment makes pointer from integer without a cast
>>>>>>>>>>>>>>>>>>>>>>>>>            gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>>>>>>>>>>>>>>>>>>>>>>            gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>>>>>>>>>>>>>>>>>>>>>>            make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>>>>>>>>>>>            make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>>>>>>>>>>>>>>>>>>>>>>            make: *** [core] Error 1
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>            I'm trying to solve this a problem.
>>>>>>>>>>>>>>>>>>>>>>>>            Do not get solved quickly...
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>            https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>>>>>>>>>>>>>>>>>>>>            g_variant_lookup_value () Since 2.28
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>            # yum list installed glib2
>>>>>>>>>>>>>>>>>>>>>>>>            Loaded plugins: fastestmirror, rhnplugin, security
>>>>>>>>>>>>>>>>>>>>>>>>            This system is receiving updates from RHN Classic or Red Hat Satellite.
>>>>>>>>>>>>>>>>>>>>>>>>            Loading mirror speeds from cached hostfile
>>>>>>>>>>>>>>>>>>>>>>>>            Installed Packages
>>>>>>>>>>>>>>>>>>>>>>>>            glib2.x86_64                                                              2.26.1-3.el6                                                               installed
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>            # cat /etc/issue
>>>>>>>>>>>>>>>>>>>>>>>>            CentOS release 6.5 (Final)
>>>>>>>>>>>>>>>>>>>>>>>>            Kernel \r on an \m
>>>>>>>>>>>>>>>>>>>>>>>            Can you try this patch?
>>>>>>>>>>>>>>>>>>>>>>>            Upstart jobs wont work, but the code will compile
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>            diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>>>            index 831e7cf..195c3a4 100644
>>>>>>>>>>>>>>>>>>>>>>>            --- a/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>>>            +++ b/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>>>            @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>>>>>>>>>>>>>>>>>>>>>>            static char *
>>>>>>>>>>>>>>>>>>>>>>>            upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>>>>>>>>>>>>>>>>>>            {
>>>>>>>>>>>>>>>>>>>>>>>            +    char *output = NULL;
>>>>>>>>>>>>>>>>>>>>>>>            +
>>>>>>>>>>>>>>>>>>>>>>>            +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>>>>>>>>>>>>>>>>>>>>            +    static bool err = TRUE;
>>>>>>>>>>>>>>>>>>>>>>>            +
>>>>>>>>>>>>>>>>>>>>>>>            +    if(err) {
>>>>>>>>>>>>>>>>>>>>>>>            +        crm_err("This version of glib is too old to support upstart jobs");
>>>>>>>>>>>>>>>>>>>>>>>            +        err = FALSE;
>>>>>>>>>>>>>>>>>>>>>>>            +    }
>>>>>>>>>>>>>>>>>>>>>>>            +#else
>>>>>>>>>>>>>>>>>>>>>>>               GError *error = NULL;
>>>>>>>>>>>>>>>>>>>>>>>               GDBusProxy *proxy;
>>>>>>>>>>>>>>>>>>>>>>>               GVariant *asv = NULL;
>>>>>>>>>>>>>>>>>>>>>>>               GVariant *value = NULL;
>>>>>>>>>>>>>>>>>>>>>>>               GVariant *_ret = NULL;
>>>>>>>>>>>>>>>>>>>>>>>            -    char *output = NULL;
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>               crm_info("Calling GetAll on %s", obj);
>>>>>>>>>>>>>>>>>>>>>>>               proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>>>>>>>>>>>>>>>>>>>>            @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>               g_object_unref(proxy);
>>>>>>>>>>>>>>>>>>>>>>>               g_variant_unref(_ret);
>>>>>>>>>>>>>>>>>>>>>>>            +#endif
>>>>>>>>>>>>>>>>>>>>>>>               return output;
>>>>>>>>>>>>>>>>>>>>>>>            }
>>>>>>>>>>>>>>>>>>>>>>            Ok :) I patch source.
>>>>>>>>>>>>>>>>>>>>>>            Type "make rc" - the same error.
>>>>>>>>>>>>>>>>>>>>>            Because its not building your local changes
>>>>>>>>>>>>>>>>>>>>>>            Make new copy via "fetch" - the same error.
>>>>>>>>>>>>>>>>>>>>>>            It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it.
>>>>>>>>>>>>>>>>>>>>>>            Otherwise use exist archive.
>>>>>>>>>>>>>>>>>>>>>>            Cutted log .......
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>            # make rc
>>>>>>>>>>>>>>>>>>>>>>            make TAG=Pacemaker-1.1.11-rc3 rpm
>>>>>>>>>>>>>>>>>>>>>>            make[1]: Entering directory `/root/ha/pacemaker'
>>>>>>>>>>>>>>>>>>>>>>            rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
>>>>>>>>>>>>>>>>>>>>>>            if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then                                             \
>>>>>>>>>>>>>>>>>>>>>>                     rm -f pacemaker.tar.*;                                              \
>>>>>>>>>>>>>>>>>>>>>>                     if [ Pacemaker-1.1.11-rc3 = dirty ]; then                                   \
>>>>>>>>>>>>>>>>>>>>>>                         git commit -m "DO-NOT-PUSH" -a;                                 \
>>>>>>>>>>>>>>>>>>>>>>                         git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
>>>>>>>>>>>>>>>>>>>>>>                         git reset --mixed HEAD^;                                        \
>>>>>>>>>>>>>>>>>>>>>>                     else                                                                \
>>>>>>>>>>>>>>>>>>>>>>                         git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>>>>>>>>>>>>>>>>>>>>>                     fi;                                                                 \
>>>>>>>>>>>>>>>>>>>>>>                     echo `date`: Rebuilt ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                                     \
>>>>>>>>>>>>>>>>>>>>>>                 else                                                                    \
>>>>>>>>>>>>>>>>>>>>>>                     echo `date`: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                     \
>>>>>>>>>>>>>>>>>>>>>>                 fi
>>>>>>>>>>>>>>>>>>>>>>            Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>>>>>>>>>>>>>>>>>>>>            .......
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>            Well, "make rpm" - build rpms and I create cluster.
>>>>>>>>>>>>>>>>>>>>>>            I spent the same tests and confirmed the behavior.
>>>>>>>>>>>>>>>>>>>>>>            crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
>>>>>>>>>>>>>>>>>>>>>            Thanks!
>>>>>>>>>>>>>>>>>>>           ,
>>>>>>>>>>>>>>>>>>>           _______________________________________________
>>>>>>>>>>>>>>>>>>>           Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>>>>>           http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>           Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>>>>           Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>>>>           Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>>>>          _______________________________________________
>>>>>>>>>>>>>>>>>>          Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>>>>          http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>          Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>>>          Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>>>          Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>>>          _______________________________________________
>>>>>>>>>>>>>>>>>          Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>>>          http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>>          Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>>          Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>>         ,
>>>>>>>>>>>>>>>>         _______________________________________________
>>>>>>>>>>>>>>>>         Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>>         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>         Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>         Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>         _______________________________________________
>>>>>>>>>>>>>>>         Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>         Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>         Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>        ,
>>>>>>>>>>>>>>        _______________________________________________
>>>>>>>>>>>>>>        Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>        http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>        Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>        Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>        _______________________________________________
>>>>>>>>>>>>>        Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>        http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>
>>>>>>>>>>>>>        Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>        Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>        Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>       ,
>>>>>>>>>>>>       _______________________________________________
>>>>>>>>>>>>       Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>       http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>
>>>>>>>>>>>>       Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>       Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>       _______________________________________________
>>>>>>>>>>>       Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>       http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>
>>>>>>>>>>>       Project Home: http://www.clusterlabs.org
>>>>>>>>>>>       Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>      ,
>>>>>>>>>>      _______________________________________________
>>>>>>>>>>      Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>      http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>
>>>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>>>      Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>>>     _______________________________________________
>>>>>>>>>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>
>>>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>>>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>>>     _______________________________________________
>>>>>>>>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>>    ,
>>>>>>>    _______________________________________________
>>>>>>>    Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>>    Project Home: http://www.clusterlabs.org
>>>>>>>    Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>>>    _______________________________________________
>>>>>>    Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>>    Project Home: http://www.clusterlabs.org
>>>>>>    Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>>   ,
>>>>>   _______________________________________________
>>>>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>>   Project Home: http://www.clusterlabs.org
>>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>   _______________________________________________
>>>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>  ,
>>>  _______________________________________________
>>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Pacemaker mailing list