[Pacemaker] hangs pending
Andrey Groshev
greenx at yandex.ru
Sat Feb 22 03:07:55 EST 2014
21.02.2014, 04:00, "Andrew Beekhof" <andrew at beekhof.net>:
> On 20 Feb 2014, at 10:04 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>
>> 20.02.2014, 13:57, "Andrew Beekhof" <andrew at beekhof.net>:
>>> On 20 Feb 2014, at 5:33 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>> 20.02.2014, 01:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>> On 20 Feb 2014, at 4:18 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>> 19.02.2014, 06:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>> On 18 Feb 2014, at 9:29 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>> Hi, ALL and Andrew!
>>>>>>>>
>>>>>>>> Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>> In general - I am happy (almost like an elephant) :)
>>>>>>>> Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>> I killed them with different signals (4,6,11 and even 9).
>>>>>>>> Behavior does not depend of number signal - it's good.
>>>>>>>> If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good.
>>>>>>>> But the behavior is different from killing various demons.
>>>>>>>>
>>>>>>>> Turned four groups:
>>>>>>>> 1. corosync,cib - STONITH work 100%.
>>>>>>>> Kill via any signals - call STONITH and reboot.
>>>>>>> excellent
>>>>>>>> 3. stonithd,attrd,pengine - not need STONITH
>>>>>>>> This daemons simple restart, resources - stay running.
>>>>>>> right
>>>>>>>> 2. lrmd,crmd - strange behavior STONITH.
>>>>>>>> Sometimes called STONITH - and the corresponding reaction.
>>>>>>>> Sometimes restart daemon
>>>>>>> The daemon will always try to restart, the only variable is how long it takes the peer to notice and initiate fencing.
>>>>>>> If the failure happens just before a they're due to receive totem token, the failure will be very quickly detected and the node fenced.
>>>>>>> If the failure happens just after, then detection will take longer - giving the node longer to recover and not be fenced.
>>>>>>>
>>>>>>> So fence/not fence is normal and to be expected.
>>>>>>>> and restart resources with large delay MS:pgsql.
>>>>>>>> One time after restart crmd - pgsql don't restart.
>>>>>>> I would not expect pgsql to ever restart - if the RA does its job properly anyway.
>>>>>>> In the case the node is not fenced, the crmd will respawn and the the PE will request that it re-detect the state of all resources.
>>>>>>>
>>>>>>> If the agent reports "all good", then there is nothing more to do.
>>>>>>> If the agent is not reporting "all good", you should really be asking why.
>>>>>>>> 4. pacemakerd - nothing happens.
>>>>>>> On non-systemd based machines, correct.
>>>>>>>
>>>>>>> On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons.
>>>>>>> Any subsequent daemon failure will be detected and the daemon respawned.
>>>>>> And! I almost forgot about IT!
>>>>>> Exist another (NORMAL) the variants, the methods, the ideas?
>>>>>> Without this ... @$%#$%&$%^&$%^&##@#$$^$%& !!!!!
>>>>>> Otherwise - it's a full epic fail ;)
>>>>> -ENOPARSE
>>>> OK, I remove my personal attitude to "systemd".
>>>> Let me explain.
>>>>
>>>> Somewhere in the beginning of this topic, I wrote:
>>>> A.G.:Who knows who runs lrmd?
>>>> A.B.:Pacemakerd.
>>>> That's one!
>>>>
>>>> Let's see the list of processes:
>>>> #ps -axf
>>>> .....
>>>> 6067 ? Ssl 7:24 corosync
>>>> 6092 ? S 0:25 pacemakerd
>>>> 6094 ? Ss 116:13 \_ /usr/libexec/pacemaker/cib
>>>> 6095 ? Ss 0:25 \_ /usr/libexec/pacemaker/stonithd
>>>> 6096 ? Ss 1:27 \_ /usr/libexec/pacemaker/lrmd
>>>> 6097 ? Ss 0:49 \_ /usr/libexec/pacemaker/attrd
>>>> 6098 ? Ss 0:25 \_ /usr/libexec/pacemaker/pengine
>>>> 6099 ? Ss 0:29 \_ /usr/libexec/pacemaker/crmd
>>>> .....
>>>> That's two!
>>> Whats two? I don't follow.
>> In the sense that it creates other processes. But it does not matter.
>>>> And more, more...
>>>> Now you must understand - why I want this process to work always.
>>>> Even I think, No need for anyone here to explain it!
>>>>
>>>> And Now you say about "pacemakerd nice work, but only on systemd distros" !!!
>>> No, I;m saying it works _better_ on systemd distros.
>>> On non-systemd distros you still need quite a few unlikely-to-happen failures to trigger a situation in which the node still gets fenced and recovered (assuming no-one saw any of the error messages and didn't run "service pacemaker restart" prior to the additional failures).
>> Can you show me the place where:
>> "On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons."?
>
> The code for it is in mcp/pacemaker.c, look for find_and_track_existing_processes()
>
> The ps tree will look different though
>
> 6094 ? Ss 116:13 /usr/libexec/pacemaker/cib
> 6095 ? Ss 0:25 /usr/libexec/pacemaker/stonithd
> 6096 ? Ss 1:27 /usr/libexec/pacemaker/lrmd
> 6097 ? Ss 0:49 /usr/libexec/pacemaker/attrd
> 6098 ? Ss 0:25 /usr/libexec/pacemaker/pengine
> 6099 ? Ss 0:29 /usr/libexec/pacemaker/crmd
> ...
> 6666 ? S 0:25 pacemakerd
>
> but pacemakerd will be watching the old children and respawning them on failure.
> at which point you might see:
>
> 6094 ? Ss 116:13 /usr/libexec/pacemaker/cib
> 6096 ? Ss 1:27 /usr/libexec/pacemaker/lrmd
> 6097 ? Ss 0:49 /usr/libexec/pacemaker/attrd
> 6098 ? Ss 0:25 /usr/libexec/pacemaker/pengine
> 6099 ? Ss 0:29 /usr/libexec/pacemaker/crmd
> ...
> 6666 ? S 0:25 pacemakerd
> 6667 ? Ss 0:25 \_ /usr/libexec/pacemaker/stonithd
>
>> If I respawn via upstart process pacemakerd - "reattaches to the existing daemons" ?
>
> If upstart is capable of detecting the pacemakerd failure and automagically respawning it, then yes - the same process will happen.
Some people defend you, send me hate mail when I'm not restrained.
But You're also a beetle :)
Why you did not say anything about supporting upstart in spec?
>>>> What should I do now?
>>>> * Integrate systemd in CentOS?
>>>> * Migrate to Fefora?
>>>> * Buy RHEL7 !?
>>> Option 3 is particularly good :)
>> It's too easy. Normal heroes are always going to bypass :)
>>>> Each a variants is great, but don't fit for me.
>>>>
>>>> P.S. And I'm not talking distros which don't migrate to systemd (and will not do).
>>> Are there any? Even debian and ubuntu have raised the white flag.
>> It certainly a lyrics, but potentially it can be any Unix-like system.
>>>> Do not be offended! We also do so.
>>>> We are building a secret military factory,
>>>> large concrete fence around it,
>>>> wall barbed wire, but forget to install the gates. :)
>>>>>>>> And then I can kill any process of the third group. They do not restart.
>>>>>>> Until they become needed.
>>>>>>> Eg. if the DC goes to invoke the policy engine, that will fail causing the crmd to fail and the node to be fenced.
>>>>>>>> Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>
>>>>>>>> What do you think about this?
>>>>>>>> The main question of this topic - we decided.
>>>>>>>> But this varied behavior - another big problem.
>>>>>>>>
>>>>>>>> 17.02.2014, 08:52, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>> 17.02.2014, 02:27, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>> With no quick follow-up, dare one hope that means the patch worked? :-)
>>>>>>>>> Hi,
>>>>>>>>> No, unfortunately the chief changed my plans on Friday and all day I was engaged in a parallel project.
>>>>>>>>> I hope that today have time to carry out the necessary tests.
>>>>>>>>>> On 14 Feb 2014, at 3:37 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>> Yes, of course. Now beginning build world and test )
>>>>>>>>>>>
>>>>>>>>>>> 14.02.2014, 04:41, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>> The previous patch wasn't quite right.
>>>>>>>>>>>> Could you try this new one?
>>>>>>>>>>>>
>>>>>>>>>>>> http://paste.fedoraproject.org/77123/13923376/
>>>>>>>>>>>>
>>>>>>>>>>>> [11:23 AM] beekhof at f19 ~/Development/sources/pacemaker/devel ☺ # git diff
>>>>>>>>>>>> diff --git a/crmd/callbacks.c b/crmd/callbacks.c
>>>>>>>>>>>> index ac4b905..d49525b 100644
>>>>>>>>>>>> --- a/crmd/callbacks.c
>>>>>>>>>>>> +++ b/crmd/callbacks.c
>>>>>>>>>>>> @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type type, crm_node_t * node, const void *d
>>>>>>>>>>>> stop_te_timer(down->timer);
>>>>>>>>>>>>
>>>>>>>>>>>> flags |= node_update_join | node_update_expected;
>>>>>>>>>>>> - crm_update_peer_join(__FUNCTION__, node, crm_join_none);
>>>>>>>>>>>> - crm_update_peer_expected(__FUNCTION__, node, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>> + crmd_peer_down(node, FALSE);
>>>>>>>>>>>> check_join_state(fsa_state, __FUNCTION__);
>>>>>>>>>>>>
>>>>>>>>>>>> update_graph(transition_graph, down);
>>>>>>>>>>>> diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h
>>>>>>>>>>>> index bc472c2..1a2577a 100644
>>>>>>>>>>>> --- a/crmd/crmd_utils.h
>>>>>>>>>>>> +++ b/crmd/crmd_utils.h
>>>>>>>>>>>> @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level);
>>>>>>>>>>>> const char *get_timer_desc(fsa_timer_t * timer);
>>>>>>>>>>>> gboolean too_many_st_failures(void);
>>>>>>>>>>>> void st_fail_count_reset(const char * target);
>>>>>>>>>>>> +void crmd_peer_down(crm_node_t *peer, bool full);
>>>>>>>>>>>>
>>>>>>>>>>>> # define fsa_register_cib_callback(id, flag, data, fn) do { \
>>>>>>>>>>>> fsa_cib_conn->cmds->register_callback( \
>>>>>>>>>>>> diff --git a/crmd/te_actions.c b/crmd/te_actions.c
>>>>>>>>>>>> index f31d4ec..3bfce59 100644
>>>>>>>>>>>> --- a/crmd/te_actions.c
>>>>>>>>>>>> +++ b/crmd/te_actions.c
>>>>>>>>>>>> @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, const char *target, const char *uuid)
>>>>>>>>>>>> crm_info("Recording uuid '%s' for node '%s'", uuid, target);
>>>>>>>>>>>> peer->uuid = strdup(uuid);
>>>>>>>>>>>> }
>>>>>>>>>>>> - crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>>>>>>>>> - crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>>>>>>>>> - crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>> - crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>>>
>>>>>>>>>>>> + crmd_peer_down(peer, TRUE);
>>>>>>>>>>>> node_state =
>>>>>>>>>>>> do_update_node_cib(peer,
>>>>>>>>>>>> node_update_cluster | node_update_peer | node_update_join |
>>>>>>>>>>>> diff --git a/crmd/te_utils.c b/crmd/te_utils.c
>>>>>>>>>>>> index ad7e573..0c92e95 100644
>>>>>>>>>>>> --- a/crmd/te_utils.c
>>>>>>>>>>>> +++ b/crmd/te_utils.c
>>>>>>>>>>>> @@ -247,10 +247,7 @@ tengine_stonith_notify(stonith_t * st, stonith_event_t * st_event)
>>>>>>>>>>>>
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> - crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>>>>>>>>> - crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>>>>>>>>> - crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>> - crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>>> + crmd_peer_down(peer, TRUE);
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/crmd/utils.c b/crmd/utils.c
>>>>>>>>>>>> index 3988cfe..2df53ab 100644
>>>>>>>>>>>> --- a/crmd/utils.c
>>>>>>>>>>>> +++ b/crmd/utils.c
>>>>>>>>>>>> @@ -1077,3 +1077,13 @@ update_attrd_remote_node_removed(const char *host, const char *user_name)
>>>>>>>>>>>> crm_trace("telling attrd to clear attributes for remote host %s", host);
>>>>>>>>>>>> update_attrd_helper(host, NULL, NULL, user_name, TRUE, 'C');
>>>>>>>>>>>> }
>>>>>>>>>>>> +
>>>>>>>>>>>> +void crmd_peer_down(crm_node_t *peer, bool full)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + if(full && peer->state == NULL) {
>>>>>>>>>>>> + crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>>>>>>>>> + crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>>>>>>>>> + }
>>>>>>>>>>>> + crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>>>> + crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>>>>>>>>>> +}
>>>>>>>>>>>>
>>>>>>>>>>>> On 16 Jan 2014, at 7:24 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>> 16.01.2014, 01:30, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>> On 16 Jan 2014, at 12:41 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>> 15.01.2014, 02:53, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>> On 15 Jan 2014, at 12:15 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>> 14.01.2014, 10:00, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>>>>>>>>>>> 14.01.2014, 07:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>> Ok, here's what happens:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. node2 is lost
>>>>>>>>>>>>>>>>>>> 2. fencing of node2 starts
>>>>>>>>>>>>>>>>>>> 3. node2 reboots (and cluster starts)
>>>>>>>>>>>>>>>>>>> 4. node2 returns to the membership
>>>>>>>>>>>>>>>>>>> 5. node2 is marked as a cluster member
>>>>>>>>>>>>>>>>>>> 6. DC tries to bring it into the cluster, but needs to cancel the active transition first.
>>>>>>>>>>>>>>>>>>> Which is a problem since the node2 fencing operation is part of that
>>>>>>>>>>>>>>>>>>> 7. node2 is in a transition (pending) state until fencing passes or fails
>>>>>>>>>>>>>>>>>>> 8a. fencing fails: transition completes and the node joins the cluster
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thats in theory, except we automatically try again. Which isn't appropriate.
>>>>>>>>>>>>>>>>>>> This should be relatively easy to fix.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 8b. fencing passes: the node is incorrectly marked as offline
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This I have no idea how to fix yet.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On another note, it doesn't look like this agent works at all.
>>>>>>>>>>>>>>>>>>> The node has been back online for a long time and the agent is still timing out after 10 minutes.
>>>>>>>>>>>>>>>>>>> So "Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0." does not seem true.
>>>>>>>>>>>>>>>>>> Damn. Looks like you're right. At some time I broke my agent and had not noticed it. Who will understand.
>>>>>>>>>>>>>>>>> I repaired my agent - after send reboot he is wait STDIN.
>>>>>>>>>>>>>>>>> Returned "normally" a behavior - hangs "pending", until manually send reboot. :)
>>>>>>>>>>>>>>>> Right. Now you're in case 8b.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you try this patch: http://paste.fedoraproject.org/68450/38973966
>>>>>>>>>>>>>>> Killed all day experiences.
>>>>>>>>>>>>>>> It turns out here that:
>>>>>>>>>>>>>>> 1. Did cluster.
>>>>>>>>>>>>>>> 2. On the node-2 send signal (-4) - killed corosink
>>>>>>>>>>>>>>> 3. From node-1 (there DC) - stonith sent reboot
>>>>>>>>>>>>>>> 4. Noda rebooted and resources start.
>>>>>>>>>>>>>>> 5. Again. On the node-2 send signal (-4) - killed corosink
>>>>>>>>>>>>>>> 6. Again. From node-1 (there DC) - stonith sent reboot
>>>>>>>>>>>>>>> 7. Noda-2 rebooted and hangs in "pending"
>>>>>>>>>>>>>>> 8. Waiting, waiting..... manually reboot.
>>>>>>>>>>>>>>> 9. Noda-2 reboot and raised resources start.
>>>>>>>>>>>>>>> 10. GOTO p.2
>>>>>>>>>>>>>> Logs?
>>>>>>>>>>>>> Yesterday I wrote an additional letter why not put the logs.
>>>>>>>>>>>>> Read it please, it contains a few more questions.
>>>>>>>>>>>>> Today again began to hang and continue along the same cycle.
>>>>>>>>>>>>> Logs here http://send2me.ru/crmrep2.tar.bz2
>>>>>>>>>>>>>>>>> New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>>>>>>>>>>>>>>> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>>>>>>>>>>>>> Apart from anything else, your timeout needs to be bigger:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: ( commands.c:1321 ) error: log_operation: Operation 'reboot' [11331] (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 (Timer expired)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>>>>>>>>>>>>>>>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>> 13.01.2014, 02:51, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 10.01.2014, 14:31, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>>>>>>>>>>>>>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, ALL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please define "pending". Where did you see this?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In crm_mon:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ......
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ......
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The experiment was like this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Four nodes in cluster.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then in the log fell out "Too many failures ...."
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All this time in the status in crm_mon is "pending".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Much time has passed and I can not accurately describe the behavior...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now I am in the following state:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I tried locate the problem. Came here with this.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I set big value in property stonith-timeout="600s".
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And got the following behavior:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. pkill -4 corosync
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. It sends reboot victim and waits until she comes to life again.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hmmm.... what version of pacemaker?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> This sounds like a timing issue that we fixed a while back
>>>>>>>>>>>>>>>>>>>>>>>>>>> Was a version 1.1.11 from December 3.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Now try full update and retest.
>>>>>>>>>>>>>>>>>>>>>>>>>> That should be recent enough. Can you create a crm_report the next time you reproduce?
>>>>>>>>>>>>>>>>>>>>>>>>> Of course yes. Little delay.... :)
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> ......
>>>>>>>>>>>>>>>>>>>>>>>>> cc1: warnings being treated as errors
>>>>>>>>>>>>>>>>>>>>>>>>> upstart.c: In function ‘upstart_job_property’:
>>>>>>>>>>>>>>>>>>>>>>>>> upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
>>>>>>>>>>>>>>>>>>>>>>>>> upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>>>>>>>>>>>>>>>>>>>>>>> upstart.c:264: error: assignment makes pointer from integer without a cast
>>>>>>>>>>>>>>>>>>>>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>>>>>>>>>>>>>>>>>>>>>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>>>>>>>>>>>>>>>>>>>>>> make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>>>>>>>>>>> make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>>>>>>>>>>>>>>>>>>>>>> make: *** [core] Error 1
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I'm trying to solve this a problem.
>>>>>>>>>>>>>>>>>>>>>>>> Do not get solved quickly...
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>>>>>>>>>>>>>>>>>>>> g_variant_lookup_value () Since 2.28
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> # yum list installed glib2
>>>>>>>>>>>>>>>>>>>>>>>> Loaded plugins: fastestmirror, rhnplugin, security
>>>>>>>>>>>>>>>>>>>>>>>> This system is receiving updates from RHN Classic or Red Hat Satellite.
>>>>>>>>>>>>>>>>>>>>>>>> Loading mirror speeds from cached hostfile
>>>>>>>>>>>>>>>>>>>>>>>> Installed Packages
>>>>>>>>>>>>>>>>>>>>>>>> glib2.x86_64 2.26.1-3.el6 installed
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> # cat /etc/issue
>>>>>>>>>>>>>>>>>>>>>>>> CentOS release 6.5 (Final)
>>>>>>>>>>>>>>>>>>>>>>>> Kernel \r on an \m
>>>>>>>>>>>>>>>>>>>>>>> Can you try this patch?
>>>>>>>>>>>>>>>>>>>>>>> Upstart jobs wont work, but the code will compile
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>>> index 831e7cf..195c3a4 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>>> +++ b/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>>>>>>>>>>>>>>>>>>>>>> static char *
>>>>>>>>>>>>>>>>>>>>>>> upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> + char *output = NULL;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>>>>>>>>>>>>>>>>>>>> + static bool err = TRUE;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + if(err) {
>>>>>>>>>>>>>>>>>>>>>>> + crm_err("This version of glib is too old to support upstart jobs");
>>>>>>>>>>>>>>>>>>>>>>> + err = FALSE;
>>>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>>>> +#else
>>>>>>>>>>>>>>>>>>>>>>> GError *error = NULL;
>>>>>>>>>>>>>>>>>>>>>>> GDBusProxy *proxy;
>>>>>>>>>>>>>>>>>>>>>>> GVariant *asv = NULL;
>>>>>>>>>>>>>>>>>>>>>>> GVariant *value = NULL;
>>>>>>>>>>>>>>>>>>>>>>> GVariant *_ret = NULL;
>>>>>>>>>>>>>>>>>>>>>>> - char *output = NULL;
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> crm_info("Calling GetAll on %s", obj);
>>>>>>>>>>>>>>>>>>>>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>>>>>>>>>>>>>>>>>>>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> g_object_unref(proxy);
>>>>>>>>>>>>>>>>>>>>>>> g_variant_unref(_ret);
>>>>>>>>>>>>>>>>>>>>>>> +#endif
>>>>>>>>>>>>>>>>>>>>>>> return output;
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>> Ok :) I patch source.
>>>>>>>>>>>>>>>>>>>>>> Type "make rc" - the same error.
>>>>>>>>>>>>>>>>>>>>> Because its not building your local changes
>>>>>>>>>>>>>>>>>>>>>> Make new copy via "fetch" - the same error.
>>>>>>>>>>>>>>>>>>>>>> It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it.
>>>>>>>>>>>>>>>>>>>>>> Otherwise use exist archive.
>>>>>>>>>>>>>>>>>>>>>> Cutted log .......
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> # make rc
>>>>>>>>>>>>>>>>>>>>>> make TAG=Pacemaker-1.1.11-rc3 rpm
>>>>>>>>>>>>>>>>>>>>>> make[1]: Entering directory `/root/ha/pacemaker'
>>>>>>>>>>>>>>>>>>>>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pacemaker-HEAD.tar.*
>>>>>>>>>>>>>>>>>>>>>> if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then \
>>>>>>>>>>>>>>>>>>>>>> rm -f pacemaker.tar.*; \
>>>>>>>>>>>>>>>>>>>>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; then \
>>>>>>>>>>>>>>>>>>>>>> git commit -m "DO-NOT-PUSH" -a; \
>>>>>>>>>>>>>>>>>>>>>> git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \
>>>>>>>>>>>>>>>>>>>>>> git reset --mixed HEAD^; \
>>>>>>>>>>>>>>>>>>>>>> else \
>>>>>>>>>>>>>>>>>>>>>> git archive --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ Pacemaker-1.1.11-rc3 | gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \
>>>>>>>>>>>>>>>>>>>>>> fi; \
>>>>>>>>>>>>>>>>>>>>>> echo `date`: Rebuilt ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \
>>>>>>>>>>>>>>>>>>>>>> else \
>>>>>>>>>>>>>>>>>>>>>> echo `date`: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \
>>>>>>>>>>>>>>>>>>>>>> fi
>>>>>>>>>>>>>>>>>>>>>> Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>>>>>>>>>>>>>>>>>>>> .......
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Well, "make rpm" - build rpms and I create cluster.
>>>>>>>>>>>>>>>>>>>>>> I spent the same tests and confirmed the behavior.
>>>>>>>>>>>>>>>>>>>>>> crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
>>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>
>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>> ,
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>
>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>
>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>> ,
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>> _______________________________________________
>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>> _______________________________________________
>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>> ,
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> ,
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> ,
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list