[Pacemaker] [Partially SOLVED] pacemaker/dlm problems

Vladislav Bogdanov bubble at hoster-ok.com
Thu Nov 24 07:21:39 UTC 2011


24.11.2011 08:49, Andrew Beekhof wrote:
> On Thu, Nov 24, 2011 at 3:58 PM, Vladislav Bogdanov
> <bubble at hoster-ok.com> wrote:
>> 24.11.2011 07:33, Andrew Beekhof wrote:
>>> On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov
>>> <bubble at hoster-ok.com> wrote:
>>>> Hi Andrew,
>>>>
>>>> I just found another problem with dlm_controld.pcmk (with your latest
>>>> patch from github applied and also my fixes to actually build it - they
>>>> are included in a message referenced by this one).
>>>> One node which just requested fencing of another one stucks at printing
>>>> that message where you print ctime() in fence_node_time() (pacemaker.c
>>>> near 293) every second.
>>>
>>> So not blocked, it just keeps repeating that message?
>>> What date does it print?
>>
>> Blocked... kern_stop
> 
> I'm confused.

As well as me...

> How can it do that every second?

Only in one case:
if both of (last_fenced_time >= node->fail_time) and
(!node->fence_queries || node->fence_time != last_fenced_time) are *false*.

So, three conditions are *true* at the same moment:
* last_fenced_time < node->fail_time
* node->fence_queries != 0
* node->fence_time == last_fenced_time

If that all are true, check_fencing_done just silently returns 0.

In all other cases I'd see one of messages "check_fencing %d done" or
"check_fencing %d wait" (first one should stop that loop btw) in between
of consequent "Node %d/%s was last shot at: %s".

> 
>>
>> It prints the same date not so far ago (in that case).
>> I did catch it only once and cannot repeat yet. Date is printed correct
>> in a "normal" fencing circumstances.
>>
>>>
>>> Did you change it to the following?
>>>   log_debug("Node %d was last shot at: %s", nodeid, ctime(*last_fenced_time));
>>
>> http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09959.html
>> contains patches against 3.0.17 which I use. I only backported commits
>> to dlm_controld core from 3.1.1 (and 3.1.7 last days) to make it up2date
>> (they are minor).
> 
> Ok, this (which was from my original patch) is wrong:
> 
> +        log_debug("Node %d/%s was last shot at: %s", nodeid,
> ctime(*last_fenced_time));

Agree, and I use
log_debug("Node %d/%s was last shot at: %s", nodeid, node_uname,
ctime(last_fenced_time));
Please see patches included in the message referenced above (a little
bit below of the backport of your original patch).

gcc sometimes is smart enough ;)

> 
> The format string expects 3 parameters but there are only 2 supplied.
> This could easily result in what you're seeing.

So, no, that's not it.

> 
> 
>>
>> man ctime
>> char *ctime(const time_t *timep);
>>
>> int fence_node_time(int nodeid, uint64_t *last_fenced_time)
>> is called from check_fencing_done() with
>> uint64_t last_fenced_time;
>> rv = fence_node_time(node->nodeid, &last_fenced_time);
>> so, I changed it to ctime(last_fenced_time). btw ctime adds trailing
>> newline, so it badly fits for logs.
>>
>> One thought: may be last commits to dlm.git (with membership monitoring,
>> notably e529211682418a8e33feafc9f703cff87e23aeba) may help here?
>>
>> And one note - I use fence_xvm for that failed VM, and I found that it
>> is a little bit deficient - only one instance of it can be run on a host
>> simultaneously as it binds to the predefined TCP port. May be that may
>> influence as well...
>>
>>>
>>>> No other messages appear, although
>>>> fence_node_time() is called only from check_fencing_done() (cpg.c near
>>>> 444). So, both of (last_fenced_time >= node->fail_time) and
>>>> (!node->fence_queries || node->fence_time != last_fenced_time) are
>>>> false, otherwise one of messages for that cases should be shown. Then,
>>>> fence_node_time() seems to return 0 from
>>>> if (wait_count)
>>>>        return 0;
>>>> (wait_count is incremented if (last_fenced_time >= node->fail_time) is
>>>> false), so it never reaches check_fencing_done() call and never return
>>>> expected 1.
>>>> Offending node was actually fenced, but that was actually not handled by
>>>> dlm_controld.
>>>>
>>>> May I ask you to help me a bit with all that logic (as you already dived
>>>> into dlm_controld sources again), I seem to be so near the success... :|
>>>>
>>>> btw, I cant find what source is your dlm repo forked from, may be you
>>>> remember?
>>>
>>> iirc, it was dlm.git on fedorahosted.
>>
>> Yep, I found that already, pacemaker branch. It seems to be a little bit
>> outdated comparing to 3.0.17 btw.
>>
>>>
>>>>
>>>> Best,
>>>> Vladislav
>>>>
>>>> 28.09.2011 17:41, Vladislav Bogdanov wrote:
>>>>> Hi Andrew,
>>>>>
>>>>>>> All the more reason to start using the stonith api directly.
>>>>>>> I was playing around list night with the dlm_controld.pcmk code:
>>>>>>>    https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787
>>>>>>
>>>>>> Doesn't seem to apply to 3.0.17, so I rebased that commit against it for
>>>>>> my build. Then it doesn't compile without attached patch.
>>>>>> It may need to be rebased a bit against your tree.
>>>>>>
>>>>>> Now I have package built and am building node images. Will try shortly.
>>>>>
>>>>> Fencing from within dlm_controld.pcmk still did not work with your first
>>>>> patch against that _no_mainloop function (expected).
>>>>>
>>>>> So I did my best to build packages from the current git tree.
>>>>>
>>>>> Voila! I got failed node correctly fenced!
>>>>> I'll do some more extensive testing next days, but I believe everything
>>>>> should be much better now.
>>>>>
>>>>> I knew you're genius he-he ;)
>>>>>
>>>>> So, here are steps to get DLM handle CPG NODEDOWN events correctly with
>>>>> pacemaker using openais stack:
>>>>>
>>>>> 1. Build pacemaker (as of 2011-09-28) from git.
>>>>> 2. Apply attached patches to cluster-3.0.17 source tree.
>>>>> 3. Build dlm_controld.pcmk
>>>>>
>>>>> One note - gfs2_controld probably needs to be fixed too (FIXME).
>>>>>
>>>>> Best regards,
>>>>> Vladislav
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>>
>>>>
>>
>>





More information about the Pacemaker mailing list