[Pacemaker] [Partially SOLVED] pacemaker/dlm problems

Andrew Beekhof andrew at beekhof.net
Thu Nov 24 04:33:01 UTC 2011


On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov
<bubble at hoster-ok.com> wrote:
> Hi Andrew,
>
> I just found another problem with dlm_controld.pcmk (with your latest
> patch from github applied and also my fixes to actually build it - they
> are included in a message referenced by this one).
> One node which just requested fencing of another one stucks at printing
> that message where you print ctime() in fence_node_time() (pacemaker.c
> near 293) every second.

So not blocked, it just keeps repeating that message?
What date does it print?

Did you change it to the following?
  log_debug("Node %d was last shot at: %s", nodeid, ctime(*last_fenced_time));	

> No other messages appear, although
> fence_node_time() is called only from check_fencing_done() (cpg.c near
> 444). So, both of (last_fenced_time >= node->fail_time) and
> (!node->fence_queries || node->fence_time != last_fenced_time) are
> false, otherwise one of messages for that cases should be shown. Then,
> fence_node_time() seems to return 0 from
> if (wait_count)
>        return 0;
> (wait_count is incremented if (last_fenced_time >= node->fail_time) is
> false), so it never reaches check_fencing_done() call and never return
> expected 1.
> Offending node was actually fenced, but that was actually not handled by
> dlm_controld.
>
> May I ask you to help me a bit with all that logic (as you already dived
> into dlm_controld sources again), I seem to be so near the success... :|
>
> btw, I cant find what source is your dlm repo forked from, may be you
> remember?

iirc, it was dlm.git on fedorahosted.

>
> Best,
> Vladislav
>
> 28.09.2011 17:41, Vladislav Bogdanov wrote:
>> Hi Andrew,
>>
>>>> All the more reason to start using the stonith api directly.
>>>> I was playing around list night with the dlm_controld.pcmk code:
>>>>    https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787
>>>
>>> Doesn't seem to apply to 3.0.17, so I rebased that commit against it for
>>> my build. Then it doesn't compile without attached patch.
>>> It may need to be rebased a bit against your tree.
>>>
>>> Now I have package built and am building node images. Will try shortly.
>>
>> Fencing from within dlm_controld.pcmk still did not work with your first
>> patch against that _no_mainloop function (expected).
>>
>> So I did my best to build packages from the current git tree.
>>
>> Voila! I got failed node correctly fenced!
>> I'll do some more extensive testing next days, but I believe everything
>> should be much better now.
>>
>> I knew you're genius he-he ;)
>>
>> So, here are steps to get DLM handle CPG NODEDOWN events correctly with
>> pacemaker using openais stack:
>>
>> 1. Build pacemaker (as of 2011-09-28) from git.
>> 2. Apply attached patches to cluster-3.0.17 source tree.
>> 3. Build dlm_controld.pcmk
>>
>> One note - gfs2_controld probably needs to be fixed too (FIXME).
>>
>> Best regards,
>> Vladislav
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>




More information about the Pacemaker mailing list