[Pacemaker] [Partially SOLVED] pacemaker/dlm problems

Vladislav Bogdanov bubble at hoster-ok.com
Mon Dec 19 06:39:44 EST 2011


09.12.2011 08:44, Andrew Beekhof wrote:
> On Fri, Dec 9, 2011 at 3:16 PM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>> 09.12.2011 03:11, Andrew Beekhof wrote:
>>> On Fri, Dec 2, 2011 at 1:32 AM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>>> Hi Andrew,
>>>>
>>>> I investigated on my test cluster what actually happens with dlm and
>>>> fencing.
>>>>
>>>> I added more debug messages to dlm dump, and also did a re-kick of nodes
>>>> after some time.
>>>>
>>>> Results are that stonith history actually doesn't contain any
>>>> information until pacemaker decides to fence node itself.
>>>
>>> ...
>>>
>>>> From my PoV that means that the call to
>>>> crm_terminate_member_no_mainloop() does not actually schedule fencing
>>>> operation.
>>>
>>> You're going to have to remind me... what does your copy of
>>> crm_terminate_member_no_mainloop() look like?
>>> This is with the non-cman editions of the controlds too right?
>>
>> Just latest github's version. You changed some dlm_controld.pcmk
>> functionality, so it asks stonithd for fencing results instead of XML
>> magic. But call to crm_terminate_member_no_mainloop() remains the same
>> there. But yes, that version communicates stonithd directly too.
>>
>> SO, the problem here is just with crm_terminate_member_no_mainloop()
>> which for some reason skips actual fencing request.
> 
> There should be some logs, either indicating that it tried, or that it failed.

Nothing about fencing.
Only messages about history requests:

stonith-ng: [1905]: info: stonith_command: Processed st_fence_history
from cluster-dlm: rc=0

I even moved all fencing code to dlm_controld to have better control on
what does it do (and not to rebuild pacemaker to play with that code).
dlm_tool dump prints the same line every second, stonith-ng prints
history requests.

A little bit odd, by I saw one time that fencing request from
cluster-dlm succeeded, but only right after node was fenced by
pacemaker. As a result, node was switched off instead of reboot.

That raises one more question: is it correct to call st->cmds->fence()
with third parameter set to "off"?
I think that "reboot" is more consistent with the rest of fencing subsystem.

At the same time, stonith_admin -B succeeds.
The main difference I see is st_opt_sync_call in a latter case.
Will try to experiment with it.

Best,
Vladislav




More information about the Pacemaker mailing list