[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Ferenc Wágner wferi at niif.hu
Fri Sep 1 11:51:11 UTC 2017


Digimer <lists at alteeve.ca> writes:

> On 2017-08-29 10:45 AM, Ferenc Wágner wrote:
>
>> Digimer <lists at alteeve.ca> writes:
>> 
>>> On 2017-08-28 12:07 PM, Ferenc Wágner wrote:
>>>
>>>> [...]
>>>> While dlm_tool status reports (similar on all nodes):
>>>>
>>>> cluster nodeid 167773705 quorate 1 ring seq 3088 3088
>>>> daemon now 2941405 fence_pid 0 
>>>> node 167773705 M add 196 rem 0 fail 0 fence 0 at 0 0
>>>> node 167773706 M add 5960 rem 5730 fail 0 fence 0 at 0 0
>>>> node 167773707 M add 2089 rem 1802 fail 0 fence 0 at 0 0
>>>> node 167773708 M add 3646 rem 3413 fail 0 fence 0 at 0 0
>>>> node 167773709 M add 2588921 rem 2588920 fail 0 fence 0 at 0 0
>>>> node 167773710 M add 196 rem 0 fail 0 fence 0 at 0 0
>>>>
>>>> dlm_tool ls shows "kern_stop":
>>>>
>>>> dlm lockspaces
>>>> name          clvmd
>>>> id            0x4104eefa
>>>> flags         0x00000004 kern_stop
>>>> change        member 5 joined 0 remove 1 failed 1 seq 8,8
>>>> members       167773705 167773706 167773707 167773708 167773710 
>>>> new change    member 6 joined 1 remove 0 failed 0 seq 9,9
>>>> new status    wait messages 1
>>>> new members   167773705 167773706 167773707 167773708 167773709 167773710 
>>>>
>>>> on all nodes except for vhbl07 (167773709), where it gives
>>>>
>>>> dlm lockspaces
>>>> name          clvmd
>>>> id            0x4104eefa
>>>> flags         0x00000000 
>>>> change        member 6 joined 1 remove 0 failed 0 seq 11,11
>>>> members       167773705 167773706 167773707 167773708 167773709 167773710 
>>>>
>>>> instead.
>>>>
>>>> [...] Is there a way to unblock DLM without rebooting all nodes?
>>>
>>> Looks like the lost node wasn't fenced.
>> 
>> Why dlm status does not report any lost node then?  Or do I misinterpret
>> its output?
>> 
>>> Do you have fencing configured and tested? If not, DLM will block
>>> forever because it won't recover until it has been told that the lost
>>> peer has been fenced, by design.
>> 
>> What command would you recommend for unblocking DLM in this case?
>
> First, fix fencing. Do you have that setup and working?

I really don't want DLM to do fencing.  DLM blocking for a couple of
days is not an issue in this setup (cLVM isn't a "service" of this
cluster, only a rarely needed administration tool).  Fencing is set up
and works fine for Pacemaker, so it's used to recover actual HA
services.  But letting DLM use it resulted in disaster one and a half
year ago (see Message-ID: <87r3g5a969.fsf at lant.ki.iif.hu>), which I
failed to understand yet, and I'd rather not go there again until that's
taken care of properly.  So for now, a manual unblock path is all I'm
after.
-- 
Thanks,
Feri




More information about the Users mailing list