[Pacemaker] pacemaker/dlm problems

Mon Sep 26 04:41:14 EDT 2011

26.09.2011 11:16, Andrew Beekhof wrote:
[snip]
>>
>>>
>>> Regardless, for 1.1.6 the dlm would be better off making a call like:
>>>
>>>           rc = st->cmds->fence(st, st_opts, target, "reboot", 120);
>>>
>>> from fencing/admin.c
>>>
>>> That would talk directly to the fencing daemon, bypassing attrd, crnd
>>> and PE - and thus be more reliable.
>>>
>>> This is what the cman plugin will be doing soon too.
>>
>> Great to know, I'll try that in near future. Thank you very much for
>> pointer.
> 
> 1.1.7 will actually make use of this API regardless of any *_controld
> changes - i'm in the middle of updating the two library functions they
> use (crm_terminate_member and crm_terminate_member_no_mainloop).

Ah, I then try your patch and wait for that to be resolved.

> 
>>
>>>
>>>>
>>>> I agree with Jiaju
>>>> (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html),
>>>> that could be solely pacemaker problem, because it probably should
>>>> originate fencing itself is such situation I think.
>>>>
>>>> So, using pacemaker/dlm with openais stack is currently risky due to
>>>> possible hangs of dlm_lockspaces.
>>>
>>> It shouldn't be, failing to connect to attrd is very unusual.
>>
>> By the way, one of underlying problems, which actually made me to notice
>> all this, is that pacemaker cluster does not fence its DC if it leaves
>> the cluster for a very short time. That is what Jiaju told in his notes.
>> And I can confirm that.
> 
> Thats highly surprising.  Do the logs you sent display this behaviour?

They do. Rest of the cluster begins the election, but then accepts
returned DC back (I write this from memory, I looked at logs Sep 5-6, so
I may mix up something).

[snip]
>>>> Although it took 25 seconds instead of 3 to break the cluster (I
>>>> understand, this is almost impossible to load host so much, but
>>>> anyways), then I got a real nightmare: two nodes of 3-node cluster had
>>>> cman stopped (and pacemaker too because of cman connection loss) - they
>>>> asked to kick_node_from_cluster() for each other, and that succeeded.
>>>> But fencing didn't happen (I still need to look why, but this is cman
>>>> specific).

Btw this part is tricky for me to understand the underlying logic:
* cman just stops cman processes on remote nodes, disregarding the
quorum. I hope that could be fixed in corosync If I understand one of
latest threads there right.
* But cman does not do fencing of that nodes, and they still run
resources. And this could be extremely dangerous under some
circumstances. And cman does not do fencing even if it has fence devices
configure in cluster.conf (I verified that).

>>>> Remaining node had pacemaker hanged, it doesn't even
>>>> notice cluster infrastructure change, down nodes were listed as a
>>>> online, one of them was a DC, all resources are marked as started on all
>>>> (down too) nodes. No log entries from pacemaker at all.
>>>
>>> Well I can't see any logs from anyone to its hard for me to comment.
>>
>> Logs are sent privately.
>>
>>>

Vladislav