[ClusterLabs] DLM hanging when corosync is OK causes cluster to hang

Sun Apr 3 13:58:18 EDT 2016

On 19/01/16 08:04 PM, Jan Pokorný wrote:
> On 11/01/16 11:59 -0500, Digimer wrote:
>>   We hit a strange problem where a RAID controller on a node failed,
>> causing DLM (gfs2/clvmd) to hang, but the node was never fenced. I
>> assume this was because corosync was still working.
>>
>>   Is there a way in rhel6/cman/rgmanager to have a node suicide or get
>> fenced in a condition like this?
> 
> something like this in the crontab (beside cron and other components
> are now the SPOF and I/O spike or DoS will finish the apocalypse)?
> 
> */1 * * * * timeout 30s touch <file on respective fs> || fence_node <myself>
> 
> Sophistications at the components you mentioned might be preferred,
> though.

Very very late reply, but I finally implemented this. :D

https://github.com/ClusterLabs/striker/commit/8c11cf1edd9278c4fe5256096748fb62c330a948

The way I've done it is that in ScanCore, certain post-scan actions
happen depending on whether the machine is a node in the cluster or a
dashboard.

If the user enables the feature, at the end of a dashboard scan,
ScanCore checks for access to each node (echo 1) and if so, calls
'timeout X ls /shared || echo timeout' (being the gfs2 mount point when
a node is in the cluster). If 'X' seconds elapse and it returns
'timeout', the dashboard reboots the node using the node's primary fence
method (which we cache on the dashboards).

Initial testing is working great!

Thank you again for the 'timeout' pointer. I never knew it existed and I
can imagine this helping me elsewhere, too. :D

digi

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?