[ClusterLabs] Antw: DLM hanging when corosync is OK causes cluster to hang

Tue Jan 12 02:20:37 EST 2016

>>> Digimer <lists at alteeve.ca> schrieb am 11.01.2016 um 17:59 in Nachricht
<5693DF77.7000506 at alteeve.ca>:
> Hi all,
> 
>   We hit a strange problem where a RAID controller on a node failed,
> causing DLM (gfs2/clvmd) to hang, but the node was never fenced. I
> assume this was because corosync was still working.

I would guess that when I/O hangs, DLM is still happy, thus there should be no fencing. Something has to time out to let the cluster become active. >From your description it's not obvious which disks were affected. I can imagine that if every local disk came to a stop, even fencing won't succeed (if triggered locally).

> 
>   Is there a way in rhel6/cman/rgmanager to have a node suicide or get
> fenced in a condition like this?

I guess you would need a monitor for disk I/O. I wonder what would happen if you use sbd on a local (not shared disk). Maybe it will detect if that disk is not responding and reset via watchdog (if available).

Regards,
Ulrich

> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.ca/w/ 
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org