[Pacemaker] clmvd hangs on node1 if node2 is fenced

Thu Aug 26 22:50:59 EDT 2010

On 8/27/2010 at 08:50 AM, Michael Smith <msmith at cbnco.com> wrote: 
>>  Xinwei Hu <hxinwei at ...> writes:
>  > 
>  > > That sounds worrying actually. 
>  > > I think this is logged as bug 585419 on SLES' bugzilla. 
>  > > If you can reproduce this issue, it worths to reopen it I think. 
>  
> I've got a pair of fully patched SLES11 SP1 nodes and they're showing  
> what I guess is the same behaviour: if I hard-poweroff node2, operations  
> like "vgdisplay -v" hang on node1 for quite some time. Sometimes a  
> minute, sometimes two, sometimes forever. They get stuck here: 
>  
> Aug 26 18:31:42 xen-test1 clvmd[8906]: doing PRE command LOCK_VG  
> 'V_vm_store' at 
> 1 (client=0x7f2714000b40) 
> Aug 26 18:31:42 xen-test1 clvmd[8906]: lock_resource 'V_vm_store',  
> flags=0, mode=3 
>  
>  
> After a few seconds, corosync & dlm notice the node is gone, but  
> vg_display and 
> friends still hang while trying to lock the VG. 
>  
> Aug 26 18:31:44 xen-test1 corosync[8476]:  [TOTEM ] A processor failed,  
> forming new configuration. 
> Aug 26 18:31:50 xen-test1 cluster-dlm[8870]: update_cluster: Processing 
> membership 1260 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: dlm_process_node: Skipped  
> active node 219878572: born-on=1256, last-seen=1260, this-event=1260,  
> last-event=1256 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: del_configfs_node:  
> del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/236655788" 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: dlm_process_node: Removed  
> inactive node 236655788: born-on=1252, last-seen=1256, this-event=1260,  
> last-event=1256 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: log_config: dlm:controld  
> conf 1 0 1 memb 219878572 join left 236655788 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: log_config: dlm:ls:clvmd  
> conf 1 0 1 memb 219878572 join left 236655788 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: add_change: clvmd  
> add_change cg 3 remove nodeid 236655788 reason 3 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: add_change: clvmd  
> add_change cg 3 counts member 1 joined 0 remove 1 failed 1 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: stop_kernel: clvmd  
> stop_kernel cg 3 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: do_sysfs: write "0" to 
> "/sys/kernel/dlm/clvmd/control" 
> Aug 26 18:31:51 xen-test1 kernel: [  365.267802] dlm: closing connection  
> to node 236655788 
> Aug 26 18:31:51 xen-test1 clvmd[8906]: confchg callback. 0 joined, 1  
> left, 1 members 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: fence_node_time: Node 
> 236655788/xen-test2 has not been shot yet 
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: check_fencing_done: clvmd 
> check_fencing 23665578 not fenced add 1282861615 fence 0 
> Aug 26 18:31:51 xen-test1 crmd: [8489]: info: ais_dispatch: Membership  
> 1260: quorum still lost 
> Aug 26 18:31:51 xen-test1 cluster-dlm: [8870]: info: ais_dispatch:  
> Membership 1260: quorum still lost 

Do you have STONITH configured?  Note that it says "xen-test2 has not
been shot yet" and "clvmd ... not fenced".  It's just going to sit there
until the down node is successfully fenced - this is intentional, as it's
not safe to keep running until you *know* the dead node is dead.

Regards,

Tim

-- 
Tim Serong <tserong at novell.com>
Senior Clustering Engineer, OPS Engineering, Novell Inc.