[Pacemaker] Pacemaker hang with hardware reset
    Damiano Scaramuzza 
    cesello at daimonlab.it
       
    Tue Jul  3 22:27:00 UTC 2012
    
    
  
Hi all, my first post in this ML.
I've used in 2008 heartbeat for a big project and now I'm back with
pacemaker for a smaller one.
I've two nodes with drbd/clvm/ocfs2/kvm virtual machines. all in debian
wheezy using testing(quite stable) packages.
I've made configuration with stonith meatware and some colocation rule
(if needed I can post cib file)
If I stop gracefully one of two node everything works good (I mean vm
resources migrate in the other node ,drbd fences and
all colocation/start-stop orders are fullfilled)
Bad things happens when I force to reset one of two nodes with echo b >
/proc/sysrq-trigger
Scenario 1) cluster software hang completely, I mean crm_mon returns 2
nodes online but the other node reboot and stay
without corosync/pacemaker unloaded. No stonith message at all
Scenario 2) sometimes I see the meatware stonith message, I call
meatclient and the cluster hang
Scenario 3) meatware message, call meat client, crm_mon returns "node
unclean" but I see some resource stopped and some running or Master.
Using the full configuration with  ocfs2 (but I tested gfs2 too) I see
these messages in syslog
kernel: [ 2277.229622] INFO: task virsh:11370 blocked for more than 120
seconds.
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229626] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229629] virsh           D
ffff88041fc53540     0 11370  11368 0x00000000
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229635]  ffff88040b50ce60
0000000000000082 0000000000000000 ffff88040f235610
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229642]  0000000000013540
ffff8803e1953fd8 ffff8803e1953fd8 ffff88040b50ce60
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229648]  0000000000000246
0000000181349294 ffff8803f5ca2690 ffff8803f5ca2000
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229655] Call Trace:
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229673]  [<ffffffffa06da2d9>] ?
ocfs2_wait_for_recovery+0xa2/0xbc [ocfs2]
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229679]  [<ffffffff8105f51b>] ?
add_wait_queue+0x3c/0x3c
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229696]  [<ffffffffa06c8896>] ?
ocfs2_inode_lock_full_nested+0xeb/0x925 [ocfs2]
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229714]  [<ffffffffa06cdd2a>] ?
ocfs2_permission+0x2b/0xe1 [ocfs2]
Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229721]  [<ffffffff811019e9>] ?
unlazy_walk+0x100/0x132
So to simplify and exclude ocfs2 from hang I tried drbd/clvm only but
resetting one node with the same echo b
I see cluster hang with these messages in syslog
kernel: [ 8747.118110] INFO: task clvmd:8514 blocked for more than 120
seconds.
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118115] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118119] clvmd           D
ffff88043fc33540     0  8514      1 0x00000000
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118126]  ffff8803e1b35810
0000000000000082 ffff880416efbd00 ffff88042f1f40c0
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118134]  0000000000013540
ffff8803e154bfd8 ffff8803e154bfd8 ffff8803e1b35810
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118140]  ffffffff8127a5fe
0000000000000000 0000000000000000 ffff880411b8a698
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118147] Call Trace:
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118157]  [<ffffffff8127a5fe>] ?
sock_sendmsg+0xc1/0xde
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118165]  [<ffffffff81349227>] ?
rwsem_down_failed_common+0xe0/0x114
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118172]  [<ffffffff811b1b64>] ?
call_rwsem_down_read_failed+0x14/0x30
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118177]  [<ffffffff81348bad>] ?
down_read+0x17/0x19
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118195]  [<ffffffffa0556a44>] ?
dlm_user_request+0x3a/0x1a9 [dlm]
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118206]  [<ffffffffa055e61b>] ?
device_write+0x28b/0x616 [dlm]
Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118214]  [<ffffffff810eb4a9>] ?
__kmalloc+0x100/0x112
It seems as dlm or corosync does not talk anymore or does not "sense"
that the other node is gone
and all pieces above stay in waiting.
Corosync version is     1.4.2-2
dlm-pcmk                3.0.12-3.1   
gfs-pcmk                3.0.12-3.1   
ocfs2-tools-pacemaker   1.6.4-1      
pacemaker               1.1.7-1      
Any clue?
    
    
More information about the Pacemaker
mailing list