On Tue, 07/03/2012 06:27 PM, Damiano Scaramuzza wrote: > Hi all, my first post in this ML. > I've used in 2008 heartbeat for a big project and now I'm back with > pacemaker for a smaller one. > > I've two nodes with drbd/clvm/ocfs2/kvm virtual machines. all in debian > wheezy using testing(quite stable) packages. > I've made configuration with stonith meatware and some colocation rule > (if needed I can post cib file) > If I stop gracefully one of two node everything works good (I mean vm > resources migrate in the other node ,drbd fences and > all colocation/start-stop orders are fullfilled) > > Bad things happens when I force to reset one of two nodes with echo b > > /proc/sysrq-trigger > > Scenario 1) cluster software hang completely, I mean crm_mon returns 2 > nodes online but the other node reboot and stay > without corosync/pacemaker unloaded. No stonith message at all > > Scenario 2) sometimes I see the meatware stonith message, I call > meatclient and the cluster hang > Scenario 3) meatware message, call meat client, crm_mon returns "node > unclean" but I see some resource stopped and some running or Master. > > Using the full configuration with ocfs2 (but I tested gfs2 too) I see > these messages in syslog > > kernel: [ 2277.229622] INFO: task virsh:11370 blocked for more than 120 > seconds. > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229626] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229629] virsh D > ffff88041fc53540 0 11370 11368 0x00000000 > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229635] ffff88040b50ce60 > 0000000000000082 0000000000000000 ffff88040f235610 > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229642] 0000000000013540 > ffff8803e1953fd8 ffff8803e1953fd8 ffff88040b50ce60 > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229648] 0000000000000246 > 0000000181349294 ffff8803f5ca2690 ffff8803f5ca2000 > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229655] Call Trace: > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229673] [] ? > ocfs2_wait_for_recovery+0xa2/0xbc [ocfs2] > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229679] [] ? > add_wait_queue+0x3c/0x3c > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229696] [] ? > ocfs2_inode_lock_full_nested+0xeb/0x925 [ocfs2] > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229714] [] ? > ocfs2_permission+0x2b/0xe1 [ocfs2] > Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229721] [] ? > unlazy_walk+0x100/0x132 > > > So to simplify and exclude ocfs2 from hang I tried drbd/clvm only but > resetting one node with the same echo b > I see cluster hang with these messages in syslog > Interesting.. I ran into this same problem last nite. I'm running on Debian squeeze using debs in squeeze backports on XCP (opensource Xenserver). When I force shutdown a node, the remaining node hangs with a similar message. I originally thought it was an OCFS2 problem - but it may be something different. Either way, I'm going to try my configuration using CentOS - but that is having it's own unique challenges lol.