Hello Damiano<br><br>Do you use drbd fence + pacemaker fence?<br><br><div class="gmail_quote">2012/7/4 Damiano Scaramuzza <span dir="ltr"><<a href="mailto:cesello@daimonlab.it" target="_blank">cesello@daimonlab.it</a>></span><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all, my first post in this ML.<br>

I've used in 2008 heartbeat for a big project and now I'm back with<br>

pacemaker for a smaller one.<br>

<br>

I've two nodes with drbd/clvm/ocfs2/kvm virtual machines. all in debian<br>

wheezy using testing(quite stable) packages.<br>

I've made configuration with stonith meatware and some colocation rule<br>

(if needed I can post cib file)<br>

If I stop gracefully one of two node everything works good (I mean vm<br>

resources migrate in the other node ,drbd fences and<br>

all colocation/start-stop orders are fullfilled)<br>

<br>

Bad things happens when I force to reset one of two nodes with echo b ><br>

/proc/sysrq-trigger<br>

<br>

Scenario 1) cluster software hang completely, I mean crm_mon returns 2<br>

nodes online but the other node reboot and stay<br>

without corosync/pacemaker unloaded. No stonith message at all<br>

<br>

Scenario 2) sometimes I see the meatware stonith message, I call<br>

meatclient and the cluster hang<br>

Scenario 3) meatware message, call meat client, crm_mon returns "node<br>

unclean" but I see some resource stopped and some running or Master.<br>

<br>

Using the full configuration with  ocfs2 (but I tested gfs2 too) I see<br>

these messages in syslog<br>

<br>

kernel: [ 2277.229622] INFO: task virsh:11370 blocked for more than 120<br>

seconds.<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229626] "echo 0 ><br>

/proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229629] virsh           D<br>

ffff88041fc53540     0 11370  11368 0x00000000<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229635]  ffff88040b50ce60<br>

0000000000000082 0000000000000000 ffff88040f235610<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229642]  0000000000013540<br>

ffff8803e1953fd8 ffff8803e1953fd8 ffff88040b50ce60<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229648]  0000000000000246<br>

0000000181349294 ffff8803f5ca2690 ffff8803f5ca2000<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229655] Call Trace:<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229673]  [<ffffffffa06da2d9>] ?<br>

ocfs2_wait_for_recovery+0xa2/0xbc [ocfs2]<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229679]  [<ffffffff8105f51b>] ?<br>

add_wait_queue+0x3c/0x3c<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229696]  [<ffffffffa06c8896>] ?<br>

ocfs2_inode_lock_full_nested+0xeb/0x925 [ocfs2]<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229714]  [<ffffffffa06cdd2a>] ?<br>

ocfs2_permission+0x2b/0xe1 [ocfs2]<br>

Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229721]  [<ffffffff811019e9>] ?<br>

unlazy_walk+0x100/0x132<br>

<br>

<br>

So to simplify and exclude ocfs2 from hang I tried drbd/clvm only but<br>

resetting one node with the same echo b<br>

I see cluster hang with these messages in syslog<br>

<br>

kernel: [ 8747.118110] INFO: task clvmd:8514 blocked for more than 120<br>

seconds.<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118115] "echo 0 ><br>

/proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118119] clvmd           D<br>

ffff88043fc33540     0  8514      1 0x00000000<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118126]  ffff8803e1b35810<br>

0000000000000082 ffff880416efbd00 ffff88042f1f40c0<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118134]  0000000000013540<br>

ffff8803e154bfd8 ffff8803e154bfd8 ffff8803e1b35810<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118140]  ffffffff8127a5fe<br>

0000000000000000 0000000000000000 ffff880411b8a698<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118147] Call Trace:<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118157]  [<ffffffff8127a5fe>] ?<br>

sock_sendmsg+0xc1/0xde<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118165]  [<ffffffff81349227>] ?<br>

rwsem_down_failed_common+0xe0/0x114<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118172]  [<ffffffff811b1b64>] ?<br>

call_rwsem_down_read_failed+0x14/0x30<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118177]  [<ffffffff81348bad>] ?<br>

down_read+0x17/0x19<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118195]  [<ffffffffa0556a44>] ?<br>

dlm_user_request+0x3a/0x1a9 [dlm]<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118206]  [<ffffffffa055e61b>] ?<br>

device_write+0x28b/0x616 [dlm]<br>

Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118214]  [<ffffffff810eb4a9>] ?<br>

__kmalloc+0x100/0x112<br>

<br>

It seems as dlm or corosync does not talk anymore or does not "sense"<br>

that the other node is gone<br>

and all pieces above stay in waiting.<br>

<br>

Corosync version is     1.4.2-2<br>

dlm-pcmk                3.0.12-3.1<br>

gfs-pcmk                3.0.12-3.1<br>

ocfs2-tools-pacemaker   1.6.4-1<br>

pacemaker               1.1.7-1<br>

<br>

Any clue?<br>

<br>

<br>

<br>

<br>

<br>

_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

</blockquote></div><br><br clear="all"><br>-- <br>esta es mi vida e me la vivo hasta que dios quiera<br>