<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Try using RSTP on the switches, if possible, it has a lower convergence
time.<br>
<br>
Roberto Giordani wrote:
<blockquote cite="mid:4C78BDB5.7000001@tiscali.it" type="cite">
<pre wrap="">Thanks,
who should I contact? Which mailing list?
I've discovered that this problem occours when the port of my switch
where the cluster ring is connected became "blocked" due spanning tree.
I've resolved the bug using for the ring a separate switch without
spanning tre enabled and different subnet.
Is there a configuration to avoid that before the spanning tree
recalculate the route due a failure, the cluster nodes doesn't hang?
The hang occurses on SLES11sp1 too where the servers are up running, the
cluster status is ok, but when try to connect to the server with ssh,
after the login hang the session.
Usually the recalculate takes 50 seconds.
Regards,
Roberto.
On 08/26/2010 10:24 AM, Dejan Muhamedagic wrote:
</pre>
<blockquote type="cite">
<pre wrap="">Hi,
On Thu, Aug 26, 2010 at 09:36:10AM +0200, Andrew Beekhof wrote:
</pre>
<blockquote type="cite">
<pre wrap="">On Wed, Aug 18, 2010 at 6:24 PM, Roberto Giordani <a class="moz-txt-link-rfc2396E" href="mailto:r.giordani@libero.it"><r.giordani@libero.it></a> wrote:
</pre>
<blockquote type="cite">
<pre wrap="">Hello,
I'll explain what’s happened after a network black-out
I've a cluster with pacemaker on Opensuse 11.2 64bit
============
Last updated: Wed Aug 18 18:13:33 2010
Current DC: nodo1 (nodo1)
Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160
3 Nodes configured.
11 Resources configured.
============
Node: nodo1 (nodo1): online
Node: nodo3 (nodo3): online
Node: nodo4 (nodo4): online
Clone Set: dlm-clone
dlm:0 (ocf::pacemaker:controld): Started nodo3
dlm:1 (ocf::pacemaker:controld): Started nodo1
dlm:2 (ocf::pacemaker:controld): Started nodo4
Clone Set: o2cb-clone
o2cb:0 (ocf::ocfs2:o2cb): Started nodo3
o2cb:1 (ocf::ocfs2:o2cb): Started nodo1
o2cb:2 (ocf::ocfs2:o2cb): Started nodo4
Clone Set: XencfgFS-Clone
XencfgFS:0 (ocf::heartbeat:Filesystem): Started nodo3
XencfgFS:1 (ocf::heartbeat:Filesystem): Started nodo1
XencfgFS:2 (ocf::heartbeat:Filesystem): Started nodo4
Clone Set: XenimageFS-Clone
XenimageFS:0 (ocf::heartbeat:Filesystem): Started nodo3
XenimageFS:1 (ocf::heartbeat:Filesystem): Started nodo1
XenimageFS:2 (ocf::heartbeat:Filesystem): Started nodo4
rsa1-fencing (stonith:external/ibmrsa-telnet): Started nodo4
rsa2-fencing (stonith:external/ibmrsa-telnet): Started nodo3
rsa3-fencing (stonith:external/ibmrsa-telnet): Started nodo4
rsa4-fencing (stonith:external/ibmrsa-telnet): Started nodo3
mailsrv-rm (ocf::heartbeat:Xen): Started nodo3
dbsrv-rm (ocf::heartbeat:Xen): Started nodo4
websrv-rm (ocf::heartbeat:Xen): Started nodo4
After a switch failure all the nodes and the rsa stonith devices was
unreachable.
On the cluster happen the following error on one node
Aug 18 13:11:38 nodo1 cluster-dlm: receive_plocks_stored:
receive_plocks_stored 1778493632:2 need_plocks 0#012
Aug 18 13:11:38 nodo1 kernel: [ 4154.272025] ------------[ cut here
]------------
Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at
/usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323!
Aug 18 13:11:38 nodo1 kernel: [ 4154.272042] invalid opcode: 0000 [#1] SMP
Aug 18 13:11:38 nodo1 kernel: [ 4154.272046] last sysfs file:
/sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control
Aug 18 13:11:38 nodo1 kernel: [ 4154.272050] CPU 1
Aug 18 13:11:38 nodo1 kernel: [ 4154.272053] Modules linked in:
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev
iptable_filter ip_tables x_tables ocfs2 ocfs2_nodemanager quota_tree
ocfs2_stack_user ocfs2_stackglue dlm configfs netbk coretemp blkbk
blkback_pagemap blktap xenbus_be ipmi_si edd dm_round_robin scsi_dh_rdac
dm_multipath scsi_dh bridge stp llc bonding ipv6 fuse ext4 jbd2 crc16 loop
dm_mod sr_mod ide_pci_generic ide_core iTCO_wdt ata_generic ibmpex i5k_amb
ibmaem iTCO_vendor_support ipmi_msghandler bnx2 i5000_edac 8250_pnp shpchp
ata_piix pcspkr ics932s401 joydev edac_core i2c_i801 ses pci_hotplug 8250
i2c_core serio_raw enclosure serial_core button sg reiserfs usbhid hid
uhci_hcd ehci_hcd xenblk cdrom xennet fan processor pata_acpi lpfc thermal
thermal_sys hwmon aacraid [last unloaded: ocfs2_stackglue]
Aug 18 13:11:38 nodo1 kernel: [ 4154.272111] Pid: 8889, comm: dlm_send Not
tainted 2.6.31.12-0.2-xen #1 IBM System x3650 -[7979AC1]-
Aug 18 13:11:38 nodo1 kernel: [ 4154.272113] RIP: e030:[<ffffffff801331c2>]
[<ffffffff801331c2>] iput+0x82/0x90
Aug 18 13:11:38 nodo1 kernel: [ 4154.272121] RSP: e02b:ffff88014ec03c30
EFLAGS: 00010246
Aug 18 13:11:38 nodo1 kernel: [ 4154.272122] RAX: 0000000000000000 RBX:
ffff880148a703c8 RCX: 0000000000000000
Aug 18 13:11:38 nodo1 kernel: [ 4154.272123] RDX: ffffc90000010000 RSI:
ffff880148a70380 RDI: ffff880148a703c8
Aug 18 13:11:38 nodo1 kernel: [ 4154.272125] RBP: ffff88014ec03c50 R08:
b038000000000000 R09: fe99594c51a57607
Aug 18 13:11:38 nodo1 kernel: [ 4154.272126] R10: ffff880040410270 R11:
0000000000000000 R12: ffff8801713e6e08
Aug 18 13:11:38 nodo1 kernel: [ 4154.272128] R13: ffff88014ec03d20 R14:
0000000000000000 R15: ffffc9000331d108
Aug 18 13:11:38 nodo1 kernel: [ 4154.272133] FS: 00007ff4cb11a730(0000)
GS:ffffc90000010000(0000) knlGS:0000000000000000
Aug 18 13:11:38 nodo1 kernel: [ 4154.272135] CS: e033 DS: 0000 ES: 0000 CR0:
000000008005003b
Aug 18 13:11:38 nodo1 kernel: [ 4154.272136] CR2: 00007ff4c5c45000 CR3:
0000000135b2a000 CR4: 0000000000002660
Aug 18 13:11:38 nodo1 kernel: [ 4154.272138] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Aug 18 13:11:38 nodo1 kernel: [ 4154.272140] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Aug 18 13:11:38 nodo1 kernel: [ 4154.272142] Process dlm_send (pid: 8889,
threadinfo ffff88014ec02000, task ffff8801381e45c0)
Aug 18 13:11:38 nodo1 kernel: [ 4154.272143] Stack:
Aug 18 13:11:38 nodo1 kernel: [ 4154.272144] 0000000000000000
00000000072f0874 ffff880148a70380 ffff880148a70380
Aug 18 13:11:38 nodo1 kernel: [ 4154.272146] <0> ffff88014ec03c80
ffffffff803add09 ffff88014ec03c80 00000000072f0874
Aug 18 13:11:38 nodo1 kernel: [ 4154.272147] <0> ffff8801713e6df8
ffff8801713e6e08 ffff88014ec03de0 ffffffffa05661e1
Aug 18 13:11:38 nodo1 kernel: [ 4154.272150] Call Trace:
Aug 18 13:11:38 nodo1 kernel: [ 4154.272164] [<ffffffff803add09>]
sock_release+0x89/0xa0
Aug 18 13:11:38 nodo1 kernel: [ 4154.272177] [<ffffffffa05661e1>]
tcp_connect_to_sock+0x161/0x2b0 [dlm]
Aug 18 13:11:38 nodo1 kernel: [ 4154.272206] [<ffffffffa0568764>]
process_send_sockets+0x34/0x60 [dlm]
Aug 18 13:11:38 nodo1 kernel: [ 4154.272222] [<ffffffff800693f3>]
run_workqueue+0x83/0x230
Aug 18 13:11:38 nodo1 kernel: [ 4154.272227] [<ffffffff80069654>]
worker_thread+0xb4/0x140
Aug 18 13:11:38 nodo1 kernel: [ 4154.272231] [<ffffffff8006fac6>]
kthread+0xb6/0xc0
Aug 18 13:11:38 nodo1 kernel: [ 4154.272236] [<ffffffff8000d38a>]
child_rip+0xa/0x20
Aug 18 13:11:38 nodo1 kernel: [ 4154.272240] Code: 42 20 48 c7 c2 b0 4c 13
80 48 85 c0 48 0f 44 c2 48 89 df ff d0 48 8b 45 e8 65 48 33 04 25 28 00 00
00 75 0b 48 83 c4 18 5b c9 c3 <0f> 0b eb fe e8 35 c6 f1 ff 0f 1f 44 00 00 55
48 8d 97 10 02 00
Aug 18 13:11:38 nodo1 kernel: [ 4154.272256] RIP [<ffffffff801331c2>]
iput+0x82/0x90
Aug 18 13:11:38 nodo1 kernel: [ 4154.272259] RSP <ffff88014ec03c30>
Aug 18 13:11:38 nodo1 kernel: [ 4154.272264] ---[ end trace 7707d0d92a7f5415
]---
Aug 18 13:11:38 nodo1 kernel: [ 4154.272495] dlm: connect from non cluster
node
and after few log lines the following line repeated until the node was
killed by me
Aug 18 13:12:31 nodo1 cluster-dlm: start_kernel: start_kernel cg 3
member_count 1#012
Aug 18 13:12:31 nodo1 cluster-dlm: update_dir_members: dir_member
1812048064#012
Aug 18 13:12:31 nodo1 cluster-dlm: update_dir_members: dir_member
1778493632#012
Aug 18 13:12:31 nodo1 cluster-dlm: set_configfs_members: set_members rmdir
"/sys/kernel/config/dlm/cluster/spaces/0BB443F896254AD3BA8FB960C425B666/nodes/1812048064"#012
Aug 18 13:12:31 nodo1 cluster-dlm: do_sysfs: write "1" to
"/sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control"#012
Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
nodeid 1812048064#012
Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
nodeid 1812048064#012
Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
nodeid 1812048064#012
Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
nodeid 1812048064#012
Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
nodeid 1812048064#012
Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
nodeid 1812048064#012
Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
nodeid 1812048064#012
Attached the log file
Someone can explain what is the reason?
</pre>
</blockquote>
<pre wrap="">Perhaps the membership got out of sync...
Aug 18 13:11:38 nodo1 kernel: [ 4154.272495] dlm: connect from non cluster node
Maybe lmb or dejan can suggest something... I dont have much to do
with ocfs2 anymore.
</pre>
</blockquote>
<pre wrap="">Me neither. But this looks like a kernel bug:
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at
/usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323!
</pre>
</blockquote>
</blockquote>
<pre wrap="">Perhaps ask on the kernel ML?
Thanks,
Dejan
</pre>
<blockquote type="cite">
<pre wrap="">_______________________________________________
Pacemaker mailing list: <a class="moz-txt-link-abbreviated" href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a>
<a class="moz-txt-link-freetext" href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a>
Project Home: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org">http://www.clusterlabs.org</a>
Getting started: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a>
Bugs: <a class="moz-txt-link-freetext" href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a>
</pre>
</blockquote>
<pre wrap="">_______________________________________________
Pacemaker mailing list: <a class="moz-txt-link-abbreviated" href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a>
<a class="moz-txt-link-freetext" href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a>
Project Home: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org">http://www.clusterlabs.org</a>
Getting started: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a>
Bugs: <a class="moz-txt-link-freetext" href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a>
</pre>
</blockquote>
<pre wrap=""><!---->
_______________________________________________
Pacemaker mailing list: <a class="moz-txt-link-abbreviated" href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a>
<a class="moz-txt-link-freetext" href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a>
Project Home: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org">http://www.clusterlabs.org</a>
Getting started: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a>
Bugs: <a class="moz-txt-link-freetext" href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania
</pre>
</body>
</html>