<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=UTF-8" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Try using RSTP on the switches, if possible, it has a lower convergence

time.<br>

<br>

Roberto Giordani wrote:

<blockquote cite="mid:4C78BDB5.7000001@tiscali.it" type="cite">

  <pre wrap="">Thanks,

who should I contact? Which mailing list?

I've discovered that this problem occours when the port of my switch

where the cluster ring is connected became "blocked" due spanning tree.

I've resolved the bug using for the ring a separate switch without

spanning tre enabled and different subnet.

Is there a configuration to avoid that before the spanning tree

recalculate the route due a failure, the cluster nodes doesn't hang?

The hang occurses on SLES11sp1 too where the servers are up running, the

cluster status is ok, but when try to connect to the server with ssh,

after the login hang the session.

Usually the recalculate takes 50 seconds.

Regards,

Roberto.

On 08/26/2010 10:24 AM, Dejan Muhamedagic wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">Hi,

On Thu, Aug 26, 2010 at 09:36:10AM +0200, Andrew Beekhof wrote:

    </pre>

    <blockquote type="cite">

      <pre wrap="">On Wed, Aug 18, 2010 at 6:24 PM, Roberto Giordani <a class="moz-txt-link-rfc2396E" href="mailto:r.giordani@libero.it"><r.giordani@libero.it></a> wrote:

      </pre>

      <blockquote type="cite">

        <pre wrap="">Hello,

I'll explain what’s happened after a network black-out

I've a cluster with pacemaker on Opensuse 11.2 64bit

============

Last updated: Wed Aug 18 18:13:33 2010

Current DC: nodo1 (nodo1)

Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160

3 Nodes configured.

11 Resources configured.

============

Node: nodo1 (nodo1): online

Node: nodo3 (nodo3): online

Node: nodo4 (nodo4): online

Clone Set: dlm-clone

    dlm:0       (ocf::pacemaker:controld):      Started nodo3

    dlm:1       (ocf::pacemaker:controld):      Started nodo1

    dlm:2       (ocf::pacemaker:controld):      Started nodo4

Clone Set: o2cb-clone

    o2cb:0      (ocf::ocfs2:o2cb):      Started nodo3

    o2cb:1      (ocf::ocfs2:o2cb):      Started nodo1

    o2cb:2      (ocf::ocfs2:o2cb):      Started nodo4

Clone Set: XencfgFS-Clone

    XencfgFS:0  (ocf::heartbeat:Filesystem):    Started nodo3

    XencfgFS:1  (ocf::heartbeat:Filesystem):    Started nodo1

    XencfgFS:2  (ocf::heartbeat:Filesystem):    Started nodo4

Clone Set: XenimageFS-Clone

    XenimageFS:0        (ocf::heartbeat:Filesystem):    Started nodo3

    XenimageFS:1        (ocf::heartbeat:Filesystem):    Started nodo1

    XenimageFS:2        (ocf::heartbeat:Filesystem):    Started nodo4

rsa1-fencing    (stonith:external/ibmrsa-telnet):       Started nodo4

rsa2-fencing    (stonith:external/ibmrsa-telnet):       Started nodo3

rsa3-fencing    (stonith:external/ibmrsa-telnet):       Started nodo4

rsa4-fencing    (stonith:external/ibmrsa-telnet):       Started nodo3

mailsrv-rm      (ocf::heartbeat:Xen):   Started nodo3

dbsrv-rm        (ocf::heartbeat:Xen):   Started nodo4

websrv-rm       (ocf::heartbeat:Xen):   Started nodo4

After a  switch failure all the nodes and the rsa stonith devices was

unreachable.

On the cluster happen the following error on one node

Aug 18 13:11:38 nodo1 cluster-dlm: receive_plocks_stored:

receive_plocks_stored 1778493632:2 need_plocks 0#012

Aug 18 13:11:38 nodo1 kernel: [ 4154.272025] ------------[ cut here

]------------

Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at

/usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323!

Aug 18 13:11:38 nodo1 kernel: [ 4154.272042] invalid opcode: 0000 [#1] SMP

Aug 18 13:11:38 nodo1 kernel: [ 4154.272046] last sysfs file:

/sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control

Aug 18 13:11:38 nodo1 kernel: [ 4154.272050] CPU 1

Aug 18 13:11:38 nodo1 kernel: [ 4154.272053] Modules linked in:

nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev

iptable_filter ip_tables x_tables ocfs2 ocfs2_nodemanager quota_tree

ocfs2_stack_user ocfs2_stackglue dlm configfs netbk coretemp blkbk

blkback_pagemap blktap xenbus_be ipmi_si edd dm_round_robin scsi_dh_rdac

dm_multipath scsi_dh bridge stp llc bonding ipv6 fuse ext4 jbd2 crc16 loop

dm_mod sr_mod ide_pci_generic ide_core iTCO_wdt ata_generic ibmpex i5k_amb

ibmaem iTCO_vendor_support ipmi_msghandler bnx2 i5000_edac 8250_pnp shpchp

ata_piix pcspkr ics932s401 joydev edac_core i2c_i801 ses pci_hotplug 8250

i2c_core serio_raw enclosure serial_core button sg reiserfs usbhid hid

uhci_hcd ehci_hcd xenblk cdrom xennet fan processor pata_acpi lpfc thermal

thermal_sys hwmon aacraid [last unloaded: ocfs2_stackglue]

Aug 18 13:11:38 nodo1 kernel: [ 4154.272111] Pid: 8889, comm: dlm_send Not

tainted 2.6.31.12-0.2-xen #1 IBM System x3650 -[7979AC1]-

Aug 18 13:11:38 nodo1 kernel: [ 4154.272113] RIP: e030:[<ffffffff801331c2>]

[<ffffffff801331c2>] iput+0x82/0x90

Aug 18 13:11:38 nodo1 kernel: [ 4154.272121] RSP: e02b:ffff88014ec03c30

EFLAGS: 00010246

Aug 18 13:11:38 nodo1 kernel: [ 4154.272122] RAX: 0000000000000000 RBX:

ffff880148a703c8 RCX: 0000000000000000

Aug 18 13:11:38 nodo1 kernel: [ 4154.272123] RDX: ffffc90000010000 RSI:

ffff880148a70380 RDI: ffff880148a703c8

Aug 18 13:11:38 nodo1 kernel: [ 4154.272125] RBP: ffff88014ec03c50 R08:

b038000000000000 R09: fe99594c51a57607

Aug 18 13:11:38 nodo1 kernel: [ 4154.272126] R10: ffff880040410270 R11:

0000000000000000 R12: ffff8801713e6e08

Aug 18 13:11:38 nodo1 kernel: [ 4154.272128] R13: ffff88014ec03d20 R14:

0000000000000000 R15: ffffc9000331d108

Aug 18 13:11:38 nodo1 kernel: [ 4154.272133] FS: 00007ff4cb11a730(0000)

GS:ffffc90000010000(0000) knlGS:0000000000000000

Aug 18 13:11:38 nodo1 kernel: [ 4154.272135] CS: e033 DS: 0000 ES: 0000 CR0:

000000008005003b

Aug 18 13:11:38 nodo1 kernel: [ 4154.272136] CR2: 00007ff4c5c45000 CR3:

0000000135b2a000 CR4: 0000000000002660

Aug 18 13:11:38 nodo1 kernel: [ 4154.272138] DR0: 0000000000000000 DR1:

0000000000000000 DR2: 0000000000000000

Aug 18 13:11:38 nodo1 kernel: [ 4154.272140] DR3: 0000000000000000 DR6:

00000000ffff0ff0 DR7: 0000000000000400

Aug 18 13:11:38 nodo1 kernel: [ 4154.272142] Process dlm_send (pid: 8889,

threadinfo ffff88014ec02000, task ffff8801381e45c0)

Aug 18 13:11:38 nodo1 kernel: [ 4154.272143] Stack:

Aug 18 13:11:38 nodo1 kernel: [ 4154.272144] 0000000000000000

00000000072f0874 ffff880148a70380 ffff880148a70380

Aug 18 13:11:38 nodo1 kernel: [ 4154.272146] <0> ffff88014ec03c80

ffffffff803add09 ffff88014ec03c80 00000000072f0874

Aug 18 13:11:38 nodo1 kernel: [ 4154.272147] <0> ffff8801713e6df8

ffff8801713e6e08 ffff88014ec03de0 ffffffffa05661e1

Aug 18 13:11:38 nodo1 kernel: [ 4154.272150] Call Trace:

Aug 18 13:11:38 nodo1 kernel: [ 4154.272164] [<ffffffff803add09>]

sock_release+0x89/0xa0

Aug 18 13:11:38 nodo1 kernel: [ 4154.272177] [<ffffffffa05661e1>]

tcp_connect_to_sock+0x161/0x2b0 [dlm]

Aug 18 13:11:38 nodo1 kernel: [ 4154.272206] [<ffffffffa0568764>]

process_send_sockets+0x34/0x60 [dlm]

Aug 18 13:11:38 nodo1 kernel: [ 4154.272222] [<ffffffff800693f3>]

run_workqueue+0x83/0x230

Aug 18 13:11:38 nodo1 kernel: [ 4154.272227] [<ffffffff80069654>]

worker_thread+0xb4/0x140

Aug 18 13:11:38 nodo1 kernel: [ 4154.272231] [<ffffffff8006fac6>]

kthread+0xb6/0xc0

Aug 18 13:11:38 nodo1 kernel: [ 4154.272236] [<ffffffff8000d38a>]

child_rip+0xa/0x20

Aug 18 13:11:38 nodo1 kernel: [ 4154.272240] Code: 42 20 48 c7 c2 b0 4c 13

80 48 85 c0 48 0f 44 c2 48 89 df ff d0 48 8b 45 e8 65 48 33 04 25 28 00 00

00 75 0b 48 83 c4 18 5b c9 c3 <0f> 0b eb fe e8 35 c6 f1 ff 0f 1f 44 00 00 55

48 8d 97 10 02 00

Aug 18 13:11:38 nodo1 kernel: [ 4154.272256] RIP [<ffffffff801331c2>]

iput+0x82/0x90

Aug 18 13:11:38 nodo1 kernel: [ 4154.272259] RSP <ffff88014ec03c30>

Aug 18 13:11:38 nodo1 kernel: [ 4154.272264] ---[ end trace 7707d0d92a7f5415

]---

Aug 18 13:11:38 nodo1 kernel: [ 4154.272495] dlm: connect from non cluster

node

and after few log lines the following line repeated until the node was

killed by me

Aug 18 13:12:31 nodo1 cluster-dlm: start_kernel: start_kernel cg 3

member_count 1#012

Aug 18 13:12:31 nodo1 cluster-dlm: update_dir_members: dir_member

1812048064#012

Aug 18 13:12:31 nodo1 cluster-dlm: update_dir_members: dir_member

1778493632#012

Aug 18 13:12:31 nodo1 cluster-dlm: set_configfs_members: set_members rmdir

"/sys/kernel/config/dlm/cluster/spaces/0BB443F896254AD3BA8FB960C425B666/nodes/1812048064"#012

Aug 18 13:12:31 nodo1 cluster-dlm: do_sysfs: write "1" to

"/sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control"#012

Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no

nodeid 1812048064#012

Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no

nodeid 1812048064#012

Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no

nodeid 1812048064#012

Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no

nodeid 1812048064#012

Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no

nodeid 1812048064#012

Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no

nodeid 1812048064#012

Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no

nodeid 1812048064#012

Attached the log file

Someone can explain what is the reason?

        </pre>

      </blockquote>

      <pre wrap="">Perhaps the membership got out of sync...

Aug 18 13:11:38 nodo1 kernel: [ 4154.272495] dlm: connect from non cluster node

Maybe lmb or dejan can suggest something... I dont have much to do

with ocfs2 anymore.

      </pre>

    </blockquote>

    <pre wrap="">Me neither. But this looks like a kernel bug:

    </pre>

    <blockquote type="cite">

      <blockquote type="cite">

        <pre wrap="">Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at

/usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323!

        </pre>

      </blockquote>

    </blockquote>

    <pre wrap="">Perhaps ask on the kernel ML?

Thanks,

Dejan

    </pre>

    <blockquote type="cite">

      <pre wrap="">_______________________________________________

Pacemaker mailing list: <a class="moz-txt-link-abbreviated" href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a>

<a class="moz-txt-link-freetext" href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a>

Project Home: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org">http://www.clusterlabs.org</a>

Getting started: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a>

Bugs: <a class="moz-txt-link-freetext" href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a>

      </pre>

    </blockquote>

    <pre wrap="">_______________________________________________

Pacemaker mailing list: <a class="moz-txt-link-abbreviated" href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a>

<a class="moz-txt-link-freetext" href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a>

Project Home: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org">http://www.clusterlabs.org</a>

Getting started: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a>

Bugs: <a class="moz-txt-link-freetext" href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a>

    </pre>

  </blockquote>

  <pre wrap=""><!---->

_______________________________________________

Pacemaker mailing list: <a class="moz-txt-link-abbreviated" href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a>

<a class="moz-txt-link-freetext" href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a>

Project Home: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org">http://www.clusterlabs.org</a>

Getting started: <a class="moz-txt-link-freetext" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a>

Bugs: <a class="moz-txt-link-freetext" href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a>

  </pre>

</blockquote>

<br>

<pre class="moz-signature" cols="72">-- 

Dan FRINCU

Systems Engineer

CCNA, RHCE

Streamwide Romania

</pre>

</body>

</html>