[Pacemaker] cluster-dlm: set_fs_notified: set_fs_notified no nodeid 1812048064#012

Mon Aug 30 11:56:17 EDT 2010

Ok I'll do.
Thanks!

On 08/30/2010 11:16 AM, Dan Frincu wrote:
> Try using RSTP on the switches, if possible, it has a lower
> convergence time.
>
> Roberto Giordani wrote:
>> Thanks,
>> who should I contact? Which mailing list?
>> I've discovered that this problem occours when the port of my switch
>> where the cluster ring is connected became "blocked" due spanning tree.
>> I've resolved the bug using for the ring a separate switch without
>> spanning tre enabled and different subnet.
>> Is there a configuration to avoid that before the spanning tree
>> recalculate the route due a failure, the cluster nodes doesn't hang?
>> The hang occurses on SLES11sp1 too where the servers are up running, the
>> cluster status is ok, but when try to connect to the server with ssh,
>> after the login hang the session.
>>
>> Usually the recalculate takes 50 seconds.
>>
>> Regards,
>> Roberto.
>>
>> On 08/26/2010 10:24 AM, Dejan Muhamedagic wrote:
>>   
>>> Hi,
>>>
>>> On Thu, Aug 26, 2010 at 09:36:10AM +0200, Andrew Beekhof wrote:
>>>   
>>>     
>>>> On Wed, Aug 18, 2010 at 6:24 PM, Roberto Giordani <r.giordani at libero.it> wrote:
>>>>     
>>>>       
>>>>> Hello,
>>>>> I'll explain what’s happened after a network black-out
>>>>> I've a cluster with pacemaker on Opensuse 11.2 64bit
>>>>> ============
>>>>> Last updated: Wed Aug 18 18:13:33 2010
>>>>> Current DC: nodo1 (nodo1)
>>>>> Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160
>>>>> 3 Nodes configured.
>>>>> 11 Resources configured.
>>>>> ============
>>>>>
>>>>> Node: nodo1 (nodo1): online
>>>>> Node: nodo3 (nodo3): online
>>>>> Node: nodo4 (nodo4): online
>>>>>
>>>>> Clone Set: dlm-clone
>>>>>     dlm:0       (ocf::pacemaker:controld):      Started nodo3
>>>>>     dlm:1       (ocf::pacemaker:controld):      Started nodo1
>>>>>     dlm:2       (ocf::pacemaker:controld):      Started nodo4
>>>>> Clone Set: o2cb-clone
>>>>>     o2cb:0      (ocf::ocfs2:o2cb):      Started nodo3
>>>>>     o2cb:1      (ocf::ocfs2:o2cb):      Started nodo1
>>>>>     o2cb:2      (ocf::ocfs2:o2cb):      Started nodo4
>>>>> Clone Set: XencfgFS-Clone
>>>>>     XencfgFS:0  (ocf::heartbeat:Filesystem):    Started nodo3
>>>>>     XencfgFS:1  (ocf::heartbeat:Filesystem):    Started nodo1
>>>>>     XencfgFS:2  (ocf::heartbeat:Filesystem):    Started nodo4
>>>>> Clone Set: XenimageFS-Clone
>>>>>     XenimageFS:0        (ocf::heartbeat:Filesystem):    Started nodo3
>>>>>     XenimageFS:1        (ocf::heartbeat:Filesystem):    Started nodo1
>>>>>     XenimageFS:2        (ocf::heartbeat:Filesystem):    Started nodo4
>>>>> rsa1-fencing    (stonith:external/ibmrsa-telnet):       Started nodo4
>>>>> rsa2-fencing    (stonith:external/ibmrsa-telnet):       Started nodo3
>>>>> rsa3-fencing    (stonith:external/ibmrsa-telnet):       Started nodo4
>>>>> rsa4-fencing    (stonith:external/ibmrsa-telnet):       Started nodo3
>>>>> mailsrv-rm      (ocf::heartbeat:Xen):   Started nodo3
>>>>> dbsrv-rm        (ocf::heartbeat:Xen):   Started nodo4
>>>>> websrv-rm       (ocf::heartbeat:Xen):   Started nodo4
>>>>>
>>>>> After a  switch failure all the nodes and the rsa stonith devices was
>>>>> unreachable.
>>>>>
>>>>> On the cluster happen the following error on one node
>>>>>
>>>>> Aug 18 13:11:38 nodo1 cluster-dlm: receive_plocks_stored:
>>>>> receive_plocks_stored 1778493632:2 need_plocks 0#012
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272025] ------------[ cut here
>>>>> ]------------
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at
>>>>> /usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323!
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272042] invalid opcode: 0000 [#1] SMP
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272046] last sysfs file:
>>>>> /sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272050] CPU 1
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272053] Modules linked in:
>>>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev
>>>>> iptable_filter ip_tables x_tables ocfs2 ocfs2_nodemanager quota_tree
>>>>> ocfs2_stack_user ocfs2_stackglue dlm configfs netbk coretemp blkbk
>>>>> blkback_pagemap blktap xenbus_be ipmi_si edd dm_round_robin scsi_dh_rdac
>>>>> dm_multipath scsi_dh bridge stp llc bonding ipv6 fuse ext4 jbd2 crc16 loop
>>>>> dm_mod sr_mod ide_pci_generic ide_core iTCO_wdt ata_generic ibmpex i5k_amb
>>>>> ibmaem iTCO_vendor_support ipmi_msghandler bnx2 i5000_edac 8250_pnp shpchp
>>>>> ata_piix pcspkr ics932s401 joydev edac_core i2c_i801 ses pci_hotplug 8250
>>>>> i2c_core serio_raw enclosure serial_core button sg reiserfs usbhid hid
>>>>> uhci_hcd ehci_hcd xenblk cdrom xennet fan processor pata_acpi lpfc thermal
>>>>> thermal_sys hwmon aacraid [last unloaded: ocfs2_stackglue]
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272111] Pid: 8889, comm: dlm_send Not
>>>>> tainted 2.6.31.12-0.2-xen #1 IBM System x3650 -[7979AC1]-
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272113] RIP: e030:[<ffffffff801331c2>]
>>>>> [<ffffffff801331c2>] iput+0x82/0x90
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272121] RSP: e02b:ffff88014ec03c30
>>>>> EFLAGS: 00010246
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272122] RAX: 0000000000000000 RBX:
>>>>> ffff880148a703c8 RCX: 0000000000000000
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272123] RDX: ffffc90000010000 RSI:
>>>>> ffff880148a70380 RDI: ffff880148a703c8
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272125] RBP: ffff88014ec03c50 R08:
>>>>> b038000000000000 R09: fe99594c51a57607
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272126] R10: ffff880040410270 R11:
>>>>> 0000000000000000 R12: ffff8801713e6e08
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272128] R13: ffff88014ec03d20 R14:
>>>>> 0000000000000000 R15: ffffc9000331d108
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272133] FS: 00007ff4cb11a730(0000)
>>>>> GS:ffffc90000010000(0000) knlGS:0000000000000000
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272135] CS: e033 DS: 0000 ES: 0000 CR0:
>>>>> 000000008005003b
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272136] CR2: 00007ff4c5c45000 CR3:
>>>>> 0000000135b2a000 CR4: 0000000000002660
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272138] DR0: 0000000000000000 DR1:
>>>>> 0000000000000000 DR2: 0000000000000000
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272140] DR3: 0000000000000000 DR6:
>>>>> 00000000ffff0ff0 DR7: 0000000000000400
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272142] Process dlm_send (pid: 8889,
>>>>> threadinfo ffff88014ec02000, task ffff8801381e45c0)
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272143] Stack:
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272144] 0000000000000000
>>>>> 00000000072f0874 ffff880148a70380 ffff880148a70380
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272146] <0> ffff88014ec03c80
>>>>> ffffffff803add09 ffff88014ec03c80 00000000072f0874
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272147] <0> ffff8801713e6df8
>>>>> ffff8801713e6e08 ffff88014ec03de0 ffffffffa05661e1
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272150] Call Trace:
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272164] [<ffffffff803add09>]
>>>>> sock_release+0x89/0xa0
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272177] [<ffffffffa05661e1>]
>>>>> tcp_connect_to_sock+0x161/0x2b0 [dlm]
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272206] [<ffffffffa0568764>]
>>>>> process_send_sockets+0x34/0x60 [dlm]
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272222] [<ffffffff800693f3>]
>>>>> run_workqueue+0x83/0x230
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272227] [<ffffffff80069654>]
>>>>> worker_thread+0xb4/0x140
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272231] [<ffffffff8006fac6>]
>>>>> kthread+0xb6/0xc0
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272236] [<ffffffff8000d38a>]
>>>>> child_rip+0xa/0x20
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272240] Code: 42 20 48 c7 c2 b0 4c 13
>>>>> 80 48 85 c0 48 0f 44 c2 48 89 df ff d0 48 8b 45 e8 65 48 33 04 25 28 00 00
>>>>> 00 75 0b 48 83 c4 18 5b c9 c3 <0f> 0b eb fe e8 35 c6 f1 ff 0f 1f 44 00 00 55
>>>>> 48 8d 97 10 02 00
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272256] RIP [<ffffffff801331c2>]
>>>>> iput+0x82/0x90
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272259] RSP <ffff88014ec03c30>
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272264] ---[ end trace 7707d0d92a7f5415
>>>>> ]---
>>>>>
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272495] dlm: connect from non cluster
>>>>> node
>>>>>
>>>>> and after few log lines the following line repeated until the node was
>>>>> killed by me
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: start_kernel: start_kernel cg 3
>>>>> member_count 1#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: update_dir_members: dir_member
>>>>> 1812048064#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: update_dir_members: dir_member
>>>>> 1778493632#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_configfs_members: set_members rmdir
>>>>> "/sys/kernel/config/dlm/cluster/spaces/0BB443F896254AD3BA8FB960C425B666/nodes/1812048064"#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: do_sysfs: write "1" to
>>>>> "/sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control"#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>>>> nodeid 1812048064#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>>>> nodeid 1812048064#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>>>> nodeid 1812048064#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>>>> nodeid 1812048064#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>>>> nodeid 1812048064#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>>>> nodeid 1812048064#012
>>>>>
>>>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>>>> nodeid 1812048064#012
>>>>>
>>>>> Attached the log file
>>>>>
>>>>> Someone can explain what is the reason?
>>>>>       
>>>>>         
>>>> Perhaps the membership got out of sync...
>>>>
>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272495] dlm: connect from non cluster node
>>>>
>>>> Maybe lmb or dejan can suggest something... I dont have much to do
>>>> with ocfs2 anymore.
>>>>     
>>>>       
>>> Me neither. But this looks like a kernel bug:
>>>
>>>   
>>>     
>>>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at
>>>>> /usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323!
>>>>>       
>>>>>         
>>> Perhaps ask on the kernel ML?
>>>
>>> Thanks,
>>>
>>> Dejan
>>>
>>>
>>>   
>>>     
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>>     
>>>>       
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>   
>>>     
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>   
>
> -- 
> Dan FRINCU
> Systems Engineer
> CCNA, RHCE
> Streamwide Romania
>   
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100830/87a6a046/attachment-0001.html>