[ClusterLabs] Antw: Re: GFS2 problem after host name change

Mon Jan 15 10:31:58 EST 2018

On a fresh boot, fsck.gfs2 found no errors on either node.

On 2018-01-15 01:03 AM, Ulrich Windl wrote:
> I'd deal with "fatal: filesystem consistency error" first.
> 
> 
>>>> Digimer <lists at alteeve.ca> schrieb am 14.01.2018 um 21:48 in Nachricht
> <6a036895-8964-ca76-3774-4b7e9bcf5601 at alteeve.ca>:
>> On 2018-01-14 12:29 PM, Digimer wrote:
>>> I recently changed the host name of a cluster. It may or may not be
>>> related, but after I noticed that I can cleanly start gfs2 when the node
>>> boots. However, if the node is withdrawn and then I try to rejoin it
>>> without a reboot, it hangs with this in syslog;
>>>
>>> ====
>>> Jan 14 12:21:34 kp-a10n01 kernel: Pid: 22580, comm: kslowd000 Not
>>> tainted 2.6.32-696.18.7.el6.x86_64 #1
>>> Jan 14 12:21:34 kp-a10n01 kernel: Call Trace:
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa0714308>] ?
>>> gfs2_lm_withdraw+0x128/0x160 [gfs2]
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa071451d>] ?
>>> gfs2_consist_inode_i+0x5d/0x60 [gfs2]
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b466>] ?
>>> find_good_lh+0x76/0x90 [gfs2]
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b509>] ?
>>> gfs2_find_jhead+0x89/0x170 [gfs2]
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8107e0ee>] ?
>>> vprintk_default+0xe/0x10
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b6ee>] ?
>>> gfs2_recover_work+0xfe/0x790 [gfs2]
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8106b73e>] ?
>>> perf_event_task_sched_out+0x2e/0x70
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8100968b>] ?
>>> __switch_to+0x6b/0x320
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8154b728>] ?
>>> schedule+0x458/0xc50
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81063883>] ?
> __wake_up+0x53/0x70
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121be3>] ?
>>> slow_work_execute+0x233/0x310
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121e17>] ?
>>> slow_work_thread+0x157/0x360
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a71a0>] ?
>>> autoremove_wake_function+0x0/0x40
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121cc0>] ?
>>> slow_work_thread+0x0/0x360
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a6d0e>] ? kthread+0x9e/0xc0
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81557afa>] ?
> child_rip+0xa/0x20
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a6c70>] ? kthread+0x0/0xc0
>>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81557af0>] ?
> child_rip+0x0/0x20
>>> Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>>> jid=0: Failed
>>> Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>>> jid=1: Trying to acquire journal lock...
>>> Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: can't
>>> read in statfs inode: -5
>>> Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not
>>> found shared
>>> Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not
>>> found shared
>>> Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not
>>> found shared
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,19 err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,19 err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,805b err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,18 err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,17 err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,16 err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 1,0 err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,18 err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 9,0 err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 1,1 err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 4,0 err=-22
>>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,17 err=-22
>>> ====
>>>
>>> I have to fence the node to get the system back up. It happens on either
>>> node, and it happens regardless of the peer node being connected.
>>>
>>> GFS2 on top of clvmd on an rhcs cluster on RHEL 6. Would configs help?
>>>
>>> digimer
>>>
>>
>> Happened again (well, many times, but here's the log output from another
>> hang);
>>
>> ====
>> Jan 14 12:46:41 kp-a10n01 kernel: GFS2 (built Jan  4 2018 17:32:36)
>> installed
>> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=: Trying to join cluster
>> "lock_dlm", "kp-anvil-10:shared"
>> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>> Joined cluster. Now mounting FS...
>> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>> jid=0, already locked for use
>> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>> jid=0: Looking at journal...
>> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>> fatal: filesystem consistency error
>> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>> inode = 4 25
>> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>> function = find_good_lh, file = fs/gfs2/recovery.c, line = 205
>> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: about
>> to withdraw this file system
>> Jan 14 12:46:42 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>> telling LM to unmount
>> Jan 14 12:46:42 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
> withdrawn
>> Jan 14 12:46:42 kp-a10n01 kernel: Pid: 11668, comm: kslowd000 Not
>> tainted 2.6.32-696.18.7.el6.x86_64 #1
>> Jan 14 12:46:42 kp-a10n01 kernel: Call Trace:
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffffa06fb308>] ?
>> gfs2_lm_withdraw+0x128/0x160 [gfs2]
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffffa06fb51d>] ?
>> gfs2_consist_inode_i+0x5d/0x60 [gfs2]
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffffa06f2466>] ?
>> find_good_lh+0x76/0x90 [gfs2]
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffffa06f2509>] ?
>> gfs2_find_jhead+0x89/0x170 [gfs2]
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff8107e0ee>] ?
>> vprintk_default+0xe/0x10
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffffa06f26ee>] ?
>> gfs2_recover_work+0xfe/0x790 [gfs2]
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff8106b73e>] ?
>> perf_event_task_sched_out+0x2e/0x70
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81074a83>] ?
>> dequeue_entity+0x113/0x2e0
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff8100968b>] ?
>> __switch_to+0x6b/0x320
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff8154b728>] ?
>> schedule+0x458/0xc50
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff8107543b>] ?
>> enqueue_task_fair+0xfb/0x100
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81121be3>] ?
>> slow_work_execute+0x233/0x310
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81121e17>] ?
>> slow_work_thread+0x157/0x360
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff810a71a0>] ?
>> autoremove_wake_function+0x0/0x40
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81121cc0>] ?
>> slow_work_thread+0x0/0x360
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff810a6d0e>] ? kthread+0x9e/0xc0
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81557afa>] ? child_rip+0xa/0x20
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff810a6c70>] ? kthread+0x0/0xc0
>> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81557af0>] ? child_rip+0x0/0x20
>> Jan 14 12:46:42 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>> jid=0: Failed
>> Jan 14 12:46:42 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
>> jid=1: Trying to acquire journal lock...
>> Jan 14 12:46:42 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: can't
>> read in statfs inode: -5
>> Jan 14 12:46:42 kp-a10n01 gfs_controld[5371]: recovery_uevent mg not
>> found shared
>> Jan 14 12:46:42 kp-a10n01 gfs_controld[5371]: recovery_uevent mg not
>> found shared
>> Jan 14 12:46:42 kp-a10n01 gfs_controld[5371]: recovery_uevent mg not
>> found shared
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 2,19 err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 5,19 err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 5,805b err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 5,18 err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 5,17 err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 5,16 err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 1,0 err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 4,0 err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 2,17 err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 2,18 err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 9,0 err=-22
>> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 1,1 err=-22
>> ====
>>
>>
>> -- 
>> Digimer
>> Papers and Projects: https://alteeve.com/w/ 
>> "I am, somehow, less interested in the weight and convolutions of
>> Einstein’s brain than in the near certainty that people of equal talent
>> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould