[ClusterLabs] GFS2 problem after host name change

Sun Jan 14 15:29:40 EST 2018

I recently changed the host name of a cluster. It may or may not be
related, but after I noticed that I can cleanly start gfs2 when the node
boots. However, if the node is withdrawn and then I try to rejoin it
without a reboot, it hangs with this in syslog;

====
Jan 14 12:21:34 kp-a10n01 kernel: Pid: 22580, comm: kslowd000 Not
tainted 2.6.32-696.18.7.el6.x86_64 #1
Jan 14 12:21:34 kp-a10n01 kernel: Call Trace:
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa0714308>] ?
gfs2_lm_withdraw+0x128/0x160 [gfs2]
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa071451d>] ?
gfs2_consist_inode_i+0x5d/0x60 [gfs2]
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b466>] ?
find_good_lh+0x76/0x90 [gfs2]
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b509>] ?
gfs2_find_jhead+0x89/0x170 [gfs2]
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8107e0ee>] ?
vprintk_default+0xe/0x10
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b6ee>] ?
gfs2_recover_work+0xfe/0x790 [gfs2]
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8106b73e>] ?
perf_event_task_sched_out+0x2e/0x70
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8100968b>] ?
__switch_to+0x6b/0x320
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8154b728>] ?
schedule+0x458/0xc50
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81063883>] ? __wake_up+0x53/0x70
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121be3>] ?
slow_work_execute+0x233/0x310
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121e17>] ?
slow_work_thread+0x157/0x360
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a71a0>] ?
autoremove_wake_function+0x0/0x40
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121cc0>] ?
slow_work_thread+0x0/0x360
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a6d0e>] ? kthread+0x9e/0xc0
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81557afa>] ? child_rip+0xa/0x20
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a6c70>] ? kthread+0x0/0xc0
Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81557af0>] ? child_rip+0x0/0x20
Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
jid=0: Failed
Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0:
jid=1: Trying to acquire journal lock...
Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: can't
read in statfs inode: -5
Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not
found shared
Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not
found shared
Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not
found shared
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,19 err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,19 err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,805b err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,18 err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,17 err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,16 err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 1,0 err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,18 err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 9,0 err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 1,1 err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 4,0 err=-22
Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,17 err=-22
====

I have to fence the node to get the system back up. It happens on either
node, and it happens regardless of the peer node being connected.

GFS2 on top of clvmd on an rhcs cluster on RHEL 6. Would configs help?

digimer

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould