[ClusterLabs] Automatic restart of Pacemaker after reboot and filesystem unmount problem

Grégory Sacré gregory.sacre at s-clinica.com
Tue Jul 14 07:56:17 EDT 2020


Dear all,


I'm pretty new to Pacemaker so I must be missing something but I cannot find it in the documentation.

I'm setting up a SAMBA File Server cluster with DRBD and Pacemaker. Here are the relevant pcs commands related to the mount part:

user $ sudo pcs cluster cib fs_cfg
user $ sudo pcs -f fs_cfg resource create VPSFSMount Filesystem device="/dev/drbd1" directory="/srv/vps-fs" fstype="gfs2" "options=acl,noatime"
  Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 'Filesystem')

It all works fine, here is an extract of the pcs status command:

user $ sudo pcs status
Cluster name: vps-fs
Stack: corosync
Current DC: vps-fs-04 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Jul 14 11:13:55 2020
Last change: Tue Jul 14 10:31:36 2020 by root via cibadmin on vps-fs-04

2 nodes configured
7 resources configured

Online: [ vps-fs-03 vps-fs-04 ]

Full list of resources:

stonith_vps-fs (stonith:external/ssh): Started vps-fs-04
Clone Set: dlm-clone [dlm]
     Started: [ vps-fs-03 vps-fs-04 ]
Master/Slave Set: VPSFSClone [VPSFS]
     Masters: [ vps-fs-03 vps-fs-04 ]
Clone Set: VPSFSMount-clone [VPSFSMount]
     Started: [ vps-fs-03 vps-fs-04 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

I can start CTDB (SAMBA cluster manager manually) and it's fine. However, CTDB shares a lock file between both nodes which is located on the shared mount point.

The problem comes from the moment I reboot one of the servers (vps-fs-04) and Pacemaker (and Corosync) are started automatically upon boot (I'm talking about unexpected reboot, not maintenance reboot which I didn't try yet).
After reboot, the server (vps-fs-04) comes back online and in the cluster but the one that wasn't rebooted has an issue with the mount resource:

user $ sudo pcs status
Cluster name: vps-fs
Stack: corosync
Current DC: vps-fs-03 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Jul 14 11:33:44 2020
Last change: Tue Jul 14 10:31:36 2020 by root via cibadmin on vps-fs-04

2 nodes configured
7 resources configured

Node vps-fs-03: UNCLEAN (online)
Online: [ vps-fs-04 ]

Full list of resources:

stonith_vps-fs (stonith:external/ssh): Started vps-fs-03
Clone Set: dlm-clone [dlm]
     Started: [ vps-fs-03 vps-fs-04 ]
Master/Slave Set: VPSFSClone [VPSFS]
     Masters: [ vps-fs-03 ]
     Slaves: [ vps-fs-04 ]
Clone Set: VPSFSMount-clone [VPSFSMount]
     VPSFSMount (ocf::heartbeat:Filesystem):    FAILED vps-fs-03
     Stopped: [ vps-fs-04 ]

Failed Actions:
* VPSFSMount_stop_0 on vps-fs-03 'unknown error' (1): call=65, status=Timed Out, exitreason='Couldn't unmount /srv/vps-fs; trying cleanup with KILL',
    last-rc-change='Tue Jul 14 11:23:46 2020', queued=0ms, exec=60011ms


Daemon Status:
  corosync: active/enabled
 pacemaker: active/enabled
  pcsd: active/enabled

The problem seems to come from the fact that the mount point (/srv/vps-fs) is busy (probably CTDB lock file) but what I don't understand is why does the server not rebooted (vps-fs-03) need to remount an already mounted file system when the other node comes back online.

I've checked the 'ocf:heartbeat:Filesystem' documentation but nothing seemed to help. The only thing I did was to change the following:

user $ sudo pcs resource update VPSFSMount fast_stop="no" op monitor timeout="60"

However this didn't help. Google doesn't give me much help either (but maybe I'm not searching for the right thing).

Thank you in advance for any pointer!


Kr,

Gregory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200714/1e540fae/attachment.htm>


More information about the Users mailing list