[ClusterLabs] Automatic restart of Pacemaker after reboot and filesystem unmount problem

Tue Jul 14 12:41:14 EDT 2020

14.07.2020 14:56, Grégory Sacré пишет:
> Dear all,
> 
> 
> I'm pretty new to Pacemaker so I must be missing something but I cannot find it in the documentation.
> 
> I'm setting up a SAMBA File Server cluster with DRBD and Pacemaker. Here are the relevant pcs commands related to the mount part:
> 
> user $ sudo pcs cluster cib fs_cfg
> user $ sudo pcs -f fs_cfg resource create VPSFSMount Filesystem device="/dev/drbd1" directory="/srv/vps-fs" fstype="gfs2" "options=acl,noatime"
>   Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 'Filesystem')
> 
> It all works fine, here is an extract of the pcs status command:
> 
> user $ sudo pcs status
> Cluster name: vps-fs
> Stack: corosync
> Current DC: vps-fs-04 (version 1.1.18-2b07d5c5a9) - partition with quorum
> Last updated: Tue Jul 14 11:13:55 2020
> Last change: Tue Jul 14 10:31:36 2020 by root via cibadmin on vps-fs-04
> 
> 2 nodes configured
> 7 resources configured
> 
> Online: [ vps-fs-03 vps-fs-04 ]
> 
> Full list of resources:
> 
> stonith_vps-fs (stonith:external/ssh): Started vps-fs-04
> Clone Set: dlm-clone [dlm]
>      Started: [ vps-fs-03 vps-fs-04 ]
> Master/Slave Set: VPSFSClone [VPSFS]
>      Masters: [ vps-fs-03 vps-fs-04 ]
> Clone Set: VPSFSMount-clone [VPSFSMount]
>      Started: [ vps-fs-03 vps-fs-04 ]
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> I can start CTDB (SAMBA cluster manager manually) and it's fine. However, CTDB shares a lock file between both nodes which is located on the shared mount point.
> 
> The problem comes from the moment I reboot one of the servers (vps-fs-04) and Pacemaker (and Corosync) are started automatically upon boot (I'm talking about unexpected reboot, not maintenance reboot which I didn't try yet).
> After reboot, the server (vps-fs-04) comes back online and in the cluster but the one that wasn't rebooted has an issue with the mount resource:
> 
> user $ sudo pcs status
> Cluster name: vps-fs
> Stack: corosync
> Current DC: vps-fs-03 (version 1.1.18-2b07d5c5a9) - partition with quorum
> Last updated: Tue Jul 14 11:33:44 2020
> Last change: Tue Jul 14 10:31:36 2020 by root via cibadmin on vps-fs-04
> 
> 2 nodes configured
> 7 resources configured
> 
> Node vps-fs-03: UNCLEAN (online)

Your node was not fenced. I wonder how pacemaker should handle this
situation.

> Online: [ vps-fs-04 ]
> 
> Full list of resources:
> 
> stonith_vps-fs (stonith:external/ssh): Started vps-fs-03

ssh fencing is not suitable for production environment. ssh cannot fence
unreachable node and that is exactly when you need to fence node. There
is little point in fencing healthy nodes (except in rare cases).

ssh may help in case interconnect broke if you use different network and
it stays, but it cannot cope with total node loss.

> Clone Set: dlm-clone [dlm]
>      Started: [ vps-fs-03 vps-fs-04 ]
> Master/Slave Set: VPSFSClone [VPSFS]
>      Masters: [ vps-fs-03 ]
>      Slaves: [ vps-fs-04 ]
> Clone Set: VPSFSMount-clone [VPSFSMount]
>      VPSFSMount (ocf::heartbeat:Filesystem):    FAILED vps-fs-03
>      Stopped: [ vps-fs-04 ]
> 
> Failed Actions:
> * VPSFSMount_stop_0 on vps-fs-03 'unknown error' (1): call=65, status=Timed Out, exitreason='Couldn't unmount /srv/vps-fs; trying cleanup with KILL',
>     last-rc-change='Tue Jul 14 11:23:46 2020', queued=0ms, exec=60011ms
> 
> 
> Daemon Status:
>   corosync: active/enabled
>  pacemaker: active/enabled
>   pcsd: active/enabled
> 
> The problem seems to come from the fact that the mount point (/srv/vps-fs) is busy (probably CTDB lock file) but what I don't understand is why does the server not rebooted (vps-fs-03) need to remount an already mounted file system when the other node comes back online.

You likely need interleave=true option on clones. Otherwise pacemaker
tries to restart all dependent resources everywhere when clone resource
changes state on one node.

To say more logs are needed as well as ordering constraints for your
resources.

> 
> I've checked the 'ocf:heartbeat:Filesystem' documentation but nothing seemed to help. The only thing I did was to change the following:
> 
> user $ sudo pcs resource update VPSFSMount fast_stop="no" op monitor timeout="60"
> 
> However this didn't help. Google doesn't give me much help either (but maybe I'm not searching for the right thing).
> 
> Thank you in advance for any pointer!
> 
> 
> Kr,
> 
> Gregory
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>