[ClusterLabs] Automatic restart of Pacemaker after reboot and filesystem unmount problem

Grégory Sacré gregory.sacre at s-clinica.com
Wed Jul 15 04:07:38 EDT 2020


>> Online: [ vps-fs-04 ]
>> 
>> Full list of resources:
>> 
>> stonith_vps-fs (stonith:external/ssh): Started vps-fs-03
> 
> ssh fencing is not suitable for production environment. ssh cannot fence unreachable node and that is exactly when you need to fence node. There is little point in fencing healthy nodes (except in rare cases).
> 
> ssh may help in case interconnect broke if you use different network and it stays, but it cannot cope with total node loss.

For the moment, I just want a PoC for the cluster part, proper fencing will be done and tested at a later stage.

[...]

>> The problem seems to come from the fact that the mount point (/srv/vps-fs) is busy (probably CTDB lock file) but what I don't understand is why does the server not rebooted (vps-fs-03) need to remount an already mounted file system when the other node comes back online.
> 
> You likely need interleave=true option on clones. Otherwise pacemaker tries to restart all dependent resources everywhere when clone resource changes state on one node.
> 
> To say more logs are needed as well as ordering constraints for your resources.

That seems to have done it, thanks!
Adding interleave=true does not force a remount after the second node comes back online.


Kr,

Gregory

-----Original Message-----
From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei Borzenkov
Sent: 14 July 2020 18:41
To: users at clusterlabs.org
Subject: Re: [ClusterLabs] Automatic restart of Pacemaker after reboot and filesystem unmount problem

14.07.2020 14:56, Grégory Sacré пишет:
> Dear all,
> 
> 
> I'm pretty new to Pacemaker so I must be missing something but I cannot find it in the documentation.
> 
> I'm setting up a SAMBA File Server cluster with DRBD and Pacemaker. Here are the relevant pcs commands related to the mount part:
> 
> user $ sudo pcs cluster cib fs_cfg
> user $ sudo pcs -f fs_cfg resource create VPSFSMount Filesystem device="/dev/drbd1" directory="/srv/vps-fs" fstype="gfs2" "options=acl,noatime"
>   Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 
> 'Filesystem')
> 
> It all works fine, here is an extract of the pcs status command:
> 
> user $ sudo pcs status
> Cluster name: vps-fs
> Stack: corosync
> Current DC: vps-fs-04 (version 1.1.18-2b07d5c5a9) - partition with 
> quorum Last updated: Tue Jul 14 11:13:55 2020 Last change: Tue Jul 14 
> 10:31:36 2020 by root via cibadmin on vps-fs-04
> 
> 2 nodes configured
> 7 resources configured
> 
> Online: [ vps-fs-03 vps-fs-04 ]
> 
> Full list of resources:
> 
> stonith_vps-fs (stonith:external/ssh): Started vps-fs-04 Clone Set: 
> dlm-clone [dlm]
>      Started: [ vps-fs-03 vps-fs-04 ]
> Master/Slave Set: VPSFSClone [VPSFS]
>      Masters: [ vps-fs-03 vps-fs-04 ]
> Clone Set: VPSFSMount-clone [VPSFSMount]
>      Started: [ vps-fs-03 vps-fs-04 ]
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> I can start CTDB (SAMBA cluster manager manually) and it's fine. However, CTDB shares a lock file between both nodes which is located on the shared mount point.
> 
> The problem comes from the moment I reboot one of the servers (vps-fs-04) and Pacemaker (and Corosync) are started automatically upon boot (I'm talking about unexpected reboot, not maintenance reboot which I didn't try yet).
> After reboot, the server (vps-fs-04) comes back online and in the cluster but the one that wasn't rebooted has an issue with the mount resource:
> 
> user $ sudo pcs status
> Cluster name: vps-fs
> Stack: corosync
> Current DC: vps-fs-03 (version 1.1.18-2b07d5c5a9) - partition with 
> quorum Last updated: Tue Jul 14 11:33:44 2020 Last change: Tue Jul 14 
> 10:31:36 2020 by root via cibadmin on vps-fs-04
> 
> 2 nodes configured
> 7 resources configured
> 
> Node vps-fs-03: UNCLEAN (online)

Your node was not fenced. I wonder how pacemaker should handle this situation.

> Online: [ vps-fs-04 ]
> 
> Full list of resources:
> 
> stonith_vps-fs (stonith:external/ssh): Started vps-fs-03

ssh fencing is not suitable for production environment. ssh cannot fence unreachable node and that is exactly when you need to fence node. There is little point in fencing healthy nodes (except in rare cases).

ssh may help in case interconnect broke if you use different network and it stays, but it cannot cope with total node loss.

> Clone Set: dlm-clone [dlm]
>      Started: [ vps-fs-03 vps-fs-04 ]
> Master/Slave Set: VPSFSClone [VPSFS]
>      Masters: [ vps-fs-03 ]
>      Slaves: [ vps-fs-04 ]
> Clone Set: VPSFSMount-clone [VPSFSMount]
>      VPSFSMount (ocf::heartbeat:Filesystem):    FAILED vps-fs-03
>      Stopped: [ vps-fs-04 ]
> 
> Failed Actions:
> * VPSFSMount_stop_0 on vps-fs-03 'unknown error' (1): call=65, status=Timed Out, exitreason='Couldn't unmount /srv/vps-fs; trying cleanup with KILL',
>     last-rc-change='Tue Jul 14 11:23:46 2020', queued=0ms, 
> exec=60011ms
> 
> 
> Daemon Status:
>   corosync: active/enabled
>  pacemaker: active/enabled
>   pcsd: active/enabled
> 
> The problem seems to come from the fact that the mount point (/srv/vps-fs) is busy (probably CTDB lock file) but what I don't understand is why does the server not rebooted (vps-fs-03) need to remount an already mounted file system when the other node comes back online.

You likely need interleave=true option on clones. Otherwise pacemaker tries to restart all dependent resources everywhere when clone resource changes state on one node.

To say more logs are needed as well as ordering constraints for your resources.

> 
> I've checked the 'ocf:heartbeat:Filesystem' documentation but nothing seemed to help. The only thing I did was to change the following:
> 
> user $ sudo pcs resource update VPSFSMount fast_stop="no" op monitor timeout="60"
> 
> However this didn't help. Google doesn't give me much help either (but maybe I'm not searching for the right thing).
> 
> Thank you in advance for any pointer!
> 
> 
> Kr,
> 
> Gregory
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


More information about the Users mailing list