[ClusterLabs] Active-Active NFS cluster failover test - system hangs (VirtualBox)

Wed Jul 12 05:06:47 EDT 2017

Hi,
The problem was due to bad stonith configuration. Above config is an
example of a working Active/Active NFS configuration.

Pozdrawiam,
Arek

2017-07-10 12:59 GMT+02:00 ArekW <arkaduis at gmail.com>:

> Hi,
> I've created 2-node active-active HA Cluster with NFS resource. The
> resources are active on both nodes. The Cluster passes failover test with
> pcs standby command but does not work when "real" node shutdown occure.
>
> Test scenario with cluster standby:
> - start cluster
> - mount nfs share on client1
> - start copy file from client1 to nfs share
> - during the copy put node1/node2 to standby mode (pcs cluster standby
> nfsnode2)
> - the copy continue
> - unstandby node1/node2
> - the copy continue and the storage re-sync (drbd)
> - the copy finish with no errors
>
> I can standby and unstandby the cluster many times and it works. The
> problem begins when I do a "true" failover test by hard-shutting down one
> of the nodes. Test results:
> - start cluster
> - mount nfs share on client1
> - start copy file from client1 to nfs share
> - during the copy shutdown node2 by stoping the node's virtual machine
> (hard stop)
> - the system hangs:
>
> <Start copy file at client1>
> # rsync -a --bwlimit=2000 /root/testfile.dat /mnt/nfsshare/
>
> <everything works ok. There is temp file .testfile.dat.9780fH>
>
> [root at nfsnode1 nfs]# ls -lah
> razem 9,8M
> drwxr-xr-x 2 root root 3,8K 07-10 11:07 .
> drwxr-xr-x 4 root root 3,8K 07-10 08:20 ..
> -rw-r--r-- 1 root root    9 07-10 08:20 client1.txt
> -rw-r----- 1 root root    0 07-10 11:07 .rmtab
> -rw------- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH
>
> [root at nfsnode1 nfs]# pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:07:29 2017          Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 nfsnode2 ]
>
> Full list of resources:
>
>  Master/Slave Set: StorageClone [Storage]
>      Masters: [ nfsnode1 nfsnode2 ]
>  Clone Set: dlm-clone [dlm]
>      Started: [ nfsnode1 nfsnode2 ]
>  vbox-fencing   (stonith:fence_vbox):   Started nfsnode1
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>      ClusterIP:0        (ocf::heartbeat:IPaddr2):       Started nfsnode2
>      ClusterIP:1        (ocf::heartbeat:IPaddr2):       Started nfsnode1
>  Clone Set: StorageFS-clone [StorageFS]
>      Started: [ nfsnode1 nfsnode2 ]
>  Clone Set: WebSite-clone [WebSite]
>      Started: [ nfsnode1 nfsnode2 ]
>  Clone Set: nfs-group-clone [nfs-group]
>      Started: [ nfsnode1 nfsnode2 ]
>
> <Hard poweroff vm machine: nfsnode2>
>
> [root at nfsnode1 nfs]# pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:07:43 2017          Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Node nfsnode2: UNCLEAN (offline)
> Online: [ nfsnode1 ]
>
> Full list of resources:
>
>  Master/Slave Set: StorageClone [Storage]
>      Storage    (ocf::linbit:drbd):     Master nfsnode2 (UNCLEAN)
>      Masters: [ nfsnode1 ]
>  Clone Set: dlm-clone [dlm]
>      dlm        (ocf::pacemaker:controld):      Started nfsnode2 (UNCLEAN)
>      Started: [ nfsnode1 ]
>  vbox-fencing   (stonith:fence_vbox):   Started nfsnode1
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>      ClusterIP:0        (ocf::heartbeat:IPaddr2):       Started nfsnode2
> (UNCLEAN)
>      ClusterIP:1        (ocf::heartbeat:IPaddr2):       Started nfsnode1
>  Clone Set: StorageFS-clone [StorageFS]
>      StorageFS  (ocf::heartbeat:Filesystem):    Started nfsnode2 (UNCLEAN)
>      Started: [ nfsnode1 ]
>  Clone Set: WebSite-clone [WebSite]
>      WebSite    (ocf::heartbeat:apache):        Started nfsnode2 (UNCLEAN)
>      Started: [ nfsnode1 ]
>  Clone Set: nfs-group-clone [nfs-group]
>      Resource Group: nfs-group:1
>          nfs    (ocf::heartbeat:nfsserver):     Started nfsnode2 (UNCLEAN)
>          nfs-export     (ocf::heartbeat:exportfs):      Started nfsnode2
> (UNCLEAN)
>      Started: [ nfsnode1 ]
>
> <ssh console hangs on client1>
> [root at nfsnode1 nfs]# ls -lah
> <nothing happen>
>
> <drbd status is ok in this situation>
> [root at nfsnode1 ~]# drbdadm status
> storage role:Primary
>   disk:UpToDate
>   nfsnode2 connection:Connecting
>
> <the nfs export is still active on node1>
> [root at nfsnode1 ~]# exportfs
> /mnt/drbd/nfs   10.0.2.0/255.255.255.0
>
> <After ssh to client1 the nfs mount is not accessible>
> login as: root
> root at 127.0.0.1's password:
> Last login: Mon Jul 10 07:48:17 2017 from 10.0.2.2
> # cd /mnt/
> # ls
> <console hangs>
>
> # mount
> 10.0.2.7:/ on /mnt/nfsshare type nfs4 (rw,relatime,vers=4.0,rsize=
> 131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,
> retrans=2,sec=sys,clientaddr=10.0.2.20,local_lock=none,addr=10.0.2.7)
>
> <Power on vm machine nfsnode2>
> <After nfsnode2 boot, console an nfsnode1 start respond but coping is not
> proceeding>
> <The temp file is visible but not active>
> [root at nfsnode1 ~]# ls -lah
> razem 9,8M
> drwxr-xr-x 2 root root 3,8K 07-10 11:07 .
> drwxr-xr-x 4 root root 3,8K 07-10 08:20 ..
> -rw-r--r-- 1 root root    9 07-10 08:20 client1.txt
> -rw-r----- 1 root root    0 07-10 11:16 .rmtab
> -rw------- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH
>
> <Coping at client1 hangs>
>
> <Cluster status:>
> [root at nfsnode1 ~]# pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:17:19 2017          Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 nfsnode2 ]
>
> Full list of resources:
>
>  Master/Slave Set: StorageClone [Storage]
>      Masters: [ nfsnode1 ]
>      Stopped: [ nfsnode2 ]
>  Clone Set: dlm-clone [dlm]
>      Started: [ nfsnode1 nfsnode2 ]
>  vbox-fencing   (stonith:fence_vbox):   Started nfsnode1
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>      ClusterIP:0        (ocf::heartbeat:IPaddr2):       Stopped
>      ClusterIP:1        (ocf::heartbeat:IPaddr2):       Started nfsnode1
>  Clone Set: StorageFS-clone [StorageFS]
>      Started: [ nfsnode1 ]
>      Stopped: [ nfsnode2 ]
>  Clone Set: WebSite-clone [WebSite]
>      Started: [ nfsnode1 ]
>      Stopped: [ nfsnode2 ]
>  Clone Set: nfs-group-clone [nfs-group]
>      Resource Group: nfs-group:0
>          nfs    (ocf::heartbeat:nfsserver):     Started nfsnode1
>          nfs-export     (ocf::heartbeat:exportfs):      FAILED nfsnode1
>      Stopped: [ nfsnode2 ]
>
> Failed Actions:
> * nfs-export_monitor_30000 on nfsnode1 'unknown error' (1): call=61,
> status=Timed Out, exitreason='none',
>     last-rc-change='Mon Jul 10 11:11:50 2017', queued=0ms, exec=0ms
> * vbox-fencing_monitor_60000 on nfsnode1 'unknown error' (1): call=22,
> status=Error, exitreason='none',
>     last-rc-change='Mon Jul 10 11:06:41 2017', queued=0ms, exec=11988ms
>
> <Try to cleanup>
>
> # pcs resource cleanup
> # pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:20:38 2017          Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 nfsnode2 ]
>
> Full list of resources:
>
>  Master/Slave Set: StorageClone [Storage]
>      Masters: [ nfsnode1 ]
>      Stopped: [ nfsnode2 ]
>  Clone Set: dlm-clone [dlm]
>      Started: [ nfsnode1 nfsnode2 ]
>  vbox-fencing   (stonith:fence_vbox):   Stopped
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>      ClusterIP:0        (ocf::heartbeat:IPaddr2):       Stopped
>      ClusterIP:1        (ocf::heartbeat:IPaddr2):       Stopped
>  Clone Set: StorageFS-clone [StorageFS]
>      Stopped: [ nfsnode1 nfsnode2 ]
>  Clone Set: WebSite-clone [WebSite]
>      Stopped: [ nfsnode1 nfsnode2 ]
>  Clone Set: nfs-group-clone [nfs-group]
>      Stopped: [ nfsnode1 nfsnode2 ]
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
>
> <Reboot of both nfsnode1 and nfsnode2>
> <After reboot:>
>
> # pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:24:10 2017          Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 nfsnode2 ]
>
> Full list of resources:
>
>  Master/Slave Set: StorageClone [Storage]
>      Slaves: [ nfsnode2 ]
>      Stopped: [ nfsnode1 ]
>  Clone Set: dlm-clone [dlm]
>      Started: [ nfsnode1 nfsnode2 ]
>  vbox-fencing   (stonith:fence_vbox):   Stopped
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>      ClusterIP:0        (ocf::heartbeat:IPaddr2):       Stopped
>      ClusterIP:1        (ocf::heartbeat:IPaddr2):       Stopped
>  Clone Set: StorageFS-clone [StorageFS]
>      Stopped: [ nfsnode1 nfsnode2 ]
>  Clone Set: WebSite-clone [WebSite]
>      Stopped: [ nfsnode1 nfsnode2 ]
>  Clone Set: nfs-group-clone [nfs-group]
>      Stopped: [ nfsnode1 nfsnode2 ]
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
>
> <Eventually the cluster was recovered after:>
> pcs cluster stop --all
> <Solve drbd split-brain>
> pcs cluster start --all
>
> The client1 could not be rebooted with 'reboot' due to mount hung (as I
> preasume). It has to be rebooted hard-way by virtualbox hypervisor.
> What's wrong with this configuration? I can send CIB configuration if
> necessary.
>
> ---------------
> Full cluster configuration (working state):
>
> # pcs status --full
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode1 (1) (version 1.1.15-11.el7_3.5-e174ec8) - partition
> with quorum
> Last updated: Mon Jul 10 12:44:03 2017          Last change: Mon Jul 10
> 11:37:13 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 (1) nfsnode2 (2) ]
>
> Full list of resources:
>
>  Master/Slave Set: StorageClone [Storage]
>      Storage    (ocf::linbit:drbd):     Master nfsnode1
>      Storage    (ocf::linbit:drbd):     Master nfsnode2
>      Masters: [ nfsnode1 nfsnode2 ]
>  Clone Set: dlm-clone [dlm]
>      dlm        (ocf::pacemaker:controld):      Started nfsnode1
>      dlm        (ocf::pacemaker:controld):      Started nfsnode2
>      Started: [ nfsnode1 nfsnode2 ]
>  vbox-fencing   (stonith:fence_vbox):   Started nfsnode1
>  Clone Set: ClusterIP-clone [ClusterIP] (unique)
>      ClusterIP:0        (ocf::heartbeat:IPaddr2):       Started nfsnode2
>      ClusterIP:1        (ocf::heartbeat:IPaddr2):       Started nfsnode1
>  Clone Set: StorageFS-clone [StorageFS]
>      StorageFS  (ocf::heartbeat:Filesystem):    Started nfsnode1
>      StorageFS  (ocf::heartbeat:Filesystem):    Started nfsnode2
>      Started: [ nfsnode1 nfsnode2 ]
>  Clone Set: WebSite-clone [WebSite]
>      WebSite    (ocf::heartbeat:apache):        Started nfsnode1
>      WebSite    (ocf::heartbeat:apache):        Started nfsnode2
>      Started: [ nfsnode1 nfsnode2 ]
>  Clone Set: nfs-group-clone [nfs-group]
>      Resource Group: nfs-group:0
>          nfs    (ocf::heartbeat:nfsserver):     Started nfsnode1
>          nfs-export     (ocf::heartbeat:exportfs):      Started nfsnode1
>      Resource Group: nfs-group:1
>          nfs    (ocf::heartbeat:nfsserver):     Started nfsnode2
>          nfs-export     (ocf::heartbeat:exportfs):      Started nfsnode2
>      Started: [ nfsnode1 nfsnode2 ]
>
> Node Attributes:
> * Node nfsnode1 (1):
>     + master-Storage                    : 10000
> * Node nfsnode2 (2):
>     + master-Storage                    : 10000
>
> Migration Summary:
> * Node nfsnode1 (1):
> * Node nfsnode2 (2):
>
> PCSD Status:
>   nfsnode1: Online
>   nfsnode2: Online
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
>
> ]# pcs resource --full
>  Master: StorageClone
>   Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=2
> clone-node-max=1
>   Resource: Storage (class=ocf provider=linbit type=drbd)
>    Attributes: drbd_resource=storage
>    Operations: start interval=0s timeout=240 (Storage-start-interval-0s)
>                promote interval=0s timeout=90 (Storage-promote-interval-0s)
>                demote interval=0s timeout=90 (Storage-demote-interval-0s)
>                stop interval=0s timeout=100 (Storage-stop-interval-0s)
>                monitor interval=60s (Storage-monitor-interval-60s)
>  Clone: dlm-clone
>   Meta Attrs: clone-max=2 clone-node-max=1
>   Resource: dlm (class=ocf provider=pacemaker type=controld)
>    Operations: start interval=0s timeout=90 (dlm-start-interval-0s)
>                stop interval=0s timeout=100 (dlm-stop-interval-0s)
>                monitor interval=60s (dlm-monitor-interval-60s)
>  Clone: ClusterIP-clone
>   Meta Attrs: clona-node-max=2 clone-max=2 globally-unique=true
> clone-node-max=2
>   Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>    Attributes: ip=10.0.2.7 cidr_netmask=32 clusterip_hash=sourceip
>    Meta Attrs: resource-stickiness=0
>    Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)
>                stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
>                monitor interval=5s (ClusterIP-monitor-interval-5s)
>  Clone: StorageFS-clone
>   Resource: StorageFS (class=ocf provider=heartbeat type=Filesystem)
>    Attributes: device=/dev/drbd1 directory=/mnt/drbd fstype=gfs2
>    Operations: start interval=0s timeout=60 (StorageFS-start-interval-0s)
>                stop interval=0s timeout=60 (StorageFS-stop-interval-0s)
>                monitor interval=20 timeout=40 (StorageFS-monitor-interval-
> 20)
>  Clone: WebSite-clone
>   Resource: WebSite (class=ocf provider=heartbeat type=apache)
>    Attributes: configfile=/etc/httpd/conf/httpd.conf statusurl=
> http://localhost/server-status
>    Operations: start interval=0s timeout=40s (WebSite-start-interval-0s)
>                stop interval=0s timeout=60s (WebSite-stop-interval-0s)
>                monitor interval=1min (WebSite-monitor-interval-1min)
>  Clone: nfs-group-clone
>   Meta Attrs: interleave=true
>   Group: nfs-group
>    Resource: nfs (class=ocf provider=heartbeat type=nfsserver)
>     Attributes: nfs_ip=10.0.2.7 nfs_no_notify=true
>     Operations: start interval=0s timeout=40 (nfs-start-interval-0s)
>                 stop interval=0s timeout=20s (nfs-stop-interval-0s)
>                 monitor interval=30s (nfs-monitor-interval-30s)
>    Resource: nfs-export (class=ocf provider=heartbeat type=exportfs)
>     Attributes: clientspec=10.0.2.0/255.255.255.0
> options=rw,sync,no_root_squash directory=/mnt/drbd/nfs fsid=0
>     Operations: start interval=0s timeout=40 (nfs-export-start-interval-0s)
>                 stop interval=0s timeout=120 (nfs-export-stop-interval-0s)
>                 monitor interval=30s (nfs-export-monitor-interval-30s)
>
> # pcs constraint --full
> Location Constraints:
> Ordering Constraints:
>   start ClusterIP-clone then start WebSite-clone (kind:Mandatory)
> (id:order-ClusterIP-WebSite-mandatory)
>   promote StorageClone then start StorageFS-clone (kind:Mandatory)
> (id:order-StorageClone-StorageFS-mandatory)
>   start StorageFS-clone then start WebSite-clone (kind:Mandatory)
> (id:order-StorageFS-WebSite-mandatory)
>   start dlm-clone then start StorageFS-clone (kind:Mandatory)
> (id:order-dlm-clone-StorageFS-mandatory)
>   start StorageFS-clone then start nfs-group-clone (kind:Mandatory)
> (id:order-StorageFS-clone-nfs-group-clone-mandatory)
> Colocation Constraints:
>   WebSite-clone with ClusterIP-clone (score:INFINITY)
> (id:colocation-WebSite-ClusterIP-INFINITY)
>   StorageFS-clone with StorageClone (score:INFINITY)
> (with-rsc-role:Master) (id:colocation-StorageFS-StorageClone-INFINITY)
>   WebSite-clone with StorageFS-clone (score:INFINITY)
> (id:colocation-WebSite-StorageFS-INFINITY)
>   StorageFS-clone with dlm-clone (score:INFINITY)
> (id:colocation-StorageFS-dlm-clone-INFINITY)
>   StorageFS-clone with nfs-group-clone (score:INFINITY)
> (id:colocation-StorageFS-clone-nfs-group-clone-INFINITY)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170712/d1190cf4/attachment-0003.html>