[ClusterLabs] Active-Active NFS cluster failover test - system hangs (VirtualBox)
ArekW
arkaduis at gmail.com
Wed Jul 12 05:06:47 EDT 2017
Hi,
The problem was due to bad stonith configuration. Above config is an
example of a working Active/Active NFS configuration.
Pozdrawiam,
Arek
2017-07-10 12:59 GMT+02:00 ArekW <arkaduis at gmail.com>:
> Hi,
> I've created 2-node active-active HA Cluster with NFS resource. The
> resources are active on both nodes. The Cluster passes failover test with
> pcs standby command but does not work when "real" node shutdown occure.
>
> Test scenario with cluster standby:
> - start cluster
> - mount nfs share on client1
> - start copy file from client1 to nfs share
> - during the copy put node1/node2 to standby mode (pcs cluster standby
> nfsnode2)
> - the copy continue
> - unstandby node1/node2
> - the copy continue and the storage re-sync (drbd)
> - the copy finish with no errors
>
> I can standby and unstandby the cluster many times and it works. The
> problem begins when I do a "true" failover test by hard-shutting down one
> of the nodes. Test results:
> - start cluster
> - mount nfs share on client1
> - start copy file from client1 to nfs share
> - during the copy shutdown node2 by stoping the node's virtual machine
> (hard stop)
> - the system hangs:
>
> <Start copy file at client1>
> # rsync -a --bwlimit=2000 /root/testfile.dat /mnt/nfsshare/
>
> <everything works ok. There is temp file .testfile.dat.9780fH>
>
> [root at nfsnode1 nfs]# ls -lah
> razem 9,8M
> drwxr-xr-x 2 root root 3,8K 07-10 11:07 .
> drwxr-xr-x 4 root root 3,8K 07-10 08:20 ..
> -rw-r--r-- 1 root root 9 07-10 08:20 client1.txt
> -rw-r----- 1 root root 0 07-10 11:07 .rmtab
> -rw------- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH
>
> [root at nfsnode1 nfs]# pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:07:29 2017 Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 nfsnode2 ]
>
> Full list of resources:
>
> Master/Slave Set: StorageClone [Storage]
> Masters: [ nfsnode1 nfsnode2 ]
> Clone Set: dlm-clone [dlm]
> Started: [ nfsnode1 nfsnode2 ]
> vbox-fencing (stonith:fence_vbox): Started nfsnode1
> Clone Set: ClusterIP-clone [ClusterIP] (unique)
> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2
> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1
> Clone Set: StorageFS-clone [StorageFS]
> Started: [ nfsnode1 nfsnode2 ]
> Clone Set: WebSite-clone [WebSite]
> Started: [ nfsnode1 nfsnode2 ]
> Clone Set: nfs-group-clone [nfs-group]
> Started: [ nfsnode1 nfsnode2 ]
>
> <Hard poweroff vm machine: nfsnode2>
>
> [root at nfsnode1 nfs]# pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:07:43 2017 Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Node nfsnode2: UNCLEAN (offline)
> Online: [ nfsnode1 ]
>
> Full list of resources:
>
> Master/Slave Set: StorageClone [Storage]
> Storage (ocf::linbit:drbd): Master nfsnode2 (UNCLEAN)
> Masters: [ nfsnode1 ]
> Clone Set: dlm-clone [dlm]
> dlm (ocf::pacemaker:controld): Started nfsnode2 (UNCLEAN)
> Started: [ nfsnode1 ]
> vbox-fencing (stonith:fence_vbox): Started nfsnode1
> Clone Set: ClusterIP-clone [ClusterIP] (unique)
> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2
> (UNCLEAN)
> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1
> Clone Set: StorageFS-clone [StorageFS]
> StorageFS (ocf::heartbeat:Filesystem): Started nfsnode2 (UNCLEAN)
> Started: [ nfsnode1 ]
> Clone Set: WebSite-clone [WebSite]
> WebSite (ocf::heartbeat:apache): Started nfsnode2 (UNCLEAN)
> Started: [ nfsnode1 ]
> Clone Set: nfs-group-clone [nfs-group]
> Resource Group: nfs-group:1
> nfs (ocf::heartbeat:nfsserver): Started nfsnode2 (UNCLEAN)
> nfs-export (ocf::heartbeat:exportfs): Started nfsnode2
> (UNCLEAN)
> Started: [ nfsnode1 ]
>
> <ssh console hangs on client1>
> [root at nfsnode1 nfs]# ls -lah
> <nothing happen>
>
> <drbd status is ok in this situation>
> [root at nfsnode1 ~]# drbdadm status
> storage role:Primary
> disk:UpToDate
> nfsnode2 connection:Connecting
>
> <the nfs export is still active on node1>
> [root at nfsnode1 ~]# exportfs
> /mnt/drbd/nfs 10.0.2.0/255.255.255.0
>
> <After ssh to client1 the nfs mount is not accessible>
> login as: root
> root at 127.0.0.1's password:
> Last login: Mon Jul 10 07:48:17 2017 from 10.0.2.2
> # cd /mnt/
> # ls
> <console hangs>
>
> # mount
> 10.0.2.7:/ on /mnt/nfsshare type nfs4 (rw,relatime,vers=4.0,rsize=
> 131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,
> retrans=2,sec=sys,clientaddr=10.0.2.20,local_lock=none,addr=10.0.2.7)
>
> <Power on vm machine nfsnode2>
> <After nfsnode2 boot, console an nfsnode1 start respond but coping is not
> proceeding>
> <The temp file is visible but not active>
> [root at nfsnode1 ~]# ls -lah
> razem 9,8M
> drwxr-xr-x 2 root root 3,8K 07-10 11:07 .
> drwxr-xr-x 4 root root 3,8K 07-10 08:20 ..
> -rw-r--r-- 1 root root 9 07-10 08:20 client1.txt
> -rw-r----- 1 root root 0 07-10 11:16 .rmtab
> -rw------- 1 root root 9,8M 07-10 11:07 .testfile.dat.9780fH
>
> <Coping at client1 hangs>
>
> <Cluster status:>
> [root at nfsnode1 ~]# pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:17:19 2017 Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 nfsnode2 ]
>
> Full list of resources:
>
> Master/Slave Set: StorageClone [Storage]
> Masters: [ nfsnode1 ]
> Stopped: [ nfsnode2 ]
> Clone Set: dlm-clone [dlm]
> Started: [ nfsnode1 nfsnode2 ]
> vbox-fencing (stonith:fence_vbox): Started nfsnode1
> Clone Set: ClusterIP-clone [ClusterIP] (unique)
> ClusterIP:0 (ocf::heartbeat:IPaddr2): Stopped
> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1
> Clone Set: StorageFS-clone [StorageFS]
> Started: [ nfsnode1 ]
> Stopped: [ nfsnode2 ]
> Clone Set: WebSite-clone [WebSite]
> Started: [ nfsnode1 ]
> Stopped: [ nfsnode2 ]
> Clone Set: nfs-group-clone [nfs-group]
> Resource Group: nfs-group:0
> nfs (ocf::heartbeat:nfsserver): Started nfsnode1
> nfs-export (ocf::heartbeat:exportfs): FAILED nfsnode1
> Stopped: [ nfsnode2 ]
>
> Failed Actions:
> * nfs-export_monitor_30000 on nfsnode1 'unknown error' (1): call=61,
> status=Timed Out, exitreason='none',
> last-rc-change='Mon Jul 10 11:11:50 2017', queued=0ms, exec=0ms
> * vbox-fencing_monitor_60000 on nfsnode1 'unknown error' (1): call=22,
> status=Error, exitreason='none',
> last-rc-change='Mon Jul 10 11:06:41 2017', queued=0ms, exec=11988ms
>
> <Try to cleanup>
>
> # pcs resource cleanup
> # pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode1 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:20:38 2017 Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 nfsnode2 ]
>
> Full list of resources:
>
> Master/Slave Set: StorageClone [Storage]
> Masters: [ nfsnode1 ]
> Stopped: [ nfsnode2 ]
> Clone Set: dlm-clone [dlm]
> Started: [ nfsnode1 nfsnode2 ]
> vbox-fencing (stonith:fence_vbox): Stopped
> Clone Set: ClusterIP-clone [ClusterIP] (unique)
> ClusterIP:0 (ocf::heartbeat:IPaddr2): Stopped
> ClusterIP:1 (ocf::heartbeat:IPaddr2): Stopped
> Clone Set: StorageFS-clone [StorageFS]
> Stopped: [ nfsnode1 nfsnode2 ]
> Clone Set: WebSite-clone [WebSite]
> Stopped: [ nfsnode1 nfsnode2 ]
> Clone Set: nfs-group-clone [nfs-group]
> Stopped: [ nfsnode1 nfsnode2 ]
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> <Reboot of both nfsnode1 and nfsnode2>
> <After reboot:>
>
> # pcs status
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode2 (version 1.1.15-11.el7_3.5-e174ec8) - partition with
> quorum
> Last updated: Mon Jul 10 11:24:10 2017 Last change: Mon Jul 10
> 10:28:12 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 nfsnode2 ]
>
> Full list of resources:
>
> Master/Slave Set: StorageClone [Storage]
> Slaves: [ nfsnode2 ]
> Stopped: [ nfsnode1 ]
> Clone Set: dlm-clone [dlm]
> Started: [ nfsnode1 nfsnode2 ]
> vbox-fencing (stonith:fence_vbox): Stopped
> Clone Set: ClusterIP-clone [ClusterIP] (unique)
> ClusterIP:0 (ocf::heartbeat:IPaddr2): Stopped
> ClusterIP:1 (ocf::heartbeat:IPaddr2): Stopped
> Clone Set: StorageFS-clone [StorageFS]
> Stopped: [ nfsnode1 nfsnode2 ]
> Clone Set: WebSite-clone [WebSite]
> Stopped: [ nfsnode1 nfsnode2 ]
> Clone Set: nfs-group-clone [nfs-group]
> Stopped: [ nfsnode1 nfsnode2 ]
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> <Eventually the cluster was recovered after:>
> pcs cluster stop --all
> <Solve drbd split-brain>
> pcs cluster start --all
>
> The client1 could not be rebooted with 'reboot' due to mount hung (as I
> preasume). It has to be rebooted hard-way by virtualbox hypervisor.
> What's wrong with this configuration? I can send CIB configuration if
> necessary.
>
> ---------------
> Full cluster configuration (working state):
>
> # pcs status --full
> Cluster name: nfscluster
> Stack: corosync
> Current DC: nfsnode1 (1) (version 1.1.15-11.el7_3.5-e174ec8) - partition
> with quorum
> Last updated: Mon Jul 10 12:44:03 2017 Last change: Mon Jul 10
> 11:37:13 2017 by root via crm_attribute on nfsnode1
>
> 2 nodes and 15 resources configured
>
> Online: [ nfsnode1 (1) nfsnode2 (2) ]
>
> Full list of resources:
>
> Master/Slave Set: StorageClone [Storage]
> Storage (ocf::linbit:drbd): Master nfsnode1
> Storage (ocf::linbit:drbd): Master nfsnode2
> Masters: [ nfsnode1 nfsnode2 ]
> Clone Set: dlm-clone [dlm]
> dlm (ocf::pacemaker:controld): Started nfsnode1
> dlm (ocf::pacemaker:controld): Started nfsnode2
> Started: [ nfsnode1 nfsnode2 ]
> vbox-fencing (stonith:fence_vbox): Started nfsnode1
> Clone Set: ClusterIP-clone [ClusterIP] (unique)
> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started nfsnode2
> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started nfsnode1
> Clone Set: StorageFS-clone [StorageFS]
> StorageFS (ocf::heartbeat:Filesystem): Started nfsnode1
> StorageFS (ocf::heartbeat:Filesystem): Started nfsnode2
> Started: [ nfsnode1 nfsnode2 ]
> Clone Set: WebSite-clone [WebSite]
> WebSite (ocf::heartbeat:apache): Started nfsnode1
> WebSite (ocf::heartbeat:apache): Started nfsnode2
> Started: [ nfsnode1 nfsnode2 ]
> Clone Set: nfs-group-clone [nfs-group]
> Resource Group: nfs-group:0
> nfs (ocf::heartbeat:nfsserver): Started nfsnode1
> nfs-export (ocf::heartbeat:exportfs): Started nfsnode1
> Resource Group: nfs-group:1
> nfs (ocf::heartbeat:nfsserver): Started nfsnode2
> nfs-export (ocf::heartbeat:exportfs): Started nfsnode2
> Started: [ nfsnode1 nfsnode2 ]
>
> Node Attributes:
> * Node nfsnode1 (1):
> + master-Storage : 10000
> * Node nfsnode2 (2):
> + master-Storage : 10000
>
> Migration Summary:
> * Node nfsnode1 (1):
> * Node nfsnode2 (2):
>
> PCSD Status:
> nfsnode1: Online
> nfsnode2: Online
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> ]# pcs resource --full
> Master: StorageClone
> Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=2
> clone-node-max=1
> Resource: Storage (class=ocf provider=linbit type=drbd)
> Attributes: drbd_resource=storage
> Operations: start interval=0s timeout=240 (Storage-start-interval-0s)
> promote interval=0s timeout=90 (Storage-promote-interval-0s)
> demote interval=0s timeout=90 (Storage-demote-interval-0s)
> stop interval=0s timeout=100 (Storage-stop-interval-0s)
> monitor interval=60s (Storage-monitor-interval-60s)
> Clone: dlm-clone
> Meta Attrs: clone-max=2 clone-node-max=1
> Resource: dlm (class=ocf provider=pacemaker type=controld)
> Operations: start interval=0s timeout=90 (dlm-start-interval-0s)
> stop interval=0s timeout=100 (dlm-stop-interval-0s)
> monitor interval=60s (dlm-monitor-interval-60s)
> Clone: ClusterIP-clone
> Meta Attrs: clona-node-max=2 clone-max=2 globally-unique=true
> clone-node-max=2
> Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: ip=10.0.2.7 cidr_netmask=32 clusterip_hash=sourceip
> Meta Attrs: resource-stickiness=0
> Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)
> stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
> monitor interval=5s (ClusterIP-monitor-interval-5s)
> Clone: StorageFS-clone
> Resource: StorageFS (class=ocf provider=heartbeat type=Filesystem)
> Attributes: device=/dev/drbd1 directory=/mnt/drbd fstype=gfs2
> Operations: start interval=0s timeout=60 (StorageFS-start-interval-0s)
> stop interval=0s timeout=60 (StorageFS-stop-interval-0s)
> monitor interval=20 timeout=40 (StorageFS-monitor-interval-
> 20)
> Clone: WebSite-clone
> Resource: WebSite (class=ocf provider=heartbeat type=apache)
> Attributes: configfile=/etc/httpd/conf/httpd.conf statusurl=
> http://localhost/server-status
> Operations: start interval=0s timeout=40s (WebSite-start-interval-0s)
> stop interval=0s timeout=60s (WebSite-stop-interval-0s)
> monitor interval=1min (WebSite-monitor-interval-1min)
> Clone: nfs-group-clone
> Meta Attrs: interleave=true
> Group: nfs-group
> Resource: nfs (class=ocf provider=heartbeat type=nfsserver)
> Attributes: nfs_ip=10.0.2.7 nfs_no_notify=true
> Operations: start interval=0s timeout=40 (nfs-start-interval-0s)
> stop interval=0s timeout=20s (nfs-stop-interval-0s)
> monitor interval=30s (nfs-monitor-interval-30s)
> Resource: nfs-export (class=ocf provider=heartbeat type=exportfs)
> Attributes: clientspec=10.0.2.0/255.255.255.0
> options=rw,sync,no_root_squash directory=/mnt/drbd/nfs fsid=0
> Operations: start interval=0s timeout=40 (nfs-export-start-interval-0s)
> stop interval=0s timeout=120 (nfs-export-stop-interval-0s)
> monitor interval=30s (nfs-export-monitor-interval-30s)
>
> # pcs constraint --full
> Location Constraints:
> Ordering Constraints:
> start ClusterIP-clone then start WebSite-clone (kind:Mandatory)
> (id:order-ClusterIP-WebSite-mandatory)
> promote StorageClone then start StorageFS-clone (kind:Mandatory)
> (id:order-StorageClone-StorageFS-mandatory)
> start StorageFS-clone then start WebSite-clone (kind:Mandatory)
> (id:order-StorageFS-WebSite-mandatory)
> start dlm-clone then start StorageFS-clone (kind:Mandatory)
> (id:order-dlm-clone-StorageFS-mandatory)
> start StorageFS-clone then start nfs-group-clone (kind:Mandatory)
> (id:order-StorageFS-clone-nfs-group-clone-mandatory)
> Colocation Constraints:
> WebSite-clone with ClusterIP-clone (score:INFINITY)
> (id:colocation-WebSite-ClusterIP-INFINITY)
> StorageFS-clone with StorageClone (score:INFINITY)
> (with-rsc-role:Master) (id:colocation-StorageFS-StorageClone-INFINITY)
> WebSite-clone with StorageFS-clone (score:INFINITY)
> (id:colocation-WebSite-StorageFS-INFINITY)
> StorageFS-clone with dlm-clone (score:INFINITY)
> (id:colocation-StorageFS-dlm-clone-INFINITY)
> StorageFS-clone with nfs-group-clone (score:INFINITY)
> (id:colocation-StorageFS-clone-nfs-group-clone-INFINITY)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170712/d1190cf4/attachment-0003.html>
More information about the Users
mailing list