[ClusterLabs] HALVM monitor action fail on slave node. Possible bug?

emmanuel segura emi2fast at gmail.com
Fri Apr 13 09:54:30 EDT 2018


the first thing that you need to configure is the stonith, because you have
this constraint "constraint order promote DrbdResClone then start HALVM"

To recover and promote drbd to master when you crash a node, configurare
the drbd fencing handler.

pacemaker execute monitor in both nodes, so this is normal, to test why
monitor fail, use ocf-tester

2018-04-13 15:29 GMT+02:00 Marco Marino <marino.mrc at gmail.com>:

> Hello, I'm trying to configure a simple 2 node cluster with drbd and HALVM
> (ocf:heartbeat:LVM) but I have a problem that I'm not able to solve, to I
> decided to write this long post. I need to really understand what I'm doing
> and where I'm doing wrong.
> More precisely, I'm configuring a pacemaker cluster with 2 nodes and only
> one drbd resource. Here all operations:
>
> - System configuration
>     hostnamectl set-hostname pcmk[12]
>     yum update -y
>     yum install vim wget git -y
>     vim /etc/sysconfig/selinux  -> permissive mode
>     systemctl disable firewalld
>     reboot
>
> - Network configuration
>     [pcmk1]
>     nmcli connection modify corosync ipv4.method manual ipv4.addresses
> 192.168.198.201/24 ipv6.method ignore connection.autoconnect yes
>     nmcli connection modify replication ipv4.method manual ipv4.addresses
> 192.168.199.201/24 ipv6.method ignore connection.autoconnect yes
>     [pcmk2]
>     nmcli connection modify corosync ipv4.method manual ipv4.addresses
> 192.168.198.202/24 ipv6.method ignore connection.autoconnect yes
>     nmcli connection modify replication ipv4.method manual ipv4.addresses
> 192.168.199.202/24 ipv6.method ignore connection.autoconnect yes
>
>     ssh-keyget -t rsa
>     ssh-copy-id root at pcmk[12]
>     scp /etc/hosts root at pcmk2:/etc/hosts
>
> - Drbd Repo configuration and drbd installation
>     rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
>     rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.
> noarch.rpm
>     yum update -y
>     yum install drbd84-utils kmod-drbd84 -y
>
> - Drbd Configuration:
>     Creating a new partition on top of /dev/vdb -> /dev/vdb1 of type
> "Linux" (83)
>     [/etc/drbd.d/global_common.conf]
>     usage-count no;
>     [/etc/drbd.d/myres.res]
>     resource myres {
>         on pcmk1 {
>                 device /dev/drbd0;
>                 disk /dev/vdb1;
>                 address 192.168.199.201:7789;
>                 meta-disk internal;
>         }
>         on pcmk2 {
>                 device /dev/drbd0;
>                 disk /dev/vdb1;
>                 address 192.168.199.202:7789;
>                 meta-disk internal;
>         }
>     }
>
>     scp /etc/drbd.d/myres.res root at pcmk2:/etc/drbd.d/myres.res
>     systemctl start drbd <-- only for test. The service is disabled at
> boot!
>     drbdadm create-md myres
>     drbdadm up myres
>     drbdadm primary --force myres
>
> - LVM Configuration
>     [root at pcmk1 ~]# lsblk
>     NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
>     sr0          11:0    1 1024M  0 rom
>     vda         252:0    0   20G  0 disk
>     ├─vda1      252:1    0    1G  0 part /boot
>     └─vda2      252:2    0   19G  0 part
>       ├─cl-root 253:0    0   17G  0 lvm  /
>       └─cl-swap 253:1    0    2G  0 lvm  [SWAP]
>     vdb         252:16   0    8G  0 disk
>     └─vdb1      252:17   0    8G  0 part  <--- /dev/vdb1 is the partition
> I'd like to use as backing device for drbd
>       └─drbd0   147:0    0    8G  0 disk
>
>     [/etc/lvm/lvm.conf]
>     write_cache_state = 0
>     use_lvmetad = 0
>     filter = [ "a|drbd.*|", "a|vda.*|", "r|.*|" ]
>
>     Disabling lvmetad service
>     systemctl disable lvm2-lvmetad.service
>     systemctl disable lvm2-lvmetad.socket
>     reboot
>
> - Creating volume group and logical volume
>     systemctl start drbd (both nodes)
>     drbdadm primary myres
>     pvcreate /dev/drbd0
>     vgcreate havolumegroup /dev/drbd0
>     lvcreate -n c-vol1 -L1G havolumegroup
>     [root at pcmk1 ~]# lvs
>         LV     VG            Attr       LSize   Pool Origin Data%  Meta%
> Move Log Cpy%Sync Convert
>         root   cl            -wi-ao---- <17.00g
>
>         swap   cl            -wi-ao----   2.00g
>
>         c-vol1 havolumegroup -wi-a-----   1.00g
>
>
> - Cluster Configuration
>     yum install pcs fence-agents-all -y
>     systemctl enable pcsd
>     systemctl start pcsd
>     echo redhat | passwd --stdin hacluster
>     pcs cluster auth pcmk1 pcmk2
>     pcs cluster setup --name ha_cluster pcmk1 pcmk2
>     pcs cluster start --all
>     pcs cluster enable --all
>     pcs property set stonith-enabled=false    <--- Just for test!!!
>     pcs property set no-quorum-policy=ignore
>
> - Drbd resource configuration
>     pcs cluster cib drbd_cfg
>     pcs -f drbd_cfg resource create DrbdRes ocf:linbit:drbd
> drbd_resource=myres op monitor interval=60s
>     pcs -f drbd_cfg resource master DrbdResClone DrbdRes master-max=1
> master-node-max=1 clone-max=2 clone-node-max=1 notify=true
>     [root at pcmk1 ~]# pcs -f drbd_cfg resource show
>      Master/Slave Set: DrbdResClone [DrbdRes]
>          Stopped: [ pcmk1 pcmk2 ]
>     [root at pcmk1 ~]#
>
>     Testing the failover with a forced shutoff of pcmk1. When pcmk1
> returns up, drbd is slave but logical volume is not active on pcmk2. So I
> need HALVM
>     [root at pcmk2 ~]# lvs
>       LV     VG            Attr       LSize   Pool Origin Data%  Meta%
> Move Log Cpy%Sync Convert
>       root   cl            -wi-ao---- <17.00g
>
>       swap   cl            -wi-ao----   2.00g
>
>       c-vol1 havolumegroup -wi-------   1.00g
>
>     [root at pcmk2 ~]#
>
>
>
> - Lvm resource and constraints
>     pcs cluster cib lvm_cfg
>     pcs -f lvm_cfg resource create HALVM ocf:heartbeat:LVM
> volgrpname=havolumegroup
>     pcs -f lvm_cfg constraint colocation add HALVM with master
> DrbdResClone INFINITY
>     pcs -f lvm_cfg constraint order promote DrbdResClone then start HALVM
>
>     [root at pcmk1 ~]# pcs -f lvm_cfg constraint
>     Location Constraints:
>     Ordering Constraints:
>       promote DrbdResClone then start HALVM (kind:Mandatory)
>     Colocation Constraints:
>       HALVM with DrbdResClone (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master)
>     Ticket Constraints:
>     [root at pcmk1 ~]#
>
>
>     [root at pcmk1 ~]# pcs status
>     Cluster name: ha_cluster
>     Stack: corosync
>     Current DC: pcmk2 (version 1.1.16-12.el7_4.8-94ff4df) - partition with
> quorum
>     Last updated: Fri Apr 13 15:12:49 2018
>     Last change: Fri Apr 13 15:05:18 2018 by root via cibadmin on pcmk1
>
>     2 nodes configured
>     2 resources configured
>
>     Online: [ pcmk1 pcmk2 ]
>
>     Full list of resources:
>
>      Master/Slave Set: DrbdResClone [DrbdRes]
>          Masters: [ pcmk2 ]
>          Slaves: [ pcmk1 ]
>
>     Daemon Status:
>       corosync: active/enabled
>       pacemaker: active/enabled
>       pcsd: active/enabled
>
>     #########[PUSHING NEW CONFIGURATION]#########
>     [root at pcmk1 ~]# pcs cluster cib-push lvm_cfg
>     CIB updated
>     [root at pcmk1 ~]# pcs status
>     Cluster name: ha_cluster
>     Stack: corosync
>     Current DC: pcmk2 (version 1.1.16-12.el7_4.8-94ff4df) - partition with
> quorum
>     Last updated: Fri Apr 13 15:12:57 2018
>     Last change: Fri Apr 13 15:12:55 2018 by root via cibadmin on pcmk1
>
>     2 nodes configured
>     3 resources configured
>
>     Online: [ pcmk1 pcmk2 ]
>
>     Full list of resources:
>
>      Master/Slave Set: DrbdResClone [DrbdRes]
>          Masters: [ pcmk2 ]
>          Slaves: [ pcmk1 ]
>      HALVM    (ocf::heartbeat:LVM):    Started pcmk2
>
>     Failed Actions:
>     * HALVM_monitor_0 on pcmk1 'unknown error' (1): call=13,
> status=complete, exitreason='LVM Volume havolumegroup is not available',
>         last-rc-change='Fri Apr 13 15:12:56 2018', queued=0ms, exec=52ms
>
>
>     Daemon Status:
>       corosync: active/enabled
>       pacemaker: active/enabled
>       pcsd: active/enabled
>     [root at pcmk1 ~]#
>
>
>     ##########[TRYING TO CLEANUP RESOURCE CONFIGURATION]##################
>     [root at pcmk1 ~]# pcs resource cleanup
>     Waiting for 1 replies from the CRMd. OK
>     [root at pcmk1 ~]# pcs status
>     Cluster name: ha_cluster
>     Stack: corosync
>     Current DC: pcmk2 (version 1.1.16-12.el7_4.8-94ff4df) - partition with
> quorum
>     Last updated: Fri Apr 13 15:13:18 2018
>     Last change: Fri Apr 13 15:12:55 2018 by root via cibadmin on pcmk1
>
>     2 nodes configured
>     3 resources configured
>
>     Online: [ pcmk1 pcmk2 ]
>
>     Full list of resources:
>
>      Master/Slave Set: DrbdResClone [DrbdRes]
>          Masters: [ pcmk2 ]
>          Slaves: [ pcmk1 ]
>      HALVM    (ocf::heartbeat:LVM):    Started pcmk2
>
>     Failed Actions:
>     * HALVM_monitor_0 on pcmk1 'unknown error' (1): call=26,
> status=complete, exitreason='LVM Volume havolumegroup is not available',
>         last-rc-change='Fri Apr 13 15:13:17 2018', queued=0ms, exec=113ms
>
>
>     Daemon Status:
>       corosync: active/enabled
>       pacemaker: active/enabled
>       pcsd: active/enabled
>     [root at pcmk1 ~]#
> #########################################################
> some details about packages and versions:
> [root at pcmk1 ~]# rpm -qa | grep pacem
> pacemaker-cluster-libs-1.1.16-12.el7_4.8.x86_64
> pacemaker-libs-1.1.16-12.el7_4.8.x86_64
> pacemaker-1.1.16-12.el7_4.8.x86_64
> pacemaker-cli-1.1.16-12.el7_4.8.x86_64
> [root at pcmk1 ~]# rpm -qa | grep coro
> corosynclib-2.4.0-9.el7_4.2.x86_64
> corosync-2.4.0-9.el7_4.2.x86_64
> [root at pcmk1 ~]# rpm -qa | grep drbd
> drbd84-utils-9.1.0-1.el7.elrepo.x86_64
> kmod-drbd84-8.4.10-1_2.el7_4.elrepo.x86_64
> [root at pcmk1 ~]# cat /etc/redhat-release
> CentOS Linux release 7.4.1708 (Core)
> [root at pcmk1 ~]# uname -r
> 3.10.0-693.21.1.el7.x86_64
> [root at pcmk1 ~]#
> ##############################################################
>
>
> So it seems to me that the problem is that the "monitor" action of the
> ocf:heartbeat:LVM resource is executed on both nodes even if I configured
> specific colocation and ordering constraints. I don't know where the
> problem is, but please I need to understand how to solve the issue. Please,
> if possible I invite someone to reproduce the configuration and possibly
> the issue. It seems a bug but obviously I'm not sure. What I'm worried is
> that it should be pacemaker that states where and when one resource should
> start so probably there is something wrong in my constraints configuration.
> I'm sorry for this long post.
> Thank you,
> Marco
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180413/836493bd/attachment-0002.html>


More information about the Users mailing list