[ClusterLabs] Need help to enable hot switch of iSCSI (tgtd) under two node Pacemaker + DRBD 9.0 under CentOS 7.5 in ESXi 6.5 Environment

Wed Oct 24 02:59:01 EDT 2018

Hello Friends,

further thought about this situation:

I want to have a cluster service tgtd, on a "primary/second" DRBD, since 
the DRBD is only active on the primary node, so the tgtd won't success 
on the secondary node.
So should i, or how do config a primary/seconday tgtd service?
Or, which feature from pacemaker should i use, so the tgtd starts only 
on one node?

for any suggestions, thank you very much in advance
Best Regards
Lifeng
> Hello Dear Andrei Borzenkov,
>
> Thank you very much for your answer. I've check the logs all the time, 
> but there are nothing helpful , just a bunch of heartbeat messages.
>
> Anyway, i've read the book "Packt - CentOS High Availability" 
> published in 2015, and got some new ideas, and tried out, the 
> situation is something new.
>
> ------------------------------------------------------------------------
> pcs resource create p_iSCSITarget ocf:heartbeat:iSCSITarget 
> implementation="tgt" iqn="iqn.2018-08.s-ka.local:disk" tid="1"
> pcs resource create p_iSCSILogicalUnit ocf:heartbeat:iSCSILogicalUnit 
> implementation="tgt" target_iqn="iqn.2018-08.s-ka.local:disk" lun="10" 
> path="/dev/drbd/by-disk/vg0/ipstor0"
> pcs resource group add p_iSCSI ClusterIP p_iSCSITarget p_iSCSILogicalUnit
> pcs constraint colocation set ClusterIP p_iSCSITarget p_iSCSILogicalUnit
> ------------------------------------------------------------------------
>
>
> The difference from previous version is here: use iqn 
> "iqn.2018-08.s-ka.local:disk" instead of 
> "iqn.2018-08.s-ka.local:disk.1", which the last ".1" maybe means the 
> "tid".
>
> now i have new problem, because the resource and tgtd are startet, 
> although i set "colocation constraint", the pacemaker always try to 
> start tgtd on another node.
> how to i solve this? thank you people in advance!
>
> here the output from "pcs status":
> ------------------------------------------------------------------------
> [root at drbd0 /]# pcs status
> Cluster name: cluster1
> Stack: corosync
> Current DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) 
> - partition with quorum
> Last updated: Wed Oct 24 08:43:29 2018
> Last change: Wed Oct 24 08:43:24 2018 by root via cibadmin on 
> drbd0-ha.s-ka.local
>
> 2 nodes configured
> 5 resources configured
>
> Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
>
> Full list of resources:
>
>  Master/Slave Set: ipstor0Clone [ipstor0]
>      Masters: [ drbd0-ha.s-ka.local ]
>      Slaves: [ drbd1-ha.s-ka.local ]
>  Resource Group: p_iSCSI
>      ClusterIP    (ocf::heartbeat:IPaddr2):    Started drbd0-ha.s-ka.local
>      p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Started 
> drbd0-ha.s-ka.local
>      p_iSCSILogicalUnit    (ocf::heartbeat:iSCSILogicalUnit): Started 
> drbd0-ha.s-ka.local
>
> Failed Actions:
> * p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1): 
> call=32, status=complete, exitreason='',
>     last-rc-change='Wed Oct 24 08:37:25 2018', queued=0ms, exec=23ms
> * p_iSCSILogicalUnit_start_0 on drbd1-ha.s-ka.local 'unknown error' 
> (1): call=38, status=complete, exitreason='',
>     last-rc-change='Wed Oct 24 08:37:55 2018', queued=0ms, exec=28ms
>
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root at drbd0 /]
>
>
> [root at drbd0 /]# pcs constraint show --full
> Location Constraints:
> Ordering Constraints:
> Colocation Constraints:
>   Resource Sets:
>     set ClusterIP p_iSCSITarget p_iSCSILogicalUnit 
> (id:pcs_rsc_set_ClusterIP_p_iSCSITarget_p_iSCSILogicalUnit) setoptions 
> score=INFINITY 
> (id:pcs_rsc_colocation_set_ClusterIP_p_iSCSITarget_p_iSCSILogicalUnit)
> Ticket Constraints:
> [root at drbd0 /]#
> ------------------------------------------------------------------------
>
> Best Regards
> Lifeng
>
> 在 2018/10/19 06:02, Andrei Borzenkov 写道:
>> 16.10.2018 15:29, LiFeng Zhang пишет:
>>> Hi, all dear friends,
>>>
>>> i need your help to enable the hot switch of iSCSI under a
>>> Pacemaker/Corosync Cluster, which has a iSCSI Device based on a two node
>>> DRBD Replication.
>>>
>>> I've got the Pacemaker/Corosync cluster working, DRBD replication also
>>> working, but it stuck at iSCSI, i can manually start a tgtd on one node,
>>> so the VCSA can recognize the iSCSI Disk and create VMFS/StorageObject
>>> on it, and then i can create a test VM on that VMFS.
>>>
>>> But when i switch the Primary/Secondary of DRBD, although the test VM
>>> still running, but the underlying Disk became read-only. As far as i
>>> know, the tgtd should be handled by Pacemaker so it will automatically
>>> start on the Primary DRBD Instance, but in my situation it's sadly NOT.
>>>
>> pacemaker only handles resources that were started by pacemaker.
>> According to your output below, in all cases resource was stopped from
>> pacemaker point of view and all pacemaker attempts to start resource
>> failed. You should troubleshoot why they failed. This requires knowledge
>> of specific resource agent, sadly I am not familiar with iSCSI target.
>> pacemaker logs may include more information from resource agent than
>> just "unknown reason".
>>
>>> I've tried all kinds of resources/manuals/documents, but they all mixed
>>> with extra information, other system, other software version.
>>>
>>> And one of my BEST reference (the closest configuration to mein) is this
>>> url:https://nnc3.com/mags/LJ_1994-2014/LJ/217/11275.html
>>>
>>> The difference betwee me and this article, i think is i don't have LVM
>>> Volume but only raw iSCSI Disk, and i have to translate CRM commands
>>> into PCS commands
>>>
>>> But after i "copied" the configuration from this article, my cluster can
>>> not start anymore, i've tried remove the LVM resource (which caused a
>>> "device not found" error), but the resource group still can't start and
>>> without any explicit "reason" from Pacemaker.
>>>
>>>
>>> *1*. The whole configuration is under a two node ESXi 6.5 Cluster, which
>>> has a VCSA one one ESXi host installed.
>>>
>>> I have a simple diagram in attachment, which may state the deployment
>>> better.
>>>
>>> 2. start point:
>>>
>>> The involved hosts are all with mapped through local dns, which also
>>> includes the floating vip, the local domain is s-ka.local:
>>>
>>> ------------------------------------------------------------------------
>>>
>>> firwall:    fw01.s-ka.local.        IN    A    192.168.95.249
>>>
>>> vcsa:    vc01.s-ka.local.        IN    A    192.168.95.30
>>> esxi:     esx01.s-ka.local.        IN    A    192.168.95.5
>>> esxi:     esx02.s-ka.local.        IN    A    192.168.95.7
>>>
>>> drbd:    drbd0.s-ka.local.        IN    A    192.168.95.45
>>> drbd:    drbd1.s-ka.local.        IN    A    192.168.95.47
>>> vip:      ipstor0.s-ka.local.        IN    A    192.168.95.48
>>>
>>> heartbeat:    drbd0-ha.s-ka.local.    IN    A    192.168.96.45
>>> heartbeat:    drbd1-ha.s-ka.local.    IN    A    192.168.96.47
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> The both drbd server are CentOS 7.5, the installed packages are here:
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd0 ~]# cat /etc/centos-release
>>> CentOS Linux release 7.5.1804 (Core)
>>>
>>> [root at drbd0 ~]# uname -a
>>> Linux drbd0.s-ka.local 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16
>>> 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> [root at drbd1 ~]# yum list installed|grep pacemaker
>>> pacemaker.x86_64 1.1.18-11.el7_5.3              @updates
>>> pacemaker-cli.x86_64 1.1.18-11.el7_5.3              @updates
>>> pacemaker-cluster-libs.x86_64 1.1.18-11.el7_5.3              @updates
>>> pacemaker-libs.x86_64 1.1.18-11.el7_5.3              @updates
>>>
>>> [root at drbd1 ~]# yum list installed|grep coro
>>> corosync.x86_64 2.4.3-2.el7_5.1                @updates
>>> corosynclib.x86_64 2.4.3-2.el7_5.1                @updates
>>>
>>> [root at drbd1 ~]# yum list installed|grep drbd
>>> drbd90-utils.x86_64 9.3.1-1.el7.elrepo             @elrepo
>>> kmod-drbd90.x86_64 9.0.14-1.el7_5.elrepo          @elrepo
>>>
>>> [root at drbd1 ~]# yum list installed|grep -i scsi
>>> lsscsi.x86_64 0.27-6.el7                     @anaconda
>>> scsi-target-utils.x86_64 1.0.55-4.el7                   @epel
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> 3. configurations
>>>
>>> 3.1 ok first the drbd configuration
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd1 ~]# cat /etc/drbd.conf
>>> # You can find an example in /usr/share/doc/drbd.../drbd.conf.example
>>>
>>> include "drbd.d/global_common.conf";
>>> include "drbd.d/*.res";
>>>
>>> [root at drbd1 ~]# cat /etc/drbd.d/r0.res
>>> resource iscsivg01 {
>>>    protocol C;
>>>    device /dev/drbd0;
>>>    disk /dev/vg0/ipstor0;
>>>    flexible-meta-disk internal;
>>>    on drbd0.s-ka.local {
>>>      #volume 0 {
>>>        #device /dev/drbd0;
>>>        #disk /dev/vg0/ipstor0;
>>>        #flexible-meta-disk internal;
>>>      #}
>>>      address 192.168.96.45:7788;
>>>    }
>>>    on drbd1.s-ka.local {
>>>      #volume 0 {
>>>        #device /dev/drbd0;
>>>        #disk /dev/vg0/ipstor0;
>>>        #flexible-meta-disk internal;
>>>      #}
>>>      address 192.168.96.47:7788;
>>>    }
>>> }
>>>
>>> ------------------------------------------------------------------------
>>>
>>> 3.2 then the drbd device
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd1 ~]# lsblk
>>> NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
>>> sda               8:0    0   25G  0 disk
>>> ├─sda1            8:1    0    1G  0 part /boot
>>> └─sda2            8:2    0   24G  0 part
>>>    ├─centos-root 253:0    0   22G  0 lvm  /
>>>    └─centos-swap 253:1    0    2G  0 lvm  [SWAP]
>>> sdb               8:16   0  500G  0 disk
>>> └─sdb1            8:17   0  500G  0 part
>>>    └─vg0-ipstor0 253:2    0  500G  0 lvm
>>>      └─drbd0     147:0    0  500G  1 disk
>>> sr0              11:0    1 1024M  0 rom
>>>
>>> [root at drbd1 ~]# tree /dev/drbd
>>> drbd/  drbd0
>>> [root at drbd1 ~]# tree /dev/drbd
>>> /dev/drbd
>>> ├── by-disk
>>> │   └── vg0
>>> │       └── ipstor0 -> ../../../drbd0
>>> └── by-res
>>>      └── iscsivg01
>>>          └── 0 -> ../../../drbd0
>>>
>>> 4 directories, 2 files
>>>
>>> ------------------------------------------------------------------------
>>>
>>> 3.3drbd status
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd1 ~]# drbdadm status
>>> iscsivg01 role:Secondary
>>>    disk:UpToDate
>>>    drbd0.s-ka.local role:Primary
>>>      peer-disk:UpToDate
>>>
>>> [root at drbd0 ~]# drbdadm status
>>> iscsivg01 role:Primary
>>>    disk:UpToDate
>>>    drbd1.s-ka.local role:Secondary
>>>      peer-disk:UpToDate
>>>
>>> [root at drbd0 ~]# cat /proc/drbd
>>> version: 9.0.14-1 (api:2/proto:86-113)
>>> GIT-hash: 62f906cf44ef02a30ce0c148fec223b40c51c533 build by mockbuild@,
>>> 2018-05-04 03:32:42
>>> Transports (api:16): tcp (9.0.14-1)
>>>
>>> ------------------------------------------------------------------------
>>>
>>> 3.4 Corosync configuration
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd0 corosync]# cat /etc/corosync/corosync.conf
>>> totem {
>>>      version: 2
>>>      cluster_name: cluster1
>>>      secauth: off
>>>      transport: udpu
>>> }
>>>
>>> nodelist {
>>>      node {
>>>          ring0_addr: drbd0-ha.s-ka.local
>>>          nodeid: 1
>>>      }
>>>
>>>      node {
>>>          ring0_addr: drbd1-ha.s-ka.local
>>>          nodeid: 2
>>>      }
>>> }
>>>
>>> quorum {
>>>      provider: corosync_votequorum
>>>      two_node: 1
>>> }
>>>
>>> logging {
>>>      to_logfile: yes
>>>      logfile: /var/log/cluster/corosync.log
>>>      to_syslog: yes
>>> }
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> 3.5 Corosync status:
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd0 corosync]# systemctl status corosync
>>> ● corosync.service - Corosync Cluster Engine
>>>     Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled;
>>> vendor preset: disabled)
>>>     Active: active (running) since Sun 2018-10-14 02:58:01 CEST; 2 days ago
>>>       Docs: man:corosync
>>>             man:corosync.conf
>>>             man:corosync_overview
>>>    Process: 1095 ExecStart=/usr/share/corosync/corosync start
>>> (code=exited, status=0/SUCCESS)
>>>   Main PID: 1167 (corosync)
>>>     CGroup: /system.slice/corosync.service
>>>             └─1167 corosync
>>>
>>> Oct 14 02:58:00 drbd0.s-ka.local corosync[1167]:  [MAIN  ] Completed
>>> service synchronization, ready to provide service.
>>> Oct 14 02:58:01 drbd0.s-ka.local corosync[1095]: Starting Corosync
>>> Cluster Engine (corosync): [  OK  ]
>>> Oct 14 02:58:01 drbd0.s-ka.local systemd[1]: Started Corosync Cluster
>>> Engine.
>>> Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]:  [TOTEM ] A new
>>> membership (192.168.96.45:384) was formed. Members left: 2
>>> Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]:  [QUORUM] Members[1]: 1
>>> Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]:  [MAIN  ] Completed
>>> service synchronization, ready to provide service.
>>> Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [TOTEM ] A new
>>> membership (192.168.96.45:388) was formed. Members joined: 2
>>> Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [CPG   ] downlist
>>> left_list: 0 received in state 0
>>> Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [QUORUM] Members[2]: 1 2
>>> Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [MAIN  ] Completed
>>> service synchronization, ready to provide service.
>>>
>>> ------------------------------------------------------------------------
>>>
>>> 3.6 tgtd configuration:
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd0 corosync]# cat /etc/tgt/targets.conf
>>> # This is a sample config file for tgt-admin.
>>> #
>>> # The "#" symbol disables the processing of a line.
>>>
>>> # Set the driver. If not specified, defaults to "iscsi".
>>> default-driver iscsi
>>>
>>> # Set iSNS parameters, if needed
>>> #iSNSServerIP 192.168.111.222
>>> #iSNSServerPort 3205
>>> #iSNSAccessControl On
>>> #iSNS On
>>>
>>> # Continue if tgtadm exits with non-zero code (equivalent of
>>> # --ignore-errors command line option)
>>> #ignore-errors yes
>>>
>>>
>>> <target iqn.2018-08.s-ka.local:disk.1>
>>>      lun 10
>>>      backing-store /dev/drbd0
>>>      initiator-address 192.168.96.0/24
>>>      initiator-address 192.168.95.0/24
>>>      target-address 192.168.95.48
>>> </target>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> 3.7 tgtd has been on both server disabled, only startable from current
>>> Primary DRBD Node.
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Secondary Node:
>>>
>>> [root at drbd1 ~]# systemctl status tgtd
>>> ● tgtd.service - tgtd iSCSI target daemon
>>>     Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
>>> vendor preset: disabled)
>>>     Active: inactive (dead)
>>> [root at drbd1 ~]# systemctl restart tgtd
>>> Job for tgtd.service failed because the control process exited with
>>> error code. See "systemctl status tgtd.service" and "journalctl -xe" for
>>> details.
>>>
>>>
>>> Primary Node:
>>>
>>> [root at drbd0 corosync]# systemctl status tgtd
>>> ● tgtd.service - tgtd iSCSI target daemon
>>>     Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
>>> vendor preset: disabled)
>>>     Active: inactive (dead)
>>> [root at drbd0 corosync]# systemctl restart tgtd
>>> [root at drbd0 corosync]# systemctl status  tgtd
>>> ● tgtd.service - tgtd iSCSI target daemon
>>>     Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
>>> vendor preset: disabled)
>>>     Active: active (running) since Tue 2018-10-16 14:09:47 CEST; 2min 29s
>>> ago
>>>    Process: 22300 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys
>>> --name State -v ready (code=exited, status=0/SUCCESS)
>>>    Process: 22272 ExecStartPost=/usr/sbin/tgt-admin -e -c $TGTD_CONFIG
>>> (code=exited, status=0/SUCCESS)
>>>    Process: 22271 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys
>>> --name State -v offline (code=exited, status=0/SUCCESS)
>>>    Process: 22270 ExecStartPost=/bin/sleep 5 (code=exited, status=0/SUCCESS)
>>>   Main PID: 22269 (tgtd)
>>>     CGroup: /system.slice/tgtd.service
>>>             └─22269 /usr/sbin/tgtd -f
>>>
>>> Oct 16 14:09:42 drbd0.s-ka.local systemd[1]: Starting tgtd iSCSI target
>>> daemon...
>>> Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd: iser_ib_init(3436)
>>> Failed to initialize RDMA; load kernel modules?
>>> Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd:
>>> work_timer_start(146) use timer_fd based scheduler
>>> Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd:
>>> bs_init_signalfd(267) could not open backing-store module directory
>>> /usr/lib64/tgt/backing-store
>>> Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd: bs_init(386) use
>>> signalfd notification
>>> Oct 16 14:09:47 drbd0.s-ka.local tgtd[22269]: tgtd: device_mgmt(246)
>>> sz:16 params:path=/dev/drbd0
>>> Oct 16 14:09:47 drbd0.s-ka.local tgtd[22269]: tgtd: bs_thread_open(408) 16
>>> Oct 16 14:09:47 drbd0.s-ka.local systemd[1]: Started tgtd iSCSI target
>>> daemon.
>>>
>>> ------------------------------------------------------------------------
>>>
>>> 3.8 it was until this point all working, but if i switched the DRBD
>>> Primary Node, it won't work anymore (FileSystem of test Node became
>>> read-only)
>>>
>>> so i changed the pcs configuration according to the previously mentioned
>>> article:
>>>
>>> ------------------------------------------------------------------------
>>>
>>>> pcs resource create p_iscsivg01 ocf:heartbeat:LVM volgrpname="vg0" op
>>> monitor interval="30"
>>>
>>>> pcs resource group add p_iSCSI p_iscsivg01 p_iSCSITarget
>>> p_iSCSILogicalUnit ClusterIP
>>>
>>>> pcs constraint order start ipstor0Clone then start p_iSCSI then start
>>> ipstor0Clone:Master
>>>
>>>
>>> [root at drbd0 ~]# pcs status
>>>      Cluster name: cluster1
>>>      Stack: corosync
>>>      Current DC: drbd0-ha.s-ka.local (version
>>> 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
>>>      Last updated: Sun Oct 14 01:38:18 2018
>>>      Last change: Sun Oct 14 01:37:58 2018 by root via cibadmin on
>>> drbd0-ha.s-ka.local
>>>
>>>      2 nodes configured
>>>      6 resources configured
>>>
>>>      Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
>>>
>>>      Full list of resources:
>>>
>>>       Master/Slave Set: ipstor0Clone [ipstor0]
>>>           Masters: [ drbd0-ha.s-ka.local ]
>>>           Slaves: [ drbd1-ha.s-ka.local ]
>>>       Resource Group: p_iSCSI
>>>           p_iscsivg01    (ocf::heartbeat:LVM):    Stopped
>>>           p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Stopped
>>>           p_iSCSILogicalUnit (ocf::heartbeat:iSCSILogicalUnit):    Stopped
>>>           ClusterIP    (ocf::heartbeat:IPaddr2):    Stopped
>>>
>>>      Failed Actions:
>>>      * p_iSCSILogicalUnit_start_0 on drbd0-ha.s-ka.local 'unknown error'
>>> (1): call=42, status=complete, exitreason='',
>>>          last-rc-change='Sun Oct 14 01:20:38 2018', queued=0ms, exec=28ms
>>>      * p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
>>> call=40, status=complete, exitreason='',
>>>          last-rc-change='Sun Oct 14 00:54:36 2018', queued=0ms, exec=23ms
>>>      * p_iscsivg01_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
>>> call=48, status=complete, exitreason='Volume group [iscsivg01] does not
>>> exist or contains error!   Volume group "iscsivg01" not found',
>>>          last-rc-change='Sun Oct 14 01:32:49 2018', queued=0ms, exec=47ms
>>>      * p_iSCSILogicalUnit_start_0 on drbd1-ha.s-ka.local 'unknown error'
>>> (1): call=41, status=complete, exitreason='',
>>>          last-rc-change='Sun Oct 14 01:20:38 2018', queued=0ms, exec=31ms
>>>      * p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
>>> call=39, status=complete, exitreason='',
>>>          last-rc-change='Sun Oct 14 00:54:36 2018', queued=0ms, exec=24ms
>>>      * p_iscsivg01_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
>>> call=47, status=complete, exitreason='Volume group [iscsivg01] does not
>>> exist or contains error!   Volume group "iscsivg01" not found',
>>>          last-rc-change='Sun Oct 14 01:32:49 2018', queued=0ms, exec=50ms
>>>
>>>
>>>      Daemon Status:
>>>        corosync: active/enabled
>>>        pacemaker: active/enabled
>>>        pcsd: active/enabled
>>>      [root at drbd0 ~]#
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> 3.9 since the "device not found" error, so i remove the LVM, it looks
>>> like this now:
>>>
>>> actually it was changed between /dev/drbd/by-disk and /dev/drbd/by-res,
>>> but no effects
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd0 corosync]# pcs status
>>> Cluster name: cluster1
>>> Stack: corosync
>>> Current DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) -
>>> partition with quorum
>>> Last updated: Tue Oct 16 14:18:09 2018
>>> Last change: Sun Oct 14 02:06:36 2018 by root via cibadmin on
>>> drbd0-ha.s-ka.local
>>>
>>> 2 nodes configured
>>> 5 resources configured
>>>
>>> Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
>>>
>>> Full list of resources:
>>>
>>>   Master/Slave Set: ipstor0Clone [ipstor0]
>>>       Masters: [ drbd0-ha.s-ka.local ]
>>>       Slaves: [ drbd1-ha.s-ka.local ]
>>>   Resource Group: p_iSCSI
>>>       p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Stopped
>>>       p_iSCSILogicalUnit    (ocf::heartbeat:iSCSILogicalUnit):  Stopped
>>>       ClusterIP    (ocf::heartbeat:IPaddr2):    Stopped
>>>
>>> Failed Actions:
>>> * p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
>>> call=12, status=complete, exitreason='',
>>>      last-rc-change='Sun Oct 14 02:58:04 2018', queued=1ms, exec=58ms
>>> * p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
>>> call=12, status=complete, exitreason='',
>>>      last-rc-change='Sun Oct 14 10:47:06 2018', queued=0ms, exec=22ms
>>>
>>>
>>> Daemon Status:
>>>    corosync: active/enabled
>>>    pacemaker: active/enabled
>>>    pcsd: active/enabled
>>> [root at drbd0 corosync]#
>>>
>>> ------------------------------------------------------------------------
>>>
>>> 3.10 i've tried with "pcs resouce debug-start xxx --full" on the DRBD
>>> Primary Node,
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd0 corosync]# pcs resource debug-start p_iSCSI --full
>>> Error: unable to debug-start a group, try one of the group's resource(s)
>>> (p_iSCSITarget,p_iSCSILogicalUnit,ClusterIP)
>>>
>>> [root at drbd0 corosync]# pcs resource debug-start p_iSCSITarget --full
>>> Operation start for p_iSCSITarget (ocf:heartbeat:iSCSITarget) returned:
>>> 'ok' (0)
>>>   >  stderr: DEBUG: p_iSCSITarget start : 0
>>>
>>> [root at drbd0 corosync]# pcs resource debug-start p_iSCSILogicalUnit --full
>>> Operation start for p_iSCSILogicalUnit (ocf:heartbeat:iSCSILogicalUnit)
>>> returned: 'unknown error' (1)
>>>   >  stderr: ERROR: tgtadm: this logical unit number already exists
>>>
>>> [root at drbd0 corosync]# pcs resource debug-start ClusterIP --full
>>> Operation start for ClusterIP (ocf:heartbeat:IPaddr2) returned: 'ok' (0)
>>>   >  stderr: INFO: Adding inet address 192.168.95.48/32 with broadcast
>>> address 192.168.95.255 to device ens192
>>>   >  stderr: INFO: Bringing device ens192 up
>>>   >  stderr: INFO: /usr/libexec/heartbeat/send_arp -i 200 -c 5 -p
>>> /var/run/resource-agents/send_arp-192.168.95.48 -I ens192 -m auto
>>> 192.168.95.48
>>> [root at drbd0 corosync]#
>>>
>>> ------------------------------------------------------------------------
>>>
>>> 3.11 as you may seen, there are errors, but "p_iSCSITarget" was
>>> successfully startet. but "pcs status" show still "stopped"
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd0 corosync]# pcs status
>>> Cluster name: cluster1
>>> Stack: corosync
>>> Current DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) -
>>> partition with quorum
>>> Last updated: Tue Oct 16 14:22:38 2018
>>> Last change: Sun Oct 14 02:06:36 2018 by root via cibadmin on
>>> drbd0-ha.s-ka.local
>>>
>>> 2 nodes configured
>>> 5 resources configured
>>>
>>> Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
>>>
>>> Full list of resources:
>>>
>>>   Master/Slave Set: ipstor0Clone [ipstor0]
>>>       Masters: [ drbd0-ha.s-ka.local ]
>>>       Slaves: [ drbd1-ha.s-ka.local ]
>>>   Resource Group: p_iSCSI
>>>       p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Stopped
>>>       p_iSCSILogicalUnit    (ocf::heartbeat:iSCSILogicalUnit): Stopped
>>>       ClusterIP    (ocf::heartbeat:IPaddr2):    Stopped
>>>
>>> Failed Actions:
>>> * p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
>>> call=12, status=complete, exitreason='',
>>>      last-rc-change='Sun Oct 14 02:58:04 2018', queued=1ms, exec=58ms
>>> * p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
>>> call=12, status=complete, exitreason='',
>>>      last-rc-change='Sun Oct 14 10:47:06 2018', queued=0ms, exec=22ms
>>>
>>>
>>> Daemon Status:
>>>    corosync: active/enabled
>>>    pacemaker: active/enabled
>>>    pcsd: active/enabled
>>> [root at drbd0 corosync]#
>>>
>>> ------------------------------------------------------------------------
>>>
>>> 3.12 the pcs config is:
>>>
>>> ------------------------------------------------------------------------
>>>
>>> [root at drbd0 corosync]# pcs config
>>> Cluster Name: cluster1
>>> Corosync Nodes:
>>>   drbd0-ha.s-ka.local drbd1-ha.s-ka.local
>>> Pacemaker Nodes:
>>>   drbd0-ha.s-ka.local drbd1-ha.s-ka.local
>>>
>>> Resources:
>>>   Master: ipstor0Clone
>>>    Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1
>>> clone-node-max=1
>>>    Resource: ipstor0 (class=ocf provider=linbit type=drbd)
>>>     Attributes: drbd_resource=iscsivg01
>>>     Operations: demote interval=0s timeout=90 (ipstor0-demote-interval-0s)
>>>                 monitor interval=60s (ipstor0-monitor-interval-60s)
>>>                 notify interval=0s timeout=90 (ipstor0-notify-interval-0s)
>>>                 promote interval=0s timeout=90 (ipstor0-promote-interval-0s)
>>>                 reload interval=0s timeout=30 (ipstor0-reload-interval-0s)
>>>                 start interval=0s timeout=240 (ipstor0-start-interval-0s)
>>>                 stop interval=0s timeout=100 (ipstor0-stop-interval-0s)
>>>   Group: p_iSCSI
>>>    Resource: p_iSCSITarget (class=ocf provider=heartbeat type=iSCSITarget)
>>>     Attributes: implementation=tgt iqn=iqn.2018-08.s-ka.local:disk.1 tid=1
>>>     Operations: monitor interval=30 timeout=60
>>> (p_iSCSITarget-monitor-interval-30)
>>>                 start interval=0 timeout=60 (p_iSCSITarget-start-interval-0)
>>>                 stop interval=0 timeout=60 (p_iSCSITarget-stop-interval-0)
>>>    Resource: p_iSCSILogicalUnit (class=ocf provider=heartbeat
>>> type=iSCSILogicalUnit)
>>>     Attributes: implementation=tgt lun=10
>>> path=/dev/drbd/by-disk/vg0/ipstor0 target_iqn=iqn.2018-08.s-ka.local:disk.1
>>>     Operations: monitor interval=30 timeout=60
>>> (p_iSCSILogicalUnit-monitor-interval-30)
>>>                 start interval=0 timeout=60
>>> (p_iSCSILogicalUnit-start-interval-0)
>>>                 stop interval=0 timeout=60
>>> (p_iSCSILogicalUnit-stop-interval-0)
>>>    Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>>     Attributes: cidr_netmask=32 ip=192.168.95.48
>>>     Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>>>                 start interval=0s timeout=20s (ClusterIP-start-interval-0s)
>>>                 stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
>>>
>>> Stonith Devices:
>>> Fencing Levels:
>>>
>>> Location Constraints:
>>> Ordering Constraints:
>>>    start ipstor0Clone then start p_iSCSI (kind:Mandatory)
>>> Colocation Constraints:
>>> Ticket Constraints:
>>>
>>> Alerts:
>>>   No alerts defined
>>>
>>> Resources Defaults:
>>>   migration-threshold: 1
>>> Operations Defaults:
>>>   No defaults set
>>>
>>> Cluster Properties:
>>>   cluster-infrastructure: corosync
>>>   cluster-name: cluster1
>>>   dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
>>>   have-watchdog: false
>>>   last-lrm-refresh: 1539474248
>>>   no-quorum-policy: ignore
>>>   stonith-enabled: false
>>>
>>> Quorum:
>>>    Options:
>>> [root at drbd0 corosync]#
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> 4. so i am out of hands. don't what to do, may just dive into
>>> pacemaker's source code??
>>>
>>> Hope to get any feedback or tips from you, thank you very much in
>>> advance :)
>>>
>>>
>>> Best Regards
>>>
>>> Zhang
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list:Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home:http://www.clusterlabs.org
>>> Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs:http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Users mailing list:Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home:http://www.clusterlabs.org
>> Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181024/cb9e0176/attachment-0002.html>