[ClusterLabs] Need help to enable hot switch of iSCSI (tgtd) under two node Pacemaker + DRBD 9.0 under CentOS 7.5 in ESXi 6.5 Environment

Wed Oct 24 14:18:34 UTC 2018

Hi,

first question is why did you use tgt instead LIO? LIO is more common nowadays..

And you need a colocations and ordering constraints.

Here is my config with lio-t (but I guess here is something wrong in the ressource-agents, but I will take a deeper look tomorrow):

pcs config
Cluster Name: zfs-vmstorage
Corosync Nodes:
 zfs-serv3 zfs-serv4
Pacemaker Nodes:
 zfs-serv3 zfs-serv4

Resources:
 Resource: ha-ip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=192.168.2.10 cidr_netmask=24 nic=bond0
  Meta Attrs: target-role=Started 
  Operations: start interval=0s timeout=20s (ha-ip-start-0s)
              stop interval=0s timeout=20s (ha-ip-stop-0s)
              monitor interval=10s timeout=20s (ha-ip-monitor-10s)
 Resource: vm_storage (class=ocf provider=heartbeat type=ZFS)
  Attributes: pool=vm_storage importargs="-d /dev/disk/by-vdev/"
  Meta Attrs: target-role=Started 
  Operations: monitor interval=5s timeout=30s (vm_storage-monitor-5s)
              start interval=0s timeout=90 (vm_storage-start-0s)
              stop interval=0s timeout=90 (vm_storage-stop-0s)
 Resource: iscsi-server (class=ocf provider=heartbeat type=iSCSITarget)
  Attributes: implementation=lio-t iqn=iqn.2003-01.org.linux-iscsi.vm-storage.x8664:sn.1decabcxxxx. portals=192.168.2.10:3260 allowed_initiators="iqn.1998-01.com.vmware:brainslug9-7548xxxx iqn.1998-01.com.vmware:brainslug8-058xxxx iqn.1998-01.com.vmware:brainslug7-592bxxxx iqn.1998-01.com.vmware:brainslug10-5564cxxxx"
 Resource: iscsi-lun0 (class=ocf provider=heartbeat type=iSCSILogicalUnit)
  Attributes: implementation=lio-t target_iqn=iqn.2003-01.org.linux-iscsi.vm-storage.x8664:sn.1decabcxxxx. lun=0 path=/dev/zvol/vm_storage/zfs-vol1
 Resource: iscsi-lun1 (class=ocf provider=heartbeat type=iSCSILogicalUnit)
  Attributes: implementation=lio-t target_iqn=iqn.2003-01.org.linux-iscsi.vm-storage.x8664:sn.1decabcxxxx. lun=1 path=/dev/zvol/vm_storage/zfs-vol2
 Resource: iscsi-lun2 (class=ocf provider=heartbeat type=iSCSILogicalUnit)
  Attributes: implementation=lio-t target_iqn=iqn.2003-01.org.linux-iscsi.vm-storage.x8664:sn.1decabcxxxx. lun=2 path=/dev/zvol/vm_storage/zfs-vol3

Stonith Devices:
 Resource: resIPMI-zfs4 (class=stonith type=external/ipmi)
  Attributes: hostname=zfs-serv4 ipaddr=172.xx.xx.xx userid=USER passwd=SECRET interface=lan priv=OPERATOR pcmk_delay_max=20
  Operations: monitor interval=60s (resIPMI-zfs4-monitor-60s)
 Resource: resIPMI-zfs3 (class=stonith type=external/ipmi)
  Attributes: hostname=zfs-serv3 ipaddr=172.xx.xx.xx userid=user passwd=SECRET interface=lan priv=OPERATOR pcmk_delay_max=20
  Operations: monitor interval=60s (resIPMI-zfs3-monitor-60s)
Fencing Levels:

Location Constraints:
  Resource: resIPMI-zfs3
    Disabled on: zfs-serv3 (score:-INFINITY) (id:location-resIPMI-zfs3-zfs-serv3--INFINITY)
  Resource: resIPMI-zfs4
    Disabled on: zfs-serv4 (score:-INFINITY) (id:location-resIPMI-zfs4-zfs-serv4--INFINITY)
Ordering Constraints:
  Resource Sets:
    set ha-ip iscsi-lun0 iscsi-lun1 iscsi-lun2 iscsi-server vm_storage action=stop (id:pcs_rsc_order_set_ha-ip_iscsi-server_vm_storage-1) setoptions symmetrical=false (id:pcs_rsc_order_set_ha-ip_iscsi-server_vm_storage)
    set vm_storage iscsi-server iscsi-lun0 iscsi-lun1 iscsi-lun2 ha-ip action=start (id:pcs_rsc_order_set_iscsi-server_vm_storage_ha-ip-1) setoptions symmetrical=false (id:pcs_rsc_order_set_iscsi-server_vm_storage_ha-ip)
Colocation Constraints:
  Resource Sets:
    set ha-ip vm_storage iscsi-server iscsi-lun0 iscsi-lun1 iscsi-lun2 (id:pcs_rsc_colocation_set_ha-ip_vm_storage_iscsi-server-1) setoptions score=INFINITY (id:pcs_rsc_colocation_set_ha-ip_vm_storage_iscsi-server)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness: 100
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: zfs-vmstorage
 dc-version: 1.1.16-94ff4df
 have-watchdog: false
 last-lrm-refresh: 1540199247
 no-quorum-policy: stop
 stonith-enabled: true

Quorum:
  Options:

Hello Friends, 

further thought about this situation: 

I want to have a cluster service tgtd, on a "primary/second" DRBD, since the DRBD is only active on the primary node, so the tgtd won't success on the secondary node. 
So should i, or how do config a primary/seconday tgtd service? 

Or, which feature from pacemaker should i use, so the tgtd starts only on one node? 

for any suggestions, thank you very much in advance 
Best Regards 
Lifeng

Hello Dear Andrei Borzenkov, 

Thank you very much for your answer. I've check the logs all the time, but there are nothing helpful , just a bunch of heartbeat messages. 

Anyway, i've read the book "Packt - CentOS High Availability" published in 2015, and got some new ideas, and tried out, the situation is something new. 

--------------------
pcs resource create p_iSCSITarget ocf:heartbeat:iSCSITarget implementation="tgt" iqn="iqn.2018-08.s-ka.local:disk" tid="1"pcs resource create p_iSCSILogicalUnit ocf:heartbeat:iSCSILogicalUnit implementation="tgt" target_iqn="iqn.2018-08.s-ka.local:disk" lun="10" path="/dev/drbd/by-disk/vg0/ipstor0"pcs resource group add p_iSCSI ClusterIP p_iSCSITarget p_iSCSILogicalUnit pcs constraint colocation set ClusterIP p_iSCSITarget  p_iSCSILogicalUnit

--------------------

The difference from previous version is here: use iqn "iqn.2018-08.s-ka.local:disk" instead of "iqn.2018-08.s-ka.local:disk.1", which the last ".1" maybe means the "tid". 

now i have new problem, because the resource and tgtd are startet, although i set "colocation constraint", the pacemaker always try to start tgtd on another node. 
how to i solve this? thank you people in advance! 

here the output from "pcs status": 
--------------------
[root at drbd0 /]# pcs statusCluster name: cluster1Stack: corosyncCurrent DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorumLast updated: Wed Oct 24 08:43:29 2018Last change: Wed Oct 24 08:43:24 2018 by root via cibadmin on drbd0-ha.s-ka.local

2 nodes configured5 resources configured

Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]

Full list of resources:

 Master/Slave Set: ipstor0Clone [ipstor0]     Masters: [ drbd0-ha.s-ka.local ]     Slaves: [ drbd1-ha.s-ka.local ] Resource Group: p_iSCSI     ClusterIP    (ocf::heartbeat:IPaddr2):    Started drbd0-ha.s-ka.local     p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Started drbd0-ha.s-ka.local     p_iSCSILogicalUnit    (ocf::heartbeat:iSCSILogicalUnit):    Started drbd0-ha.s-ka.local

Failed Actions:* p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1): call=32, status=complete, exitreason='',    last-rc-change='Wed Oct 24 08:37:25 2018', queued=0ms, exec=23ms* p_iSCSILogicalUnit_start_0 on drbd1-ha.s-ka.local 'unknown error' (1): call=38, status=complete, exitreason='',    last-rc-change='Wed Oct 24 08:37:55 2018', queued=0ms, exec=28ms

Daemon Status:  corosync: active/enabled  pacemaker: active/enabled  pcsd: active/enabled[root at drbd0 /] 

[root at drbd0 /]# pcs constraint show --fullLocation Constraints:Ordering Constraints:Colocation Constraints:  Resource Sets:    set ClusterIP p_iSCSITarget p_iSCSILogicalUnit (id:pcs_rsc_set_ClusterIP_p_iSCSITarget_p_iSCSILogicalUnit) setoptions score=INFINITY (id:pcs_rsc_colocation_set_ClusterIP_p_iSCSITarget_p_iSCSILogicalUnit)Ticket Constraints:[root at drbd0 /]# 

--------------------

Best Regards 
Lifeng

在 2018/10/19 06:02, Andrei Borzenkov 写道:

16.10.2018 15:29, LiFeng Zhang пишет: 
Hi, all dear friends,

i need your help to enable the hot switch of iSCSI under a
Pacemaker/Corosync Cluster, which has a iSCSI Device based on a two node
DRBD Replication.

I've got the Pacemaker/Corosync cluster working, DRBD replication also
working, but it stuck at iSCSI, i can manually start a tgtd on one node,
so the VCSA can recognize the iSCSI Disk and create VMFS/StorageObject
on it, and then i can create a test VM on that VMFS.

But when i switch the Primary/Secondary of DRBD, although the test VM
still running, but the underlying Disk became read-only. As far as i
know, the tgtd should be handled by Pacemaker so it will automatically
start on the Primary DRBD Instance, but in my situation it's sadly NOT.

pacemaker only handles resources that were started by pacemaker.
According to your output below, in all cases resource was stopped from
pacemaker point of view and all pacemaker attempts to start resource
failed. You should troubleshoot why they failed. This requires knowledge
of specific resource agent, sadly I am not familiar with iSCSI target.
pacemaker logs may include more information from resource agent than
just "unknown reason".

I've tried all kinds of resources/manuals/documents, but they all mixed
with extra information, other system, other software version.

And one of my BEST reference (the closest configuration to mein) is this
url: https://nnc3.com/mags/LJ_1994-2014/LJ/217/11275.html[1]

The difference betwee me and this article, i think is i don't have LVM
Volume but only raw iSCSI Disk, and i have to translate CRM commands
into PCS commands

But after i "copied" the configuration from this article, my cluster can
not start anymore, i've tried remove the LVM resource (which caused a
"device not found" error), but the resource group still can't start and
without any explicit "reason" from Pacemaker.

*1*. The whole configuration is under a two node ESXi 6.5 Cluster, which
has a VCSA one one ESXi host installed.

I have a simple diagram in attachment, which may state the deployment
better.

2. start point:

The involved hosts are all with mapped through local dns, which also
includes the floating vip, the local domain is s-ka.local:

------------------------------------------------------------------------

firwall:    fw01.s-ka.local.        IN    A    192.168.95.249

vcsa:    vc01.s-ka.local.        IN    A    192.168.95.30
esxi:     esx01.s-ka.local.        IN    A    192.168.95.5
esxi:     esx02.s-ka.local.        IN    A    192.168.95.7

drbd:    drbd0.s-ka.local.        IN    A    192.168.95.45
drbd:    drbd1.s-ka.local.        IN    A    192.168.95.47
vip:      ipstor0.s-ka.local.        IN    A    192.168.95.48

heartbeat:    drbd0-ha.s-ka.local.    IN    A    192.168.96.45
heartbeat:    drbd1-ha.s-ka.local.    IN    A    192.168.96.47

------------------------------------------------------------------------

The both drbd server are CentOS 7.5, the installed packages are here:

------------------------------------------------------------------------

[root at drbd0 ~]# cat /etc/centos-release
CentOS Linux release 7.5.1804 (Core)

[root at drbd0 ~]# uname -a
Linux drbd0.s-ka.local 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16
16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

[root at drbd1 ~]# yum list installed|grep pacemaker
pacemaker.x86_64 1.1.18-11.el7_5.3              @updates
pacemaker-cli.x86_64 1.1.18-11.el7_5.3              @updates
pacemaker-cluster-libs.x86_64 1.1.18-11.el7_5.3              @updates
pacemaker-libs.x86_64 1.1.18-11.el7_5.3              @updates

[root at drbd1 ~]# yum list installed|grep coro
corosync.x86_64 2.4.3-2.el7_5.1                @updates
corosynclib.x86_64 2.4.3-2.el7_5.1                @updates

[root at drbd1 ~]# yum list installed|grep drbd
drbd90-utils.x86_64 9.3.1-1.el7.elrepo             @elrepo
kmod-drbd90.x86_64 9.0.14-1.el7_5.elrepo          @elrepo

[root at drbd1 ~]# yum list installed|grep -i scsi
lsscsi.x86_64 0.27-6.el7                     @anaconda
scsi-target-utils.x86_64 1.0.55-4.el7                   @epel

------------------------------------------------------------------------

3. configurations

3.1 ok first the drbd configuration

------------------------------------------------------------------------

[root at drbd1 ~]# cat /etc/drbd.conf
# You can find an example in /usr/share/doc/drbd.../drbd.conf.example

include "drbd.d/global_common.conf";
include "drbd.d/*.res";

[root at drbd1 ~]# cat /etc/drbd.d/r0.res
resource iscsivg01 {
  protocol C;
  device /dev/drbd0;
  disk /dev/vg0/ipstor0;
  flexible-meta-disk internal;
  on drbd0.s-ka.local {
    #volume 0 {
      #device /dev/drbd0;
      #disk /dev/vg0/ipstor0;
      #flexible-meta-disk internal;
    #}
    address 192.168.96.45:7788;
  }
  on drbd1.s-ka.local {
    #volume 0 {
      #device /dev/drbd0;
      #disk /dev/vg0/ipstor0;
      #flexible-meta-disk internal;
    #}
    address 192.168.96.47:7788;
  }
}

------------------------------------------------------------------------

3.2 then the drbd device

------------------------------------------------------------------------

[root at drbd1 ~]# lsblk
NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda               8:0    0   25G  0 disk
├─sda1            8:1    0    1G  0 part /boot
└─sda2            8:2    0   24G  0 part
  ├─centos-root 253:0    0   22G  0 lvm  /
  └─centos-swap 253:1    0    2G  0 lvm  [SWAP]
sdb               8:16   0  500G  0 disk
└─sdb1            8:17   0  500G  0 part
  └─vg0-ipstor0 253:2    0  500G  0 lvm
    └─drbd0     147:0    0  500G  1 disk
sr0              11:0    1 1024M  0 rom

[root at drbd1 ~]# tree /dev/drbd
drbd/  drbd0
[root at drbd1 ~]# tree /dev/drbd
/dev/drbd
├── by-disk
│   └── vg0
│       └── ipstor0 -> ../../../drbd0
└── by-res
    └── iscsivg01
        └── 0 -> ../../../drbd0

4 directories, 2 files

------------------------------------------------------------------------

3.3drbd status

------------------------------------------------------------------------

[root at drbd1 ~]# drbdadm status
iscsivg01 role:Secondary
  disk:UpToDate
  drbd0.s-ka.local role:Primary
    peer-disk:UpToDate

[root at drbd0 ~]# drbdadm status
iscsivg01 role:Primary
  disk:UpToDate
  drbd1.s-ka.local role:Secondary
    peer-disk:UpToDate

[root at drbd0 ~]# cat /proc/drbd
version: 9.0.14-1 (api:2/proto:86-113)
GIT-hash: 62f906cf44ef02a30ce0c148fec223b40c51c533 build by mockbuild@,
2018-05-04 03:32:42
Transports (api:16): tcp (9.0.14-1)

------------------------------------------------------------------------

3.4 Corosync configuration

------------------------------------------------------------------------

[root at drbd0 corosync]# cat /etc/corosync/corosync.conf
totem {
    version: 2
    cluster_name: cluster1
    secauth: off
    transport: udpu
}

nodelist {
    node {
        ring0_addr: drbd0-ha.s-ka.local
        nodeid: 1
    }

    node {
        ring0_addr: drbd1-ha.s-ka.local
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

------------------------------------------------------------------------

3.5 Corosync status:

------------------------------------------------------------------------

[root at drbd0 corosync]# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled;
vendor preset: disabled)
   Active: active (running) since Sun 2018-10-14 02:58:01 CEST; 2 days ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
  Process: 1095 ExecStart=/usr/share/corosync/corosync start
(code=exited, status=0/SUCCESS)
 Main PID: 1167 (corosync)
   CGroup: /system.slice/corosync.service
           └─1167 corosync

Oct 14 02:58:00 drbd0.s-ka.local corosync[1167]:  [MAIN  ] Completed
service synchronization, ready to provide service.
Oct 14 02:58:01 drbd0.s-ka.local corosync[1095]: Starting Corosync
Cluster Engine (corosync): [  OK  ]
Oct 14 02:58:01 drbd0.s-ka.local systemd[1]: Started Corosync Cluster
Engine.
Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]:  [TOTEM ] A new
membership (192.168.96.45:384) was formed. Members left: 2
Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]:  [QUORUM] Members[1]: 1
Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]:  [MAIN  ] Completed
service synchronization, ready to provide service.
Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [TOTEM ] A new
membership (192.168.96.45:388) was formed. Members joined: 2
Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [CPG   ] downlist
left_list: 0 received in state 0
Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [QUORUM] Members[2]: 1 2
Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [MAIN  ] Completed
service synchronization, ready to provide service.

------------------------------------------------------------------------

3.6 tgtd configuration:

------------------------------------------------------------------------

[root at drbd0 corosync]# cat /etc/tgt/targets.conf
# This is a sample config file for tgt-admin.
#
# The "#" symbol disables the processing of a line.

# Set the driver. If not specified, defaults to "iscsi".
default-driver iscsi

# Set iSNS parameters, if needed
#iSNSServerIP 192.168.111.222
#iSNSServerPort 3205
#iSNSAccessControl On
#iSNS On

# Continue if tgtadm exits with non-zero code (equivalent of
# --ignore-errors command line option)
#ignore-errors yes

<target iqn.2018-08.s-ka.local:disk.1>
    lun 10
    backing-store /dev/drbd0
    initiator-address 192.168.96.0/24
    initiator-address 192.168.95.0/24
    target-address 192.168.95.48
</target>

------------------------------------------------------------------------

3.7 tgtd has been on both server disabled, only startable from current
Primary DRBD Node.

------------------------------------------------------------------------

Secondary Node:

[root at drbd1 ~]# systemctl status tgtd
● tgtd.service - tgtd iSCSI target daemon
   Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
vendor preset: disabled)
   Active: inactive (dead)
[root at drbd1 ~]# systemctl restart tgtd
Job for tgtd.service failed because the control process exited with
error code. See "systemctl status tgtd.service" and "journalctl -xe" for
details.

Primary Node:

[root at drbd0 corosync]# systemctl status tgtd
● tgtd.service - tgtd iSCSI target daemon
   Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
vendor preset: disabled)
   Active: inactive (dead)
[root at drbd0 corosync]# systemctl restart tgtd
[root at drbd0 corosync]# systemctl status  tgtd
● tgtd.service - tgtd iSCSI target daemon
   Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
vendor preset: disabled)
   Active: active (running) since Tue 2018-10-16 14:09:47 CEST; 2min 29s
ago
  Process: 22300 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys
--name State -v ready (code=exited, status=0/SUCCESS)
  Process: 22272 ExecStartPost=/usr/sbin/tgt-admin -e -c $TGTD_CONFIG
(code=exited, status=0/SUCCESS)
  Process: 22271 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys
--name State -v offline (code=exited, status=0/SUCCESS)
  Process: 22270 ExecStartPost=/bin/sleep 5 (code=exited, status=0/SUCCESS)
 Main PID: 22269 (tgtd)
   CGroup: /system.slice/tgtd.service
           └─22269 /usr/sbin/tgtd -f

Oct 16 14:09:42 drbd0.s-ka.local systemd[1]: Starting tgtd iSCSI target
daemon...
Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd: iser_ib_init(3436)
Failed to initialize RDMA; load kernel modules?
Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd:
work_timer_start(146) use timer_fd based scheduler
Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd:
bs_init_signalfd(267) could not open backing-store module directory
/usr/lib64/tgt/backing-store
Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd: bs_init(386) use
signalfd notification
Oct 16 14:09:47 drbd0.s-ka.local tgtd[22269]: tgtd: device_mgmt(246)
sz:16 params:path=/dev/drbd0
Oct 16 14:09:47 drbd0.s-ka.local tgtd[22269]: tgtd: bs_thread_open(408) 16
Oct 16 14:09:47 drbd0.s-ka.local systemd[1]: Started tgtd iSCSI target
daemon.

------------------------------------------------------------------------

3.8 it was until this point all working, but if i switched the DRBD
Primary Node, it won't work anymore (FileSystem of test Node became
read-only)

so i changed the pcs configuration according to the previously mentioned
article:

------------------------------------------------------------------------

pcs resource create p_iscsivg01 ocf:heartbeat:LVM volgrpname="vg0" op 
monitor interval="30"

pcs resource group add p_iSCSI p_iscsivg01 p_iSCSITarget 
p_iSCSILogicalUnit ClusterIP

pcs constraint order start ipstor0Clone then start p_iSCSI then start 
ipstor0Clone:Master

[root at drbd0 ~]# pcs status
    Cluster name: cluster1
    Stack: corosync
    Current DC: drbd0-ha.s-ka.local (version
1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
    Last updated: Sun Oct 14 01:38:18 2018
    Last change: Sun Oct 14 01:37:58 2018 by root via cibadmin on
drbd0-ha.s-ka.local

    2 nodes configured
    6 resources configured

    Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]

    Full list of resources:

     Master/Slave Set: ipstor0Clone [ipstor0]
         Masters: [ drbd0-ha.s-ka.local ]
         Slaves: [ drbd1-ha.s-ka.local ]
     Resource Group: p_iSCSI
         p_iscsivg01    (ocf::heartbeat:LVM):    Stopped
         p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Stopped
         p_iSCSILogicalUnit (ocf::heartbeat:iSCSILogicalUnit):    Stopped
         ClusterIP    (ocf::heartbeat:IPaddr2):    Stopped

    Failed Actions:
    * p_iSCSILogicalUnit_start_0 on drbd0-ha.s-ka.local 'unknown error'
(1): call=42, status=complete, exitreason='',
        last-rc-change='Sun Oct 14 01:20:38 2018', queued=0ms, exec=28ms
    * p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
call=40, status=complete, exitreason='',
        last-rc-change='Sun Oct 14 00:54:36 2018', queued=0ms, exec=23ms
    * p_iscsivg01_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
call=48, status=complete, exitreason='Volume group [iscsivg01] does not
exist or contains error!   Volume group "iscsivg01" not found',
        last-rc-change='Sun Oct 14 01:32:49 2018', queued=0ms, exec=47ms
    * p_iSCSILogicalUnit_start_0 on drbd1-ha.s-ka.local 'unknown error'
(1): call=41, status=complete, exitreason='',
        last-rc-change='Sun Oct 14 01:20:38 2018', queued=0ms, exec=31ms
    * p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
call=39, status=complete, exitreason='',
        last-rc-change='Sun Oct 14 00:54:36 2018', queued=0ms, exec=24ms
    * p_iscsivg01_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
call=47, status=complete, exitreason='Volume group [iscsivg01] does not
exist or contains error!   Volume group "iscsivg01" not found',
        last-rc-change='Sun Oct 14 01:32:49 2018', queued=0ms, exec=50ms

    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled
    [root at drbd0 ~]#

------------------------------------------------------------------------

3.9 since the "device not found" error, so i remove the LVM, it looks
like this now:

actually it was changed between /dev/drbd/by-disk and /dev/drbd/by-res,
but no effects

------------------------------------------------------------------------

[root at drbd0 corosync]# pcs status
Cluster name: cluster1
Stack: corosync
Current DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) -
partition with quorum
Last updated: Tue Oct 16 14:18:09 2018
Last change: Sun Oct 14 02:06:36 2018 by root via cibadmin on
drbd0-ha.s-ka.local

2 nodes configured
5 resources configured

Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]

Full list of resources:

 Master/Slave Set: ipstor0Clone [ipstor0]
     Masters: [ drbd0-ha.s-ka.local ]
     Slaves: [ drbd1-ha.s-ka.local ]
 Resource Group: p_iSCSI
     p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Stopped
     p_iSCSILogicalUnit    (ocf::heartbeat:iSCSILogicalUnit):  Stopped
     ClusterIP    (ocf::heartbeat:IPaddr2):    Stopped

Failed Actions:
* p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
call=12, status=complete, exitreason='',
    last-rc-change='Sun Oct 14 02:58:04 2018', queued=1ms, exec=58ms
* p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
call=12, status=complete, exitreason='',
    last-rc-change='Sun Oct 14 10:47:06 2018', queued=0ms, exec=22ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root at drbd0 corosync]#

------------------------------------------------------------------------

3.10 i've tried with "pcs resouce debug-start xxx --full" on the DRBD
Primary Node,

------------------------------------------------------------------------

[root at drbd0 corosync]# pcs resource debug-start p_iSCSI --full
Error: unable to debug-start a group, try one of the group's resource(s)
(p_iSCSITarget,p_iSCSILogicalUnit,ClusterIP)

[root at drbd0 corosync]# pcs resource debug-start p_iSCSITarget --full
Operation start for p_iSCSITarget (ocf:heartbeat:iSCSITarget) returned:
'ok' (0)
 >  stderr: DEBUG: p_iSCSITarget start : 0

[root at drbd0 corosync]# pcs resource debug-start p_iSCSILogicalUnit --full
Operation start for p_iSCSILogicalUnit (ocf:heartbeat:iSCSILogicalUnit)
returned: 'unknown error' (1)
 >  stderr: ERROR: tgtadm: this logical unit number already exists

[root at drbd0 corosync]# pcs resource debug-start ClusterIP --full
Operation start for ClusterIP (ocf:heartbeat:IPaddr2) returned: 'ok' (0)
 >  stderr: INFO: Adding inet address 192.168.95.48/32 with broadcast
address 192.168.95.255 to device ens192
 >  stderr: INFO: Bringing device ens192 up
 >  stderr: INFO: /usr/libexec/heartbeat/send_arp -i 200 -c 5 -p
/var/run/resource-agents/send_arp-192.168.95.48 -I ens192 -m auto
192.168.95.48
[root at drbd0 corosync]#

------------------------------------------------------------------------

3.11 as you may seen, there are errors, but "p_iSCSITarget" was
successfully startet. but "pcs status" show still "stopped"

------------------------------------------------------------------------

[root at drbd0 corosync]# pcs status
Cluster name: cluster1
Stack: corosync
Current DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) -
partition with quorum
Last updated: Tue Oct 16 14:22:38 2018
Last change: Sun Oct 14 02:06:36 2018 by root via cibadmin on
drbd0-ha.s-ka.local

2 nodes configured
5 resources configured

Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]

Full list of resources:

 Master/Slave Set: ipstor0Clone [ipstor0]
     Masters: [ drbd0-ha.s-ka.local ]
     Slaves: [ drbd1-ha.s-ka.local ]
 Resource Group: p_iSCSI
     p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Stopped
     p_iSCSILogicalUnit    (ocf::heartbeat:iSCSILogicalUnit): Stopped
     ClusterIP    (ocf::heartbeat:IPaddr2):    Stopped

Failed Actions:
* p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
call=12, status=complete, exitreason='',
    last-rc-change='Sun Oct 14 02:58:04 2018', queued=1ms, exec=58ms
* p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
call=12, status=complete, exitreason='',
    last-rc-change='Sun Oct 14 10:47:06 2018', queued=0ms, exec=22ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root at drbd0 corosync]#

------------------------------------------------------------------------

3.12 the pcs config is:

------------------------------------------------------------------------

[root at drbd0 corosync]# pcs config
Cluster Name: cluster1
Corosync Nodes:
 drbd0-ha.s-ka.local drbd1-ha.s-ka.local
Pacemaker Nodes:
 drbd0-ha.s-ka.local drbd1-ha.s-ka.local

Resources:
 Master: ipstor0Clone
  Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1
clone-node-max=1
  Resource: ipstor0 (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=iscsivg01
   Operations: demote interval=0s timeout=90 (ipstor0-demote-interval-0s)
               monitor interval=60s (ipstor0-monitor-interval-60s)
               notify interval=0s timeout=90 (ipstor0-notify-interval-0s)
               promote interval=0s timeout=90 (ipstor0-promote-interval-0s)
               reload interval=0s timeout=30 (ipstor0-reload-interval-0s)
               start interval=0s timeout=240 (ipstor0-start-interval-0s)
               stop interval=0s timeout=100 (ipstor0-stop-interval-0s)
 Group: p_iSCSI
  Resource: p_iSCSITarget (class=ocf provider=heartbeat type=iSCSITarget)
   Attributes: implementation=tgt iqn=iqn.2018-08.s-ka.local:disk.1 tid=1
   Operations: monitor interval=30 timeout=60
(p_iSCSITarget-monitor-interval-30)
               start interval=0 timeout=60 (p_iSCSITarget-start-interval-0)
               stop interval=0 timeout=60 (p_iSCSITarget-stop-interval-0)
  Resource: p_iSCSILogicalUnit (class=ocf provider=heartbeat
type=iSCSILogicalUnit)
   Attributes: implementation=tgt lun=10
path=/dev/drbd/by-disk/vg0/ipstor0 target_iqn=iqn.2018-08.s-ka.local:disk.1
   Operations: monitor interval=30 timeout=60
(p_iSCSILogicalUnit-monitor-interval-30)
               start interval=0 timeout=60
(p_iSCSILogicalUnit-start-interval-0)
               stop interval=0 timeout=60
(p_iSCSILogicalUnit-stop-interval-0)
  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=32 ip=192.168.95.48
   Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
               start interval=0s timeout=20s (ClusterIP-start-interval-0s)
               stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
  start ipstor0Clone then start p_iSCSI (kind:Mandatory)
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 migration-threshold: 1
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: cluster1
 dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
 have-watchdog: false
 last-lrm-refresh: 1539474248
 no-quorum-policy: ignore
 stonith-enabled: false

Quorum:
  Options:
[root at drbd0 corosync]#

------------------------------------------------------------------------

4. so i am out of hands. don't what to do, may just dive into
pacemaker's source code??

Hope to get any feedback or tips from you, thank you very much in
advance :)

Best Regards

Zhang

_______________________________________________
Users mailing list: Users at clusterlabs.org[2]
https://lists.clusterlabs.org/mailman/listinfo/users[3]

Project Home: http://www.clusterlabs.org[4]
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf[5]
Bugs: http://bugs.clusterlabs.org[6]

_______________________________________________
Users mailing list: Users at clusterlabs.org[2]
https://lists.clusterlabs.org/mailman/listinfo/users[3]

Project Home: http://www.clusterlabs.org[4]
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf[5]
Bugs: http://bugs.clusterlabs.org[6] 

_______________________________________________
Users mailing list: Users at clusterlabs.org[2]
https://lists.clusterlabs.org/mailman/listinfo/users[3]

Project Home: http://www.clusterlabs.org[4]
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf[5]
Bugs: http://bugs.clusterlabs.org[6] 

--------
[1] https://nnc3.com/mags/LJ_1994-2014/LJ/217/11275.html
[2] mailto:Users at clusterlabs.org
[3] https://lists.clusterlabs.org/mailman/listinfo/users
[4] http://www.clusterlabs.org
[5] http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
[6] http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181024/15796e43/attachment-0001.html>