[ClusterLabs] Need help to enable hot switch of iSCSI (tgtd) under two node Pacemaker + DRBD 9.0 under CentOS 7.5 in ESXi 6.5 Environment

Wed Oct 24 06:53:10 UTC 2018

Hello Dear Andrei Borzenkov,

Thank you very much for your answer. I've check the logs all the time, 
but there are nothing helpful , just a bunch of heartbeat messages.

Anyway, i've read the book "Packt - CentOS High Availability" published 
in 2015, and got some new ideas, and tried out, the situation is 
something new.

------------------------------------------------------------------------
pcs resource create p_iSCSITarget ocf:heartbeat:iSCSITarget 
implementation="tgt" iqn="iqn.2018-08.s-ka.local:disk" tid="1"
pcs resource create p_iSCSILogicalUnit ocf:heartbeat:iSCSILogicalUnit 
implementation="tgt" target_iqn="iqn.2018-08.s-ka.local:disk" lun="10" 
path="/dev/drbd/by-disk/vg0/ipstor0"
pcs resource group add p_iSCSI ClusterIP p_iSCSITarget p_iSCSILogicalUnit
pcs constraint colocation set ClusterIP p_iSCSITarget p_iSCSILogicalUnit
------------------------------------------------------------------------

The difference from previous version is here: use iqn 
"iqn.2018-08.s-ka.local:disk" instead of 
"iqn.2018-08.s-ka.local:disk.1", which the last ".1" maybe means the "tid".

now i have new problem, because the resource and tgtd are startet, 
although i set "colocation constraint", the pacemaker always try to 
start tgtd on another node.
how to i solve this? thank you people in advance!

here the output from "pcs status":
------------------------------------------------------------------------
[root at drbd0 /]# pcs status
Cluster name: cluster1
Stack: corosync
Current DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) - 
partition with quorum
Last updated: Wed Oct 24 08:43:29 2018
Last change: Wed Oct 24 08:43:24 2018 by root via cibadmin on 
drbd0-ha.s-ka.local

2 nodes configured
5 resources configured

Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]

Full list of resources:

  Master/Slave Set: ipstor0Clone [ipstor0]
      Masters: [ drbd0-ha.s-ka.local ]
      Slaves: [ drbd1-ha.s-ka.local ]
  Resource Group: p_iSCSI
      ClusterIP    (ocf::heartbeat:IPaddr2):    Started drbd0-ha.s-ka.local
      p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Started 
drbd0-ha.s-ka.local
      p_iSCSILogicalUnit    (ocf::heartbeat:iSCSILogicalUnit): Started 
drbd0-ha.s-ka.local

Failed Actions:
* p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1): 
call=32, status=complete, exitreason='',
     last-rc-change='Wed Oct 24 08:37:25 2018', queued=0ms, exec=23ms
* p_iSCSILogicalUnit_start_0 on drbd1-ha.s-ka.local 'unknown error' (1): 
call=38, status=complete, exitreason='',
     last-rc-change='Wed Oct 24 08:37:55 2018', queued=0ms, exec=28ms

Daemon Status:
   corosync: active/enabled
   pacemaker: active/enabled
   pcsd: active/enabled
[root at drbd0 /]

[root at drbd0 /]# pcs constraint show --full
Location Constraints:
Ordering Constraints:
Colocation Constraints:
   Resource Sets:
     set ClusterIP p_iSCSITarget p_iSCSILogicalUnit 
(id:pcs_rsc_set_ClusterIP_p_iSCSITarget_p_iSCSILogicalUnit) setoptions 
score=INFINITY 
(id:pcs_rsc_colocation_set_ClusterIP_p_iSCSITarget_p_iSCSILogicalUnit)
Ticket Constraints:
[root at drbd0 /]#
------------------------------------------------------------------------

Best Regards
Lifeng

在 2018/10/19 06:02, Andrei Borzenkov 写道:
> 16.10.2018 15:29, LiFeng Zhang пишет:
>> Hi, all dear friends,
>>
>> i need your help to enable the hot switch of iSCSI under a
>> Pacemaker/Corosync Cluster, which has a iSCSI Device based on a two node
>> DRBD Replication.
>>
>> I've got the Pacemaker/Corosync cluster working, DRBD replication also
>> working, but it stuck at iSCSI, i can manually start a tgtd on one node,
>> so the VCSA can recognize the iSCSI Disk and create VMFS/StorageObject
>> on it, and then i can create a test VM on that VMFS.
>>
>> But when i switch the Primary/Secondary of DRBD, although the test VM
>> still running, but the underlying Disk became read-only. As far as i
>> know, the tgtd should be handled by Pacemaker so it will automatically
>> start on the Primary DRBD Instance, but in my situation it's sadly NOT.
>>
> pacemaker only handles resources that were started by pacemaker.
> According to your output below, in all cases resource was stopped from
> pacemaker point of view and all pacemaker attempts to start resource
> failed. You should troubleshoot why they failed. This requires knowledge
> of specific resource agent, sadly I am not familiar with iSCSI target.
> pacemaker logs may include more information from resource agent than
> just "unknown reason".
>
>> I've tried all kinds of resources/manuals/documents, but they all mixed
>> with extra information, other system, other software version.
>>
>> And one of my BEST reference (the closest configuration to mein) is this
>> url: https://nnc3.com/mags/LJ_1994-2014/LJ/217/11275.html
>>
>> The difference betwee me and this article, i think is i don't have LVM
>> Volume but only raw iSCSI Disk, and i have to translate CRM commands
>> into PCS commands
>>
>> But after i "copied" the configuration from this article, my cluster can
>> not start anymore, i've tried remove the LVM resource (which caused a
>> "device not found" error), but the resource group still can't start and
>> without any explicit "reason" from Pacemaker.
>>
>>
>> *1*. The whole configuration is under a two node ESXi 6.5 Cluster, which
>> has a VCSA one one ESXi host installed.
>>
>> I have a simple diagram in attachment, which may state the deployment
>> better.
>>
>> 2. start point:
>>
>> The involved hosts are all with mapped through local dns, which also
>> includes the floating vip, the local domain is s-ka.local:
>>
>> ------------------------------------------------------------------------
>>
>> firwall:    fw01.s-ka.local.        IN    A    192.168.95.249
>>
>> vcsa:    vc01.s-ka.local.        IN    A    192.168.95.30
>> esxi:     esx01.s-ka.local.        IN    A    192.168.95.5
>> esxi:     esx02.s-ka.local.        IN    A    192.168.95.7
>>
>> drbd:    drbd0.s-ka.local.        IN    A    192.168.95.45
>> drbd:    drbd1.s-ka.local.        IN    A    192.168.95.47
>> vip:      ipstor0.s-ka.local.        IN    A    192.168.95.48
>>
>> heartbeat:    drbd0-ha.s-ka.local.    IN    A    192.168.96.45
>> heartbeat:    drbd1-ha.s-ka.local.    IN    A    192.168.96.47
>>
>> ------------------------------------------------------------------------
>>
>>
>> The both drbd server are CentOS 7.5, the installed packages are here:
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd0 ~]# cat /etc/centos-release
>> CentOS Linux release 7.5.1804 (Core)
>>
>> [root at drbd0 ~]# uname -a
>> Linux drbd0.s-ka.local 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16
>> 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> [root at drbd1 ~]# yum list installed|grep pacemaker
>> pacemaker.x86_64 1.1.18-11.el7_5.3              @updates
>> pacemaker-cli.x86_64 1.1.18-11.el7_5.3              @updates
>> pacemaker-cluster-libs.x86_64 1.1.18-11.el7_5.3              @updates
>> pacemaker-libs.x86_64 1.1.18-11.el7_5.3              @updates
>>
>> [root at drbd1 ~]# yum list installed|grep coro
>> corosync.x86_64 2.4.3-2.el7_5.1                @updates
>> corosynclib.x86_64 2.4.3-2.el7_5.1                @updates
>>
>> [root at drbd1 ~]# yum list installed|grep drbd
>> drbd90-utils.x86_64 9.3.1-1.el7.elrepo             @elrepo
>> kmod-drbd90.x86_64 9.0.14-1.el7_5.elrepo          @elrepo
>>
>> [root at drbd1 ~]# yum list installed|grep -i scsi
>> lsscsi.x86_64 0.27-6.el7                     @anaconda
>> scsi-target-utils.x86_64 1.0.55-4.el7                   @epel
>>
>> ------------------------------------------------------------------------
>>
>>
>> 3. configurations
>>
>> 3.1 ok first the drbd configuration
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd1 ~]# cat /etc/drbd.conf
>> # You can find an example in /usr/share/doc/drbd.../drbd.conf.example
>>
>> include "drbd.d/global_common.conf";
>> include "drbd.d/*.res";
>>
>> [root at drbd1 ~]# cat /etc/drbd.d/r0.res
>> resource iscsivg01 {
>>    protocol C;
>>    device /dev/drbd0;
>>    disk /dev/vg0/ipstor0;
>>    flexible-meta-disk internal;
>>    on drbd0.s-ka.local {
>>      #volume 0 {
>>        #device /dev/drbd0;
>>        #disk /dev/vg0/ipstor0;
>>        #flexible-meta-disk internal;
>>      #}
>>      address 192.168.96.45:7788;
>>    }
>>    on drbd1.s-ka.local {
>>      #volume 0 {
>>        #device /dev/drbd0;
>>        #disk /dev/vg0/ipstor0;
>>        #flexible-meta-disk internal;
>>      #}
>>      address 192.168.96.47:7788;
>>    }
>> }
>>
>> ------------------------------------------------------------------------
>>
>> 3.2 then the drbd device
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd1 ~]# lsblk
>> NAME            MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
>> sda               8:0    0   25G  0 disk
>> ├─sda1            8:1    0    1G  0 part /boot
>> └─sda2            8:2    0   24G  0 part
>>    ├─centos-root 253:0    0   22G  0 lvm  /
>>    └─centos-swap 253:1    0    2G  0 lvm  [SWAP]
>> sdb               8:16   0  500G  0 disk
>> └─sdb1            8:17   0  500G  0 part
>>    └─vg0-ipstor0 253:2    0  500G  0 lvm
>>      └─drbd0     147:0    0  500G  1 disk
>> sr0              11:0    1 1024M  0 rom
>>
>> [root at drbd1 ~]# tree /dev/drbd
>> drbd/  drbd0
>> [root at drbd1 ~]# tree /dev/drbd
>> /dev/drbd
>> ├── by-disk
>> │   └── vg0
>> │       └── ipstor0 -> ../../../drbd0
>> └── by-res
>>      └── iscsivg01
>>          └── 0 -> ../../../drbd0
>>
>> 4 directories, 2 files
>>
>> ------------------------------------------------------------------------
>>
>> 3.3drbd status
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd1 ~]# drbdadm status
>> iscsivg01 role:Secondary
>>    disk:UpToDate
>>    drbd0.s-ka.local role:Primary
>>      peer-disk:UpToDate
>>
>> [root at drbd0 ~]# drbdadm status
>> iscsivg01 role:Primary
>>    disk:UpToDate
>>    drbd1.s-ka.local role:Secondary
>>      peer-disk:UpToDate
>>
>> [root at drbd0 ~]# cat /proc/drbd
>> version: 9.0.14-1 (api:2/proto:86-113)
>> GIT-hash: 62f906cf44ef02a30ce0c148fec223b40c51c533 build by mockbuild@,
>> 2018-05-04 03:32:42
>> Transports (api:16): tcp (9.0.14-1)
>>
>> ------------------------------------------------------------------------
>>
>> 3.4 Corosync configuration
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd0 corosync]# cat /etc/corosync/corosync.conf
>> totem {
>>      version: 2
>>      cluster_name: cluster1
>>      secauth: off
>>      transport: udpu
>> }
>>
>> nodelist {
>>      node {
>>          ring0_addr: drbd0-ha.s-ka.local
>>          nodeid: 1
>>      }
>>
>>      node {
>>          ring0_addr: drbd1-ha.s-ka.local
>>          nodeid: 2
>>      }
>> }
>>
>> quorum {
>>      provider: corosync_votequorum
>>      two_node: 1
>> }
>>
>> logging {
>>      to_logfile: yes
>>      logfile: /var/log/cluster/corosync.log
>>      to_syslog: yes
>> }
>>
>> ------------------------------------------------------------------------
>>
>>
>> 3.5 Corosync status:
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd0 corosync]# systemctl status corosync
>> ● corosync.service - Corosync Cluster Engine
>>     Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled;
>> vendor preset: disabled)
>>     Active: active (running) since Sun 2018-10-14 02:58:01 CEST; 2 days ago
>>       Docs: man:corosync
>>             man:corosync.conf
>>             man:corosync_overview
>>    Process: 1095 ExecStart=/usr/share/corosync/corosync start
>> (code=exited, status=0/SUCCESS)
>>   Main PID: 1167 (corosync)
>>     CGroup: /system.slice/corosync.service
>>             └─1167 corosync
>>
>> Oct 14 02:58:00 drbd0.s-ka.local corosync[1167]:  [MAIN  ] Completed
>> service synchronization, ready to provide service.
>> Oct 14 02:58:01 drbd0.s-ka.local corosync[1095]: Starting Corosync
>> Cluster Engine (corosync): [  OK  ]
>> Oct 14 02:58:01 drbd0.s-ka.local systemd[1]: Started Corosync Cluster
>> Engine.
>> Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]:  [TOTEM ] A new
>> membership (192.168.96.45:384) was formed. Members left: 2
>> Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]:  [QUORUM] Members[1]: 1
>> Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]:  [MAIN  ] Completed
>> service synchronization, ready to provide service.
>> Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [TOTEM ] A new
>> membership (192.168.96.45:388) was formed. Members joined: 2
>> Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [CPG   ] downlist
>> left_list: 0 received in state 0
>> Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [QUORUM] Members[2]: 1 2
>> Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]:  [MAIN  ] Completed
>> service synchronization, ready to provide service.
>>
>> ------------------------------------------------------------------------
>>
>> 3.6 tgtd configuration:
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd0 corosync]# cat /etc/tgt/targets.conf
>> # This is a sample config file for tgt-admin.
>> #
>> # The "#" symbol disables the processing of a line.
>>
>> # Set the driver. If not specified, defaults to "iscsi".
>> default-driver iscsi
>>
>> # Set iSNS parameters, if needed
>> #iSNSServerIP 192.168.111.222
>> #iSNSServerPort 3205
>> #iSNSAccessControl On
>> #iSNS On
>>
>> # Continue if tgtadm exits with non-zero code (equivalent of
>> # --ignore-errors command line option)
>> #ignore-errors yes
>>
>>
>> <target iqn.2018-08.s-ka.local:disk.1>
>>      lun 10
>>      backing-store /dev/drbd0
>>      initiator-address 192.168.96.0/24
>>      initiator-address 192.168.95.0/24
>>      target-address 192.168.95.48
>> </target>
>>
>> ------------------------------------------------------------------------
>>
>>
>> 3.7 tgtd has been on both server disabled, only startable from current
>> Primary DRBD Node.
>>
>> ------------------------------------------------------------------------
>>
>> Secondary Node:
>>
>> [root at drbd1 ~]# systemctl status tgtd
>> ● tgtd.service - tgtd iSCSI target daemon
>>     Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
>> vendor preset: disabled)
>>     Active: inactive (dead)
>> [root at drbd1 ~]# systemctl restart tgtd
>> Job for tgtd.service failed because the control process exited with
>> error code. See "systemctl status tgtd.service" and "journalctl -xe" for
>> details.
>>
>>
>> Primary Node:
>>
>> [root at drbd0 corosync]# systemctl status tgtd
>> ● tgtd.service - tgtd iSCSI target daemon
>>     Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
>> vendor preset: disabled)
>>     Active: inactive (dead)
>> [root at drbd0 corosync]# systemctl restart tgtd
>> [root at drbd0 corosync]# systemctl status  tgtd
>> ● tgtd.service - tgtd iSCSI target daemon
>>     Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
>> vendor preset: disabled)
>>     Active: active (running) since Tue 2018-10-16 14:09:47 CEST; 2min 29s
>> ago
>>    Process: 22300 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys
>> --name State -v ready (code=exited, status=0/SUCCESS)
>>    Process: 22272 ExecStartPost=/usr/sbin/tgt-admin -e -c $TGTD_CONFIG
>> (code=exited, status=0/SUCCESS)
>>    Process: 22271 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys
>> --name State -v offline (code=exited, status=0/SUCCESS)
>>    Process: 22270 ExecStartPost=/bin/sleep 5 (code=exited, status=0/SUCCESS)
>>   Main PID: 22269 (tgtd)
>>     CGroup: /system.slice/tgtd.service
>>             └─22269 /usr/sbin/tgtd -f
>>
>> Oct 16 14:09:42 drbd0.s-ka.local systemd[1]: Starting tgtd iSCSI target
>> daemon...
>> Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd: iser_ib_init(3436)
>> Failed to initialize RDMA; load kernel modules?
>> Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd:
>> work_timer_start(146) use timer_fd based scheduler
>> Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd:
>> bs_init_signalfd(267) could not open backing-store module directory
>> /usr/lib64/tgt/backing-store
>> Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd: bs_init(386) use
>> signalfd notification
>> Oct 16 14:09:47 drbd0.s-ka.local tgtd[22269]: tgtd: device_mgmt(246)
>> sz:16 params:path=/dev/drbd0
>> Oct 16 14:09:47 drbd0.s-ka.local tgtd[22269]: tgtd: bs_thread_open(408) 16
>> Oct 16 14:09:47 drbd0.s-ka.local systemd[1]: Started tgtd iSCSI target
>> daemon.
>>
>> ------------------------------------------------------------------------
>>
>> 3.8 it was until this point all working, but if i switched the DRBD
>> Primary Node, it won't work anymore (FileSystem of test Node became
>> read-only)
>>
>> so i changed the pcs configuration according to the previously mentioned
>> article:
>>
>> ------------------------------------------------------------------------
>>
>>> pcs resource create p_iscsivg01 ocf:heartbeat:LVM volgrpname="vg0" op
>> monitor interval="30"
>>
>>> pcs resource group add p_iSCSI p_iscsivg01 p_iSCSITarget
>> p_iSCSILogicalUnit ClusterIP
>>
>>> pcs constraint order start ipstor0Clone then start p_iSCSI then start
>> ipstor0Clone:Master
>>
>>
>> [root at drbd0 ~]# pcs status
>>      Cluster name: cluster1
>>      Stack: corosync
>>      Current DC: drbd0-ha.s-ka.local (version
>> 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
>>      Last updated: Sun Oct 14 01:38:18 2018
>>      Last change: Sun Oct 14 01:37:58 2018 by root via cibadmin on
>> drbd0-ha.s-ka.local
>>
>>      2 nodes configured
>>      6 resources configured
>>
>>      Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
>>
>>      Full list of resources:
>>
>>       Master/Slave Set: ipstor0Clone [ipstor0]
>>           Masters: [ drbd0-ha.s-ka.local ]
>>           Slaves: [ drbd1-ha.s-ka.local ]
>>       Resource Group: p_iSCSI
>>           p_iscsivg01    (ocf::heartbeat:LVM):    Stopped
>>           p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Stopped
>>           p_iSCSILogicalUnit (ocf::heartbeat:iSCSILogicalUnit):    Stopped
>>           ClusterIP    (ocf::heartbeat:IPaddr2):    Stopped
>>
>>      Failed Actions:
>>      * p_iSCSILogicalUnit_start_0 on drbd0-ha.s-ka.local 'unknown error'
>> (1): call=42, status=complete, exitreason='',
>>          last-rc-change='Sun Oct 14 01:20:38 2018', queued=0ms, exec=28ms
>>      * p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
>> call=40, status=complete, exitreason='',
>>          last-rc-change='Sun Oct 14 00:54:36 2018', queued=0ms, exec=23ms
>>      * p_iscsivg01_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
>> call=48, status=complete, exitreason='Volume group [iscsivg01] does not
>> exist or contains error!   Volume group "iscsivg01" not found',
>>          last-rc-change='Sun Oct 14 01:32:49 2018', queued=0ms, exec=47ms
>>      * p_iSCSILogicalUnit_start_0 on drbd1-ha.s-ka.local 'unknown error'
>> (1): call=41, status=complete, exitreason='',
>>          last-rc-change='Sun Oct 14 01:20:38 2018', queued=0ms, exec=31ms
>>      * p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
>> call=39, status=complete, exitreason='',
>>          last-rc-change='Sun Oct 14 00:54:36 2018', queued=0ms, exec=24ms
>>      * p_iscsivg01_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
>> call=47, status=complete, exitreason='Volume group [iscsivg01] does not
>> exist or contains error!   Volume group "iscsivg01" not found',
>>          last-rc-change='Sun Oct 14 01:32:49 2018', queued=0ms, exec=50ms
>>
>>
>>      Daemon Status:
>>        corosync: active/enabled
>>        pacemaker: active/enabled
>>        pcsd: active/enabled
>>      [root at drbd0 ~]#
>>
>> ------------------------------------------------------------------------
>>
>>
>> 3.9 since the "device not found" error, so i remove the LVM, it looks
>> like this now:
>>
>> actually it was changed between /dev/drbd/by-disk and /dev/drbd/by-res,
>> but no effects
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd0 corosync]# pcs status
>> Cluster name: cluster1
>> Stack: corosync
>> Current DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) -
>> partition with quorum
>> Last updated: Tue Oct 16 14:18:09 2018
>> Last change: Sun Oct 14 02:06:36 2018 by root via cibadmin on
>> drbd0-ha.s-ka.local
>>
>> 2 nodes configured
>> 5 resources configured
>>
>> Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
>>
>> Full list of resources:
>>
>>   Master/Slave Set: ipstor0Clone [ipstor0]
>>       Masters: [ drbd0-ha.s-ka.local ]
>>       Slaves: [ drbd1-ha.s-ka.local ]
>>   Resource Group: p_iSCSI
>>       p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Stopped
>>       p_iSCSILogicalUnit    (ocf::heartbeat:iSCSILogicalUnit):  Stopped
>>       ClusterIP    (ocf::heartbeat:IPaddr2):    Stopped
>>
>> Failed Actions:
>> * p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
>> call=12, status=complete, exitreason='',
>>      last-rc-change='Sun Oct 14 02:58:04 2018', queued=1ms, exec=58ms
>> * p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
>> call=12, status=complete, exitreason='',
>>      last-rc-change='Sun Oct 14 10:47:06 2018', queued=0ms, exec=22ms
>>
>>
>> Daemon Status:
>>    corosync: active/enabled
>>    pacemaker: active/enabled
>>    pcsd: active/enabled
>> [root at drbd0 corosync]#
>>
>> ------------------------------------------------------------------------
>>
>> 3.10 i've tried with "pcs resouce debug-start xxx --full" on the DRBD
>> Primary Node,
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd0 corosync]# pcs resource debug-start p_iSCSI --full
>> Error: unable to debug-start a group, try one of the group's resource(s)
>> (p_iSCSITarget,p_iSCSILogicalUnit,ClusterIP)
>>
>> [root at drbd0 corosync]# pcs resource debug-start p_iSCSITarget --full
>> Operation start for p_iSCSITarget (ocf:heartbeat:iSCSITarget) returned:
>> 'ok' (0)
>>   >  stderr: DEBUG: p_iSCSITarget start : 0
>>
>> [root at drbd0 corosync]# pcs resource debug-start p_iSCSILogicalUnit --full
>> Operation start for p_iSCSILogicalUnit (ocf:heartbeat:iSCSILogicalUnit)
>> returned: 'unknown error' (1)
>>   >  stderr: ERROR: tgtadm: this logical unit number already exists
>>
>> [root at drbd0 corosync]# pcs resource debug-start ClusterIP --full
>> Operation start for ClusterIP (ocf:heartbeat:IPaddr2) returned: 'ok' (0)
>>   >  stderr: INFO: Adding inet address 192.168.95.48/32 with broadcast
>> address 192.168.95.255 to device ens192
>>   >  stderr: INFO: Bringing device ens192 up
>>   >  stderr: INFO: /usr/libexec/heartbeat/send_arp -i 200 -c 5 -p
>> /var/run/resource-agents/send_arp-192.168.95.48 -I ens192 -m auto
>> 192.168.95.48
>> [root at drbd0 corosync]#
>>
>> ------------------------------------------------------------------------
>>
>> 3.11 as you may seen, there are errors, but "p_iSCSITarget" was
>> successfully startet. but "pcs status" show still "stopped"
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd0 corosync]# pcs status
>> Cluster name: cluster1
>> Stack: corosync
>> Current DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) -
>> partition with quorum
>> Last updated: Tue Oct 16 14:22:38 2018
>> Last change: Sun Oct 14 02:06:36 2018 by root via cibadmin on
>> drbd0-ha.s-ka.local
>>
>> 2 nodes configured
>> 5 resources configured
>>
>> Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
>>
>> Full list of resources:
>>
>>   Master/Slave Set: ipstor0Clone [ipstor0]
>>       Masters: [ drbd0-ha.s-ka.local ]
>>       Slaves: [ drbd1-ha.s-ka.local ]
>>   Resource Group: p_iSCSI
>>       p_iSCSITarget    (ocf::heartbeat:iSCSITarget):    Stopped
>>       p_iSCSILogicalUnit    (ocf::heartbeat:iSCSILogicalUnit): Stopped
>>       ClusterIP    (ocf::heartbeat:IPaddr2):    Stopped
>>
>> Failed Actions:
>> * p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
>> call=12, status=complete, exitreason='',
>>      last-rc-change='Sun Oct 14 02:58:04 2018', queued=1ms, exec=58ms
>> * p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
>> call=12, status=complete, exitreason='',
>>      last-rc-change='Sun Oct 14 10:47:06 2018', queued=0ms, exec=22ms
>>
>>
>> Daemon Status:
>>    corosync: active/enabled
>>    pacemaker: active/enabled
>>    pcsd: active/enabled
>> [root at drbd0 corosync]#
>>
>> ------------------------------------------------------------------------
>>
>> 3.12 the pcs config is:
>>
>> ------------------------------------------------------------------------
>>
>> [root at drbd0 corosync]# pcs config
>> Cluster Name: cluster1
>> Corosync Nodes:
>>   drbd0-ha.s-ka.local drbd1-ha.s-ka.local
>> Pacemaker Nodes:
>>   drbd0-ha.s-ka.local drbd1-ha.s-ka.local
>>
>> Resources:
>>   Master: ipstor0Clone
>>    Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1
>> clone-node-max=1
>>    Resource: ipstor0 (class=ocf provider=linbit type=drbd)
>>     Attributes: drbd_resource=iscsivg01
>>     Operations: demote interval=0s timeout=90 (ipstor0-demote-interval-0s)
>>                 monitor interval=60s (ipstor0-monitor-interval-60s)
>>                 notify interval=0s timeout=90 (ipstor0-notify-interval-0s)
>>                 promote interval=0s timeout=90 (ipstor0-promote-interval-0s)
>>                 reload interval=0s timeout=30 (ipstor0-reload-interval-0s)
>>                 start interval=0s timeout=240 (ipstor0-start-interval-0s)
>>                 stop interval=0s timeout=100 (ipstor0-stop-interval-0s)
>>   Group: p_iSCSI
>>    Resource: p_iSCSITarget (class=ocf provider=heartbeat type=iSCSITarget)
>>     Attributes: implementation=tgt iqn=iqn.2018-08.s-ka.local:disk.1 tid=1
>>     Operations: monitor interval=30 timeout=60
>> (p_iSCSITarget-monitor-interval-30)
>>                 start interval=0 timeout=60 (p_iSCSITarget-start-interval-0)
>>                 stop interval=0 timeout=60 (p_iSCSITarget-stop-interval-0)
>>    Resource: p_iSCSILogicalUnit (class=ocf provider=heartbeat
>> type=iSCSILogicalUnit)
>>     Attributes: implementation=tgt lun=10
>> path=/dev/drbd/by-disk/vg0/ipstor0 target_iqn=iqn.2018-08.s-ka.local:disk.1
>>     Operations: monitor interval=30 timeout=60
>> (p_iSCSILogicalUnit-monitor-interval-30)
>>                 start interval=0 timeout=60
>> (p_iSCSILogicalUnit-start-interval-0)
>>                 stop interval=0 timeout=60
>> (p_iSCSILogicalUnit-stop-interval-0)
>>    Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>     Attributes: cidr_netmask=32 ip=192.168.95.48
>>     Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
>>                 start interval=0s timeout=20s (ClusterIP-start-interval-0s)
>>                 stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
>>
>> Stonith Devices:
>> Fencing Levels:
>>
>> Location Constraints:
>> Ordering Constraints:
>>    start ipstor0Clone then start p_iSCSI (kind:Mandatory)
>> Colocation Constraints:
>> Ticket Constraints:
>>
>> Alerts:
>>   No alerts defined
>>
>> Resources Defaults:
>>   migration-threshold: 1
>> Operations Defaults:
>>   No defaults set
>>
>> Cluster Properties:
>>   cluster-infrastructure: corosync
>>   cluster-name: cluster1
>>   dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
>>   have-watchdog: false
>>   last-lrm-refresh: 1539474248
>>   no-quorum-policy: ignore
>>   stonith-enabled: false
>>
>> Quorum:
>>    Options:
>> [root at drbd0 corosync]#
>>
>> ------------------------------------------------------------------------
>>
>>
>> 4. so i am out of hands. don't what to do, may just dive into
>> pacemaker's source code??
>>
>> Hope to get any feedback or tips from you, thank you very much in
>> advance :)
>>
>>
>> Best Regards
>>
>> Zhang
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181024/5d3d6248/attachment-0001.html>