[ClusterLabs] Need help to enable hot switch of iSCSI (tgtd) under two node Pacemaker + DRBD 9.0 under CentOS 7.5 in ESXi 6.5 Environment
Stefan K
shadow_7 at gmx.net
Wed Oct 24 10:18:34 EDT 2018
Hi,
first question is why did you use tgt instead LIO? LIO is more common nowadays..
And you need a colocations and ordering constraints.
Here is my config with lio-t (but I guess here is something wrong in the ressource-agents, but I will take a deeper look tomorrow):
pcs config
Cluster Name: zfs-vmstorage
Corosync Nodes:
zfs-serv3 zfs-serv4
Pacemaker Nodes:
zfs-serv3 zfs-serv4
Resources:
Resource: ha-ip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=192.168.2.10 cidr_netmask=24 nic=bond0
Meta Attrs: target-role=Started
Operations: start interval=0s timeout=20s (ha-ip-start-0s)
stop interval=0s timeout=20s (ha-ip-stop-0s)
monitor interval=10s timeout=20s (ha-ip-monitor-10s)
Resource: vm_storage (class=ocf provider=heartbeat type=ZFS)
Attributes: pool=vm_storage importargs="-d /dev/disk/by-vdev/"
Meta Attrs: target-role=Started
Operations: monitor interval=5s timeout=30s (vm_storage-monitor-5s)
start interval=0s timeout=90 (vm_storage-start-0s)
stop interval=0s timeout=90 (vm_storage-stop-0s)
Resource: iscsi-server (class=ocf provider=heartbeat type=iSCSITarget)
Attributes: implementation=lio-t iqn=iqn.2003-01.org.linux-iscsi.vm-storage.x8664:sn.1decabcxxxx. portals=192.168.2.10:3260 allowed_initiators="iqn.1998-01.com.vmware:brainslug9-7548xxxx iqn.1998-01.com.vmware:brainslug8-058xxxx iqn.1998-01.com.vmware:brainslug7-592bxxxx iqn.1998-01.com.vmware:brainslug10-5564cxxxx"
Resource: iscsi-lun0 (class=ocf provider=heartbeat type=iSCSILogicalUnit)
Attributes: implementation=lio-t target_iqn=iqn.2003-01.org.linux-iscsi.vm-storage.x8664:sn.1decabcxxxx. lun=0 path=/dev/zvol/vm_storage/zfs-vol1
Resource: iscsi-lun1 (class=ocf provider=heartbeat type=iSCSILogicalUnit)
Attributes: implementation=lio-t target_iqn=iqn.2003-01.org.linux-iscsi.vm-storage.x8664:sn.1decabcxxxx. lun=1 path=/dev/zvol/vm_storage/zfs-vol2
Resource: iscsi-lun2 (class=ocf provider=heartbeat type=iSCSILogicalUnit)
Attributes: implementation=lio-t target_iqn=iqn.2003-01.org.linux-iscsi.vm-storage.x8664:sn.1decabcxxxx. lun=2 path=/dev/zvol/vm_storage/zfs-vol3
Stonith Devices:
Resource: resIPMI-zfs4 (class=stonith type=external/ipmi)
Attributes: hostname=zfs-serv4 ipaddr=172.xx.xx.xx userid=USER passwd=SECRET interface=lan priv=OPERATOR pcmk_delay_max=20
Operations: monitor interval=60s (resIPMI-zfs4-monitor-60s)
Resource: resIPMI-zfs3 (class=stonith type=external/ipmi)
Attributes: hostname=zfs-serv3 ipaddr=172.xx.xx.xx userid=user passwd=SECRET interface=lan priv=OPERATOR pcmk_delay_max=20
Operations: monitor interval=60s (resIPMI-zfs3-monitor-60s)
Fencing Levels:
Location Constraints:
Resource: resIPMI-zfs3
Disabled on: zfs-serv3 (score:-INFINITY) (id:location-resIPMI-zfs3-zfs-serv3--INFINITY)
Resource: resIPMI-zfs4
Disabled on: zfs-serv4 (score:-INFINITY) (id:location-resIPMI-zfs4-zfs-serv4--INFINITY)
Ordering Constraints:
Resource Sets:
set ha-ip iscsi-lun0 iscsi-lun1 iscsi-lun2 iscsi-server vm_storage action=stop (id:pcs_rsc_order_set_ha-ip_iscsi-server_vm_storage-1) setoptions symmetrical=false (id:pcs_rsc_order_set_ha-ip_iscsi-server_vm_storage)
set vm_storage iscsi-server iscsi-lun0 iscsi-lun1 iscsi-lun2 ha-ip action=start (id:pcs_rsc_order_set_iscsi-server_vm_storage_ha-ip-1) setoptions symmetrical=false (id:pcs_rsc_order_set_iscsi-server_vm_storage_ha-ip)
Colocation Constraints:
Resource Sets:
set ha-ip vm_storage iscsi-server iscsi-lun0 iscsi-lun1 iscsi-lun2 (id:pcs_rsc_colocation_set_ha-ip_vm_storage_iscsi-server-1) setoptions score=INFINITY (id:pcs_rsc_colocation_set_ha-ip_vm_storage_iscsi-server)
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
resource-stickiness: 100
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: zfs-vmstorage
dc-version: 1.1.16-94ff4df
have-watchdog: false
last-lrm-refresh: 1540199247
no-quorum-policy: stop
stonith-enabled: true
Quorum:
Options:
Hello Friends,
further thought about this situation:
I want to have a cluster service tgtd, on a "primary/second" DRBD, since the DRBD is only active on the primary node, so the tgtd won't success on the secondary node.
So should i, or how do config a primary/seconday tgtd service?
Or, which feature from pacemaker should i use, so the tgtd starts only on one node?
for any suggestions, thank you very much in advance
Best Regards
Lifeng
Hello Dear Andrei Borzenkov,
Thank you very much for your answer. I've check the logs all the time, but there are nothing helpful , just a bunch of heartbeat messages.
Anyway, i've read the book "Packt - CentOS High Availability" published in 2015, and got some new ideas, and tried out, the situation is something new.
--------------------
pcs resource create p_iSCSITarget ocf:heartbeat:iSCSITarget implementation="tgt" iqn="iqn.2018-08.s-ka.local:disk" tid="1"pcs resource create p_iSCSILogicalUnit ocf:heartbeat:iSCSILogicalUnit implementation="tgt" target_iqn="iqn.2018-08.s-ka.local:disk" lun="10" path="/dev/drbd/by-disk/vg0/ipstor0"pcs resource group add p_iSCSI ClusterIP p_iSCSITarget p_iSCSILogicalUnit pcs constraint colocation set ClusterIP p_iSCSITarget p_iSCSILogicalUnit
--------------------
The difference from previous version is here: use iqn "iqn.2018-08.s-ka.local:disk" instead of "iqn.2018-08.s-ka.local:disk.1", which the last ".1" maybe means the "tid".
now i have new problem, because the resource and tgtd are startet, although i set "colocation constraint", the pacemaker always try to start tgtd on another node.
how to i solve this? thank you people in advance!
here the output from "pcs status":
--------------------
[root at drbd0 /]# pcs statusCluster name: cluster1Stack: corosyncCurrent DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorumLast updated: Wed Oct 24 08:43:29 2018Last change: Wed Oct 24 08:43:24 2018 by root via cibadmin on drbd0-ha.s-ka.local
2 nodes configured5 resources configured
Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
Full list of resources:
Master/Slave Set: ipstor0Clone [ipstor0] Masters: [ drbd0-ha.s-ka.local ] Slaves: [ drbd1-ha.s-ka.local ] Resource Group: p_iSCSI ClusterIP (ocf::heartbeat:IPaddr2): Started drbd0-ha.s-ka.local p_iSCSITarget (ocf::heartbeat:iSCSITarget): Started drbd0-ha.s-ka.local p_iSCSILogicalUnit (ocf::heartbeat:iSCSILogicalUnit): Started drbd0-ha.s-ka.local
Failed Actions:* p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1): call=32, status=complete, exitreason='', last-rc-change='Wed Oct 24 08:37:25 2018', queued=0ms, exec=23ms* p_iSCSILogicalUnit_start_0 on drbd1-ha.s-ka.local 'unknown error' (1): call=38, status=complete, exitreason='', last-rc-change='Wed Oct 24 08:37:55 2018', queued=0ms, exec=28ms
Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled[root at drbd0 /]
[root at drbd0 /]# pcs constraint show --fullLocation Constraints:Ordering Constraints:Colocation Constraints: Resource Sets: set ClusterIP p_iSCSITarget p_iSCSILogicalUnit (id:pcs_rsc_set_ClusterIP_p_iSCSITarget_p_iSCSILogicalUnit) setoptions score=INFINITY (id:pcs_rsc_colocation_set_ClusterIP_p_iSCSITarget_p_iSCSILogicalUnit)Ticket Constraints:[root at drbd0 /]#
--------------------
Best Regards
Lifeng
在 2018/10/19 06:02, Andrei Borzenkov 写道:
16.10.2018 15:29, LiFeng Zhang пишет:
Hi, all dear friends,
i need your help to enable the hot switch of iSCSI under a
Pacemaker/Corosync Cluster, which has a iSCSI Device based on a two node
DRBD Replication.
I've got the Pacemaker/Corosync cluster working, DRBD replication also
working, but it stuck at iSCSI, i can manually start a tgtd on one node,
so the VCSA can recognize the iSCSI Disk and create VMFS/StorageObject
on it, and then i can create a test VM on that VMFS.
But when i switch the Primary/Secondary of DRBD, although the test VM
still running, but the underlying Disk became read-only. As far as i
know, the tgtd should be handled by Pacemaker so it will automatically
start on the Primary DRBD Instance, but in my situation it's sadly NOT.
pacemaker only handles resources that were started by pacemaker.
According to your output below, in all cases resource was stopped from
pacemaker point of view and all pacemaker attempts to start resource
failed. You should troubleshoot why they failed. This requires knowledge
of specific resource agent, sadly I am not familiar with iSCSI target.
pacemaker logs may include more information from resource agent than
just "unknown reason".
I've tried all kinds of resources/manuals/documents, but they all mixed
with extra information, other system, other software version.
And one of my BEST reference (the closest configuration to mein) is this
url: https://nnc3.com/mags/LJ_1994-2014/LJ/217/11275.html[1]
The difference betwee me and this article, i think is i don't have LVM
Volume but only raw iSCSI Disk, and i have to translate CRM commands
into PCS commands
But after i "copied" the configuration from this article, my cluster can
not start anymore, i've tried remove the LVM resource (which caused a
"device not found" error), but the resource group still can't start and
without any explicit "reason" from Pacemaker.
*1*. The whole configuration is under a two node ESXi 6.5 Cluster, which
has a VCSA one one ESXi host installed.
I have a simple diagram in attachment, which may state the deployment
better.
2. start point:
The involved hosts are all with mapped through local dns, which also
includes the floating vip, the local domain is s-ka.local:
------------------------------------------------------------------------
firwall: fw01.s-ka.local. IN A 192.168.95.249
vcsa: vc01.s-ka.local. IN A 192.168.95.30
esxi: esx01.s-ka.local. IN A 192.168.95.5
esxi: esx02.s-ka.local. IN A 192.168.95.7
drbd: drbd0.s-ka.local. IN A 192.168.95.45
drbd: drbd1.s-ka.local. IN A 192.168.95.47
vip: ipstor0.s-ka.local. IN A 192.168.95.48
heartbeat: drbd0-ha.s-ka.local. IN A 192.168.96.45
heartbeat: drbd1-ha.s-ka.local. IN A 192.168.96.47
------------------------------------------------------------------------
The both drbd server are CentOS 7.5, the installed packages are here:
------------------------------------------------------------------------
[root at drbd0 ~]# cat /etc/centos-release
CentOS Linux release 7.5.1804 (Core)
[root at drbd0 ~]# uname -a
Linux drbd0.s-ka.local 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16
16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root at drbd1 ~]# yum list installed|grep pacemaker
pacemaker.x86_64 1.1.18-11.el7_5.3 @updates
pacemaker-cli.x86_64 1.1.18-11.el7_5.3 @updates
pacemaker-cluster-libs.x86_64 1.1.18-11.el7_5.3 @updates
pacemaker-libs.x86_64 1.1.18-11.el7_5.3 @updates
[root at drbd1 ~]# yum list installed|grep coro
corosync.x86_64 2.4.3-2.el7_5.1 @updates
corosynclib.x86_64 2.4.3-2.el7_5.1 @updates
[root at drbd1 ~]# yum list installed|grep drbd
drbd90-utils.x86_64 9.3.1-1.el7.elrepo @elrepo
kmod-drbd90.x86_64 9.0.14-1.el7_5.elrepo @elrepo
[root at drbd1 ~]# yum list installed|grep -i scsi
lsscsi.x86_64 0.27-6.el7 @anaconda
scsi-target-utils.x86_64 1.0.55-4.el7 @epel
------------------------------------------------------------------------
3. configurations
3.1 ok first the drbd configuration
------------------------------------------------------------------------
[root at drbd1 ~]# cat /etc/drbd.conf
# You can find an example in /usr/share/doc/drbd.../drbd.conf.example
include "drbd.d/global_common.conf";
include "drbd.d/*.res";
[root at drbd1 ~]# cat /etc/drbd.d/r0.res
resource iscsivg01 {
protocol C;
device /dev/drbd0;
disk /dev/vg0/ipstor0;
flexible-meta-disk internal;
on drbd0.s-ka.local {
#volume 0 {
#device /dev/drbd0;
#disk /dev/vg0/ipstor0;
#flexible-meta-disk internal;
#}
address 192.168.96.45:7788;
}
on drbd1.s-ka.local {
#volume 0 {
#device /dev/drbd0;
#disk /dev/vg0/ipstor0;
#flexible-meta-disk internal;
#}
address 192.168.96.47:7788;
}
}
------------------------------------------------------------------------
3.2 then the drbd device
------------------------------------------------------------------------
[root at drbd1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 25G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 24G 0 part
├─centos-root 253:0 0 22G 0 lvm /
└─centos-swap 253:1 0 2G 0 lvm [SWAP]
sdb 8:16 0 500G 0 disk
└─sdb1 8:17 0 500G 0 part
└─vg0-ipstor0 253:2 0 500G 0 lvm
└─drbd0 147:0 0 500G 1 disk
sr0 11:0 1 1024M 0 rom
[root at drbd1 ~]# tree /dev/drbd
drbd/ drbd0
[root at drbd1 ~]# tree /dev/drbd
/dev/drbd
├── by-disk
│ └── vg0
│ └── ipstor0 -> ../../../drbd0
└── by-res
└── iscsivg01
└── 0 -> ../../../drbd0
4 directories, 2 files
------------------------------------------------------------------------
3.3drbd status
------------------------------------------------------------------------
[root at drbd1 ~]# drbdadm status
iscsivg01 role:Secondary
disk:UpToDate
drbd0.s-ka.local role:Primary
peer-disk:UpToDate
[root at drbd0 ~]# drbdadm status
iscsivg01 role:Primary
disk:UpToDate
drbd1.s-ka.local role:Secondary
peer-disk:UpToDate
[root at drbd0 ~]# cat /proc/drbd
version: 9.0.14-1 (api:2/proto:86-113)
GIT-hash: 62f906cf44ef02a30ce0c148fec223b40c51c533 build by mockbuild@,
2018-05-04 03:32:42
Transports (api:16): tcp (9.0.14-1)
------------------------------------------------------------------------
3.4 Corosync configuration
------------------------------------------------------------------------
[root at drbd0 corosync]# cat /etc/corosync/corosync.conf
totem {
version: 2
cluster_name: cluster1
secauth: off
transport: udpu
}
nodelist {
node {
ring0_addr: drbd0-ha.s-ka.local
nodeid: 1
}
node {
ring0_addr: drbd1-ha.s-ka.local
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}
------------------------------------------------------------------------
3.5 Corosync status:
------------------------------------------------------------------------
[root at drbd0 corosync]# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled;
vendor preset: disabled)
Active: active (running) since Sun 2018-10-14 02:58:01 CEST; 2 days ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Process: 1095 ExecStart=/usr/share/corosync/corosync start
(code=exited, status=0/SUCCESS)
Main PID: 1167 (corosync)
CGroup: /system.slice/corosync.service
└─1167 corosync
Oct 14 02:58:00 drbd0.s-ka.local corosync[1167]: [MAIN ] Completed
service synchronization, ready to provide service.
Oct 14 02:58:01 drbd0.s-ka.local corosync[1095]: Starting Corosync
Cluster Engine (corosync): [ OK ]
Oct 14 02:58:01 drbd0.s-ka.local systemd[1]: Started Corosync Cluster
Engine.
Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]: [TOTEM ] A new
membership (192.168.96.45:384) was formed. Members left: 2
Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]: [QUORUM] Members[1]: 1
Oct 14 10:46:03 drbd0.s-ka.local corosync[1167]: [MAIN ] Completed
service synchronization, ready to provide service.
Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]: [TOTEM ] A new
membership (192.168.96.45:388) was formed. Members joined: 2
Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]: [CPG ] downlist
left_list: 0 received in state 0
Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]: [QUORUM] Members[2]: 1 2
Oct 14 10:46:22 drbd0.s-ka.local corosync[1167]: [MAIN ] Completed
service synchronization, ready to provide service.
------------------------------------------------------------------------
3.6 tgtd configuration:
------------------------------------------------------------------------
[root at drbd0 corosync]# cat /etc/tgt/targets.conf
# This is a sample config file for tgt-admin.
#
# The "#" symbol disables the processing of a line.
# Set the driver. If not specified, defaults to "iscsi".
default-driver iscsi
# Set iSNS parameters, if needed
#iSNSServerIP 192.168.111.222
#iSNSServerPort 3205
#iSNSAccessControl On
#iSNS On
# Continue if tgtadm exits with non-zero code (equivalent of
# --ignore-errors command line option)
#ignore-errors yes
<target iqn.2018-08.s-ka.local:disk.1>
lun 10
backing-store /dev/drbd0
initiator-address 192.168.96.0/24
initiator-address 192.168.95.0/24
target-address 192.168.95.48
</target>
------------------------------------------------------------------------
3.7 tgtd has been on both server disabled, only startable from current
Primary DRBD Node.
------------------------------------------------------------------------
Secondary Node:
[root at drbd1 ~]# systemctl status tgtd
● tgtd.service - tgtd iSCSI target daemon
Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
vendor preset: disabled)
Active: inactive (dead)
[root at drbd1 ~]# systemctl restart tgtd
Job for tgtd.service failed because the control process exited with
error code. See "systemctl status tgtd.service" and "journalctl -xe" for
details.
Primary Node:
[root at drbd0 corosync]# systemctl status tgtd
● tgtd.service - tgtd iSCSI target daemon
Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
vendor preset: disabled)
Active: inactive (dead)
[root at drbd0 corosync]# systemctl restart tgtd
[root at drbd0 corosync]# systemctl status tgtd
● tgtd.service - tgtd iSCSI target daemon
Loaded: loaded (/usr/lib/systemd/system/tgtd.service; disabled;
vendor preset: disabled)
Active: active (running) since Tue 2018-10-16 14:09:47 CEST; 2min 29s
ago
Process: 22300 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys
--name State -v ready (code=exited, status=0/SUCCESS)
Process: 22272 ExecStartPost=/usr/sbin/tgt-admin -e -c $TGTD_CONFIG
(code=exited, status=0/SUCCESS)
Process: 22271 ExecStartPost=/usr/sbin/tgtadm --op update --mode sys
--name State -v offline (code=exited, status=0/SUCCESS)
Process: 22270 ExecStartPost=/bin/sleep 5 (code=exited, status=0/SUCCESS)
Main PID: 22269 (tgtd)
CGroup: /system.slice/tgtd.service
└─22269 /usr/sbin/tgtd -f
Oct 16 14:09:42 drbd0.s-ka.local systemd[1]: Starting tgtd iSCSI target
daemon...
Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd: iser_ib_init(3436)
Failed to initialize RDMA; load kernel modules?
Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd:
work_timer_start(146) use timer_fd based scheduler
Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd:
bs_init_signalfd(267) could not open backing-store module directory
/usr/lib64/tgt/backing-store
Oct 16 14:09:42 drbd0.s-ka.local tgtd[22269]: tgtd: bs_init(386) use
signalfd notification
Oct 16 14:09:47 drbd0.s-ka.local tgtd[22269]: tgtd: device_mgmt(246)
sz:16 params:path=/dev/drbd0
Oct 16 14:09:47 drbd0.s-ka.local tgtd[22269]: tgtd: bs_thread_open(408) 16
Oct 16 14:09:47 drbd0.s-ka.local systemd[1]: Started tgtd iSCSI target
daemon.
------------------------------------------------------------------------
3.8 it was until this point all working, but if i switched the DRBD
Primary Node, it won't work anymore (FileSystem of test Node became
read-only)
so i changed the pcs configuration according to the previously mentioned
article:
------------------------------------------------------------------------
pcs resource create p_iscsivg01 ocf:heartbeat:LVM volgrpname="vg0" op
monitor interval="30"
pcs resource group add p_iSCSI p_iscsivg01 p_iSCSITarget
p_iSCSILogicalUnit ClusterIP
pcs constraint order start ipstor0Clone then start p_iSCSI then start
ipstor0Clone:Master
[root at drbd0 ~]# pcs status
Cluster name: cluster1
Stack: corosync
Current DC: drbd0-ha.s-ka.local (version
1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sun Oct 14 01:38:18 2018
Last change: Sun Oct 14 01:37:58 2018 by root via cibadmin on
drbd0-ha.s-ka.local
2 nodes configured
6 resources configured
Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
Full list of resources:
Master/Slave Set: ipstor0Clone [ipstor0]
Masters: [ drbd0-ha.s-ka.local ]
Slaves: [ drbd1-ha.s-ka.local ]
Resource Group: p_iSCSI
p_iscsivg01 (ocf::heartbeat:LVM): Stopped
p_iSCSITarget (ocf::heartbeat:iSCSITarget): Stopped
p_iSCSILogicalUnit (ocf::heartbeat:iSCSILogicalUnit): Stopped
ClusterIP (ocf::heartbeat:IPaddr2): Stopped
Failed Actions:
* p_iSCSILogicalUnit_start_0 on drbd0-ha.s-ka.local 'unknown error'
(1): call=42, status=complete, exitreason='',
last-rc-change='Sun Oct 14 01:20:38 2018', queued=0ms, exec=28ms
* p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
call=40, status=complete, exitreason='',
last-rc-change='Sun Oct 14 00:54:36 2018', queued=0ms, exec=23ms
* p_iscsivg01_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
call=48, status=complete, exitreason='Volume group [iscsivg01] does not
exist or contains error! Volume group "iscsivg01" not found',
last-rc-change='Sun Oct 14 01:32:49 2018', queued=0ms, exec=47ms
* p_iSCSILogicalUnit_start_0 on drbd1-ha.s-ka.local 'unknown error'
(1): call=41, status=complete, exitreason='',
last-rc-change='Sun Oct 14 01:20:38 2018', queued=0ms, exec=31ms
* p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
call=39, status=complete, exitreason='',
last-rc-change='Sun Oct 14 00:54:36 2018', queued=0ms, exec=24ms
* p_iscsivg01_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
call=47, status=complete, exitreason='Volume group [iscsivg01] does not
exist or contains error! Volume group "iscsivg01" not found',
last-rc-change='Sun Oct 14 01:32:49 2018', queued=0ms, exec=50ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root at drbd0 ~]#
------------------------------------------------------------------------
3.9 since the "device not found" error, so i remove the LVM, it looks
like this now:
actually it was changed between /dev/drbd/by-disk and /dev/drbd/by-res,
but no effects
------------------------------------------------------------------------
[root at drbd0 corosync]# pcs status
Cluster name: cluster1
Stack: corosync
Current DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) -
partition with quorum
Last updated: Tue Oct 16 14:18:09 2018
Last change: Sun Oct 14 02:06:36 2018 by root via cibadmin on
drbd0-ha.s-ka.local
2 nodes configured
5 resources configured
Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
Full list of resources:
Master/Slave Set: ipstor0Clone [ipstor0]
Masters: [ drbd0-ha.s-ka.local ]
Slaves: [ drbd1-ha.s-ka.local ]
Resource Group: p_iSCSI
p_iSCSITarget (ocf::heartbeat:iSCSITarget): Stopped
p_iSCSILogicalUnit (ocf::heartbeat:iSCSILogicalUnit): Stopped
ClusterIP (ocf::heartbeat:IPaddr2): Stopped
Failed Actions:
* p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
call=12, status=complete, exitreason='',
last-rc-change='Sun Oct 14 02:58:04 2018', queued=1ms, exec=58ms
* p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
call=12, status=complete, exitreason='',
last-rc-change='Sun Oct 14 10:47:06 2018', queued=0ms, exec=22ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root at drbd0 corosync]#
------------------------------------------------------------------------
3.10 i've tried with "pcs resouce debug-start xxx --full" on the DRBD
Primary Node,
------------------------------------------------------------------------
[root at drbd0 corosync]# pcs resource debug-start p_iSCSI --full
Error: unable to debug-start a group, try one of the group's resource(s)
(p_iSCSITarget,p_iSCSILogicalUnit,ClusterIP)
[root at drbd0 corosync]# pcs resource debug-start p_iSCSITarget --full
Operation start for p_iSCSITarget (ocf:heartbeat:iSCSITarget) returned:
'ok' (0)
> stderr: DEBUG: p_iSCSITarget start : 0
[root at drbd0 corosync]# pcs resource debug-start p_iSCSILogicalUnit --full
Operation start for p_iSCSILogicalUnit (ocf:heartbeat:iSCSILogicalUnit)
returned: 'unknown error' (1)
> stderr: ERROR: tgtadm: this logical unit number already exists
[root at drbd0 corosync]# pcs resource debug-start ClusterIP --full
Operation start for ClusterIP (ocf:heartbeat:IPaddr2) returned: 'ok' (0)
> stderr: INFO: Adding inet address 192.168.95.48/32 with broadcast
address 192.168.95.255 to device ens192
> stderr: INFO: Bringing device ens192 up
> stderr: INFO: /usr/libexec/heartbeat/send_arp -i 200 -c 5 -p
/var/run/resource-agents/send_arp-192.168.95.48 -I ens192 -m auto
192.168.95.48
[root at drbd0 corosync]#
------------------------------------------------------------------------
3.11 as you may seen, there are errors, but "p_iSCSITarget" was
successfully startet. but "pcs status" show still "stopped"
------------------------------------------------------------------------
[root at drbd0 corosync]# pcs status
Cluster name: cluster1
Stack: corosync
Current DC: drbd0-ha.s-ka.local (version 1.1.18-11.el7_5.3-2b07d5c5a9) -
partition with quorum
Last updated: Tue Oct 16 14:22:38 2018
Last change: Sun Oct 14 02:06:36 2018 by root via cibadmin on
drbd0-ha.s-ka.local
2 nodes configured
5 resources configured
Online: [ drbd0-ha.s-ka.local drbd1-ha.s-ka.local ]
Full list of resources:
Master/Slave Set: ipstor0Clone [ipstor0]
Masters: [ drbd0-ha.s-ka.local ]
Slaves: [ drbd1-ha.s-ka.local ]
Resource Group: p_iSCSI
p_iSCSITarget (ocf::heartbeat:iSCSITarget): Stopped
p_iSCSILogicalUnit (ocf::heartbeat:iSCSILogicalUnit): Stopped
ClusterIP (ocf::heartbeat:IPaddr2): Stopped
Failed Actions:
* p_iSCSITarget_start_0 on drbd0-ha.s-ka.local 'unknown error' (1):
call=12, status=complete, exitreason='',
last-rc-change='Sun Oct 14 02:58:04 2018', queued=1ms, exec=58ms
* p_iSCSITarget_start_0 on drbd1-ha.s-ka.local 'unknown error' (1):
call=12, status=complete, exitreason='',
last-rc-change='Sun Oct 14 10:47:06 2018', queued=0ms, exec=22ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root at drbd0 corosync]#
------------------------------------------------------------------------
3.12 the pcs config is:
------------------------------------------------------------------------
[root at drbd0 corosync]# pcs config
Cluster Name: cluster1
Corosync Nodes:
drbd0-ha.s-ka.local drbd1-ha.s-ka.local
Pacemaker Nodes:
drbd0-ha.s-ka.local drbd1-ha.s-ka.local
Resources:
Master: ipstor0Clone
Meta Attrs: master-node-max=1 clone-max=2 notify=true master-max=1
clone-node-max=1
Resource: ipstor0 (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=iscsivg01
Operations: demote interval=0s timeout=90 (ipstor0-demote-interval-0s)
monitor interval=60s (ipstor0-monitor-interval-60s)
notify interval=0s timeout=90 (ipstor0-notify-interval-0s)
promote interval=0s timeout=90 (ipstor0-promote-interval-0s)
reload interval=0s timeout=30 (ipstor0-reload-interval-0s)
start interval=0s timeout=240 (ipstor0-start-interval-0s)
stop interval=0s timeout=100 (ipstor0-stop-interval-0s)
Group: p_iSCSI
Resource: p_iSCSITarget (class=ocf provider=heartbeat type=iSCSITarget)
Attributes: implementation=tgt iqn=iqn.2018-08.s-ka.local:disk.1 tid=1
Operations: monitor interval=30 timeout=60
(p_iSCSITarget-monitor-interval-30)
start interval=0 timeout=60 (p_iSCSITarget-start-interval-0)
stop interval=0 timeout=60 (p_iSCSITarget-stop-interval-0)
Resource: p_iSCSILogicalUnit (class=ocf provider=heartbeat
type=iSCSILogicalUnit)
Attributes: implementation=tgt lun=10
path=/dev/drbd/by-disk/vg0/ipstor0 target_iqn=iqn.2018-08.s-ka.local:disk.1
Operations: monitor interval=30 timeout=60
(p_iSCSILogicalUnit-monitor-interval-30)
start interval=0 timeout=60
(p_iSCSILogicalUnit-start-interval-0)
stop interval=0 timeout=60
(p_iSCSILogicalUnit-stop-interval-0)
Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=32 ip=192.168.95.48
Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
start interval=0s timeout=20s (ClusterIP-start-interval-0s)
stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
start ipstor0Clone then start p_iSCSI (kind:Mandatory)
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
migration-threshold: 1
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: cluster1
dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
have-watchdog: false
last-lrm-refresh: 1539474248
no-quorum-policy: ignore
stonith-enabled: false
Quorum:
Options:
[root at drbd0 corosync]#
------------------------------------------------------------------------
4. so i am out of hands. don't what to do, may just dive into
pacemaker's source code??
Hope to get any feedback or tips from you, thank you very much in
advance :)
Best Regards
Zhang
_______________________________________________
Users mailing list: Users at clusterlabs.org[2]
https://lists.clusterlabs.org/mailman/listinfo/users[3]
Project Home: http://www.clusterlabs.org[4]
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf[5]
Bugs: http://bugs.clusterlabs.org[6]
_______________________________________________
Users mailing list: Users at clusterlabs.org[2]
https://lists.clusterlabs.org/mailman/listinfo/users[3]
Project Home: http://www.clusterlabs.org[4]
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf[5]
Bugs: http://bugs.clusterlabs.org[6]
_______________________________________________
Users mailing list: Users at clusterlabs.org[2]
https://lists.clusterlabs.org/mailman/listinfo/users[3]
Project Home: http://www.clusterlabs.org[4]
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf[5]
Bugs: http://bugs.clusterlabs.org[6]
--------
[1] https://nnc3.com/mags/LJ_1994-2014/LJ/217/11275.html
[2] mailto:Users at clusterlabs.org
[3] https://lists.clusterlabs.org/mailman/listinfo/users
[4] http://www.clusterlabs.org
[5] http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
[6] http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181024/15796e43/attachment-0002.html>
More information about the Users
mailing list