[ClusterLabs] Pacemaker not starting ISCSI LUNs and Targets

John Keates john at keates.nl
Mon Aug 21 17:19:58 EDT 2017


Hi,

I have a strange issue where LIO-T based ISCSI targets and LUNs most of the time simply don’t work. They either don’t start, or bounce around until no more nodes are tried.
The less-than-usefull information on the logs is like:

Aug 21 22:49:06 [10531] storage-1-prod    pengine:  warning: check_migration_threshold:	Forcing iscsi0-target away from storage-1-prod after 1000000 failures (max=1000000)

Aug 21 22:54:47 storage-1-prod crmd[2757]:   notice: Result of start operation for ip-iscsi0-vlan40 on storage-1-prod: 0 (ok)
Aug 21 22:54:47 storage-1-prod iSCSITarget(iscsi0-target)[5427]: WARNING: Configuration parameter "tid" is not supported by the iSCSI implementation and will be ignored.
Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: INFO: Parameter auto_add_default_portal is now 'false'.
Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: INFO: Created target iqn.2017-08.acccess.net:prod-1-ha. Created TPG 1.
Aug 21 22:54:48 storage-1-prod iSCSITarget(iscsi0-target)[5427]: ERROR: This Target already exists in configFS
Aug 21 22:54:48 storage-1-prod crmd[2757]:   notice: Result of start operation for iscsi0-target on storage-1-prod: 1 (unknown error)
Aug 21 22:54:49 storage-1-prod iSCSITarget(iscsi0-target)[5536]: INFO: Deleted Target iqn.2017-08.access.net:prod-1-ha.
Aug 21 22:54:49 storage-1-prod crmd[2757]:   notice: Result of stop operation for iscsi0-target on storage-1-prod: 0 (ok)

Now, the unknown error seems to actually be a targetcli type of error: "This Target already exists in configFS”. Checking with targetcli shows zero configured items on either node.
Manually starting the LUNs and target gives:


john at storage-1-prod:~$ sudo pcs resource debug-start iscsi0-target
Error performing operation: Operation not permitted
Operation start for iscsi0-target (ocf:heartbeat:iSCSITarget) returned 1
 >  stderr: WARNING: Configuration parameter "tid" is not supported by the iSCSI implementation and will be ignored.
 >  stderr: INFO: Parameter auto_add_default_portal is now 'false'.
 >  stderr: INFO: Created target iqn.2017-08.access.net:prod-1-ha. Created TPG 1.
 >  stderr: ERROR: This Target already exists in configFS

but now targetcli shows at least the target. Checking with crm status still shows the target as stopped.
Manually starting the LUNs gives:


john at storage-1-prod:~$ sudo pcs resource debug-start iscsi0-lun0
Operation start for iscsi0-lun0 (ocf:heartbeat:iSCSILogicalUnit) returned 0
 >  stderr: INFO: Created block storage object iscsi0-lun0 using /dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-root.
 >  stderr: INFO: Created LUN 0.
 >  stderr: DEBUG: iscsi0-lun0 start : 0
john at storage-1-prod:~$ sudo pcs resource debug-start iscsi0-lun1
Operation start for iscsi0-lun1 (ocf:heartbeat:iSCSILogicalUnit) returned 0
 >  stderr: INFO: Created block storage object iscsi0-lun1 using /dev/zvol/iscsipool0/iscsi/net.access.prod-1-ha-swap.
 >  stderr: /usr/lib/ocf/resource.d/heartbeat/iSCSILogicalUnit: line 378: /sys/kernel/config/target/core/iblock_0/iscsi0-lun1/wwn/vpd_unit_serial: No such file or directory
 >  stderr: INFO: Created LUN 1.
 >  stderr: DEBUG: iscsi0-lun1 start : 0

So the second LUN seems to have some bad parameters created by the iSCSILogicalUnit script. Checking with targetcli however shows both LUNs and the target up and running.
Checking again with crm status (and pcs status) shows all three resources still stopped. Since LUNs are colocated with the target and the target still has fail counts, I clear them with:

sudo pcs resource cleanup iscsi0-target

Now the LUNs and target are all active in crm status / pcs status. But it’s quite a manual process to get this to work! I’m thinking either my configuration is bad or there is some bug somewhere in targetcli / LIO or the iSCSI heartbeat script.
On top of all the manual work, it still breaks on any action. A move, failover, reboot etc. instantly breaks it. Everything else (the underlying ZFS Pool, the DRBD device, the IPv4 IP’s etc) moves just fine, it’s only the ISCSI that’s being problematic.

Concrete questions:

- Is my config bad?
- Is there a known issue with ISCSI? (I have only found old references about ordering)

I have added the output of crm config show as cib.txt and the output of a fresh boot of both nodes is:

Current DC: storage-2-prod (version 1.1.16-94ff4df) - partition with quorum
Last updated: Mon Aug 21 22:55:05 2017
Last change: Mon Aug 21 22:36:23 2017 by root via cibadmin on storage-1-prod

2 nodes configured
21 resources configured

Online: [ storage-1-prod storage-2-prod ]

Full list of resources:

 ip-iscsi0-vlan10	(ocf::heartbeat:IPaddr2):	Started storage-1-prod
 ip-iscsi0-vlan20	(ocf::heartbeat:IPaddr2):	Started storage-1-prod
 ip-iscsi0-vlan30	(ocf::heartbeat:IPaddr2):	Started storage-1-prod
 ip-iscsi0-vlan40	(ocf::heartbeat:IPaddr2):	Started storage-1-prod
 Master/Slave Set: drbd_master_slave0 [drbd_disk0]
     Masters: [ storage-1-prod ]
     Slaves: [ storage-2-prod ]
 Master/Slave Set: drbd_master_slave1 [drbd_disk1]
     Masters: [ storage-2-prod ]
     Slaves: [ storage-1-prod ]
 ip-iscsi1-vlan10	(ocf::heartbeat:IPaddr2):	Started storage-2-prod
 ip-iscsi1-vlan20	(ocf::heartbeat:IPaddr2):	Started storage-2-prod
 ip-iscsi1-vlan30	(ocf::heartbeat:IPaddr2):	Started storage-2-prod
 ip-iscsi1-vlan40	(ocf::heartbeat:IPaddr2):	Started storage-2-prod
 st-storage-1-prod	(stonith:meatware):	Started storage-2-prod
 st-storage-2-prod	(stonith:meatware):	Started storage-1-prod
 zfs-iscsipool0	(ocf::heartbeat:ZFS):	Started storage-1-prod
 zfs-iscsipool1	(ocf::heartbeat:ZFS):	Started storage-2-prod
 iscsi0-lun0	(ocf::heartbeat:iSCSILogicalUnit):	Stopped
 iscsi0-lun1	(ocf::heartbeat:iSCSILogicalUnit):	Stopped
 iscsi0-target	(ocf::heartbeat:iSCSITarget):	Stopped
 Clone Set: dlm-clone [dlm]
     Started: [ storage-1-prod storage-2-prod ]

Failed Actions:
* iscsi0-target_start_0 on storage-2-prod 'unknown error' (1): call=99, status=complete, exitreason='none',
    last-rc-change='Mon Aug 21 22:54:49 2017', queued=0ms, exec=954ms
* iscsi0-target_start_0 on storage-1-prod 'unknown error' (1): call=98, status=complete, exitreason='none',
    last-rc-change='Mon Aug 21 22:54:47 2017', queued=0ms, exec=1062ms

Regards,
John
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cib.txt
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170821/c4d5a8a2/attachment-0002.txt>


More information about the Users mailing list