[ClusterLabs] Major problem with iSCSITarget resource on top of DRBD M/S resource.

Sun Sep 27 11:02:07 EDT 2015

On 27/09/15 15:54, Digimer wrote:
> On 27/09/15 10:40 AM, Alex Crow wrote:
>> Hi List,
>>
>> I'm trying to set up a failover iSCSI storage system for oVirt using a
>> self-hosted engine. I've set up DRBD in Master-Slave for two iSCSI
>> targets, one for the self-hosted engine and one for the VMs. I has this
>> all working perfectly, then after trying to move the engine's LUN to the
>> opposite host, all hell broke loose. The VMS LUN is still fine, starts
> I'm guessing no fencing?

Hi Digimer,

No, but I've tried turning off one machine and still no success as a 
single node :-(

>
>> and migrates as it should. However the engine LUN always seems to try to
>> launch the target on the host that is *NOT* the master of the DRBD
>> resource. My constraints look fine, and should be self-explanatory about
>> which is which:
>>
>> [root at granby ~]# pcs constraint --full
>> Location Constraints:
>> Ordering Constraints:
>>    promote drbd-vms-iscsi then start iscsi-vms-ip (kind:Mandatory)
>> (id:vm_iscsi_ip_after_drbd)
>>    start iscsi-vms-target then start iscsi-vms-lun (kind:Mandatory)
>> (id:vms_lun_after_target)
>>    promote drbd-vms-iscsi then start iscsi-vms-target (kind:Mandatory)
>> (id:vms_target_after_drbd)
>>    promote drbd-engine-iscsi then start iscsi-engine-ip (kind:Mandatory)
>> (id:ip_after_drbd)
>>    start iscsi-engine-target then start iscsi-engine-lun (kind:Mandatory)
>> (id:lun_after_target)
>>    promote drbd-engine-iscsi then start iscsi-engine-target
>> (kind:Mandatory) (id:target_after_drbd)
>> Colocation Constraints:
>>    iscsi-vms-ip with drbd-vms-iscsi (score:INFINITY) (rsc-role:Started)
>> (with-rsc-role:Master) (id:vms_ip-with-drbd)
>>    iscsi-vms-lun with drbd-vms-iscsi (score:INFINITY) (rsc-role:Started)
>> (with-rsc-role:Master) (id:vms_lun-with-drbd)
>>    iscsi-vms-target with drbd-vms-iscsi (score:INFINITY)
>> (rsc-role:Started) (with-rsc-role:Master) (id:vms_target-with-drbd)
>>    iscsi-engine-ip with drbd-engine-iscsi (score:INFINITY)
>> (rsc-role:Started) (with-rsc-role:Master) (id:ip-with-drbd)
>>    iscsi-engine-lun with drbd-engine-iscsi (score:INFINITY)
>> (rsc-role:Started) (with-rsc-role:Master) (id:lun-with-drbd)
>>    iscsi-engine-target with drbd-engine-iscsi (score:INFINITY)
>> (rsc-role:Started) (with-rsc-role:Master) (id:target-with-drbd)
>>
>> But see this from pcs status, the iSCSI target has FAILED on glenrock,
>> but the DRBD master is on granby!:
>>
>> [root at granby ~]# pcs status
>> Cluster name: storage
>> Last updated: Sun Sep 27 15:30:08 2015
>> Last change: Sun Sep 27 15:20:58 2015
>> Stack: cman
>> Current DC: glenrock - partition with quorum
>> Version: 1.1.11-97629de
>> 2 Nodes configured
>> 10 Resources configured
>>
>>
>> Online: [ glenrock granby ]
>>
>> Full list of resources:
>>
>>   Master/Slave Set: drbd-vms-iscsi [drbd-vms]
>>       Masters: [ glenrock ]
>>       Slaves: [ granby ]
>>   iscsi-vms-target    (ocf::heartbeat:iSCSITarget): Started glenrock
>>   iscsi-vms-lun    (ocf::heartbeat:iSCSILogicalUnit): Started glenrock
>>   iscsi-vms-ip    (ocf::heartbeat:IPaddr2):    Started glenrock
>>   Master/Slave Set: drbd-engine-iscsi [drbd-engine]
>>       Masters: [ granby ]
>>       Slaves: [ glenrock ]
>>   iscsi-engine-target    (ocf::heartbeat:iSCSITarget): FAILED glenrock
>> (unmanaged)
>>   iscsi-engine-ip    (ocf::heartbeat:IPaddr2):    Stopped
>>   iscsi-engine-lun    (ocf::heartbeat:iSCSILogicalUnit): Stopped
>>
>> Failed actions:
>>      iscsi-engine-target_stop_0 on glenrock 'unknown error' (1):
>> call=177, status=Timed Out, last-rc-change='Sun Sep 27 15:20:59 2015',
>> queued=0ms, exec=10003ms
>>      iscsi-engine-target_stop_0 on glenrock 'unknown error' (1):
>> call=177, status=Timed Out, last-rc-change='Sun Sep 27 15:20:59 2015',
>> queued=0ms, exec=10003ms
>>
>> I have tried various combinations of pcs resource clear and cleanup, but
>> that all result in the same outcome - apart from on some occasions when
>> one or other of the two hosts suddenly reboots!
>>
>> Here is a log right after a "pcs resource cleanup" - first on the master
>> for the DRBD m/s resource:
>> [root at granby ~]# pcs resource cleanup; tail -f /var/log/messages
>> All resources/stonith devices successfully cleaned up
>> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
>> granby-drbd-engine_monitor_0:117 [ \n ]
>> Sep 27 15:33:42 granby attrd[3356]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: probe_complete (true)
>> Sep 27 15:33:42 granby attrd[3356]:   notice: attrd_perform_update: Sent
>> update 54: probe_complete=true
>> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
>> Operation drbd-engine_monitor_10000: master (node=granby, call=131,
>> rc=8, cib-update=83, confirmed=false)
>> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
>> granby-drbd-engine_monitor_10000:131 [ \n ]
>> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
>> Operation drbd-vms_monitor_20000: ok (node=granby, call=130, rc=0,
>> cib-update=84, confirmed=false)
>> Sep 27 15:34:46 granby crmd[3358]:   notice: do_lrm_invoke: Forcing the
>> status of all resources to be redetected
>> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: probe_complete (<null>)
>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>> granby-drbd-engine_monitor_10000:131 [ \n ]
>> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_perform_update: Sent
>> delete 61: node=granby, attr=probe_complete, id=<n/a>, set=(null),
>> section=status
>> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_perform_update: Sent
>> delete 63: node=granby, attr=probe_complete, id=<n/a>, set=(null),
>> section=status
>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>> Operation iscsi-vms-target_monitor_0: not running (node=granby,
>> call=150, rc=7, cib-update=94, confirmed=true)
>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>> Operation iscsi-vms-lun_monitor_0: not running (node=granby, call=154,
>> rc=7, cib-update=95, confirmed=true)
>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>> Operation iscsi-engine-target_monitor_0: not running (node=granby,
>> call=167, rc=7, cib-update=96, confirmed=true)
>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>> Operation iscsi-engine-lun_monitor_0: not running (node=granby,
>> call=175, rc=7, cib-update=97, confirmed=true)
>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>> Operation iscsi-vms-ip_monitor_0: not running (node=granby, call=158,
>> rc=7, cib-update=98, confirmed=true)
>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>> Operation iscsi-engine-ip_monitor_0: not running (node=granby, call=171,
>> rc=7, cib-update=99, confirmed=true)
>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>> Operation drbd-vms_monitor_0: ok (node=granby, call=146, rc=0,
>> cib-update=100, confirmed=true)
>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>> Operation drbd-engine_monitor_0: master (node=granby, call=163, rc=8,
>> cib-update=101, confirmed=true)
>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>> granby-drbd-engine_monitor_0:163 [ \n ]
>> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: probe_complete (true)
>> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_perform_update: Sent
>> update 67: probe_complete=true
>> Sep 27 15:34:47 granby crmd[3358]:   notice: process_lrm_event:
>> Operation drbd-vms_monitor_20000: ok (node=granby, call=176, rc=0,
>> cib-update=102, confirmed=false)
>> Sep 27 15:34:47 granby crmd[3358]:   notice: process_lrm_event:
>> Operation drbd-engine_monitor_10000: master (node=granby, call=177,
>> rc=8, cib-update=103, confirmed=false)
>> Sep 27 15:34:47 granby crmd[3358]:   notice: process_lrm_event:
>> granby-drbd-engine_monitor_10000:177 [ \n ]
>>
>> Now on the slave:
>>
>>
>> [root at glenrock ~]# pcs resource cleanup; tail -f /var/log/messages
>> All resources/stonith devices successfully cleaned up
>> Sep 27 15:34:57 glenrock pengine[3365]:  warning: unpack_rsc_op_failure:
>> Processing failed op stop for iscsi-engine-target on glenrock: unknown
>> error (1)
>> Sep 27 15:34:57 glenrock pengine[3365]:  warning:
>> common_apply_stickiness: Forcing iscsi-engine-target away from glenrock
>> after 1000000 failures (max=1000000)
>> Sep 27 15:34:57 glenrock pengine[3365]:   notice: process_pe_message:
>> Calculated Transition 50: /var/lib/pacemaker/pengine/pe-input-533.bz2
>> Sep 27 15:34:57 glenrock crmd[3366]:   notice: run_graph: Transition 50
>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-533.bz2): Complete
>> Sep 27 15:34:57 glenrock crmd[3366]:   notice: do_state_transition:
>> State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: do_lrm_invoke: Forcing
>> the status of all resources to be redetected
>> Sep 27 15:35:49 glenrock attrd[3364]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: probe_complete (<null>)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> glenrock-drbd-vms_monitor_10000:278 [ \n ]
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: qb_ipcs_event_sendv:
>> new_event_notification (3366-853-16): Broken pipe (32)
> Something is starting to go wrong here

Yes, does look like it. But why does the other iSCSI resource work fine 
and this one has seemingly gone mad?

>
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: do_state_transition:
>> State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>> Sep 27 15:35:49 glenrock attrd[3364]:   notice: attrd_perform_update:
>> Sent delete 195: node=glenrock, attr=probe_complete, id=<n/a>,
>> set=(null), section=status
>> Sep 27 15:35:49 glenrock attrd[3364]:   notice: attrd_perform_update:
>> Sent delete 197: node=glenrock, attr=probe_complete, id=<n/a>,
>> set=(null), section=status
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: unpack_config: On loss
>> of CCM Quorum: Ignore
>> Sep 27 15:35:49 glenrock pengine[3365]:  warning:
>> common_apply_stickiness: Forcing iscsi-engine-target away from glenrock
>> after 1000000 failures (max=1000000)
> The node should probably have been fenced at this point.

Agreed.

>
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Start
>> drbd-vms:0#011(glenrock)
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Start
>> drbd-vms:1#011(granby)
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Start
>> drbd-engine:0#011(glenrock)
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Start
>> drbd-engine:1#011(granby)
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: process_pe_message:
>> Calculated Transition 51: /var/lib/pacemaker/pengine/pe-input-534.bz2
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 4: monitor drbd-vms:0_monitor_0 on glenrock (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 13: monitor drbd-vms:1_monitor_0 on granby
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 14: monitor iscsi-vms-target_monitor_0 on granby
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 5: monitor iscsi-vms-target_monitor_0 on glenrock (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 15: monitor iscsi-vms-lun_monitor_0 on granby
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 6: monitor iscsi-vms-lun_monitor_0 on glenrock (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 16: monitor iscsi-vms-ip_monitor_0 on granby
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 7: monitor iscsi-vms-ip_monitor_0 on glenrock (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 8: monitor drbd-engine:0_monitor_0 on glenrock (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 17: monitor drbd-engine:1_monitor_0 on granby
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 18: monitor iscsi-engine-target_monitor_0 on granby
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 9: monitor iscsi-engine-target_monitor_0 on glenrock
>> (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 19: monitor iscsi-engine-ip_monitor_0 on granby
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 10: monitor iscsi-engine-ip_monitor_0 on glenrock (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 20: monitor iscsi-engine-lun_monitor_0 on granby
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 11: monitor iscsi-engine-lun_monitor_0 on glenrock
>> (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation iscsi-vms-target_monitor_0: ok (node=glenrock, call=305, rc=0,
>> cib-update=377, confirmed=true)
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 5
>> (iscsi-vms-target_monitor_0) on glenrock failed (target: 7 vs. rc: 0):
>> Error
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: abort_transition_graph:
>> Transition aborted by iscsi-vms-target_monitor_0 'create' on (null):
>> Event failed (magic=0:0;5:51:7:4ac61a75-532d-45c4-b07c-eee753699adc,
>> cib=0.374.121, source=match_graph_event:344, 0)
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 5
>> (iscsi-vms-target_monitor_0) on glenrock failed (target: 7 vs. rc: 0):
>> Error
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation iscsi-vms-lun_monitor_0: ok (node=glenrock, call=309, rc=0,
>> cib-update=378, confirmed=true)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation iscsi-engine-target_monitor_0: ok (node=glenrock, call=322,
>> rc=0, cib-update=379, confirmed=true)
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 6
>> (iscsi-vms-lun_monitor_0) on glenrock failed (target: 7 vs. rc: 0): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 6
>> (iscsi-vms-lun_monitor_0) on glenrock failed (target: 7 vs. rc: 0): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 9
>> (iscsi-engine-target_monitor_0) on glenrock failed (target: 7 vs. rc:
>> 0): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 9
>> (iscsi-engine-target_monitor_0) on glenrock failed (target: 7 vs. rc:
>> 0): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation iscsi-engine-lun_monitor_0: not running (node=glenrock,
>> call=330, rc=7, cib-update=380, confirmed=true)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation iscsi-engine-ip_monitor_0: not running (node=glenrock,
>> call=326, rc=7, cib-update=381, confirmed=true)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation iscsi-vms-ip_monitor_0: ok (node=glenrock, call=313, rc=0,
>> cib-update=382, confirmed=true)
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 13
>> (drbd-vms:1_monitor_0) on granby failed (target: 7 vs. rc: 0): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 13
>> (drbd-vms:1_monitor_0) on granby failed (target: 7 vs. rc: 0): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 17
>> (drbd-engine:1_monitor_0) on granby failed (target: 7 vs. rc: 8): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 17
>> (drbd-engine:1_monitor_0) on granby failed (target: 7 vs. rc: 8): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 12: probe_complete probe_complete-granby on granby -
>> no waiting
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 7
>> (iscsi-vms-ip_monitor_0) on glenrock failed (target: 7 vs. rc: 0): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 7
>> (iscsi-vms-ip_monitor_0) on glenrock failed (target: 7 vs. rc: 0): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: abort_transition_graph:
>> Transition aborted by status-granby-probe_complete, probe_complete=true:
>> Transient attribute change (create cib=0.374.135,
>> source=te_update_diff:391,
>> path=/cib/status/node_state[@id='granby']/transient_attributes[@id='granby']/instance_attributes[@id='status-granby'],
>> 0)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation drbd-engine_monitor_0: ok (node=glenrock, call=318, rc=0,
>> cib-update=383, confirmed=true)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation drbd-vms_monitor_0: master (node=glenrock, call=301, rc=8,
>> cib-update=384, confirmed=true)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> glenrock-drbd-vms_monitor_0:301 [ \n ]
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 8
>> (drbd-engine:0_monitor_0) on glenrock failed (target: 7 vs. rc: 0): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 8
>> (drbd-engine:0_monitor_0) on glenrock failed (target: 7 vs. rc: 0): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 4
>> (drbd-vms:0_monitor_0) on glenrock failed (target: 7 vs. rc: 8): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 4
>> (drbd-vms:0_monitor_0) on glenrock failed (target: 7 vs. rc: 8): Error
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 3: probe_complete probe_complete-glenrock on glenrock
>> (local) - no waiting
>> Sep 27 15:35:49 glenrock attrd[3364]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: probe_complete (true)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: run_graph: Transition 51
>> (Complete=24, Pending=0, Fired=0, Skipped=9, Incomplete=10,
>> Source=/var/lib/pacemaker/pengine/pe-input-534.bz2): Stopped
>> Sep 27 15:35:49 glenrock attrd[3364]:   notice: attrd_perform_update:
>> Sent update 201: probe_complete=true
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: unpack_config: On loss
>> of CCM Quorum: Ignore
>> Sep 27 15:35:49 glenrock pengine[3365]:  warning:
>> common_apply_stickiness: Forcing iscsi-engine-target away from glenrock
>> after 1000000 failures (max=1000000)
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Move
>> iscsi-engine-target#011(Started glenrock -> granby)
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: process_pe_message:
>> Calculated Transition 52: /var/lib/pacemaker/pengine/pe-input-535.bz2
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: unpack_config: On loss
>> of CCM Quorum: Ignore
>> Sep 27 15:35:49 glenrock pengine[3365]:  warning:
>> common_apply_stickiness: Forcing iscsi-engine-target away from glenrock
>> after 1000000 failures (max=1000000)
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Move
>> iscsi-engine-target#011(Started glenrock -> granby)
>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: process_pe_message:
>> Calculated Transition 53: /var/lib/pacemaker/pengine/pe-input-536.bz2
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 7: monitor drbd-vms_monitor_20000 on granby
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 12: monitor drbd-vms_monitor_10000 on glenrock (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 39: monitor iscsi-vms-target_monitor_10000 on glenrock
>> (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 42: monitor iscsi-vms-lun_monitor_10000 on glenrock
>> (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 45: monitor iscsi-vms-ip_monitor_10000 on glenrock
>> (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 50: monitor drbd-engine_monitor_10000 on granby
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 53: monitor drbd-engine_monitor_20000 on glenrock (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>> Initiating action 78: stop iscsi-engine-target_stop_0 on glenrock (local)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation iscsi-vms-target_monitor_10000: ok (node=glenrock, call=332,
>> rc=0, cib-update=387, confirmed=false)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation iscsi-vms-lun_monitor_10000: ok (node=glenrock, call=333,
>> rc=0, cib-update=388, confirmed=false)
>> Sep 27 15:35:49 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation iscsi-vms-ip_monitor_10000: ok (node=glenrock, call=334, rc=0,
>> cib-update=389, confirmed=false)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation drbd-vms_monitor_10000: master (node=glenrock, call=331, rc=8,
>> cib-update=390, confirmed=false)
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> glenrock-drbd-vms_monitor_10000:331 [ \n ]
>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>> Operation drbd-engine_monitor_20000: ok (node=glenrock, call=335, rc=0,
>> cib-update=391, confirmed=false)
>> Sep 27 15:35:50 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>> Sep 27 15:35:51 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>> Sep 27 15:35:52 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>> Sep 27 15:35:53 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>> Sep 27 15:35:54 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>> Sep 27 15:35:55 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>> Sep 27 15:35:56 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>> Sep 27 15:35:57 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>> Sep 27 15:35:58 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>> Sep 27 15:35:59 glenrock lrmd[3363]:  warning: child_timeout_callback:
>> iscsi-engine-target_stop_0 process (PID 1079) timed out
>> Sep 27 15:35:59 glenrock lrmd[3363]:  warning: operation_finished:
>> iscsi-engine-target_stop_0:1079 - timed out after 10000ms
>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>> Sep 27 15:35:59 glenrock crmd[3366]:    error: process_lrm_event:
>> Operation iscsi-engine-target_stop_0: Timed Out (node=glenrock,
>> call=336, timeout=10000ms)
>> Sep 27 15:35:59 glenrock crmd[3366]:   notice: process_lrm_event:
>> glenrock-iscsi-engine-target_stop_0:336 [ tgtadm: can't find the
>> target\ntgtadm: can't find the target\ntgtadm: can't find the
>> target\ntgtadm: can't find the target\ntgtadm: can't find the
>> target\ntgtadm: can't find the target\ntgtadm: can't find the
>> target\ntgtadm: can't find the target\ntgtadm: can't find the
>> target\ntgtadm: can't find the target\n ]
>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: status_from_rc: Action 78
>> (iscsi-engine-target_stop_0) on glenrock failed (target: 0 vs. rc: 1):
>> Error
>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: update_failcount:
>> Updating failcount for iscsi-engine-target on glenrock after failed
>> stop: rc=1 (update=INFINITY, time=1443364559)
>> Sep 27 15:35:59 glenrock crmd[3366]:   notice: abort_transition_graph:
>> Transition aborted by iscsi-engine-target_stop_0 'modify' on (null):
>> Event failed (magic=2:1;78:53:0:4ac61a75-532d-45c4-b07c-eee753699adc,
>> cib=0.374.146, source=match_graph_event:344, 0)
>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: last-failure-iscsi-engine-target
>> (1443364559)
>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: update_failcount:
>> Updating failcount for iscsi-engine-target on glenrock after failed
>> stop: rc=1 (update=INFINITY, time=1443364559)
>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: status_from_rc: Action 78
>> (iscsi-engine-target_stop_0) on glenrock failed (target: 0 vs. rc: 1):
>> Error
>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: update_failcount:
>> Updating failcount for iscsi-engine-target on glenrock after failed
>> stop: rc=1 (update=INFINITY, time=1443364559)
>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_perform_update:
>> Sent update 203: last-failure-iscsi-engine-target=1443364559
>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: update_failcount:
>> Updating failcount for iscsi-engine-target on glenrock after failed
>> stop: rc=1 (update=INFINITY, time=1443364559)
>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: last-failure-iscsi-engine-target
>> (1443364559)
>> Sep 27 15:35:59 glenrock crmd[3366]:   notice: run_graph: Transition 53
>> (Complete=8, Pending=0, Fired=0, Skipped=3, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-536.bz2): Stopped
>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_perform_update:
>> Sent update 205: last-failure-iscsi-engine-target=1443364559
>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: last-failure-iscsi-engine-target
>> (1443364559)
>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_perform_update:
>> Sent update 207: last-failure-iscsi-engine-target=1443364559
>> Sep 27 15:35:59 glenrock pengine[3365]:   notice: unpack_config: On loss
>> of CCM Quorum: Ignore
>> Sep 27 15:35:59 glenrock pengine[3365]:  warning: unpack_rsc_op_failure:
>> Processing failed op stop for iscsi-engine-target on glenrock: unknown
>> error (1)
>> Sep 27 15:35:59 glenrock pengine[3365]:  warning: unpack_rsc_op_failure:
>> Processing failed op stop for iscsi-engine-target on glenrock: unknown
>> error (1)
>> Sep 27 15:35:59 glenrock pengine[3365]:  warning:
>> common_apply_stickiness: Forcing iscsi-engine-target away from glenrock
>> after 1000000 failures (max=1000000)
>> Sep 27 15:35:59 glenrock pengine[3365]:   notice: process_pe_message:
>> Calculated Transition 54: /var/lib/pacemaker/pengine/pe-input-537.bz2
>> Sep 27 15:35:59 glenrock crmd[3366]:   notice: run_graph: Transition 54
>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-537.bz2): Complete
>> Sep 27 15:35:59 glenrock crmd[3366]:   notice: do_state_transition:
>> State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>
>> I've even deleted all resources for the engine stuff and started again,
>> but the same still happens! My oVirt engine is down all this time even
>> though the big "vms" lun has been absolutely fine.
> Something is going wrong, obviously. You will want to fix that, but I am
> not familiar with this RA so I can't be the one. I will say though that
> having configured, working stonith will help in cases like this because
> it would fence off the node that failed. (assuming you've configured it
> to fence on a resource failure like this).
>
> At the core, fencing is to put a node that is in an unknown state into a
> known state. This concept can be extended to the HA services... If the
> service goes into an unknown state that it can't recover from, it's
> valid to fence the node.

Yes, I'm familiar with fencing as we already run RHEV. This is a POC of 
Ovirt 3.5 for a small satellite office.

>> Also it's really odd that the iSCSITarget resource is not available is
>> Centos 6.x, I had to copy the resource script from an FTP site, ugh!
> So there is no guarantee of compatibility. Have you considered CentOS 7?
> With corosync v2 and no more cman plugin, it might help.

That would be my next step. But ideally I'd like to get this working 
first if I can, as we have some conservative elements here that think 7 
is "too new".

Thanks

Alex