[ClusterLabs] Major problem with iSCSITarget resource on top of DRBD M/S resource.

Sun Sep 27 15:06:08 UTC 2015

On 27/09/15 11:02 AM, Alex Crow wrote:
> 
> 
> On 27/09/15 15:54, Digimer wrote:
>> On 27/09/15 10:40 AM, Alex Crow wrote:
>>> Hi List,
>>>
>>> I'm trying to set up a failover iSCSI storage system for oVirt using a
>>> self-hosted engine. I've set up DRBD in Master-Slave for two iSCSI
>>> targets, one for the self-hosted engine and one for the VMs. I has this
>>> all working perfectly, then after trying to move the engine's LUN to the
>>> opposite host, all hell broke loose. The VMS LUN is still fine, starts
>> I'm guessing no fencing?
> 
> Hi Digimer,
> 
> No, but I've tried turning off one machine and still no success as a
> single node :-(

You *must* have working fencing, anyway. So now strikes me as a
fantastic time to add it. Turning off the node alone doesn't help.

>>> and migrates as it should. However the engine LUN always seems to try to
>>> launch the target on the host that is *NOT* the master of the DRBD
>>> resource. My constraints look fine, and should be self-explanatory about
>>> which is which:
>>>
>>> [root at granby ~]# pcs constraint --full
>>> Location Constraints:
>>> Ordering Constraints:
>>>    promote drbd-vms-iscsi then start iscsi-vms-ip (kind:Mandatory)
>>> (id:vm_iscsi_ip_after_drbd)
>>>    start iscsi-vms-target then start iscsi-vms-lun (kind:Mandatory)
>>> (id:vms_lun_after_target)
>>>    promote drbd-vms-iscsi then start iscsi-vms-target (kind:Mandatory)
>>> (id:vms_target_after_drbd)
>>>    promote drbd-engine-iscsi then start iscsi-engine-ip (kind:Mandatory)
>>> (id:ip_after_drbd)
>>>    start iscsi-engine-target then start iscsi-engine-lun
>>> (kind:Mandatory)
>>> (id:lun_after_target)
>>>    promote drbd-engine-iscsi then start iscsi-engine-target
>>> (kind:Mandatory) (id:target_after_drbd)
>>> Colocation Constraints:
>>>    iscsi-vms-ip with drbd-vms-iscsi (score:INFINITY) (rsc-role:Started)
>>> (with-rsc-role:Master) (id:vms_ip-with-drbd)
>>>    iscsi-vms-lun with drbd-vms-iscsi (score:INFINITY) (rsc-role:Started)
>>> (with-rsc-role:Master) (id:vms_lun-with-drbd)
>>>    iscsi-vms-target with drbd-vms-iscsi (score:INFINITY)
>>> (rsc-role:Started) (with-rsc-role:Master) (id:vms_target-with-drbd)
>>>    iscsi-engine-ip with drbd-engine-iscsi (score:INFINITY)
>>> (rsc-role:Started) (with-rsc-role:Master) (id:ip-with-drbd)
>>>    iscsi-engine-lun with drbd-engine-iscsi (score:INFINITY)
>>> (rsc-role:Started) (with-rsc-role:Master) (id:lun-with-drbd)
>>>    iscsi-engine-target with drbd-engine-iscsi (score:INFINITY)
>>> (rsc-role:Started) (with-rsc-role:Master) (id:target-with-drbd)
>>>
>>> But see this from pcs status, the iSCSI target has FAILED on glenrock,
>>> but the DRBD master is on granby!:
>>>
>>> [root at granby ~]# pcs status
>>> Cluster name: storage
>>> Last updated: Sun Sep 27 15:30:08 2015
>>> Last change: Sun Sep 27 15:20:58 2015
>>> Stack: cman
>>> Current DC: glenrock - partition with quorum
>>> Version: 1.1.11-97629de
>>> 2 Nodes configured
>>> 10 Resources configured
>>>
>>>
>>> Online: [ glenrock granby ]
>>>
>>> Full list of resources:
>>>
>>>   Master/Slave Set: drbd-vms-iscsi [drbd-vms]
>>>       Masters: [ glenrock ]
>>>       Slaves: [ granby ]
>>>   iscsi-vms-target    (ocf::heartbeat:iSCSITarget): Started glenrock
>>>   iscsi-vms-lun    (ocf::heartbeat:iSCSILogicalUnit): Started glenrock
>>>   iscsi-vms-ip    (ocf::heartbeat:IPaddr2):    Started glenrock
>>>   Master/Slave Set: drbd-engine-iscsi [drbd-engine]
>>>       Masters: [ granby ]
>>>       Slaves: [ glenrock ]
>>>   iscsi-engine-target    (ocf::heartbeat:iSCSITarget): FAILED glenrock
>>> (unmanaged)
>>>   iscsi-engine-ip    (ocf::heartbeat:IPaddr2):    Stopped
>>>   iscsi-engine-lun    (ocf::heartbeat:iSCSILogicalUnit): Stopped
>>>
>>> Failed actions:
>>>      iscsi-engine-target_stop_0 on glenrock 'unknown error' (1):
>>> call=177, status=Timed Out, last-rc-change='Sun Sep 27 15:20:59 2015',
>>> queued=0ms, exec=10003ms
>>>      iscsi-engine-target_stop_0 on glenrock 'unknown error' (1):
>>> call=177, status=Timed Out, last-rc-change='Sun Sep 27 15:20:59 2015',
>>> queued=0ms, exec=10003ms
>>>
>>> I have tried various combinations of pcs resource clear and cleanup, but
>>> that all result in the same outcome - apart from on some occasions when
>>> one or other of the two hosts suddenly reboots!
>>>
>>> Here is a log right after a "pcs resource cleanup" - first on the master
>>> for the DRBD m/s resource:
>>> [root at granby ~]# pcs resource cleanup; tail -f /var/log/messages
>>> All resources/stonith devices successfully cleaned up
>>> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
>>> granby-drbd-engine_monitor_0:117 [ \n ]
>>> Sep 27 15:33:42 granby attrd[3356]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: probe_complete (true)
>>> Sep 27 15:33:42 granby attrd[3356]:   notice: attrd_perform_update: Sent
>>> update 54: probe_complete=true
>>> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation drbd-engine_monitor_10000: master (node=granby, call=131,
>>> rc=8, cib-update=83, confirmed=false)
>>> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
>>> granby-drbd-engine_monitor_10000:131 [ \n ]
>>> Sep 27 15:33:42 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation drbd-vms_monitor_20000: ok (node=granby, call=130, rc=0,
>>> cib-update=84, confirmed=false)
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: do_lrm_invoke: Forcing the
>>> status of all resources to be redetected
>>> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: probe_complete (<null>)
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>>> granby-drbd-engine_monitor_10000:131 [ \n ]
>>> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_perform_update: Sent
>>> delete 61: node=granby, attr=probe_complete, id=<n/a>, set=(null),
>>> section=status
>>> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_perform_update: Sent
>>> delete 63: node=granby, attr=probe_complete, id=<n/a>, set=(null),
>>> section=status
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation iscsi-vms-target_monitor_0: not running (node=granby,
>>> call=150, rc=7, cib-update=94, confirmed=true)
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation iscsi-vms-lun_monitor_0: not running (node=granby, call=154,
>>> rc=7, cib-update=95, confirmed=true)
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation iscsi-engine-target_monitor_0: not running (node=granby,
>>> call=167, rc=7, cib-update=96, confirmed=true)
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation iscsi-engine-lun_monitor_0: not running (node=granby,
>>> call=175, rc=7, cib-update=97, confirmed=true)
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation iscsi-vms-ip_monitor_0: not running (node=granby, call=158,
>>> rc=7, cib-update=98, confirmed=true)
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation iscsi-engine-ip_monitor_0: not running (node=granby, call=171,
>>> rc=7, cib-update=99, confirmed=true)
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation drbd-vms_monitor_0: ok (node=granby, call=146, rc=0,
>>> cib-update=100, confirmed=true)
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation drbd-engine_monitor_0: master (node=granby, call=163, rc=8,
>>> cib-update=101, confirmed=true)
>>> Sep 27 15:34:46 granby crmd[3358]:   notice: process_lrm_event:
>>> granby-drbd-engine_monitor_0:163 [ \n ]
>>> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: probe_complete (true)
>>> Sep 27 15:34:46 granby attrd[3356]:   notice: attrd_perform_update: Sent
>>> update 67: probe_complete=true
>>> Sep 27 15:34:47 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation drbd-vms_monitor_20000: ok (node=granby, call=176, rc=0,
>>> cib-update=102, confirmed=false)
>>> Sep 27 15:34:47 granby crmd[3358]:   notice: process_lrm_event:
>>> Operation drbd-engine_monitor_10000: master (node=granby, call=177,
>>> rc=8, cib-update=103, confirmed=false)
>>> Sep 27 15:34:47 granby crmd[3358]:   notice: process_lrm_event:
>>> granby-drbd-engine_monitor_10000:177 [ \n ]
>>>
>>> Now on the slave:
>>>
>>>
>>> [root at glenrock ~]# pcs resource cleanup; tail -f /var/log/messages
>>> All resources/stonith devices successfully cleaned up
>>> Sep 27 15:34:57 glenrock pengine[3365]:  warning: unpack_rsc_op_failure:
>>> Processing failed op stop for iscsi-engine-target on glenrock: unknown
>>> error (1)
>>> Sep 27 15:34:57 glenrock pengine[3365]:  warning:
>>> common_apply_stickiness: Forcing iscsi-engine-target away from glenrock
>>> after 1000000 failures (max=1000000)
>>> Sep 27 15:34:57 glenrock pengine[3365]:   notice: process_pe_message:
>>> Calculated Transition 50: /var/lib/pacemaker/pengine/pe-input-533.bz2
>>> Sep 27 15:34:57 glenrock crmd[3366]:   notice: run_graph: Transition 50
>>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>> Source=/var/lib/pacemaker/pengine/pe-input-533.bz2): Complete
>>> Sep 27 15:34:57 glenrock crmd[3366]:   notice: do_state_transition:
>>> State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: do_lrm_invoke: Forcing
>>> the status of all resources to be redetected
>>> Sep 27 15:35:49 glenrock attrd[3364]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: probe_complete (<null>)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> glenrock-drbd-vms_monitor_10000:278 [ \n ]
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: qb_ipcs_event_sendv:
>>> new_event_notification (3366-853-16): Broken pipe (32)
>> Something is starting to go wrong here
> 
> Yes, does look like it. But why does the other iSCSI resource work fine
> and this one has seemingly gone mad?

Not sure. Without fencing, once things go sideways, you enter an
undefined state. One of the major benefits of *working* fencing is that
it makes debugging issues easier.

>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: do_state_transition:
>>> State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>> Sep 27 15:35:49 glenrock attrd[3364]:   notice: attrd_perform_update:
>>> Sent delete 195: node=glenrock, attr=probe_complete, id=<n/a>,
>>> set=(null), section=status
>>> Sep 27 15:35:49 glenrock attrd[3364]:   notice: attrd_perform_update:
>>> Sent delete 197: node=glenrock, attr=probe_complete, id=<n/a>,
>>> set=(null), section=status
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: unpack_config: On loss
>>> of CCM Quorum: Ignore
>>> Sep 27 15:35:49 glenrock pengine[3365]:  warning:
>>> common_apply_stickiness: Forcing iscsi-engine-target away from glenrock
>>> after 1000000 failures (max=1000000)
>> The node should probably have been fenced at this point.
> 
> Agreed.

Then add it.

>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Start
>>> drbd-vms:0#011(glenrock)
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Start
>>> drbd-vms:1#011(granby)
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Start
>>> drbd-engine:0#011(glenrock)
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Start
>>> drbd-engine:1#011(granby)
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: process_pe_message:
>>> Calculated Transition 51: /var/lib/pacemaker/pengine/pe-input-534.bz2
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 4: monitor drbd-vms:0_monitor_0 on glenrock (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 13: monitor drbd-vms:1_monitor_0 on granby
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 14: monitor iscsi-vms-target_monitor_0 on granby
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 5: monitor iscsi-vms-target_monitor_0 on glenrock
>>> (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 15: monitor iscsi-vms-lun_monitor_0 on granby
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 6: monitor iscsi-vms-lun_monitor_0 on glenrock (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 16: monitor iscsi-vms-ip_monitor_0 on granby
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 7: monitor iscsi-vms-ip_monitor_0 on glenrock (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 8: monitor drbd-engine:0_monitor_0 on glenrock (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 17: monitor drbd-engine:1_monitor_0 on granby
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 18: monitor iscsi-engine-target_monitor_0 on granby
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 9: monitor iscsi-engine-target_monitor_0 on glenrock
>>> (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 19: monitor iscsi-engine-ip_monitor_0 on granby
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 10: monitor iscsi-engine-ip_monitor_0 on glenrock
>>> (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 20: monitor iscsi-engine-lun_monitor_0 on granby
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 11: monitor iscsi-engine-lun_monitor_0 on glenrock
>>> (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation iscsi-vms-target_monitor_0: ok (node=glenrock, call=305, rc=0,
>>> cib-update=377, confirmed=true)
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 5
>>> (iscsi-vms-target_monitor_0) on glenrock failed (target: 7 vs. rc: 0):
>>> Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: abort_transition_graph:
>>> Transition aborted by iscsi-vms-target_monitor_0 'create' on (null):
>>> Event failed (magic=0:0;5:51:7:4ac61a75-532d-45c4-b07c-eee753699adc,
>>> cib=0.374.121, source=match_graph_event:344, 0)
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 5
>>> (iscsi-vms-target_monitor_0) on glenrock failed (target: 7 vs. rc: 0):
>>> Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation iscsi-vms-lun_monitor_0: ok (node=glenrock, call=309, rc=0,
>>> cib-update=378, confirmed=true)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation iscsi-engine-target_monitor_0: ok (node=glenrock, call=322,
>>> rc=0, cib-update=379, confirmed=true)
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 6
>>> (iscsi-vms-lun_monitor_0) on glenrock failed (target: 7 vs. rc: 0):
>>> Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 6
>>> (iscsi-vms-lun_monitor_0) on glenrock failed (target: 7 vs. rc: 0):
>>> Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 9
>>> (iscsi-engine-target_monitor_0) on glenrock failed (target: 7 vs. rc:
>>> 0): Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 9
>>> (iscsi-engine-target_monitor_0) on glenrock failed (target: 7 vs. rc:
>>> 0): Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation iscsi-engine-lun_monitor_0: not running (node=glenrock,
>>> call=330, rc=7, cib-update=380, confirmed=true)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation iscsi-engine-ip_monitor_0: not running (node=glenrock,
>>> call=326, rc=7, cib-update=381, confirmed=true)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation iscsi-vms-ip_monitor_0: ok (node=glenrock, call=313, rc=0,
>>> cib-update=382, confirmed=true)
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 13
>>> (drbd-vms:1_monitor_0) on granby failed (target: 7 vs. rc: 0): Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 13
>>> (drbd-vms:1_monitor_0) on granby failed (target: 7 vs. rc: 0): Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 17
>>> (drbd-engine:1_monitor_0) on granby failed (target: 7 vs. rc: 8): Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 17
>>> (drbd-engine:1_monitor_0) on granby failed (target: 7 vs. rc: 8): Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 12: probe_complete probe_complete-granby on granby -
>>> no waiting
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 7
>>> (iscsi-vms-ip_monitor_0) on glenrock failed (target: 7 vs. rc: 0): Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 7
>>> (iscsi-vms-ip_monitor_0) on glenrock failed (target: 7 vs. rc: 0): Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: abort_transition_graph:
>>> Transition aborted by status-granby-probe_complete, probe_complete=true:
>>> Transient attribute change (create cib=0.374.135,
>>> source=te_update_diff:391,
>>> path=/cib/status/node_state[@id='granby']/transient_attributes[@id='granby']/instance_attributes[@id='status-granby'],
>>>
>>> 0)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation drbd-engine_monitor_0: ok (node=glenrock, call=318, rc=0,
>>> cib-update=383, confirmed=true)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation drbd-vms_monitor_0: master (node=glenrock, call=301, rc=8,
>>> cib-update=384, confirmed=true)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> glenrock-drbd-vms_monitor_0:301 [ \n ]
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 8
>>> (drbd-engine:0_monitor_0) on glenrock failed (target: 7 vs. rc: 0):
>>> Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 8
>>> (drbd-engine:0_monitor_0) on glenrock failed (target: 7 vs. rc: 0):
>>> Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 4
>>> (drbd-vms:0_monitor_0) on glenrock failed (target: 7 vs. rc: 8): Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:  warning: status_from_rc: Action 4
>>> (drbd-vms:0_monitor_0) on glenrock failed (target: 7 vs. rc: 8): Error
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 3: probe_complete probe_complete-glenrock on glenrock
>>> (local) - no waiting
>>> Sep 27 15:35:49 glenrock attrd[3364]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: probe_complete (true)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: run_graph: Transition 51
>>> (Complete=24, Pending=0, Fired=0, Skipped=9, Incomplete=10,
>>> Source=/var/lib/pacemaker/pengine/pe-input-534.bz2): Stopped
>>> Sep 27 15:35:49 glenrock attrd[3364]:   notice: attrd_perform_update:
>>> Sent update 201: probe_complete=true
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: unpack_config: On loss
>>> of CCM Quorum: Ignore
>>> Sep 27 15:35:49 glenrock pengine[3365]:  warning:
>>> common_apply_stickiness: Forcing iscsi-engine-target away from glenrock
>>> after 1000000 failures (max=1000000)
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Move
>>> iscsi-engine-target#011(Started glenrock -> granby)
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: process_pe_message:
>>> Calculated Transition 52: /var/lib/pacemaker/pengine/pe-input-535.bz2
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: unpack_config: On loss
>>> of CCM Quorum: Ignore
>>> Sep 27 15:35:49 glenrock pengine[3365]:  warning:
>>> common_apply_stickiness: Forcing iscsi-engine-target away from glenrock
>>> after 1000000 failures (max=1000000)
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: LogActions: Move
>>> iscsi-engine-target#011(Started glenrock -> granby)
>>> Sep 27 15:35:49 glenrock pengine[3365]:   notice: process_pe_message:
>>> Calculated Transition 53: /var/lib/pacemaker/pengine/pe-input-536.bz2
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 7: monitor drbd-vms_monitor_20000 on granby
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 12: monitor drbd-vms_monitor_10000 on glenrock (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 39: monitor iscsi-vms-target_monitor_10000 on glenrock
>>> (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 42: monitor iscsi-vms-lun_monitor_10000 on glenrock
>>> (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 45: monitor iscsi-vms-ip_monitor_10000 on glenrock
>>> (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 50: monitor drbd-engine_monitor_10000 on granby
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 53: monitor drbd-engine_monitor_20000 on glenrock
>>> (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: te_rsc_command:
>>> Initiating action 78: stop iscsi-engine-target_stop_0 on glenrock
>>> (local)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation iscsi-vms-target_monitor_10000: ok (node=glenrock, call=332,
>>> rc=0, cib-update=387, confirmed=false)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation iscsi-vms-lun_monitor_10000: ok (node=glenrock, call=333,
>>> rc=0, cib-update=388, confirmed=false)
>>> Sep 27 15:35:49 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation iscsi-vms-ip_monitor_10000: ok (node=glenrock, call=334, rc=0,
>>> cib-update=389, confirmed=false)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation drbd-vms_monitor_10000: master (node=glenrock, call=331, rc=8,
>>> cib-update=390, confirmed=false)
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> glenrock-drbd-vms_monitor_10000:331 [ \n ]
>>> Sep 27 15:35:49 glenrock crmd[3366]:   notice: process_lrm_event:
>>> Operation drbd-engine_monitor_20000: ok (node=glenrock, call=335, rc=0,
>>> cib-update=391, confirmed=false)
>>> Sep 27 15:35:50 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>>> Sep 27 15:35:51 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>>> Sep 27 15:35:52 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>>> Sep 27 15:35:53 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>>> Sep 27 15:35:54 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>>> Sep 27 15:35:55 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>>> Sep 27 15:35:56 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>>> Sep 27 15:35:57 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>>> Sep 27 15:35:58 glenrock iSCSITarget(iscsi-engine-target)[1079]:
>>> WARNING: Failed to remove target iqn.2015-09.integralife.net, retrying.
>>> Sep 27 15:35:59 glenrock lrmd[3363]:  warning: child_timeout_callback:
>>> iscsi-engine-target_stop_0 process (PID 1079) timed out
>>> Sep 27 15:35:59 glenrock lrmd[3363]:  warning: operation_finished:
>>> iscsi-engine-target_stop_0:1079 - timed out after 10000ms
>>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>>> Sep 27 15:35:59 glenrock lrmd[3363]:   notice: operation_finished:
>>> iscsi-engine-target_stop_0:1079:stderr [ tgtadm: can't find the target ]
>>> Sep 27 15:35:59 glenrock crmd[3366]:    error: process_lrm_event:
>>> Operation iscsi-engine-target_stop_0: Timed Out (node=glenrock,
>>> call=336, timeout=10000ms)
>>> Sep 27 15:35:59 glenrock crmd[3366]:   notice: process_lrm_event:
>>> glenrock-iscsi-engine-target_stop_0:336 [ tgtadm: can't find the
>>> target\ntgtadm: can't find the target\ntgtadm: can't find the
>>> target\ntgtadm: can't find the target\ntgtadm: can't find the
>>> target\ntgtadm: can't find the target\ntgtadm: can't find the
>>> target\ntgtadm: can't find the target\ntgtadm: can't find the
>>> target\ntgtadm: can't find the target\n ]
>>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: status_from_rc: Action 78
>>> (iscsi-engine-target_stop_0) on glenrock failed (target: 0 vs. rc: 1):
>>> Error
>>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: update_failcount:
>>> Updating failcount for iscsi-engine-target on glenrock after failed
>>> stop: rc=1 (update=INFINITY, time=1443364559)
>>> Sep 27 15:35:59 glenrock crmd[3366]:   notice: abort_transition_graph:
>>> Transition aborted by iscsi-engine-target_stop_0 'modify' on (null):
>>> Event failed (magic=2:1;78:53:0:4ac61a75-532d-45c4-b07c-eee753699adc,
>>> cib=0.374.146, source=match_graph_event:344, 0)
>>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: last-failure-iscsi-engine-target
>>> (1443364559)
>>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: update_failcount:
>>> Updating failcount for iscsi-engine-target on glenrock after failed
>>> stop: rc=1 (update=INFINITY, time=1443364559)
>>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: status_from_rc: Action 78
>>> (iscsi-engine-target_stop_0) on glenrock failed (target: 0 vs. rc: 1):
>>> Error
>>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: update_failcount:
>>> Updating failcount for iscsi-engine-target on glenrock after failed
>>> stop: rc=1 (update=INFINITY, time=1443364559)
>>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_perform_update:
>>> Sent update 203: last-failure-iscsi-engine-target=1443364559
>>> Sep 27 15:35:59 glenrock crmd[3366]:  warning: update_failcount:
>>> Updating failcount for iscsi-engine-target on glenrock after failed
>>> stop: rc=1 (update=INFINITY, time=1443364559)
>>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: last-failure-iscsi-engine-target
>>> (1443364559)
>>> Sep 27 15:35:59 glenrock crmd[3366]:   notice: run_graph: Transition 53
>>> (Complete=8, Pending=0, Fired=0, Skipped=3, Incomplete=0,
>>> Source=/var/lib/pacemaker/pengine/pe-input-536.bz2): Stopped
>>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_perform_update:
>>> Sent update 205: last-failure-iscsi-engine-target=1443364559
>>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: last-failure-iscsi-engine-target
>>> (1443364559)
>>> Sep 27 15:35:59 glenrock attrd[3364]:   notice: attrd_perform_update:
>>> Sent update 207: last-failure-iscsi-engine-target=1443364559
>>> Sep 27 15:35:59 glenrock pengine[3365]:   notice: unpack_config: On loss
>>> of CCM Quorum: Ignore
>>> Sep 27 15:35:59 glenrock pengine[3365]:  warning: unpack_rsc_op_failure:
>>> Processing failed op stop for iscsi-engine-target on glenrock: unknown
>>> error (1)
>>> Sep 27 15:35:59 glenrock pengine[3365]:  warning: unpack_rsc_op_failure:
>>> Processing failed op stop for iscsi-engine-target on glenrock: unknown
>>> error (1)
>>> Sep 27 15:35:59 glenrock pengine[3365]:  warning:
>>> common_apply_stickiness: Forcing iscsi-engine-target away from glenrock
>>> after 1000000 failures (max=1000000)
>>> Sep 27 15:35:59 glenrock pengine[3365]:   notice: process_pe_message:
>>> Calculated Transition 54: /var/lib/pacemaker/pengine/pe-input-537.bz2
>>> Sep 27 15:35:59 glenrock crmd[3366]:   notice: run_graph: Transition 54
>>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>> Source=/var/lib/pacemaker/pengine/pe-input-537.bz2): Complete
>>> Sep 27 15:35:59 glenrock crmd[3366]:   notice: do_state_transition:
>>> State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>>
>>> I've even deleted all resources for the engine stuff and started again,
>>> but the same still happens! My oVirt engine is down all this time even
>>> though the big "vms" lun has been absolutely fine.
>> Something is going wrong, obviously. You will want to fix that, but I am
>> not familiar with this RA so I can't be the one. I will say though that
>> having configured, working stonith will help in cases like this because
>> it would fence off the node that failed. (assuming you've configured it
>> to fence on a resource failure like this).
>>
>> At the core, fencing is to put a node that is in an unknown state into a
>> known state. This concept can be extended to the HA services... If the
>> service goes into an unknown state that it can't recover from, it's
>> valid to fence the node.
> 
> Yes, I'm familiar with fencing as we already run RHEV. This is a POC of
> Ovirt 3.5 for a small satellite office.

The software always behaves as if it is critical. Besides, what good is
a PoC that doesn't replicate in function the final product?

>>> Also it's really odd that the iSCSITarget resource is not available is
>>> Centos 6.x, I had to copy the resource script from an FTP site, ugh!
>> So there is no guarantee of compatibility. Have you considered CentOS 7?
>> With corosync v2 and no more cman plugin, it might help.
> 
> That would be my next step. But ideally I'd like to get this working
> first if I can, as we have some conservative elements here that think 7
> is "too new".

I would agree with that assessment. I still use RHEL6 myself, though I
also know I am one of the more conservative/paranoid cluster people. :)

> Thanks
> 
> Alex

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?