[ClusterLabs] Resources suddenly get target-role="stopped"

Ken Gaillot kgaillot at redhat.com
Mon Dec 7 20:28:22 UTC 2015


On 12/07/2015 02:03 AM, Klechomir wrote:
> Hi Ken,
> The comments are in the text.
> 
> On 4.12.2015 19:06, Ken Gaillot wrote:
>> On 12/04/2015 10:22 AM, Klechomir wrote:
>>> Hi list,
>>> My issue is the following:
>>>
>>> I have very stable cluster, using Corosync 2.1.0.26 and Pacemaker 1.1.8
>>> (observed the same problem with Corosync 2.3.5  & Pacemaker 1.1.13-rc3)
>>>
>>> Bumped on this issue when started playing with VirtualDomain resources,
>>> but this seems to be unrelated to the RA.
>>>
>>> The problem is that without apparent reason a resource gets
>>> target-role="Stopped". This happens after (successful) migration, or
>>> after failover., or after VM restart .
>>>
>>> My tests showed that changing the resource name fixes this problem, but
>>> this seems to be a temporary workaround.
>>>
>>> The resource configuration is:
>>> primitive VMA_VM1 ocf:heartbeat:VirtualDomain \
>>>          params config="/NFSvolumes/CDrive1/VM1/VM1.xml"
>>> hypervisor="qemu:///system" migration_transport="tcp" \
>>>          meta allow-migrate="true" target-role="Started" \
>>>          op start interval="0" timeout="120s" \
>>>          op stop interval="0" timeout="120s" \
>>>          op monitor interval="10" timeout="30" depth="0" \
>>>          utilization cpu="1" hv_memory="925"
>>> order VM_VM1_after_Filesystem_CDrive1 inf: Filesystem_CDrive1 VMA_VM1
>>>
>>> Here is the log from one such stop, after successful migration with "crm
>>> migrate resource VMA_VM1":
>>>
>>> Dec 04 15:18:22 [3818929] CLUSTER-1       crmd:    debug: cancel_op:
>>> Cancelling op 5564 for VMA_VM1 (VMA_VM1:5564)
>>> Dec 04 15:18:22 [4434] CLUSTER-1       lrmd:     info:
>>> cancel_recurring_action:         Cancelling operation
>>> VMA_VM1_monitor_10000
>>> Dec 04 15:18:23 [3818929] CLUSTER-1       crmd:    debug: cancel_op:
>>> Op 5564 for VMA_VM1 (VMA_VM1:5564): cancelled
>>> Dec 04 15:18:23 [3818929] CLUSTER-1       crmd:    debug:
>>> do_lrm_rsc_op:        Performing
>>> key=351:199:0:fb6e486a-023a-4b44-83cf-4c0c208a0f56
>>> op=VMA_VM1_migrate_to_0
>>> VirtualDomain(VMA_VM1)[1797698]:        2015/12/04_15:18:23 DEBUG:
>>> Virtual domain VM1 is currently running.
>>> VirtualDomain(VMA_VM1)[1797698]:        2015/12/04_15:18:23 INFO: VM1:
>>> Starting live migration to CLUSTER-2 (using virsh
>>> --connect=qemu:///system --quiet migrate --live  VM1
>>> qemu+tcp://CLUSTER-2/system ).
>>> Dec 04 15:18:24 [3818929] CLUSTER-1       crmd:     info:
>>> process_lrm_event:    LRM operation VMA_VM1_monitor_10000 (call=5564,
>>> status=1, cib-update=0, confirmed=false) Cancelled
>>> Dec 04 15:18:24 [3818929] CLUSTER-1       crmd:    debug:
>>> update_history_cache:         Updating history for 'VMA_VM1' with
>>> monitor op
>>> VirtualDomain(VMA_VM1)[1797698]:        2015/12/04_15:18:26 INFO: VM1:
>>> live migration to CLUSTER-2 succeeded.
>>> Dec 04 15:18:26 [4434] CLUSTER-1       lrmd:    debug:
>>> operation_finished:      VMA_VM1_migrate_to_0:1797698 - exited with rc=0
>>> Dec 04 15:18:26 [4434] CLUSTER-1       lrmd:   notice:
>>> operation_finished:      VMA_VM1_migrate_to_0:1797698 [
>>> 2015/12/04_15:18:23 INFO: VM1: Starting live migration to CLUSTER-2
>>> (using virsh --connect=qemu:///system --quiet migrate --live  VM1
>>> qemu+tcp://CLUSTER-2/system ). ]
>>> Dec 04 15:18:26 [4434] CLUSTER-1       lrmd:   notice:
>>> operation_finished:      VMA_VM1_migrate_to_0:1797698 [
>>> 2015/12/04_15:18:26 INFO: VM1: live migration to CLUSTER-2 succeeded. ]
>>> Dec 04 15:18:27 [3818929] CLUSTER-1       crmd:    debug:
>>> create_operation_update:      do_update_resource: Updating resouce
>>> VMA_VM1 after complete migrate_to op (interval=0)
>>> Dec 04 15:18:27 [3818929] CLUSTER-1       crmd:   notice:
>>> process_lrm_event:    LRM operation VMA_VM1_migrate_to_0 (call=5697,
>>> rc=0, cib-update=89, confirmed=true) ok
>>> Dec 04 15:18:27 [3818929] CLUSTER-1       crmd:    debug:
>>> update_history_cache:         Updating history for 'VMA_VM1' with
>>> migrate_to op
>>> Dec 04 15:18:31 [3818929] CLUSTER-1       crmd:    debug: cancel_op:
>>> Operation VMA_VM1:5564 already cancelled
>>> Dec 04 15:18:31 [3818929] CLUSTER-1       crmd:    debug:
>>> do_lrm_rsc_op:        Performing
>>> key=225:200:0:fb6e486a-023a-4b44-83cf-4c0c208a0f56 op=VMA_VM1_stop_0
>>> VirtualDomain(VMA_VM1)[1798719]:        2015/12/04_15:18:31 DEBUG:
>>> Virtual domain VM1 is not running:  failed to get domain 'vm1' domain
>>> not found: no domain with matching name 'vm1'
>> This looks like the problem. Configuration error?
> 
> As far as I checked this is a harmless bug in VirtualDomain RA. It
> downcases the output from "virsh domain info" command, to be able to
> parse the status easily, which prevents matching the domain name.
> In any case this error doesn't affect the RA functionality, in this case
> it just finds out that the resource is already stopped, while my big
> concern is why is it stopped.

I'm not sure whether it's the cause of your main issue, but it's not
harmless. It could report back to Pacemaker that a domain is not running
when it really is.

The RA is not lowercasing the domain name, but the domain status
("running", "shut off", etc.). The RA gets the domain name from the
<name> tag in the XML file. I'm not sure where the lowercase "vm1" is
coming from.

It looks like another node was the DC at this time; its logs will be
more interesting for figuring out why each action was initiated.

>>> VirtualDomain(VMA_VM1)[1798719]:        2015/12/04_15:18:31 INFO: Domain
>>> VM1 already stopped.
>>> Dec 04 15:18:31 [4434] CLUSTER-1       lrmd:    debug:
>>> operation_finished:      VMA_VM1_stop_0:1798719 - exited with rc=0
>>> Dec 04 15:18:31 [4434] CLUSTER-1       lrmd:   notice:
>>> operation_finished:      VMA_VM1_stop_0:1798719 [ 2015/12/04_15:18:31
>>> INFO: Domain VM1 already stopped. ]
>>> Dec 04 15:18:32 [3818929] CLUSTER-1       crmd:    debug:
>>> create_operation_update:      do_update_resource: Updating resouce
>>> VMA_VM1 after complete stop op (interval=0)
>>> Dec 04 15:18:32 [3818929] CLUSTER-1       crmd:   notice:
>>> process_lrm_event:    LRM operation VMA_VM1_stop_0 (call=5701, rc=0,
>>> cib-update=90, confirmed=true) ok
>>> Dec 04 15:18:32 [3818929] CLUSTER-1       crmd:    debug:
>>> update_history_cache:         Updating history for 'VMA_VM1' with
>>> stop op
>>> Dec 04 15:20:58 [3818929] CLUSTER-1       crmd:    debug:
>>> create_operation_update:      build_active_RAs: Updating resouce VMA_VM1
>>> after complete stop op (interval=0)
>>> Dec 04 15:20:58 [3818929] CLUSTER-1       crmd:    debug:
>>> create_operation_update:      build_active_RAs: Updating resouce VMA_VM1
>>> after complete monitor op (interval=0)
>>> Dec 04 15:23:31 [1833996] CLUSTER-1 crm_resource:    debug:
>>> process_orphan_resource:    Detected orphan resource VMA_VM1 on
>>> CLUSTER-2
>>> Any suggestions are welcome.
>>>
>>> Best regards,
>>> Klecho





More information about the Users mailing list