[ClusterLabs] VM failure during shutdown
Emmanuel Gelati
emi2fast at gmail.com
Tue Jun 26 12:19:53 EDT 2018
I think that you need pcs resource create windows_VM_res VirtualDomain
hypervisor="qemu:///system"
config="/opt/sigma_vms/xml_definitions/windows_VM.xml"
meta target-role=Stopped
In this way, pacemaker doesn't starts the resources
2018-06-26 17:24 GMT+02:00 Vaggelis Papastavros <psvaggelis at gmail.com>:
> Many thanks for the excellent answer ,
>
> Ken after investigation of the log files :
>
> In our environment we have two drbd partitions one for customer_vms and on
> for sigma_vms
>
> For the customer_vms the active node is node2 and for the sigma_vms the
> active node is node1 .
>
> [root at sgw-01 drbd.d]# drbdadm status
>
> customer_vms role:Secondary
> disk:UpToDate
> sgw-02 role:Primary
> peer-disk:UpToDate
>
> sigma_vms role:Primary
> disk:UpToDate
> sgw-02 role:Secondary
> peer-disk:UpToDate
>
> when i create a new VM *i can't force the resource creation* to take
> place on a specific node , the cluster places the resource
>
> spontaneously on one of the two nodes (if the node happens to be the drbd
> Primary then is ok, else the pacemaker raise a failure fro the node) .
>
> My solution is the following :
>
> pcs resource create windows_VM_res VirtualDomain
> hypervisor="qemu:///system" config="/opt/sigma_vms/xml_definitions/windows_VM.xml"
>
>
> (the cluster arbitrarily try to place the above resource on node 2 who is
> currently the secondary for the corresponding partition. Personally
>
> i assume that the VirtualDomain agent should be able to read the correct
> disk location from the xml defintion and then try to find the correct drbd
> node)
>
> pcs constraint colocation add windows_VM_res with StorageDRBD_SigmaVMs
> INFINITY
>
> pcs constraint order start StorageDRBD_SigmaVMs_rers then start windows_VM
>
> pcs resource cleanup windows_VM_res
>
> After the above steps the VM is located on the correct node and everything
> is ok.
>
>
> *Is my approach correct ?*
>
>
> Your opinion would be valuable,
>
> Sincerely
>
>
>
> On 06/25/2018 07:15 PM, Ken Gaillot wrote:
>
> On Mon, 2018-06-25 at 09:47 -0500, Ken Gaillot wrote:
>
> On Mon, 2018-06-25 at 11:33 +0300, Vaggelis Papastavros wrote:
>
> Dear friends ,
>
> We have the following configuration :
>
> CentOS7 , pacemaker 0.9.152 and Corosync 2.4.0, storage with DRBD
> and
> stonith eanbled with APC PDU devices.
>
> I have a windows VM configured as cluster resource with the
> following
> attributes :
>
> Resource: WindowSentinelOne_res (class=ocf provider=heartbeat
> type=VirtualDomain)
> Attributes: hypervisor=qemu:///system
> config=/opt/customer_vms/conf/WindowSentinelOne/WindowSentinelOne.x
> ml
>
> migration_transport=ssh
> Utilization: cpu=8 hv_memory=8192
> Operations: start interval=0s timeout=120s
> (WindowSentinelOne_res-start-interval-0s)
> stop interval=0s timeout=120s
> (WindowSentinelOne_res-stop-interval-0s)
> monitor interval=10s timeout=30s
> (WindowSentinelOne_res-monitor-interval-10s)
>
> under some circumstances (which i try to identify) the VM fails
> and
> disappears under virsh list --all and also pacemaker reports the VM
> as
> stopped .
>
> If run pcs resource cleanup windows_wm everything is OK, but i
> can't
> identify the reason of failure.
>
> For example when shutdown the VM (with windows shutdown) the
> cluster
> reports the following :
>
> WindowSentinelOne_res (ocf::heartbeat:VirtualDomain): Started
> sgw-
> 02
> (failure ignored)
>
> Failed Actions:
> * WindowSentinelOne_res_monitor_10000 on sgw-02 'not running' (7):
> call=67, status=complete, exitreason='none',
> last-rc-change='Mon Jun 25 07:41:37 2018', queued=0ms,
> exec=0ms.
>
>
> My questions are
>
> 1) why the VM shutdown is reported as (FailedAction) from cluster ?
> Its
> a worthy operation during VM life cycle .
>
> Pacemaker has no way of knowing that the VM was intentionally shut
> down, vs crashed.
>
> When some resource is managed by the cluster, all starts and stops of
> the resource have to go through the cluster. You can either set
> target-
> role=Stopped in the resource configuration, or if it's a temporary
> issue (e.g. rebooting for some OS updates), you could set is-
> managed=false to take it out of cluster control, do the work, then
> set
> is-managed=true again.
>
> Also, a nice feature is that you can use rules to set a maintenance
> window ahead of time (especially helpful if the person who maintains
> the cluster isn't the same person who needs to do the VM updates). For
> example, you could set a rule that the resource's is-managed option
> will be false from 9pm to midnight on Fridays. See:
> http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pa
> cemaker_Explained/index.html#idm140583511697312
>
> particularly the parts about time/date expressions and using rules to
> control resource options.
>
>
> 2) why sometimes the resource is marked as stopped (the VM is
> healthy)
> and needs cleanup ?
>
> That's a problem. If the VM is truly healthy, it sounds like there's
> an
> issue with the resource agent. You'd have to look at the logs to see
> if
> it gave any more information (e.g. if it's a timeout, raising the
> timeout might be sufficient).
>
>
> 3) I can't understand the corosync logs ... during the the VM
> shutdown
> corosync logs is the following
>
> FYI, the system log will have the most important messages.
> corosync.log
> will additionally have info-level messages -- potentially helpful but
> definitely difficult to follow.
>
>
> Jun 25 07:41:37 [5140] sgw-02 crmd: info:
> process_lrm_event: Result of monitor operation for
> WindowSentinelOne_res on sgw-02: 7 (not running) | call=67
> key=WindowSentinelOne_res_monitor_10000 confirmed=false cib-
> update=36
>
> This is really the only important message. It says that a recurring
> monitor on the WindowSentinelOne_res resource on node sgw-02 exited
> with status code 7 (which means the resource agent thinks the
> resource
> is not running).
>
> 'key=WindowSentinelOne_res_monitor_10000' is how pacemaker identifies
> resource agent actions. The format is <resource-name>_<action-
> name>_<action-interval-in-milliseconds>
>
> This is the only information Pacemaker will get from the resource
> agent. To investigate more deeply, you'll have to check for log
> messages from the agent itself.
>
>
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_process_request: Forwarding cib_modify operation for
> section
> status to all (origin=local/crmd/36)
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> Diff: --- 0.4704.67 2
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> Diff: +++ 0.4704.68 (null)
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> + /cib: @num_updates=68
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> + /cib/status/node_state[@id='2']: @crm-debug-
> origin=do_update_resource
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> ++
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_reso
> ur
> ce[@id='WindowSentinelOne_res']:
> <lrm_rsc_op id="WindowSentinelOne_res_last_failure_0"
> operation_key="WindowSentinelOne_res_monitor_10000"
> operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
> transition-key="84:3:0:f910c793-a714-4e24-80d1-b0ec66275491"
> transition-magic="0:7;84:3:0:f910c793-a714-4e24-80d1-b0ec66275491"
> on_node="sgw-02" cal
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_process_request: Completed cib_modify operation for section
> status: OK (rc=0, origin=sgw-02/crmd/36, version=0.4704.68)
>
> You can usually ignore the 'cib' messages. This just means Pacemaker
> recorded the result on disk.
>
>
> Jun 25 07:41:37 [5137] sgw-02 attrd: info:
> attrd_peer_update: Setting fail-count-WindowSentinelOne_res[sgw-
> 02]:
> (null) -> 1 from sgw-01
>
> Since the cluster expected the resource to be running, this result is
> a
> failure. Failures are counted using special node attributes that
> start
> with "fail-count-". This is what Pacemaker uses to determine if a
> resource has reached its migration-threshold.
>
>
> Jun 25 07:41:37 [5137] sgw-02 attrd: info:
> write_attribute:
> Sent update 10 with 1 changes for fail-count-
> WindowSentinelOne_res,
> id=<n/a>, set=(null)
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_process_request: Forwarding cib_modify operation for
> section
> status to all (origin=local/attrd/10)
> Jun 25 07:41:37 [5137] sgw-02 attrd: info:
> attrd_peer_update: Setting
> last-failure-WindowSentinelOne_res[sgw-02]: (null) -> 1529912497
> from
> sgw-01
>
> Similarly, the time the failure occurred is stored in a 'last-
> failure-'
> node attribute, which Pacemaker uses to determine if a resource has
> reached its failure-timeout.
>
>
> Jun 25 07:41:37 [5137] sgw-02 attrd: info:
> write_attribute:
> Sent update 11 with 1 changes for last-failure-
> WindowSentinelOne_res,
> id=<n/a>, set=(null)
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_process_request: Forwarding cib_modify operation for
> section
> status to all (origin=local/attrd/11)
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> Diff: --- 0.4704.68 2
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> Diff: +++ 0.4704.69 (null)
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> + /cib: @num_updates=69
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> ++
> /cib/status/node_state[@id='2']/transient_attributes[@id='2']/insta
> nc
> e_attributes[@id='status-2']:
> <nvpair id="status-2-fail-count-WindowSentinelOne_res"
> name="fail-count-WindowSentinelOne_res" value="1"/>
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_process_request: Completed cib_modify operation for section
> status: OK (rc=0, origin=sgw-02/attrd/10, version=0.4704.69)
> Jun 25 07:41:37 [5137] sgw-02 attrd: info:
> attrd_cib_callback: Update 10 for fail-count-
> WindowSentinelOne_res:
> OK (0)
> Jun 25 07:41:37 [5137] sgw-02 attrd: info:
> attrd_cib_callback: Update 10 for
> fail-count-WindowSentinelOne_res[sgw-02]=1: OK (0)
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> Diff: --- 0.4704.69 2
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> Diff: +++ 0.4704.70 (null)
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> + /cib: @num_updates=70
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_perform_op:
> ++
> /cib/status/node_state[@id='2']/transient_attributes[@id='2']/insta
> nc
> e_attributes[@id='status-2']:
> <nvpair id="status-2-last-failure-WindowSentinelOne_res"
> name="last-failure-WindowSentinelOne_res" value="1529912497"/>
> Jun 25 07:41:37 [5130] sgw-02 cib: info:
> cib_process_request: Completed cib_modify operation for section
> status: OK (rc=0, origin=sgw-02/attrd/11, version=0.4704.70)
> Jun 25 07:41:37 [5137] sgw-02 attrd: info:
> attrd_cib_callback: Update 11 for last-failure-
> WindowSentinelOne_res:
> OK (0)
> Jun 25 07:41:37 [5137] sgw-02 attrd: info:
> attrd_cib_callback: Update 11 for
> last-failure-WindowSentinelOne_res[sgw-02]=1529912497: OK (0)
> Jun 25 07:41:42 [5130] sgw-02 cib: info:
> cib_process_ping:
> Reporting our current digest to sgw-01:
> 3e27415fcb003ef3373b47ffa6c5f358
> for 0.4704.70 (0x7faac1729720 0)
>
> Sincerely ,
>
> Vaggelis Papastavros
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
--
.~.
/V\
// \\
/( )\
^`~'^
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180626/f9eab346/attachment-0002.html>
More information about the Users
mailing list