<div dir="ltr">unsubscribe<br></div><div class="gmail_extra"><br><div class="gmail_quote">2015-04-08 8:22 GMT-07:00  <span dir="ltr"><<a href="mailto:pacemaker-request@oss.clusterlabs.org" target="_blank">pacemaker-request@oss.clusterlabs.org</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Send Pacemaker mailing list submissions to<br>

        <a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a><br>

<br>

To subscribe or unsubscribe via the World Wide Web, visit<br>

        <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

or, via email, send a message with subject or body 'help' to<br>

        <a href="mailto:pacemaker-request@oss.clusterlabs.org">pacemaker-request@oss.clusterlabs.org</a><br>

<br>

You can reach the person managing the list at<br>

        <a href="mailto:pacemaker-owner@oss.clusterlabs.org">pacemaker-owner@oss.clusterlabs.org</a><br>

<br>

When replying, please edit your Subject line so it is more specific<br>

than "Re: Contents of Pacemaker digest..."<br>

<br>

<br>

Today's Topics:<br>

<br>

   1. Re: update cib after fence (Michael Schwartzkopff)<br>

   2. Cluster with two STONITH devices (Jorge Lopes)<br>

<br>

<br>

----------------------------------------------------------------------<br>

<br>

Message: 1<br>

Date: Wed, 08 Apr 2015 16:53:38 +0200<br>

From: Michael Schwartzkopff <<a href="mailto:ms@sys4.de">ms@sys4.de</a>><br>

To: The Pacemaker cluster resource manager<br>

        <<a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a>><br>

Subject: Re: [Pacemaker] update cib after fence<br>

Message-ID: <3983598.UMelJT2BHb@nb003><br>

Content-Type: text/plain; charset="iso-8859-1"<br>

<br>

Am Mittwoch, 8. April 2015, 15:03:48 schrieb <a href="mailto:philipp.achmueller@arz.at">philipp.achmueller@arz.at</a>:<br>

> hi,<br>

><br>

> how to cleanup cib from node after unexpected system halt?<br>

> failed node still thinks of running VirtualDomain resource, which is<br>

> already running on other node in cluster(sucessful takeover:<br>

><br>

> executing "pcs cluster start" -<br>

> ....<br>

> Apr  8 13:41:10 lnx0083a daemon:info lnx0083a<br>

> VirtualDomain(lnx0106a)[20360]: INFO: Virtual domain lnx0106a currently<br>

> has no state, retrying.<br>

> Apr  8 13:41:12 lnx0083a daemon:err|error lnx0083a<br>

> VirtualDomain(lnx0106a)[20360]: ERROR: Virtual domain lnx0106a has no<br>

> state during stop operation, bailing out.<br>

> Apr  8 13:41:12 lnx0083a daemon:info lnx0083a<br>

> VirtualDomain(lnx0106a)[20360]: INFO: Issuing forced shutdown (destroy)<br>

> request for domain lnx0106a.<br>

> Apr  8 13:41:12 lnx0083a daemon:err|error lnx0083a<br>

> VirtualDomain(lnx0106a)[20360]: ERROR: forced stop failed<br>

> Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]:   notice:<br>

> operation_finished: lnx0106a_stop_0:20360:stderr [ error: failed to<br>

> connect to the hypervisor error: end of file while reading data: :<br>

> input/output error ]<br>

> Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]:   notice:<br>

> operation_finished: lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced<br>

> stop failed ]<br>

> Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]:   notice:<br>

> process_lrm_event: Operation lnx0106a_stop_0: unknown error<br>

> (node=lnx0083a, call=131, rc=1, cib-update=43, confirmed=true)<br>

> Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]:   notice:<br>

> process_lrm_event: lnx0083a-lnx0106a_stop_0:131 [ error: failed to connect<br>

> to the hypervisor error: end of file while reading data: : input/output<br>

> error\nocf-exit-reason:forced stop failed\n ]<br>

> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:<br>

> warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed<br>

> (target: 0 vs. rc: 1): Error<br>

> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:<br>

> warning: update_failcount: Updating failcount for lnx0106a on lnx0083a<br>

> after failed stop: rc=1 (update=INFINITY, time=1428493272)<br>

> Apr  8 13:41:12 lnx0083b daemon:notice lnx0083b crmd[18244]:   notice:<br>

> abort_transition_graph: Transition aborted by lnx0106a_stop_0 'modify' on<br>

> lnx0083a: Event failed<br>

> (magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,<br>

> cib=1.499.624, source=match_graph_event:350, 0)<br>

> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:<br>

> warning: update_failcount: Updating failcount for lnx0106a on lnx0083a<br>

> after failed stop: rc=1 (update=INFINITY, time=1428493272)<br>

> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:<br>

> warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed<br>

> (target: 0 vs. rc: 1): Error<br>

> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:<br>

> warning: update_failcount: Updating failcount for lnx0106a on lnx0083a<br>

> after failed stop: rc=1 (update=INFINITY, time=1428493272)<br>

> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:<br>

> warning: update_failcount: Updating failcount for lnx0106a on lnx0083a<br>

> after failed stop: rc=1 (update=INFINITY, time=1428493272)<br>

> Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:<br>

> warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on<br>

> lnx0083a: unknown error (1)<br>

> Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:<br>

> warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on<br>

> lnx0083a: unknown error (1)<br>

> Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:<br>

> warning: pe_fence_node: Node lnx0083a will be fenced because of resource<br>

> failure(s)<br>

> Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:<br>

> warning: common_apply_stickiness: Forcing lnx0106a away from lnx0083a<br>

> after 1000000 failures (max=3)<br>

> Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:<br>

> warning: stage6: Scheduling Node lnx0083a for STONITH<br>

> Apr  8 13:41:17 lnx0083b daemon:notice lnx0083b pengine[18243]:   notice:<br>

> native_stop_constraints: Stop of failed resource lnx0106a is implicit<br>

> after lnx0083a is fenced<br>

> ....<br>

><br>

> Node is fenced..<br>

><br>

> log from corosync.log:<br>

> ...<br>

> Apr 08 13:41:00 [14226] lnx0083a pacemakerd:   notice: mcp_read_config:<br>

> Configured corosync to accept connections from group 2035: OK (1)<br>

> Apr 08 13:41:00 [14226] lnx0083a pacemakerd:   notice: main:    Starting<br>

> Pacemaker 1.1.12 (Build: 4ed91da):  agent-manpages ascii-docs ncurses<br>

> libqb-logging libqb-ip<br>

> c lha-fencing upstart nagios  corosync-native atomic-attrd libesmtp acls<br>

> ....<br>

> Apr 08 13:16:04 [23690] lnx0083a        cib:     info: cib_perform_op:  +<br>

> /cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id=<br>

> 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:<br>

> @operation_key=lnx0106a_stop_0, @operation=stop,<br>

> @transition-key=106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,<br>

> @transition-magic=0:0;106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,<br>

> @call-id=538, @last-run=1428491757, @last-rc-change=1428491757,<br>

> @exec-time=7686<br>

> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: write_attribute:<br>

> Sent update 40 with 3 changes for fail-count-vm-lnx0106a, id=<n/a>,<br>

> set=(null)<br>

> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: write_attribute:<br>

> Sent update 45 with 3 changes for fail-count-lnx0106a, id=<n/a>,<br>

> set=(null)<br>

> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                     <lrm_resource id="lnx0106a" type="VirtualDomain"<br>

> class="ocf" provider="heartbeat"><br>

> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                       <lrm_rsc_op id="lnx0106a_last_0"<br>

> operation_key="lnx0106a_monitor_0" operation="monitor"<br>

> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9"<br>

> transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0"<br>

> last-run="1427965815" last-rc-change="1427965815" exec-time="8<br>

> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                     <lrm_resource id="lnx0106a" type="VirtualDomain"<br>

> class="ocf" provider="heartbeat"><br>

> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                       <lrm_rsc_op id="lnx0106a_last_failure_0"<br>

> operation_key="lnx0106a_migrate_to_0" operation="migrate_to"<br>

> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"<br>

> transition-key="112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> transition-magic="0:1;112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> on_node="lnx0129a" call-id="444" rc-code="1" op-status="0" interval="0"<br>

> last-run="1427973596" last-rc-change="1427<br>

> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                       <lrm_rsc_op id="lnx0106a_last_0"<br>

> operation_key="lnx0106a_stop_0" operation="stop"<br>

> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"<br>

> transition-key="113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> transition-magic="0:0;113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> on_node="lnx0129a" call-id="546" rc-code="0" op-status="0" interval="0"<br>

> last-run="1428403880" last-rc-change="1428403880" exec-time="2<br>

> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                       <lrm_rsc_op id="lnx0106a_monitor_30000"<br>

> operation_key="lnx0106a_monitor_30000" operation="monitor"<br>

> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"<br>

> transition-key="47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> transition-magic="0:0;47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> on_node="lnx0129a" call-id="436" rc-code="0" op-status="0"<br>

> interval="30000" last-rc-change="1427965985" exec-time="1312<br>

> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                     <lrm_resource id="lnx0106a" type="VirtualDomain"<br>

> class="ocf" provider="heartbeat"><br>

> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                       <lrm_rsc_op id="lnx0106a_last_0"<br>

> operation_key="lnx0106a_start_0" operation="start"<br>

> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"<br>

> transition-key="110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> transition-magic="0:0;110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> on_node="lnx0129b" call-id="539" rc-code="0" op-status="0" interval="0"<br>

> last-run="1428491780" last-rc-change="1428491780" exec-tim<br>

> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                       <lrm_rsc_op id="lnx0106a_monitor_30000"<br>

> operation_key="lnx0106a_monitor_30000" operation="monitor"<br>

> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"<br>

> transition-key="89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> transition-magic="0:0;89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> on_node="lnx0129b" call-id="540" rc-code="0" op-status="0"<br>

> interval="30000" last-rc-change="1428491810" exec-time="12<br>

> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:<br>

>         Update 40 for fail-count-vm-lnx0106a: OK (0)<br>

> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:<br>

>         Update 40 for fail-count-vm-lnx0106a[lnx0129a]=(null): OK (0)<br>

> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:<br>

>         Update 40 for fail-count-vm-lnx0106a[lnx0129b]=(null): OK (0)<br>

> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:<br>

>         Update 40 for fail-count-vm-lnx0106a[lnx0083b]=(null): OK (0)<br>

> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:<br>

>         Update 45 for fail-count-lnx0106a: OK (0)<br>

> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:<br>

>         Update 45 for fail-count-lnx0106a[lnx0129a]=(null): OK (0)<br>

> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:<br>

>         Update 45 for fail-count-lnx0106a[lnx0129b]=(null): OK (0)<br>

> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:<br>

>         Update 45 for fail-count-lnx0106a[lnx0083b]=(null): OK (0)<br>

> Apr 08 13:41:05 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                                       <lrm_resource id="lnx0106a"<br>

> type="VirtualDomain" class="ocf" provider="heartbeat"><br>

> Apr 08 13:41:05 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                                         <lrm_rsc_op id="lnx0106a_last_0"<br>

> operation_key="lnx0106a_monitor_0" operation="monitor"<br>

> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9"<br>

> transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0"<br>

> last-run="1427965815" last-rc-change="142796<br>

> Apr 08 13:41:07 [14230] lnx0083a       lrmd:     info:<br>

> process_lrmd_get_rsc_info:      Resource 'lnx0106a' not found (27 active<br>

> resources)<br>

> Apr 08 13:41:07 [14230] lnx0083a       lrmd:     info:<br>

> process_lrmd_rsc_register:      Added 'lnx0106a' to the rsc list (28<br>

> active resources)<br>

> Apr 08 13:41:07 [14233] lnx0083a       crmd:     info: do_lrm_rsc_op:<br>

> Performing key=65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc<br>

> op=lnx0106a_monitor_0<br>

> Apr 08 13:41:08 [14233] lnx0083a       crmd:   notice: process_lrm_event:<br>

> Operation lnx0106a_monitor_0: not running (node=lnx0083a, call=114, rc=7,<br>

> cib-update=34, confirmed=true)<br>

> Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources:  <lrm_resource<br>

> id="lnx0106a" type="VirtualDomain" class="ocf" provider="heartbeat"/><br>

> Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                                                                <lrm_rsc_op<br>

> id="lnx0106a_last_failure_0" operation_key="lnx0106a_monitor_0"<br>

> operation="monitor" crm-debug-origin="do_update_resource"<br>

> crm_feature_set="3.0.9"<br>

> transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0"<br>

> last-ru<br>

> Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

>                                                                <lrm_rsc_op<br>

> id="lnx0106a_last_0" operation_key="lnx0106a_monitor_0"<br>

> operation="monitor" crm-debug-origin="do_update_resource"<br>

> crm_feature_set="3.0.9"<br>

> transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"<br>

> on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0"<br>

> last-run="14284<br>

> Apr 08 13:41:09 [14233] lnx0083a       crmd:     info: do_lrm_rsc_op:<br>

> Performing key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc<br>

> op=lnx0106a_stop_0<br>

> Apr 08 13:41:09 [14230] lnx0083a       lrmd:     info: log_execute:<br>

> executing - rsc:lnx0106a action:stop call_id:131<br>

> VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:09 INFO: Virtual domain<br>

> lnx0106a currently has no state, retrying.<br>

> VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:10 INFO: Virtual domain<br>

> lnx0106a currently has no state, retrying.<br>

> VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: Virtual domain<br>

> lnx0106a has no state during stop operation, bailing out.<br>

> VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 INFO: Issuing forced<br>

> shutdown (destroy) request for domain lnx0106a.<br>

> VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: forced stop<br>

> failed<br>

> Apr 08 13:41:12 [14230] lnx0083a       lrmd:   notice: operation_finished:<br>

>         lnx0106a_stop_0:20360:stderr [ error: failed to connect to the<br>

> hypervisor error: end of file while reading data: : input/output error ]<br>

> Apr 08 13:41:12 [14230] lnx0083a       lrmd:   notice: operation_finished:<br>

>         lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced stop failed<br>

> ]<br>

> Apr 08 13:41:12 [14230] lnx0083a       lrmd:     info: log_finished:<br>

> finished - rsc:lnx0106a action:stop call_id:131 pid:20360 exit-code:1<br>

> exec-time:2609ms queue-time:0ms<br>

> Apr 08 13:41:12 [14233] lnx0083a       crmd:   notice: process_lrm_event:<br>

> Operation lnx0106a_stop_0: unknown error (node=lnx0083a, call=131, rc=1,<br>

> cib-update=43, confirmed=true)<br>

> Apr 08 13:41:12 [14233] lnx0083a       crmd:   notice: process_lrm_event:<br>

> lnx0083a-lnx0106a_stop_0:131 [ error: failed to connect to the hypervisor<br>

> error: end of file while reading data: : input/output<br>

> error\nocf-exit-reason:forced stop failed\n ]<br>

> Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  +<br>

> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id=<br>

> 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_failure_0']:<br>

> @operation_key=lnx0106a_stop_0, @operation=stop,<br>

> @transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,<br>

> @transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,<br>

> @call-id=131, @rc-code=1, @last-run=1428493269,<br>

> @last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop<br>

> Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  +<br>

> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id=<br>

> 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:<br>

> @operation_key=lnx0106a_stop_0, @operation=stop,<br>

> @transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,<br>

> @transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,<br>

> @call-id=131, @rc-code=1, @last-run=1428493269,<br>

> @last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop<br>

> failed<br>

> Apr 08 13:41:12 [14231] lnx0083a      attrd:     info: attrd_peer_update:<br>

> Setting fail-count-lnx0106a[lnx0083a]: (null) -> INFINITY from lnx0083b<br>

> Apr 08 13:41:12 [14231] lnx0083a      attrd:     info: attrd_peer_update:<br>

> Setting last-failure-lnx0106a[lnx0083a]: (null) -> 1428493272 from<br>

> lnx0083b<br>

> Apr 08 13:41:12 [14228] lnx0083a        cib:<br>

> info: cib_perform_op:  ++<br>

> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attri<br>

> butes[@id='status-1']: <nvpair id="status-1-fail-count-lnx0106a"<br>

> name="fail-count-lnx0106a" value="INFINITY"/><br>

> Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  ++<br>

> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attri<br>

> butes[@id='status-1']: <nvpair id="status-1-last-failure-lnx0106a"<br>

> name="last-failure-lnx0106a" value="1428493272"/><br>

> Apr 08 13:41:17 [14228] lnx0083a        cib:     info: cib_perform_op:  +<br>

> /cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id=<br>

> 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:<br>

> @operation_key=lnx0106a_stop_0, @operation=stop,<br>

> @transition-key=106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,<br>

> @transition-magic=0:0;106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,<br>

> @call-id=542, @last-run=1428493269, @last-rc-change=1428493269,<br>

> @exec-time=7645<br>

> ...<br>

><br>

> any ideas?<br>

><br>

> thank you!<br>

> Philipp<br>

<br>

It will get the latest version of the CIB after it connects to the cluster<br>

again.<br>

<br>

The CIB has a timestamp and so every node can decide if it has the lastest<br>

version of the CIB or it should fetch it from an other node.<br>

<br>

Mit freundlichen Gr??en,<br>

<br>

Michael Schwartzkopff<br>

<br>

--<br>

[*] sys4 AG<br>

<br>

<a href="http://sys4.de" target="_blank">http://sys4.de</a>, <a href="tel:%2B49%20%2889%29%2030%2090%2046%2064" value="+498930904664">+49 (89) 30 90 46 64</a>, <a href="tel:%2B49%20%28162%29%20165%200044" value="+491621650044">+49 (162) 165 0044</a><br>

Franziskanerstra?e 15, 81669 M?nchen<br>

<br>

Sitz der Gesellschaft: M?nchen, Amtsgericht M?nchen: HRB 199263<br>

Vorstand: Patrick Ben Koetter, Marc Schiffbauer<br>

Aufsichtsratsvorsitzender: Florian Kirstein<br>

-------------- next part --------------<br>

A non-text attachment was scrubbed...<br>

Name: signature.asc<br>

Type: application/pgp-signature<br>

Size: 230 bytes<br>

Desc: This is a digitally signed message part.<br>

URL: <<a href="http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20150408/c367f49a/attachment-0001.sig" target="_blank">http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20150408/c367f49a/attachment-0001.sig</a>><br>

<br>

------------------------------<br>

<br>

Message: 2<br>

Date: Wed, 8 Apr 2015 16:20:50 +0100<br>

From: Jorge Lopes <<a href="mailto:jmclopes@gmail.com">jmclopes@gmail.com</a>><br>

To: <a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a><br>

Subject: [Pacemaker] Cluster with two STONITH devices<br>

Message-ID:<br>

        <CAASpg58BJLxNSt3JjyNqeuAEPX1bbgyGHTD_CeP=<a href="mailto:zXna1eY5jw@mail.gmail.com">zXna1eY5jw@mail.gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

Hi all.<br>

<br>

I'm having difficulties orchestrating two STONITH devices in my cluster. I<br>

have been struggling with this in past days and I need some help, please.<br>

<br>

A simplified version of my cluster and its goals is as follows:<br>

- The cluster has two physical servers, each with two nodes (VMWare virtual<br>

machines): overall, there are 4 nodes in this simplified version.<br>

- There are two resource groups: group-cluster-a and group-cluster-b.<br>

- To achieve a good CPU balance in the physical servers, the cluster is<br>

asymmetric, with one group running in one server and the other group<br>

running on the other server.<br>

- If the VM of one host becomes not usable, then its resources are started<br>

in its sister VM deployed in the other physical host.<br>

- If one physical host becomes not usable, then all resources are started<br>

in the other physical host.<br>

- Two STONITH levels are used to fence the problematic nodes.<br>

<br>

The resources have the following behavior:<br>

- If the resource monitor detects a problem, then Pacemaker tries to<br>

restart the resource in the same node.<br>

- If it fails, then STONITH takes place (vcenter reboots the VM) and<br>

Pacemaker starts the resource in the sister VM present in the other<br>

physical host.<br>

- If restarting the VM fails, I want to power off the physical server and<br>

Pacemaker will start all resources in the other physical host.<br>

<br>

<br>

The HA stack is:<br>

Ubuntu 14.04 (the node OS, which is a visualized guest running in VMWare<br>

ESXi 5.5)<br>

Pacemaker 1.1.12<br>

Corosync  2.3.4<br>

CRM 2.1.2<br>

<br>

The 4 nodes are:<br>

cluster-a-1<br>

cluster-a-2<br>

cluster-b-1<br>

cluster-b-2<br>

<br>

The relevant configuration is:<br>

<br>

property symmetric-cluster=false<br>

property stonith-enabled=true<br>

property no-quorum-policy=stop<br>

<br>

group group-cluster-a vip-cluster-a docker-web<br>

location loc-group-cluster-a-1 group-cluster-a inf: cluster-a-1<br>

location loc-group-cluster-a-2 group-cluster-a 500: cluster-a-2<br>

<br>

group group-cluster-b vip-cluster-b docker-srv<br>

location loc-group-cluster-b-1 group-cluster-b 500: cluster-b-1<br>

location loc-group-cluster-b-2 group-cluster-b inf: cluster-b-2<br>

<br>

<br>

# stonith vcenter definitions for host 1<br>

# run in any of the host2 VM<br>

primitive stonith-vcenter-host1 stonith:external/vcenter \<br>

  params \<br>

    VI_SERVER="192.168.40.20" \<br>

    VI_CREDSTORE="/etc/vicredentials.xml" \<br>

    HOSTLIST="cluster-a-1=cluster-a-1;cluster-a-2=cluster-a-2" \<br>

    RESETPOWERON="1" \<br>

  priority="2" \<br>

  pcmk_host_check="static-list" \<br>

  pcmk_host_list="cluster-a-1 cluster-a-2" \<br>

  op monitor interval="60s"<br>

<br>

location loc1-stonith-vcenter-host1 stonith-vcenter-host1 500: cluster-b-1<br>

location loc2-stonith-vcenter-host1 stonith-vcenter-host1 501: cluster-b-2<br>

<br>

# stonith vcenter definitions for host 2<br>

# run in any of the host1 VM<br>

primitive stonith-vcenter-host2 stonith:external/vcenter \<br>

  params \<br>

    VI_SERVER="192.168.40.21" \<br>

    VI_CREDSTORE="/etc/vicredentials.xml" \<br>

    HOSTLIST="cluster-b-1=cluster-b-1;cluster-b-2=cluster-b-2" \<br>

    RESETPOWERON="1" \<br>

  priority="2" \<br>

  pcmk_host_check="static-list" \<br>

  pcmk_host_list="cluster-b-1 cluster-b-2" \<br>

  op monitor interval="60s"<br>

<br>

location loc1-stonith-vcenter-host2 stonith-vcenter-host2 500: cluster-a-1<br>

location loc2-stonith-vcenter-host2 stonith-vcenter-host2 501: cluster-a-2<br>

<br>

<br>

# stonith IPMI definitions for host 1 (DELL with iDRAC 7 enterprise<br>

interface at 192.168.40.15)<br>

# run in any of the host2 VM<br>

primitive stonith-ipmi-host1 stonith:external/ipmi \<br>

    params hostname="host1" ipaddr="192.168.40.15" userid="root"<br>

passwd="mypassword" interface="lanplus" \<br>

    priority="1" \<br>

    pcmk_host_check="static-list" \<br>

    pcmk_host_list="cluster-a-1 cluster-a-2" \<br>

    op start interval="0" timeout="60s" requires="nothing" \<br>

    op monitor interval="3600s" timeout="20s" requires="nothing"<br>

<br>

location loc1-stonith-ipmi-host1 stonith-ipmi-host1 500: cluster-b-1<br>

location loc2-stonith-ipmi-host1 stonith-ipmi-host1 501: cluster-b-2<br>

<br>

<br>

# stonith IPMI definitions for host 2 (DELL with iDRAC 7 enterprise<br>

interface at 192.168.40.16)<br>

# run in any of the host1 VM<br>

primitive stonith-ipmi-host2 stonith:external/ipmi \<br>

    params hostname="host2" ipaddr="192.168.40.16" userid="root"<br>

passwd="mypassword" interface="lanplus" \<br>

    priority="1" \<br>

    pcmk_host_check="static-list" \<br>

    pcmk_host_list="cluster-b-1 cluster-b-2" \<br>

    op start interval="0" timeout="60s" requires="nothing" \<br>

    op monitor interval="3600s" timeout="20s" requires="nothing"<br>

<br>

location loc1-stonith-ipmi-host2 stonith-ipmi-host2 500: cluster-a-1<br>

location loc2-stonith-ipmi-host2 stonith-ipmi-host2 501: cluster-a-2<br>

<br>

<br>

What is working:<br>

- When an error is detected in one resource, the resource restart in the<br>

same node, as expected.<br>

- With the STONITH external/ipmi  resource *stopped*, a fail in one node<br>

makes the vcenter rebooting it and the resources starts in the sister node.<br>

<br>

<br>

What is not so good:<br>

- When vcenter reboots one node, then the resource starts in the other node<br>

as expected but then they return to the original node as soon as it becomes<br>

online. This makes a bit of ping-pong and I think it is a consequence of<br>

how the locations are defined. Any suggestion to avoid this? After the<br>

resource was moved to another node, I would prefer that it stays there,<br>

instead of returning it to the original node. I can think of playing with<br>

the resource affinity scores - is this way it should be done?<br>

<br>

What is wrong:<br>

Lets consider this scenario.<br>

I have a set of resources provided by a docker agent. My test consists in<br>

stopping the docker service in the node cluster-a-1, which makes the docker<br>

agent to return OCF_ERR_INSTALLED to Pacemaker (this is a change I made in<br>

the docker agent, when compared to the github repository version). With the<br>

IPMI STONITH resource stopped, this leads to the node cluster-a-1 restart,<br>

which is expected.<br>

<br>

But with the IPMI STONITH resource started, I notice an erratic behavior:<br>

- Some times, the resources at the node cluster-a-1 are stopped and no<br>

STONITH happens. Also, the resources are not moved to the node cluster-a-2.<br>

In this situation, if I manually restart the node cluster-a-1 (virtual<br>

machine restart), then the IPMI STONITH takes place and restarts the<br>

corresponding physical server.<br>

- Sometimes, the IPMI STONITH starts before the vCenter STONITH, which is<br>

not expected because the vCenter STONITH has higher priority.<br>

<br>

I might have something wrong in my stonith definition, but I can't figure<br>

what.<br>

Any idea how to correct this?<br>

<br>

And how can I set external/ipmi to power off the physical host, instead of<br>

rebooting it?<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20150408/b5f9cc1c/attachment.html" target="_blank">http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20150408/b5f9cc1c/attachment.html</a>><br>

<br>

------------------------------<br>

<br>

_______________________________________________<br>

Pacemaker mailing list<br>

<a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

<br>

End of Pacemaker Digest, Vol 89, Issue 2<br>

****************************************<br>

</blockquote></div><br></div>