[Pacemaker] Pacemaker Digest, Vol 89, Issue 2

Wed Apr 8 11:50:10 EDT 2015

unsubscribe

2015-04-08 8:22 GMT-07:00 <pacemaker-request at oss.clusterlabs.org>:

> Send Pacemaker mailing list submissions to
>         pacemaker at oss.clusterlabs.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> or, via email, send a message with subject or body 'help' to
>         pacemaker-request at oss.clusterlabs.org
>
> You can reach the person managing the list at
>         pacemaker-owner at oss.clusterlabs.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Pacemaker digest..."
>
>
> Today's Topics:
>
>    1. Re: update cib after fence (Michael Schwartzkopff)
>    2. Cluster with two STONITH devices (Jorge Lopes)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 08 Apr 2015 16:53:38 +0200
> From: Michael Schwartzkopff <ms at sys4.de>
> To: The Pacemaker cluster resource manager
>         <pacemaker at oss.clusterlabs.org>
> Subject: Re: [Pacemaker] update cib after fence
> Message-ID: <3983598.UMelJT2BHb at nb003>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Am Mittwoch, 8. April 2015, 15:03:48 schrieb philipp.achmueller at arz.at:
> > hi,
> >
> > how to cleanup cib from node after unexpected system halt?
> > failed node still thinks of running VirtualDomain resource, which is
> > already running on other node in cluster(sucessful takeover:
> >
> > executing "pcs cluster start" -
> > ....
> > Apr  8 13:41:10 lnx0083a daemon:info lnx0083a
> > VirtualDomain(lnx0106a)[20360]: INFO: Virtual domain lnx0106a currently
> > has no state, retrying.
> > Apr  8 13:41:12 lnx0083a daemon:err|error lnx0083a
> > VirtualDomain(lnx0106a)[20360]: ERROR: Virtual domain lnx0106a has no
> > state during stop operation, bailing out.
> > Apr  8 13:41:12 lnx0083a daemon:info lnx0083a
> > VirtualDomain(lnx0106a)[20360]: INFO: Issuing forced shutdown (destroy)
> > request for domain lnx0106a.
> > Apr  8 13:41:12 lnx0083a daemon:err|error lnx0083a
> > VirtualDomain(lnx0106a)[20360]: ERROR: forced stop failed
> > Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]:   notice:
> > operation_finished: lnx0106a_stop_0:20360:stderr [ error: failed to
> > connect to the hypervisor error: end of file while reading data: :
> > input/output error ]
> > Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]:   notice:
> > operation_finished: lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced
> > stop failed ]
> > Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]:   notice:
> > process_lrm_event: Operation lnx0106a_stop_0: unknown error
> > (node=lnx0083a, call=131, rc=1, cib-update=43, confirmed=true)
> > Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]:   notice:
> > process_lrm_event: lnx0083a-lnx0106a_stop_0:131 [ error: failed to
> connect
> > to the hypervisor error: end of file while reading data: : input/output
> > error\nocf-exit-reason:forced stop failed\n ]
> > Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> > warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed
> > (target: 0 vs. rc: 1): Error
> > Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> > warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
> > after failed stop: rc=1 (update=INFINITY, time=1428493272)
> > Apr  8 13:41:12 lnx0083b daemon:notice lnx0083b crmd[18244]:   notice:
> > abort_transition_graph: Transition aborted by lnx0106a_stop_0 'modify' on
> > lnx0083a: Event failed
> > (magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> > cib=1.499.624, source=match_graph_event:350, 0)
> > Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> > warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
> > after failed stop: rc=1 (update=INFINITY, time=1428493272)
> > Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> > warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed
> > (target: 0 vs. rc: 1): Error
> > Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> > warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
> > after failed stop: rc=1 (update=INFINITY, time=1428493272)
> > Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> > warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
> > after failed stop: rc=1 (update=INFINITY, time=1428493272)
> > Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
> > warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on
> > lnx0083a: unknown error (1)
> > Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
> > warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on
> > lnx0083a: unknown error (1)
> > Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
> > warning: pe_fence_node: Node lnx0083a will be fenced because of resource
> > failure(s)
> > Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
> > warning: common_apply_stickiness: Forcing lnx0106a away from lnx0083a
> > after 1000000 failures (max=3)
> > Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
> > warning: stage6: Scheduling Node lnx0083a for STONITH
> > Apr  8 13:41:17 lnx0083b daemon:notice lnx0083b pengine[18243]:   notice:
> > native_stop_constraints: Stop of failed resource lnx0106a is implicit
> > after lnx0083a is fenced
> > ....
> >
> > Node is fenced..
> >
> > log from corosync.log:
> > ...
> > Apr 08 13:41:00 [14226] lnx0083a pacemakerd:   notice: mcp_read_config:
> > Configured corosync to accept connections from group 2035: OK (1)
> > Apr 08 13:41:00 [14226] lnx0083a pacemakerd:   notice: main:    Starting
> > Pacemaker 1.1.12 (Build: 4ed91da):  agent-manpages ascii-docs ncurses
> > libqb-logging libqb-ip
> > c lha-fencing upstart nagios  corosync-native atomic-attrd libesmtp acls
> > ....
> > Apr 08 13:16:04 [23690] lnx0083a        cib:     info: cib_perform_op:  +
> >
> /cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id=
> > 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:
> > @operation_key=lnx0106a_stop_0, @operation=stop,
> > @transition-key=106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> > @transition-magic=0:0;106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> > @call-id=538, @last-run=1428491757, @last-rc-change=1428491757,
> > @exec-time=7686
> > Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: write_attribute:
> > Sent update 40 with 3 changes for fail-count-vm-lnx0106a, id=<n/a>,
> > set=(null)
> > Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: write_attribute:
> > Sent update 45 with 3 changes for fail-count-lnx0106a, id=<n/a>,
> > set=(null)
> > Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                     <lrm_resource id="lnx0106a" type="VirtualDomain"
> > class="ocf" provider="heartbeat">
> > Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                       <lrm_rsc_op id="lnx0106a_last_0"
> > operation_key="lnx0106a_monitor_0" operation="monitor"
> > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9"
> > transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0"
> > last-run="1427965815" last-rc-change="1427965815" exec-time="8
> > Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                     <lrm_resource id="lnx0106a" type="VirtualDomain"
> > class="ocf" provider="heartbeat">
> > Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                       <lrm_rsc_op id="lnx0106a_last_failure_0"
> > operation_key="lnx0106a_migrate_to_0" operation="migrate_to"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> > transition-key="112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > transition-magic="0:1;112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > on_node="lnx0129a" call-id="444" rc-code="1" op-status="0" interval="0"
> > last-run="1427973596" last-rc-change="1427
> > Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                       <lrm_rsc_op id="lnx0106a_last_0"
> > operation_key="lnx0106a_stop_0" operation="stop"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> > transition-key="113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > transition-magic="0:0;113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > on_node="lnx0129a" call-id="546" rc-code="0" op-status="0" interval="0"
> > last-run="1428403880" last-rc-change="1428403880" exec-time="2
> > Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                       <lrm_rsc_op id="lnx0106a_monitor_30000"
> > operation_key="lnx0106a_monitor_30000" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> > transition-key="47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > transition-magic="0:0;47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > on_node="lnx0129a" call-id="436" rc-code="0" op-status="0"
> > interval="30000" last-rc-change="1427965985" exec-time="1312
> > Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                     <lrm_resource id="lnx0106a" type="VirtualDomain"
> > class="ocf" provider="heartbeat">
> > Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                       <lrm_rsc_op id="lnx0106a_last_0"
> > operation_key="lnx0106a_start_0" operation="start"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> > transition-key="110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > transition-magic="0:0;110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > on_node="lnx0129b" call-id="539" rc-code="0" op-status="0" interval="0"
> > last-run="1428491780" last-rc-change="1428491780" exec-tim
> > Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                       <lrm_rsc_op id="lnx0106a_monitor_30000"
> > operation_key="lnx0106a_monitor_30000" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> > transition-key="89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > transition-magic="0:0;89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > on_node="lnx0129b" call-id="540" rc-code="0" op-status="0"
> > interval="30000" last-rc-change="1428491810" exec-time="12
> > Apr 08 13:41:04 [14231] lnx0083a      attrd:     info:
> attrd_cib_callback:
> >         Update 40 for fail-count-vm-lnx0106a: OK (0)
> > Apr 08 13:41:04 [14231] lnx0083a      attrd:     info:
> attrd_cib_callback:
> >         Update 40 for fail-count-vm-lnx0106a[lnx0129a]=(null): OK (0)
> > Apr 08 13:41:04 [14231] lnx0083a      attrd:     info:
> attrd_cib_callback:
> >         Update 40 for fail-count-vm-lnx0106a[lnx0129b]=(null): OK (0)
> > Apr 08 13:41:04 [14231] lnx0083a      attrd:     info:
> attrd_cib_callback:
> >         Update 40 for fail-count-vm-lnx0106a[lnx0083b]=(null): OK (0)
> > Apr 08 13:41:04 [14231] lnx0083a      attrd:     info:
> attrd_cib_callback:
> >         Update 45 for fail-count-lnx0106a: OK (0)
> > Apr 08 13:41:04 [14231] lnx0083a      attrd:     info:
> attrd_cib_callback:
> >         Update 45 for fail-count-lnx0106a[lnx0129a]=(null): OK (0)
> > Apr 08 13:41:04 [14231] lnx0083a      attrd:     info:
> attrd_cib_callback:
> >         Update 45 for fail-count-lnx0106a[lnx0129b]=(null): OK (0)
> > Apr 08 13:41:04 [14231] lnx0083a      attrd:     info:
> attrd_cib_callback:
> >         Update 45 for fail-count-lnx0106a[lnx0083b]=(null): OK (0)
> > Apr 08 13:41:05 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                                       <lrm_resource id="lnx0106a"
> > type="VirtualDomain" class="ocf" provider="heartbeat">
> > Apr 08 13:41:05 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >                                         <lrm_rsc_op id="lnx0106a_last_0"
> > operation_key="lnx0106a_monitor_0" operation="monitor"
> > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9"
> > transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0"
> > last-run="1427965815" last-rc-change="142796
> > Apr 08 13:41:07 [14230] lnx0083a       lrmd:     info:
> > process_lrmd_get_rsc_info:      Resource 'lnx0106a' not found (27 active
> > resources)
> > Apr 08 13:41:07 [14230] lnx0083a       lrmd:     info:
> > process_lrmd_rsc_register:      Added 'lnx0106a' to the rsc list (28
> > active resources)
> > Apr 08 13:41:07 [14233] lnx0083a       crmd:     info: do_lrm_rsc_op:
> > Performing key=65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc
> > op=lnx0106a_monitor_0
> > Apr 08 13:41:08 [14233] lnx0083a       crmd:   notice: process_lrm_event:
> > Operation lnx0106a_monitor_0: not running (node=lnx0083a, call=114, rc=7,
> > cib-update=34, confirmed=true)
> > Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources:
> <lrm_resource
> > id="lnx0106a" type="VirtualDomain" class="ocf" provider="heartbeat"/>
> > Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >
> <lrm_rsc_op
> > id="lnx0106a_last_failure_0" operation_key="lnx0106a_monitor_0"
> > operation="monitor" crm-debug-origin="do_update_resource"
> > crm_feature_set="3.0.9"
> > transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0"
> > last-ru
> > Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >
> <lrm_rsc_op
> > id="lnx0106a_last_0" operation_key="lnx0106a_monitor_0"
> > operation="monitor" crm-debug-origin="do_update_resource"
> > crm_feature_set="3.0.9"
> > transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> > on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0"
> > last-run="14284
> > Apr 08 13:41:09 [14233] lnx0083a       crmd:     info: do_lrm_rsc_op:
> > Performing key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc
> > op=lnx0106a_stop_0
> > Apr 08 13:41:09 [14230] lnx0083a       lrmd:     info: log_execute:
> > executing - rsc:lnx0106a action:stop call_id:131
> > VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:09 INFO: Virtual domain
> > lnx0106a currently has no state, retrying.
> > VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:10 INFO: Virtual domain
> > lnx0106a currently has no state, retrying.
> > VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: Virtual domain
> > lnx0106a has no state during stop operation, bailing out.
> > VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 INFO: Issuing forced
> > shutdown (destroy) request for domain lnx0106a.
> > VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: forced stop
> > failed
> > Apr 08 13:41:12 [14230] lnx0083a       lrmd:   notice:
> operation_finished:
> >         lnx0106a_stop_0:20360:stderr [ error: failed to connect to the
> > hypervisor error: end of file while reading data: : input/output error ]
> > Apr 08 13:41:12 [14230] lnx0083a       lrmd:   notice:
> operation_finished:
> >         lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced stop failed
> > ]
> > Apr 08 13:41:12 [14230] lnx0083a       lrmd:     info: log_finished:
> > finished - rsc:lnx0106a action:stop call_id:131 pid:20360 exit-code:1
> > exec-time:2609ms queue-time:0ms
> > Apr 08 13:41:12 [14233] lnx0083a       crmd:   notice: process_lrm_event:
> > Operation lnx0106a_stop_0: unknown error (node=lnx0083a, call=131, rc=1,
> > cib-update=43, confirmed=true)
> > Apr 08 13:41:12 [14233] lnx0083a       crmd:   notice: process_lrm_event:
> > lnx0083a-lnx0106a_stop_0:131 [ error: failed to connect to the hypervisor
> > error: end of file while reading data: : input/output
> > error\nocf-exit-reason:forced stop failed\n ]
> > Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  +
> >
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id=
> > 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_failure_0']:
> > @operation_key=lnx0106a_stop_0, @operation=stop,
> > @transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> > @transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> > @call-id=131, @rc-code=1, @last-run=1428493269,
> > @last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop
> > Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  +
> >
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id=
> > 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:
> > @operation_key=lnx0106a_stop_0, @operation=stop,
> > @transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> > @transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> > @call-id=131, @rc-code=1, @last-run=1428493269,
> > @last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop
> > failed
> > Apr 08 13:41:12 [14231] lnx0083a      attrd:     info: attrd_peer_update:
> > Setting fail-count-lnx0106a[lnx0083a]: (null) -> INFINITY from lnx0083b
> > Apr 08 13:41:12 [14231] lnx0083a      attrd:     info: attrd_peer_update:
> > Setting last-failure-lnx0106a[lnx0083a]: (null) -> 1428493272 from
> > lnx0083b
> > Apr 08 13:41:12 [14228] lnx0083a        cib:
> > info: cib_perform_op:  ++
> >
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attri
> > butes[@id='status-1']: <nvpair id="status-1-fail-count-lnx0106a"
> > name="fail-count-lnx0106a" value="INFINITY"/>
> > Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:
> ++
> >
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attri
> > butes[@id='status-1']: <nvpair id="status-1-last-failure-lnx0106a"
> > name="last-failure-lnx0106a" value="1428493272"/>
> > Apr 08 13:41:17 [14228] lnx0083a        cib:     info: cib_perform_op:  +
> >
> /cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id=
> > 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:
> > @operation_key=lnx0106a_stop_0, @operation=stop,
> > @transition-key=106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> > @transition-magic=0:0;106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> > @call-id=542, @last-run=1428493269, @last-rc-change=1428493269,
> > @exec-time=7645
> > ...
> >
> > any ideas?
> >
> > thank you!
> > Philipp
>
> It will get the latest version of the CIB after it connects to the cluster
> again.
>
> The CIB has a timestamp and so every node can decide if it has the lastest
> version of the CIB or it should fetch it from an other node.
>
> Mit freundlichen Gr??en,
>
> Michael Schwartzkopff
>
> --
> [*] sys4 AG
>
> http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
> Franziskanerstra?e 15, 81669 M?nchen
>
> Sitz der Gesellschaft: M?nchen, Amtsgericht M?nchen: HRB 199263
> Vorstand: Patrick Ben Koetter, Marc Schiffbauer
> Aufsichtsratsvorsitzender: Florian Kirstein
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 230 bytes
> Desc: This is a digitally signed message part.
> URL: <
> http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20150408/c367f49a/attachment-0001.sig
> >
>
> ------------------------------
>
> Message: 2
> Date: Wed, 8 Apr 2015 16:20:50 +0100
> From: Jorge Lopes <jmclopes at gmail.com>
> To: pacemaker at oss.clusterlabs.org
> Subject: [Pacemaker] Cluster with two STONITH devices
> Message-ID:
>         <CAASpg58BJLxNSt3JjyNqeuAEPX1bbgyGHTD_CeP=
> zXna1eY5jw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi all.
>
> I'm having difficulties orchestrating two STONITH devices in my cluster. I
> have been struggling with this in past days and I need some help, please.
>
> A simplified version of my cluster and its goals is as follows:
> - The cluster has two physical servers, each with two nodes (VMWare virtual
> machines): overall, there are 4 nodes in this simplified version.
> - There are two resource groups: group-cluster-a and group-cluster-b.
> - To achieve a good CPU balance in the physical servers, the cluster is
> asymmetric, with one group running in one server and the other group
> running on the other server.
> - If the VM of one host becomes not usable, then its resources are started
> in its sister VM deployed in the other physical host.
> - If one physical host becomes not usable, then all resources are started
> in the other physical host.
> - Two STONITH levels are used to fence the problematic nodes.
>
> The resources have the following behavior:
> - If the resource monitor detects a problem, then Pacemaker tries to
> restart the resource in the same node.
> - If it fails, then STONITH takes place (vcenter reboots the VM) and
> Pacemaker starts the resource in the sister VM present in the other
> physical host.
> - If restarting the VM fails, I want to power off the physical server and
> Pacemaker will start all resources in the other physical host.
>
>
> The HA stack is:
> Ubuntu 14.04 (the node OS, which is a visualized guest running in VMWare
> ESXi 5.5)
> Pacemaker 1.1.12
> Corosync  2.3.4
> CRM 2.1.2
>
> The 4 nodes are:
> cluster-a-1
> cluster-a-2
> cluster-b-1
> cluster-b-2
>
> The relevant configuration is:
>
> property symmetric-cluster=false
> property stonith-enabled=true
> property no-quorum-policy=stop
>
> group group-cluster-a vip-cluster-a docker-web
> location loc-group-cluster-a-1 group-cluster-a inf: cluster-a-1
> location loc-group-cluster-a-2 group-cluster-a 500: cluster-a-2
>
> group group-cluster-b vip-cluster-b docker-srv
> location loc-group-cluster-b-1 group-cluster-b 500: cluster-b-1
> location loc-group-cluster-b-2 group-cluster-b inf: cluster-b-2
>
>
> # stonith vcenter definitions for host 1
> # run in any of the host2 VM
> primitive stonith-vcenter-host1 stonith:external/vcenter \
>   params \
>     VI_SERVER="192.168.40.20" \
>     VI_CREDSTORE="/etc/vicredentials.xml" \
>     HOSTLIST="cluster-a-1=cluster-a-1;cluster-a-2=cluster-a-2" \
>     RESETPOWERON="1" \
>   priority="2" \
>   pcmk_host_check="static-list" \
>   pcmk_host_list="cluster-a-1 cluster-a-2" \
>   op monitor interval="60s"
>
> location loc1-stonith-vcenter-host1 stonith-vcenter-host1 500: cluster-b-1
> location loc2-stonith-vcenter-host1 stonith-vcenter-host1 501: cluster-b-2
>
> # stonith vcenter definitions for host 2
> # run in any of the host1 VM
> primitive stonith-vcenter-host2 stonith:external/vcenter \
>   params \
>     VI_SERVER="192.168.40.21" \
>     VI_CREDSTORE="/etc/vicredentials.xml" \
>     HOSTLIST="cluster-b-1=cluster-b-1;cluster-b-2=cluster-b-2" \
>     RESETPOWERON="1" \
>   priority="2" \
>   pcmk_host_check="static-list" \
>   pcmk_host_list="cluster-b-1 cluster-b-2" \
>   op monitor interval="60s"
>
> location loc1-stonith-vcenter-host2 stonith-vcenter-host2 500: cluster-a-1
> location loc2-stonith-vcenter-host2 stonith-vcenter-host2 501: cluster-a-2
>
>
> # stonith IPMI definitions for host 1 (DELL with iDRAC 7 enterprise
> interface at 192.168.40.15)
> # run in any of the host2 VM
> primitive stonith-ipmi-host1 stonith:external/ipmi \
>     params hostname="host1" ipaddr="192.168.40.15" userid="root"
> passwd="mypassword" interface="lanplus" \
>     priority="1" \
>     pcmk_host_check="static-list" \
>     pcmk_host_list="cluster-a-1 cluster-a-2" \
>     op start interval="0" timeout="60s" requires="nothing" \
>     op monitor interval="3600s" timeout="20s" requires="nothing"
>
> location loc1-stonith-ipmi-host1 stonith-ipmi-host1 500: cluster-b-1
> location loc2-stonith-ipmi-host1 stonith-ipmi-host1 501: cluster-b-2
>
>
> # stonith IPMI definitions for host 2 (DELL with iDRAC 7 enterprise
> interface at 192.168.40.16)
> # run in any of the host1 VM
> primitive stonith-ipmi-host2 stonith:external/ipmi \
>     params hostname="host2" ipaddr="192.168.40.16" userid="root"
> passwd="mypassword" interface="lanplus" \
>     priority="1" \
>     pcmk_host_check="static-list" \
>     pcmk_host_list="cluster-b-1 cluster-b-2" \
>     op start interval="0" timeout="60s" requires="nothing" \
>     op monitor interval="3600s" timeout="20s" requires="nothing"
>
> location loc1-stonith-ipmi-host2 stonith-ipmi-host2 500: cluster-a-1
> location loc2-stonith-ipmi-host2 stonith-ipmi-host2 501: cluster-a-2
>
>
> What is working:
> - When an error is detected in one resource, the resource restart in the
> same node, as expected.
> - With the STONITH external/ipmi  resource *stopped*, a fail in one node
> makes the vcenter rebooting it and the resources starts in the sister node.
>
>
> What is not so good:
> - When vcenter reboots one node, then the resource starts in the other node
> as expected but then they return to the original node as soon as it becomes
> online. This makes a bit of ping-pong and I think it is a consequence of
> how the locations are defined. Any suggestion to avoid this? After the
> resource was moved to another node, I would prefer that it stays there,
> instead of returning it to the original node. I can think of playing with
> the resource affinity scores - is this way it should be done?
>
> What is wrong:
> Lets consider this scenario.
> I have a set of resources provided by a docker agent. My test consists in
> stopping the docker service in the node cluster-a-1, which makes the docker
> agent to return OCF_ERR_INSTALLED to Pacemaker (this is a change I made in
> the docker agent, when compared to the github repository version). With the
> IPMI STONITH resource stopped, this leads to the node cluster-a-1 restart,
> which is expected.
>
> But with the IPMI STONITH resource started, I notice an erratic behavior:
> - Some times, the resources at the node cluster-a-1 are stopped and no
> STONITH happens. Also, the resources are not moved to the node cluster-a-2.
> In this situation, if I manually restart the node cluster-a-1 (virtual
> machine restart), then the IPMI STONITH takes place and restarts the
> corresponding physical server.
> - Sometimes, the IPMI STONITH starts before the vCenter STONITH, which is
> not expected because the vCenter STONITH has higher priority.
>
> I might have something wrong in my stonith definition, but I can't figure
> what.
> Any idea how to correct this?
>
> And how can I set external/ipmi to power off the physical host, instead of
> rebooting it?
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20150408/b5f9cc1c/attachment.html
> >
>
> ------------------------------
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> End of Pacemaker Digest, Vol 89, Issue 2
> ****************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150408/723a3338/attachment-0002.html>