[Pacemaker] update cib after fence

Michael Schwartzkopff ms at sys4.de
Wed Apr 8 10:53:38 EDT 2015


Am Mittwoch, 8. April 2015, 15:03:48 schrieb philipp.achmueller at arz.at:
> hi,
> 
> how to cleanup cib from node after unexpected system halt?
> failed node still thinks of running VirtualDomain resource, which is
> already running on other node in cluster(sucessful takeover:
> 
> executing "pcs cluster start" -
> ....
> Apr  8 13:41:10 lnx0083a daemon:info lnx0083a
> VirtualDomain(lnx0106a)[20360]: INFO: Virtual domain lnx0106a currently
> has no state, retrying.
> Apr  8 13:41:12 lnx0083a daemon:err|error lnx0083a
> VirtualDomain(lnx0106a)[20360]: ERROR: Virtual domain lnx0106a has no
> state during stop operation, bailing out.
> Apr  8 13:41:12 lnx0083a daemon:info lnx0083a
> VirtualDomain(lnx0106a)[20360]: INFO: Issuing forced shutdown (destroy)
> request for domain lnx0106a.
> Apr  8 13:41:12 lnx0083a daemon:err|error lnx0083a
> VirtualDomain(lnx0106a)[20360]: ERROR: forced stop failed
> Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]:   notice:
> operation_finished: lnx0106a_stop_0:20360:stderr [ error: failed to
> connect to the hypervisor error: end of file while reading data: :
> input/output error ]
> Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]:   notice:
> operation_finished: lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced
> stop failed ]
> Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]:   notice:
> process_lrm_event: Operation lnx0106a_stop_0: unknown error
> (node=lnx0083a, call=131, rc=1, cib-update=43, confirmed=true)
> Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]:   notice:
> process_lrm_event: lnx0083a-lnx0106a_stop_0:131 [ error: failed to connect
> to the hypervisor error: end of file while reading data: : input/output
> error\nocf-exit-reason:forced stop failed\n ]
> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed
> (target: 0 vs. rc: 1): Error
> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
> after failed stop: rc=1 (update=INFINITY, time=1428493272)
> Apr  8 13:41:12 lnx0083b daemon:notice lnx0083b crmd[18244]:   notice:
> abort_transition_graph: Transition aborted by lnx0106a_stop_0 'modify' on
> lnx0083a: Event failed
> (magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> cib=1.499.624, source=match_graph_event:350, 0)
> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
> after failed stop: rc=1 (update=INFINITY, time=1428493272)
> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed
> (target: 0 vs. rc: 1): Error
> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
> after failed stop: rc=1 (update=INFINITY, time=1428493272)
> Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
> warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
> after failed stop: rc=1 (update=INFINITY, time=1428493272)
> Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
> warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on
> lnx0083a: unknown error (1)
> Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
> warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on
> lnx0083a: unknown error (1)
> Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
> warning: pe_fence_node: Node lnx0083a will be fenced because of resource
> failure(s)
> Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
> warning: common_apply_stickiness: Forcing lnx0106a away from lnx0083a
> after 1000000 failures (max=3)
> Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
> warning: stage6: Scheduling Node lnx0083a for STONITH
> Apr  8 13:41:17 lnx0083b daemon:notice lnx0083b pengine[18243]:   notice:
> native_stop_constraints: Stop of failed resource lnx0106a is implicit
> after lnx0083a is fenced
> ....
> 
> Node is fenced..
> 
> log from corosync.log:
> ...
> Apr 08 13:41:00 [14226] lnx0083a pacemakerd:   notice: mcp_read_config:
> Configured corosync to accept connections from group 2035: OK (1)
> Apr 08 13:41:00 [14226] lnx0083a pacemakerd:   notice: main:    Starting
> Pacemaker 1.1.12 (Build: 4ed91da):  agent-manpages ascii-docs ncurses
> libqb-logging libqb-ip
> c lha-fencing upstart nagios  corosync-native atomic-attrd libesmtp acls
> ....
> Apr 08 13:16:04 [23690] lnx0083a        cib:     info: cib_perform_op:  +
> /cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id=
> 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:
> @operation_key=lnx0106a_stop_0, @operation=stop,
> @transition-key=106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> @transition-magic=0:0;106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> @call-id=538, @last-run=1428491757, @last-rc-change=1428491757,
> @exec-time=7686
> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: write_attribute:
> Sent update 40 with 3 changes for fail-count-vm-lnx0106a, id=<n/a>,
> set=(null)
> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: write_attribute:
> Sent update 45 with 3 changes for fail-count-lnx0106a, id=<n/a>,
> set=(null)
> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                     <lrm_resource id="lnx0106a" type="VirtualDomain"
> class="ocf" provider="heartbeat">
> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                       <lrm_rsc_op id="lnx0106a_last_0"
> operation_key="lnx0106a_monitor_0" operation="monitor"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9"
> transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0"
> last-run="1427965815" last-rc-change="1427965815" exec-time="8
> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                     <lrm_resource id="lnx0106a" type="VirtualDomain"
> class="ocf" provider="heartbeat">
> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                       <lrm_rsc_op id="lnx0106a_last_failure_0"
> operation_key="lnx0106a_migrate_to_0" operation="migrate_to"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> transition-key="112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> transition-magic="0:1;112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> on_node="lnx0129a" call-id="444" rc-code="1" op-status="0" interval="0"
> last-run="1427973596" last-rc-change="1427
> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                       <lrm_rsc_op id="lnx0106a_last_0"
> operation_key="lnx0106a_stop_0" operation="stop"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> transition-key="113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> transition-magic="0:0;113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> on_node="lnx0129a" call-id="546" rc-code="0" op-status="0" interval="0"
> last-run="1428403880" last-rc-change="1428403880" exec-time="2
> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                       <lrm_rsc_op id="lnx0106a_monitor_30000"
> operation_key="lnx0106a_monitor_30000" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> transition-key="47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> transition-magic="0:0;47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> on_node="lnx0129a" call-id="436" rc-code="0" op-status="0"
> interval="30000" last-rc-change="1427965985" exec-time="1312
> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                     <lrm_resource id="lnx0106a" type="VirtualDomain"
> class="ocf" provider="heartbeat">
> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                       <lrm_rsc_op id="lnx0106a_last_0"
> operation_key="lnx0106a_start_0" operation="start"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> transition-key="110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> transition-magic="0:0;110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> on_node="lnx0129b" call-id="539" rc-code="0" op-status="0" interval="0"
> last-run="1428491780" last-rc-change="1428491780" exec-tim
> Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                       <lrm_rsc_op id="lnx0106a_monitor_30000"
> operation_key="lnx0106a_monitor_30000" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
> transition-key="89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> transition-magic="0:0;89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> on_node="lnx0129b" call-id="540" rc-code="0" op-status="0"
> interval="30000" last-rc-change="1428491810" exec-time="12
> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:
>         Update 40 for fail-count-vm-lnx0106a: OK (0)
> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:
>         Update 40 for fail-count-vm-lnx0106a[lnx0129a]=(null): OK (0)
> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:
>         Update 40 for fail-count-vm-lnx0106a[lnx0129b]=(null): OK (0)
> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:
>         Update 40 for fail-count-vm-lnx0106a[lnx0083b]=(null): OK (0)
> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:
>         Update 45 for fail-count-lnx0106a: OK (0)
> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:
>         Update 45 for fail-count-lnx0106a[lnx0129a]=(null): OK (0)
> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:
>         Update 45 for fail-count-lnx0106a[lnx0129b]=(null): OK (0)
> Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback:
>         Update 45 for fail-count-lnx0106a[lnx0083b]=(null): OK (0)
> Apr 08 13:41:05 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                                       <lrm_resource id="lnx0106a"
> type="VirtualDomain" class="ocf" provider="heartbeat">
> Apr 08 13:41:05 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                                         <lrm_rsc_op id="lnx0106a_last_0"
> operation_key="lnx0106a_monitor_0" operation="monitor"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9"
> transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0"
> last-run="1427965815" last-rc-change="142796
> Apr 08 13:41:07 [14230] lnx0083a       lrmd:     info:
> process_lrmd_get_rsc_info:      Resource 'lnx0106a' not found (27 active
> resources)
> Apr 08 13:41:07 [14230] lnx0083a       lrmd:     info:
> process_lrmd_rsc_register:      Added 'lnx0106a' to the rsc list (28
> active resources)
> Apr 08 13:41:07 [14233] lnx0083a       crmd:     info: do_lrm_rsc_op:
> Performing key=65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc
> op=lnx0106a_monitor_0
> Apr 08 13:41:08 [14233] lnx0083a       crmd:   notice: process_lrm_event:
> Operation lnx0106a_monitor_0: not running (node=lnx0083a, call=114, rc=7,
> cib-update=34, confirmed=true)
> Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources:  <lrm_resource
> id="lnx0106a" type="VirtualDomain" class="ocf" provider="heartbeat"/>
> Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                                                                <lrm_rsc_op
> id="lnx0106a_last_failure_0" operation_key="lnx0106a_monitor_0"
> operation="monitor" crm-debug-origin="do_update_resource"
> crm_feature_set="3.0.9"
> transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0"
> last-ru
> Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
>                                                                <lrm_rsc_op
> id="lnx0106a_last_0" operation_key="lnx0106a_monitor_0"
> operation="monitor" crm-debug-origin="do_update_resource"
> crm_feature_set="3.0.9"
> transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
> on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0"
> last-run="14284
> Apr 08 13:41:09 [14233] lnx0083a       crmd:     info: do_lrm_rsc_op:
> Performing key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc
> op=lnx0106a_stop_0
> Apr 08 13:41:09 [14230] lnx0083a       lrmd:     info: log_execute:
> executing - rsc:lnx0106a action:stop call_id:131
> VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:09 INFO: Virtual domain
> lnx0106a currently has no state, retrying.
> VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:10 INFO: Virtual domain
> lnx0106a currently has no state, retrying.
> VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: Virtual domain
> lnx0106a has no state during stop operation, bailing out.
> VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 INFO: Issuing forced
> shutdown (destroy) request for domain lnx0106a.
> VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: forced stop
> failed
> Apr 08 13:41:12 [14230] lnx0083a       lrmd:   notice: operation_finished:
>         lnx0106a_stop_0:20360:stderr [ error: failed to connect to the
> hypervisor error: end of file while reading data: : input/output error ]
> Apr 08 13:41:12 [14230] lnx0083a       lrmd:   notice: operation_finished:
>         lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced stop failed
> ]
> Apr 08 13:41:12 [14230] lnx0083a       lrmd:     info: log_finished:
> finished - rsc:lnx0106a action:stop call_id:131 pid:20360 exit-code:1
> exec-time:2609ms queue-time:0ms
> Apr 08 13:41:12 [14233] lnx0083a       crmd:   notice: process_lrm_event:
> Operation lnx0106a_stop_0: unknown error (node=lnx0083a, call=131, rc=1,
> cib-update=43, confirmed=true)
> Apr 08 13:41:12 [14233] lnx0083a       crmd:   notice: process_lrm_event:
> lnx0083a-lnx0106a_stop_0:131 [ error: failed to connect to the hypervisor
> error: end of file while reading data: : input/output
> error\nocf-exit-reason:forced stop failed\n ]
> Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  +
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id=
> 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_failure_0']:
> @operation_key=lnx0106a_stop_0, @operation=stop,
> @transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> @transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> @call-id=131, @rc-code=1, @last-run=1428493269,
> @last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop
> Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  +
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id=
> 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:
> @operation_key=lnx0106a_stop_0, @operation=stop,
> @transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> @transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> @call-id=131, @rc-code=1, @last-run=1428493269,
> @last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop
> failed
> Apr 08 13:41:12 [14231] lnx0083a      attrd:     info: attrd_peer_update:
> Setting fail-count-lnx0106a[lnx0083a]: (null) -> INFINITY from lnx0083b
> Apr 08 13:41:12 [14231] lnx0083a      attrd:     info: attrd_peer_update:
> Setting last-failure-lnx0106a[lnx0083a]: (null) -> 1428493272 from
> lnx0083b
> Apr 08 13:41:12 [14228] lnx0083a        cib:  
> info: cib_perform_op:  ++
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attri
> butes[@id='status-1']: <nvpair id="status-1-fail-count-lnx0106a"
> name="fail-count-lnx0106a" value="INFINITY"/>
> Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  ++
> /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attri
> butes[@id='status-1']: <nvpair id="status-1-last-failure-lnx0106a"
> name="last-failure-lnx0106a" value="1428493272"/>
> Apr 08 13:41:17 [14228] lnx0083a        cib:     info: cib_perform_op:  +
> /cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id=
> 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:
> @operation_key=lnx0106a_stop_0, @operation=stop,
> @transition-key=106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> @transition-magic=0:0;106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
> @call-id=542, @last-run=1428493269, @last-rc-change=1428493269,
> @exec-time=7645
> ...
> 
> any ideas?
> 
> thank you!
> Philipp

It will get the latest version of the CIB after it connects to the cluster 
again.

The CIB has a timestamp and so every node can decide if it has the lastest 
version of the CIB or it should fetch it from an other node.

Mit freundlichen Grüßen,

Michael Schwartzkopff

-- 
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150408/c367f49a/attachment-0003.sig>


More information about the Pacemaker mailing list