[ClusterLabs] WG: update cib after fence
philipp.achmueller at arz.at
philipp.achmueller at arz.at
Wed Apr 8 13:11:50 UTC 2015
hi,
how to cleanup cib from node after unexpected system halt?
failed node still thinks of running VirtualDomain resource, which is
already running on other node in cluster(sucessful takeover:
executing "pcs cluster start" -
....
Apr 8 13:41:10 lnx0083a daemon:info lnx0083a
VirtualDomain(lnx0106a)[20360]: INFO: Virtual domain lnx0106a currently
has no state, retrying.
Apr 8 13:41:12 lnx0083a daemon:err|error lnx0083a
VirtualDomain(lnx0106a)[20360]: ERROR: Virtual domain lnx0106a has no
state during stop operation, bailing out.
Apr 8 13:41:12 lnx0083a daemon:info lnx0083a
VirtualDomain(lnx0106a)[20360]: INFO: Issuing forced shutdown (destroy)
request for domain lnx0106a.
Apr 8 13:41:12 lnx0083a daemon:err|error lnx0083a
VirtualDomain(lnx0106a)[20360]: ERROR: forced stop failed
Apr 8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]: notice:
operation_finished: lnx0106a_stop_0:20360:stderr [ error: failed to
connect to the hypervisor error: end of file while reading data: :
input/output error ]
Apr 8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]: notice:
operation_finished: lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced
stop failed ]
Apr 8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]: notice:
process_lrm_event: Operation lnx0106a_stop_0: unknown error
(node=lnx0083a, call=131, rc=1, cib-update=43, confirmed=true)
Apr 8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]: notice:
process_lrm_event: lnx0083a-lnx0106a_stop_0:131 [ error: failed to connect
to the hypervisor error: end of file while reading data: : input/output
error\nocf-exit-reason:forced stop failed\n ]
Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed
(target: 0 vs. rc: 1): Error
Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
after failed stop: rc=1 (update=INFINITY, time=1428493272)
Apr 8 13:41:12 lnx0083b daemon:notice lnx0083b crmd[18244]: notice:
abort_transition_graph: Transition aborted by lnx0106a_stop_0 'modify' on
lnx0083a: Event failed
(magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
cib=1.499.624, source=match_graph_event:350, 0)
Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
after failed stop: rc=1 (update=INFINITY, time=1428493272)
Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed
(target: 0 vs. rc: 1): Error
Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
after failed stop: rc=1 (update=INFINITY, time=1428493272)
Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]:
warning: update_failcount: Updating failcount for lnx0106a on lnx0083a
after failed stop: rc=1 (update=INFINITY, time=1428493272)
Apr 8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on
lnx0083a: unknown error (1)
Apr 8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on
lnx0083a: unknown error (1)
Apr 8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
warning: pe_fence_node: Node lnx0083a will be fenced because of resource
failure(s)
Apr 8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
warning: common_apply_stickiness: Forcing lnx0106a away from lnx0083a
after 1000000 failures (max=3)
Apr 8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]:
warning: stage6: Scheduling Node lnx0083a for STONITH
Apr 8 13:41:17 lnx0083b daemon:notice lnx0083b pengine[18243]: notice:
native_stop_constraints: Stop of failed resource lnx0106a is implicit
after lnx0083a is fenced
....
Node is fenced..
log from corosync.log:
...
Apr 08 13:41:00 [14226] lnx0083a pacemakerd: notice: mcp_read_config:
Configured corosync to accept connections from group 2035: OK (1)
Apr 08 13:41:00 [14226] lnx0083a pacemakerd: notice: main: Starting
Pacemaker 1.1.12 (Build: 4ed91da): agent-manpages ascii-docs ncurses
libqb-logging libqb-ip
c lha-fencing upstart nagios corosync-native atomic-attrd libesmtp acls
....
Apr 08 13:16:04 [23690] lnx0083a cib: info: cib_perform_op: +
/cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id='lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:
@operation_key=lnx0106a_stop_0, @operation=stop,
@transition-key=106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
@transition-magic=0:0;106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
@call-id=538, @last-run=1428491757, @last-rc-change=1428491757,
@exec-time=7686
Apr 08 13:41:04 [14231] lnx0083a attrd: info: write_attribute:
Sent update 40 with 3 changes for fail-count-vm-lnx0106a, id=<n/a>,
set=(null)
Apr 08 13:41:04 [14231] lnx0083a attrd: info: write_attribute:
Sent update 45 with 3 changes for fail-count-lnx0106a, id=<n/a>,
set=(null)
Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_resource id="lnx0106a" type="VirtualDomain"
class="ocf" provider="heartbeat">
Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_rsc_op id="lnx0106a_last_0"
operation_key="lnx0106a_monitor_0" operation="monitor"
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9"
transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0"
last-run="1427965815" last-rc-change="1427965815" exec-time="8
Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_resource id="lnx0106a" type="VirtualDomain"
class="ocf" provider="heartbeat">
Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_rsc_op id="lnx0106a_last_failure_0"
operation_key="lnx0106a_migrate_to_0" operation="migrate_to"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
transition-key="112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
transition-magic="0:1;112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
on_node="lnx0129a" call-id="444" rc-code="1" op-status="0" interval="0"
last-run="1427973596" last-rc-change="1427
Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_rsc_op id="lnx0106a_last_0"
operation_key="lnx0106a_stop_0" operation="stop"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
transition-key="113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
transition-magic="0:0;113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
on_node="lnx0129a" call-id="546" rc-code="0" op-status="0" interval="0"
last-run="1428403880" last-rc-change="1428403880" exec-time="2
Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_rsc_op id="lnx0106a_monitor_30000"
operation_key="lnx0106a_monitor_30000" operation="monitor"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
transition-key="47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
transition-magic="0:0;47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
on_node="lnx0129a" call-id="436" rc-code="0" op-status="0"
interval="30000" last-rc-change="1427965985" exec-time="1312
Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_resource id="lnx0106a" type="VirtualDomain"
class="ocf" provider="heartbeat">
Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_rsc_op id="lnx0106a_last_0"
operation_key="lnx0106a_start_0" operation="start"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
transition-key="110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
transition-magic="0:0;110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
on_node="lnx0129b" call-id="539" rc-code="0" op-status="0" interval="0"
last-run="1428491780" last-rc-change="1428491780" exec-tim
Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_rsc_op id="lnx0106a_monitor_30000"
operation_key="lnx0106a_monitor_30000" operation="monitor"
crm-debug-origin="do_update_resource" crm_feature_set="3.0.9"
transition-key="89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
transition-magic="0:0;89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
on_node="lnx0129b" call-id="540" rc-code="0" op-status="0"
interval="30000" last-rc-change="1428491810" exec-time="12
Apr 08 13:41:04 [14231] lnx0083a attrd: info: attrd_cib_callback:
Update 40 for fail-count-vm-lnx0106a: OK (0)
Apr 08 13:41:04 [14231] lnx0083a attrd: info: attrd_cib_callback:
Update 40 for fail-count-vm-lnx0106a[lnx0129a]=(null): OK (0)
Apr 08 13:41:04 [14231] lnx0083a attrd: info: attrd_cib_callback:
Update 40 for fail-count-vm-lnx0106a[lnx0129b]=(null): OK (0)
Apr 08 13:41:04 [14231] lnx0083a attrd: info: attrd_cib_callback:
Update 40 for fail-count-vm-lnx0106a[lnx0083b]=(null): OK (0)
Apr 08 13:41:04 [14231] lnx0083a attrd: info: attrd_cib_callback:
Update 45 for fail-count-lnx0106a: OK (0)
Apr 08 13:41:04 [14231] lnx0083a attrd: info: attrd_cib_callback:
Update 45 for fail-count-lnx0106a[lnx0129a]=(null): OK (0)
Apr 08 13:41:04 [14231] lnx0083a attrd: info: attrd_cib_callback:
Update 45 for fail-count-lnx0106a[lnx0129b]=(null): OK (0)
Apr 08 13:41:04 [14231] lnx0083a attrd: info: attrd_cib_callback:
Update 45 for fail-count-lnx0106a[lnx0083b]=(null): OK (0)
Apr 08 13:41:05 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_resource id="lnx0106a"
type="VirtualDomain" class="ocf" provider="heartbeat">
Apr 08 13:41:05 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_rsc_op id="lnx0106a_last_0"
operation_key="lnx0106a_monitor_0" operation="monitor"
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9"
transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0"
last-run="1427965815" last-rc-change="142796
Apr 08 13:41:07 [14230] lnx0083a lrmd: info:
process_lrmd_get_rsc_info: Resource 'lnx0106a' not found (27 active
resources)
Apr 08 13:41:07 [14230] lnx0083a lrmd: info:
process_lrmd_rsc_register: Added 'lnx0106a' to the rsc list (28
active resources)
Apr 08 13:41:07 [14233] lnx0083a crmd: info: do_lrm_rsc_op:
Performing key=65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc
op=lnx0106a_monitor_0
Apr 08 13:41:08 [14233] lnx0083a crmd: notice: process_lrm_event:
Operation lnx0106a_monitor_0: not running (node=lnx0083a, call=114, rc=7,
cib-update=34, confirmed=true)
Apr 08 13:41:08 [14228] lnx0083a cib: info: cib_perform_op: ++
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources: <lrm_resource
id="lnx0106a" type="VirtualDomain" class="ocf" provider="heartbeat"/>
Apr 08 13:41:08 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_rsc_op
id="lnx0106a_last_failure_0" operation_key="lnx0106a_monitor_0"
operation="monitor" crm-debug-origin="do_update_resource"
crm_feature_set="3.0.9"
transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0"
last-ru
Apr 08 13:41:08 [14228] lnx0083a cib: info: cib_perform_op: ++
<lrm_rsc_op
id="lnx0106a_last_0" operation_key="lnx0106a_monitor_0"
operation="monitor" crm-debug-origin="do_update_resource"
crm_feature_set="3.0.9"
transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc"
on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0"
last-run="14284
Apr 08 13:41:09 [14233] lnx0083a crmd: info: do_lrm_rsc_op:
Performing key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc
op=lnx0106a_stop_0
Apr 08 13:41:09 [14230] lnx0083a lrmd: info: log_execute:
executing - rsc:lnx0106a action:stop call_id:131
VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:09 INFO: Virtual domain
lnx0106a currently has no state, retrying.
VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:10 INFO: Virtual domain
lnx0106a currently has no state, retrying.
VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: Virtual domain
lnx0106a has no state during stop operation, bailing out.
VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 INFO: Issuing forced
shutdown (destroy) request for domain lnx0106a.
VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: forced stop
failed
Apr 08 13:41:12 [14230] lnx0083a lrmd: notice: operation_finished:
lnx0106a_stop_0:20360:stderr [ error: failed to connect to the
hypervisor error: end of file while reading data: : input/output error ]
Apr 08 13:41:12 [14230] lnx0083a lrmd: notice: operation_finished:
lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced stop failed
]
Apr 08 13:41:12 [14230] lnx0083a lrmd: info: log_finished:
finished - rsc:lnx0106a action:stop call_id:131 pid:20360 exit-code:1
exec-time:2609ms queue-time:0ms
Apr 08 13:41:12 [14233] lnx0083a crmd: notice: process_lrm_event:
Operation lnx0106a_stop_0: unknown error (node=lnx0083a, call=131, rc=1,
cib-update=43, confirmed=true)
Apr 08 13:41:12 [14233] lnx0083a crmd: notice: process_lrm_event:
lnx0083a-lnx0106a_stop_0:131 [ error: failed to connect to the hypervisor
error: end of file while reading data: : input/output
error\nocf-exit-reason:forced stop failed\n ]
Apr 08 13:41:12 [14228] lnx0083a cib: info: cib_perform_op: +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_failure_0']:
@operation_key=lnx0106a_stop_0, @operation=stop,
@transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
@transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
@call-id=131, @rc-code=1, @last-run=1428493269,
@last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop
Apr 08 13:41:12 [14228] lnx0083a cib: info: cib_perform_op: +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:
@operation_key=lnx0106a_stop_0, @operation=stop,
@transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
@transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
@call-id=131, @rc-code=1, @last-run=1428493269,
@last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop
failed
Apr 08 13:41:12 [14231] lnx0083a attrd: info: attrd_peer_update:
Setting fail-count-lnx0106a[lnx0083a]: (null) -> INFINITY from lnx0083b
Apr 08 13:41:12 [14231] lnx0083a attrd: info: attrd_peer_update:
Setting last-failure-lnx0106a[lnx0083a]: (null) -> 1428493272 from
lnx0083b
Apr 08 13:41:12 [14228] lnx0083a cib: info: cib_perform_op: ++
/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']:
<nvpair id="status-1-fail-count-lnx0106a" name="fail-count-lnx0106a"
value="INFINITY"/>
Apr 08 13:41:12 [14228] lnx0083a cib: info: cib_perform_op: ++
/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']:
<nvpair id="status-1-last-failure-lnx0106a" name="last-failure-lnx0106a"
value="1428493272"/>
Apr 08 13:41:17 [14228] lnx0083a cib: info: cib_perform_op: +
/cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id='lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']:
@operation_key=lnx0106a_stop_0, @operation=stop,
@transition-key=106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
@transition-magic=0:0;106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc,
@call-id=542, @last-run=1428493269, @last-rc-change=1428493269,
@exec-time=7645
...
any ideas?
thank you!
Philipp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150408/74e91592/attachment-0003.html>
More information about the Users
mailing list