[ClusterLabs] WG: update cib after fence

philipp.achmueller at arz.at philipp.achmueller at arz.at
Wed Apr 8 09:11:50 EDT 2015


hi,

how to cleanup cib from node after unexpected system halt?
failed node still thinks of running VirtualDomain resource, which is 
already running on other node in cluster(sucessful takeover:

executing "pcs cluster start" - 
....
Apr  8 13:41:10 lnx0083a daemon:info lnx0083a 
VirtualDomain(lnx0106a)[20360]: INFO: Virtual domain lnx0106a currently 
has no state, retrying.
Apr  8 13:41:12 lnx0083a daemon:err|error lnx0083a 
VirtualDomain(lnx0106a)[20360]: ERROR: Virtual domain lnx0106a has no 
state during stop operation, bailing out.
Apr  8 13:41:12 lnx0083a daemon:info lnx0083a 
VirtualDomain(lnx0106a)[20360]: INFO: Issuing forced shutdown (destroy) 
request for domain lnx0106a.
Apr  8 13:41:12 lnx0083a daemon:err|error lnx0083a 
VirtualDomain(lnx0106a)[20360]: ERROR: forced stop failed
Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]:   notice: 
operation_finished: lnx0106a_stop_0:20360:stderr [ error: failed to 
connect to the hypervisor error: end of file while reading data: : 
input/output error ]
Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]:   notice: 
operation_finished: lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced 
stop failed ]
Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]:   notice: 
process_lrm_event: Operation lnx0106a_stop_0: unknown error 
(node=lnx0083a, call=131, rc=1, cib-update=43, confirmed=true)
Apr  8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]:   notice: 
process_lrm_event: lnx0083a-lnx0106a_stop_0:131 [ error: failed to connect 
to the hypervisor error: end of file while reading data: : input/output 
error\nocf-exit-reason:forced stop failed\n ]
Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: 
warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed 
(target: 0 vs. rc: 1): Error
Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: 
warning: update_failcount: Updating failcount for lnx0106a on lnx0083a 
after failed stop: rc=1 (update=INFINITY, time=1428493272)
Apr  8 13:41:12 lnx0083b daemon:notice lnx0083b crmd[18244]:   notice: 
abort_transition_graph: Transition aborted by lnx0106a_stop_0 'modify' on 
lnx0083a: Event failed 
(magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, 
cib=1.499.624, source=match_graph_event:350, 0)
Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: 
warning: update_failcount: Updating failcount for lnx0106a on lnx0083a 
after failed stop: rc=1 (update=INFINITY, time=1428493272)
Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: 
warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed 
(target: 0 vs. rc: 1): Error
Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: 
warning: update_failcount: Updating failcount for lnx0106a on lnx0083a 
after failed stop: rc=1 (update=INFINITY, time=1428493272)
Apr  8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: 
warning: update_failcount: Updating failcount for lnx0106a on lnx0083a 
after failed stop: rc=1 (update=INFINITY, time=1428493272)
Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]: 
warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on 
lnx0083a: unknown error (1)
Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]: 
warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on 
lnx0083a: unknown error (1)
Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]: 
warning: pe_fence_node: Node lnx0083a will be fenced because of resource 
failure(s)
Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]: 
warning: common_apply_stickiness: Forcing lnx0106a away from lnx0083a 
after 1000000 failures (max=3)
Apr  8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]: 
warning: stage6: Scheduling Node lnx0083a for STONITH
Apr  8 13:41:17 lnx0083b daemon:notice lnx0083b pengine[18243]:   notice: 
native_stop_constraints: Stop of failed resource lnx0106a is implicit 
after lnx0083a is fenced
....

Node is fenced..

log from corosync.log:
...
Apr 08 13:41:00 [14226] lnx0083a pacemakerd:   notice: mcp_read_config:   
Configured corosync to accept connections from group 2035: OK (1)
Apr 08 13:41:00 [14226] lnx0083a pacemakerd:   notice: main:    Starting 
Pacemaker 1.1.12 (Build: 4ed91da):  agent-manpages ascii-docs ncurses 
libqb-logging libqb-ip
c lha-fencing upstart nagios  corosync-native atomic-attrd libesmtp acls
....
Apr 08 13:16:04 [23690] lnx0083a        cib:     info: cib_perform_op:  + 
/cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id='lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']: 
 @operation_key=lnx0106a_stop_0, @operation=stop, 
@transition-key=106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, 
@transition-magic=0:0;106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, 
@call-id=538, @last-run=1428491757, @last-rc-change=1428491757, 
@exec-time=7686
Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: write_attribute: 
Sent update 40 with 3 changes for fail-count-vm-lnx0106a, id=<n/a>, 
set=(null)
Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: write_attribute: 
Sent update 45 with 3 changes for fail-count-lnx0106a, id=<n/a>, 
set=(null)
Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                    <lrm_resource id="lnx0106a" type="VirtualDomain" 
class="ocf" provider="heartbeat">
Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                      <lrm_rsc_op id="lnx0106a_last_0" 
operation_key="lnx0106a_monitor_0" operation="monitor" 
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9" 
transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0" 
last-run="1427965815" last-rc-change="1427965815" exec-time="8
Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                    <lrm_resource id="lnx0106a" type="VirtualDomain" 
class="ocf" provider="heartbeat">
Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                      <lrm_rsc_op id="lnx0106a_last_failure_0" 
operation_key="lnx0106a_migrate_to_0" operation="migrate_to" 
crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" 
transition-key="112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
transition-magic="0:1;112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
on_node="lnx0129a" call-id="444" rc-code="1" op-status="0" interval="0" 
last-run="1427973596" last-rc-change="1427
Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                      <lrm_rsc_op id="lnx0106a_last_0" 
operation_key="lnx0106a_stop_0" operation="stop" 
crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" 
transition-key="113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
transition-magic="0:0;113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
on_node="lnx0129a" call-id="546" rc-code="0" op-status="0" interval="0" 
last-run="1428403880" last-rc-change="1428403880" exec-time="2
Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                      <lrm_rsc_op id="lnx0106a_monitor_30000" 
operation_key="lnx0106a_monitor_30000" operation="monitor" 
crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" 
transition-key="47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
transition-magic="0:0;47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
on_node="lnx0129a" call-id="436" rc-code="0" op-status="0" 
interval="30000" last-rc-change="1427965985" exec-time="1312
Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                    <lrm_resource id="lnx0106a" type="VirtualDomain" 
class="ocf" provider="heartbeat">
Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                      <lrm_rsc_op id="lnx0106a_last_0" 
operation_key="lnx0106a_start_0" operation="start" 
crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" 
transition-key="110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
transition-magic="0:0;110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
on_node="lnx0129b" call-id="539" rc-code="0" op-status="0" interval="0" 
last-run="1428491780" last-rc-change="1428491780" exec-tim
Apr 08 13:41:04 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                      <lrm_rsc_op id="lnx0106a_monitor_30000" 
operation_key="lnx0106a_monitor_30000" operation="monitor" 
crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" 
transition-key="89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
transition-magic="0:0;89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
on_node="lnx0129b" call-id="540" rc-code="0" op-status="0" 
interval="30000" last-rc-change="1428491810" exec-time="12
Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback: 
        Update 40 for fail-count-vm-lnx0106a: OK (0)
Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback: 
        Update 40 for fail-count-vm-lnx0106a[lnx0129a]=(null): OK (0)
Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback: 
        Update 40 for fail-count-vm-lnx0106a[lnx0129b]=(null): OK (0)
Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback: 
        Update 40 for fail-count-vm-lnx0106a[lnx0083b]=(null): OK (0)
Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback: 
        Update 45 for fail-count-lnx0106a: OK (0)
Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback: 
        Update 45 for fail-count-lnx0106a[lnx0129a]=(null): OK (0)
Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback: 
        Update 45 for fail-count-lnx0106a[lnx0129b]=(null): OK (0)
Apr 08 13:41:04 [14231] lnx0083a      attrd:     info: attrd_cib_callback: 
        Update 45 for fail-count-lnx0106a[lnx0083b]=(null): OK (0)
Apr 08 13:41:05 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                                      <lrm_resource id="lnx0106a" 
type="VirtualDomain" class="ocf" provider="heartbeat">
Apr 08 13:41:05 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                                        <lrm_rsc_op id="lnx0106a_last_0" 
operation_key="lnx0106a_monitor_0" operation="monitor" 
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9" 
transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0" 
last-run="1427965815" last-rc-change="142796
Apr 08 13:41:07 [14230] lnx0083a       lrmd:     info: 
process_lrmd_get_rsc_info:      Resource 'lnx0106a' not found (27 active 
resources)
Apr 08 13:41:07 [14230] lnx0083a       lrmd:     info: 
process_lrmd_rsc_register:      Added 'lnx0106a' to the rsc list (28 
active resources)
Apr 08 13:41:07 [14233] lnx0083a       crmd:     info: do_lrm_rsc_op: 
Performing key=65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc 
op=lnx0106a_monitor_0
Apr 08 13:41:08 [14233] lnx0083a       crmd:   notice: process_lrm_event: 
Operation lnx0106a_monitor_0: not running (node=lnx0083a, call=114, rc=7, 
cib-update=34, confirmed=true)
Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources:  <lrm_resource 
id="lnx0106a" type="VirtualDomain" class="ocf" provider="heartbeat"/>
Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                                                               <lrm_rsc_op 
id="lnx0106a_last_failure_0" operation_key="lnx0106a_monitor_0" 
operation="monitor" crm-debug-origin="do_update_resource" 
crm_feature_set="3.0.9" 
transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0" 
last-ru
Apr 08 13:41:08 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
                                                               <lrm_rsc_op 
id="lnx0106a_last_0" operation_key="lnx0106a_monitor_0" 
operation="monitor" crm-debug-origin="do_update_resource" 
crm_feature_set="3.0.9" 
transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" 
on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0" 
last-run="14284
Apr 08 13:41:09 [14233] lnx0083a       crmd:     info: do_lrm_rsc_op: 
Performing key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc 
op=lnx0106a_stop_0
Apr 08 13:41:09 [14230] lnx0083a       lrmd:     info: log_execute: 
executing - rsc:lnx0106a action:stop call_id:131
VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:09 INFO: Virtual domain 
lnx0106a currently has no state, retrying.
VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:10 INFO: Virtual domain 
lnx0106a currently has no state, retrying.
VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: Virtual domain 
lnx0106a has no state during stop operation, bailing out.
VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 INFO: Issuing forced 
shutdown (destroy) request for domain lnx0106a.
VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: forced stop 
failed
Apr 08 13:41:12 [14230] lnx0083a       lrmd:   notice: operation_finished: 
        lnx0106a_stop_0:20360:stderr [ error: failed to connect to the 
hypervisor error: end of file while reading data: : input/output error ]
Apr 08 13:41:12 [14230] lnx0083a       lrmd:   notice: operation_finished: 
        lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced stop failed 
]
Apr 08 13:41:12 [14230] lnx0083a       lrmd:     info: log_finished: 
finished - rsc:lnx0106a action:stop call_id:131 pid:20360 exit-code:1 
exec-time:2609ms queue-time:0ms
Apr 08 13:41:12 [14233] lnx0083a       crmd:   notice: process_lrm_event: 
Operation lnx0106a_stop_0: unknown error (node=lnx0083a, call=131, rc=1, 
cib-update=43, confirmed=true)
Apr 08 13:41:12 [14233] lnx0083a       crmd:   notice: process_lrm_event: 
lnx0083a-lnx0106a_stop_0:131 [ error: failed to connect to the hypervisor 
error: end of file while reading data: : input/output 
error\nocf-exit-reason:forced stop failed\n ]
Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  + 
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_failure_0']: 
 @operation_key=lnx0106a_stop_0, @operation=stop, 
@transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, 
@transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, 
@call-id=131, @rc-code=1, @last-run=1428493269, 
@last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop
Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  + 
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']: 
 @operation_key=lnx0106a_stop_0, @operation=stop, 
@transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, 
@transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, 
@call-id=131, @rc-code=1, @last-run=1428493269, 
@last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop 
failed
Apr 08 13:41:12 [14231] lnx0083a      attrd:     info: attrd_peer_update: 
Setting fail-count-lnx0106a[lnx0083a]: (null) -> INFINITY from lnx0083b
Apr 08 13:41:12 [14231] lnx0083a      attrd:     info: attrd_peer_update: 
Setting last-failure-lnx0106a[lnx0083a]: (null) -> 1428493272 from 
lnx0083b
Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']: 
 <nvpair id="status-1-fail-count-lnx0106a" name="fail-count-lnx0106a" 
value="INFINITY"/>
Apr 08 13:41:12 [14228] lnx0083a        cib:     info: cib_perform_op:  ++ 
/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']: 
 <nvpair id="status-1-last-failure-lnx0106a" name="last-failure-lnx0106a" 
value="1428493272"/>
Apr 08 13:41:17 [14228] lnx0083a        cib:     info: cib_perform_op:  + 
/cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id='lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']: 
 @operation_key=lnx0106a_stop_0, @operation=stop, 
@transition-key=106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, 
@transition-magic=0:0;106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, 
@call-id=542, @last-run=1428493269, @last-rc-change=1428493269, 
@exec-time=7645
...

any ideas?

thank you!
Philipp


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150408/74e91592/attachment-0002.html>


More information about the Users mailing list