[Pacemaker] After pacemaker is stopped on DC, it gets fenced because last 'stop' operation is lost

Tue Aug 16 05:21:09 EDT 2011

Hi,

I recently discovered that when I stop pacemaker on DC node (for
upgrade), that node is shortly fenced by a new DC.

Fencing is caused by not-stopped (from the PoV of new DC) clone instance
(dlm in my case) which was actually stopped:

Aug 16 08:56:29 v03-a pengine: [2083]: WARN: custom_action: Action
dlm:2_stop_0 on mgmt01 is unrunnable (offline)
Aug 16 08:56:29 v03-a pengine: [2083]: WARN: custom_action: Marking node
mgmt01 unclean
Aug 16 08:56:29 v03-a pengine: [2083]: WARN: stage6: Scheduling Node
mgmt01 for STONITH

Actually, that stop op was correctly done:

Aug 16 08:56:09 mgmt01 pengine: [1465]: notice: LogActions: Stop
dlm:2#011(mgmt01)
Aug 16 08:56:23 mgmt01 crmd: [1466]: info: te_rsc_command: Initiating
action 113: stop dlm:2_stop_0 on mgmt01 (local)
Aug 16 08:56:23 mgmt01 crmd: [1466]: info: do_lrm_rsc_op: Performing
key=113:6:0:99d2326c-7ad7-4e74-92ac-417303ef94a6 op=dlm:2_stop_0 )
Aug 16 08:56:23 mgmt01 lrmd: [1463]: info: rsc:dlm:2:167: stop
Aug 16 08:56:23 mgmt01 lrmd: [1463]: info: RA output:
(dlm:2:stop:stderr) dlm_controld.pcmk: no process killed
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: process_lrm_event: LRM
operation dlm:2_stop_0 (call=167, rc=0, cib-update=248, confirmed=true) ok
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: match_graph_event: Action
dlm:2_stop_0 (113) confirmed on mgmt01 (rc=0)

dlm stop is one of last operations (or very last) performed during
pacemaker shutdown, so I suspect that result of that operation is just
not propagated over the cluster. So right after lines above I see
following (sorry for many wrapped lines):

Aug 16 08:56:24 mgmt01 crmd: [1466]: info: te_pseudo_action: Pseudo
action 117 fired and confirmed
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: te_crm_command: Executing
crm-event (337): do_shutdown on mgmt01
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: te_crm_command: crm-event
(337) is a local shutdown
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: te_pseudo_action: Pseudo
action 83 fired and confirmed
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: run_graph:
====================================================
Aug 16 08:56:24 mgmt01 crmd: [1466]: notice: run_graph: Transition 6
(Complete=41, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-82.bz2): Complete
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: te_graph_trigger: Transition
6 is now complete
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_STOPPING [ input=I_STOP
cause=C_FSA_INTERNAL origin=notify_crmd ]
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: do_dc_release: DC role released
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: pe_connection_destroy:
Connection to the Policy Engine released
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: do_te_control: Transitioner
is now inactive
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: do_shutdown: Disconnecting
STONITH...
Aug 16 08:56:24 mgmt01 crmd: [1466]: info:
tengine_stonith_connection_destroy: Fencing daemon disconnected
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: do_lrm_control: Disconnected
from the LRM
Aug 16 08:56:24 mgmt01 crmd: [1466]: notice: terminate_ais_connection:
Disconnecting from AIS
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: do_ha_control: Disconnected
from OpenAIS
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: do_cib_control: Disconnecting CIB
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: crmd_cib_connection_destroy:
Connection to the CIB terminated...
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: do_exit: Performing A_EXIT_0
- gracefully exiting the CRMd
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: free_mem: Dropping
I_RELEASE_SUCCESS: [ state=S_STOPPING cause=C_FSA_INTERNAL
origin=do_dc_release ]
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: free_mem: Dropping
I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: crm_xml_cleanup: Cleaning up
memory from libxml2
Aug 16 08:56:24 mgmt01 crmd: [1466]: info: do_exit: [crmd] stopped (0)
Aug 16 08:56:24 mgmt01 corosync[1116]:   [pcmk  ] info: pcmk_ipc_exit:
Client crmd (conn=0xca83b0, async-conn=0xca83b0) left
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: pcmk_child_exit:
Child process crmd exited (pid=1466, rc=0)
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: info: update_node_processes:
Node mgmt01 now has process list: 00000000000000000000000000111112 (was
00000000000000000000000000111312)
Aug 16 08:56:24 mgmt01 stonith-ng: [1461]: info: crm_update_peer: Node
mgmt01: id=1543767306 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000111112 (new)
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: pcmk_shutdown_worker:
crmd confirmed stopped
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: stop_child: Stopping
pengine: Sent -15 to process 1465
Aug 16 08:56:24 mgmt01 corosync[1116]:   [pcmk  ] WARN:
route_ais_message: Sending message to local.crmd failed: ipc delivery
failed (rc=-2)
Aug 16 08:56:24 mgmt01 corosync[1116]:   [pcmk  ] WARN:
route_ais_message: Sending message to local.crmd failed: ipc delivery
failed (rc=-2)
Aug 16 08:56:24 mgmt01 corosync[1116]:   [pcmk  ] WARN:
route_ais_message: Sending message to local.crmd failed: ipc delivery
failed (rc=-2)
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: pcmk_child_exit:
Child process pengine exited (pid=1465, rc=0)
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: info: update_node_processes:
Node mgmt01 now has process list: 00000000000000000000000000101112 (was
00000000000000000000000000111112)
Aug 16 08:56:24 mgmt01 stonith-ng: [1461]: info: crm_update_peer: Node
mgmt01: id=1543767306 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000101112 (new)
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: pcmk_shutdown_worker:
pengine confirmed stopped
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: stop_child: Stopping
attrd: Sent -15 to process 1464
Aug 16 08:56:24 mgmt01 attrd: [1464]: notice: main: Exiting...
Aug 16 08:56:24 mgmt01 corosync[1116]:   [pcmk  ] info: pcmk_ipc_exit:
Client attrd (conn=0xc9dce0, async-conn=0xc9dce0) left
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: pcmk_child_exit:
Child process attrd exited (pid=1464, rc=0)
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: info: update_node_processes:
Node mgmt01 now has process list: 00000000000000000000000000100112 (was
00000000000000000000000000101112)
Aug 16 08:56:24 mgmt01 stonith-ng: [1461]: info: crm_update_peer: Node
mgmt01: id=1543767306 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000100112 (new)
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: pcmk_shutdown_worker:
attrd confirmed stopped
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: stop_child: Stopping
lrmd: Sent -15 to process 1463
Aug 16 08:56:24 mgmt01 lrmd: [1463]: info: lrmd is shutting down
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: pcmk_child_exit:
Child process lrmd exited (pid=1463, rc=0)
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: info: update_node_processes:
Node mgmt01 now has process list: 00000000000000000000000000100102 (was
00000000000000000000000000100112)
Aug 16 08:56:24 mgmt01 stonith-ng: [1461]: info: crm_update_peer: Node
mgmt01: id=1543767306 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000100102 (new)
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: pcmk_shutdown_worker:
lrmd confirmed stopped
Aug 16 08:56:24 mgmt01 pacemakerd: [1457]: notice: stop_child: Stopping
cib: Sent -15 to process 1462
Aug 16 08:56:24 mgmt01 cib: [1462]: info: cib_process_readwrite: We are
now in R/O mode
Aug 16 08:56:24 mgmt01 cib: [1462]: WARN: send_ipc_message: IPC Channel
to 1466 is not connected
Aug 16 08:56:24 mgmt01 cib: [1462]: WARN: send_via_callback_channel:
Delivery of reply to client 1466/d688f643-42e7-4240-b163-39fae08bf740 failed
Aug 16 08:56:24 mgmt01 cib: [1462]: WARN: do_local_notify: A-Sync reply
to crmd failed: reply failed
Aug 16 08:56:24 mgmt01 cib: [1462]: info: crm_signal_dispatch: Invoking
handler for signal 15: Terminated
Aug 16 08:56:24 mgmt01 cib: [1462]: info: cib_shutdown: Disconnected 3
clients
Aug 16 08:56:24 mgmt01 cib: [1462]: info: crm_update_peer: Node mgmt01:
id=1543767306 state=member addr=r(0) ip(10.5.4.220)  votes=1 born=80548
seen=80564 proc=00000000000000000000000000111112 (new)
Aug 16 08:56:24 mgmt01 cib: [1462]: info: crm_update_peer: Node mgmt01:
id=1543767306 state=member addr=r(0) ip(10.5.4.220)  votes=1 born=80548
seen=80564 proc=00000000000000000000000000101112 (new)
Aug 16 08:56:24 mgmt01 cib: [1462]: info: crm_update_peer: Node mgmt01:
id=1543767306 state=member addr=r(0) ip(10.5.4.220)  votes=1 born=80548
seen=80564 proc=00000000000000000000000000100112 (new)
Aug 16 08:56:24 mgmt01 cib: [1462]: info: crm_update_peer: Node mgmt01:
id=1543767306 state=member addr=r(0) ip(10.5.4.220)  votes=1 born=80548
seen=80564 proc=00000000000000000000000000100102 (new)
Aug 16 08:56:24 mgmt01 cib: [1462]: info: cib_process_disconnect: All
clients disconnected...
Aug 16 08:56:24 mgmt01 cib: [1462]: info: initiate_exit: Sending
disconnect notification to 3 peers...
Aug 16 08:56:25 mgmt01 cib: [1462]: WARN: cib_process_diff: Diff
0.4787.60 -> 0.4787.61 not applied to 0.4787.61: current "num_updates"
is greater than required
Aug 16 08:56:25 mgmt01 cib: [1462]: info: apply_xml_diff: Digest
mis-match: expected baf0dd99bb1bf6e68274cfee15fe6770, calculated
dc26695f7b534d6ad6380ada73d7fd2a
Aug 16 08:56:25 mgmt01 cib: [1462]: notice: cib_process_diff: Diff
0.4787.61 -> 0.4787.62 not applied to 0.4787.61: Failed application of
an update diff
Aug 16 08:56:25 mgmt01 cib: [1462]: info: cib_server_process_diff:
Requesting re-sync from peer
Aug 16 08:56:25 mgmt01 cib: [1462]: notice: cib_server_process_diff: Not
applying diff 0.4787.62 -> 0.4787.63 (sync in progress)
Aug 16 08:56:25 mgmt01 cib: [1462]: info: cib_replace_notify: Replaced:
-1.-1.-1 -> 0.4787.63 from v03-a
Aug 16 08:56:26 mgmt01 corosync[1116]:   [pcmk  ] WARN:
route_ais_message: Sending message to local.attrd failed: ipc delivery
failed (rc=-2)
Aug 16 08:56:27 mgmt01 corosync[1116]:   [pcmk  ] WARN:
route_ais_message: Sending message to local.attrd failed: ipc delivery
failed (rc=-2)
Aug 16 08:56:28 mgmt01 corosync[1116]:   [pcmk  ] WARN:
route_ais_message: Sending message to local.attrd failed: ipc delivery
failed (rc=-2)
Aug 16 08:56:29 mgmt01 stonith-ng: [1461]: info: stonith_command:
Processed st_query from v03-a: rc=0
Aug 16 08:56:30 mgmt01 cib: [1462]: WARN: G_SIG_dispatch: Dispatch
function for SIGCHLD was delayed 4250 ms (> 100 ms) before being called
(GSource: 0x6b2f10)
Aug 16 08:56:30 mgmt01 cib: [1462]: info: G_SIG_dispatch: started at
429604086 should have started at 429603661
Aug 16 08:56:30 mgmt01 cib: [1462]: notice: cib_force_exit: Forcing exit!
Aug 16 08:56:30 mgmt01 cib: [1462]: info: cib_ha_connection_destroy:
Heartbeat disconnection complete... exiting
Aug 16 08:56:30 mgmt01 cib: [1462]: info: cib_ha_connection_destroy:
Exiting...
Aug 16 08:56:30 mgmt01 cib: [1462]: info: crm_xml_cleanup: Cleaning up
memory from libxml2
Aug 16 08:56:30 mgmt01 cib: [1462]: info: main: Done
Aug 16 08:56:30 mgmt01 corosync[1116]:   [pcmk  ] info: pcmk_ipc_exit:
Client cib (conn=0xca3040, async-conn=0xca3040) left
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: WARN: G_SIG_dispatch:
Dispatch function for SIGCHLD was delayed 1000 ms (> 100 ms) before
being called (GSource: 0x1c52850)
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: info: G_SIG_dispatch: started
at 429604189 should have started at 429604089
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: notice: pcmk_child_exit:
Child process cib exited (pid=1462, rc=0)
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: info: update_node_processes:
Node mgmt01 now has process list: 00000000000000000000000000100002 (was
00000000000000000000000000100102)
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: notice: pcmk_shutdown_worker:
cib confirmed stopped
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: notice: stop_child: Stopping
stonith-ng: Sent -15 to process 1461
Aug 16 08:56:31 mgmt01 stonith-ng: [1461]: info: crm_update_peer: Node
mgmt01: id=1543767306 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000100002 (new)
Aug 16 08:56:31 mgmt01 stonith-ng: [1461]: info: crm_signal_dispatch:
Invoking handler for signal 15: Terminated
Aug 16 08:56:31 mgmt01 stonith-ng: [1461]: info: stonith_shutdown:
Terminating with  0 clients
Aug 16 08:56:31 mgmt01 corosync[1116]:   [pcmk  ] info: pcmk_ipc_exit:
Client stonith-ng (conn=0xc989a0, async-conn=0xc989a0) left
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: notice: pcmk_child_exit:
Child process stonith-ng exited (pid=1461, rc=0)
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: info: update_node_processes:
Node mgmt01 now has process list: 00000000000000000000000000000002 (was
00000000000000000000000000100002)
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: notice: pcmk_shutdown_worker:
stonith-ng confirmed stopped
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: notice: pcmk_shutdown_worker:
Shutdown complete
Aug 16 08:56:31 mgmt01 pacemakerd: [1457]: info: main: Exiting pacemakerd

This is 100% reproducible in my setup, so I can take hb_report if this
is needed.

Thank you,
Vladislav