[Pacemaker] Antwort: Re: stonith sbd problem

philipp.achmueller at arz.at philipp.achmueller at arz.at
Wed Aug 11 05:48:17 EDT 2010


i removed the clone, set the global cluster property for stonith-timeout.

the nodes need about 3-5 minutes to startup after they get "shot"

i did some more tests and found out that if the node, which runs resource 
sbd_fence, get "shot" the remaining node see the stonith resource online 
on both nodes (although one of the cluster-nodes is stonithed).

crm_mon:
sbd_fence       (stonith:external/sbd): Started [ lnx0047a lnx0047b ]

looking through /var/log/messages:

Aug 11 11:24:25 lnx0047a pengine: [20618]: info: determine_online_status: 
Node lnx0047a is online
Aug 11 11:24:25 lnx0047a pengine: [20618]: WARN: pe_fence_node: Node 
lnx0047b will be fenced because it is un-expectedly down
Aug 11 11:24:25 lnx0047a pengine: [20618]: info: 
determine_online_status_fencing:       ha_state=active, ccm_state=false, 
crm_state=online, join_state=pending, expected=member
Aug 11 11:24:25 lnx0047a pengine: [20618]: WARN: determine_online_status: 
Node lnx0047b is unclean
Aug 11 11:24:25 lnx0047a pengine: [20618]: ERROR: native_add_running: 
Resource stonith::external/sbd:sbd_fence appears to be active on 2 nodes
...
Aug 11 11:24:26 lnx0047a sbd: [22315]: info: lnx0047b owns slot 0
Aug 11 11:24:26 lnx0047a sbd: [22315]: info: Writing reset to node slot 
lnx0047b
Aug 11 11:24:26 lnx0047a sbd: [22318]: info: lnx0047b owns slot 0
Aug 11 11:24:26 lnx0047a sbd: [22318]: info: Writing reset to node slot 
lnx0047b
Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: ERROR: 
remote_op_query_timeout: Query 37724c6f-191f-407f-ad24-68028d2b6573 for 
lnx0047b timed out
Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: ERROR: remote_op_timeout: 
Action reboot (37724c6f-191f-407f-ad24-68028d2b6573) for lnx0047b timed 
out
Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: info: remote_op_done: 
Notifing clients of 37724c6f-191f-407f-ad24-68028d2b6573 (reboot of 
lnx0047b from 11ea7c1e-6034-48e1-b616-a10c92e53a1d by (null)):
 0, rc=-7
Aug 11 11:24:28 lnx0047a crmd: [20619]: info: log_data_element: 
tengine_stonith_callback: StonithOp <remote-op state="0" 
st_target="lnx0047b" st_op="reboot" />
Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: info: stonith_notify_client: 
Sending st_fence-notification to client 
20619/15310d8c-6640-4799-8655-10d125b467bd
Aug 11 11:24:28 lnx0047a crmd: [20619]: info: tengine_stonith_callback: 
Stonith operation 75/17:74:0:40ea951f-0c79-43af-8adb-adf8d6defe63: 
Operation timed out (-7)
Aug 11 11:24:28 lnx0047a crmd: [20619]: ERROR: tengine_stonith_callback: 
Stonith of lnx0047b failed (-7)... aborting transition.
Aug 11 11:24:28 lnx0047a crmd: [20619]: info: abort_transition_graph: 
tengine_stonith_callback:402 - Triggered transition abort (complete=0) : 
Stonith failed
Aug 11 11:24:28 lnx0047a crmd: [20619]: info: update_abort_priority: Abort 
priority upgraded from 0 to 1000000
Aug 11 11:24:28 lnx0047a crmd: [20619]: info: update_abort_priority: Abort 
action done superceeded by restart
Aug 11 11:24:28 lnx0047a crmd: [20619]: info: tengine_stonith_notify: Peer 
lnx0047b was terminated (reboot) by (null) for lnx0047a 
(ref=37724c6f-191f-407f-ad24-68028d2b6573): Operation timed out
Aug 11 11:24:28 lnx0047a crmd: [20619]: info: run_graph: 
====================================================
Aug 11 11:24:28 lnx0047a crmd: [20619]: notice: run_graph: Transition 74 
(Complete=5, Pending=0, Fired=0, Skipped=5, Incomplete=1, 
Source=/var/lib/pengine/pe-error-942.bz2): Stopped
...

this entries continue infinitely until i manually stop/start sbd_fence 
resource.

------------
still not sure why Ressource lnx0101a will not start on remaining node... 
----------------
Disclaimer:
Diese Nachricht dient ausschließlich zu Informationszwecken und ist nur 
für den Gebrauch des angesprochenen Adressaten bestimmt.

This message is only for informational purposes and is intended solely for 
the use of the addressee.
----------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100811/062c6c08/attachment-0001.html>


More information about the Pacemaker mailing list