[Pacemaker] problem: stonith executes stonith command remote on the dead host and not local

Thu Mar 29 06:24:21 UTC 2012

On Tue, Mar 6, 2012 at 5:12 AM, Thomas Boernert <tb at tbits.net> wrote:
> Hi again,
>
> should i open a bug report about this issue ?

Which distro is this?  Any chance to try 1.1.7?

> Thanks
>
> Thomas
>
> Thomas Börnert schrieb am 02.03.2012 um 12:06 Uhr
>
>
>> Hi List,
>>
>> my problem is that stonith will execute the command to fence on the remote
>> dead host and not on the local machine :-(. this will end with an timeout.
>>
>> some facts:
>> - 2 node cluster with 2
>> dell servers
>> - each server have an own drac card
>> - pacemaker 1.1.6
>> - heartbeat 3.0.4
>> - corosync 1.4.1
>>
>> node1 should fence node2 if node2 is dead and
>> node2 should fence node1 if node1 is dead
>>
>> it
>> works fine manual with the stonith script
>> fence_drac5 ....
>>
>> my config
>> <---------------------------------- snip -------------------------------->
>> node node1 \
>>     attributes standby="off"
>> node node2 \
>>
>>     attributes standby="off"
>> primitive httpd ocf:heartbeat:apache \
>>     params configfile="/etc/httpd/conf/httpd.conf" port="80" \
>>     op start interval="0" timeout="60s" \
>>     op monitor
>> interval="5s" timeout="20s" \
>>     op stop interval="0" timeout="60s"
>> primitive node1-stonith stonith:fence_drac5 \
>>
>>     params ipaddr="192.168.1.101" login="root" passwd="1234" action="reboot"
>>
>> secure="true" cmd_prompt="admin1->" power_wait="300" pcmk_host_list="node1"
>> primitive node2-stonith stonith:fence_drac5 \
>>
>>     params ipaddr="192.168.1.102" login="root" passwd="1234" action="reboot"
>>
>> secure="true" cmd_prompt="admin1->" power_wait="300" pcmk_host_list="node2"
>> primitive nodeIP ocf:heartbeat:IPaddr2 \
>>     op monitor interval="60" timeout="20" \
>>     params ip="192.168.1.10"
>> cidr_netmask="24" nic="eth0:0" broadcast="192.168.1.255"
>> primitive nodeIParp ocf:heartbeat:SendArp \
>>     params ip="192.168.1.10" nic="eth0:0"
>> group WebServices nodeIP nodeIParp httpd
>> location
>> node1-stonith-log node1-stonith -inf: node1
>> location node2-stonith-log node2-stonith -inf: node2
>> property $id="cib-bootstrap-options" \
>>
>> dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
>>     cluster-infrastructure="openais" \
>>     expected-quorum-votes="2" \
>>     stonith-enabled="true" \
>>     no-quorum-policy="ignore" \
>>
>>     last-lrm-refresh="1330685786"
>> <---------------------------------- snip -------------------------------->
>>
>> [root at node2 ~]# stonith_admin -l node1
>>  node1-stonith
>> 1 devices found
>>
>> it seems ok
>>
>> now
>> i try
>>
>> [root at node2 ~]# stonith_admin -V -F node1
>> stonith_admin[5685]: 2012/03/02_13:00:44 debug: main: Create
>>
>> stonith_admin[5685]: 2012/03/02_13:00:44 debug: init_client_ipc_comms_nodispatch:
>> Attempting to talk on: /var/run/crm/st_command
>>
>> stonith_admin[5685]: 2012/03/02_13:00:44 debug: get_stonith_token: Obtained registration token: 6258828b-4b19-472f-9256-8da36fe87962
>>
>>
>> stonith_admin[5685]: 2012/03/02_13:00:44 debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/st_callback
>> stonith_admin[5685]: 2012/03/02_13:00:44 debug: get_stonith_token:
>> Obtained registration token: 6266ebb8-2112-4378-a00c-3eaff47c9a9d
>>
>> stonith_admin[5685]: 2012/03/02_13:00:44 debug: stonith_api_signon: Connection to STONITH successful
>> stonith_admin[5685]:
>> 2012/03/02_13:00:44 debug: main: Connect: 0
>> Command failed: Operation timed out
>>
>> stonith_admin[5685]: 2012/03/02_13:00:56 debug: stonith_api_signoff: Signing out of the STONITH Service
>>
>> stonith_admin[5685]: 2012/03/02_13:00:56 debug: main: Disconnect: -8
>> stonith_admin[5685]: 2012/03/02_13:00:56 debug: main: Destroy
>>
>> the log on node2 shows:
>>
>>
>> <----------------------------------------------- snip --------------------------------------->
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: te_fence_node: Executing reboot fencing operation (21) on
>> node1 (timeout=60000)
>>
>> Mar  2 13:00:58 node2 stonith-ng: [2660]: info: initiate_remote_stonith_op: Initiating remote operation reboot for node1: 3325df94-8d59-4c00-a37e-be31e79f7503
>> Mar  2 13:00:58
>>
>> node2 stonith-ng: [2638]: info: stonith_command: Processed st_query from node2: rc=0
>>
>> <----------------------------------------------- snip --------------------------------------->
>>
>> why remote on the
>> dead host ?
>>
>> Thanks
>>
>> Thomas
>>
>> the complete log
>>
>> <----------------------------------------------- snip --------------------------------------->
>> Mar  2 13:00:44 node2 stonith_admin: [5685]: info:
>>
>> crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root
>>
>> Mar  2 13:00:44 node2 stonith-ng: [2660]: info: initiate_remote_stonith_op: Initiating remote operation off for node1:
>> 7d8beca4-1853-44fd-9bb2-4015b080c37b
>>
>> Mar  2 13:00:44 node2 stonith-ng: [2638]: info: stonith_command: Processed st_query from node2: rc=0
>> Mar  2 13:00:46 node2 stonith-ng: [2660]: ERROR:
>>
>> remote_op_query_timeout: Query 561e89af-6f5a-45cb-adc2-45389940f1db for node1 timed out
>>
>> Mar  2 13:00:46 node2 stonith-ng: [2660]: ERROR: remote_op_timeout: Action reboot
>> (561e89af-6f5a-45cb-adc2-45389940f1db) for node1 timed out
>>
>> Mar  2 13:00:46 node2 stonith-ng: [2660]: info: remote_op_done: Notifing clients of 561e89af-6f5a-45cb-adc2-45389940f1db (reboot of node1
>> from 8231841e-3537-44a9-8870-899d0d846c42 by (null)): 0, rc=-8
>>
>> Mar  2 13:00:46 node2 stonith-ng: [2660]: info: stonith_notify_client: Sending st_fence-notification to client
>> 2665/ff16ec78-3634-444c-88a6-275ce79eec6b
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: tengine_stonith_callback: StonithOp
>> Mar  2 13:00:46 node2
>>
>> crmd: [2665]: info: tengine_stonith_callback: Stonith operation 798/21:815:0:d274c31a-571b-4e22-b453-1c151a8871b1: Operation timed out (-8)
>> Mar  2 13:00:46 node2 crmd: [2665]: ERROR:
>>
>> tengine_stonith_callback: Stonith of node1 failed (-8)... aborting transition.
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: abort_transition_graph: tengine_stonith_callback:454 - Triggered transition
>> abort (complete=0) : Stonith failed
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
>> Mar  2 13:00:46 node2 crmd: [2665]: info:
>> update_abort_priority: Abort action done superceeded by restart
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: ERROR: tengine_stonith_notify: Peer node1 could not be terminated (reboot) by  for node2
>
>> (ref=561e89af-6f5a-45cb-adc2-45389940f1db): Operation timed out
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: run_graph: ====================================================
>> Mar  2 13:00:46 node2 crmd:
>>
>> [2665]: notice: run_graph: Transition 815 (Complete=3, Pending=0, Fired=0, Skipped=14, Incomplete=0, Source=/var/lib/pengine/pe-warn-39.bz2): Stopped
>> Mar  2 13:00:46 node2 crmd: [2665]: info:
>> te_graph_trigger: Transition 815 is now complete
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
>> Mar  2 13:00:46 node2 crmd: [2665]: info:
>> do_pe_invoke: Query 1271: Requesting the current CIB: S_POLICY_ENGINE
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: do_pe_invoke_callback: Invoking the PE: query=1271, ref=pe_calc-dc-1330689646-1028,
>> seq=404, quorate=0
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: notice: unpack_config: On loss of CCM Quorum: Ignore
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: WARN: pe_fence_node: Node node1 will be fenced
>> because it is un-expectedly down
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: WARN: determine_online_status: Node node1 is unclean
>> Mar  2 13:00:46 node2 pengine: [2664]: notice: unpack_rsc_op: Operation
>> nodeIParp_last_failure_0 found resource nodeIParp active on node2
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: notice: unpack_rsc_op: Operation node1-stonith_last_failure_0 found resource node1-stonith
>> active on node2
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: notice: unpack_rsc_op: Operation nodeIParp_last_failure_0 found resource nodeIParp active on node1
>> Mar  2 13:00:46 node2 pengine: [2664]:
>>
>> notice: unpack_rsc_op: Operation nodeIP_last_failure_0 found resource nodeIP active on node1
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: notice: unpack_rsc_op: Operation httpd_last_failure_0 found
>> resource httpd active on node1
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: WARN: custom_action: Action nodeIP_stop_0 on node1 is unrunnable (offline)
>> Mar  2 13:00:46 node2 pengine: [2664]: WARN:
>> custom_action: Marking node node1 unclean
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: notice: RecurringOp:  Start recurring monitor (60s) for nodeIP on node2
>> Mar  2 13:00:46 node2 pengine: [2664]: WARN:
>> custom_action: Action nodeIParp_stop_0 on node1 is unrunnable (offline)
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: WARN: custom_action: Marking node node1 unclean
>> Mar  2 13:00:46 node2 pengine: [2664]:
>> WARN: custom_action: Action httpd_stop_0 on node1 is unrunnable (offline)
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: WARN: custom_action: Marking node node1 unclean
>> Mar  2 13:00:46 node2 pengine: [2664]:
>>  notice: RecurringOp:  Start recurring monitor (5s) for httpd on node2
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: WARN: custom_action: Action node2-stonith_stop_0 on node1 is unrunnable (offline)
>> Mar  2
>>
>> 13:00:46 node2 pengine: [2664]: WARN: custom_action: Marking node node1 unclean
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: WARN: stage6: Scheduling Node node1 for STONITH
>> Mar  2 13:00:46 node2 pengine:
>> [2664]: notice: LogActions: Move    nodeIP#011(Started node1 -> node2)
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: notice: LogActions: Move    nodeIParp#011(Started node1 -> node2)
>> Mar  2 13:00:46 node2
>>
>> pengine: [2664]: notice: LogActions: Move    httpd#011(Started node1 -> node2)
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: notice: LogActions: Leave   node1-stonith#011(Started node2)
>> Mar  2 13:00:46
>>
>> node2 pengine: [2664]: notice: LogActions: Stop    node2-stonith#011(node1)
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: WARN: process_pe_message: Transition 816: WARNINGs found during PE processing.
>> PEngine Input stored in: /var/lib/pengine/pe-warn-39.bz2
>>
>> Mar  2 13:00:46 node2 pengine: [2664]: notice: process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify
>> -L" to identify issues.
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
>> origin=handle_response ]
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: unpack_graph: Unpacked transition 816: 17 actions in 17 synapses
>> Mar  2 13:00:46 node2 crmd: [2665]: info: do_te_invoke: Processing
>>
>> graph 816 (ref=pe_calc-dc-1330689646-1028) derived from /var/lib/pengine/pe-warn-39.bz2
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: te_pseudo_action: Pseudo action 18 fired and confirmed
>> Mar  2
>>
>> 13:00:46 node2 crmd: [2665]: info: te_pseudo_action: Pseudo action 19 fired and confirmed
>>
>> Mar  2 13:00:46 node2 crmd: [2665]: info: te_fence_node: Executing reboot fencing operation (21) on node1
>> (timeout=60000)
>>
>> Mar  2 13:00:46 node2 stonith-ng: [2660]: info: initiate_remote_stonith_op: Initiating remote operation reboot for node1: 07f10c9c-b33e-41b4-8781-fb32eb850bd2
>> Mar  2 13:00:46 node2
>>
>> stonith-ng: [2638]: info: stonith_command: Processed st_query from node2: rc=0
>>
>> Mar  2 13:00:52 node2 stonith-ng: [2660]: ERROR: remote_op_query_timeout: Query 07f10c9c-b33e-41b4-8781-fb32eb850bd2 for
>>  node1 timed out
>>
>> Mar  2 13:00:52 node2 stonith-ng: [2660]: ERROR: remote_op_timeout: Action reboot (07f10c9c-b33e-41b4-8781-fb32eb850bd2) for node1 timed out
>> Mar  2 13:00:52 node2 stonith-ng: [2660]:
>>
>>  info: remote_op_done: Notifing clients of 07f10c9c-b33e-41b4-8781-fb32eb850bd2 (reboot of node1 from 8231841e-3537-44a9-8870-899d0d846c42 by (null)): 0, rc=-8
>> Mar  2 13:00:52 node2 stonith-ng:
>>
>> [2660]: info: stonith_notify_client: Sending st_fence-notification to client 2665/ff16ec78-3634-444c-88a6-275ce79eec6b
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: tengine_stonith_callback: StonithOp
>>
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: tengine_stonith_callback: Stonith operation 799/21:816:0:d274c31a-571b-4e22-b453-1c151a8871b1:
>> Operation timed out (-8)
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: ERROR: tengine_stonith_callback: Stonith of node1 failed (-8)... aborting transition.
>> Mar  2 13:00:52 node2 crmd: [2665]: info:
>>
>> abort_transition_graph: tengine_stonith_callback:454 - Triggered transition abort (complete=0) : Stonith failed
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: update_abort_priority: Abort priority
>> upgraded from 0 to 1000000
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: update_abort_priority: Abort action done superceeded by restart
>> Mar  2 13:00:52 node2 crmd: [2665]: ERROR: tengine_stonith_notify:
>>
>>  Peer node1 could not be terminated (reboot) by  for node2 (ref=07f10c9c-b33e-41b4-8781-fb32eb850bd2): Operation timed out
>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: run_graph:
>> ====================================================
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: notice: run_graph: Transition 816 (Complete=3, Pending=0, Fired=0, Skipped=14, Incomplete=0,
>> Source=/var/lib/pengine/pe-warn-39.bz2): Stopped
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: te_graph_trigger: Transition 816 is now complete
>> Mar  2 13:00:52 node2 crmd: [2665]: info:
>>
>> do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Mar  2 13:00:52 node2 crmd: [2665]: info: do_state_transition:
>>  All 1 cluster nodes are eligible to run resources.
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: do_pe_invoke: Query 1272: Requesting the current CIB: S_POLICY_ENGINE
>> Mar  2 13:00:52 node2 crmd: [2665]:
>>
>>  info: do_pe_invoke_callback: Invoking the PE: query=1272, ref=pe_calc-dc-1330689652-1029, seq=404, quorate=0
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: notice: unpack_config: On loss of CCM Quorum:
>> Ignore
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: WARN: pe_fence_node: Node node1 will be fenced because it is un-expectedly down
>> Mar  2 13:00:52 node2 pengine: [2664]: WARN: determine_online_status:
>> Node node1 is unclean
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: notice: unpack_rsc_op: Operation nodeIParp_last_failure_0 found resource nodeIParp active on node2
>> Mar  2 13:00:52 node2 pengine: [2664]:
>>
>> notice: unpack_rsc_op: Operation node1-stonith_last_failure_0 found resource node1-stonith active on node2
>> Mar  2 13:00:52 node2 pengine: [2664]: notice: unpack_rsc_op: Operation
>> nodeIParp_last_failure_0 found resource nodeIParp active on node1
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: notice: unpack_rsc_op: Operation nodeIP_last_failure_0 found resource nodeIP active on node1
>>
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: notice: unpack_rsc_op: Operation httpd_last_failure_0 found resource httpd active on node1
>> Mar  2 13:00:52 node2 pengine: [2664]: WARN: custom_action: Action
>> nodeIP_stop_0 on node1 is unrunnable (offline)
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: WARN: custom_action: Marking node node1 unclean
>> Mar  2 13:00:52 node2 pengine: [2664]: notice: RecurringOp:
>> Start recurring monitor (60s) for nodeIP on node2
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: WARN: custom_action: Action nodeIParp_stop_0 on node1 is unrunnable (offline)
>> Mar  2 13:00:52 node2 pengine:
>> [2664]: WARN: custom_action: Marking node node1 unclean
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: WARN: custom_action: Action httpd_stop_0 on node1 is unrunnable (offline)
>> Mar  2 13:00:52 node2 pengine:
>>  [2664]: WARN: custom_action: Marking node node1 unclean
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: notice: RecurringOp:  Start recurring monitor (5s) for httpd on node2
>> Mar  2 13:00:52 node2 pengine:
>>
>> [2664]: WARN: custom_action: Action node2-stonith_stop_0 on node1 is unrunnable (offline)
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: WARN: custom_action: Marking node node1 unclean
>> Mar  2 13:00:52 node2
>> pengine: [2664]: WARN: stage6: Scheduling Node node1 for STONITH
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: notice: LogActions: Move    nodeIP#011(Started node1 -> node2)
>> Mar  2 13:00:52 node2 pengine:
>> [2664]: notice: LogActions: Move    nodeIParp#011(Started node1 -> node2)
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: notice: LogActions: Move    httpd#011(Started node1 -> node2)
>> Mar  2 13:00:52 node2
>>
>> pengine: [2664]: notice: LogActions: Leave   node1-stonith#011(Started node2)
>>
>> Mar  2 13:00:52 node2 pengine: [2664]: notice: LogActions: Stop    node2-stonith#011(node1)
>> Mar  2 13:00:52 node2
>>
>> pengine: [2664]: WARN: process_pe_message: Transition 817: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-39.bz2
>> Mar  2 13:00:52 node2 pengine: [2664]: notice:
>>
>>  process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: do_state_transition: State transition
>>
>>  S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: unpack_graph: Unpacked transition 817: 17 actions
>> in 17 synapses
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: do_te_invoke: Processing graph 817 (ref=pe_calc-dc-1330689652-1029) derived from /var/lib/pengine/pe-warn-39.bz2
>> Mar  2 13:00:52 node2 crmd:
>> [2665]: info: te_pseudo_action: Pseudo action 18 fired and confirmed
>>
>> Mar  2 13:00:52 node2 crmd: [2665]: info: te_pseudo_action: Pseudo action 19 fired and confirmed
>> Mar  2 13:00:52 node2 crmd:
>>
>> [2665]: info: te_fence_node: Executing reboot fencing operation (21) on node1 (timeout=60000)
>>
>> Mar  2 13:00:52 node2 stonith-ng: [2660]: info: initiate_remote_stonith_op: Initiating remote operation
>> reboot for node1: a4ebce93-0eee-43dd-b610-0115e62b0285
>>
>> Mar  2 13:00:52 node2 stonith-ng: [2638]: info: stonith_command: Processed st_query from node2: rc=0
>> Mar  2 13:00:56 node2 stonith-ng: [2660]:
>>
>> ERROR: remote_op_query_timeout: Query 7d8beca4-1853-44fd-9bb2-4015b080c37b for node1 timed out
>>
>> Mar  2 13:00:56 node2 stonith-ng: [2660]: ERROR: remote_op_timeout: Action off
>> (7d8beca4-1853-44fd-9bb2-4015b080c37b) for node1 timed out
>>
>> Mar  2 13:00:56 node2 stonith-ng: [2660]: info: remote_op_done: Notifing clients of 7d8beca4-1853-44fd-9bb2-4015b080c37b (off of node1 from
>> 6258828b-4b19-472f-9256-8da36fe87962 by (null)): 0, rc=-8
>>
>> Mar  2 13:00:56 node2 stonith-ng: [2660]: info: stonith_notify_client: Sending st_fence-notification to client
>> 2665/ff16ec78-3634-444c-88a6-275ce79eec6b
>>
>> Mar  2 13:00:56 node2 crmd: [2665]: ERROR: tengine_stonith_notify: Peer node1 could not be terminated (off) by  for node2
>
>> (ref=7d8beca4-1853-44fd-9bb2-4015b080c37b): Operation timed out
>>
>> Mar  2 13:00:58 node2 stonith-ng: [2660]: ERROR: remote_op_query_timeout: Query a4ebce93-0eee-43dd-b610-0115e62b0285 for node1 timed
>> out
>>
>> Mar  2 13:00:58 node2 stonith-ng: [2660]: ERROR: remote_op_timeout: Action reboot (a4ebce93-0eee-43dd-b610-0115e62b0285) for node1 timed out
>> Mar  2 13:00:58 node2 stonith-ng: [2660]: info:
>>
>> remote_op_done: Notifing clients of a4ebce93-0eee-43dd-b610-0115e62b0285 (reboot of node1 from 8231841e-3537-44a9-8870-899d0d846c42 by (null)): 0, rc=-8
>> Mar  2 13:00:58 node2 stonith-ng: [2660]:
>>
>> info: stonith_notify_client: Sending st_fence-notification to client 2665/ff16ec78-3634-444c-88a6-275ce79eec6b
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: tengine_stonith_callback: StonithOp
>>  state="0" st_target="node1" st_op="reboot" />
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: tengine_stonith_callback: Stonith operation 800/21:817:0:d274c31a-571b-4e22-b453-1c151a8871b1: Operation timed
>>  out (-8)
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: ERROR: tengine_stonith_callback: Stonith of node1 failed (-8)... aborting transition.
>> Mar  2 13:00:58 node2 crmd: [2665]: info: abort_transition_graph:
>>
>> tengine_stonith_callback:454 - Triggered transition abort (complete=0) : Stonith failed
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
>> Mar
>>
>>  2 13:00:58 node2 crmd: [2665]: info: update_abort_priority: Abort action done superceeded by restart
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: ERROR: tengine_stonith_notify: Peer node1 could not be
>>
>> terminated (reboot) by  for node2 (ref=a4ebce93-0eee-43dd-b610-0115e62b0285): Operation timed out
>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: run_graph:
>> ====================================================
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: notice: run_graph: Transition 817 (Complete=3, Pending=0, Fired=0, Skipped=14, Incomplete=0,
>> Source=/var/lib/pengine/pe-warn-39.bz2): Stopped
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: te_graph_trigger: Transition 817 is now complete
>> Mar  2 13:00:58 node2 crmd: [2665]: info:
>>
>> do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Mar  2 13:00:58 node2 crmd: [2665]: info: do_state_transition:
>>  All 1 cluster nodes are eligible to run resources.
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: do_pe_invoke: Query 1273: Requesting the current CIB: S_POLICY_ENGINE
>> Mar  2 13:00:58 node2 crmd: [2665]:
>>
>>  info: do_pe_invoke_callback: Invoking the PE: query=1273, ref=pe_calc-dc-1330689658-1030, seq=404, quorate=0
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: notice: unpack_config: On loss of CCM Quorum:
>> Ignore
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: WARN: pe_fence_node: Node node1 will be fenced because it is un-expectedly down
>> Mar  2 13:00:58 node2 pengine: [2664]: WARN: determine_online_status:
>> Node node1 is unclean
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: notice: unpack_rsc_op: Operation nodeIParp_last_failure_0 found resource nodeIParp active on node2
>> Mar  2 13:00:58 node2 pengine: [2664]:
>>
>> notice: unpack_rsc_op: Operation node1-stonith_last_failure_0 found resource node1-stonith active on node2
>> Mar  2 13:00:58 node2 pengine: [2664]: notice: unpack_rsc_op: Operation
>> nodeIParp_last_failure_0 found resource nodeIParp active on node1
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: notice: unpack_rsc_op: Operation nodeIP_last_failure_0 found resource nodeIP active on node1
>>
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: notice: unpack_rsc_op: Operation httpd_last_failure_0 found resource httpd active on node1
>> Mar  2 13:00:58 node2 pengine: [2664]: WARN: custom_action: Action
>> nodeIP_stop_0 on node1 is unrunnable (offline)
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: WARN: custom_action: Marking node node1 unclean
>> Mar  2 13:00:58 node2 pengine: [2664]: notice: RecurringOp:
>> Start recurring monitor (60s) for nodeIP on node2
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: WARN: custom_action: Action nodeIParp_stop_0 on node1 is unrunnable (offline)
>> Mar  2 13:00:58 node2 pengine:
>> [2664]: WARN: custom_action: Marking node node1 unclean
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: WARN: custom_action: Action httpd_stop_0 on node1 is unrunnable (offline)
>> Mar  2 13:00:58 node2 pengine:
>>  [2664]: WARN: custom_action: Marking node node1 unclean
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: notice: RecurringOp:  Start recurring monitor (5s) for httpd on node2
>> Mar  2 13:00:58 node2 pengine:
>>
>> [2664]: WARN: custom_action: Action node2-stonith_stop_0 on node1 is unrunnable (offline)
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: WARN: custom_action: Marking node node1 unclean
>> Mar  2 13:00:58 node2
>> pengine: [2664]: WARN: stage6: Scheduling Node node1 for STONITH
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: notice: LogActions: Move    nodeIP#011(Started node1 -> node2)
>> Mar  2 13:00:58 node2 pengine:
>> [2664]: notice: LogActions: Move    nodeIParp#011(Started node1 -> node2)
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: notice: LogActions: Move    httpd#011(Started node1 -> node2)
>> Mar  2 13:00:58 node2
>>
>> pengine: [2664]: notice: LogActions: Leave   node1-stonith#011(Started node2)
>>
>> Mar  2 13:00:58 node2 pengine: [2664]: notice: LogActions: Stop    node2-stonith#011(node1)
>> Mar  2 13:00:58 node2
>>
>> pengine: [2664]: WARN: process_pe_message: Transition 818: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-39.bz2
>> Mar  2 13:00:58 node2 pengine: [2664]: notice:
>>
>>  process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: do_state_transition: State transition
>>
>>  S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: unpack_graph: Unpacked transition 818: 17 actions
>> in 17 synapses
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: do_te_invoke: Processing graph 818 (ref=pe_calc-dc-1330689658-1030) derived from /var/lib/pengine/pe-warn-39.bz2
>> Mar  2 13:00:58 node2 crmd:
>> [2665]: info: te_pseudo_action: Pseudo action 18 fired and confirmed
>>
>> Mar  2 13:00:58 node2 crmd: [2665]: info: te_pseudo_action: Pseudo action 19 fired and confirmed
>> Mar  2 13:00:58 node2 crmd:
>>
>> [2665]: info: te_fence_node: Executing reboot fencing operation (21) on node1 (timeout=60000)
>>
>> Mar  2 13:00:58 node2 stonith-ng: [2660]: info: initiate_remote_stonith_op: Initiating remote operation
>> reboot for node1: 3325df94-8d59-4c00-a37e-be31e79f7503
>>
>> Mar  2 13:00:58 node2 stonith-ng: [2638]: info: stonith_command: Processed st_query from node2: rc=0
>>
>>
>> <----------------------------------------------- snip --------------------------------------->
>>
>>
>
>
> --
> Mit freundlichen Grüßen
> Best regards
>
> Thomas Börnert
> Gesellschafter Geschäftsführer
> Senior IT Consultant & Manager
> BSI lizenzierter Auditor für ISO 27001
>
> TBits.net GmbH, Seeweg 6, 73553 Alfdorf, Germany
> phone: +49 (0)7172 18391-0, fax: +49 (0)7172 18391-99
> Key fingerprint = 8602 2EF5 78FD 3C04 B148  2506 5D4F 6A49 E4E2 9D15
> Geschäftsführer: Thomas Börnert, Amtsgericht Stuttgart HRB 281836
> USt.-IdNr. DE 207 740 994
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>