[ClusterLabs] GlusterFS all apache instances stopped after failover

Thu May 7 13:05:48 UTC 2015

Comments inline ...

----- Original Message -----
> Hi,
> 
> have a config with 2 nodes where glusterfs is being mounted to root dir
> of apache. After producing a kernel panic by echo c >
> /proc/sysrq-trigger on node2, node 2 is turned off by stonith. ping
> should work further on node1 but it stops since apache is stopped on
> node1 as well. So why is apache stopped on node1, has someone any ideas.
> Below the logs
> following setup: pacemaker 1.1.7, corosync 1.4
> 
> 
> 
> Online: [ vm-1 vm-2 ]
> 
>   Clone Set: cl_gluster_mnt [p_gluster_mnt]
>       Started: [ vm-1 vm-2 ]
>   Clone Set: cl_apache [p_apache]
>       Started: [ vm-1 vm-2 ]
>   Clone Set: cl_IP [IP] (unique)
>       IP:0       (ocf::heartbeat:IPaddr2):       Started vm-1
>       IP:1       (ocf::heartbeat:IPaddr2):       Started vm-2
> p_fence_N1      (stonith:external/libvirt):     Started vm-2
> p_fence_N2      (stonith:external/libvirt):     Started vm-1
> 
> root at vm-1:~# crm configure show
> node vm-1 \
>      attributes standby="off"
> node vm-2 \
>      attributes standby="off"
> primitive IP ocf:heartbeat:IPaddr2 \
>      params ip="192.168.122.200" nic="eth0"
> clusterip_hash="sourceip-sourceport" \
>      op monitor interval="10s"
> primitive p_apache ocf:heartbeat:apache \
>      params configfile="/etc/apache2/apache2.conf"
> statusurl="http://localhost/server-status" \
>      op monitor interval="60" timeout="20" \
>      op start interval="0" timeout="40s" start-delay="0" \
>      meta is-managed="true"
> primitive p_fence_N1 stonith:external/libvirt \
>      params hostlist="vm-1:N1"
> hypervisor_uri="qemu+tcp://192.168.122.1/system" pcmk_reboot_action="off" \
>      op monitor interval="60" \
>      meta target-role="Started"
> primitive p_fence_N2 stonith:external/libvirt \
>      params hostlist="vm-2:N2"
> hypervisor_uri="qemu+tcp://192.168.122.1/system" pcmk_reboot_action="off" \
>      op monitor interval="60"
> primitive p_gluster_mnt ocf:heartbeat:Filesystem \
>      params device="localhost:/gvolrep" directory="/var/www/html"
> fstype="glusterfs" \
>      op monitor interval="10"
> clone cl_IP IP \
>      meta globally-unique="true" clone-max="2" clone-node-max="2" \
>      params resource-stickiness="0"
> clone cl_apache p_apache \
>      meta target-role="Started"
> clone cl_gluster_mnt p_gluster_mnt \
>      meta target-role="Started"
> location l_fence_N1 p_fence_N1 -inf: vm-1
> location l_fence_N2 p_fence_N2 -inf: vm-2
> colocation c_apache_gluster inf: cl_IP cl_gluster_mnt
> colocation c_ip_apache inf: cl_apache cl_IP
> order o_apache inf: cl_gluster_mnt cl_IP cl_apache
> property $id="cib-bootstrap-options" \
>      dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>      cluster-infrastructure="openais" \
>      expected-quorum-votes="2" \
>      stonith-enabled="true" \
>      no-quorum-policy="ignore" \
>      last-lrm-refresh="1430996556"
> rsc_defaults $id="rsc-options" \
>      resource-stickiness="100"
> op_defaults $id="op-options" \
>      timeout="240s"
> 
> ############################
> root at vm-1:~# tail -f /var/log/syslog
> May  7 13:19:58 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
> action 27: monitor p_apache:1_monitor_60000 on vm-2
> May  7 13:19:58 vm-1 lrmd: [3036]: info: operation monitor[54] on
> p_apache:0 for client 3039: pid 27310 exited with return code 0
> May  7 13:19:58 vm-1 crmd: [3039]: info: process_lrm_event: LRM
> operation p_apache:0_monitor_60000 (call=54, rc=0, cib-update=322,
> confirmed=false) ok
> May  7 13:19:58 vm-1 crmd: [3039]: notice: run_graph: ==== Transition 30
> (Complete=34, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-1906.bz2): Complete
> May  7 13:19:58 vm-1 crmd: [3039]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> May  7 13:20:12 vm-1 stonith-ng: [3035]: info: stonith_command:
> Processed st_execute from lrmd: rc=-1
> May  7 13:20:12 vm-1 external/libvirt[27429]: [27444]: notice:
> qemu+tcp://192.168.122.1/system: Running hypervisor: QEMU 2.0.0
> May  7 13:20:13 vm-1 stonith: [27422]: info: external/libvirt device OK.
> May  7 13:20:13 vm-1 stonith-ng: [3035]: info: log_operation:
> p_fence_N2: Performing: stonith -t external/libvirt -S
> May  7 13:20:13 vm-1 stonith-ng: [3035]: info: log_operation:
> p_fence_N2: success:  0
> May  7 13:21:08 vm-1 corosync[2971]:   [TOTEM ] A processor failed,
> forming new configuration.
> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] notice:
> pcmk_peer_update: Transitional membership event on ring 7024: memb=1,
> new=0, lost=1
> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info: pcmk_peer_update:
> memb: vm-1 2138745024
> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info: pcmk_peer_update:
> lost: vm-2 930785472
> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] notice:
> pcmk_peer_update: Stable membership event on ring 7024: memb=1, new=0,
> lost=0
> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info: pcmk_peer_update:
> MEMB: vm-1 2138745024
> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info:
> ais_mark_unseen_peer_dead: Node vm-2 was not seen in the previous transition
> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info: update_member:
> Node 930785472/vm-2 is now: lost
> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info:
> send_member_notification: Sending membership update 7024 to 2 children
> May  7 13:21:12 vm-1 cib: [3034]: notice: ais_dispatch_message:
> Membership 7024: quorum lost
> May  7 13:21:12 vm-1 cib: [3034]: info: crm_update_peer: Node vm-2:
> id=930785472 state=lost (new) addr=r(0) ip(192.168.122.183)  votes=1
> born=7020 seen=7020 proc=00000000000000000000000000111312
> May  7 13:21:12 vm-1 crmd: [3039]: notice: ais_dispatch_message:
> Membership 7024: quorum lost
> May  7 13:21:12 vm-1 crmd: [3039]: info: ais_status_callback: status:
> vm-2 is now lost (was member)
> May  7 13:21:12 vm-1 corosync[2971]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> May  7 13:21:12 vm-1 crmd: [3039]: info: crm_update_peer: Node vm-2:
> id=930785472 state=lost (new) addr=r(0) ip(192.168.122.183)  votes=1
> born=7020 seen=7020 proc=00000000000000000000000000111312
> May  7 13:21:12 vm-1 corosync[2971]:   [CPG   ] chosen downlist: sender
> r(0) ip(192.168.122.127) ; members(old:2 left:1)
> May  7 13:21:12 vm-1 cib: [3034]: info: cib_process_request: Operation
> complete: op cib_modify for section nodes (origin=local/crmd/323,
> version=0.502.55): ok (rc=0)
> May  7 13:21:12 vm-1 corosync[2971]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> May  7 13:21:12 vm-1 cib: [3034]: info: cib_process_request: Operation
> complete: op cib_modify for section cib (origin=local/crmd/325,
> version=0.502.57): ok (rc=0)
> May  7 13:21:12 vm-1 crmd: [3039]: info: crmd_ais_dispatch: Setting
> expected votes to 2
> May  7 13:21:12 vm-1 crmd: [3039]: WARN: match_down_event: No match for
> shutdown action on vm-2
> May  7 13:21:12 vm-1 crmd: [3039]: info: te_update_diff:
> Stonith/shutdown of vm-2 not matched
> May  7 13:21:12 vm-1 crmd: [3039]: info: abort_transition_graph:
> te_update_diff:234 - Triggered transition abort (complete=1,
> tag=node_state, id=vm-2, magic=NA, cib=0.502.56) : Node failure
> May  7 13:21:12 vm-1 crmd: [3039]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> May  7 13:21:12 vm-1 cib: [3034]: info: cib_process_request: Operation
> complete: op cib_modify for section crm_config (origin=local/crmd/327,
> version=0.502.58): ok (rc=0)
> May  7 13:21:12 vm-1 pengine: [3038]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: pe_fence_node: Node vm-2
> will be fenced because it is un-expectedly down
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: determine_online_status:
> Node vm-2 is unclean
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Action
> p_gluster_mnt:1_stop_0 on vm-2 is unrunnable (offline)
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Marking node
> vm-2 unclean
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Action
> p_apache:1_stop_0 on vm-2 is unrunnable (offline)
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Marking node
> vm-2 unclean
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Action
> IP:1_stop_0 on vm-2 is unrunnable (offline)
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Marking node
> vm-2 unclean
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Action
> p_fence_N1_stop_0 on vm-2 is unrunnable (offline)
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Marking node
> vm-2 unclean
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: stage6: Scheduling Node vm-2
> for STONITH
> May  7 13:21:12 vm-1 pengine: [3038]: notice: LogActions: Stop
> p_gluster_mnt:1#011(vm-2)
> May  7 13:21:12 vm-1 pengine: [3038]: notice: LogActions: Restart
> p_apache:0#011(Started vm-1)
> May  7 13:21:12 vm-1 pengine: [3038]: notice: LogActions: Stop
> p_apache:1#011(vm-2)
> May  7 13:21:12 vm-1 pengine: [3038]: notice: LogActions: Move
> IP:1#011(Started vm-2 -> vm-1)
> May  7 13:21:12 vm-1 pengine: [3038]: notice: LogActions: Stop
> p_fence_N1#011(vm-2)
> May  7 13:21:12 vm-1 crmd: [3039]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> May  7 13:21:12 vm-1 crmd: [3039]: info: do_te_invoke: Processing graph
> 31 (ref=pe_calc-dc-1430997672-228) derived from
> /var/lib/pengine/pe-warn-661.bz2
> May  7 13:21:12 vm-1 crmd: [3039]: notice: te_fence_node: Executing
> reboot fencing operation (36) on vm-2 (timeout=60000)
> May  7 13:21:12 vm-1 stonith-ng: [3035]: info:
> initiate_remote_stonith_op: Initiating remote operation reboot for vm-2:
> 9a25d1de-b42d-44b1-9d49-368f36fa57fd
> May  7 13:21:12 vm-1 pengine: [3038]: WARN: process_pe_message:
> Transition 31: WARNINGs found during PE processing. PEngine Input stored
> in: /var/lib/pengine/pe-warn-661.bz2
> May  7 13:21:12 vm-1 pengine: [3038]: notice: process_pe_message:
> Configuration WARNINGs found during PE processing.  Please run
> "crm_verify -L" to identify issues.

The above is a problem. It means there is an error in your configuration.

> May  7 13:21:12 vm-1 stonith-ng: [3035]: info:
> can_fence_host_with_device: Refreshing port list for p_fence_N2
> May  7 13:21:12 vm-1 stonith-ng: [3035]: info:
> can_fence_host_with_device: p_fence_N2 can fence vm-2: dynamic-list
> May  7 13:21:12 vm-1 stonith-ng: [3035]: info: call_remote_stonith:
> Requesting that vm-1 perform op reboot vm-2
> May  7 13:21:12 vm-1 stonith-ng: [3035]: info:
> can_fence_host_with_device: p_fence_N2 can fence vm-2: dynamic-list
> May  7 13:21:12 vm-1 stonith-ng: [3035]: info: stonith_fence: Found 1
> matching devices for 'vm-2'
> May  7 13:21:12 vm-1 stonith-ng: [3035]: info: stonith_command:
> Processed st_fence from vm-1: rc=-1
> May  7 13:21:12 vm-1 stonith-ng: [3035]: info: make_args: Substituting
> action 'off' for requested operation 'reboot'
> May  7 13:21:12 vm-1 external/libvirt[27944]: [27957]: notice: Domain N2
> was stopped
> May  7 13:21:13 vm-1 stonith-ng: [3035]: notice: log_operation:
> Operation 'reboot' [27936] (call 0 from
> c5982dd8-7639-49be-a9f9-c8488bea6091) for host 'vm-2' with device
> 'p_fence_N2' returned: 0
> May  7 13:21:13 vm-1 stonith-ng: [3035]: info: log_operation:
> p_fence_N2: Performing: stonith -t external/libvirt -T off vm-2
> May  7 13:21:13 vm-1 stonith-ng: [3035]: info: log_operation:
> p_fence_N2: success: vm-2 0
> May  7 13:21:13 vm-1 stonith-ng: [3035]: notice: remote_op_done:
> Operation reboot of vm-2 by vm-1 for
> vm-1[c5982dd8-7639-49be-a9f9-c8488bea6091]: OK
> May  7 13:21:13 vm-1 crmd: [3039]: info: tengine_stonith_callback:
> StonithOp <st-reply st_origin="stonith_construct_async_reply"
> t="stonith-ng" st_op="reboot"
> st_remote_op="9a25d1de-b42d-44b1-9d49-368f36fa57fd"
> st_clientid="c5982dd8-7639-49be-a9f9-c8488bea6091" st_target="vm-2"
> st_device_action="st_fence" st_callid="0" st_callopt="0" st_rc="0"
> st_output="Performing: stonith -t external/libvirt -T off
> vm-2#012success: vm-2 0#012" src="vm-1" seq="16" state="2" />
> May  7 13:21:13 vm-1 crmd: [3039]: info: erase_status_tag: Deleting
> xpath: //node_state[@uname='vm-2']/lrm
> May  7 13:21:13 vm-1 crmd: [3039]: info: erase_status_tag: Deleting
> xpath: //node_state[@uname='vm-2']/transient_attributes
> May  7 13:21:13 vm-1 crmd: [3039]: notice: crmd_peer_update: Status
> update: Client vm-2/crmd now has status [offline] (DC=true)
> May  7 13:21:13 vm-1 crmd: [3039]: notice: tengine_stonith_notify: Peer
> vm-2 was terminated (reboot) by vm-1 for vm-1: OK
> (ref=9a25d1de-b42d-44b1-9d49-368f36fa57fd)
> May  7 13:21:13 vm-1 crmd: [3039]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_INTEGRATION [ input=I_NODE_JOIN
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May  7 13:21:13 vm-1 crmd: [3039]: info: abort_transition_graph:
> do_te_invoke:169 - Triggered transition abort (complete=0) : Peer Halt
> May  7 13:21:13 vm-1 crmd: [3039]: notice: run_graph: ==== Transition 31
> (Complete=3, Pending=0, Fired=0, Skipped=15, Incomplete=5,
> Source=/var/lib/pengine/pe-warn-661.bz2): Stopped
> May  7 13:21:13 vm-1 crmd: [3039]: info: abort_transition_graph:
> do_te_invoke:169 - Triggered transition abort (complete=1) : Peer Halt
> May  7 13:21:13 vm-1 crmd: [3039]: info: join_make_offer: Making join
> offers based on membership 7024
> May  7 13:21:13 vm-1 crmd: [3039]: info: do_dc_join_offer_all: join-11:
> Waiting on 1 outstanding join acks
> May  7 13:21:13 vm-1 crmd: [3039]: info: update_dc: Set DC to vm-1 (3.0.6)
> May  7 13:21:13 vm-1 crmd: [3039]: info: cib_fencing_updated: Fencing
> update 329 for vm-2: complete
> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
> complete: op cib_delete for section //node_state[@uname='vm-2']/lrm
> (origin=local/crmd/330, version=0.502.60): ok (rc=0)
> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
> complete: op cib_delete for section
> //node_state[@uname='vm-2']/transient_attributes (origin=local/crmd/331,
> version=0.502.61): ok (rc=0)
> May  7 13:21:13 vm-1 crmd: [3039]: notice: do_state_transition: State
> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May  7 13:21:13 vm-1 crmd: [3039]: info: do_dc_join_finalize: join-11:
> Syncing the CIB from vm-1 to the rest of the cluster
> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
> complete: op cib_sync for section 'all' (origin=local/crmd/334,
> version=0.502.62): ok (rc=0)
> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
> complete: op cib_modify for section nodes (origin=local/crmd/335,
> version=0.502.63): ok (rc=0)
> May  7 13:21:13 vm-1 crmd: [3039]: info: do_dc_join_ack: join-11:
> Updating node state to member for vm-1
> May  7 13:21:13 vm-1 crmd: [3039]: info: erase_status_tag: Deleting
> xpath: //node_state[@uname='vm-1']/lrm
> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
> complete: op cib_delete for section //node_state[@uname='vm-1']/lrm
> (origin=local/crmd/336, version=0.502.64): ok (rc=0)
> May  7 13:21:13 vm-1 crmd: [3039]: notice: do_state_transition: State
> transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May  7 13:21:13 vm-1 crmd: [3039]: info: abort_transition_graph:
> do_te_invoke:162 - Triggered transition abort (complete=1) : Peer Cancelled
> May  7 13:21:13 vm-1 attrd: [3037]: notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> May  7 13:21:13 vm-1 attrd: [3037]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-p_apache:0 (1430996422)
> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
> complete: op cib_modify for section nodes (origin=local/crmd/338,
> version=0.502.66): ok (rc=0)
> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
> complete: op cib_modify for section cib (origin=local/crmd/340,
> version=0.502.68): ok (rc=0)
> May  7 13:21:13 vm-1 attrd: [3037]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> May  7 13:21:13 vm-1 pengine: [3038]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> May  7 13:21:13 vm-1 pengine: [3038]: notice: LogActions: Restart
> p_apache:0#011(Started vm-1)
> May  7 13:21:13 vm-1 pengine: [3038]: notice: LogActions: Start
> IP:1#011(vm-1)
> May  7 13:21:13 vm-1 crmd: [3039]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> May  7 13:21:13 vm-1 crmd: [3039]: info: do_te_invoke: Processing graph
> 32 (ref=pe_calc-dc-1430997673-233) derived from
> /var/lib/pengine/pe-input-1907.bz2
> May  7 13:21:13 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
> action 14: stop p_apache:0_stop_0 on vm-1 (local)
> May  7 13:21:13 vm-1 lrmd: [3036]: info: cancel_op: operation
> monitor[54] on p_apache:0 for client 3039, its parameters:
> CRM_meta_start_delay=[0] CRM_meta_timeout=[20000]
> CRM_meta_name=[monitor] crm_feature_set=[3.0.6]
> CRM_meta_clone_node_max=[1] configfile=[/etc/apache2/apache2.conf]
> CRM_meta_clone=[0] CRM_meta_interval=[60000] CRM_meta_clone_max=[2]
> CRM_meta_notify=[false] statusurl=[http://localhost/server-status]
> CRM_meta_globally_unique=[false]  cancelled
> May  7 13:21:13 vm-1 lrmd: [3036]: info: rsc:p_apache:0 stop[55] (pid 27992)
> May  7 13:21:13 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
> action 22: start IP:1_start_0 on vm-1 (local)
> May  7 13:21:13 vm-1 lrmd: [3036]: info: rsc:IP:1 start[56] (pid 27993)
> May  7 13:21:13 vm-1 crmd: [3039]: info: process_lrm_event: LRM
> operation p_apache:0_monitor_60000 (call=54, status=1, cib-update=0,
> confirmed=true) Cancelled
> May  7 13:21:13 vm-1 pengine: [3038]: notice: process_pe_message:
> Transition 32: PEngine Input stored in: /var/lib/pengine/pe-input-1907.bz2
> May  7 13:21:13 vm-1 stonith-ng: [3035]: info: stonith_command:
> Processed st_execute from lrmd: rc=-1
> May  7 13:21:13 vm-1 external/libvirt[28069]: [28084]: notice:
> qemu+tcp://192.168.122.1/system: Running hypervisor: QEMU 2.0.0
> May  7 13:21:14 vm-1 IPaddr2[27993]: INFO: /usr/lib/heartbeat/send_arp
> -i 200 -r 5 -p /var/run/resource-agents/send_arp-192.168.122.200 eth0
> 192.168.122.200 8bbd5f0b558f not_used not_used
> May  7 13:21:14 vm-1 lrmd: [3036]: info: operation start[56] on IP:1 for
> client 3039: pid 27993 exited with return code 0
> May  7 13:21:14 vm-1 crmd: [3039]: info: process_lrm_event: LRM
> operation IP:1_start_0 (call=56, rc=0, cib-update=342, confirmed=true) ok
> May  7 13:21:14 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
> action 23: monitor IP:1_monitor_10000 on vm-1 (local)
> May  7 13:21:14 vm-1 lrmd: [3036]: info: rsc:IP:1 monitor[57] (pid 28104)
> May  7 13:21:14 vm-1 lrmd: [3036]: info: operation monitor[57] on IP:1
> for client 3039: pid 28104 exited with return code 0
> May  7 13:21:14 vm-1 crmd: [3039]: info: process_lrm_event: LRM
> operation IP:1_monitor_10000 (call=57, rc=0, cib-update=343,
> confirmed=false) ok
> May  7 13:21:14 vm-1 lrmd: [3036]: info: RA output:
> (p_apache:0:stop:stderr) /usr/lib/ocf/resource.d//heartbeat/apache: 440:
> kill:
> May  7 13:21:14 vm-1 lrmd: [3036]: info: RA output:
> (p_apache:0:stop:stderr) No such process
> May  7 13:21:14 vm-1 lrmd: [3036]: info: RA output:
> (p_apache:0:stop:stderr)
> May  7 13:21:14 vm-1 apache[27992]: INFO: Killing apache PID 27235
> May  7 13:21:14 vm-1 apache[27992]: INFO: apache stopped.
> May  7 13:21:14 vm-1 lrmd: [3036]: info: operation stop[55] on
> p_apache:0 for client 3039: pid 27992 exited with return code 0
> May  7 13:21:14 vm-1 crmd: [3039]: info: process_lrm_event: LRM
> operation p_apache:0_stop_0 (call=55, rc=0, cib-update=344,
> confirmed=true) ok
> May  7 13:21:14 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
> action 15: start p_apache:0_start_0 on vm-1 (local)
> May  7 13:21:14 vm-1 lrmd: [3036]: info: rsc:p_apache:0 start[58] (pid
> 28166)
> May  7 13:21:14 vm-1 stonith: [28062]: info: external/libvirt device OK.
> May  7 13:21:14 vm-1 stonith-ng: [3035]: info: log_operation:
> p_fence_N2: Performing: stonith -t external/libvirt -S
> May  7 13:21:14 vm-1 stonith-ng: [3035]: info: log_operation:
> p_fence_N2: success:  0
> May  7 13:21:54 vm-1 lrmd: [3036]: WARN: p_apache:0:start process (PID
> 28166) timed out (try 1).  Killing with signal SIGTERM (15).
> May  7 13:21:54 vm-1 lrmd: [3036]: WARN: operation start[58] on
> p_apache:0 for client 3039: pid 28166 timed out
> May  7 13:21:54 vm-1 crmd: [3039]: ERROR: process_lrm_event: LRM
> operation p_apache:0_start_0 (58) Timed Out (timeout=40000ms)

This is another problem. Apache didn't start within 40s. Check apache's logs to see if there are any hints as to what went wrong.

If you stop pacemaker on both nodes and shut one of them down, can you manually bring up the resources on the other node? It's always good to verify that everything works without Pacemaker first. If you can bring up everything without Pacemaker, then perhaps the configuration error from above is related, or maybe your constraints need tweaking.

> May  7 13:21:54 vm-1 crmd: [3039]: WARN: status_from_rc: Action 15
> (p_apache:0_start_0) on vm-1 failed (target: 0 vs. rc: -2): Error
> May  7 13:21:54 vm-1 crmd: [3039]: WARN: update_failcount: Updating
> failcount for p_apache:0 on vm-1 after failed start: rc=-2
> (update=INFINITY, time=1430997714)
> May  7 13:21:54 vm-1 crmd: [3039]: info: abort_transition_graph:
> match_graph_event:277 - Triggered transition abort (complete=0,
> tag=lrm_rsc_op, id=p_apache:0_last_failure_0,
> magic=2:-2;15:32:0:da4997d2-70ea-45a7-9617-42963ea29e42, cib=0.502.74) :
> Event failed
> May  7 13:21:54 vm-1 crmd: [3039]: notice: run_graph: ==== Transition 32
> (Complete=11, Pending=0, Fired=0, Skipped=1, Incomplete=0,
> Source=/var/lib/pengine/pe-input-1907.bz2): Stopped
> May  7 13:21:54 vm-1 crmd: [3039]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> May  7 13:21:54 vm-1 attrd: [3037]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-p_apache:0 (INFINITY)
> May  7 13:21:54 vm-1 attrd: [3037]: notice: attrd_perform_update: Sent
> update 94: fail-count-p_apache:0=INFINITY
> May  7 13:21:54 vm-1 pengine: [3038]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> May  7 13:21:54 vm-1 pengine: [3038]: WARN: unpack_rsc_op: Processing
> failed op p_apache:0_last_failure_0 on vm-1: unknown exec error (-2)
> May  7 13:21:54 vm-1 attrd: [3037]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-p_apache:0 (1430997714)
> May  7 13:21:54 vm-1 pengine: [3038]: notice: LogActions: Recover
> p_apache:0#011(Started vm-1)
> May  7 13:21:54 vm-1 attrd: [3037]: notice: attrd_perform_update: Sent
> update 96: last-failure-p_apache:0=1430997714
> May  7 13:21:54 vm-1 crmd: [3039]: info: abort_transition_graph:
> te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair,
> id=status-vm-1-fail-count-p_apache.0, name=fail-count-p_apache:0,
> value=INFINITY, magic=NA, cib=0.502.75) : Transient attribute: update
> May  7 13:21:54 vm-1 crmd: [3039]: info: handle_response: pe_calc
> calculation pe_calc-dc-1430997714-238 is obsolete
> May  7 13:21:54 vm-1 crmd: [3039]: info: abort_transition_graph:
> te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair,
> id=status-vm-1-last-failure-p_apache.0, name=last-failure-p_apache:0,
> value=1430997714, magic=NA, cib=0.502.76) : Transient attribute: update
> May  7 13:21:54 vm-1 pengine: [3038]: notice: process_pe_message:
> Transition 33: PEngine Input stored in: /var/lib/pengine/pe-input-1908.bz2
> May  7 13:21:54 vm-1 pengine: [3038]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> May  7 13:21:54 vm-1 pengine: [3038]: WARN: unpack_rsc_op: Processing
> failed op p_apache:0_last_failure_0 on vm-1: unknown exec error (-2)
> May  7 13:21:54 vm-1 pengine: [3038]: WARN: common_apply_stickiness:
> Forcing cl_apache away from vm-1 after 1000000 failures (max=1000000)
> May  7 13:21:54 vm-1 pengine: [3038]: WARN: common_apply_stickiness:
> Forcing cl_apache away from vm-1 after 1000000 failures (max=1000000)
> May  7 13:21:54 vm-1 pengine: [3038]: notice: LogActions: Stop
> p_apache:0#011(vm-1)
> May  7 13:21:54 vm-1 crmd: [3039]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> May  7 13:21:54 vm-1 crmd: [3039]: info: do_te_invoke: Processing graph
> 34 (ref=pe_calc-dc-1430997714-239) derived from
> /var/lib/pengine/pe-input-1909.bz2
> May  7 13:21:54 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
> action 4: stop p_apache:0_stop_0 on vm-1 (local)
> May  7 13:21:54 vm-1 lrmd: [3036]: info: rsc:p_apache:0 stop[59] (pid 28650)
> May  7 13:21:54 vm-1 pengine: [3038]: notice: process_pe_message:
> Transition 34: PEngine Input stored in: /var/lib/pengine/pe-input-1909.bz2
> May  7 13:21:54 vm-1 apache[28650]: INFO: apache is not running.
> May  7 13:21:54 vm-1 apache[28650]: INFO: apache children were signalled
> (SIGTERM)
> May  7 13:21:56 vm-1 apache[28650]: INFO: apache children were signalled
> (SIGHUP)
> May  7 13:21:57 vm-1 apache[28650]: INFO: apache children were signalled
> (SIGKILL)
> May  7 13:21:58 vm-1 lrmd: [3036]: info: operation stop[59] on
> p_apache:0 for client 3039: pid 28650 exited with return code 0
> May  7 13:21:58 vm-1 crmd: [3039]: info: process_lrm_event: LRM
> operation p_apache:0_stop_0 (call=59, rc=0, cib-update=349,
> confirmed=true) ok
> May  7 13:21:58 vm-1 crmd: [3039]: notice: run_graph: ==== Transition 34
> (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-1909.bz2): Complete
> May  7 13:21:58 vm-1 crmd: [3039]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> May  7 13:22:14 vm-1 stonith-ng: [3035]: info: stonith_command:
> Processed st_execute from lrmd: rc=-1
> May  7 13:22:15 vm-1 external/libvirt[28943]: [28958]: notice:
> qemu+tcp://192.168.122.1/system: Running hypervisor: QEMU 2.0.0
> May  7 13:22:16 vm-1 stonith: [28935]: info: external/libvirt device OK.
> May  7 13:22:16 vm-1 stonith-ng: [3035]: info: log_operation:
> p_fence_N2: Performing: stonith -t external/libvirt -S
> May  7 13:22:16 vm-1 stonith-ng: [3035]: info: log_operation:
> p_fence_N2: success:  0
> 
> Online: [ vm-1 ]
> OFFLINE: [ vm-2 ]
> 
>   Clone Set: cl_gluster_mnt [p_gluster_mnt]
>       Started: [ vm-1 ]
>       Stopped: [ p_gluster_mnt:1 ]
>   Clone Set: cl_IP [IP] (unique)
>       IP:0       (ocf::heartbeat:IPaddr2):       Started vm-1
>       IP:1       (ocf::heartbeat:IPaddr2):       Started vm-1
> p_fence_N2      (stonith:external/libvirt):     Started vm-1
> 
> Failed actions:
>      p_apache:0_start_0 (node=vm-1, call=58, rc=-2, status=Timed Out):
> unknown exec error
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
-- Ken Gaillot <kgaillot at redhat.com>