[ClusterLabs] GlusterFS all apache instances stopped after failover

Thu May 7 13:45:07 EDT 2015

On 05/07/2015 12:29 PM, lukas wrote:
> Hi,
> 
> crm_verify -LV on node1 (active one) doesn't show anything after
> failover. If I I do a resource cleanup and start the resource by (crm
> resource start cl_apache) it works without problems. But to bring it up
> manually is no solution...
> 
> If I restart both machines I get follwoing output in apache.error.log
> [Thu May 07 18:11:36 2015] [notice] mod_bw : Memory Allocated 48 bytes
> (each conf takes 48 bytes)
> [Thu May 07 18:11:36 2015] [notice] mod_bw : Version 0.92 - Initialized
> [1 Confs]
> [Thu May 07 18:11:36 2015] [notice] Apache/2.2.22 (Debian) configured --
> resuming normal operations
> [Thu May 07 18:11:36 2015] [error] (2)No such file or directory:
> Couldn't bind unix domain socket /var/run/apache2/cgisock.3551
> [Thu May 07 18:11:37 2015] [crit] cgid daemon failed to initialize
>
> But the resource are up and everything works as expected.  Apache is
> only managed by pacemaker (not started at bootup).

Yes, that looks unrelated to your original issue, but it is something to
look at.

Most likely /var/run/apache2 does not exist. Usually /var/run is a
memory filesystem, meaning it is erased every reboot. Often the init
script is what creates the subdirectory at boot, but of course here the
init script is not used.

Hopefully someone more familiar with apache resources can give a good
solution; my first instinct would be to modify the resource agent to
create /var/run/apache2 at start if it doesn't already exist.

The logs above imply that any CGI scripts on your site likely won't work
until that issue is fixed.

> Have you any suggestions for constraints?
> 
> On 2015-05-07 15:05, Ken Gaillot wrote:
>> Comments inline ...
>>
>> ----- Original Message -----
>>> Hi,
>>>
>>> have a config with 2 nodes where glusterfs is being mounted to root dir
>>> of apache. After producing a kernel panic by echo c >
>>> /proc/sysrq-trigger on node2, node 2 is turned off by stonith. ping
>>> should work further on node1 but it stops since apache is stopped on
>>> node1 as well. So why is apache stopped on node1, has someone any ideas.
>>> Below the logs
>>> following setup: pacemaker 1.1.7, corosync 1.4
>>>
>>>
>>>
>>> Online: [ vm-1 vm-2 ]
>>>
>>>    Clone Set: cl_gluster_mnt [p_gluster_mnt]
>>>        Started: [ vm-1 vm-2 ]
>>>    Clone Set: cl_apache [p_apache]
>>>        Started: [ vm-1 vm-2 ]
>>>    Clone Set: cl_IP [IP] (unique)
>>>        IP:0       (ocf::heartbeat:IPaddr2):       Started vm-1
>>>        IP:1       (ocf::heartbeat:IPaddr2):       Started vm-2
>>> p_fence_N1      (stonith:external/libvirt):     Started vm-2
>>> p_fence_N2      (stonith:external/libvirt):     Started vm-1
>>>
>>> root at vm-1:~# crm configure show
>>> node vm-1 \
>>>       attributes standby="off"
>>> node vm-2 \
>>>       attributes standby="off"
>>> primitive IP ocf:heartbeat:IPaddr2 \
>>>       params ip="192.168.122.200" nic="eth0"
>>> clusterip_hash="sourceip-sourceport" \
>>>       op monitor interval="10s"
>>> primitive p_apache ocf:heartbeat:apache \
>>>       params configfile="/etc/apache2/apache2.conf"
>>> statusurl="http://localhost/server-status" \
>>>       op monitor interval="60" timeout="20" \
>>>       op start interval="0" timeout="40s" start-delay="0" \
>>>       meta is-managed="true"
>>> primitive p_fence_N1 stonith:external/libvirt \
>>>       params hostlist="vm-1:N1"
>>> hypervisor_uri="qemu+tcp://192.168.122.1/system"
>>> pcmk_reboot_action="off" \
>>>       op monitor interval="60" \
>>>       meta target-role="Started"
>>> primitive p_fence_N2 stonith:external/libvirt \
>>>       params hostlist="vm-2:N2"
>>> hypervisor_uri="qemu+tcp://192.168.122.1/system"
>>> pcmk_reboot_action="off" \
>>>       op monitor interval="60"
>>> primitive p_gluster_mnt ocf:heartbeat:Filesystem \
>>>       params device="localhost:/gvolrep" directory="/var/www/html"
>>> fstype="glusterfs" \
>>>       op monitor interval="10"
>>> clone cl_IP IP \
>>>       meta globally-unique="true" clone-max="2" clone-node-max="2" \
>>>       params resource-stickiness="0"
>>> clone cl_apache p_apache \
>>>       meta target-role="Started"
>>> clone cl_gluster_mnt p_gluster_mnt \
>>>       meta target-role="Started"
>>> location l_fence_N1 p_fence_N1 -inf: vm-1
>>> location l_fence_N2 p_fence_N2 -inf: vm-2
>>> colocation c_apache_gluster inf: cl_IP cl_gluster_mnt
>>> colocation c_ip_apache inf: cl_apache cl_IP
>>> order o_apache inf: cl_gluster_mnt cl_IP cl_apache
>>> property $id="cib-bootstrap-options" \
>>>       dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>>>       cluster-infrastructure="openais" \
>>>       expected-quorum-votes="2" \
>>>       stonith-enabled="true" \
>>>       no-quorum-policy="ignore" \
>>>       last-lrm-refresh="1430996556"
>>> rsc_defaults $id="rsc-options" \
>>>       resource-stickiness="100"
>>> op_defaults $id="op-options" \
>>>       timeout="240s"
>>>
>>> ############################
>>> root at vm-1:~# tail -f /var/log/syslog
>>> May  7 13:19:58 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
>>> action 27: monitor p_apache:1_monitor_60000 on vm-2
>>> May  7 13:19:58 vm-1 lrmd: [3036]: info: operation monitor[54] on
>>> p_apache:0 for client 3039: pid 27310 exited with return code 0
>>> May  7 13:19:58 vm-1 crmd: [3039]: info: process_lrm_event: LRM
>>> operation p_apache:0_monitor_60000 (call=54, rc=0, cib-update=322,
>>> confirmed=false) ok
>>> May  7 13:19:58 vm-1 crmd: [3039]: notice: run_graph: ==== Transition 30
>>> (Complete=34, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>> Source=/var/lib/pengine/pe-input-1906.bz2): Complete
>>> May  7 13:19:58 vm-1 crmd: [3039]: notice: do_state_transition: State
>>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>> May  7 13:20:12 vm-1 stonith-ng: [3035]: info: stonith_command:
>>> Processed st_execute from lrmd: rc=-1
>>> May  7 13:20:12 vm-1 external/libvirt[27429]: [27444]: notice:
>>> qemu+tcp://192.168.122.1/system: Running hypervisor: QEMU 2.0.0
>>> May  7 13:20:13 vm-1 stonith: [27422]: info: external/libvirt device OK.
>>> May  7 13:20:13 vm-1 stonith-ng: [3035]: info: log_operation:
>>> p_fence_N2: Performing: stonith -t external/libvirt -S
>>> May  7 13:20:13 vm-1 stonith-ng: [3035]: info: log_operation:
>>> p_fence_N2: success:  0
>>> May  7 13:21:08 vm-1 corosync[2971]:   [TOTEM ] A processor failed,
>>> forming new configuration.
>>> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] notice:
>>> pcmk_peer_update: Transitional membership event on ring 7024: memb=1,
>>> new=0, lost=1
>>> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info: pcmk_peer_update:
>>> memb: vm-1 2138745024
>>> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info: pcmk_peer_update:
>>> lost: vm-2 930785472
>>> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] notice:
>>> pcmk_peer_update: Stable membership event on ring 7024: memb=1, new=0,
>>> lost=0
>>> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info: pcmk_peer_update:
>>> MEMB: vm-1 2138745024
>>> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info:
>>> ais_mark_unseen_peer_dead: Node vm-2 was not seen in the previous
>>> transition
>>> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info: update_member:
>>> Node 930785472/vm-2 is now: lost
>>> May  7 13:21:12 vm-1 corosync[2971]:   [pcmk  ] info:
>>> send_member_notification: Sending membership update 7024 to 2 children
>>> May  7 13:21:12 vm-1 cib: [3034]: notice: ais_dispatch_message:
>>> Membership 7024: quorum lost
>>> May  7 13:21:12 vm-1 cib: [3034]: info: crm_update_peer: Node vm-2:
>>> id=930785472 state=lost (new) addr=r(0) ip(192.168.122.183)  votes=1
>>> born=7020 seen=7020 proc=00000000000000000000000000111312
>>> May  7 13:21:12 vm-1 crmd: [3039]: notice: ais_dispatch_message:
>>> Membership 7024: quorum lost
>>> May  7 13:21:12 vm-1 crmd: [3039]: info: ais_status_callback: status:
>>> vm-2 is now lost (was member)
>>> May  7 13:21:12 vm-1 corosync[2971]:   [TOTEM ] A processor joined or
>>> left the membership and a new membership was formed.
>>> May  7 13:21:12 vm-1 crmd: [3039]: info: crm_update_peer: Node vm-2:
>>> id=930785472 state=lost (new) addr=r(0) ip(192.168.122.183)  votes=1
>>> born=7020 seen=7020 proc=00000000000000000000000000111312
>>> May  7 13:21:12 vm-1 corosync[2971]:   [CPG   ] chosen downlist: sender
>>> r(0) ip(192.168.122.127) ; members(old:2 left:1)
>>> May  7 13:21:12 vm-1 cib: [3034]: info: cib_process_request: Operation
>>> complete: op cib_modify for section nodes (origin=local/crmd/323,
>>> version=0.502.55): ok (rc=0)
>>> May  7 13:21:12 vm-1 corosync[2971]:   [MAIN  ] Completed service
>>> synchronization, ready to provide service.
>>> May  7 13:21:12 vm-1 cib: [3034]: info: cib_process_request: Operation
>>> complete: op cib_modify for section cib (origin=local/crmd/325,
>>> version=0.502.57): ok (rc=0)
>>> May  7 13:21:12 vm-1 crmd: [3039]: info: crmd_ais_dispatch: Setting
>>> expected votes to 2
>>> May  7 13:21:12 vm-1 crmd: [3039]: WARN: match_down_event: No match for
>>> shutdown action on vm-2
>>> May  7 13:21:12 vm-1 crmd: [3039]: info: te_update_diff:
>>> Stonith/shutdown of vm-2 not matched
>>> May  7 13:21:12 vm-1 crmd: [3039]: info: abort_transition_graph:
>>> te_update_diff:234 - Triggered transition abort (complete=1,
>>> tag=node_state, id=vm-2, magic=NA, cib=0.502.56) : Node failure
>>> May  7 13:21:12 vm-1 crmd: [3039]: notice: do_state_transition: State
>>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>> May  7 13:21:12 vm-1 cib: [3034]: info: cib_process_request: Operation
>>> complete: op cib_modify for section crm_config (origin=local/crmd/327,
>>> version=0.502.58): ok (rc=0)
>>> May  7 13:21:12 vm-1 pengine: [3038]: notice: unpack_config: On loss of
>>> CCM Quorum: Ignore
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: pe_fence_node: Node vm-2
>>> will be fenced because it is un-expectedly down
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: determine_online_status:
>>> Node vm-2 is unclean
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Action
>>> p_gluster_mnt:1_stop_0 on vm-2 is unrunnable (offline)
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Marking node
>>> vm-2 unclean
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Action
>>> p_apache:1_stop_0 on vm-2 is unrunnable (offline)
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Marking node
>>> vm-2 unclean
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Action
>>> IP:1_stop_0 on vm-2 is unrunnable (offline)
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Marking node
>>> vm-2 unclean
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Action
>>> p_fence_N1_stop_0 on vm-2 is unrunnable (offline)
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: custom_action: Marking node
>>> vm-2 unclean
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: stage6: Scheduling Node vm-2
>>> for STONITH
>>> May  7 13:21:12 vm-1 pengine: [3038]: notice: LogActions: Stop
>>> p_gluster_mnt:1#011(vm-2)
>>> May  7 13:21:12 vm-1 pengine: [3038]: notice: LogActions: Restart
>>> p_apache:0#011(Started vm-1)
>>> May  7 13:21:12 vm-1 pengine: [3038]: notice: LogActions: Stop
>>> p_apache:1#011(vm-2)
>>> May  7 13:21:12 vm-1 pengine: [3038]: notice: LogActions: Move
>>> IP:1#011(Started vm-2 -> vm-1)
>>> May  7 13:21:12 vm-1 pengine: [3038]: notice: LogActions: Stop
>>> p_fence_N1#011(vm-2)
>>> May  7 13:21:12 vm-1 crmd: [3039]: notice: do_state_transition: State
>>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
>>> cause=C_IPC_MESSAGE origin=handle_response ]
>>> May  7 13:21:12 vm-1 crmd: [3039]: info: do_te_invoke: Processing graph
>>> 31 (ref=pe_calc-dc-1430997672-228) derived from
>>> /var/lib/pengine/pe-warn-661.bz2
>>> May  7 13:21:12 vm-1 crmd: [3039]: notice: te_fence_node: Executing
>>> reboot fencing operation (36) on vm-2 (timeout=60000)
>>> May  7 13:21:12 vm-1 stonith-ng: [3035]: info:
>>> initiate_remote_stonith_op: Initiating remote operation reboot for vm-2:
>>> 9a25d1de-b42d-44b1-9d49-368f36fa57fd
>>> May  7 13:21:12 vm-1 pengine: [3038]: WARN: process_pe_message:
>>> Transition 31: WARNINGs found during PE processing. PEngine Input stored
>>> in: /var/lib/pengine/pe-warn-661.bz2
>>> May  7 13:21:12 vm-1 pengine: [3038]: notice: process_pe_message:
>>> Configuration WARNINGs found during PE processing.  Please run
>>> "crm_verify -L" to identify issues.
>> The above is a problem. It means there is an error in your configuration.
>>
>>> May  7 13:21:12 vm-1 stonith-ng: [3035]: info:
>>> can_fence_host_with_device: Refreshing port list for p_fence_N2
>>> May  7 13:21:12 vm-1 stonith-ng: [3035]: info:
>>> can_fence_host_with_device: p_fence_N2 can fence vm-2: dynamic-list
>>> May  7 13:21:12 vm-1 stonith-ng: [3035]: info: call_remote_stonith:
>>> Requesting that vm-1 perform op reboot vm-2
>>> May  7 13:21:12 vm-1 stonith-ng: [3035]: info:
>>> can_fence_host_with_device: p_fence_N2 can fence vm-2: dynamic-list
>>> May  7 13:21:12 vm-1 stonith-ng: [3035]: info: stonith_fence: Found 1
>>> matching devices for 'vm-2'
>>> May  7 13:21:12 vm-1 stonith-ng: [3035]: info: stonith_command:
>>> Processed st_fence from vm-1: rc=-1
>>> May  7 13:21:12 vm-1 stonith-ng: [3035]: info: make_args: Substituting
>>> action 'off' for requested operation 'reboot'
>>> May  7 13:21:12 vm-1 external/libvirt[27944]: [27957]: notice: Domain N2
>>> was stopped
>>> May  7 13:21:13 vm-1 stonith-ng: [3035]: notice: log_operation:
>>> Operation 'reboot' [27936] (call 0 from
>>> c5982dd8-7639-49be-a9f9-c8488bea6091) for host 'vm-2' with device
>>> 'p_fence_N2' returned: 0
>>> May  7 13:21:13 vm-1 stonith-ng: [3035]: info: log_operation:
>>> p_fence_N2: Performing: stonith -t external/libvirt -T off vm-2
>>> May  7 13:21:13 vm-1 stonith-ng: [3035]: info: log_operation:
>>> p_fence_N2: success: vm-2 0
>>> May  7 13:21:13 vm-1 stonith-ng: [3035]: notice: remote_op_done:
>>> Operation reboot of vm-2 by vm-1 for
>>> vm-1[c5982dd8-7639-49be-a9f9-c8488bea6091]: OK
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: tengine_stonith_callback:
>>> StonithOp <st-reply st_origin="stonith_construct_async_reply"
>>> t="stonith-ng" st_op="reboot"
>>> st_remote_op="9a25d1de-b42d-44b1-9d49-368f36fa57fd"
>>> st_clientid="c5982dd8-7639-49be-a9f9-c8488bea6091" st_target="vm-2"
>>> st_device_action="st_fence" st_callid="0" st_callopt="0" st_rc="0"
>>> st_output="Performing: stonith -t external/libvirt -T off
>>> vm-2#012success: vm-2 0#012" src="vm-1" seq="16" state="2" />
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: erase_status_tag: Deleting
>>> xpath: //node_state[@uname='vm-2']/lrm
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: erase_status_tag: Deleting
>>> xpath: //node_state[@uname='vm-2']/transient_attributes
>>> May  7 13:21:13 vm-1 crmd: [3039]: notice: crmd_peer_update: Status
>>> update: Client vm-2/crmd now has status [offline] (DC=true)
>>> May  7 13:21:13 vm-1 crmd: [3039]: notice: tengine_stonith_notify: Peer
>>> vm-2 was terminated (reboot) by vm-1 for vm-1: OK
>>> (ref=9a25d1de-b42d-44b1-9d49-368f36fa57fd)
>>> May  7 13:21:13 vm-1 crmd: [3039]: notice: do_state_transition: State
>>> transition S_TRANSITION_ENGINE -> S_INTEGRATION [ input=I_NODE_JOIN
>>> cause=C_FSA_INTERNAL origin=check_join_state ]
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: abort_transition_graph:
>>> do_te_invoke:169 - Triggered transition abort (complete=0) : Peer Halt
>>> May  7 13:21:13 vm-1 crmd: [3039]: notice: run_graph: ==== Transition 31
>>> (Complete=3, Pending=0, Fired=0, Skipped=15, Incomplete=5,
>>> Source=/var/lib/pengine/pe-warn-661.bz2): Stopped
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: abort_transition_graph:
>>> do_te_invoke:169 - Triggered transition abort (complete=1) : Peer Halt
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: join_make_offer: Making join
>>> offers based on membership 7024
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: do_dc_join_offer_all: join-11:
>>> Waiting on 1 outstanding join acks
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: update_dc: Set DC to vm-1
>>> (3.0.6)
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: cib_fencing_updated: Fencing
>>> update 329 for vm-2: complete
>>> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
>>> complete: op cib_delete for section //node_state[@uname='vm-2']/lrm
>>> (origin=local/crmd/330, version=0.502.60): ok (rc=0)
>>> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
>>> complete: op cib_delete for section
>>> //node_state[@uname='vm-2']/transient_attributes (origin=local/crmd/331,
>>> version=0.502.61): ok (rc=0)
>>> May  7 13:21:13 vm-1 crmd: [3039]: notice: do_state_transition: State
>>> transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
>>> cause=C_FSA_INTERNAL origin=check_join_state ]
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: do_dc_join_finalize: join-11:
>>> Syncing the CIB from vm-1 to the rest of the cluster
>>> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
>>> complete: op cib_sync for section 'all' (origin=local/crmd/334,
>>> version=0.502.62): ok (rc=0)
>>> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
>>> complete: op cib_modify for section nodes (origin=local/crmd/335,
>>> version=0.502.63): ok (rc=0)
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: do_dc_join_ack: join-11:
>>> Updating node state to member for vm-1
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: erase_status_tag: Deleting
>>> xpath: //node_state[@uname='vm-1']/lrm
>>> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
>>> complete: op cib_delete for section //node_state[@uname='vm-1']/lrm
>>> (origin=local/crmd/336, version=0.502.64): ok (rc=0)
>>> May  7 13:21:13 vm-1 crmd: [3039]: notice: do_state_transition: State
>>> transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED
>>> cause=C_FSA_INTERNAL origin=check_join_state ]
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: abort_transition_graph:
>>> do_te_invoke:162 - Triggered transition abort (complete=1) : Peer
>>> Cancelled
>>> May  7 13:21:13 vm-1 attrd: [3037]: notice: attrd_local_callback:
>>> Sending full refresh (origin=crmd)
>>> May  7 13:21:13 vm-1 attrd: [3037]: notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: last-failure-p_apache:0 (1430996422)
>>> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
>>> complete: op cib_modify for section nodes (origin=local/crmd/338,
>>> version=0.502.66): ok (rc=0)
>>> May  7 13:21:13 vm-1 cib: [3034]: info: cib_process_request: Operation
>>> complete: op cib_modify for section cib (origin=local/crmd/340,
>>> version=0.502.68): ok (rc=0)
>>> May  7 13:21:13 vm-1 attrd: [3037]: notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: probe_complete (true)
>>> May  7 13:21:13 vm-1 pengine: [3038]: notice: unpack_config: On loss of
>>> CCM Quorum: Ignore
>>> May  7 13:21:13 vm-1 pengine: [3038]: notice: LogActions: Restart
>>> p_apache:0#011(Started vm-1)
>>> May  7 13:21:13 vm-1 pengine: [3038]: notice: LogActions: Start
>>> IP:1#011(vm-1)
>>> May  7 13:21:13 vm-1 crmd: [3039]: notice: do_state_transition: State
>>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
>>> cause=C_IPC_MESSAGE origin=handle_response ]
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: do_te_invoke: Processing graph
>>> 32 (ref=pe_calc-dc-1430997673-233) derived from
>>> /var/lib/pengine/pe-input-1907.bz2
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
>>> action 14: stop p_apache:0_stop_0 on vm-1 (local)
>>> May  7 13:21:13 vm-1 lrmd: [3036]: info: cancel_op: operation
>>> monitor[54] on p_apache:0 for client 3039, its parameters:
>>> CRM_meta_start_delay=[0] CRM_meta_timeout=[20000]
>>> CRM_meta_name=[monitor] crm_feature_set=[3.0.6]
>>> CRM_meta_clone_node_max=[1] configfile=[/etc/apache2/apache2.conf]
>>> CRM_meta_clone=[0] CRM_meta_interval=[60000] CRM_meta_clone_max=[2]
>>> CRM_meta_notify=[false] statusurl=[http://localhost/server-status]
>>> CRM_meta_globally_unique=[false]  cancelled
>>> May  7 13:21:13 vm-1 lrmd: [3036]: info: rsc:p_apache:0 stop[55] (pid
>>> 27992)
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
>>> action 22: start IP:1_start_0 on vm-1 (local)
>>> May  7 13:21:13 vm-1 lrmd: [3036]: info: rsc:IP:1 start[56] (pid 27993)
>>> May  7 13:21:13 vm-1 crmd: [3039]: info: process_lrm_event: LRM
>>> operation p_apache:0_monitor_60000 (call=54, status=1, cib-update=0,
>>> confirmed=true) Cancelled
>>> May  7 13:21:13 vm-1 pengine: [3038]: notice: process_pe_message:
>>> Transition 32: PEngine Input stored in:
>>> /var/lib/pengine/pe-input-1907.bz2
>>> May  7 13:21:13 vm-1 stonith-ng: [3035]: info: stonith_command:
>>> Processed st_execute from lrmd: rc=-1
>>> May  7 13:21:13 vm-1 external/libvirt[28069]: [28084]: notice:
>>> qemu+tcp://192.168.122.1/system: Running hypervisor: QEMU 2.0.0
>>> May  7 13:21:14 vm-1 IPaddr2[27993]: INFO: /usr/lib/heartbeat/send_arp
>>> -i 200 -r 5 -p /var/run/resource-agents/send_arp-192.168.122.200 eth0
>>> 192.168.122.200 8bbd5f0b558f not_used not_used
>>> May  7 13:21:14 vm-1 lrmd: [3036]: info: operation start[56] on IP:1 for
>>> client 3039: pid 27993 exited with return code 0
>>> May  7 13:21:14 vm-1 crmd: [3039]: info: process_lrm_event: LRM
>>> operation IP:1_start_0 (call=56, rc=0, cib-update=342,
>>> confirmed=true) ok
>>> May  7 13:21:14 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
>>> action 23: monitor IP:1_monitor_10000 on vm-1 (local)
>>> May  7 13:21:14 vm-1 lrmd: [3036]: info: rsc:IP:1 monitor[57] (pid
>>> 28104)
>>> May  7 13:21:14 vm-1 lrmd: [3036]: info: operation monitor[57] on IP:1
>>> for client 3039: pid 28104 exited with return code 0
>>> May  7 13:21:14 vm-1 crmd: [3039]: info: process_lrm_event: LRM
>>> operation IP:1_monitor_10000 (call=57, rc=0, cib-update=343,
>>> confirmed=false) ok
>>> May  7 13:21:14 vm-1 lrmd: [3036]: info: RA output:
>>> (p_apache:0:stop:stderr) /usr/lib/ocf/resource.d//heartbeat/apache: 440:
>>> kill:
>>> May  7 13:21:14 vm-1 lrmd: [3036]: info: RA output:
>>> (p_apache:0:stop:stderr) No such process
>>> May  7 13:21:14 vm-1 lrmd: [3036]: info: RA output:
>>> (p_apache:0:stop:stderr)
>>> May  7 13:21:14 vm-1 apache[27992]: INFO: Killing apache PID 27235
>>> May  7 13:21:14 vm-1 apache[27992]: INFO: apache stopped.
>>> May  7 13:21:14 vm-1 lrmd: [3036]: info: operation stop[55] on
>>> p_apache:0 for client 3039: pid 27992 exited with return code 0
>>> May  7 13:21:14 vm-1 crmd: [3039]: info: process_lrm_event: LRM
>>> operation p_apache:0_stop_0 (call=55, rc=0, cib-update=344,
>>> confirmed=true) ok
>>> May  7 13:21:14 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
>>> action 15: start p_apache:0_start_0 on vm-1 (local)
>>> May  7 13:21:14 vm-1 lrmd: [3036]: info: rsc:p_apache:0 start[58] (pid
>>> 28166)
>>> May  7 13:21:14 vm-1 stonith: [28062]: info: external/libvirt device OK.
>>> May  7 13:21:14 vm-1 stonith-ng: [3035]: info: log_operation:
>>> p_fence_N2: Performing: stonith -t external/libvirt -S
>>> May  7 13:21:14 vm-1 stonith-ng: [3035]: info: log_operation:
>>> p_fence_N2: success:  0
>>> May  7 13:21:54 vm-1 lrmd: [3036]: WARN: p_apache:0:start process (PID
>>> 28166) timed out (try 1).  Killing with signal SIGTERM (15).
>>> May  7 13:21:54 vm-1 lrmd: [3036]: WARN: operation start[58] on
>>> p_apache:0 for client 3039: pid 28166 timed out
>>> May  7 13:21:54 vm-1 crmd: [3039]: ERROR: process_lrm_event: LRM
>>> operation p_apache:0_start_0 (58) Timed Out (timeout=40000ms)
>> This is another problem. Apache didn't start within 40s. Check
>> apache's logs to see if there are any hints as to what went wrong.
>>
>> If you stop pacemaker on both nodes and shut one of them down, can you
>> manually bring up the resources on the other node? It's always good to
>> verify that everything works without Pacemaker first. If you can bring
>> up everything without Pacemaker, then perhaps the configuration error
>> from above is related, or maybe your constraints need tweaking.
>>
>>> May  7 13:21:54 vm-1 crmd: [3039]: WARN: status_from_rc: Action 15
>>> (p_apache:0_start_0) on vm-1 failed (target: 0 vs. rc: -2): Error
>>> May  7 13:21:54 vm-1 crmd: [3039]: WARN: update_failcount: Updating
>>> failcount for p_apache:0 on vm-1 after failed start: rc=-2
>>> (update=INFINITY, time=1430997714)
>>> May  7 13:21:54 vm-1 crmd: [3039]: info: abort_transition_graph:
>>> match_graph_event:277 - Triggered transition abort (complete=0,
>>> tag=lrm_rsc_op, id=p_apache:0_last_failure_0,
>>> magic=2:-2;15:32:0:da4997d2-70ea-45a7-9617-42963ea29e42, cib=0.502.74) :
>>> Event failed
>>> May  7 13:21:54 vm-1 crmd: [3039]: notice: run_graph: ==== Transition 32
>>> (Complete=11, Pending=0, Fired=0, Skipped=1, Incomplete=0,
>>> Source=/var/lib/pengine/pe-input-1907.bz2): Stopped
>>> May  7 13:21:54 vm-1 crmd: [3039]: notice: do_state_transition: State
>>> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>> May  7 13:21:54 vm-1 attrd: [3037]: notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: fail-count-p_apache:0 (INFINITY)
>>> May  7 13:21:54 vm-1 attrd: [3037]: notice: attrd_perform_update: Sent
>>> update 94: fail-count-p_apache:0=INFINITY
>>> May  7 13:21:54 vm-1 pengine: [3038]: notice: unpack_config: On loss of
>>> CCM Quorum: Ignore
>>> May  7 13:21:54 vm-1 pengine: [3038]: WARN: unpack_rsc_op: Processing
>>> failed op p_apache:0_last_failure_0 on vm-1: unknown exec error (-2)
>>> May  7 13:21:54 vm-1 attrd: [3037]: notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: last-failure-p_apache:0 (1430997714)
>>> May  7 13:21:54 vm-1 pengine: [3038]: notice: LogActions: Recover
>>> p_apache:0#011(Started vm-1)
>>> May  7 13:21:54 vm-1 attrd: [3037]: notice: attrd_perform_update: Sent
>>> update 96: last-failure-p_apache:0=1430997714
>>> May  7 13:21:54 vm-1 crmd: [3039]: info: abort_transition_graph:
>>> te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair,
>>> id=status-vm-1-fail-count-p_apache.0, name=fail-count-p_apache:0,
>>> value=INFINITY, magic=NA, cib=0.502.75) : Transient attribute: update
>>> May  7 13:21:54 vm-1 crmd: [3039]: info: handle_response: pe_calc
>>> calculation pe_calc-dc-1430997714-238 is obsolete
>>> May  7 13:21:54 vm-1 crmd: [3039]: info: abort_transition_graph:
>>> te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair,
>>> id=status-vm-1-last-failure-p_apache.0, name=last-failure-p_apache:0,
>>> value=1430997714, magic=NA, cib=0.502.76) : Transient attribute: update
>>> May  7 13:21:54 vm-1 pengine: [3038]: notice: process_pe_message:
>>> Transition 33: PEngine Input stored in:
>>> /var/lib/pengine/pe-input-1908.bz2
>>> May  7 13:21:54 vm-1 pengine: [3038]: notice: unpack_config: On loss of
>>> CCM Quorum: Ignore
>>> May  7 13:21:54 vm-1 pengine: [3038]: WARN: unpack_rsc_op: Processing
>>> failed op p_apache:0_last_failure_0 on vm-1: unknown exec error (-2)
>>> May  7 13:21:54 vm-1 pengine: [3038]: WARN: common_apply_stickiness:
>>> Forcing cl_apache away from vm-1 after 1000000 failures (max=1000000)
>>> May  7 13:21:54 vm-1 pengine: [3038]: WARN: common_apply_stickiness:
>>> Forcing cl_apache away from vm-1 after 1000000 failures (max=1000000)
>>> May  7 13:21:54 vm-1 pengine: [3038]: notice: LogActions: Stop
>>> p_apache:0#011(vm-1)
>>> May  7 13:21:54 vm-1 crmd: [3039]: notice: do_state_transition: State
>>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
>>> cause=C_IPC_MESSAGE origin=handle_response ]
>>> May  7 13:21:54 vm-1 crmd: [3039]: info: do_te_invoke: Processing graph
>>> 34 (ref=pe_calc-dc-1430997714-239) derived from
>>> /var/lib/pengine/pe-input-1909.bz2
>>> May  7 13:21:54 vm-1 crmd: [3039]: info: te_rsc_command: Initiating
>>> action 4: stop p_apache:0_stop_0 on vm-1 (local)
>>> May  7 13:21:54 vm-1 lrmd: [3036]: info: rsc:p_apache:0 stop[59] (pid
>>> 28650)
>>> May  7 13:21:54 vm-1 pengine: [3038]: notice: process_pe_message:
>>> Transition 34: PEngine Input stored in:
>>> /var/lib/pengine/pe-input-1909.bz2
>>> May  7 13:21:54 vm-1 apache[28650]: INFO: apache is not running.
>>> May  7 13:21:54 vm-1 apache[28650]: INFO: apache children were signalled
>>> (SIGTERM)
>>> May  7 13:21:56 vm-1 apache[28650]: INFO: apache children were signalled
>>> (SIGHUP)
>>> May  7 13:21:57 vm-1 apache[28650]: INFO: apache children were signalled
>>> (SIGKILL)
>>> May  7 13:21:58 vm-1 lrmd: [3036]: info: operation stop[59] on
>>> p_apache:0 for client 3039: pid 28650 exited with return code 0
>>> May  7 13:21:58 vm-1 crmd: [3039]: info: process_lrm_event: LRM
>>> operation p_apache:0_stop_0 (call=59, rc=0, cib-update=349,
>>> confirmed=true) ok
>>> May  7 13:21:58 vm-1 crmd: [3039]: notice: run_graph: ==== Transition 34
>>> (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>>> Source=/var/lib/pengine/pe-input-1909.bz2): Complete
>>> May  7 13:21:58 vm-1 crmd: [3039]: notice: do_state_transition: State
>>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>>> May  7 13:22:14 vm-1 stonith-ng: [3035]: info: stonith_command:
>>> Processed st_execute from lrmd: rc=-1
>>> May  7 13:22:15 vm-1 external/libvirt[28943]: [28958]: notice:
>>> qemu+tcp://192.168.122.1/system: Running hypervisor: QEMU 2.0.0
>>> May  7 13:22:16 vm-1 stonith: [28935]: info: external/libvirt device OK.
>>> May  7 13:22:16 vm-1 stonith-ng: [3035]: info: log_operation:
>>> p_fence_N2: Performing: stonith -t external/libvirt -S
>>> May  7 13:22:16 vm-1 stonith-ng: [3035]: info: log_operation:
>>> p_fence_N2: success:  0
>>>
>>> Online: [ vm-1 ]
>>> OFFLINE: [ vm-2 ]
>>>
>>>    Clone Set: cl_gluster_mnt [p_gluster_mnt]
>>>        Started: [ vm-1 ]
>>>        Stopped: [ p_gluster_mnt:1 ]
>>>    Clone Set: cl_IP [IP] (unique)
>>>        IP:0       (ocf::heartbeat:IPaddr2):       Started vm-1
>>>        IP:1       (ocf::heartbeat:IPaddr2):       Started vm-1
>>> p_fence_N2      (stonith:external/libvirt):     Started vm-1
>>>
>>> Failed actions:
>>>       p_apache:0_start_0 (node=vm-1, call=58, rc=-2, status=Timed Out):
>>> unknown exec error
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org