[Pacemaker] Failed start of a resource after a Debian upgrading

Tue Jan 24 17:52:20 EST 2012

Hi,

On Tue, Jan 24, 2012 at 06:31:54PM +0100, Michal Vyoral wrote:
> Hello,
> we had a cluster of two nodes both running Debian 5.0, each with two resources
> IPaddr2 and apache managed by pacemaker 1.0.9.1. After an upgrading of
> one node from Debian 5.0 to 6.0 we have a problem to start the
> apache resource on the upgraded node. Here are the details:
> 
> Versions of heartbeat and pacemaker before the upgrade:
> pr-iso1:~# dpkg -l pacemaker heartbeat
> Desired=Unknown/Install/Remove/Purge/Hold
> | Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad)
> ||/ Name           Version        Description
> +++-==============-==============-============================================
> ii  heartbeat      1:3.0.3-2~bpo5 Subsystem for High-Availability Linux
> ii  pacemaker      1.0.9.1+hg1562 HA cluster resource manager
> 
> Versions of heartbeat and pacemaker after the upgrade:
> pr-iso2:~# dpkg -l pacemaker heartbeat
> Desired=Unknown/Install/Remove/Purge/Hold
> | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
> ||/ Name           Version        Description
> +++-==============-==============-============================================
> ii  heartbeat      1:3.0.3-2      Subsystem for High-Availability Linux
> ii  pacemaker      1.0.9.1+hg1562 HA cluster resource manager
> 
> Status of the resources on the upgraded node:
> pr-iso2:~# crm_mon
> ============
> Last updated: Tue Jan 24 10:14:12 2012
> Stack: Heartbeat
> Current DC: pr-iso2 (511079a9-0f71-4537-bdf9-07714b454441) - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, unknown expected votes
> 2 Resources configured.
> ============
> 
> Online: [ pr-iso2 ]
> OFFLINE: [ pr-iso1 ]
> 
> ClusterIP       (ocf::heartbeat:IPaddr2):       Started pr-iso2
> 
> Failed actions:
>     RTWeb_start_0 (node=pr-iso2, call=7, rc=1, status=complete): unknown error
> 
> Status of the resources on the non upgraded node:
> pr-iso1:~# crm_mon
> ============
> Last updated: Tue Jan 24 17:08:22 2012
> Stack: Heartbeat
> Current DC: pr-iso1 (014268aa-f234-4789-b4a1-0053cf4e61b9) - partition with quor
> um
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, unknown expected votes
> 2 Resources configured.
> ============
> 
> Online: [ pr-iso1 pr-iso2 ]
> 
> ClusterIP       (ocf::heartbeat:IPaddr2):       Started pr-iso1
> RTWeb   (ocf::heartbeat:apache):        Started pr-iso1
> 
> Configuration of the resources:
> pr-iso1:~# crm configure show
> node $id="014268aa-f234-4789-b4a1-0053cf4e61b9" pr-iso1
> node $id="511079a9-0f71-4537-bdf9-07714b454441" pr-iso2
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
>         params ip="10.5.75.83" cidr_netmask="24" \
>         op monitor interval="30s"
> primitive RTWeb ocf:heartbeat:apache \
>         params configfile="/etc/apache2/apache2.conf" \
>         op monitor interval="1min" \
>         meta target-role="Started" is-managed="true"
> colocation website-with-ip inf: RTWeb ClusterIP
> order rtweb_after_clustrip inf: ClusterIP RTWeb
> property $id="cib-bootstrap-options" \
>         dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
>         cluster-infrastructure="Heartbeat" \
>         stonith-enabled="false" \
>         last-lrm-refresh="1327399494"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="100"
> 
> Records in the /var/log/ha-log related to RTWeb resource:
> pr-iso2:~# grep RTWeb /var/log/ha-log
> Jan 24 10:04:56 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing key=7:76:7:41cbad9d-9090-4aba-bd6a-bf171077c74b op=RTWeb_monitor_0 )
> Jan 24 10:04:56 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:4: probe
> Jan 24 10:04:56 pr-iso2 crmd: [6130]: info: process_lrm_event: LRM operation RTWeb_monitor_0 (call=4, rc=7, cib-update=13, confirmed=true) not running
> Jan 24 10:12:48 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing key=11:77:0:41cbad9d-9090-4aba-bd6a-bf171077c74b op=RTWeb_start_0 )
> Jan 24 10:12:48 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:7: start

After this message there should be a bit more (look for "apache"
or "lrmd"). Next resource agents are going to log the resource
name too (RTWeb in this case). If you cannot find anything here,
then the answer must be in the apache logs.

Thanks,

Dejan

> Jan 24 10:12:50 pr-iso2 crmd: [6130]: info: process_lrm_event: LRM operation RTWeb_start_0 (call=7, rc=1, cib-update=16, confirmed=true) unknown error
> Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-RTWeb (INFINITY)
> Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent update 20: fail-count-RTWeb=INFINITY
> Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: find_hash_entry: Creating hash entry for last-failure-RTWeb
> Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-RTWeb (1327399969)
> Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent update 23: last-failure-RTWeb=1327399969
> Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-RTWeb (INFINITY)
> Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent update 27: fail-count-RTWeb=INFINITY
> Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-RTWeb (1327399969)
> Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent update 30: last-failure-RTWeb=1327399969
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: WARN: unpack_rsc_op: Processing failed op RTWeb_start_0 on pr-iso2: unknown error (1)
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: native_print: RTWeb    (ocf::heartbeat:apache):        Started pr-iso2 FAILED
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: RecurringOp:  Start recurring monitor (60s) for RTWeb on pr-iso2
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: LogActions: Recover resource RTWeb     (Started pr-iso2)
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: WARN: unpack_rsc_op: Processing failed op RTWeb_start_0 on pr-iso2: unknown error (1)
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: native_print: RTWeb    (ocf::heartbeat:apache):        Started pr-iso2 FAILED
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: info: get_failcount: RTWeb has failed INFINITY times on pr-iso2
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: WARN: common_apply_stickiness: Forcing RTWeb away from pr-iso2 after 1000000 failures (max=1000000)
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: info: native_merge_weights: ClusterIP: Rolling back scores from RTWeb
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: info: native_color: Resource RTWeb cannot run anywhere
> Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: LogActions: Stop resource RTWeb        (pr-iso2)
> Jan 24 10:13:08 pr-iso2 crmd: [6130]: info: te_rsc_command: Initiating action 1: stop RTWeb_stop_0 on pr-iso2 (local)
> Jan 24 10:13:08 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing key=1:1:0:1f379174-a1dd-413f-8108-466a22efdbd0 op=RTWeb_stop_0 )
> Jan 24 10:13:08 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:8: stop
> Jan 24 10:13:09 pr-iso2 lrmd: [6127]: info: RA output: (RTWeb:stop:stderr) kill: 830: No such process
> Jan 24 10:13:09 pr-iso2 crmd: [6130]: info: process_lrm_event: LRM operation RTWeb_stop_0 (call=8, rc=0, cib-update=38, confirmed=true) ok
> Jan 24 10:13:09 pr-iso2 crmd: [6130]: info: match_graph_event: Action RTWeb_stop_0 (1) confirmed on pr-iso2 (rc=0)
> Jan 24 10:14:51 pr-iso2 attrd: [6129]: WARN: attrd_cib_callback: Update 20 for fail-count-RTWeb=INFINITY failed: Remote node did not respond
> Jan 24 10:14:51 pr-iso2 attrd: [6129]: WARN: attrd_cib_callback: Update 23 for last-failure-RTWeb=1327399969 failed: Remote node did not respond
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: WARN: unpack_rsc_op: Processing failed op RTWeb_start_0 on pr-iso2: unknown error (1)
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: notice: native_print: RTWeb    (ocf::heartbeat:apache):        Stopped
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: info: get_failcount: RTWeb has failed INFINITY times on pr-iso2
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: WARN: common_apply_stickiness: Forcing RTWeb away from pr-iso2 after 1000000 failures (max=1000000)
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: info: native_merge_weights: ClusterIP: Rolling back scores from RTWeb
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: info: native_color: Resource RTWeb cannot run anywhere
> Jan 24 10:15:01 pr-iso2 pengine: [9579]: notice: LogActions: Leave resource RTWeb       (Stopped)
> Jan 24 10:26:22 pr-iso2 crmd: [9912]: info: do_lrm_rsc_op: Performing key=8:2:7:2df45cf3-886f-4815-861e-079a2542e882 op=RTWeb_monitor_0 )
> Jan 24 10:26:22 pr-iso2 lrmd: [9909]: info: rsc:RTWeb:3: probe
> Jan 24 10:26:22 pr-iso2 crmd: [9912]: info: process_lrm_event: LRM operation RTWeb_monitor_0 (call=3, rc=7, cib-update=8, confirmed=true) not running
> 
> How could we solve this problem? Thank you.
> 
> Best regards,
> Michal Vyoral
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org