[Pacemaker] Failed start of a resource after a Debian upgrading

Michal Vyoral michal.vyoral at chmi.cz
Tue Jan 24 12:31:54 EST 2012


Hello,
we had a cluster of two nodes both running Debian 5.0, each with two resources
IPaddr2 and apache managed by pacemaker 1.0.9.1. After an upgrading of
one node from Debian 5.0 to 6.0 we have a problem to start the
apache resource on the upgraded node. Here are the details:

Versions of heartbeat and pacemaker before the upgrade:
pr-iso1:~# dpkg -l pacemaker heartbeat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad)
||/ Name           Version        Description
+++-==============-==============-============================================
ii  heartbeat      1:3.0.3-2~bpo5 Subsystem for High-Availability Linux
ii  pacemaker      1.0.9.1+hg1562 HA cluster resource manager

Versions of heartbeat and pacemaker after the upgrade:
pr-iso2:~# dpkg -l pacemaker heartbeat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version        Description
+++-==============-==============-============================================
ii  heartbeat      1:3.0.3-2      Subsystem for High-Availability Linux
ii  pacemaker      1.0.9.1+hg1562 HA cluster resource manager

Status of the resources on the upgraded node:
pr-iso2:~# crm_mon
============
Last updated: Tue Jan 24 10:14:12 2012
Stack: Heartbeat
Current DC: pr-iso2 (511079a9-0f71-4537-bdf9-07714b454441) - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pr-iso2 ]
OFFLINE: [ pr-iso1 ]

ClusterIP       (ocf::heartbeat:IPaddr2):       Started pr-iso2

Failed actions:
    RTWeb_start_0 (node=pr-iso2, call=7, rc=1, status=complete): unknown error

Status of the resources on the non upgraded node:
pr-iso1:~# crm_mon
============
Last updated: Tue Jan 24 17:08:22 2012
Stack: Heartbeat
Current DC: pr-iso1 (014268aa-f234-4789-b4a1-0053cf4e61b9) - partition with quor
um
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ pr-iso1 pr-iso2 ]

ClusterIP       (ocf::heartbeat:IPaddr2):       Started pr-iso1
RTWeb   (ocf::heartbeat:apache):        Started pr-iso1

Configuration of the resources:
pr-iso1:~# crm configure show
node $id="014268aa-f234-4789-b4a1-0053cf4e61b9" pr-iso1
node $id="511079a9-0f71-4537-bdf9-07714b454441" pr-iso2
primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="10.5.75.83" cidr_netmask="24" \
        op monitor interval="30s"
primitive RTWeb ocf:heartbeat:apache \
        params configfile="/etc/apache2/apache2.conf" \
        op monitor interval="1min" \
        meta target-role="Started" is-managed="true"
colocation website-with-ip inf: RTWeb ClusterIP
order rtweb_after_clustrip inf: ClusterIP RTWeb
property $id="cib-bootstrap-options" \
        dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
        cluster-infrastructure="Heartbeat" \
        stonith-enabled="false" \
        last-lrm-refresh="1327399494"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

Records in the /var/log/ha-log related to RTWeb resource:
pr-iso2:~# grep RTWeb /var/log/ha-log
Jan 24 10:04:56 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing key=7:76:7:41cbad9d-9090-4aba-bd6a-bf171077c74b op=RTWeb_monitor_0 )
Jan 24 10:04:56 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:4: probe
Jan 24 10:04:56 pr-iso2 crmd: [6130]: info: process_lrm_event: LRM operation RTWeb_monitor_0 (call=4, rc=7, cib-update=13, confirmed=true) not running
Jan 24 10:12:48 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing key=11:77:0:41cbad9d-9090-4aba-bd6a-bf171077c74b op=RTWeb_start_0 )
Jan 24 10:12:48 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:7: start
Jan 24 10:12:50 pr-iso2 crmd: [6130]: info: process_lrm_event: LRM operation RTWeb_start_0 (call=7, rc=1, cib-update=16, confirmed=true) unknown error
Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-RTWeb (INFINITY)
Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent update 20: fail-count-RTWeb=INFINITY
Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: find_hash_entry: Creating hash entry for last-failure-RTWeb
Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-RTWeb (1327399969)
Jan 24 10:12:51 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent update 23: last-failure-RTWeb=1327399969
Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-RTWeb (INFINITY)
Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent update 27: fail-count-RTWeb=INFINITY
Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-RTWeb (1327399969)
Jan 24 10:13:08 pr-iso2 attrd: [6129]: info: attrd_perform_update: Sent update 30: last-failure-RTWeb=1327399969
Jan 24 10:13:08 pr-iso2 pengine: [9579]: WARN: unpack_rsc_op: Processing failed op RTWeb_start_0 on pr-iso2: unknown error (1)
Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: native_print: RTWeb    (ocf::heartbeat:apache):        Started pr-iso2 FAILED
Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: RecurringOp:  Start recurring monitor (60s) for RTWeb on pr-iso2
Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: LogActions: Recover resource RTWeb     (Started pr-iso2)
Jan 24 10:13:08 pr-iso2 pengine: [9579]: WARN: unpack_rsc_op: Processing failed op RTWeb_start_0 on pr-iso2: unknown error (1)
Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: native_print: RTWeb    (ocf::heartbeat:apache):        Started pr-iso2 FAILED
Jan 24 10:13:08 pr-iso2 pengine: [9579]: info: get_failcount: RTWeb has failed INFINITY times on pr-iso2
Jan 24 10:13:08 pr-iso2 pengine: [9579]: WARN: common_apply_stickiness: Forcing RTWeb away from pr-iso2 after 1000000 failures (max=1000000)
Jan 24 10:13:08 pr-iso2 pengine: [9579]: info: native_merge_weights: ClusterIP: Rolling back scores from RTWeb
Jan 24 10:13:08 pr-iso2 pengine: [9579]: info: native_color: Resource RTWeb cannot run anywhere
Jan 24 10:13:08 pr-iso2 pengine: [9579]: notice: LogActions: Stop resource RTWeb        (pr-iso2)
Jan 24 10:13:08 pr-iso2 crmd: [6130]: info: te_rsc_command: Initiating action 1: stop RTWeb_stop_0 on pr-iso2 (local)
Jan 24 10:13:08 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing key=1:1:0:1f379174-a1dd-413f-8108-466a22efdbd0 op=RTWeb_stop_0 )
Jan 24 10:13:08 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:8: stop
Jan 24 10:13:09 pr-iso2 lrmd: [6127]: info: RA output: (RTWeb:stop:stderr) kill: 830: No such process
Jan 24 10:13:09 pr-iso2 crmd: [6130]: info: process_lrm_event: LRM operation RTWeb_stop_0 (call=8, rc=0, cib-update=38, confirmed=true) ok
Jan 24 10:13:09 pr-iso2 crmd: [6130]: info: match_graph_event: Action RTWeb_stop_0 (1) confirmed on pr-iso2 (rc=0)
Jan 24 10:14:51 pr-iso2 attrd: [6129]: WARN: attrd_cib_callback: Update 20 for fail-count-RTWeb=INFINITY failed: Remote node did not respond
Jan 24 10:14:51 pr-iso2 attrd: [6129]: WARN: attrd_cib_callback: Update 23 for last-failure-RTWeb=1327399969 failed: Remote node did not respond
Jan 24 10:15:01 pr-iso2 pengine: [9579]: WARN: unpack_rsc_op: Processing failed op RTWeb_start_0 on pr-iso2: unknown error (1)
Jan 24 10:15:01 pr-iso2 pengine: [9579]: notice: native_print: RTWeb    (ocf::heartbeat:apache):        Stopped
Jan 24 10:15:01 pr-iso2 pengine: [9579]: info: get_failcount: RTWeb has failed INFINITY times on pr-iso2
Jan 24 10:15:01 pr-iso2 pengine: [9579]: WARN: common_apply_stickiness: Forcing RTWeb away from pr-iso2 after 1000000 failures (max=1000000)
Jan 24 10:15:01 pr-iso2 pengine: [9579]: info: native_merge_weights: ClusterIP: Rolling back scores from RTWeb
Jan 24 10:15:01 pr-iso2 pengine: [9579]: info: native_color: Resource RTWeb cannot run anywhere
Jan 24 10:15:01 pr-iso2 pengine: [9579]: notice: LogActions: Leave resource RTWeb       (Stopped)
Jan 24 10:26:22 pr-iso2 crmd: [9912]: info: do_lrm_rsc_op: Performing key=8:2:7:2df45cf3-886f-4815-861e-079a2542e882 op=RTWeb_monitor_0 )
Jan 24 10:26:22 pr-iso2 lrmd: [9909]: info: rsc:RTWeb:3: probe
Jan 24 10:26:22 pr-iso2 crmd: [9912]: info: process_lrm_event: LRM operation RTWeb_monitor_0 (call=3, rc=7, cib-update=8, confirmed=true) not running

How could we solve this problem? Thank you.

Best regards,
Michal Vyoral





More information about the Pacemaker mailing list