[Pacemaker] Failed start of a resource after a Debian upgrading

Wed Jan 25 11:53:26 EST 2012

On Wed, Jan 25, 2012 at 04:35:39PM +0100, Michal Vyoral wrote:
> Hi Dejan,
> 
> On Tue, Jan 24, 2012 at 11:52:20PM +0100, Dejan Muhamedagic wrote:
> > Hi,
> > 
> > On Tue, Jan 24, 2012 at 06:31:54PM +0100, Michal Vyoral wrote:
> > > Hello,
> > > we had a cluster of two nodes both running Debian 5.0, each with two resources
> > > IPaddr2 and apache managed by pacemaker 1.0.9.1. After an upgrading of
> > > one node from Debian 5.0 to 6.0 we have a problem to start the
> > > apache resource on the upgraded node. Here are the details:
> > > 
> > > Versions of heartbeat and pacemaker before the upgrade:
> > > pr-iso1:~# dpkg -l pacemaker heartbeat
> > > Desired=Unknown/Install/Remove/Purge/Hold
> > > | Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend
> > > |/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad)
> > > ||/ Name           Version        Description
> > > +++-==============-==============-============================================
> > > ii  heartbeat      1:3.0.3-2~bpo5 Subsystem for High-Availability Linux
> > > ii  pacemaker      1.0.9.1+hg1562 HA cluster resource manager
> > > 
> > > Versions of heartbeat and pacemaker after the upgrade:
> > > pr-iso2:~# dpkg -l pacemaker heartbeat
> > > Desired=Unknown/Install/Remove/Purge/Hold
> > > | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
> > > |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
> > > ||/ Name           Version        Description
> > > +++-==============-==============-============================================
> > > ii  heartbeat      1:3.0.3-2      Subsystem for High-Availability Linux
> > > ii  pacemaker      1.0.9.1+hg1562 HA cluster resource manager
> > > 
> > > Status of the resources on the upgraded node:
> > > pr-iso2:~# crm_mon
> > > ============
> > > Last updated: Tue Jan 24 10:14:12 2012
> > > Stack: Heartbeat
> > > Current DC: pr-iso2 (511079a9-0f71-4537-bdf9-07714b454441) - partition with quorum
> > > Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> > > 2 Nodes configured, unknown expected votes
> > > 2 Resources configured.
> > > ============
> > > 
> > > Online: [ pr-iso2 ]
> > > OFFLINE: [ pr-iso1 ]
> > > 
> > > ClusterIP       (ocf::heartbeat:IPaddr2):       Started pr-iso2
> > > 
> > > Failed actions:
> > >     RTWeb_start_0 (node=pr-iso2, call=7, rc=1, status=complete): unknown error
> > > 
> > > Status of the resources on the non upgraded node:
> > > pr-iso1:~# crm_mon
> > > ============
> > > Last updated: Tue Jan 24 17:08:22 2012
> > > Stack: Heartbeat
> > > Current DC: pr-iso1 (014268aa-f234-4789-b4a1-0053cf4e61b9) - partition with quor
> > > um
> > > Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> > > 2 Nodes configured, unknown expected votes
> > > 2 Resources configured.
> > > ============
> > > 
> > > Online: [ pr-iso1 pr-iso2 ]
> > > 
> > > ClusterIP       (ocf::heartbeat:IPaddr2):       Started pr-iso1
> > > RTWeb   (ocf::heartbeat:apache):        Started pr-iso1
> > > 
> > > Configuration of the resources:
> > > pr-iso1:~# crm configure show
> > > node $id="014268aa-f234-4789-b4a1-0053cf4e61b9" pr-iso1
> > > node $id="511079a9-0f71-4537-bdf9-07714b454441" pr-iso2
> > > primitive ClusterIP ocf:heartbeat:IPaddr2 \
> > >         params ip="10.5.75.83" cidr_netmask="24" \
> > >         op monitor interval="30s"
> > > primitive RTWeb ocf:heartbeat:apache \
> > >         params configfile="/etc/apache2/apache2.conf" \
> > >         op monitor interval="1min" \
> > >         meta target-role="Started" is-managed="true"
> > > colocation website-with-ip inf: RTWeb ClusterIP
> > > order rtweb_after_clustrip inf: ClusterIP RTWeb
> > > property $id="cib-bootstrap-options" \
> > >         dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
> > >         cluster-infrastructure="Heartbeat" \
> > >         stonith-enabled="false" \
> > >         last-lrm-refresh="1327399494"
> > > rsc_defaults $id="rsc-options" \
> > >         resource-stickiness="100"
> > > 
> > > Records in the /var/log/ha-log related to RTWeb resource:
> > > pr-iso2:~# grep RTWeb /var/log/ha-log
> > > Jan 24 10:04:56 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing key=7:76:7:41cbad9d-9090-4aba-bd6a-bf171077c74b op=RTWeb_monitor_0 )
> > > Jan 24 10:04:56 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:4: probe
> > > Jan 24 10:04:56 pr-iso2 crmd: [6130]: info: process_lrm_event: LRM operation RTWeb_monitor_0 (call=4, rc=7, cib-update=13, confirmed=true) not running
> > > Jan 24 10:12:48 pr-iso2 crmd: [6130]: info: do_lrm_rsc_op: Performing key=11:77:0:41cbad9d-9090-4aba-bd6a-bf171077c74b op=RTWeb_start_0 )
> > > Jan 24 10:12:48 pr-iso2 lrmd: [6127]: info: rsc:RTWeb:7: start
> > 
> > After this message there should be a bit more (look for "apache"
> > or "lrmd"). Next resource agents are going to log the resource
> > name too (RTWeb in this case). If you cannot find anything here,
> > then the answer must be in the apache logs.
> > 
> > Thanks,
> > 
> > Dejan
> 
> Yes, you are right: here are two more lines after the previous line:
> 
>   apache[9454]:   2012/01/24_10:12:49 INFO: apache not running
>   apache[9454]:   2012/01/24_10:12:49 INFO: waiting for apache /etc/apache2/apache2.conf to come up

That's all?

> There are no records in /var/log/apache2/error.log giving some clue, see:
> 
>   pr-iso2:/var/log/apache2# cat error.log
>   [Tue Jan 24 11:12:50 2012] [notice] Apache/2.2.16 (Debian) PHP/5.3.3-7+squeeze3 with Suhosin-Patch mod_perl/2.0.4 Perl/v5.10.1 configured -- resuming normal operations
>   [Tue Jan 24 11:13:08 2012] [notice] caught SIGTERM, shutting down
>   [Wed Jan 25 13:09:02 2012] [notice] Apache/2.2.16 (Debian) PHP/5.3.3-7+squeeze3 with Suhosin-Patch mod_perl/2.0.4 Perl/v5.10.1 configured -- resuming normal operations
>   [Wed Jan 25 13:09:21 2012] [notice] caught SIGTERM, shutting down
> 
> See the interesting thing: our nodes shold use UTC time, but after the upgrade
> we have noticed, that the time on the upgraded node is our local time (= UTC + 1)
> I have return the system time back to UTC, but Apache still uses the local time in the log. 
> 
> We have tried to start the Apache on the upgraded node alone:
> 
> 1. we have modified the file /etc/apache2/ports2.conf to
> Apache listen on the physical address
> 2. we have run the command '/etc/init.d/apache2 start'
> 3. we have download an index.html page
> 
> Here is the record in the error log:
> 
>  [Wed Jan 25 13:28:11 2012] [notice] Apache/2.2.16 (Debian) PHP/5.3.3-7+squeeze3 with Suhosin-Patch mod_perl/2.0.4 Perl/v5.10.1 configured -- resuming normal operations
>  [Wed Jan 25 13:28:27 2012] [warn] [client 10.5.77.29] incomplete redirection target of '/rt/' for URI '/' modified to 'http://10.5.75.82/rt/'
> 
> So, Apache alone could run. 
> 
> Before the upgrade we have made some minor changes to apache2.conf
> on the active node, but not on the passive node. We have return 
> the changes back, but the resource is stil failed, see the tail from th ha-log
> on the upgraded node:

[...]
> Jan 25 14:04:18 pr-iso2 pengine: [16392]: info: get_failcount: RTWeb has failed INFINITY times on pr-iso2

You need to cleanup the resource: crm resource cleanup RTWeb

Otherwise, I really cannot say what's wrong with your apache, but
it's definitely resource specific. You can leave out the cluster
and try to resolve the issue using ocf-tester. Also, it is
necessary that the apache status module is enabled.

Thanks,

Dejan