[Pacemaker] Strange failover behaviour SOLVED

Wed Nov 30 13:46:54 UTC 2011

Hi Andreas,

Thank you for your answer.

Didn't know about the instability issues with DRBD 8.4.
The reason why I compiled everything myself is the fact that the versions that are shipped with CentOS
6 had some problems as well, like heartbeatprocesses taking up 100% cpu. Don't have this now....

I did try switching to the ocf way of stopping/starting httpd, and this seems to work perfectly.

Thanks again, another problem solved.

Hans

-----Original Message-----
From: Andreas Kurz [mailto:andreas at hastexo.com] 
Sent: Wednesday, November 30, 2011 00:12
To: pacemaker at oss.clusterlabs.org
Subject: Re: [Pacemaker] Strange failover behaviour

On 11/29/2011 07:14 PM, Hans Lammerts wrote:
> Hi there,
> 
>  
> 
> I have something strange I would like the community to give it’s 
> opinion on. I can’t figure out
> 
> what is going wrong.
> 
>  
> 
> I have a 2 node cluster (named cl1 and cl2). On this cluster I’m 
> running MySQL, Apache, and
> 
> Zarafa. Both clusters run CentOS 6.
> 
> I have downloaded all latest sources for DRBD, Cluster Glue, Resource 
> Agents, Heartbeat
> 
> and Pacemaker and compiled them. Everything seems to be OK.

BTW ... no need to compile Pacemaker/Glue/Agents ... it is shipped with CentOS 6 ... and use DRBD 8.4.0 only for test setups, there are some known stability issues.

> 
>  
> 
> I believe my Pacemaker setup to be OK, but I may be mistaken. Will 
> attach the config below.
> 
>  
> 
> What I experience when I do a failover from cl1 to cl2 is that MySQL 
> and Zarafa failover without
> 
> any problems, but httpd seems to be getting in a loop of starting and 
> stopping.
> 
> The error that is displayed is this :
> 
>  
> 
> apache2_monitor_10000 (node=cl2, call=502, rc=7, status=complete): not 
> running
> 

the cluster and apache logs should give you good hints on the problem ...

>  
> 
> If I remember to set the failcount of the apache2 resource to 0, httpd 
> will eventually start after
> 
> quite a number of retries :
> 
>  
> 
> [root at cl2 httpd]# crm resource failcount apache2 show cl2
> 
> scope=status  name=fail-count-apache2 value=69
> 
>  
> 
> If I forget to reset the failcount (something you should not need to 
> do), the failcount will reach
> 
> infinity at some time in the future, and httpd won’t start. The number 
> of times Pacemaker
> 
> retries Is also different every time.
> 
>  
> 
> Wait, it gets stranger…
> 
> Putting cl1 online again, the fallback is initiated, and this goes 
> without any problems. So, it looks
> 
> like the problems reside only on the second cluster half. The hardware 
> of cl2 is different from cl1, and
> 
> it is the slower machine of the two.
> 
> Yes, I made very sure every configuration file is the same on both nodes.
> 
> And yes, I made sure the server-status section in httpd.conf is 
> uncommented, as is the
> 
> ExtendedStatus directive. Doing a wget -O - 
> http://localhost/server-status?auto works

you are using the lsb script ... this does a simple pid check, at least on the SL 6.1 test machines in my lab. Have you tried the ocf RA?

> 
> perfectly.
> 
>  
> 
> Can anyone please tell me what the problem could be here ?

Dig through your logs ... or hire someone to do it for you ;-)

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Thanks.
> 
>  
> 
> Versioninfo:
> 
> CentOS 6.0
> 
> DRBD 8.4.0
> Glue 1.0.8
> 
> Resource agents 3.9.2
> 
> Heartbeat 3.0.5
> 
> Pacemaker 1.0.11
> 
>  
> 
> Pacemaker config:
> 
>  
> 
> node $id="62b94e0a-532f-4f99-acdb-57d6052a5635" cl1 \
> 
>         attributes standby="on"
> 
> node $id="7444dfb4-2c9b-4130-83c4-c0cd3d7ec006" cl2 \
> 
>         attributes standby="off"
> 
> primitive apache2 lsb:httpd \
> 
>         op monitor interval="10" timeout="30" \
> 
>         op start interval="0" timeout="120" \
> 
>         op stop interval="0" timeout="120" \
> 
>         meta target-role="Started"
> 
> primitive drbd_http ocf:linbit:drbd \
> 
>         params drbd_resource="http" \
> 
>         op start interval="0" timeout="240" \
> 
>         op stop interval="0" timeout="100" \
> 
>         op monitor interval="59s" role="Master" timeout="30s" \
> 
>         op monitor interval="60s" role="Slave" timeout="30s"
> 
> primitive drbd_mysql ocf:linbit:drbd \
> 
>         params drbd_resource="mysql" \
> 
>         op start interval="0" timeout="240" \
> 
>         op stop interval="0" timeout="100" \
> 
>         op monitor interval="59s" role="Master" timeout="30s" \
> 
>         op monitor interval="60s" role="Slave" timeout="30s"
> 
> primitive drbd_zarafa ocf:linbit:drbd \
> 
>         params drbd_resource="zarafa" \
> 
>         op start interval="0" timeout="240" \
> 
>         op stop interval="0" timeout="100" \
> 
>         op monitor interval="59s" role="Master" timeout="30s" \
> 
>         op monitor interval="60s" role="Slave" timeout="30s"
> 
> primitive http_fs ocf:heartbeat:Filesystem \
> 
>         params device="/dev/drbd1" directory="/var/www/html"
> fstype="ext4" options="noatime" \
> 
>         op monitor interval="30s"
> 
> primitive http_ip ocf:heartbeat:IPaddr2 \
> 
>         params ip="192.168.2.50" cidr_netmask="24" nic="eth0" \
> 
>         op monitor interval="30s"
> 
> primitive mysql_fs ocf:heartbeat:Filesystem \
> 
>         params device="/dev/drbd0" directory="/var/lib/mysql"
> fstype="ext4" options="noatime" \
> 
>         op monitor interval="30s"
> 
> primitive mysql_ip ocf:heartbeat:IPaddr2 \
> 
>         params ip="192.168.2.30" cidr_netmask="24" nic="eth0" \
> 
>         op monitor interval="30s"
> 
> primitive mysqld lsb:mysqld \
> 
>         op monitor interval="10" timeout="30" \
> 
>         op start interval="0" timeout="120" \
> 
>         op stop interval="0" timeout="120"
> 
> primitive zarafa-dagent lsb:zarafa-dagent \
> 
>         op monitor interval="10" timeout="30" \
> 
>         op start interval="0" timeout="120" \
> 
>         op stop interval="0" timeout="120"
> 
> primitive zarafa-gateway lsb:zarafa-gateway \
> 
>         op monitor interval="10" timeout="30" \
> 
>         op start interval="0" timeout="120" \
> 
>         op stop interval="0" timeout="120"
> 
> primitive zarafa-ical lsb:zarafa-ical \
> 
>         op monitor interval="10" timeout="30" \
> 
>         op start interval="0" timeout="120" \
> 
>         op stop interval="0" timeout="120"
> 
> primitive zarafa-licensed lsb:zarafa-licensed \
> 
>         op monitor interval="10" timeout="30" \
> 
>         op start interval="0" timeout="120" \
> 
>         op stop interval="0" timeout="120"
> 
> primitive zarafa-monitor lsb:zarafa-monitor \
> 
>         op monitor interval="10" timeout="30" \
> 
>         op start interval="0" timeout="120" \
> 
>         op stop interval="0" timeout="120"
> 
> primitive zarafa-server lsb:zarafa-server \
> 
>         op monitor interval="10" timeout="30" \
> 
>         op start interval="0" timeout="120" \
> 
>         op stop interval="0" timeout="120"
> 
> primitive zarafa-spooler lsb:zarafa-spooler \
> 
>         op monitor interval="10" timeout="30" \
> 
>         op start interval="0" timeout="120" \
> 
>         op stop interval="0" timeout="120"
> 
> primitive zarafa_fs ocf:heartbeat:Filesystem \
> 
>         params device="/dev/drbd2" directory="/var/lib/zarafa"
> fstype="ext4" options="noatime" \
> 
>         op monitor interval="30s"
> 
> primitive zarafa_ip ocf:heartbeat:IPaddr2 \
> 
>         params ip="192.168.2.40" cidr_netmask="24" nic="eth0" \
> 
>         op monitor interval="30s"
> 
> group HTTP http_fs http_ip apache2 \
> 
>         meta target-role="Started"
> 
> group MYSQL mysql_fs mysql_ip mysqld \
> 
>         meta target-role="Started"
> 
> group ZARAFA zarafa_fs zarafa_ip zarafa-server zarafa-spooler 
> zarafa-dagent zarafa-licensed zarafa-monitor zarafa-gateway 
> zarafa-ical \
> 
>         meta target-role="Started"
> 
> ms ms_drbd_http drbd_http \
> 
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> 
> ms ms_drbd_mysql drbd_mysql \
> 
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> 
> ms ms_drbd_zarafa drbd_zarafa \
> 
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> 
> location cli-prefer-HTTP HTTP \
> 
>         rule $id="cli-prefer-rule-HTTP" inf: #uname eq cl1
> 
> location cli-prefer-MYSQL MYSQL \
> 
>         rule $id="cli-prefer-rule-MYSQL" inf: #uname eq cl1
> 
> location cli-prefer-ZARAFA ZARAFA \
> 
>         rule $id="cli-prefer-rule-ZARAFA" inf: #uname eq cl1
> 
> colocation http_on_drbd inf: HTTP ms_drbd_http:Master
> 
> colocation mysql_on_drbd inf: MYSQL ms_drbd_mysql:Master
> 
> colocation zarafa_on_drbd inf: ZARAFA ms_drbd_zarafa:Master
> 
> order http_after_drbd inf: ms_drbd_http:promote HTTP:start
> 
> order mysql_after_drbd inf: ms_drbd_mysql:promote MYSQL:start
> 
> order zarafa_after_drbd inf: ms_drbd_zarafa:promote ZARAFA:start
> 
> order zarafa_after_mysql inf: MYSQL:start ZARAFA:start
> 
> property $id="cib-bootstrap-options" \
> 
>         dc-version="1.0.11-9af47ddebcad19e35a61b2a20301dc038018e8e8" \
> 
>         cluster-infrastructure="Heartbeat" \
> 
>         stonith-enabled="false" \
> 
>         no-quorum-policy="ignore"
> 
>  
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org