[Pacemaker] Please Help - frequent cleanup is required for the resources on failover condition

Mon Aug 17 07:35:19 EDT 2009

Hi,

On Mon, Aug 17, 2009 at 07:38:39AM +0530, Abhin GS wrote:
> Hello,
> 
> The node1 was chocked due to a big "messages" file. we have fixed that
> problem in node1, then we ran a update for SLES11, every required
> patches were installed properly (service openais stop - was done before
> patching).  We have purposely switched off the node 2 during this
> exercise to avoid any complications. 
> 
> After the update, we have started the system back online (node2 was
> still kept off) and saw that the machine was refusing to function.
> analysis found that the systems update had changed the contents
> of /etc/hosts during update, node1 entry was taken out for some reason
> in node1 hosts file. Pacemaker showed all services to be down even after
> that fix. Couple of reboots - no help. I have attached forensics report
> (cib and messages) of node1 in node1.tar

Everybody'd be better of using hb_report :)

> Whilst , we have switched off node1 after our enthu levels went down and
> made node2 online. We were happy to see the things work well in it (we
> have adjusted the timings - no cleanup was required - though we tested
> this fact for only one reboot) - except the message node1 is offline. we
> copied this cib of node2 using cibadmin - Q, switched it off and
> switched on the node1 for cib injection.
> 
> @ node 1 we have cleared the pacemaker config using cibadmin -E --force,
> then we injected the cib(after increasing the epoch values) using
> cibadmin -U -x cib.xml. service openais restart revealed the wonderful
> fact that node1 is still behaving the same way. no green signal except
> node1 DC.
> 
> Heart broken we did a forensic evidence collection, switched of node1
> and made node2 online for further study on its remaining files. Volla -
> node 2 came up showing all red. no services running. only green i could
> see was node2 dc. Any way forensic was done. Files are attached herewith
> for your kind perusal.
> 
> severely broken, there was no more energy left in us for this 5 week
> effort to bring up a HA cluster which will run postgres, apache on a
> virtual ip. decided to switch off the msa array - after switching off
> node2 (we lost hope in node 1 earlier).

Your fencing (stonith) doesn't work:

Aug 16 16:14:45 node1 stonithd: [3870]: ERROR: Failed to STONITH 
the node node2: optype=POWEROFF, op_result=TIMEOUT

You'll find a bunch of similar messages. The cluster won't make
any progress in case it has to fence a node but it can't.

The timeout for stonith resources (st1/st2) is set to very low 5
seconds. Make that at least 1 minute.

Thanks,

Dejan

> 10 minutes later - MSA was pushed online then node2 - then node1. Node 2
> became dc and all is green. I really did not understand what went wrong
> when and where. I tried to look in the log - but was not able to
> understand anything (lack of confidence after multiple failure ).
> 
> One observation, which could be right or wrong - node 1 will fail to
> function properly if node 2 is not available and vice versa. Well node1
> is now having the latest patches, but node2 is still virgin. we didn't
> have the heart to run update on node2 after experiencing the node 1
> affair.
> 
> Please throw some light in to our mystery HA project.
> 
> Thank you in advance.
> 
> Take care, 
> 
> Abhin 
> 
> 
> On Thu, 2009-08-13 at 14:16 +0200, Andrew Beekhof wrote: 
> > First thing I'd do is fix this:
> > 
> > Aug  8 13:47:13 node1 cib: [3894]: ERROR: write_xml_file: Cannot write
> > output to /var/lib/heartbeat/crm/cib.XLiyUG: No space left on device
> > (28)
> > 
> > then i'd increase the timeouts:
> > 
> > Aug  8 13:39:42 node2 crmd: [3803]: ERROR: process_lrm_event: LRM
> > operation fs:1_stop_0 (18) Timed Out (timeout=20000ms)
> > Aug  8 13:45:16 node2 crmd: [3692]: ERROR: process_lrm_event: LRM
> > operation postgres_start_0 (15) Timed Out (timeout=20000ms)
> > Aug  8 13:48:57 node2 crmd: [3692]: ERROR: process_lrm_event: LRM
> > operation fs:0_stop_0 (23) Timed Out (timeout=20000ms)
> > Aug  8 13:53:06 node2 crmd: [3765]: ERROR: process_lrm_event: LRM
> > operation postgres_start_0 (14) Timed Out (timeout=20000ms)
> > 
> > Try setting default-action-timeout to something higher than 20s
> > 
> > On Wed, Aug 12, 2009 at 11:54 AM, Abhin.G.S - DEUCN<deucn at inmail.sk> wrote:
> > >
> > > Hello Andrew,
> > >
> > > On behalf of Ajith, i'm sending you the details.
> > >
> > > /var/log/message  of node2 (truncated) = http://deucn.com/messages_new
> > >
> > > Attachments :
> > >
> > > 1> CIB.xml
> > >
> > > 2> extract of /var/log/messages of node1
> > >
> > > 3> complete /var/log/messages of node2 in zip format
> > >
> > > Please help us.
> > >
> > > Thank you,
> > >
> > > Warm Regards
> > >
> > > Abhin.G.S
> > > ---- Original message ----
> > > From: Andrew Beekhof <andrew at beekhof.net>
> > > To: pacemaker at oss.clusterlabs.org
> > > Date: 8/12/2009 12:49:00 PM
> > > Subject: Re: [Pacemaker] Please Help - frequent cleanup is required for the
> > > resources on failover condition
> > >
> > > On Sun, Aug 9, 2009 at 4:41 PM, Ajith Kumar<ajith.kgs.hk at gmail.com> wrote:
> > >> Hello Everyone,
> > >>
> > >> I was behind a project to create a test cluster using Pacemaker on suse11.
> > >> With kind help of lmb and beekhof @ #linux-cluster i was finally able to
> > >> put
> > >> up a two node cluster using HP ML350g5 each with two HBA connected to a
> > >> MSA2012fcdc.
> > >>
> > >> The cluster resource both apache2 and postgresql requires cleanup every
> > >> time
> > >> i boot the cluster (this was a test cluster - which was switched off at
> > >> the
> > >> end of the day - or when i see the level of madness in me cross the
> > >> barrier), when a simulated failover (by making the other node stand by) ,
> > >> or
> > >> when i pull the nic cable of one node. Ipaddress and stonith was working
> > >> fine as planned. but the big boys - apache2 and postgresql is having
> > >> trouble
> > >> and i have to cleanup always.
> > >>
> > >> I would like to give the log file as attachment (/var/log/messages) - but
> > >> it
> > >> is 3.2GB in size
> > >
> > > limit the contents to just one instance of the problem and use bzip
> > >
> > >> and  has lot of repeated entries - whic i did not find
> > >> relevant.
> > >
> > > actually its the only thing that i relevant
> > >
> > > _______________________________________________
> > > Pacemaker mailing list
> > > Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > ------------------------------------
> > > Abhin.G.S
> > > =========
> > > +91-9895-525880 | +91-471-2437189
> > > D E U C N ® | http://www.deucn.com
> > > ------------------------------------
> > >
> > > ----------
> > > VYHLADAJTE VASE DOVOLENKOVE FOTOGRAFIE NA MAPE. Info na www.fotoskola.sk.
> > >
> 
> ----------
> Ukazte svoje fotky na www.zonerama.sk

> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker