[Pacemaker] R: Frequent SBD triggered server reboots

Tue May 7 04:43:55 EDT 2013

Hello Andrea

i think you need to think about that Lars told you = (Upgrade to SP2) or
maybe you can try to use a diferent lun for the sbd and use ionice for
setting the realtime class for sbd process


2013/5/7 andrea cuozzo <andrea.cuozzo at sysma.it>

> Hi,
>
> Here are three logs from the last server watchdog-driven reboot on friday
> evening (not that I want you to actually dig into them, it's just to update
> this thread with my new findings), with SBD watchdog timeout set to 20
> seconds.
>
> 1) sar.txt is the output of sar -d -p- 2 (two seconds frequency of disk
> statistics pretty printed), starting right before the reboot
>
> 2) messages.txt is an extract of the server /var/log/messages starting
> right
> before the reboot, with QLogic driver, scsi layer and SBD verbose loggings
> enabled
>
> 3) cpu1.txt is the output of sar -P ALL -2 (two seconds frequency of cpu
> statistics), filtered by cpu #1, starting right before the reboot
>
> sda is the local drive, sdb and sdc are the same single SAN LUN as seen by
> the two FC ports of the server, san is the LUN multipath alias, san_part1
> is
> the SBD partition, san_part2 is the Oracle partition.
>
> sar.txt shows that somewhere between 17:46.44 and 17:46.46 all reads and
> writes to/from the san LUN drops to zero, for both SBD and Oracle
> partitions, right until the 17th second of the SBD countdown, at which time
> something (3.88 wr/s) seems to get written on the Oracle partition.
> %util jumps to 100% as it does iowait%, from cpu1.txt, on 3 of the 24 cpu
> cores this server has got (the ones Oracle and SBD were using at the time,
> I
> suppose).
>
> messages.txt shows at 17:46.44 this QLogic driver message that is different
> from the rest og QLogic messages:
>
> May  3 17:46:44 server1 kernel: [66588.156113] qla2xxx
> [0000:11:00.1]-5816:2: Discard RND Frame -- 1006 02c1 0000.
>
> By the time I started facing these problems, I got gigs of
> /var/log/messages
> from these servers now, and the QLogic driver will write some rare "dropped
> frame(s) detected" from time to time during normal server operations, but
> it
> will never write this "Discard RND Frame" message unless there's going to
> be
> an unwanted reboot right after. No scsi layer read and write communication
> on sdb and sdc gets recorded by the kernel afterwards, except for a couple
> of "device ready" commands. All these info have been shared with the SAN
> department already.
>
> Yesterday the SAN department has made a parameter configuration change on
> the two Brocade switches (and multipath worked smoothlessly on the servers,
> switching paths back and forth as the relative switches got restarted) I
> hope this fixes the problem, otherwise we might investigate the switch port
> configuration change described in the following link, as our current
> configuration seems to apply (8Gb fc, Brocade switches, lots of er_bad_os
> port errors, fill word port mode currently set to 1, and random server
> problem)
>
>
> http://loopbackconnector.com/2013/02/14/brocade-8-gb-how-to-talk-when-idle-p
> ortcfgfillword/
>
> andrea
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 3 May 2013 10:17:12 +0200
> From: Lars Marowsky-Bree <lmb at suse.com>
> To: The Pacemaker cluster resource manager
>         <pacemaker at oss.clusterlabs.org>
> Subject: Re: [Pacemaker] Frequent SBD triggered server reboots
> Message-ID: <20130503081712.GE3705 at suse.de>
> Content-Type: text/plain; charset=iso-8859-1
>
> On 2013-05-03T02:49:54, andrea cuozzo <andrea.cuozzo at sysma.it> wrote:
>
> > Unfortunately Os and SP version for the Oracle project these clusters
> > belong to have been decided several layers over my head, I'll make it
> > a point for upgrading to Sp2 anyway, I might get lucky.
>
> Good luck with that!
>
> > the SAN department investigate their side of the problem, I'll take a
> > look at trying a different stonith resources, all servers involved
> > have some kind of IBM management console. Thanks for your answers to
> > my questions and for your time, very much appreciated.
>
> You're missing out on many further fixes since SP1 went out of support.
> Not just to sbd, but everything, from kernel to pacemaker to glibc and
> back.
>
> Since support is obviously irrelevant to your management, you could
> consider
> recompiling sbd from source if you were so inclined, though.
>
>
>
> Regards,
>     Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imend?rffer,
> HRB 21284 (AG N?rnberg) "Experience is the name everyone gives to their
> mistakes." -- Oscar Wilde
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130507/a2482ff0/attachment-0003.html>