[Pacemaker] Bug? Resources running with realtime priority - possibly causing monitor timeouts

Tue Oct 1 13:22:12 EDT 2013

Hi,

On Tue, Oct 01, 2013 at 11:07:35AM +0200, Joschi Brauchle wrote:
> Hello everyone,
> 
> on two (recently upgraded) SLES11SP3 machines, we are running an
> active/passive NFS fileserver and several other high availability
> services using corosync + pacemaker (see version numbers below).
> 
> We are having severe problems with resource monitors timing out
> during our system backup at night, where the active machine is under
> high IO load. These problems did not exist under SLES11SP1, from
> which we just upgraded some days ago.
> 
> 
> After some diagnosis, it turns out that actually all cluster
> resources which are started by pacemaker are running with realtime
> priority, which includes our backup service. This seems not to be
> correct!
> 
> 
> See this output of "ps --forest -Ao cls,rtprio,pri,comm --sort cls":
> ------------
>  RR      1  41 corosync
>  RR      1  41  \_ cib
>  RR      1  41  \_ stonithd
>  RR      1  41  \_ lrmd
>  RR      1  41  \_ attrd
>  RR      1  41  \_ pengine
>  RR      1  41  \_ crmd
>  RR      1  41  \_ mgmtd
>  RR      1  41 krb5kdc
>  RR      1  41 slapd
>  RR      1  41 cupsd
>  RR      1  41 rpc.svcgssd
>  RR      1  41 rpc.gssd
>  RR      1  41 rpc.idmapd
>  RR      1  41 rpc.mountd
>  RR      1  41 rpc.statd
>  RR      1  41 rpc.rquotad
>  RR      1  41 httpd2-prefork
>  RR      1  41  \_ httpd2-prefork
>  RR      1  41  \_ httpd2-prefork
>  RR      1  41  \_ httpd2-prefork
>  RR      1  41  \_ httpd2-prefork
>  RR      1  41  \_ httpd2-prefork
>  RR      1  41  \_ httpd2-prefork
>  RR      1  41 dsmcad
> ------------
> Clearly, corosync itself **plus all cluster services** (like cups,
> slapd, httpd2) are running with realtime priority (process class
> being "RR").

Oops. Looks like neither corosync nor lrmd reset the priority and
scheduler for their children.

> As far as we remember from SLES11SP1, the resources were not running
> in realtime priority there. Hence, this looks like a bug in the more
> recent pacemaker/corosync version?!?

Looks like it. Can you please open a support call.

Thanks,

Dejan

> We suspect that the backup software "dsmcad" running in realtime
> priority causes the monitors to time out, as the system is under
> heavy IO load and may not respond in time for the monitors.
> 
> 
> More details about our setup:
> ------------
> # hb_report -V
> cluster-glue: 1.0.11 (8347e8c9b94f111400dd844f11bc6ede98cc11a5)
> # zypper -q if cluster-glue pacemaker corosync
> Information for package cluster-glue:
> 
> Repository: SLE11-HAE-SP3-Pool
> Name: cluster-glue
> Version: 1.0.11-0.15.28
> Arch: x86_64
> ...
> Information for package pacemaker:
> 
> Repository: SLE11-HAE-SP3-Pool
> Name: pacemaker
> Version: 1.1.9-0.19.102
> Arch: x86_64
> ...
> Information for package corosync:
> 
> Repository: SLE11-HAE-SP3-Pool
> Name: corosync
> Version: 1.4.5-0.18.15
> Arch: x86_64
> ------------
> 
> I can provide more required information on request. We would be glad
> for any hits or suggestions on how to fix this problem.
> 
> Best regards,
> J Brauchle
> 

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org