[Pacemaker] RES: Reboot of cluster members with heavy load on filesystem.

Sun Feb 10 20:41:31 EST 2013

Hi Andrew,

tank you very much for your hints.

> > Hi.
> >
> > We are running two clusters compounded of two machines. We are using DRBD + OCFS2 to make the common
> filesystem.

[snip]

> >
> > The clusters run nice with normal load except when doing backup of
> > files or optimize of the databases. At this time we got a huge increment of data coming by the
> mysqldump to the backup resource or from the resource mounted on /export.
> > Sometimes when performing the backup or optimizing the database (done
> > just on the mysql cluster), the Pacemaker declares a node dead (but
> > its not)
> 
> Well you know that, but it doesn't :)
> It just knows it can't talk to its peer anymore - which it has to treat as a failure.
> 
> > and start the recovering process. When it happens we end up with two
> > machines getting restarted and most of the times with a database crash
> > :-(
> >
> > As you can see below, just about 30 seconds after the dump starts on diana the problem happens.
> > ----------------------------------------------------------------

[snip]

> > 04:27:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr)
> > redirecting to systemctl Feb  6 04:28:31 diana lrmd: [2919]: info: RA
> > output: (httpd:1:monitor:stderr) redirecting to systemctl Feb  6
> > 04:29:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr)
> > redirecting to systemctl Feb  6 04:30:01 diana /USR/SBIN/CRON[1257]:
> > (root) CMD (/root/scripts/bkp_database_diario.sh)
> > Feb  6 04:30:31 diana lrmd: [2919]: info: RA output:
> > (httpd:1:monitor:stderr) redirecting to systemctl Feb  6 04:31:31
> > diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr)
> > redirecting to systemctl Feb  6 04:31:42 diana lrmd: [2919]: WARN: ip_intranet:0:monitor process
> (PID 1902) timed out (try 1).  Killing with signal SIGTERM (15).
> 
> I'd increase the timeout here. Or put pacemaker into maintenance mode (where it will not act on
> failures) while you do the backups - but thats more dangerous.
> 
> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] CLM CONFIGURATION CHANGE
> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] New Configuration:
> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9)
> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Left:
> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)
> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Joined:
> >
> 
> This appears to be the (almost) root of your problem.
> The load is staving corosync of CPU (or possibly network bandwidth) and it can no longer talk to its
> peer.
> Corosync then informs pacemaker who initiates recovery.
> 
> I'd start by tuning some of your timeout values in corosync.conf
> 

It should be the CPU, because I can see it going to 100% of usage on the cacti graph.
Also we got two rings for corosync, one affected by the data flow ate the backup time and another with free badwidth.

This is the totem session of my configuration.

totem {
        version:        2
        token:          5000
        token_retransmits_before_loss_const: 10
        join:           60
        consensus:      6000
        vsftype:        none
        max_messages:   20
        clear_node_high_bit: yes
        secauth:        off
        threads:        0
        rrp_mode: active 
        interface {
                ringnumber: 0
                bindnetaddr: 10.10.1.0
                mcastaddr: 226.94.1.1
                mcastport: 5406
                ttl: 1
        }
        interface {
                ringnumber: 1
                bindnetaddr: 10.10.10.0
                mcastaddr: 226.94.1.1
                mcastport: 5406
                ttl: 1
        }
}

Can you kindly point what timer/counter should I play with?
What are the reasonable values for them? I got scared with this warning "It is not recommended to alter this value without guidance
from the corosync community."
Is there any benefits of changing the rrp_mode from active to passive? Should it be done on both hosts?

> > ----------------------------------------------------------------
> >
> > Feb  6 04:30:32 apolo lrmd: [2855]: info: RA output:
> > (httpd:0:monitor:stderr) redirecting to systemctl Feb  6 04:31:32
> > apolo lrmd: [2855]: info: RA output: (httpd:0:monitor:stderr) redirecting to systemctl Feb  6
> 04:31:41 apolo corosync[2848]:  [TOTEM ] A processor failed, forming new configuration.
> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] CLM CONFIGURATION CHANGE
> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] New Configuration:
> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)
> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Left:
> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9)
> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
> > Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice:
> > pcmk_peer_update: Transitional membership event on ring 304: memb=1,
> > new=0,
> > lost=1

[snip]

> >
> > After lots of log apolo asks diana to reboot and sometime after that it got rebooted too.
> > We had an old cluster with heartbeat and DRBD used to cause it on that system but now looks like
> Pacemaker is the guilt.
> >
> > Here is my Pacemaker and DRBD configuration
> > http://www2.connection.com.br/cbastos/pacemaker/crm_config
> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/global_commo
> > n.setup
> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/backup.res
> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/export.res
> >
> > And more detailed logs
> > http://www2.connection.com.br/cbastos/pacemaker/reboot_apolo
> > http://www2.connection.com.br/cbastos/pacemaker/reboot_diana
> >

Best regards,
Carlos.