[Pacemaker] RES: Reboot of cluster members with heavy load on filesystem.

Mon Feb 11 05:25:58 EST 2013

Hi,

On Mon, Feb 11, 2013 at 12:21 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> On Mon, Feb 11, 2013 at 12:41 PM, Carlos Xavier
> <cbastos at connection.com.br> wrote:
>> Hi Andrew,
>>
>> tank you very much for your hints.
>>
>>> > Hi.
>>> >
>>> > We are running two clusters compounded of two machines. We are using DRBD + OCFS2 to make the common
>>> filesystem.
>>
>> [snip]
>>
>>> >
>>> > The clusters run nice with normal load except when doing backup of
>>> > files or optimize of the databases. At this time we got a huge increment of data coming by the
>>> mysqldump to the backup resource or from the resource mounted on /export.
>>> > Sometimes when performing the backup or optimizing the database (done
>>> > just on the mysql cluster), the Pacemaker declares a node dead (but
>>> > its not)
>>>
>>> Well you know that, but it doesn't :)
>>> It just knows it can't talk to its peer anymore - which it has to treat as a failure.
>>>
>>> > and start the recovering process. When it happens we end up with two
>>> > machines getting restarted and most of the times with a database crash
>>> > :-(
>>> >
>>> > As you can see below, just about 30 seconds after the dump starts on diana the problem happens.
>>> > ----------------------------------------------------------------
>>
>> [snip]
>>
>>> > 04:27:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr)
>>> > redirecting to systemctl Feb  6 04:28:31 diana lrmd: [2919]: info: RA
>>> > output: (httpd:1:monitor:stderr) redirecting to systemctl Feb  6
>>> > 04:29:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr)
>>> > redirecting to systemctl Feb  6 04:30:01 diana /USR/SBIN/CRON[1257]:
>>> > (root) CMD (/root/scripts/bkp_database_diario.sh)
>>> > Feb  6 04:30:31 diana lrmd: [2919]: info: RA output:
>>> > (httpd:1:monitor:stderr) redirecting to systemctl Feb  6 04:31:31
>>> > diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr)
>>> > redirecting to systemctl Feb  6 04:31:42 diana lrmd: [2919]: WARN: ip_intranet:0:monitor process
>>> (PID 1902) timed out (try 1).  Killing with signal SIGTERM (15).
>>>
>>> I'd increase the timeout here. Or put pacemaker into maintenance mode (where it will not act on
>>> failures) while you do the backups - but thats more dangerous.
>>>
>>> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] CLM CONFIGURATION CHANGE
>>> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] New Configuration:
>>> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9)
>>> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Left:
>>> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)
>>> > Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Joined:
>>> >
>>>
>>> This appears to be the (almost) root of your problem.
>>> The load is staving corosync of CPU (or possibly network bandwidth) and it can no longer talk to its
>>> peer.
>>> Corosync then informs pacemaker who initiates recovery.
>>>
>>> I'd start by tuning some of your timeout values in corosync.conf
>>>
>>
>> It should be the CPU, because I can see it going to 100% of usage on the cacti graph.
>> Also we got two rings for corosync, one affected by the data flow ate the backup time and another with free badwidth.
>>
>> This is the totem session of my configuration.
>>
>> totem {
>>         version:        2
>>         token:          5000
>>         token_retransmits_before_loss_const: 10
>>         join:           60
>>         consensus:      6000
>>         vsftype:        none
>>         max_messages:   20
>>         clear_node_high_bit: yes
>>         secauth:        off
>>         threads:        0
>>         rrp_mode: active
>>         interface {
>>                 ringnumber: 0
>>                 bindnetaddr: 10.10.1.0
>>                 mcastaddr: 226.94.1.1
>>                 mcastport: 5406
>>                 ttl: 1
>>         }
>>         interface {
>>                 ringnumber: 1
>>                 bindnetaddr: 10.10.10.0
>>                 mcastaddr: 226.94.1.1
>>                 mcastport: 5406
>>                 ttl: 1
>>         }
>> }
>>
>> Can you kindly point what timer/counter should I play with?
>
> I would start by making these higher, perhaps double them and see what
> effect it has.
>
>         token:          5000
>         token_retransmits_before_loss_const: 10
>
>> What are the reasonable values for them? I got scared with this warning "It is not recommended to alter this value without guidance
>> from the corosync community."
>> Is there any benefits of changing the rrp_mode from active to passive?

rrp_mode: passive is better tested than active. That's the only real benefit.

>
> Not something I've played with, sorry.
>
>> Should it be done on both hosts?
>
> It should be the same I would imagine.
>
>>
>>> > ----------------------------------------------------------------
>>> >
>>> > Feb  6 04:30:32 apolo lrmd: [2855]: info: RA output:
>>> > (httpd:0:monitor:stderr) redirecting to systemctl Feb  6 04:31:32
>>> > apolo lrmd: [2855]: info: RA output: (httpd:0:monitor:stderr) redirecting to systemctl Feb  6
>>> 04:31:41 apolo corosync[2848]:  [TOTEM ] A processor failed, forming new configuration.
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] CLM CONFIGURATION CHANGE
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] New Configuration:
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Left:
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9)
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
>>> > Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice:
>>> > pcmk_peer_update: Transitional membership event on ring 304: memb=1,
>>> > new=0,
>>> > lost=1
>>
>> [snip]
>>
>>> >
>>> > After lots of log apolo asks diana to reboot and sometime after that it got rebooted too.
>>> > We had an old cluster with heartbeat and DRBD used to cause it on that system but now looks like
>>> Pacemaker is the guilt.
>>> >
>>> > Here is my Pacemaker and DRBD configuration
>>> > http://www2.connection.com.br/cbastos/pacemaker/crm_config
>>> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/global_commo
>>> > n.setup
>>> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/backup.res
>>> > http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/export.res
>>> >
>>> > And more detailed logs
>>> > http://www2.connection.com.br/cbastos/pacemaker/reboot_apolo
>>> > http://www2.connection.com.br/cbastos/pacemaker/reboot_diana
>>> >
>>
>> Best regards,
>> Carlos.
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
Dan Frincu
CCNA, RHCE