[Pacemaker] Reboot of cluster members with heavy load on filesystem.

Wed Feb 6 18:07:09 EST 2013

On Thu, Feb 7, 2013 at 5:58 AM, Carlos Xavier <cbastos at connection.com.br> wrote:
> Hi.
>
> We are running two clusters compounded of two machines. We are using DRBD + OCFS2 to make the common filesystem.
> The the version of the components in use are:
> OpenSuse 12.1
> pacemaker-1.1.6-2.1.2.x86_64
> corosync-1.4.1-2.1.3.x86_64
> openais-1.1.4-9.7.3.x86_64
> ocfs2-tools-1.8.0-4.1.2.x86_64
> kernel downgraded to 2.6.37
> 4x Dell R610
>
> We use one of the cluster to run our MySQL database and the other one to be our web server. On the web cluster we got two resources
> mounted on /export and /backup.
>
> The clusters run nice with normal load except when doing backup of files or optimize of the databases. At this time we got a huge
> increment of data coming by the mysqldump to the backup resource or from the resource mounted on /export.
> Sometimes when performing the backup or optimizing the database (done just on the mysql cluster), the Pacemaker declares a node dead
> (but its not)

Well you know that, but it doesn't :)
It just knows it can't talk to its peer anymore - which it has to
treat as a failure.

> and start the recovering process. When it happens we end up with two machines getting restarted and most of the times
> with a database crash :-(
>
> As you can see below, just about 30 seconds after the dump starts on diana the problem happens.
> ----------------------------------------------------------------
> Feb  6 04:22:40 diana stonith-ng: [2917]: info: log_operation: resIPMI-1: Performing: stonith -t external/ipmi -S
> Feb  6 04:22:40 diana stonith-ng: [2917]: info: log_operation: resIPMI-1: success:  0
> Feb  6 04:22:40 diana stonith-ng: [2917]: info: stonith_device_execute: Nothing to do for resIPMI-1
> Feb  6 04:23:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
> Feb  6 04:24:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
> Feb  6 04:25:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
> Feb  6 04:26:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
> Feb  6 04:27:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
> Feb  6 04:28:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
> Feb  6 04:29:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
> Feb  6 04:30:01 diana /USR/SBIN/CRON[1257]: (root) CMD (/root/scripts/bkp_database_diario.sh)
> Feb  6 04:30:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
> Feb  6 04:31:31 diana lrmd: [2919]: info: RA output: (httpd:1:monitor:stderr) redirecting to systemctl
> Feb  6 04:31:42 diana lrmd: [2919]: WARN: ip_intranet:0:monitor process (PID 1902) timed out (try 1).  Killing with signal SIGTERM
> (15).

I'd increase the timeout here. Or put pacemaker into maintenance mode
(where it will not act on failures) while you do the backups - but
thats more dangerous.

> Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] CLM CONFIGURATION CHANGE
> Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] New Configuration:
> Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9)
> Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Left:
> Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)
> Feb  6 04:31:47 diana corosync[2902]:  [CLM   ] Members Joined:
>

This appears to be the (almost) root of your problem.
The load is staving corosync of CPU (or possibly network bandwidth)
and it can no longer talk to its peer.
Corosync then informs pacemaker who initiates recovery.

I'd start by tuning some of your timeout values in corosync.conf

> ----------------------------------------------------------------
>
> Feb  6 04:30:32 apolo lrmd: [2855]: info: RA output: (httpd:0:monitor:stderr) redirecting to systemctl
> Feb  6 04:31:32 apolo lrmd: [2855]: info: RA output: (httpd:0:monitor:stderr) redirecting to systemctl
> Feb  6 04:31:41 apolo corosync[2848]:  [TOTEM ] A processor failed, forming new configuration.
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] CLM CONFIGURATION CHANGE
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] New Configuration:
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Left:
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.2) r(1) ip(10.10.10.9)
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 304: memb=1, new=0,
> lost=1
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: memb: apolo 16845322
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: lost: diana 33622538
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] CLM CONFIGURATION CHANGE
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] New Configuration:
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Left:
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 304: memb=1, new=0, lost=0
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: MEMB: apolo 16845322
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: ais_mark_unseen_peer_dead: Node diana was not seen in the previous transition
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: update_member: Node 33622538/diana is now: lost
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: send_member_notification: Sending membership update 304 to 4 children
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 304: memb=1, new=0,
> lost=1
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: memb: apolo 16845322
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: lost: diana 33622538
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] CLM CONFIGURATION CHANGE
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] New Configuration:
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] #011r(0) ip(10.10.1.1) r(1) ip(10.10.10.8)
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Left:
> Feb  6 04:31:47 apolo corosync[2848]:  [CLM   ] Members Joined:
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 304: memb=1, new=0, lost=0
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: pcmk_peer_update: MEMB: apolo 16845322
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: ais_mark_unseen_peer_dead: Node diana was not seen in the previous transition
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: update_member: Node 33622538/diana is now: lost
> Feb  6 04:31:47 apolo corosync[2848]:  [pcmk  ] info: send_member_notification: Sending membership update 304 to 4 children
> Feb  6 04:31:47 apolo cib: [2854]: notice: ais_dispatch_message: Membership 304: quorum lost
> ----------------------------------------------------------------
>
> After lots of log apolo asks diana to reboot and sometime after that it got rebooted too.
> We had an old cluster with heartbeat and DRBD used to cause it on that system but now looks like Pacemaker is the guilt.
>
> Here is my Pacemaker and DRBD configuration
> http://www2.connection.com.br/cbastos/pacemaker/crm_config
> http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/global_common.setup
> http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/backup.res
> http://www2.connection.com.br/cbastos/pacemaker/drbd_conf/export.res
>
> And more detailed logs
> http://www2.connection.com.br/cbastos/pacemaker/reboot_apolo
> http://www2.connection.com.br/cbastos/pacemaker/reboot_diana
>
>
> Please is there a way to fix this? Maybe some timings adjustments can help?
> I have no idea where to start.
>
> Best Regards,
> Carlos.
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org