[Pacemaker] About failover node in a deep swap

Wed Feb 9 11:47:15 EST 2011

I noticed that pacemaker does not correctly failover nodes under heavy load
when they go into deep swap or heavy IO.

I configuring >1 nodes running apache with  MaxClients big enough to swap
out the node, putting there some heavy php scripts (Wordpress ^_^) and then
run heavy webserver benchmarks.

When the node comes into deep swap, load averages goes to thousands and its
stun (but pings are okay), pacemaker in some reason do not mark the node as
failed and do not migrate resources away.

Even more. In certain conditions pacemaker starts to migrate resources away,
but they are failed to start on other nodes (while in normal condition it
starts them okay):

httpd_start_0 (node=node1, call=32, rc=1, status=complete): unknown error
httpd_start_0 (node=node2, call=43, rc=1, status=complete): unknown error

Sometimes there is a timeout error, sometimes there are no errors ever, but
result is the resources are down.

In this case ocf::heartbeat:apache running in a group
with ocf::heartbeat:IPaddr2, so maybe pacemaker failed to stop IPaddr2 so it
can't move ocf::heartbeat:apache because they are in a same group.

Is it a corosync "normal" behavior or I do something wrong? 90% of my "down
conditions" are heavy load, and corosync does not handle this in my case.

-- 
Regards, Pentarh Udi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110209/03a65eba/attachment.html>