I noticed that pacemaker does not correctly failover nodes under heavy load when they go into deep swap or heavy IO.<div><br></div><div>I configuring >1 nodes running apache with MaxClients big enough to swap out the node, putting there some heavy php scripts (Wordpress ^_^) and then run heavy webserver benchmarks.</div>
<div><br></div><div>When the node comes into deep swap, load averages goes to thousands and its stun (but pings are okay), pacemaker in some reason do not mark the node as failed and do not migrate resources away.</div><div>
<br></div><div>Even more. In certain conditions pacemaker starts to migrate resources away, but they are failed to start on other nodes (while in normal condition it starts them okay):</div><div><br></div><div>httpd_start_0 (node=node1, call=32, rc=1, status=complete): unknown error<br clear="all">
<meta http-equiv="content-type" content="text/html; charset=utf-8">httpd_start_0 (node=node2, call=43, rc=1, status=complete): unknown error</div><div><br></div><div>Sometimes there is a timeout error, sometimes there are no errors ever, but result is the resources are down.</div>
<div><br></div><div>In this case ocf::heartbeat:apache running in a group with ocf::heartbeat:IPaddr2, so maybe pacemaker failed to stop IPaddr2 so it can't move ocf::heartbeat:apache because they are in a same group.</div>
<div><br></div><div>Is it a corosync "normal" behavior or I do something wrong? 90% of my "down conditions" are heavy load, and corosync does not handle this in my case.</div><div><br clear="all">-- <br>
Regards, Pentarh Udi<br>
</div>