<div><div><div class="gmail_quote"><div>Hey everyone,</div><div> We had an interesting issue happen the other night on one of our clusters. A resource attempted to start on an unauthorized node (and failed), which caused the real resource, already running on a different node, to become orphaned and subsequently shut down.</div>
<div><br></div><div>Some background:</div><div>We're running pacemaker 1.0.12, corosync 1.2.7 on Centos 5.8 x64</div><div><br></div><div>The cluster has 3 members:</div><div>pgsql1c & pgsql1d are physical machines running dual Xeon X5650's with 32 gigs of ram</div>
<div>dbquorum which is a vm running on vmware ESX server on HP Blade hardware.</div><div><br></div><div>The 2 physical machines are configured to be master/slave postgres servers, the vm machine is only there for quorum - it should never run any resources. The full crm configuration is available in this zip (as alink to allow the email to post correctly)</div>
<div><a href="https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip">https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip</a></div>
<div><br></div><div>On the dbquorum VM we got the following log message:</div><div>Jun 07 03:11:10 corosync [TOTEM ] Process pause detected for 598 ms, flushing membership messages.</div><div><br></div><div>After this it appears that somehow even though the Cluster-Postgres-Server-1 and Postgres-IP-1 resources are only setup to run on pgsql1c/d the dbquorum box tried to start them up </div>
<div><br></div><div>WARN: unpack_rsc_op: Processing failed op Postgres-Server-1:0_monitor_0 on <a href="http://dbquorum.example.com/" style="color:rgb(17,85,204)" target="_blank">dbquorum.example.com</a>: unknown error (1)</div>
<div>WARN: unpack_rsc_op: Processing failed op Postgres-IP-1_monitor_0 on <a href="http://dbquorum.example.com/" style="color:rgb(17,85,204)" target="_blank">dbquorum.example.com</a>: unknown error (1)</div><div>info: find_clone: Internally renamed Postgres-Server-1:0 on <a href="http://pgsql1c.example.com/" style="color:rgb(17,85,204)" target="_blank">pgsql1c.example.com</a> to Postgres-Server-1:1</div>
<div>info: find_clone: Internally renamed Postgres-Server-1:1 on <a href="http://pgsql1d.example.com/" style="color:rgb(17,85,204)" target="_blank">pgsql1d.example.com</a> to Postgres-Server-1:2 (ORPHAN)</div><div>WARN: process_rsc_state: Detected active orphan Postgres-Server-1:2 running on <a href="http://pgsql1d.example.com/" style="color:rgb(17,85,204)" target="_blank">pgsql1d.example.com</a></div>
<div>ERROR: native_add_running: Resource ocf::IPaddr2:Postgres-IP-1 appears to be active on 2 nodes.</div><div>WARN: See <a href="http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active" style="color:rgb(17,85,204)" target="_blank">http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active</a> for more information.</div>
<div>notice: native_print: Postgres-IP-1<span style="white-space:pre-wrap"> </span>(ocf::heartbeat:IPaddr2) Started FAILED</div><div>notice: native_print: <span style="white-space:pre-wrap"> </span>0 : <a href="http://dbquorum.example.com/" style="color:rgb(17,85,204)" target="_blank">dbquorum.example.com</a></div>
<div>notice: native_print: <span style="white-space:pre-wrap"> </span>1 : <a href="http://pgsql1d.example.com/" style="color:rgb(17,85,204)" target="_blank">pgsql1d.example.com</a></div><div>notice: clone_print: Master/Slave Set: Cluster-Postgres-Server-1</div>
<div>notice: native_print: Postgres-Server-1:0<span style="white-space:pre-wrap"> </span>(ocf::custom:pgsql):<span style="white-space:pre-wrap"> </span>Slave <a href="http://dbquorum.example.com/" style="color:rgb(17,85,204)" target="_blank">dbquorum.example.com</a> FAILED</div>
<div>notice: native_print: Postgres-Server-1:2<span style="white-space:pre-wrap"> </span>(ocf::custom:pgsql):<span style="white-space:pre-wrap"> </span> ORPHANED Master <a href="http://pgsql1d.example.com/" style="color:rgb(17,85,204)" target="_blank">pgsql1d.example.com</a></div>
<div>notice: short_print: Slaves: [ <a href="http://pgsql1c.example.com/" style="color:rgb(17,85,204)" target="_blank">pgsql1c.example.com</a> ]</div><div>ERROR: common_apply_stickiness: Postgres-Server-1:0[<a href="http://pgsql1c.example.com/" style="color:rgb(17,85,204)" target="_blank">pgsql1c.example.com</a>] = 100</div>
<div>ERROR: common_apply_stickiness: Postgres-Server-1:0[<a href="http://pgsql1d.example.com/" style="color:rgb(17,85,204)" target="_blank">pgsql1d.example.com</a>] = 100</div><div>ERROR: clone_color: Postgres-Server-1:0 is running on <a href="http://dbquorum.example.com/" style="color:rgb(17,85,204)" target="_blank">dbquorum.example.com</a> which isn't allowed</div>
<div>info: native_color: Stopping orphan resource Postgres-Server-1:2</div><div><br></div><div>The stopping of the orphaned resource caused our master to stop, luckily the slave correctly got promoted to master and we had no outage.</div>
<div><br></div><div>There seems to be several things that went wrong here:</div><div>1. The VM pause - doing some searching I found some posts with regards to the pause message and VM's. We've upped the priority of our dbquorum box on the VM host, the other posts seem to talk about the token configuration option in totem but we haven't set that so it seems like it should be the default of 1000ms so it doesn't seem likely that changing this setting would have made any difference in this situation. We looked at the VM host and couldn't see anything on the physical host at the time that would cause this pause.</div>
<div>2. The quorum machine tried to start resources it is not authorized for - symmetric-cluster is set to false and there is no location entry for that node/resource...why would it try to start it?</div><div>3. The 2 machines that stayed up got corrupted when the 3rd came back - the 2 primary machines never lost quorum so...when the 3rd machine came back and told them it was now the postgres master, why would they believe it? and then subsequently shut down the proper master that they should know full well is the true master? I would have expected the dbquorum machine changes to have been rejected by the other 2 that had quorum.</div>
<div><br></div><div>The logs and config are in this zip <a href="https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip...pgsql1d">https://www.dropbox.com/s/tc4hz22lw738y06/Jun7Logs.zip...pgsql1d</a> was the DC at the time of the issue.</div>
<div><br></div><div>If anyone has any ideas as to why this happened and/or changes we can make to our config to prevent it happening again that would be great.</div>
<div><br></div><div>Thanks!</div><div><br></div><div></div></div></div></div><div></div>