<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Even if that is set, we need to verify that the resources are, indeed,<br>
NOT running where they shouldn't be; remember, it is our job to ensure<br>
that the configured policy is enforced. So, we probe them everywhere to<br>
ensure they are indeed not around, and stop them if we find them.<br>
</blockquote>
<br></div>
Again, WHY do you need to verify things which cannot happen by setup? If some resource cannot, REALLY CANNOT exist on a node, and administrator can confirm this, why rely on network, cluster stack, resource agents, electricity in power outlet, etc. to verify that 2+2 is still 4?</blockquote>
<div><br></div><div>Don't want to step on any toes or anything, mainly because me stepping on somebody's toes without the person wearing a pair of steel-toe cap boots would leave them toeless, but I've been hearing the ranting go on and on and just felt like maybe something's missing from the picture, specifically, an example for why checking for resources on passive nodes is a good thing, which I haven't seen thus far.</div>
<div><br></div><div>Case in point, a service depends on running a process from a disk resource, someone comes with the idea to "cluster the service" and use it on a shared storage, the process would read its configuration from a directory not shared with the other node/s (e.g.: /etc/something) and then connect to the shared storage and use it. Let's assume for a second that the shared storage would be a running on a DRBD dual-primary setup (nothing against that, great software), and the process would have direct access to the underlying shared disk, no DLM, no cluster filesystem and this setup would be comprised of two nodes, with the restriction that only one of the two nodes should be able to access the data at any given point in time, otherwise concurrent access to the shared storage would compromise the data.</div>
<meta http-equiv="content-type" content="text/html; charset=utf-8"><div><br></div><div>So the thought comes of using a cluster software to maintain a high-availability setup. Enter Pacemaker. Now the software providing the service has an init script, its only purpose is to start, stop, restart and show the status of the process running on the local disk (it was never meant or thought of being used on a shared storage). Pacemaker gladly takes care of the task of providing a highly available setup by using the init script, provided its LSB compliant.</div>
<div><br></div><div>Advantages: </div><div>- no RA is required, the init script will do</div><div>- no advanced multi-state OCF compliant script is required to monitor if the process is being run on N-nodes, keep track of where its supposed to run and perform the appropriate monitoring and signalling for promoting, demoting, migrating, managing which node accesses which part of the shared storage so concurrent access is prevented, running a two-phase commit, setting and releasing locks, ASOASF</div>
<div>- less overhead in administration by one not at clustering-guru level (the last one is very important) </div><div><br></div><div>Ok, so far it sounds perfect, but what happens if on the secondary/passive node, someone starts the service, by user error, by upgrading the software and thus activating its automatic startup at the given runlevel and restarting the secondary node (common practice when performing upgrades in a cluster environment), etc. If Pacemaker were not to check all the nodes for the service being active or not => epic fail. Its state-based model, where it maintains a state of the resources and performs the necessary actions to bring the cluster to that state is what saves us from the "epic fail" moment. </div>
<div><br></div><div>This example, although simple, is a very common occurrence. That's why it's also documented and implemented in Pacemaker <a href="http://www.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active">http://www.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active</a> as opposed to the feature of not checking for a resource on a node in an asymmetric cluster, which has been brought into this mailing list as a request by one person and voted in favor of by another person (so far that's 2).</div>
<div><br></div><div>I'm trying to be impartial here, although I may be biased by my experience to rule in favor of Pacemaker, but here's a thought, it's a free world, we all have the freedom of speech, which I'm also exercising at the moment, want something done, do it yourself, patches are being accepted, don't have the time, ask people for their help, in a polite manner, wait for them to reply, kindly ask them again (and prayers are heard, Steven Dake released >> <a href="http://www.mail-archive.com/openais@lists.linux-foundation.org/msg06072.html">http://www.mail-archive.com/openais@lists.linux-foundation.org/msg06072.html</a> << a patch for automatic redundant ring recovery, thank you Steven), want something done fast, pay some developers to do it for you, say the folks over at <a href="http://www.linbit.com">www.linbit.com</a> wouldn't mind some sponsorship (and I'm not affiliated with them in any way, believe it or not, I'm actually doing this without external incentives, from the kindness of my heart so to speak).</div>
<meta http-equiv="content-type" content="text/html; charset=utf-8"><div><br></div><div>And of course, the large majority of the effort, the one that goes into getting things done, having a functional piece of software that cover most use cases is usually forgotten, because hey, you spent all of those hard hours making sure the software works, implementing feature after feature, with the most commonly-used ones as a main target, cross-testing various hardware platforms, operating systems and writing endless pages of documentation so that the community will benefit from the first open source cluster stack and associated software that can compete head-to-head with pricey commercial clustering solutions on the market, but since your software doesn't know how to cook cordon-bleu, it's not worth considering for a long term relationship, it's better to part ways now because it just won't work between the two of you.</div>
<div><br></div><div>Clearly, many more rants will probably follow this posting, but I'm ok with that, everyone's got the right to express themselves, whether right or wrong, which is a relative subject anyway, and please forgive me if I did step on any toes, it was not my intention.</div>
<meta http-equiv="content-type" content="text/html; charset=utf-8"><div> </div><div>Regards,<br>Dan</div><div><br></div></div>-- <br>Dan Frincu<div>CCNA, RHCE</div><br>