[Pacemaker] new node causes spurious evil

Mon May 14 11:08:22 EDT 2012

Hi!  Thanks for your reply!  That makes perfect sense.

Thanks again!!
-- Matt

On 5/14/2012 10:44 AM, David Vossel wrote:
> ----- Original Message -----
>> From: "Matthew O'Connor" <matt at ecsorl.com>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Friday, May 11, 2012 11:49:04 PM
>> Subject: [Pacemaker] new node causes spurious evil
>>
>> My question:  Why will a node that is not allowed to start a resource
>> attempt to start a monitor on that resource?  Is there a way to
>> change
>> this behavior?  (Specific monitors in question:
>> ocf:heartbeat:iSCSITarget and ocf:heartbeat:iSCSILogicalUnit)
>>
>> The details:
>> I have two nodes, ds01 and ds02, running and happy, and when adding a
>> third node called gw05, things start falling apart.  I've configured
>> an
>> asymmetric opt-in cluster per the documentation, and have explicit
>> rules
>> about what can start where.  ds01 and ds02 are configured with a
>> variety
>> of resources.  gw05 is not configured with any - it's effectively a
>> blank node.
>>
>> With ds01 and ds02 running and in a stable state with their
>> resources,
>> bringing gw05 online (even in standby-mode) causes many things to
>> fall
>> apart.  First, a monitor error on gw05 for a resource that wasn't
>> supposed to even run there.  The monitor error belonged to a group
>> that
>> was alive and well on ds01; the group died, but one of the group
>> members
>> was left alive on ds01 (?!).  Nothing could be migrated to ds02, or
>> away
>> from gw05.  After pulling a "service pacemaker stop" on the command
>> line
>> and doing a resource cleanup on the group from one of the remaining
>> ds??
>> nodes, everything went back to normal.
>>
>> (I've simplified the details here - the actual configuration is
>> slightly
>> more complex with two resource groups instead of one.  Both groups
>> die,
>> one group completely and the other has the dangling ip-address
>> resource
>> on the node it started on.  gw05 never starts anything, and isn't
>> supposed to, but it's the one reporting the errors and evidently
>> killing
>> the resources.)
>>
>> Now, I've tried location statements to explicitly exclude gw05 from
>> starting any of the resources it's complaining about, and used
>> copious
>> order and colocation statements, to no avail.  The kicker is: when I
>> finally gave in and installed one "missing" package (that should not
>> have been required on gw05), the monitor worked again and things
>> stopped
>> failing.
> I think the cluster will attempt to verify a resource isn't running on any other nodes by running the monitor action.  The cluster expects to see "NOT RUNNING" returned by the monitor action, but since the resource agent isn't installed you see an error.  Once you install the resource agents on the node that isn't running any resources, the cluster can verify your resources are not running on the node.  Hope that makes sense.  Just install the resource agents everywhere and you should be good.
>
> -- Vossel
>
>
>> More Specifics: packages iscsitarget and iscsitarget-dkms were
>> required
>> for gw05 to stop killing my resources.  I have an ocf:iSCSITarget,
>> iSCSILogicalUnit, and virtual ip address in each of two groups.  ds01
>> and ds02 share the load for these groups, and are the ONLY nodes
>> allowed
>> to run them.  gw05 should not even be trying to start these, let
>> alone
>> ANY resources/monitors in those groups IMO.  Using -inf location
>> statements for both the group and for the group members had no
>> effect.
>> This effectively suggests to me that any new node I bring into the
>> cluster will need to have these extra packages installed.
>>
>> If this is a RTFM question, I apologize.  I've been reading it,
>> honestly, and this behavior totally bewilders me.  Would setting
>> is-managed="false" in the resource defaults help?  I almost loathe to
>> add another step to the current "turn this resource on here" chain.
>>
>> Thanks!
>> -- Matt
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 

Sincerely,
  Matthew O'Connor

-----------------------------------------------------------------
Sr. Software Engineer
PGP/GPG Key: 0x55F981C4
Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4

Engineering and Computer Simulations, Inc.
11825 High Tech Ave Suite 250
Orlando, FL 32817

Tel:   407-823-9991 x315
Fax:   407-823-8299
Email: matt at ecsorl.com
Web:   www.ecsorl.com
-----------------------------------------------------------------

CONFIDENTIAL NOTICE: The information contained in this electronic
message is legally privileged, confidential and exempt from disclosure
under applicable law. It is intended only for the use of the individual
or entity named above. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message is strictly prohibited. If you have received
this communication in error, please notify the sender immediately by
return e-mail and delete the original message and any copies of it from
your computer system. Thank you.