[Pacemaker] RFC: What part of the XML configuration do you hate the most?

Fri Jun 27 08:18:08 EDT 2008

Hi,

just about topic 4) in this mail...

Andrew Beekhof <beekhof at gmail.com> writes:
>> 4) node fencing without the poweroff
>>   (this is a kind of a new feature request)
>>   Node fencing is just simple and good enough in most of our cases but
>>   we hesitate to use STONITH(poweroff/reboot) as the first action
>>   of a failure, because:
>>   - we want to shutdown the services gracefully as long as possible.
>>   - rebooting the failed node may lose the evidence of the
>>     real cause of a failure. We want to preserve it as possible
>>     to investigate it later and to ensure that the all problems are
>> resolved.
>>
>>   We think that, ideally, when a resource failed the node would
>>   try to go to 'standby' state, and only when it failed it
>>   would escalate to STONITH to poweroff.
>
> The problem with this is that it directly (and negatively) impacts
> service availability.
> It is unsafe to start services elsewhere until they are confirmed dead
> on the existing node.
>
> So relying on manual shutdowns greatly increases failover time.

Right, but I think it depends on applications.

In the case of database applications such as pgsql or oracle,
the most dominant factor of failover time is the recovery time.
Shutting down a node in the middle of a transaction will cause a
rollback action and will increase the recovery time more and more.
We estimates 3-5 minutes at most for the recovery time in our configuration.

Another case is Filesystem on a shared storage.
You should run fsck before mounting it on the failover-ed node
for the safety of the data if the filesystem was not umounted cleanly.
It would take a very long time particularly if the filesystem
is very large as used by a database. 

Addition to this, there may be a risk of data loss if the power
was suddenly down.  Such risks may be neglected, but if there's
anything we can do to avoid or minimize such risks then we want
to take the steps for that.

>
> One thing we used to do (but had to disable because we couldn't get it
> 100% right at the time) was move off the healthy resources before
> shooting the node.  I think resurrecting this feature is a better
> approach.

Yes, that sounds good to me.
One thing I'm wondering is that if the cluster manager was able
to confirm all the resouces were stopped on the failed node, it
does not necessarily need to be turned off, doesn't it?

>
>> 5) STONITH priority
>>   Another reason why we hesitate using STONITH is the "cross counter"
>>   problem when split-brain occured.
>>   It would be great if we can tune so that a node with resouces
>> running
>>   is most likely to survive.
>>
>>
>> 6) node fencing when the connectivity failure is detected by pingd.
>>   Currently we have to have the pingd constrains for all resources.
>>   It woule be helpful to simplify the config and the recovery
>> operation
>>   if we could configure the behavior as same as a resource failure.
>
> I think this could be easily done by creating a new mode for pingd - 
> such that it "fails" when all connectivity is lost.
> Then it would just be a matter of setting on_fail=fence for pingd's
> monitor op.
>
>> Regarding to 1)-b), 4) and 5), I and my colleagues think that they
>> are important and we're now studying how we can implement them.
>
> Please let me know if you come up with anything.
> I don't have any real objection to the concepts - well, except maybe 4).
>
> You might also want to add bugzilla enhancements for these so that we
> (or I) don't forget about them.

Thank you for the words. I will do that later.
And probably I and my colleagues will ask you about the
implementation details later.

Thanks,

Keisuke MORI
NTT DATA Intellilink Corporation