[ClusterLabs] stonithd/fenced filling up logs

Wed Oct 5 13:13:22 EDT 2016

On 05/10/16 12:56 PM, Israel Brewster wrote:
<trim>
>>> Yeah, I don't want that. If one of the nodes enters an unknown state,
>>> I want the system to notify me so I can decide the proper course of
>>> action - I don't want it to simply shut down the other machine or
>>> something.
>>
>> You do, actually. If a node isn't readily disposable, you need to
>> rethink your HA strategy. The service you're protecting is what matters,
>> not the machine hosting it at any particular time.
> 
> True. My hesitation, however, stems not from loosing the machine without
> warning (the ability to do so without consequence being one of the major
> selling points of HA), but rather with loosing the diagnostic
> opportunities presented *while* the machine is mis-behaving. I'm
> borderline obsessive with knowing what went wrong and why, if the
> machine is shut down before I have a chance to see what state it is in,
> my chances of being able to figure out what happened greatly diminish.
> 
> As you say, though, this is something I'll simply need to get over if I
> want real HA (see below).

If that is the case, you can explore using "fabric fencing". Power
fencing is safer (from a human error perspective) and more popular
because it often returns the node to a working state and restores
redundancy faster.

However, from a pure safety perspective, all the cluster software cares
about is that the target can't provide cluster services. So in fabric
fencing, what happens is that the target isn't powered off, but instead
isolated from the world by severing its network links. I actually wrote
a POC using snmp + managed ethernet switches to provide "cheap" fencing
to people who didn't have IPMI or switched PDUs. Basically, when the
fence fired, it would log into the switches and turn down all the ports
used by the target.

The main concern here is that someone, or something, might restore
access without first clearing the node's state (usually via a reboot).
So I still don't recommend it, but if your concern about being able to
analyze the cause of the hang is strong enough, and if you take
appropriate care not to let the node back in without first rebooting it,
it is an option.

>> Further, the whole role of pacemaker is to know what to do when things
>> go wrong (which you validate with plenty of creative failure testing
>> pre-production). A good HA system is one you won't touch for a long
>> time, possibly over a year. You don't want to be relying on rusty memory
>> for what do while, while people are breathing down your neck because the
>> service is down.
> 
> True, although that argument would hold more weight if I worked for a
> company where everyone wasn't quite so nice :-) We've had outages before
> (one of the reasons I started looking at HA), and everyone was like
> "Well, we can't do our jobs without it, so please let us know when it's
> back up. Have a good day!"

I'm here to help with technical issues. Politics I leave to others. ;)

>> Trust the HA stack to do the right job, and validate that via testing.
> 
> Yeah, my testing is somewhat lacking. Probably contributes to my lack of
> trust.

My rule of thumb is that, *at a minimum*, you should allocate 2 days for
testing for each day you spend implementing. HA is worthless without
full and careful testing.

>>>> This is also why I said that your hardware matters.
>>>> Do your nodes have IPMI? (or iRMC, iLO, DRAC, RSA, etc)?
>>>
>>> I *might* have IPMI. I know my newer servers do. I'll have to check
>>> on that.
>>
>> You can tell from the CLI. I've got a section on how to locate and
>> configure IPMI from the command line here:
>>
>> https://alteeve.ca/w/AN!Cluster_Tutorial_2#What_is_IPMI
>>
>> It should port to most any distro/version.
> 
> Looks like I'm out-of-luck on the IPMI front. Neither my application
> servers nor my database servers have IPMI ports. I'll have to talk to my
> boss about getting controllable power strips or the like (unless there
> are better options than just cutting the power)

APC brand switched PDUs, like the AP7900 (or your country's version of)
are excellent fence devices. Of course, if your servers have dual PSUs,
get dual PDUs. Alternatively, I know Raritan brand work (I wrote an
agent for them).

>>> So where fencing comes in would be for the situations where one
>>> machine *thinks* the other is unavailable, perhaps due to a network
>>> issue, but in fact the other machine is still up and running, I
>>> guess? That would make sense, but the thought of software simply
>>> taking over and shutting down one of my machines, without even
>>> consulting me first, doesn't sit well with me at all. Even a restart
>>> would be annoying - I typically like to see if I can figure out what
>>> is going on before restarting, since restarting often eliminates the
>>> symptoms that help diagnose problems.
>>
>> That is a classic example, but not the only one. Perhaps the target is
>> hung, but might recover later? You just don't know, and not knowing is
>> all you know, *until* you fence it.`
> 
> ....or until I log onto the machine and take a look at what is going on :-)

Remember, in HA, the services matter, the nodes don't. Your focus on
diagnosing the nodes puts the services at risk. Nodes are disposable. If
one acts us, rebuild/replace it.

>> I can understand that it "doesn't sit well with you", but you need to
>> let that go. HA software is not like most other applications.
> 
> Understood. It might help if I knew there would be good documentation of
> the current state of the machine before the shutdown, but I don't know
> if that is even possible. So I guess I'll just have to get over it, be
> happy that I didn't loose any services, and move on :-)

As mentioned, anything that tried to account for the node puts the
services you care about at risk. If a node enters an unknown state, it
must be dealt with promptly so that recovery can begin.

>> If a node gets shot, in pacemaker/corosync, there is always going to be
>> a reason for it. Your job is to sort out why, after the fact. The
>> important part is that your services continued to be available.
> 
> Gotcha. Makes sense.
> 
>> Note that you can bias which node wins in a case where both are alive
>> but something blocked comms. You do this by setting a delay on the fence
>> method for the node you prefer. So it works like this;
>>
>> Say node 1 is your primary node where your services normally live, and
>> node 2 is the backup. Something breaks comms and both declare the other
>> dead and both initiate a fence. Node 2 loops up how to fence node 1,
>> sees a delay and pauses for $delay seconds. Node 1 looks up how to fence
>> node 2, sees no delay and pulls the trigger right away. Node 2 will die
>> because it exits its delay.
> 
> I was wondering about that.
> 
> So in any case, I guess the next step here is to figure out how to do
> fencing properly, using controllable power strips or the like. Back to
> the drawing board!

https://alteeve.ca/w/AN!Cluster_Tutorial_2#A_Note_on_Patience

:)

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?