[ClusterLabs] stonithd/fenced filling up logs

Tue Oct 4 19:50:59 EDT 2016

On Oct 4, 2016, at 3:38 PM, Digimer <lists at alteeve.ca> wrote:
> 
> On 04/10/16 07:09 PM, Israel Brewster wrote:
>> On Oct 4, 2016, at 3:03 PM, Digimer <lists at alteeve.ca> wrote:
>>> 
>>> On 04/10/16 06:50 PM, Israel Brewster wrote:
>>>> On Oct 4, 2016, at 2:26 PM, Ken Gaillot <kgaillot at redhat.com
>>>> <mailto:kgaillot at redhat.com>> wrote:
>>>>> 
>>>>> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>>>>>> I sent this a week ago, but never got a response, so I'm sending it
>>>>>> again in the hopes that it just slipped through the cracks. It seems to
>>>>>> me that this should just be a simple mis-configuration on my part
>>>>>> causing the issue, but I suppose it could be a bug as well.
>>>>>> 
>>>>>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>>>>>> 6.8. One cluster is simply sharing an IP, while the other one has
>>>>>> numerous services and IP's set up between the two machines in the
>>>>>> cluster. Both appear to be working fine. However, I was poking around
>>>>>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>>>>>> and fenced were using "significant" amounts of processing power - 25%
>>>>>> for corosync on the current primary node, with fenced and stonithd often
>>>>>> showing 1-2% (not horrible, but more than any other process). In looking
>>>>>> at my logs, I see that they are dumping messages like the following to
>>>>>> the messages log every second or two:
>>>>>> 
>>>>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>>>>>> No match for //@st_delegate in /st-reply
>>>>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>>>>>> Operation reboot of fai-dbs1 by fai-dbs2 for
>>>>>> stonith_admin.cman.15835 at fai-dbs2.c5161517: No such device
>>>>>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>>>>>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>>>>>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>>>>>> stonith_admin.cman.15835
>>>>>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>>>>>> fai-dbs2 (reset)
>>>>> 
>>>>> The above shows that CMAN is asking pacemaker to fence a node. Even
>>>>> though fencing is disabled in pacemaker itself, CMAN is configured to
>>>>> use pacemaker for fencing (fence_pcmk).
>>>> 
>>>> I never did any specific configuring of CMAN, Perhaps that's the
>>>> problem? I missed some configuration steps on setup? I just followed the
>>>> directions
>>>> here: http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
>>>> which disabled stonith in pacemaker via the
>>>> "pcs property set stonith-enabled=false" command. Is there separate CMAN
>>>> configs I need to do to get everything copacetic? If so, can you point
>>>> me to some sort of guide/tutorial for that?
>>> 
>>> Disabling stonith is not possible in cman, and very ill advised in
>>> pacemaker. This is a mistake a lot of "tutorials" make when the author
>>> doesn't understand the role of fencing.
>>> 
>>> In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
>>> agent, as it should. So when something went wrong, corosync detected it,
>>> informed cman which then requested pacemaker to fence the peer. With
>>> pacemaker not having stonith configured and enabled, it could do
>>> nothing. So pacemaker returned that the fence failed and cman went into
>>> an infinite loop trying again and again to fence (as it should have).
>>> 
>>> You must configure stonith (exactly how depends on your hardware), then
>>> enable stonith in pacemaker.
>>> 
>> 
>> Gotcha. There is nothing special about the hardware, it's just two physical boxes connected to the network. So I guess I've got a choice of either a) live with the logging/load situation (since the system does work perfectly as-is other than the excessive logging), or b) spend some time researching stonith to figure out what it does and how to configure it. Thanks for the pointers.
> 
> The system is not working perfectly. Consider it like this; You're
> flying, and your landing gears are busted. You think everything is fine
> because you're not trying to land yet.

Ok, good analogy :-)

> 
> Fencing is needed to force a node that has entered into a known state
> into a known state (usually 'off'). It does this by reaching out over
> some independent mechanism, like IPMI or a switched PDU, and forcing the
> target to shut down.

Yeah, I don't want that. If one of the nodes enters an unknown state, I want the system to notify me so I can decide the proper course of action - I don't want it to simply shut down the other machine or something.

> This is also why I said that your hardware matters.
> Do your nodes have IPMI? (or iRMC, iLO, DRAC, RSA, etc)?

I *might* have IPMI. I know my newer servers do. I'll have to check on that.

> 
> If you don't need to coordinate actions between the nodes, you don't
> need HA software, just run things everywhere all the time. If, however,
> you do need to coordinate actions, then you need fencing.

The coordination is, of course, the whole point - an IP/service/whatever runs on one machine, and should that machine become unavailable (for whatever reason), it automatically moves to the other machine. My services could, of course, run on both just fine, but that doesn't help with accessing said services - that still has to go to one or the other. 

So where fencing comes in would be for the situations where one machine *thinks* the other is unavailable, perhaps due to a network issue, but in fact the other machine is still up and running, I guess? That would make sense, but the thought of software simply taking over and shutting down one of my machines, without even consulting me first, doesn't sit well with me at all. Even a restart would be annoying - I typically like to see if I can figure out what is going on before restarting, since restarting often eliminates the symptoms that help diagnose problems.

Now if there is a version of fencing that simply e-mails/texts/whatever me and says "Ummm... something is wrong with that machine over there, you need to do something about it, because I can't guarantee operation otherwise", I could go for that. 

> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org