[ClusterLabs] stonithd/fenced filling up logs

Wed Oct 5 18:01:08 UTC 2016

On Oct 5, 2016, at 9:38 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> 
> On 10/05/2016 11:56 AM, Israel Brewster wrote
>> 
>>>>>>>> I never did any specific configuring of CMAN, Perhaps that's the
>>>>>>>> problem? I missed some configuration steps on setup? I just
>>>>>>>> followed the
>>>>>>>> directions
>>>>>>>> here:
>>>>>>>> http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
>>>>>>>> which disabled stonith in pacemaker via the
>>>>>>>> "pcs property set stonith-enabled=false" command. Is there
>>>>>>>> separate CMAN
>>>>>>>> configs I need to do to get everything copacetic? If so, can you
>>>>>>>> point
>>>>>>>> me to some sort of guide/tutorial for that?
> 
> If you ran "pcs cluster setup", it configured CMAN for you. Normally you
> don't need to modify those values, but you can see them in
> /etc/cluster/cluster.conf.

Good to know. So I'm probably OK on that front.

>> 
>> So in any case, I guess the next step here is to figure out how to do
>> fencing properly, using controllable power strips or the like. Back to
>> the drawing board!
> 
> It sounds like you're on the right track for fencing, but it may not be
> your best next step. Currently, your nodes are trying to fence each
> other endlessly, so if you get fencing working, one of them will
> succeed, and you just have a new problem. :-)
> 
> Check the logs for the earliest occurrence (after starting the cluster)
> of the "Requesting Pacemaker fence" message. Look back from that time in
> /var/log/messages, /var/log/cluster/*, and /var/log/pacemaker.log (not
> necessarily all will be present on your system) to try to figure out why
> it wants to fence.
> 
> One thing I noticed is that you're running CentOS 6.8, but your
> pacemaker version is 1.1.11. CentOS 6.8 shipped with 1.1.14, so maybe
> you partially upgraded your system from an earlier OS version? I'd try
> applying all updates (especially cman, libqb, corosync, and pacemaker).

I think what's you're seeing is pacemaker on my primary DB server, which is still at CentOS 6.7. The other servers I've managed to update, but I haven't figured out a *good* HA solution for my DB server (PostgreSQL 9.4 running streaming replication with named replication slots). That is, I can fail over *relatively* easily (touch a file on the secondary, move the IP, and hope all the persistent DB connections reconnect without issue), but getting the demoted primary back up and running is more of a chore (the pg_rewind feature of PostgreSQL 9.5 looks to help with this, but I'm not up to 9.5 yet). As such, I haven't updated the primary DB server as much as some of the others.

Proper integration of the DB with pacemaker is something I need to look into again, but I took a stab at it when I was first setting up the application cluster, and didn't have much luck.

>>>> Now if there is a version of fencing that simply
>>>> e-mails/texts/whatever me and says "Ummm... something is wrong with
>>>> that machine over there, you need to do something about it, because I
>>>> can't guarantee operation otherwise", I could go for that. 
> 
> As digimer mentioned elsewhere, one variation is to use "fabric"
> fencing, i.e. cutting off all external access (disk and/or network) to
> the node. That leaves it up but unable to cause any trouble, so you can
> investigate.
> 
> If the disk is all local, or accessed over the network, then asking an
> intelligent switch to cut off network access is sufficient. If the disk
> is shared (e.g. iSCSI), then you need to cut it off, too.

All disks are local, which would simplify this option, especially considering that I don't have any remote power control options available at the moment. I mentioned getting switched PDU's to my boss, and he'll look into it, but thinks it might not fit into his budget. If I could simply down the proper ports on the Cisco switch(s) the machines are connected to, that could be a viable alternative without any additional hardware needed.

Thanks!

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

> 
>>> No, that is not fencing.
>>> 
>>> -- 
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/
>>> What if the cure for cancer is trapped in the mind of a person without
>>> access to education?
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20161005/7f6b7437/attachment-0002.html>