[ClusterLabs] stonithd/fenced filling up logs

Tue Oct 4 22:50:08 UTC 2016

On Oct 4, 2016, at 2:26 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
> 
> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>> I sent this a week ago, but never got a response, so I'm sending it
>> again in the hopes that it just slipped through the cracks. It seems to
>> me that this should just be a simple mis-configuration on my part
>> causing the issue, but I suppose it could be a bug as well.
>> 
>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>> 6.8. One cluster is simply sharing an IP, while the other one has
>> numerous services and IP's set up between the two machines in the
>> cluster. Both appear to be working fine. However, I was poking around
>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>> and fenced were using "significant" amounts of processing power - 25%
>> for corosync on the current primary node, with fenced and stonithd often
>> showing 1-2% (not horrible, but more than any other process). In looking
>> at my logs, I see that they are dumping messages like the following to
>> the messages log every second or two:
>> 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>> No match for //@st_delegate in /st-reply
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>> Operation reboot of fai-dbs1 by fai-dbs2 for
>> stonith_admin.cman.15835 at fai-dbs2.c5161517: No such device
>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>> stonith_admin.cman.15835
>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>> fai-dbs2 (reset)
> 
> The above shows that CMAN is asking pacemaker to fence a node. Even
> though fencing is disabled in pacemaker itself, CMAN is configured to
> use pacemaker for fencing (fence_pcmk).

I never did any specific configuring of CMAN, Perhaps that's the problem? I missed some configuration steps on setup? I just followed the directions here: http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs <http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs>, which disabled stonith in pacemaker via the "pcs property set stonith-enabled=false" command. Is there separate CMAN configs I need to do to get everything copacetic? If so, can you point me to some sort of guide/tutorial for that?

> 
>> Sep 27 08:51:50 fai-dbs1 stonith_admin[15394]:   notice: crm_log_args:
>> Invoked: stonith_admin --reboot fai-dbs2 --tolerance 5s --tag cman 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: handle_request:
>> Client stonith_admin.cman.15394.2a97d89d wants to fence (reboot)
>> 'fai-dbs2' with device '(any)'
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
>> initiate_remote_stonith_op: Initiating remote operation reboot for
>> fai-dbs2: bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae (0)
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
>> stonith_choose_peer: Couldn't find anyone to fence fai-dbs2 with <any>
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>> No match for //@st_delegate in /st-reply
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:    error: remote_op_done:
>> Operation reboot of fai-dbs2 by fai-dbs1 for
>> stonith_admin.cman.15394 at fai-dbs1.bc3f5d73: No such device
>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>> Peer fai-dbs2 was not terminated (reboot) by fai-dbs1 for fai-dbs1: No
>> such device (ref=bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae) by client
>> stonith_admin.cman.15394
>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Call to fence fai-dbs2
>> (reset) failed with rc=237
>> 
>> After seeing this one the one cluster, I checked the logs on the other
>> and sure enough I'm seeing the same thing there. As I mentioned, both
>> nodes in both clusters *appear* to be operating correctly. For example,
>> the output of "pcs status" on the small cluster is this:
>> 
>> [root at fai-dbs1 ~]# pcs status
>> Cluster name: dbs_cluster
>> Last updated: Tue Sep 27 08:59:44 2016
>> Last change: Thu Mar  3 06:11:00 2016
>> Stack: cman
>> Current DC: fai-dbs1 - partition with quorum
>> Version: 1.1.11-97629de
>> 2 Nodes configured
>> 1 Resources configured
>> 
>> 
>> Online: [ fai-dbs1 fai-dbs2 ]
>> 
>> Full list of resources:
>> 
>> virtual_ip(ocf::heartbeat:IPaddr2):Started fai-dbs1
>> 
>> And on the larger cluster, it has services running across both nodes of
>> the cluster, and I've been able to move stuff back and forth without
>> issue. Both nodes have the stonith-enabled property set to false, and
>> no-quorum-policy set to ignore (since they are only two nodes in the
>> cluster).
>> 
>> What could be causing the log messages? Is the CPU usage normal, or
>> might there be something I can do about that as well? Thanks.
> 
> It's not normal; most likely, the failed fencing is being retried endlessly.

That does appear to be what the logs indicate, yes.

> 
> You'll want to figure out why CMAN is asking for fencing.

Any hints as to how?

> You may have
> some sort of communication problem between the nodes (that might be a
> factor in corosync's CPU usage, too).

Doesn't *appear* to be the case, according to PCS status (which shows both nodes online), and my ability to move resources between the nodes without difficulty. But maybe CMAN is having issues? Firewall port not open such that CMAN can't communicate, even though everything else can?

> 
> Once that's straightened out, it's a good idea to actually configure and
> enable fencing :)

Probably, although my understanding is that this is less of an issue with a two-node cluster than with larger nodes. That said, that's something to figure out another day. As you said: once that's straightened out :-)

Thanks!
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------
> 
> 
>> 
>> -----------------------------------------------
>> Israel Brewster
>> Systems Analyst II
>> Ravn Alaska
>> 5245 Airport Industrial Rd
>> Fairbanks, AK 99709
>> (907) 450-7293
>> -----------------------------------------------
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
> http://clusterlabs.org/mailman/listinfo/users <http://clusterlabs.org/mailman/listinfo/users>
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20161004/b71d8ffd/attachment-0002.html>