[ClusterLabs] stonithd/fenced filling up logs

Tue Oct 4 18:26:45 EDT 2016

On 10/04/2016 11:31 AM, Israel Brewster wrote:
> I sent this a week ago, but never got a response, so I'm sending it
> again in the hopes that it just slipped through the cracks. It seems to
> me that this should just be a simple mis-configuration on my part
> causing the issue, but I suppose it could be a bug as well.
> 
> I have two two-node clusters set up using corosync/pacemaker on CentOS
> 6.8. One cluster is simply sharing an IP, while the other one has
> numerous services and IP's set up between the two machines in the
> cluster. Both appear to be working fine. However, I was poking around
> today, and I noticed that on the single IP cluster, corosync, stonithd,
> and fenced were using "significant" amounts of processing power - 25%
> for corosync on the current primary node, with fenced and stonithd often
> showing 1-2% (not horrible, but more than any other process). In looking
> at my logs, I see that they are dumping messages like the following to
> the messages log every second or two:
> 
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
> No match for //@st_delegate in /st-reply
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
> Operation reboot of fai-dbs1 by fai-dbs2 for
> stonith_admin.cman.15835 at fai-dbs2.c5161517: No such device
> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
> stonith_admin.cman.15835
> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
> fai-dbs2 (reset)

The above shows that CMAN is asking pacemaker to fence a node. Even
though fencing is disabled in pacemaker itself, CMAN is configured to
use pacemaker for fencing (fence_pcmk).

> Sep 27 08:51:50 fai-dbs1 stonith_admin[15394]:   notice: crm_log_args:
> Invoked: stonith_admin --reboot fai-dbs2 --tolerance 5s --tag cman 
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: handle_request:
> Client stonith_admin.cman.15394.2a97d89d wants to fence (reboot)
> 'fai-dbs2' with device '(any)'
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
> initiate_remote_stonith_op: Initiating remote operation reboot for
> fai-dbs2: bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae (0)
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
> stonith_choose_peer: Couldn't find anyone to fence fai-dbs2 with <any>
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
> No match for //@st_delegate in /st-reply
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:    error: remote_op_done:
> Operation reboot of fai-dbs2 by fai-dbs1 for
> stonith_admin.cman.15394 at fai-dbs1.bc3f5d73: No such device
> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
> Peer fai-dbs2 was not terminated (reboot) by fai-dbs1 for fai-dbs1: No
> such device (ref=bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae) by client
> stonith_admin.cman.15394
> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Call to fence fai-dbs2
> (reset) failed with rc=237
> 
> After seeing this one the one cluster, I checked the logs on the other
> and sure enough I'm seeing the same thing there. As I mentioned, both
> nodes in both clusters *appear* to be operating correctly. For example,
> the output of "pcs status" on the small cluster is this:
> 
> [root at fai-dbs1 ~]# pcs status
> Cluster name: dbs_cluster
> Last updated: Tue Sep 27 08:59:44 2016
> Last change: Thu Mar  3 06:11:00 2016
> Stack: cman
> Current DC: fai-dbs1 - partition with quorum
> Version: 1.1.11-97629de
> 2 Nodes configured
> 1 Resources configured
> 
> 
> Online: [ fai-dbs1 fai-dbs2 ]
> 
> Full list of resources:
> 
>  virtual_ip(ocf::heartbeat:IPaddr2):Started fai-dbs1
> 
> And on the larger cluster, it has services running across both nodes of
> the cluster, and I've been able to move stuff back and forth without
> issue. Both nodes have the stonith-enabled property set to false, and
> no-quorum-policy set to ignore (since they are only two nodes in the
> cluster).
> 
> What could be causing the log messages? Is the CPU usage normal, or
> might there be something I can do about that as well? Thanks.

It's not normal; most likely, the failed fencing is being retried endlessly.

You'll want to figure out why CMAN is asking for fencing. You may have
some sort of communication problem between the nodes (that might be a
factor in corosync's CPU usage, too).

Once that's straightened out, it's a good idea to actually configure and
enable fencing :)

> 
> -----------------------------------------------
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> -----------------------------------------------