[ClusterLabs] ask for help for a pacemaker problem

Thu Jul 26 12:37:55 EDT 2018

On Wed, 2018-07-25 at 23:43 +0800, 李培 wrote:
> Dear all
> 
> I have a problem when I use pacemaker.
> 
> the corosync.log in two nodes grows to 1Gb in about one hour.
> 
> the corosync.log only has one kind of message in one node named paas-
> controller-22-0-2-12 as below:
> Jul 23 14:00:06 [15036] paas-controller-22-0-2-12 cib:  error:
> cib_process_shutdown_req: 
> Shutdown ACK from 22.0.2.11 -  not shutting down
> Jul 23 14:00:06 [15036] paas-controller-22-0-2-12 cib:  error:
> cib_process_shutdown_req: 
> Shutdown ACK from 22.0.2.11 -  not shutting down
> Jul 23 14:00:06 [15036] paas-controller-22-0-2-12 cib:  error:
> cib_process_shutdown_req: 
> Shutdown ACK from 22.0.2.11 -  not shutting down
> Jul 23 14:00:06 [15036] paas-controller-22-0-2-12 cib:  error:
> cib_process_shutdown_req: 
> Shutdown ACK from 22.0.2.11 -  not shutting 
> 
> 
> the corosync.log only has one kind of message in another node named
> paas-controller-22-0-2-11 as below:
> Jul 23 14:00:06 [15036] paas-controller-22-0-2-11 cib:  error:
> cib_process_shutdown_req: 
> Shutdown ACK from 22.0.2.12 -  not shutting down
> Jul 23 14:00:06 [15036] paas-controller-22-0-2-11 cib:  error:
> cib_process_shutdown_req: 
> Shutdown ACK from 22.0.2.12 -  not shutting down
> Jul 23 14:00:06 [15036] paas-controller-22-0-2-11 cib:  error:
> cib_process_shutdown_req: 
> Shutdown ACK from 22.0.2.12 -  not shutting down
> Jul 23 14:00:06 [15036] paas-controller-22-0-2-11 cib:  error:
> cib_process_shutdown_req: 
> Shutdown ACK from 22.0.2.12 -  not shutting 
> 
> it seems that the two nodes do not response shutdown request to each
> other,so the message keeps being sent out.
> 
> have any of you ever encountered this issue?
> 
> how it happened? how it can be solved?
>  
> I am looking forwarding to hearing from you.
> 
> Thanks in advance.
> 
> Sincerely yours

This is interesting. At least one of the nodes should have an info-
level log message like "Shutdown REQ from ..." before these messages
start.

For this to happen, one of the nodes has to receive a shutdown request
from the other, then acknowledge it with a reply, and then the node
that sent the request somehow doesn't know it sent a request, and so
logs this message.

The funny (?) part is that it will reply to the acknowledgement, and
then that node will (wrongly) treat that as a reply to one of its own
shutdown requests, which it doesn't have, so it logs this message and
replies back. Infinite loop :-/

I've opened a bug for the loop:

https://bugs.clusterlabs.org/show_bug.cgi?id=5361

However an unanswered question is how the loop got started. One of the
nodes thought it received a shutdown request, but the other node didn't
think it sent one. That is a mystery here. If you can find the
"Shutdown REQ" message, the logs from both nodes around that time might
shed some light.
-- 
Ken Gaillot <kgaillot at redhat.com>