[ClusterLabs] Pacemaker shutting down peer node

Ken Gaillot kgaillot at redhat.com
Fri Jun 16 12:43:50 EDT 2017


On 06/16/2017 11:21 AM, Jaz Khan wrote:
> Hi,
> 
> I have checked node ha-apex2.
> The log on that machine from /var/log/messages says "systemd: Power
> button pressed" and "Shutting down...."  but this message appeared just
> when the ha-apex1 node scheduled the shutdown with difference in seconds.
> 
> It seems like the peer node (ha-apex1) has sent some kind of power off
> request and it obeyed to the request.
>  
> On node ha-apex1 it clearly says "Scheduling Node ha-apex2 for shutdown"
> which seems like it has scheduled this task to be executed on peer node.

That's the cluster's response to systemd's shutdown request.

Something in your system is triggering the "power button pressed" event.
I believe that message usually originates from /etc/acpi/powerbtn.sh.
(As an aside, it's usually a good idea to disable ACPI on servers.)

In my experience, "Power button pressed" usually means a real person
pushed a real power button. (My favorite time was when a hosting
provider labeled some physical machines incorrectly and kept rebooting a
server used by the company I worked for at the time, wondering why it
wasn't having the intended effect.) But I'm sure it's possible it's
being generated via IPMI or something.

I don't think any cluster fence agents could be the cause because you
don't see any fencing messages in your logs, and fence agents should
always use a hard poweroff, not something that can be intercepted by the OS.

> My servers are running in production, please help me out. I really do
> not want anything to happen to any of node. I hope you understand the
> seriousness of this issue.
> 
> NOTE: This didn't only happen on this cluster group of nodes. It also
> happened few times on another cluster group of machines as well.
> 
> Look at this two messages from ha-apex1 node.
> 
> Jun 14 15:52:23 apex1 pengine[18732]:  notice: Scheduling Node ha-apex2
> for shutdown
> 
> Jun 14 15:52:27 apex1 crmd[18733]:  notice: do_shutdown of peer ha-apex2
> is complete
> 
> 
> Best regards,
> Jaz
> 
> 
> 
> 
> 
>     Message: 1
>     Date: Thu, 15 Jun 2017 13:53:00 -0500
>     From: Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com>>
>     To: users at clusterlabs.org <mailto:users at clusterlabs.org>
>     Subject: Re: [ClusterLabs] Pacemaker shutting down peer node
>     Message-ID: <5d122183-2030-050d-3a8e-9c158fa5fb5d at redhat.com
>     <mailto:5d122183-2030-050d-3a8e-9c158fa5fb5d at redhat.com>>
>     Content-Type: text/plain; charset=utf-8
> 
>     On 06/15/2017 12:38 AM, Jaz Khan wrote:
>     > Hi,
>     >
>     > I have been encountering this serious issue from past couple of
>     months.
>     > I really have no idea that why pacemaker sends shutdown signal to peer
>     > node and it goes down. This is very strange and I am too much
>     worried .
>     >
>     > This is not happening daily, but it surely does this kind of behavior
>     > after every few days.
>     >
>     > Version:
>     > Pacemaker 1.1.16
>     > Corosync 2.4.2
>     >
>     > Please help me out with this bug! Below is the log message.
>     >
>     >
>     >
>     > Jun 14 15:52:23 apex1 crmd[18733]:  notice: State transition S_IDLE ->
>     > S_POLICY_ENGINE
>     > Jun 14 15:52:23 apex1 pengine[18732]:  notice: On loss of CCM
>     Quorum: Ignore
>     >
>     > Jun 14 15:52:23 apex1 pengine[18732]:  notice: Scheduling Node
>     ha-apex2
>     > for shutdown
> 
>     This is not a fencing, but a clean shutdown. Normally this only happens
>     in response to a user request.
> 
>     Check the logs on both nodes before this point, to try to see what was
>     the first indication that it would shut down.
> 
>     >
>     > Jun 14 15:52:23 apex1 pengine[18732]:  notice: Move    vip#011(Started
>     > ha-apex2 -> ha-apex1)
>     > Jun 14 15:52:23 apex1 pengine[18732]:  notice: Move
>     >  filesystem#011(Started ha-apex2 -> ha-apex1)
>     > Jun 14 15:52:23 apex1 pengine[18732]:  notice: Move   
>     samba#011(Started
>     > ha-apex2 -> ha-apex1)
>     > Jun 14 15:52:23 apex1 pengine[18732]:  notice: Move
>     >  database#011(Started ha-apex2 -> ha-apex1)
>     > Jun 14 15:52:23 apex1 pengine[18732]:  notice: Calculated transition
>     > 1744, saving inputs in /var/lib/pacemaker/pengine/pe-input-123.bz2
>     > Jun 14 15:52:23 apex1 crmd[18733]:  notice: Initiating stop operation
>     > vip_stop_0 on ha-apex2
>     > Jun 14 15:52:23 apex1 crmd[18733]:  notice: Initiating stop operation
>     > samba_stop_0 on ha-apex2
>     > Jun 14 15:52:23 apex1 crmd[18733]:  notice: Initiating stop operation
>     > database_stop_0 on ha-apex2
>     > Jun 14 15:52:26 apex1 crmd[18733]:  notice: Initiating stop operation
>     > filesystem_stop_0 on ha-apex2
>     > Jun 14 15:52:27 apex1 kernel: drbd apexdata apex2.br
>     <http://apex2.br> <http://apex2.br>:
>     > peer( Primary -> Secondary )
>     > Jun 14 15:52:27 apex1 crmd[18733]:  notice: Initiating start operation
>     > filesystem_start_0 locally on ha-apex1
>     >
>     > Jun 14 15:52:27 apex1 crmd[18733]:  notice: do_shutdown of peer
>     ha-apex2
>     > is complete
>     >
>     > Jun 14 15:52:27 apex1 attrd[18731]:  notice: Node ha-apex2 state
>     is now lost
>     > Jun 14 15:52:27 apex1 attrd[18731]:  notice: Removing all ha-apex2
>     > attributes for peer loss
>     > Jun 14 15:52:27 apex1 attrd[18731]:  notice: Lost attribute writer
>     ha-apex2
>     > Jun 14 15:52:27 apex1 attrd[18731]:  notice: Purged 1 peers with id=2
>     > and/or uname=ha-apex2 from the membership cache
>     > Jun 14 15:52:27 apex1 stonith-ng[18729]:  notice: Node ha-apex2
>     state is
>     > now lost
>     > Jun 14 15:52:27 apex1 stonith-ng[18729]:  notice: Purged 1 peers with
>     > id=2 and/or uname=ha-apex2 from the membership cache
>     > Jun 14 15:52:27 apex1 cib[18728]:  notice: Node ha-apex2 state is
>     now lost
>     > Jun 14 15:52:27 apex1 cib[18728]:  notice: Purged 1 peers with id=2
>     > and/or uname=ha-apex2 from the membership cache
>     >
>     >
>     >
>     > Best regards,
>     > Jaz. K




More information about the Users mailing list