[ClusterLabs] 9 nodes pacemaker cluster setup non-DC nodes reboot parallelly

Wed Jul 24 14:41:11 UTC 2024

On Tue, 2024-07-16 at 16:05 +0000, S Sathish S wrote:
> Hi Ken,
> 
> Thank you for quick response.
> 
> We have checked pacemaker logs found signal 15 on pacemaker component
> . Post that we have executed pcs cluster start then pacemaker and
> corosync service started properly and joined cluster also.
> 
> With respect to reboot query , In our application pacemaker cluster
> no quorum or fencing is configured. Please find reboot procedure
> followed in our upgrade procedure which will be executed parallelly
> on all 9 nodes cluster. Whether it is recommended way to reboot?
> 
>  pacemaker cluster in maintenance mode.
> Bring down pacemaker cluster service using below command.
> # pcs cluster stop
> # pcs cluster disable
>      3) reboot 
>      4) Bring up pacemaker cluster Service

That's fine. The disable command means the cluster services will not
start at boot, which I presume is intentional.

No quorum or fencing means you are at risk of service interruption and
possibly data unavailability or corruption, depending on what resources
you are running.

Without quorum, if one or more nodes are split from the cluster, each
side of the split will bring up all resources. The effect of that
varies by the type of resources. For example, with an IP address,
packets might be routed randomly to the two sides, rendering it
useless. With a database in single-primary mode, you will end up with
divergent data sets. And so on.

Without fencing, if a node is malfunctioning (high CPU load, a device
driver hanging, a flaky network card, etc.), the cluster may be unable
to communicate with it and will bring its resources up on other nodes.
The malfunctioning node is likely still running those resources and,
especially if it recovers, you may have similar problems as a quorum
split.

> 
> 
> Regards,
> S Sathish S
> From: Ken Gaillot <kgaillot at redhat.com>
> Sent: Tuesday, July 16, 2024 7:53 PM
> To: Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> Cc: S Sathish S <s.s.sathish at ericsson.com>
> Subject: Re: [ClusterLabs] 9 nodes pacemaker cluster setup non-DC
> nodes reboot parallelly
>  
> On Tue, 2024-07-16 at 11:18 +0000, S Sathish S via Users wrote:
> > Hi Team,
> >  
> > In our product we have 9 nodes pacemaker cluster setup non-DC nodes
> > reboot parallelly. Most of nodes join cluster properly and only one
> > node pacemaker and corosync service is not came up properly with
> > below error message.
> >  
> > Error Message:
> > Error: error running crm_mon, is pacemaker running?
> >   crm_mon: Connection to cluster failed: Connection refused
> 
> All that indicates is that Pacemaker is not responding. You'd have to
> look at the system log and/or pacemaker.log from that time to find
> out
> more.
> 
> > 
> > Query : Is it recommended to reboot parallelly of non-DC nodes ?
> 
> As long as they are cleanly rebooted, there should be no fencing or
> other actual problems. However the cluster will lose quorum and have
> to
> stop all resources. If you reboot less than half of the nodes at one
> time and wait for them to rejoin before rebooting more, you would
> avoid
> that.
> 
> >  
> > Thanks and Regards,
> > S Sathish S
> > _______________________________________________
> > Manage your subscription:
> > 
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.clusterlabs.org%2Fmailman%2Flistinfo%2Fusers&data=05%7C02%7Cs.s.sathish%40ericsson.com%7C5e391698a47643d1c7fb08dca5a2ec0e%7C92e84cebfbfd47abbe52080c6b87953f%7C0%7C0%7C638567366368643199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=QIk47YY2QLsIBwA1lWM%2BeG%2FEFfEL%2B5D7GEn0nOTeRV8%3D&reserved=0
> > 
> > ClusterLabs home: 
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clusterlabs.org%2F&data=05%7C02%7Cs.s.sathish%40ericsson.com%7C5e391698a47643d1c7fb08dca5a2ec0e%7C92e84cebfbfd47abbe52080c6b87953f%7C0%7C0%7C638567366368652616%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=WJe0xE95VNwHECBIB8onLtn537l9p6teIrHQGQwU24U%3D&reserved=0
-- 
Ken Gaillot <kgaillot at redhat.com>