[ClusterLabs] corosync.service (and sbd.service) are not stopper on pacemaker shutdown when corosync-qdevice is used

Ken Gaillot kgaillot at redhat.com
Fri Aug 9 11:44:13 EDT 2019


On Fri, 2019-08-09 at 08:19 +0000, Roger Zhou wrote:
> 
> On 8/9/19 3:39 PM, Jan Friesse wrote:
> > Roger Zhou napsal(a):
> > > 
> > > On 8/9/19 2:27 PM, Roger Zhou wrote:
> > > > 
> > > > On 7/29/19 12:24 AM, Andrei Borzenkov wrote:
> > > > > corosync.service sets StopWhenUnneded=yes which normally
> > > > > stops it when
> > > > > pacemaker is shut down.
> > > 
> > > One more thought,
> > > 
> > > Make sense to add "RefuseManualStop=true" to pacemaker.service?
> > > The same for corosync-qdevice.service?
> > > 
> > > And "RefuseManualStart=true" to corosync.service?
> > 
> > I would say short answer is no, but I would like to hear what is
> > the 
> > main idea for this proposal.
> 
> It's more about out of box user experience to guide the users of the 
> most use cases in the field to manage the whole cluster stack in the 
> appropriate steps, namely:
> 
> - To start stack: systemctl start pacemaker corosync-qdevice
> - To stop stack: systemctl stop corosync.service
> 
> and less error prone assumptions:
> 
> With "RefuseManualStop=true" to pacemaker.service, sometimes(if not
> often),
> 
> - it prevents the wrong assumption/wish/impression to stop the
>    whole cluster together with corosync
> 
> - it prevents users forget one more step to stop corosync indeed
> 
> - it prevents some ISV do create disruptive scripts only stop
> pacemaker 
> and forget others.
> 
> - Being rejected at the first place, then naturally guide users to
> run 
> `systemctl stop corosync.service`
> 
> 
> And extends the same idea a little further to
> 
> - "RefuseManualStop=true" to corosync-qdevice.service
> - and "RefuseManualStart=true" to corosync.service

This definitely can be a pain point for users, but I think the higher-
level tools (crm, pcs, hawk, etc.) are a better place to do this. At
the individual project level, it's possible to run corosync alone
(rare, but I have seen messages on this list by users who do) and that
can be useful for testing as well.

The higher-level tools exist to hide the complexity from the end user,
and they can coordinate multiple pieces like booth, qdevice-only nodes,
etc. As time goes on, it seems like there are more and more such pieces
-- single-host native facilities like systemd probably won't ever be
able to grasp the entire puzzle.

As an example, newer pcs versions start corosync on all nodes first,
then pacemaker on all nodes, so that if there's a quorum, it's already
available when pacemaker starts. There's no way to do such multi-host
dependencies in systemd.

The documentation could be improved too, for users who do want a lower-
level view.

> 
> Well, I do feel corosync* are less error prone as pacemaker in this
> regards.
> 
> Thanks,
> Roger
> 
> 
> > 
> > Regards,
> >    Honza
> > 
> > > 
> > > @Jan, @Ken
> > > 
> > > What do you think?
> > > 
> > > Cheers,
> > > Roger
> > > 
> > > 
> > > > 
> > > > `systemctl stop corosync.service` is the right command to stop
> > > > those
> > > > cluster stack.
> > > > 
> > > > It stops pacemaker and corosync-qdevice first, and stop SBD
> > > > too.
> > > > 
> > > > pacemaker.service: After=corosync.service
> > > > corosync-qdevice.service: After=corosync.service
> > > > sbd.service: PartOf=corosync.service
> > > > 
> > > > On the reverse side, to start the cluster stack, use
> > > > 
> > > > systemctl start pacemaker.service corosync-qdevice
> > > > 
> > > > It is slightly confusing from the impression. So, openSUSE uses
> > > > the
> > > > consistent commands as below:
> > > > 
> > > > crm cluster start
> > > > crm cluster stop
> > > > 
> > > > Cheers,
> > > > Roger
> > > > 
> > > > > Unfortunately, corosync-qdevice.service declares
> > > > > Requires=corosync.service and corosync-qdevice.service itself
> > > > > is *not*
> > > > > stopped when pacemaker.service is stopped. Which means
> > > > > corosync.service
> > > > > remains "needed" and is never stopped.
> > > > > 
> > > > > Also sbd.service (which is PartOf=corosync.service) remains
> > > > > running 
> > > > > as well.
> > > > > 
> > > > > The latter is really bad, as it means sbd watchdog can kick
> > > > > in at any
> > > > > time when user believes cluster stack is safely stopped. In
> > > > > particular
> > > > > if qnetd is not accessible (think network reconfiguration).

-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list