[ClusterLabs] Restoring network connection breaks cluster services

Jan Pokorný jpokorny at redhat.com
Mon Aug 12 16:30:38 EDT 2019


On 07/08/19 16:06 +0200, Momcilo Medic wrote:
> On Wed, Aug 7, 2019 at 1:00 PM Klaus Wenninger <kwenning at redhat.com> wrote:
> 
>> On 8/7/19 12:26 PM, Momcilo Medic wrote:
>> 
>>> We have three node cluster that is setup to stop resources on lost
>>> quorum.  Failure (network going down) handling is done properly,
>>> but recovery doesn't seem to work.
>> 
>> What do you mean by 'network going down'?
>> Loss of link? Does the IP persist on the interface
>> in that case?
> 
> Yes, we simulate faulty cable by turning switch ports down and up.
> In such a case, the IP does not persist on the interface.
> 
>> That there are issue reconnecting the CPG-API sounds strange to me.
>> Already the fact that something has to be reconnected. I got it
>> that your nodes were persistently up during the
>> network-disconnection. Although I would have expected fencing to
>> kick in at least on those which are part of the non-quorate
>> cluster-partition.  Maybe a few words more on your scenario
>> (fening-setup e.g.) would help to understand what is going on.
> 
> We don't use any fencing mechanisms, we rely on quorum to run the
> services.  In more detail, we run three node Linbit LINSTOR storage
> that is hyperconverged.  Meaning, we run clustered storage on the
> virtualization hypervisors.
> 
> We use pcs in order to have linstor-controller service in high
> availabilty mode.  Policy for no quorum is to stop the resources.
> 
> In such hyperconverged setup, we can't fence a node without impact.
> It may happen that network instability causes primary node to no
> longer be primary.  In that case, we don't want running VMs to go
> down with the ship, as there was no impact for them.
> 
> However, we would like to have high-availability of that service
> upon network restoration, without manual actions.

This spurred a train of thought that is admittedly not immediately
helpful in this case:

* * *

1. the word "converged" is a fitting word for how we'd like
   the cluster stack to appear (from the outside), but what we have
   is that some circumstances are not clearly articulated across the
   components meaning that there's no way for users to express the
   preferences in simple terms and in a non-conflicting and
   unambiguous ways when 2+ components' realms combine together
   -- high level tools like pcs may attempt to rectify that to some
   extent, but they fall short when there are no surfaces to glue (at
   least unambiguously, see also parallel thread about shutting the
   cluster down in the presence of sbd)

   it seems to me that the very circumstance that was hit here is
   exactly where corosync authors decided that it's rare and obnoxious
   to indicate up the chain for a detached destiny reasoning (which
   pacemaker normaly performs) enough that they rather stop right
   there (and in a well-behaved cluster configuration hence ask to be
   fenced)

   all is actually sound, until one starts to make compromises like
   here was done, with ditching of the fencing (think: sanity
   assurance) layer, relying fully on no-quorum-policy=stop, naively
   thinking that one 100% covered, but with purely pacemaker hat
   on, we -- the pacemaker dev&maint -- can't really give you such
   a guarantee, because we have no visibility into said "bail out"
   shortcuts that corosyncs makes for such rare circumstances -- you
   shall refer to corosync documentation, but it's not covered there
   (man pages) AFAIK (if it was _all_ indicated to pacemaker, just
   standard response on quorum loss could be carried out, not
   resorting to anything more drastic like here)


2. based on said missing explicit and clear inter-component signalling
   (1.) and the logs provided, it's fair to bring an argument that
   pacemaker had an opportunity to see, barring said explicit API
   signalling, that corosync died, but then, the major assumed case is:

   - corosync crashed or was explicitly killed (perhaps to test the
     claimed HA resiliency towards the outer world)

   - broken pacemaker-corosync communication consistency
     (did some messages fall through the cracks?)

   i.e., cluster endangering scenarios, not something to keep alive
   at all costs, better to try to stabilize the environment first,
   no to speak about chances with "miracles awaiting" strategy


3. despite 2.. there was a decision with systemd-enabled systems to
   actually pursue said "at all costs" (althought implicitly
   mitigated when the restart cycles would be happening in the rapid
   pace)

   - it's all then in the hands in slightly non-deterministic timing
     (token loss timeout window hit/miss, although perhaps not in
     this very case if the state within the protocol would be a clear
     indicator for other corosync peers)
   
   - I'd actually assume the pacemaker would be restarted in said
     scenario (unless one fiddled with the pacemaker service file,
     that is), and just prior to that, corosync would be forcibly
     started anew as well

   - is the problem then such a shortcircuit-revamped instance of
     cluster stack remains in UNCLEAN state, since there is no
     fencing that would delimit perceived insane/sane mode of
     node's operation?

* * *

Anyway, with 1. above, I've meant to point out that we are likely
doing something wrong design-wise when the exact state of the peer
component is _only_ to be determined from the indirect pointers rather
than be integral part (in-band) of the protocol (except, apparently,
straight cut-offs), and lacking any such granularity, allowing/forcing
just a catch-all best-effort means to local recover.
Also, vaguely, sbd seems to be immersed in this
magic-oracle-rather-than-by-design-informed-part discrepancy even
deeper (just assuming, not standing on something firmer like periodic
inter-component persistent pulse-metering with cut-off detection and
timeout tolerance and stuff...).

Or vice-versa, where the true authority for the behaviour belong?
Is it OK for lower level components to do autonomous decisions
without at least informing the higher level wrt. what exactly is
going on, as we could observe here?

I think there is a lot of room for improvement, shooting for higher
"convergence" overall and satisfying level of unified configuration
and documentation.  Use cases are apparently richer than a single
component may have in mind in isolation.

Alternatively, we should have settled on the original resolution (that
was subsequently relaxed again)for the 2.0 line  of pacemaker that
tested and working fencing is a must for the cluster operation, not
for "you probably don't need a cluster stack otherwise", but exactly
because such rather unpredictable infrastructure level subtleties will
make people down even when they expected their use cases do not
require fencing per se.

* * *

Bottom line: status quo is suboptimal; one way forward is the
"converged" reasoning about the cluster as a whole, not per-partes
as got very customary with development compartmentization.

(Apparently, I credit the original usage of the word "converged"
for triggering this post.  I may have missed a lot from the picture,
but did some fact checking.)

>> 
>> Klaus
>> 
>> 
>> What happens is, services crash when we re-enable network connection.
>> 
>> From journal:
>> 
>> ```
>> ...
>> Jul 12 00:27:32 itaftestkvmls02.dc.itaf.eu corosync[9069]: corosync:
>> totemsrp.c:1328: memb_consensus_agreed: Assertion `token_memb_entries >= 1'
>> failed.
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu attrd[9104]:    error:
>> Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu stonith-ng[9100]:    error:
>> Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service:
>> Main process exited, code=dumped, status=6/ABRT
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu cib[9098]:    error:
>> Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service:
>> Failed with result 'core-dump'.
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu pacemakerd[9087]:    error:
>> Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service:
>> Main process exited, code=exited, status=107/n/a
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service:
>> Failed with result 'exit-code'.
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: Stopped Pacemaker
>> High Availability Cluster Manager.
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu lrmd[9102]:  warning:
>> new_event_notification (9102-9107-7): Bad file descriptor (9)
>> ...
>> ```
>> Pacemaker's log shows no relevant info.
>> 
>> This is from corosync's log:
>> 
>> ```
>> Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu       crmd:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu      attrd:    error:
>> pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:    error:
>> pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:    error:
>> pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd:    error:
>> pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu      attrd:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd:     info:
>> crm_xml_cleanup:        Cleaning up memory from libxml2
>> Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu       crmd:     info:
>> crm_xml_cleanup:        Cleaning up memory from libxml2
>> Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu      attrd:     info:
>> crm_xml_cleanup:        Cleaning up memory from libxml2
>> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:     info:
>> crm_xml_cleanup:        Cleaning up memory from libxml2
>> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
>> crm_xml_cleanup:        Cleaning up memory from libxml2
>> Jul 12 00:27:33 [9102] itaftestkvmls02.dc.itaf.eu       lrmd:  warning:
>> qb_ipcs_event_sendv:    new_event_notification (9102-9107-7): Bad file
>> descriptor (9)
>> ```
>> 
>> Please let me know if you need any further info, I'll be more than
>> happy to provide it.
>> 
>> This is always reproducible in our environment:
>> Ubuntu 18.04.2
>> corosync 2.4.3-0ubuntu1.1
>> pcs 0.9.164-1
>> pacemaker 1.1.18-0ubuntu1.1

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/009b3cb8/attachment.sig>


More information about the Users mailing list