[ClusterLabs] Restoring network connection breaks cluster services

Tue Aug 13 01:59:53 EDT 2019

Momcilo
> On Wed, Aug 7, 2019 at 1:00 PM Klaus Wenninger <kwenning at redhat.com> wrote:
> 
>> On 8/7/19 12:26 PM, Momcilo Medic wrote:
>>
>> We have three node cluster that is setup to stop resources on lost quorum.
>> Failure (network going down) handling is done properly, but recovery
>> doesn't seem to work.
>>
>> What do you mean by 'network going down'?
>> Loss of link? Does the IP persist on the interface
>> in that case?
>>
> 
> Yes, we simulate faulty cable by turning switch ports down and up.
> In such a case, the IP does not persist on the interface.

What corosync version you have? Corosync was really bad in handling 
ifdown (removal of ip) properly till version 3 with knet which solved 
problem completely and 2.4.5, where it is so-so for udpu (udp is still 
affected).

Solution is ether upgrade corosync or configure system to keep ip intact.

Honza
> 
> 
>> That there are issue reconnecting the CPG-API
>> sounds strange to me. Already the fact that
>> something has to be reconnected. I got it
>> that your nodes were persistently up during the
>> network-disconnection. Although I would have
>> expected fencing to kick in at least on those
>> which are part of the non-quorate cluster-partition.
>> Maybe a few words more on your scenario
>> (fening-setup e.g.) would help to understand what
>> is going on.
>>
> 
> We don't use any fencing mechanisms, we rely on quorum to run the services.
> In more detail, we run three node Linbit LINSTOR storage that is
> hyperconverged.
> Meaning, we run clustered storage on the virtualization hypervisors.
> 
> We use pcs in order to have linstor-controller service in high availabilty
> mode.
> Policy for no quorum is to stop the resources.
> 
> In such hyperconverged setup, we can't fence a node without impact.
> It may happen that network instability causes primary node to no longer be
> primary.
> In that case, we don't want running VMs to go down with the ship, as there
> was no impact for them.
> 
> However, we would like to have high-availability of that service upon
> network restoration, without manual actions.
> 
> 
>>
>> Klaus
>>
>>
>> What happens is, services crash when we re-enable network connection.
>>
>>  From journal:
>>
>> ```
>> ...
>> Jul 12 00:27:32 itaftestkvmls02.dc.itaf.eu corosync[9069]: corosync:
>> totemsrp.c:1328: memb_consensus_agreed: Assertion `token_memb_entries >= 1'
>> failed.
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu attrd[9104]:    error:
>> Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu stonith-ng[9100]:    error:
>> Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service:
>> Main process exited, code=dumped, status=6/ABRT
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu cib[9098]:    error:
>> Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: corosync.service:
>> Failed with result 'core-dump'.
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu pacemakerd[9087]:    error:
>> Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service:
>> Main process exited, code=exited, status=107/n/a
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: pacemaker.service:
>> Failed with result 'exit-code'.
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu systemd[1]: Stopped Pacemaker
>> High Availability Cluster Manager.
>> Jul 12 00:27:33 itaftestkvmls02.dc.itaf.eu lrmd[9102]:  warning:
>> new_event_notification (9102-9107-7): Bad file descriptor (9)
>> ...
>> ```
>> Pacemaker's log shows no relevant info.
>>
>> This is from corosync's log:
>>
>> ```
>> Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu       crmd:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu      attrd:    error:
>> pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:    error:
>> pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:    error:
>> pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd:    error:
>> pcmk_cpg_dispatch:      Connection to the CPG API failed: Library error (2)
>> Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu      attrd:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9087] itaftestkvmls02.dc.itaf.eu pacemakerd:     info:
>> crm_xml_cleanup:        Cleaning up memory from libxml2
>> Jul 12 00:27:33 [9107] itaftestkvmls02.dc.itaf.eu       crmd:     info:
>> crm_xml_cleanup:        Cleaning up memory from libxml2
>> Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9104] itaftestkvmls02.dc.itaf.eu      attrd:     info:
>> crm_xml_cleanup:        Cleaning up memory from libxml2
>> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9100] itaftestkvmls02.dc.itaf.eu stonith-ng:     info:
>> crm_xml_cleanup:        Cleaning up memory from libxml2
>> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
>> qb_ipcs_us_withdraw:    withdrawing server sockets
>> Jul 12 00:27:33 [9098] itaftestkvmls02.dc.itaf.eu        cib:     info:
>> crm_xml_cleanup:        Cleaning up memory from libxml2
>> Jul 12 00:27:33 [9102] itaftestkvmls02.dc.itaf.eu       lrmd:  warning:
>> qb_ipcs_event_sendv:    new_event_notification (9102-9107-7): Bad file
>> descriptor (9)
>> ```
>>
>> Please let me know if you need any further info, I'll be more than happy
>> to provide it.
>>
>> This is always reproducible in our environment:
>> Ubuntu 18.04.2
>> corosync 2.4.3-0ubuntu1.1
>> pcs 0.9.164-1
>> pacemaker 1.1.18-0ubuntu1.1
>>
>> Kind regards,
>> Momo.
>>
>> _______________________________________________
>> Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>>
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>