[ClusterLabs] Corosync/Pacemaker bug methinks! (Was: pacemaker won't start because duplicate node but can't remove dupe node because pacemaker won't start)

Thu Dec 19 11:02:11 EST 2019

On Thu, 2019-12-19 at 02:38 -0800, JC wrote:
> Hi Ken,
> 
> I took a little time away from the problem. Getting back to it now. I
> found that the corosync logs were not only in journalctl but also in
> /var/log/syslog. I think the logs in syslog are more interesting,
> though I haven’t actually done a thorough comparison. Nevertheless,
> I’m pasting what the logs in syslog say and am hoping there’s more
> interesting data here. The time signatures match perfectly here, too.

<snip>

> Dec 18 23:44:21 region-ctrl-2 corosync[2946]:   [TOTEM ] A new
> membership (192.168.99.225:120) was formed. Members joined:
> 1084777441    

Well that at least confirms that the ID is coming from corosync.

<snip>

> # cat /etc/corosync/corosync.conf 
> totem {
>     version: 2
>     cluster_name: maas-cluster
>     token: 3000
>     token_retransmits_before_loss_const: 10
>     clear_node_high_bit: yes
>     crypto_cipher: none
>     crypto_hash: none
> 
>     interface {
>         ringnumber: 0
>         bindnetaddr: 192.168.99.0
>         mcastport: 5405

Hmm, multicast? I bet your problems will go away if you switch to udpu.

>         ttl: 1
>     }
> }
> 
> logging {
>     fileline: off
>     to_stderr: no
>     to_logfile: yes
>     to_syslog: yes
>     syslog_facility: daemon
>     debug: on
>     timestamp: on
> 
>     logger_subsys {
>         subsys: QUORUM
>         debug: on
>     }
> }
> 
> quorum {
>     provider: corosync_votequorum
>     expected_votes: 3
>     two_node: 1
> }
> 
> nodelist {
>     node {
>         ring0_addr: postgres-sb
>         nodeid: 3
>     }
> 
>     node {
>         ring0_addr: region-ctrl-1
>         nodeid: 1
>     }
> }

I know you've tried various things with the config, so I'm not sure
what happened when, but with only those two nodes listed explicitly and
multicast configured, it does make sense that the local node (which
isn't listed) would join with an auto-generated ID.

I would list all nodes explicitly and switch to udpu transport.
-- 
Ken Gaillot <kgaillot at redhat.com>