[ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

Mon Mar 25 03:50:22 EDT 2019

Brian,

> On Fri, Mar 22, 2019 at 08:57:20AM +0100, Jan Friesse wrote:
>>> - If I manually set 'totem.token' to a higher value, am I responsible
>>>    for tracking the number of nodes in the cluster, to keep in
>>>    alignment with what Red Hat's page says?
>>
>> Nope. I've tried to explain what is really happening in the manpage
>> corosync.conf(5). totem.token and totem.token_coefficient are used in
>> the following formula:
> 
> I do see this under token_coefficient, thanks.
> 
>> Corosync used runtime.config.token.
> 
> Cool; thanks.  Bumping up totem.token to 2000 got me over this hump.
> 
>>> - Under these conditions, when corosync exits, why does it do so
>>>    with a zero status? It seems to me that if it exited at all,
>>
>> That's a good question. How reproducible is the issue? Corosync
>> shouldn't "exit" with zero status.
> 
> If I leave totem.token set to default, %100 in my case.
> 
> I stand corrected; yesterday, it was %100.  Today, I cannot reproduce
> this at all, even with reverting to the defaults.
> 

That's sad

> Here is a snippet of output from yesterday's experiments; this is
> based on a typescript capture file, so I apologize for the ANSI
> screen codes.
> 

Yep, np. Looks just fine.

> - by default, systemd doesn't report full log lines.
> 
> - by default, CentOS's config of systemd doesn't persist journaled
>    logs, so I can't directly review yesterday's efforts.
> 
> - and, it looks like I misinterpreted the 'exited' message; corosync
>    was enabled and running, but the 'Process' line doesn't report
>    on the 'corosync' process, but some systemd utility.
> 
> (Let me count the ways I'm coming to dislike systemd...)
> 
> I was able to recover logs from /var/log/messages, but other than
> the 'Consider token timeout increase' message, it looks hunky-dory.
> 
> With what I've since learned;
> 
> - I cannot explain why I can't reproduce the symptoms, even with
>    reverting to the defaults.
> 
> - And without being able to reproduce, I can't pursue why 'pcs
>    status cluster' was actually failing for me. :/
> 
> So, I appreciate your attention to this message, and I guess I'm
> off to further explore all of this.
> 
>    C]0;root at node1:~^G[root at node1 ~]# systemctl status corosync.service
>    ESC[1;32m●ESC[0m corosync.service - Corosync Cluster Engine
>     Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor
> preset: disabled)
>       Active: ESC[1;32mactive (running)ESC[0m since Thu 2019-03-21 14:26:56
> UTC; 1min 35s ago
>         Docs: man:corosync
>               man:corosync.conf
>               man:corosync_overview
>      Process: 5474 ExecStart=/usr/share/corosync/corosync start (code=exited,
> status=0/SUCCESS)
>     Main PID: 5490 (corosync)
>       CGroup: /system.slice/corosync.service
>             └─5490 corosync
> 
> 

As you can see, corosync service unit in COS 7 is executing init script 
which execs corosync and waits till connection to local IPC can be 
established. IPC connection can be established when corosync is ready. 
Initscript timeout for IPC is 1 minute and return code is 1 if 
connection cannot be established. On success initscript returns 0. So 
ExecStart (initscript) exited with 0/SUCESS = corosync was successfully 
started and it is running as a PID 5490.

Regards,
   Honza

>>    Honza
>