[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Cluster timeout

Thu Mar 10 03:19:46 EST 2022

Hi Thierry!

Having a glance at the log, I wonder:
* Why is the start for pgsql_mail returning an "unknown error (1)"
* Why is demote for drbd_pgsql:1 returning an "unknown error (1)"?
* Your DC (dvs47713) went offline

So the first action plan is:
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
drbd_pgsql:0	(Master dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
drbd_pgsql:1	(Stopped)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_mail	(Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_fs	(Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_lsb	(Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_vip	(Started dvs42832)

(BTW: You may want to limit the number of policy files kept)

As the cluster goes to IDLE mode, I must assume that you have no fencing
confiugured:
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
drbd_pgsql:0	(Master dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
drbd_pgsql:1	(Stopped)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_mail	(Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_fs	(Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_lsb	(Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_vip	(Started dvs42832)

The cluster seems unable to react until:
[22170] dvs42832 corosyncnotice  [MAIN  ] Completed service synchronization,
ready to provide service.

As the DC was not fenced, you have two of them:
Mar 09 09:26:22 [22179] dvs42832       crmd:  warning:
crmd_ha_msg_filter:	Another DC detected: dvs47713 (op=noop)

(Re-join after split brain is risky)

After rejoiun, the cluster handles the failure:
Mar 09 09:26:24 [22178] dvs42832    pengine:  warning:
unpack_rsc_op_failure:	Forcing drbd_pgsql:1 to stop after a failed
demote action

So the next action plan is:
Mar 09 09:26:24 [22178] dvs42832    pengine:     info: LogActions:	Leave  
drbd_pgsql:0	(Master dvs42832)
Mar 09 09:26:24 [22178] dvs42832    pengine:   notice: LogActions:	Demote 
drbd_pgsql:1	(Master -> Slave dvs47713)
Mar 09 09:26:24 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_mail	(Started dvs42832)
Mar 09 09:26:24 [22178] dvs42832    pengine:   notice: LogActions:	Restart
pgsql_fs	(Started dvs42832)
Mar 09 09:26:24 [22178] dvs42832    pengine:   notice: LogActions:	Restart
pgsql_lsb	(Started dvs42832)
Mar 09 09:26:24 [22178] dvs42832    pengine:   notice: LogActions:	Restart
pgsql_vip	(Started dvs42832)

Then:
Mar 09 09:26:27 [22178] dvs42832    pengine:     info: LogActions:	Leave  
drbd_pgsql:0	(Master dvs42832)
Mar 09 09:26:27 [22178] dvs42832    pengine:   notice: LogActions:	Stop   
drbd_pgsql:1	(dvs47713)
Mar 09 09:26:27 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_mail	(Started dvs42832)
Mar 09 09:26:27 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_fs	(Started dvs42832)
Mar 09 09:26:27 [22178] dvs42832    pengine:   notice: LogActions:	Start  
pgsql_lsb	(dvs42832)
Mar 09 09:26:27 [22178] dvs42832    pengine:   notice: LogActions:	Start  
pgsql_vip	(dvs42832)

Then:
Mar 09 09:26:28 [22178] dvs42832    pengine:     info: LogActions:	Leave  
drbd_pgsql:0	(Master dvs42832)
Mar 09 09:26:28 [22178] dvs42832    pengine:   notice: LogActions:	Start  
drbd_pgsql:1	(dvs47713)
Mar 09 09:26:28 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_mail	(Started dvs42832)
Mar 09 09:26:28 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_fs	(Started dvs42832)
Mar 09 09:26:28 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_lsb	(Started dvs42832)
Mar 09 09:26:28 [22178] dvs42832    pengine:   notice: LogActions:	Start  
pgsql_vip	(dvs42832)

Eventually:
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:	Leave  
drbd_pgsql:0	(Master dvs42832)
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:	Leave  
drbd_pgsql:1	(Slave dvs47713)
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_mail	(Started dvs42832)
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_fs	(Started dvs42832)
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_lsb	(Started dvs42832)
Mar 09 09:26:29 [22178] dvs42832    pengine:     info: LogActions:	Leave  
pgsql_vip	(Started dvs42832)

So it seems you have three problems:
1) some resource operation failing
2) network problems
3) no fencing configured

Just adjusting some timeouts woun't help much in this situation.

Regards,
Ulrich

>>> FLORAC Thierry <thierry.florac at onf.fr> schrieb am 09.03.2022 um 18:24 in
Nachricht
<PR2P264MB076785E16FAD8F054D972C0EF50A9 at PR2P264MB0767.FRAP264.PROD.OUTLOOK.COM>:

> He is an extract of "corosync.log"...
> 
> Thierry
> 
> ________________________________
> De : Users <users-bounces at clusterlabs.org> de la part de Ulrich Windl 
> <Ulrich.Windl at rz.uni-regensburg.de>
> Envoyé : mercredi 9 mars 2022 17:13
> À : users at clusterlabs.org <users at clusterlabs.org>
> Objet : [ClusterLabs] Antw: Re: Antw: [EXT] Cluster timeout
> 
>>>> FLORAC Thierry <thierry.florac at onf.fr> schrieb am 09.03.2022 um 16:56 in
> Nachricht
>
<PR2P264MB07678DBB0517CB8C7695627CF50A9 at PR2P264MB0767.FRAP264.PROD.OUTLOOK.CO

> M>:
> 
>>>>> FLORAC Thierry <thierry.florac at onf.fr> schrieb am 09.03.2022 um 11:46
in
>> Nachricht
>>
>
<PR2P264MB076751671FC57F33B995F851F50A9 at PR2P264MB0767.FRAP264.PROD.OUTLOOK.CO

> 
>> M>:
>>
>>> Hi,
>>>
>>> I manage an active/passive PostgreSQL cluster using DRBD, LVM, Pacemaker
> and
>>
>>> Corosync on a Debian GNU/Linux operating system.
>>> Everything is OK, but my platform seems to be quite "sensitive" to small
>>> network timeouts which are generating a cluster migration start from
> active
>>
>>> to passive node; generally, the process doesn't go through to the end: as
>>> soon as the connection is back again, the migration is cancelled and the
>>> database restarts!
>>
>> Could it be you run without fencing? Maybe show some logs!
>>
>> Logs are quite verbose and not very easy to understand...
>> What log would you need?
> 
> Those showing what happens when the network goes down, and what happens
when
> the network comes up.
> Usually the DC writes some good "action summaries" (typically after
> "pacemaker-controld[7236]:  notice: State transition S_IDLE ->
> S_POLICY_ENGINE"). Those would be helpful.
> 
>>
>>> That should be OK but on the application side, some database connections
> (on
>>
>>> a Java WildFly server) can become "invalid"! So I would like to avoid
> these
>>
>>> migrations when this kind of small timeout occurs...
>>>
>>> So my question is: which cluster settings can I change to increase the
>>> timeout before starting a cluster migration?
>>>
>>> Best regards,
>>> Thierry
>>>
>>>
>>>
>>> Thierry Florac
>>> Resp. Pôle Architecture Applicative et Mobile
>>> DSI ‑ Dépt. Études et Solutions Tranverses
>>> 2, avenue de Saint‑Mandé ‑ 75570 Paris cedex 12
>>> Tél : 01 40 19 59 64
>>> www.onf.fr <https://www.onf.f<https://www.onf.fr>r
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/