[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Cluster timeout
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Thu Mar 10 03:19:46 EST 2022
Hi Thierry!
Having a glance at the log, I wonder:
* Why is the start for pgsql_mail returning an "unknown error (1)"
* Why is demote for drbd_pgsql:1 returning an "unknown error (1)"?
* Your DC (dvs47713) went offline
So the first action plan is:
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
drbd_pgsql:0 (Master dvs42832)
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
drbd_pgsql:1 (Stopped)
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_mail (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_fs (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_lsb (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_vip (Started dvs42832)
(BTW: You may want to limit the number of policy files kept)
As the cluster goes to IDLE mode, I must assume that you have no fencing
confiugured:
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
drbd_pgsql:0 (Master dvs42832)
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
drbd_pgsql:1 (Stopped)
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_mail (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_fs (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_lsb (Started dvs42832)
Mar 09 09:26:03 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_vip (Started dvs42832)
The cluster seems unable to react until:
[22170] dvs42832 corosyncnotice [MAIN ] Completed service synchronization,
ready to provide service.
As the DC was not fenced, you have two of them:
Mar 09 09:26:22 [22179] dvs42832 crmd: warning:
crmd_ha_msg_filter: Another DC detected: dvs47713 (op=noop)
(Re-join after split brain is risky)
After rejoiun, the cluster handles the failure:
Mar 09 09:26:24 [22178] dvs42832 pengine: warning:
unpack_rsc_op_failure: Forcing drbd_pgsql:1 to stop after a failed
demote action
So the next action plan is:
Mar 09 09:26:24 [22178] dvs42832 pengine: info: LogActions: Leave
drbd_pgsql:0 (Master dvs42832)
Mar 09 09:26:24 [22178] dvs42832 pengine: notice: LogActions: Demote
drbd_pgsql:1 (Master -> Slave dvs47713)
Mar 09 09:26:24 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_mail (Started dvs42832)
Mar 09 09:26:24 [22178] dvs42832 pengine: notice: LogActions: Restart
pgsql_fs (Started dvs42832)
Mar 09 09:26:24 [22178] dvs42832 pengine: notice: LogActions: Restart
pgsql_lsb (Started dvs42832)
Mar 09 09:26:24 [22178] dvs42832 pengine: notice: LogActions: Restart
pgsql_vip (Started dvs42832)
Then:
Mar 09 09:26:27 [22178] dvs42832 pengine: info: LogActions: Leave
drbd_pgsql:0 (Master dvs42832)
Mar 09 09:26:27 [22178] dvs42832 pengine: notice: LogActions: Stop
drbd_pgsql:1 (dvs47713)
Mar 09 09:26:27 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_mail (Started dvs42832)
Mar 09 09:26:27 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_fs (Started dvs42832)
Mar 09 09:26:27 [22178] dvs42832 pengine: notice: LogActions: Start
pgsql_lsb (dvs42832)
Mar 09 09:26:27 [22178] dvs42832 pengine: notice: LogActions: Start
pgsql_vip (dvs42832)
Then:
Mar 09 09:26:28 [22178] dvs42832 pengine: info: LogActions: Leave
drbd_pgsql:0 (Master dvs42832)
Mar 09 09:26:28 [22178] dvs42832 pengine: notice: LogActions: Start
drbd_pgsql:1 (dvs47713)
Mar 09 09:26:28 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_mail (Started dvs42832)
Mar 09 09:26:28 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_fs (Started dvs42832)
Mar 09 09:26:28 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_lsb (Started dvs42832)
Mar 09 09:26:28 [22178] dvs42832 pengine: notice: LogActions: Start
pgsql_vip (dvs42832)
Eventually:
Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave
drbd_pgsql:0 (Master dvs42832)
Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave
drbd_pgsql:1 (Slave dvs47713)
Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_mail (Started dvs42832)
Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_fs (Started dvs42832)
Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_lsb (Started dvs42832)
Mar 09 09:26:29 [22178] dvs42832 pengine: info: LogActions: Leave
pgsql_vip (Started dvs42832)
So it seems you have three problems:
1) some resource operation failing
2) network problems
3) no fencing configured
Just adjusting some timeouts woun't help much in this situation.
Regards,
Ulrich
>>> FLORAC Thierry <thierry.florac at onf.fr> schrieb am 09.03.2022 um 18:24 in
Nachricht
<PR2P264MB076785E16FAD8F054D972C0EF50A9 at PR2P264MB0767.FRAP264.PROD.OUTLOOK.COM>:
> He is an extract of "corosync.log"...
>
> Thierry
>
> ________________________________
> De : Users <users-bounces at clusterlabs.org> de la part de Ulrich Windl
> <Ulrich.Windl at rz.uni-regensburg.de>
> Envoyé : mercredi 9 mars 2022 17:13
> À : users at clusterlabs.org <users at clusterlabs.org>
> Objet : [ClusterLabs] Antw: Re: Antw: [EXT] Cluster timeout
>
>>>> FLORAC Thierry <thierry.florac at onf.fr> schrieb am 09.03.2022 um 16:56 in
> Nachricht
>
<PR2P264MB07678DBB0517CB8C7695627CF50A9 at PR2P264MB0767.FRAP264.PROD.OUTLOOK.CO
> M>:
>
>>>>> FLORAC Thierry <thierry.florac at onf.fr> schrieb am 09.03.2022 um 11:46
in
>> Nachricht
>>
>
<PR2P264MB076751671FC57F33B995F851F50A9 at PR2P264MB0767.FRAP264.PROD.OUTLOOK.CO
>
>> M>:
>>
>>> Hi,
>>>
>>> I manage an active/passive PostgreSQL cluster using DRBD, LVM, Pacemaker
> and
>>
>>> Corosync on a Debian GNU/Linux operating system.
>>> Everything is OK, but my platform seems to be quite "sensitive" to small
>>> network timeouts which are generating a cluster migration start from
> active
>>
>>> to passive node; generally, the process doesn't go through to the end: as
>>> soon as the connection is back again, the migration is cancelled and the
>>> database restarts!
>>
>> Could it be you run without fencing? Maybe show some logs!
>>
>> Logs are quite verbose and not very easy to understand...
>> What log would you need?
>
> Those showing what happens when the network goes down, and what happens
when
> the network comes up.
> Usually the DC writes some good "action summaries" (typically after
> "pacemaker-controld[7236]: notice: State transition S_IDLE ->
> S_POLICY_ENGINE"). Those would be helpful.
>
>>
>>> That should be OK but on the application side, some database connections
> (on
>>
>>> a Java WildFly server) can become "invalid"! So I would like to avoid
> these
>>
>>> migrations when this kind of small timeout occurs...
>>>
>>> So my question is: which cluster settings can I change to increase the
>>> timeout before starting a cluster migration?
>>>
>>> Best regards,
>>> Thierry
>>>
>>>
>>>
>>> Thierry Florac
>>> Resp. Pôle Architecture Applicative et Mobile
>>> DSI ‑ Dépt. Études et Solutions Tranverses
>>> 2, avenue de Saint‑Mandé ‑ 75570 Paris cedex 12
>>> Tél : 01 40 19 59 64
>>> www.onf.fr <https://www.onf.f<https://www.onf.fr>r
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list