[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres

Fri Jul 23 09:45:57 EDT 2021

On Fri, 23 Jul 2021 12:52:00 +0200
damiano giuliani <damianogiuliani87 at gmail.com> wrote:

> the time query isnt the problem, is known that took its time. the network
> is 10gbs bonding, quite impossible to sature with queries :=).

Everything is possible, it's just harder :)

[...]
> checking again the logs what for me is not clear its the cause of the loss
> of quorum and then fence the node.

As said before, according to logs from other nodes, ltaoperdbs02 did not
answers to the TOTEM protocol anymore, so it left the communication group. But
worse, it did it without saying goodbye properly:

  > [TOTEM ] Failed to receive the leave message. failed: 1 

From this exact time, the node is then considered "uncleaned", aka
its state "unknown". To solve this trouble, the cluster needs to fence it to
set a predictable state: OFF. So, the reaction to the trouble is sane.

Now, from the starting point of this conversation, the question is what
happened? Logs on other nodes will probably not help, as they just witnessed a
node disappearing without any explanation.

Logs from ltaoperdbs02 might help, but the corosync log you sent stop at
00:38:44, almost 2 minutes before the fencing as reported from other nodes:

  > Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:  warning: pe_fence_node:
  >    Cluster node ltaoperdbs02 will be fenced: peer is no longer part of

> So the cluster works flawessy as expected: as soon ltaoperdbs02 become
> "unreachable", it formed a new quorum, fenced the lost node and promoted
> the new master.

exact.

> What i cant findout is WHY its happened.
> there are no useful  information into the system logs neither into the
> Idrac motherboard logs.

Because I suppose some log where not synced to disks when the server has been
fenced.

Either the server clocks were not synched (I doubt), or you really lost almost
2 minutes of logs.

> There is a way to improve or configure a log system for fenced / failed
> node?

Yes:

1.setup rsyslog to export logs on some dedicated logging servers. Such
servers should receive and save logs from your clusters and other hardwares
(network?) and keep them safe. You will not loose messages anymore.

2. Gather a lot of system metrics and keep them safe (eg. export them using pcp,
collectd, etc). Metrics and visualization are important to cross-compare with
logs and pinpoint something behaving outside of the usual scope.

Looking at your log, I still find your query time are suspicious. I'm not
convinced they are the root cause, they might be just a bad symptom/signal
of something going wrong there. Having a one-row INSERT taking 649.754ms is
suspicious. Maybe it's just a locking problem, maybe there's some CPU-bound
postgis things involved, maybe with some GIN or GiST indexes, but it's still
suspicious considering the server is over-sized in performance as you stated...

And maybe the network or SAN had a hick-up and corosync has been too sensible
to it. Check the retransmit and timeout parameters?

Regards,