[ClusterLabs] Antw: [EXT] Pacemaker alerts log duplication.

Fri Jul 9 11:50:20 EDT 2021

On Thu, 2021-07-08 at 10:00 +0200, Ulrich Windl wrote:
> > > > Amol Shinde <amol.shinde at seagate.com> schrieb am 08.07.2021 um
> > > > 08:58 in
> 
> Nachricht
> <
> MW3PR20MB3385122EBAC1AB9282C3F91AE9199 at MW3PR20MB3385.namprd20.prod.outlook.com
> >
> 
> > Hello everyone!!!
> > Hope you are doing well.
> > I need some help regarding pacemaker alerts. I have a 36‑node
> > cluster setup
> > with some IP and dummy resources. I have also deployed an alert
> > script for 
> > the cluster that monitors the node and resources and generates
> > alerts on 
> > events occurrence. The alert script is present on all nodes and
> > sends the 
> > captured alert to a Web‑UI using a message bus. So, for example,
> > when a node
> > goes offline pacemaker triggers the alert agent script on other
> > nodes in the
> > cluster and logs the event as "Node is lost". This message is then
> > sent to 
> > the message bus by the script.
> > 
> > The problem is that since the alert is triggered on every node the
> > agent 
> > script sends multiple duplicate log messages to the message bus.
> > Multiple 
> > duplicate log messages from all the live nodes are reported to the
> > Web‑UI
> 
> thus 
> > clogging up the interface and making parsing through it difficult
> > and
> 
> ruining 
> > the user experience.
> > 
> > Is there any way in the pacemaker itself through which when an
> > event occurs
> > the pacemaker calls the agent on any one node and logs the message
> > rather 
> > than calling the agent on all live nodes within the cluster? For
> > example, 
> > when a node goes offline, the agent is triggered on any one of the
> > live
> 
> nodes 
> > on the cluster thus generating one log, rather than generating
> > multiple 
> > duplicate logs for the same event.

Not currently.

It's not straightforward -- cluster partitions can happen in many ways
besides just one node leaving (splitting into two active partitions,
every node in its own partition, etc.). Pacemaker coordinates nodes
within a partition by electing a DC, but that could unnecessarily delay
alerts.

Basically we decided that it's up to whatever is receiving the alerts
to de-duplicate them.

> If there were (actually I don't know) a cluster-wide "event-ID" (e.g.
> sequence
> number) and that event ID would be passed to the alerting function,
> then you'd
> still create multiple events, but the backend could suppress multiple
> events
> about the same event ID.

No, there isn't. There's a CRM_alert_node_sequence passed to the agent,
but it's node-local, so the agent can reliably detect the order of
alerts on a single node.

A timestamp is also passed to the agent, both in a format specified by
the user and in seconds and microseconds since the epoch, so if the
clocks are closely synchronized, it should be feasible to de-duplicate
on the receiving end.

> Regards,
> Ulrich
-- 
Ken Gaillot <kgaillot at redhat.com>