[ClusterLabs] Alert notes

Klaus Wenninger kwenning at redhat.com
Thu Jun 16 10:07:34 UTC 2016


On 06/16/2016 11:05 AM, Ferenc Wágner wrote:
> Klaus Wenninger <kwenning at redhat.com> writes:
>
>> On 06/15/2016 06:11 PM, Ferenc Wágner wrote:
>>
>>> Please find some random notes about my adventures testing the new alert
>>> system.
>>>
>>> The first alert example in the documentation has no recipient:
>>>
>>>     <alert id="my-alert" path="/path/to/my-script.sh" />
>>>
>>>     In the example above, the cluster will call my-script.sh for each
>>>     event.
>>>
>>> while the next section starts as:
>>>
>>>     Each alert may be configured with one or more recipients. The cluster
>>>     will call the agent separately for each recipient.
>> The goal of the first example is to be as simple as possible.
>> But of course it makes sense to mention that it is not compulsory
>> to ad a recipient. And I guess it makes sense to point that out
>> as it is just ugly to think that you have to fake a recipient while
>> it wouldn't make any sense in your context.
> I agree.
>
>>> I think the default timestamp should contain date and time zone
>>> specification to make it unambigous.
>> Idea was to have a trade-off between length and amount of information.
> I don't think it's worth saving a couple of bytes by dropping this
> information.  In many cases there will be some way to recover it (from
> SMTP headers or system logs), but that complicates things.
Wasn't about saving some bytes in the size of a file or so but
rather to keep readability. If the timestamp fills your screen
you won't be able to read the actual information...have a look
at /var/log/messages...
Pure intention was to have a default that creates a kind of nice-looking
output together with the file-example to give people an impression
what they could do with the feature.
>
> In a similar vein, keeping the sequence number around would simplify
> alert ordering and loss detection on the receiver side.  Especially with
> SNMP, where the transport is unreliable as well.
Nice idea... any OID in mind?
Unfortunately the sequence-number we have right now als environment-
variable is not really fit for this purpuse. It counts up with each and
every
alert being sent on a single node. So if you have multiple alerts
configured you
would experience gaps that prevent you from using it as loss-detection.
>
>>> (BTW I'd prefer to run the alert scripts as a different user than the
>>> various Pacemaker components, but that would lead too far now.)
>> well, something we thought about already and a point where the
>> new feature breaks the ClusterMon-Interface.
>> Unfortunately the impact is quite high - crmd has dropped privileges -
>> but if the pain-level rises high enough ...
> There's very little room to do this.  You'd need to configure an alert
> user and group, and store them in the saved uid/gid set before dropping
> privileges for the crmd process.  Or use a separate daemon for sending
> alerts, which feels cleaner.
Yes 2nd daemon was the idea. We don't want to give more rights
to crmd than it needs. Btw. the daemon is there already: lrmd ;-)
>>> The SNMP agent seems to have a problem with hrSystemDate, which should
>>> be an OCTETSTR with strict format, not some plain textual timestamp.
>>> But I haven't really looked into this yet.
>> Actually I had tried it with the snmptrap-tool coming with rhel-7.2
>> and it worked with the string given in the example.
>> Did you copy it 1-1? There is a typo in the document having the
>> double-quotes double. The format is strict and there are actually
>> 2 formats allowed - on with timezone and one without. The
>> format string given should match the latter.
> You are right.  The snmptrap tool does the string->binary conversion if
> it gets the correct format.  Otherwise, if the length matches, is does a
> plain cast to binary, interpreting for example 12:34:56.78 as
> 12594-58-51,52:58:53.54,.55:56.  Looks like the sample SNMP alert agent
> shouldn't let the uses choose any timestamp-format but
> %Y-%m-%d,%H:%M:%S.%1N,%:z; unfortunately there's no way to enforce this
Well, generic vs. failsafe  ;-)
Of course one could introduce something like the metadata in RAs
to achieve things like that but we wanted to keep the ball flat...
After all the scripts are just examples...and the timestamp-format
that should work is given in the header of the script...

> in the current design.  Maybe it would be more appropriate to get the
> timestamp from crmd as a high resolution (fractional) epoch all the
> time, and do the string conversion in the agents as necessary.  One
> could still control the format via instance_attributes where allowed.
> Or keep around the current mechanism as well to reduce code duplication
> in the agents.  Just some ideas...
epoch was actually my first default ...
additional epoch might be interesting alternative...





More information about the Users mailing list