[ClusterLabs] Alert notes
wferi at niif.hu
Thu Jun 16 08:40:18 EDT 2016
Klaus Wenninger <kwenning at redhat.com> writes:
> On 06/16/2016 11:05 AM, Ferenc Wágner wrote:
>> Klaus Wenninger <kwenning at redhat.com> writes:
>>> On 06/15/2016 06:11 PM, Ferenc Wágner wrote:
>>>> I think the default timestamp should contain date and time zone
>>>> specification to make it unambigous.
>>> Idea was to have a trade-off between length and amount of information.
>> I don't think it's worth saving a couple of bytes by dropping this
>> information. In many cases there will be some way to recover it (from
>> SMTP headers or system logs), but that complicates things.
> Wasn't about saving some bytes in the size of a file or so but
> rather to keep readability. If the timestamp fills your screen
> you won't be able to read the actual information...have a look
> at /var/log/messages...
> Pure intention was to have a default that creates a kind of nice-looking
> output together with the file-example to give people an impression
> what they could do with the feature.
I see. Incidentally, the file example is probably the one which would
profit most of having full timestamps. And some locking.
>> In a similar vein, keeping the sequence number around would simplify
>> alert ordering and loss detection on the receiver side. Especially with
>> SNMP, where the transport is unreliable as well.
> Nice idea... any OID in mind?
No. But you can always extend PACEMAKER-MIB.
> Unfortunately the sequence-number we have right now als environment-
> variable is not really fit for this purpuse. It counts up with each
> and every alert being sent on a single node. So if you have multiple
> alerts configured you would experience gaps that prevent you from
> using it as loss-detection.
I see, it isn't per alert, unfortunately. Still better than nothing,
>>>> (BTW I'd prefer to run the alert scripts as a different user than the
>>>> various Pacemaker components, but that would lead too far now.)
>>> well, something we thought about already and a point where the
>>> new feature breaks the ClusterMon-Interface.
>>> Unfortunately the impact is quite high - crmd has dropped privileges -
>>> but if the pain-level rises high enough ...
>> There's very little room to do this. You'd need to configure an alert
>> user and group, and store them in the saved uid/gid set before dropping
>> privileges for the crmd process. Or use a separate daemon for sending
>> alerts, which feels cleaner.
> Yes 2nd daemon was the idea. We don't want to give more rights
> to crmd than it needs. Btw. the daemon is there already: lrmd ;-)
It's running as root already, so at least no problem changing to any
user. And the default could be hacluster.
>> You are right. The snmptrap tool does the string->binary conversion if
>> it gets the correct format. Otherwise, if the length matches, is does a
>> plain cast to binary, interpreting for example 12:34:56.78 as
>> 12594-58-51,52:58:53.54,.55:56. Looks like the sample SNMP alert agent
>> shouldn't let the uses choose any timestamp-format but
>> %Y-%m-%d,%H:%M:%S.%1N,%:z; unfortunately there's no way to enforce this
>> in the current design.
> Well, generic vs. failsafe ;-)
> Of course one could introduce something like the metadata in RAs
> to achieve things like that but we wanted to keep the ball flat...
> After all the scripts are just examples...and the timestamp-format
> that should work is given in the header of the script...
More emphasis would help, I think.
>> Maybe it would be more appropriate to get the timestamp from crmd as
>> a high resolution (fractional) epoch all the time, and do the string
>> conversion in the agents as necessary. One could still control the
>> format via instance_attributes where allowed. Or keep around the
>> current mechanism as well to reduce code duplication in the agents.
>> Just some ideas...
> epoch was actually my first default ...
> additional epoch might be interesting alternative...
It would be useful. Actually, crm_time_format_hr() currently fails for
any format string ending with any %-escape but N. For example, "%Yx" is
formatted as "2016x", but "%Y" returns NULL. You can avoid fixing this
by providing a fractional epoch instead. :)
More information about the Users