[ClusterLabs] Antw: [EXT] True time periods/CLOCK_MONOTONIC node vs. cluster wide (Was: Coming in Pacemaker 2.0.4: dependency on monotonic clock for systemd resources)

Thu Mar 12 03:22:50 EDT 2020

Hi!

Sorry for top-posting, but if you have NTP-synced your nodes, CLOCK_MONOTONIC
will not have much advantage over CLOCK_REALTIME as the clocks will be rather
the same, and they won't "jump". IMHO the latter is the main reason for using
CLOCK_MONOTONIC (if the admin decides to adjust the real-time clock).
So far the theory. In practice the clock jumps, even with NTP, especially if
the node had been running for a long time, is nouzt updating the RTC, and then
is fenced. The clock may be off by minutes after boot then, and NTP is quite
conservative when adjusting the time (first it won't believe that the clock if
off that far, then after minutes the clock will actually jump. That's why some
fast pre-ntpd adjustment is generally used.)
The point is (if you can sync your clocks to real-time at all): How long do
you want to wait for all your nodes to agree on some common time? Maybe
CLOCK_MONOTONIC could help here...

The other useful application is any sort of timeout or repeat thing that
should not be affected by adjusting the real-time clock.

Regards,
Ulrich

>>> Jan Pokorný <jpokorny at redhat.com> schrieb am 11.03.2020 um 23:43 in
Nachricht
<17407_1583966641_5E6969B1_17407_22_1_20200311224339.GA481 at redhat.com>:
> On 11/03/20 09:04 ‑0500, Ken Gaillot wrote:
>> On Wed, 2020‑03‑11 at 08:20 +0100, Ulrich Windl wrote:
>>> You only have to take care not to compare CLOCK_MONOTONIC
>>> timestamps between nodes or node restarts. 
>> 
>> Definitely :)
>> 
>> They are used only to calculate action queue and run durations
> 
> Both these ... from an isolated perspective of a single node only.
> E.g., run durations related to the one currently responsible to act
> upon the resource in some way (the "atomic" operation is always
> bound to the single host context and when retried or logically
> followed with another operation, it's measured anew on pertaining,
> perhaps different node).
> 
> I feel that's a rather important detail, and just recently this
> surface received some slight scratching on the conceptual level...
> 
> Current inability to synchronize measurements of CLOCK_MONOTONIC
> like notions of time amongst nodes (especially tranfer from old,
> possibly failed DC to new DC, likely involving some admitted loss
> of precisenesss ‑‑ mind you, cluster is never fully synchronous,
> you'd need the help of specialized HW for that) in as lossless
> way as possible is what I believe is the main show stopper for
> being able to accurately express the actual "availability score"
> for given resource or resource group ‑‑‑ yep, that famous number,
> the holy grail of anyone taking HA seriously ‑‑‑ while at the
> same time, something the cluster stack currently cannot readily
> present to users (despite it having all or most of the relevant
> information, just piecewise).
> 
> IOW, this sort of non‑localized measurement is what asks for
> emulation of cluster‑wide CLOCK_MONOTONIC‑like measurement, which is
> not that trivial if you think about it.  Sort of a corollary of what
> Ulrich said, because emulating that pushes you exactly in these waters
> of relating CLOCK_MONOTONIC measurements from different nodes
> together.
> 
>   Not to speak of evaluating whether any node is totally off in its
>   own CLOCK_MONOTONIC measurements and hence shall rather be fenced
>   as "brain damaged", and perhaps even using the measurements of the
>   nodes keeping up together to somehow calculate what's the average
>   rate of measured time progress so as to self‑maintain time‑bound
>   cluster‑wide integrity, which may just as well be important for
>   sbd(!).  (nope, this doesn't get anywhere close to near‑light
>   speed concerns, just imprecise HW and possibly implied/or
>   inter‑VM differences)
> 
> Perhaps cheapest way out would be to use NTP‑level algorithms to
> synchronize two CLOCK_MONOTIC timers at the point the worker node
> for resource in question claimed "resource stopped", between this
> worker node and DC, so that the DC can synchronize again like that
> with a new worker node at the point in time when this new claims
> "resource started".  At that point, DC would have a rather accurate
> knowledge of how long this fail‑/move‑over, hence down‑time, lasted,
> hence being able to reflect it to the "availability score" equations.
> 
>   Hmm, no wonder that businesses with deep pockets and serious
>   synchronicity requirements across the globe resort to using atomic
>   clocks, incredibly precise CLOCK_MONOTONIC by default :‑)
> 
>> For most resource types those are optional (for reporting only), but
>> systemd resources require them (multiple status checks are usually
>> necessary to verify a start or stop worked, and we need to check the
>> remaining timeout each time).
> 
> Coincidentally, IIRC systemd alone strictly requires CLOCK_MONOTIC
> (and we shall get a lot more strict as well to provide reasonable
> expectations to the users as mentioned recently[*]), so said
> requirement is just a logical extension without corner cases.
> 
> [*] https://lists.clusterlabs.org/pipermail/users/2019‑November/026647.html

> 
> ‑‑ 
> Jan (Poki)