[ClusterLabs] True time periods/CLOCK_MONOTONIC node vs. cluster wide (Was: Coming in Pacemaker 2.0.4: dependency on monotonic clock for systemd resources)

Jan Pokorný jpokorny at redhat.com
Wed Mar 11 18:43:39 EDT 2020


On 11/03/20 09:04 -0500, Ken Gaillot wrote:
> On Wed, 2020-03-11 at 08:20 +0100, Ulrich Windl wrote:
>> You only have to take care not to compare CLOCK_MONOTONIC
>> timestamps between nodes or node restarts. 
> 
> Definitely :)
> 
> They are used only to calculate action queue and run durations

Both these ... from an isolated perspective of a single node only.
E.g., run durations related to the one currently responsible to act
upon the resource in some way (the "atomic" operation is always
bound to the single host context and when retried or logically
followed with another operation, it's measured anew on pertaining,
perhaps different node).

I feel that's a rather important detail, and just recently this
surface received some slight scratching on the conceptual level...

Current inability to synchronize measurements of CLOCK_MONOTONIC
like notions of time amongst nodes (especially tranfer from old,
possibly failed DC to new DC, likely involving some admitted loss
of precisenesss -- mind you, cluster is never fully synchronous,
you'd need the help of specialized HW for that) in as lossless
way as possible is what I believe is the main show stopper for
being able to accurately express the actual "availability score"
for given resource or resource group --- yep, that famous number,
the holy grail of anyone taking HA seriously --- while at the
same time, something the cluster stack currently cannot readily
present to users (despite it having all or most of the relevant
information, just piecewise).

IOW, this sort of non-localized measurement is what asks for
emulation of cluster-wide CLOCK_MONOTONIC-like measurement, which is
not that trivial if you think about it.  Sort of a corollary of what
Ulrich said, because emulating that pushes you exactly in these waters
of relating CLOCK_MONOTONIC measurements from different nodes
together.

  Not to speak of evaluating whether any node is totally off in its
  own CLOCK_MONOTONIC measurements and hence shall rather be fenced
  as "brain damaged", and perhaps even using the measurements of the
  nodes keeping up together to somehow calculate what's the average
  rate of measured time progress so as to self-maintain time-bound
  cluster-wide integrity, which may just as well be important for
  sbd(!).  (nope, this doesn't get anywhere close to near-light
  speed concerns, just imprecise HW and possibly implied/or
  inter-VM differences)

Perhaps cheapest way out would be to use NTP-level algorithms to
synchronize two CLOCK_MONOTIC timers at the point the worker node
for resource in question claimed "resource stopped", between this
worker node and DC, so that the DC can synchronize again like that
with a new worker node at the point in time when this new claims
"resource started".  At that point, DC would have a rather accurate
knowledge of how long this fail-/move-over, hence down-time, lasted,
hence being able to reflect it to the "availability score" equations.

  Hmm, no wonder that businesses with deep pockets and serious
  synchronicity requirements across the globe resort to using atomic
  clocks, incredibly precise CLOCK_MONOTONIC by default :-)

> For most resource types those are optional (for reporting only), but
> systemd resources require them (multiple status checks are usually
> necessary to verify a start or stop worked, and we need to check the
> remaining timeout each time).

Coincidentally, IIRC systemd alone strictly requires CLOCK_MONOTIC
(and we shall get a lot more strict as well to provide reasonable
expectations to the users as mentioned recently[*]), so said
requirement is just a logical extension without corner cases.

[*] https://lists.clusterlabs.org/pipermail/users/2019-November/026647.html

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200311/dd935d13/attachment.sig>


More information about the Users mailing list