[ClusterLabs] Pacemaker log showing time mismatch after
Jan Pokorný
jpokorny at redhat.com
Mon Feb 11 17:49:58 EST 2019
On 11/02/19 15:03 -0600, Ken Gaillot wrote:
> On Fri, 2019-02-01 at 08:10 +0100, Jan Pokorný wrote:
>> On 28/01/19 09:47 -0600, Ken Gaillot wrote:
>>> On Mon, 2019-01-28 at 18:04 +0530, Dileep V Nair wrote:
>>> Pacemaker can handle the clock jumping forward, but not backward.
>>
>> I am rather surprised, are we not using monotonic time only, then?
>> If so, why?
>
> The scheduler runs on a single node (the DC) but must take as input the
> resource history (including timestamps) on all nodes. We need wall
> clock time to compare against time-based rules.
Yep, was aware of the troubles with this.
> Also, if we get two resource history entries from a node, we don't
> know if it rebooted in between, so a monotonic timestamp alone
> wouldn't be sufficient.
Ah, that's along the lines of Ulrich's response, I see it now, thanks.
I am not sure if there could be a step around that using boot IDs when
provided by the platform (like the hashes in case of systemd, assuming
just an equality test is all that's needed, since both histories
cannot arrive at the same time and presumably the FIFO gets
preserved -- it shall under normal circumstances, incl. no time
travels and sane, non-byzantine process, which might even be partially
detected and acted upon [two differing histories without any
recollection the node was fenced by the cluster? fence it for sure
and good measure!]).
> However, it might be possible to store both time representations in the
> history (and possibly maintain some sort of cluster knowledge about
> monotonic clocks to compare them within and across nodes), and use one
> or the other depending on the context. I haven't tried to determine how
> feasible that would be, but it would be a major project.
Just thinking aloud, the DC node, once won the election, will cause
other nodes (incl. new-comers during it's ruling) to (re)set the
offsets DC's vs. their own internal wall-clock time clocks (for
time-based rules should they ever become DC on their own), all nodes
will (re)establish monotonic to wall-clock time conversion based on
these one-off inputs, and from that point on, they can operate fully
detached from the wall-clock time (so that changing the wall clock
will have no immediate effect on the cluster unless (re)harmonizing
desired via some configured event handler that could reschedule the
plan appropriately, or delay the sync for a more suitable moment).
This indeed counts on pacemaker used in a full-fledged cluster mode,
i.e., requiring quorum (not to speak about fencing).
As a corollary, with such a scheme, time-based rules would only be
allowed when quorum fully honoured (afterall, it makes more sense
to employ cron or timer systemd units otherwise, since they all
use a local time only, which is the right fit, then, as opposed to
a distributed system).
>> We shall not need any explicit time synchronization across the nodes
>> since we are already backed by extended virtual synchrony from
>> corosync, eventhough it could introduce strangenesses when
>> time-based rules kick in.
>
> Pacemaker determines the state of a resource by replaying its resource
> history in the CIB. A history entry can be replaced only by a newer
> event. Thus if there's a start event in the history, and a stop result
> comes in, we have to know which one is newer to determine whether the
> resource is started or stopped.
But for that, boot ID might be a sufficient determinant, since
a result (keyed with boot ID ABC) for an action never triggered with
boot ID XYZ deemed "current" ATM means that something is off, and
fencing is perhaps the best choice.
> Something along those lines is likely the cause of:
>
> https://bugs.clusterlabs.org/show_bug.cgi?id=5246
I believe the "detached" scheme sketched above could solve this.
Problem is, devil is in the details to it can be unpredictably hard
to get it right in the distributed environment.
--
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190211/45e61352/attachment-0002.sig>
More information about the Users
mailing list