Why are there so many monotonic clocks on linux?

Morten Hauke Solvang

—

September 2024

There are multiple monotonic clocks available to linux userspace through clock_gettime. Most notably: CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_BOOTTIME. These three clocks all give us the time since system startup in nanoseconds.

A quick refresher on what monotonic means: A clock is monotonic if we always see increasing time values from it. If we sample the clock two times in a row the first sample will be less than (or maybe equal to, if the clocks resolution is low enough) the second sample. This is opposed to what is refered to as a wall clock or realtime clock, which can jump backwards e.g. because somebody decided it was showing the wrong time and went and adjusted it.

With there being at least three viable options for a monotonic clock in linux, which one should we choose?

What does documentation tell us?

We can start by reading the man page.

In a sense, one could argue that if we want to write a truly portable application, reading the man page is all we should be doing: Any behavior not described in the man page can't be relied upon. But often I know which system(s) I'm targeting, so I'm able to check the specifics of how the clocks behave on my system, and we'll do just that further down in the article.

But first: the man page.

The differences between the monotonic clocks, according to the man page, can be summarized as follows:

	Is frequency corrected based on NTP?	Does the clock count time when the system is in suspend?
CLOCK_MONOTONIC_RAW
CLOCK_MONOTONIC	✓
CLOCK_BOOTTIME	✓	✓

When programming, there are cases where I know I want the behavior of CLOCK_BOOTTIME: If my program deals with monotonic time in a way which somehow is supposed to be in sync with the world outside the computer on which the program is running, I want a clock which "keeps counting" even when my computer is suspended.

I can imagine cases where CLOCK_MONOTONIC is what I would want: If I care about the amount of time my programs have been able to run, e.g. to implement some timeout, then I don't want that timeout to suddenly expire when my computer resumes from a long suspend. (I suspect that in those cases I'd actually rather want CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID, but those clocks will have to be the topic for another time.)

With CLOCK_MONOTONIC_RAW, however, I struggle to imagine a usecase for it. Documentation seems too vague for it to seem useful in portable applications. This email on the kernel mailing list suggests that it is used by NTP clients, but checking the source code for both chronyd and systemd-timesyncd I was unable to find any reference to it. Perhaps it is useful in cases where you know what platform you are targeting, and you have checked the implementation of CLOCK_MONOTONIC_RAW on that specific platform?

The main question I'm left with after reading the man page is why linux ships all three of these monotonic clocks. Wouldn't CLOCK_BOOTTIME be good enough for most usecases? Are there some important usecases I am missing? Do some of the clocks have implementation details not mentioned on the man page which make them undesirable?

As promised: To find out, we'll have to go beyond the manpage.

What can we measure?

We can measure two properties of the clock implementations fairly easily:

How long does a call to clock_gettime take?
This is easy: Do a bunch of calls and divide the total time taken by the number of calls.

We'll have to take the numbers with a grain of salt, due to out-of-order execution and pipelining on modern CPUs, but this number will at least give us some idea of what cost we are paying by getting a timing from the clock.
What resolution does the timestamp have have?
At first this may sound like a weird question. Doesn't clock_gettime always return time in nanoseconds?

Yes! But keep in mind that this just sets a upper limit on resolution. We can still imagine a clock having lower resolution. E.g. if all returned values are multiples of 1 000 000, then the clock effectively has a resolution of 1 ms.

We can measure resolution by repeatedly sampling the clock twice in a row, and taking the smallest non-zero difference between times which we see.

I've written a small program to measure this for us: measure_gettime.c. Below are results from my computer and from the embedded device that happened to be closest to my desk at the time of writing.

	Time per call	Resolution / smallest step
CLOCK_MONOTONIC_RAW	19.56 ns	10 ns
CLOCK_MONOTONIC	19.34 ns	9 ns
CLOCK_BOOTTIME	19.34 ns	9 ns
CLOCK_REALTIME	19.34 ns	9 ns

Results on a Ryzen 5 5600 (x86_64), with kernel version 6.10.

	Time per call	Resolution / smallest step
CLOCK_MONOTONIC_RAW	42.62 ns	125 ns
CLOCK_MONOTONIC	42.75 ns	124 ns
CLOCK_BOOTTIME	43.37 ns	124 ns
CLOCK_REALTIME	42.75 ns	124 ns

Results on a i.MX8M (aarch64), with kernel version 6.1.

The difference in measured resolution between CLOCK_MONOTONIC_RAW and CLOCK_MONOTONIC likely is because CLOCK_MONOTONIC_RAW runs slower then the kernel assumed it would, making each step of the underlying clock appear larger when compared to the correctly scaled CLOCK_MONOTONIC. With that caveat in mind, all four clocks in the tables above appear to have identical measurements. (Though, as a sidenote, the program I wrote also measures these numbers for the other clocks. Some of those have significantly different values.)

Of course, if you'll depend on the time per call or resolution of any of the clocks, you'll have to make sure to run measurements on your specific system.

However, at least with the two systems I tested, it seems like there is no reason to prefer one of the three monotonic clocks over the others based on either cost of calling the function or provided resolution.

Are there historical reasons? A quick look into the kernel mailing list.

CLOCK_BOOTTIME was added with kernel 2.6.39 in 2011, specifically to address the fact that many applications want monotonic time to include the time the system is suspended.

In the discussion leading up to the change, it was discussed whether instead of adding a new clock, one should simply change the definition of CLOCK_MONOTONIC. One reason given against merging the clocks is that it would mess with the relationship between CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW:

No. Lets not change it. CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW's relationship is tightly coupled, and applications that are tracking the amount of clock adjustment being done to the system require they keep their semantics.

This ended with CLOCK_BOOTTIME being added as a new clock.

In 2018, release candidates for 4.17 contained a change which unified CLOCK_MONOTONIC and CLOCK_BOOTTIME. The change was merged and part of the release candidate, but was later reverted before 4.17 became stable.

This time, a separate argument was cited for why the two clocks had to be kept separate: Various userspace services were relying on CLOCK_MONOTONIC not counting while the system was in suspend, and broke when the behavior was changed.

Linus sums up the commit and revert as follows:

we sadly had to revert the nice CLOCK_{MONOTONIC,BOOTTIME} unification

From this, we can learn two reasons for keeping CLOCK_MONOTONIC and CLOCK_BOOTTIME separate, though both of the reasons to a certain extent have to do with maintaining backwards compatibility:

Preserving the relationship between CLOCK_MONOTONIC and CLOCK_MONOTONIC_RAW.

(This begs the question why CLOCK_MONOTONIC_RAW can't also count when the systme is in suspend. I suspect this is motivated by how the underlying clocks provided by hardware behave.)
Some userspace software depends on having a clock which does not count while the system is in suspend.
(It is unclear whether this wouldn't have been fine if CLOCK_MONOTONIC had been defined to count time spent in suspend from the start.)

I also believe there is a general lesson about API compatibility here: Maintaining API compatibility goes beyond maintaining the same function signatures. Seemingly subtle details in the documentation for a function also are part of its API, and changing those details also constitutes a breaking API change. (Of course, it is likely that the same userspace breakage with 4.17 release candidates would have occured even if documentation for CLOCK_MONOTONIC did not state that it does not count time when the system is suspended. In a sense, depending on how strict one wants to be about backward compatibility, one could argue that API for a function extends to cover all observable behaviors of the function, whether documented or not...)

Let's look at the implementation!

So far, it seems like the difference between CLOCK_MONOTONIC and CLOCK_BOOTTIME is mostly historical, and that for new software we likely want to use CLOCK_BOOTTIME. But maybe we can learn something more by looking at how kernel implements clock_gettime?

Calls to clock_gettime go through the vDSO. In the "happy" case, this means that a call to clock_gettime runs entirely in userspace, avoiding the overhead of a syscall. This works by reading some high-resolution counter value available to userspace, and then scaling/offsetting the value depending on which counter is being queried.

We can illustrate it with the following diagram. Here, all the logic runs in userspace, and uses the vdso_data structure which is provided by kernel.

A high-resolution processor counter is sampled. On x64, this is done with the rdtsc (read timestamp counter) instruction. On ARM, a read from the CNTVCT register is used. These counters don't count when the system is in suspend ¹.
An offset, expressed in number of counter cycles, is subtracted.
The value is scaled by some factor to convert from counter cycles to nanoseconds.
Another offset, expressed in seconds/nanoseconds, is added.

The first thing to notice is that the diagram agrees with the documentation from the man page: CLOCK_MONOTONIC, CLOCK_BOOTTIME and CLOCK_REALTIME all count at the same frequency. They only differ in what offset is added in step (4).

Let's now look at the steps in some more detail.

Step (1) is reading the high-resolution counter. This is platform-dependent, and boils down to some inline assembly. See for example:

x86 and x64: arch/x86/include/asm/vdso/gettimeofday.h#L250-L274
64-bit ARM: arch/arm64/include/asm/vdso/gettimeofday.h#L69-L100

Steps (2), (3) and (4) are done inside do_hres (with the exception of the coarse functions, that is a topic for another time). There is a lot going on in this function ², as is typical of kernel code which handles many configurations across many architectures.

Since the basetimes values used in step (4) is what distinguishes CLOCK_MONOTONIC and CLOCK_BOOTTIME, the obvious next question is where the basetimes values are comming from. To find this out, we have to look beyond what is happening in userspace.

In kernel, update_vdso_data is what updates the vdso_data seen in the diagram above. In this function, we can see the variables used to set up basetimes (and cycle_last).

tk->tkr_mono.cycle_last provides cycle_last in step (2).
tk->xtime_sec (xtime refers to wall/realtime), tk->wall_to_monotonic and tk->monotonic_to_boot. These are summed up as necessary to form basetimes in step (4).

These values are in turn managed by kernel/time/timekeeping.c, where e.g. monotonic_to_boot gets incremented by the amount of time the system spent in suspend ³. The time spent in suspend is calculated based on some different clock, which is configured to keep counting when the system is in suspend ⁴. (The values can of course also be modified by userspace, e.g. when the NTP service adjusts the realtime clock.)

Conclusion

The topic of monotonic timers in kernel has interested me for a while. The functions seem deceptively simple on the surface, but once you start digging you quickly realize there is a lot of hidden detail.

We've learnt that the primary reason for CLOCK_MONOTONIC and CLOCK_BOOTTIME being separate is historical: Userspace came to rely on the specific behavior of CLOCK_MONOTONIC, at which point it became impossible to remove or change it. (Which is not to say that CLOCK_MONOTONIC might not have technical merit in some special cases.)

There are a lot of topics I've not covered in this article. In addition to the CLOCK_s discussed here, linux provides a variety of other clocks with different properties. I've also limited my discussion to look at the various clocks from the perspective of clock_gettime. But the same clock definitions are used by other APIs, such as timerfd_create. If we look under the hood there, will we see the same oddities of behavior? Another interesting question I've not addressed at all: How does it look if you are writing a driver? Do you have access to the same clocks as userspace, or are there other APIs with different constraints available to you?

I hope this article has answered some of your questions about clocks in linux, or at the very least that it has given you a starting point for your own investigations.

[1]

At least as far as I can tell this is true, though I struggle to find any location in either intel/AMDs x86_64 documentation or in the aarch64 documentation which specifies this. Instead, I've written a small program to help me verify this experimentaly: hardware_counter.c. This program shows that on my x86_64 system, rdtsc gets reset to zero after suspend, while on my aarch64 system, CNTVCT is paused while the system is in suspend.

[2]

One interesting thing which jumps out is that there is a big loop around most of the function body. The loop is an alternative to locking, it's a really neat trick! The whole vdso_data structure can be updated concurrently while we are reading from it in userspace, and if that happens our whole computation might be messed up. But the implementation of do_hres goes ahead and assume that a concurrent update won't happen. At the start and of the loop, it reads out a "generation counter" (vd->seq), and at the end of the loop it reads it again and checks that the value hasn't changed. If the value has changed, the data structure must have changed while we were reading from it, and we retry (there is one more detail: if the generation is odd, it indicates the structure is being written to, so the program spins until the counter becomes even again).

[3]

cycle_last also gets updated in the suspend/resume handlers. This is to account for the fact that the system might run for some amount of time after the suspend handler and before the resume handler, during which time the high-resolution counter (rdtsc) will still be advancing. See here and here.

[4]

The clock used to count the amount of time the system was suspend presumably will have a different resolution from the clock high-resolution clock sampled for clock_gettime. Additionally, some inaccuracy will be introduced when switching between the two clocks. This should mean that some error will be introduced to CLOCK_BOOTTIME whenever the system does a suspend-cycle. I haven't yet taken the time to a) look into which clock is used in suspend and b) to measure this error.