How long do short sleeps actually take?

There comes a time in the life of almost any C++ programmer, where one of the various sleep functions raises its head. Most of the time the problem boils down to some kind of polling algorithm, for example waiting for a resource and wanting to let other processes work in the meantime¹.

While it is not very accurate in general, predicting what happens with a sleep that takes a hundred milliseconds or more, is usually fairly simple. This post will concern itself with the extreme low values, primarily zero and the lowest non-zero value the specific sleep function will accept.

Intuitively, a sleep of zero time means that the currently running thread of execution allows the scheduler the chance to schedule some other thread that actually may have better work to do - like release the resource it is waiting for. This means that, for a system with low load, this sleep should usually take about the time of a context switch.

When choosing the smallest non-zero time, we can argue that the result should not be much different, but if both versions would adhere to expectations, this article would be pretty darn useless…

Setup

The Windows and Linux experiments were conducted on a dual Intel Xeon X5680 system providing a whole bunch of cores. The OS X experiments were conducted on a 2.8 GHz Intel Core i7 "Macbook Pro (Retina, 13-inch, Late 2013)" providing 4 logical cores.

Everything was compiled for x64 and configured to represent a typical release build. The total system CPU load was usually in the range of 2-5%. All experiments were repeated at least 20 times in an interleaved fashion and 99.9% confidence intervals are given for each one. Where not otherwise noted, results are normalized to one execution of the sleep function.

All operating systems were "lived in", without any intentional changes to system clock resolution or similar mechanisms. Hopefully this represents the typical use case better than a virgin system fresh out of the box. Similarly, the system was not sent into a benchmark mode where as many programs as possible are disabled. For example, they continuously played music and had a browser pointed open with an editor in which I was writing this article.

The test program used to give the ground truth is:

cpp

 1#define SLEEP(x) static_cast<void>((x))
 2#include <chrono>
 3#include <iostream>
 4
 5int main() {
 6	unsigned t = 0;
 7	auto start = ::std::chrono::steady_clock::now();
 8	for(unsigned i = 0; i < 3000; ++i) {
 9		SLEEP(1);
10		t ^= i; // prevent overeager optimization
11	}
12	auto stop = ::std::chrono::steady_clock::now();
13	auto elapsed = ::std::chrono::duration_cast<::std::chrono::nanoseconds>(stop - start);
14	::std::cout << elapsed.count() << "\n";
15	::std::cerr << t << "\n";
16}

The modifications for the individual sleep functions simply added any required headers and replaced the definition of the SLEEP macro with a version that invokes the appropriate sleep function instead. For example, the version relying on the C++11 sleep facilities is:

cpp

 1#include <thread>
 2#define SLEEP(x) ::std::this_thread::sleep_for(::std::chrono::nanoseconds((x)))
 3#include <chrono>
 4#include <iostream>
 5
 6int main() {
 7	unsigned t = 0;
 8	auto start = ::std::chrono::steady_clock::now();
 9	for(unsigned i = 0; i < 3000; ++i) {
10		SLEEP(1);
11		t ^= i; // prevent overeager optimization
12	}
13	auto stop = ::std::chrono::steady_clock::now();
14	auto elapsed = ::std::chrono::duration_cast<::std::chrono::nanoseconds>(stop - start);
15	::std::cout << elapsed.count() << "\n";
16	::std::cerr << t << "\n";
17}

Be aware that the standard mandates that ::std::this_thread::sleep_for may block execution longer than intended, but not shorter. The standard also suggests that this function use a steady clock, which is the reason why the benchmark code does not use a high-resolution clock.

Windows 10

All code for Windows 10 was compiled by Visual Studio 2015, with Visual C++ 19.00.23026.

For this OS, we will use two platform-specific sleep function in addition to ::std::this_thread::sleep_for : Sleep and SleepEx (with its second parameter set to FALSE). Both functions are described to basically behave the same in this test: When given 0, they will yield execution without sleeping and when given 1 they will take any time up to one system clock tick.

Since WINAPI functions only take arguments with millisecond resolution, ::std::this_thread::sleep_for will be performed in two variations: Once with a nanosecond argument and once with a millisecond argument.

The target system had a system clock resolution of:

text

1ClockRes v2.0 - View the system clock resolution
2Copyright (C) 2009 Mark Russinovich
3SysInternals - www.sysinternals.com
4
5Maximum timer interval: 15.625 ms
6Minimum timer interval: 0.500 ms
7Current timer interval: 1.001 ms

With a ground truth of less than one nanosecond per iteration (980 ± 61 nanoseconds per 3 000), we will first look at the cases where the sleep functions were explicitly asked to perform a zero duration sleep:

::std::this_thread::sleep_for with 0 nanoseconds: 130 ± 1 ns
::std::this_thread::sleep_for with 0 milliseconds: 132 ± 5 ns
Sleep: 64 ± 1 ns
SleepEx: 69 ± 11 ns

As expected, there is a certain cost for yielding execution, clocking in at less than 150 ns per sleep. It should also not come as a big surprise, that the C++ standard library function has a higher overhead than the direct WINAPI calls.

Now the results for the minimal non-zero argument:

::std::this_thread::sleep_for with 1 nanosecond: 1 535 253 ± 19 313 ns
::std::this_thread::sleep_for with 1 millisecond: 2 000 969 ± 194 ns
Sleep: 2 000 949 ± 135 ns
SleepEx: 2 000 911 ± 134 ns

All functions targeting a single millisecond yield the same result, hitting 2 milliseconds instead of one.

I was surprised by the result of ::std::this_thread::sleep_for when given a 1 nanosecond argument, as it only takes ¾ of the time that either native solution requires for its smallest argument. It should be noted however, that both relative and absolute error are larger though².

Concluding: Out of these alternatives, ::std::this_thread::sleep_for performs best in general, as its interface alleviates much of the pain associated with the older APIs. Still, Sleep/SleepEx offer a better performance when only yielding execution.

Linux

The operating system used was an Arch Linux identifying its kernel release as 4.1.6-1-ARCH. All code was compiled using g++ version 5.2.0.

For this operating system, we will discuss three different native methods in addition to ::std::this_thread::sleep_for . The obvious choice is nanosleep ³, additionally we will use the timeout of pselect ⁴ and the timerfd facility . The timerfd functionality was tested in three distinct configurations: Recreating the timerfd every call, reusing one timerfd but letting it only fire once, and finally by preparing the timerfd with an interval timer in advance. As all these timer APIs have nanosecond resolution, the chosen inputs will be 0 and 1 nanoseconds. Additionally, sched_yield is evaluated as a 0 ns sleep.

This operating system exhibits a ground truth of less than one nanosecond per iteration (604 ± 44 nanoseconds per 3 000).⁵

For the first set of benchmarks, in which the effect with a zero argument is evaluated, the timerfd family of timers will not be present, as their API makes this usage impossible⁶:

::std::this_thread::sleep_for: 0 ± 1 ns per iteration (612 ± 35 ns per 3000)
nanosleep: 498 577 ± 427 ns
pselect: 136 ± 7 ns
sched_yield: 164 ± 8 ns

Right off the bat: ::std::this_thread::sleep_for requires not statistically significant more time than the ground truth - and definitely not enough for a system call. It would seem as if this were completely handled in user-space, thus not actually yielding execution at all.

Interestingly, pselect performs slightly better than sched_yield, which may be due to better optimized code, dumb luck, or because it does not actually yield execution - after all it is not primarily intended to yield execution, but to wait upon an event.⁷

Finally, nanosleep performs significantly worse than sched_yield, probably making it the wrong tool for yielding execution.

Going on, here are the results for a 1 nanosecond sleep:

::std::this_thread::sleep_for: 498 628 ± 263 ns
nanosleep: 498 693 ± 353 ns
pselect: 498 796 ± 398 ns
timerfd recreating: 4 819 ± 182 ns
timerfd reusing: 3 273 ± 255 ns
timerfd interval: 2 783 ± 163 ns

It seems that ::std::this_thread::sleep_for, nanosleep and pselect are provided by the same underlying mechanism - which is outperformed by several orders of magnitude by the timerfd API. It can also be noticed that nanosleep seems to treat a 0 ns sleep the same as a 1 ns sleep, unlike the Windows sleep functions that explicitly treat this as a yield only.

There is no real surprise in the relative performance of the timerfd variants themselves: The most general usage case is slowest (although still blazingly fast), with the reuse of the file descriptor saving a lot of work, and the switch to intervals making it faster yet, although it also becomes rather inflexible.

At this point it should be noted that the actual sleeping on the timerfd is done via read , meaning it is not guaranteed to yield execution, especially in the interval case where the file descriptor may already be ready when read is invoked. Still, for this benchmark, I was able to verify that about 3000 context switches do take place during the execution of the timerfd in interval using GNU Time 1.7.

Concluding the Linux analysis: To yield execution, it seems safest to use sched_yield, which performs slightly worse than the pselect alternative. To perform short sleeps, the use of timerfd timers is far superior to all other variants, as a timerfd with minimal time returns two orders of magnitude quicker than nanosleep with any time.

OS X

The exact OS X version used for this test was 10.10.5, as El Capitan was not yet available at the time of writing. Be reminded that this test was run on different hardware which must be taken into account when comparing it to the Linux and Windows tests.

The test suite was fairly similar to the Linux one, but the timerfd suite had to be removed as that particular facility is not available on OS X.

This operating system also exhibits a ground truth of less than one nanosecond per iteration (477 ± 27 ns per 3 000).

Beginning with the zero-duration sleeps:

::std::this_thread::sleep_for: 4 ± 1 ns (10 680 ± 670 ns per 3000)
nanosleep: 1 086 ± 32 ns
pselect: 412 ± 13 ns
sched_yield: 180 ± 35 ns

Again, we see a conspicuously low value for ::std::this_thread::sleep_for, suggesting that OS X does not actually perform a sleep here. Maybe the most surprising result is how good both nanosleep and pselect perform, compared to sched_yield.

Now the numbers for a 1 nanosecond sleep:

::std::this_thread::sleep_for: 13 809 ± 186 ns
nanosleep: 14 831 ± 234 ns
pselect: 416 ± 12 ns

For this test, all methods used leave those available on other platform far in the dust. In fact, only Linux's timerfd facilities manage to come close – and they are still beaten by the OS X pselect by almost an order of magnitude. Additionally, unlike on Linux, great performance is available for all tested methods, including nanosleep, which is after all the obvious choice in C style code and ::std::this_thread::sleep_for, which is the obvious choice for C++ style code.

Summing up the OS X results, it is obvious that this operating system has all others beat, when it comes to short sleeps. While nanosleep performs somewhat worse than pselect, its purpose is more obvious and it can be easily used to continue sleeping in the presence of interrupts.

Conclusion

Interestingly, the results were mixed for Windows and Linux: Windows 10 seems to bring primitives to the table that perform very well when only yielding execution, but lack in resolution when actually sleeping. Linux on the other hand provides the timerfd API, which allows extremely short sleeps when a sleep is actually requested. However, the winner of this articly clearly is called OS X, handily beating both alternatives in every single category.

The test program, all results and the script used to analyze them can be downloaded here.