Daniel Schemmel
There comes a time in the life of almost any C++ programmer, where one of the various sleep functions raises its head. Most of the time the problem boils down to some kind of polling algorithm, for example waiting for a resource and wanting to let other processes work in the meantime1.
While it is not very accurate in general, predicting what happens with a sleep that takes a hundred milliseconds or more, is usually fairly simple. This post will concern itself with the extreme low values, primarily zero and the lowest non-zero value the specific sleep function will accept.
Intuitively, a sleep of zero time means that the currently running thread of execution allows the scheduler the chance to schedule some other thread that actually may have better work to do - like release the resource it is waiting for. This means that, for a system with low load, this sleep should usually take about the time of a context switch.
When choosing the smallest non-zero time, we can argue that the result should not be much different, but if both versions would adhere to expectations, this article would be pretty darn useless…
The Windows and Linux experiments were conducted on a dual Intel Xeon X5680 system providing a whole bunch of cores. The OS X experiments were conducted on a 2.8 GHz Intel Core i7 "Macbook Pro (Retina, 13-inch, Late 2013)" providing 4 logical cores.
Everything was compiled for x64 and configured to represent a typical release build. The total system CPU load was usually in the range of 2-5%. All experiments were repeated at least 20 times in an interleaved fashion and 99.9% confidence intervals are given for each one. Where not otherwise noted, results are normalized to one execution of the sleep function.
All operating systems were "lived in", without any intentional changes to system clock resolution or similar mechanisms. Hopefully this represents the typical use case better than a virgin system fresh out of the box. Similarly, the system was not sent into a benchmark mode where as many programs as possible are disabled. For example, they continuously played music and had a browser pointed open with an editor in which I was writing this article.
The test program used to give the ground truth is:
#define SLEEP(x) static_cast<void>((x))
#include <chrono>
#include <iostream>
int main() {
unsigned t = 0;
auto start = ::std::chrono::steady_clock::now();
for(unsigned i = 0; i < 3000; ++i) {
SLEEP(1);
t ^= i; // prevent overeager optimization
}
auto stop = ::std::chrono::steady_clock::now();
auto elapsed = ::std::chrono::duration_cast<::std::chrono::nanoseconds>(stop - start);
::std::cout << elapsed.count() << "\n";
::std::cerr << t << "\n";
}
The modifications for the individual sleep functions simply added any required headers and replaced the definition of the SLEEP
macro with a version that invokes the appropriate sleep function instead. For example, the version relying on the C++11 sleep facilities is:
#include <thread>
#define SLEEP(x) ::std::this_thread::sleep_for(::std::chrono::nanoseconds((x)))
#include <chrono>
#include <iostream>
int main() {
unsigned t = 0;
auto start = ::std::chrono::steady_clock::now();
for(unsigned i = 0; i < 3000; ++i) {
SLEEP(1);
t ^= i; // prevent overeager optimization
}
auto stop = ::std::chrono::steady_clock::now();
auto elapsed = ::std::chrono::duration_cast<::std::chrono::nanoseconds>(stop - start);
::std::cout << elapsed.count() << "\n";
::std::cerr << t << "\n";
}
Be aware that the standard mandates that ::std::this_thread::sleep_for
All code for Windows 10 was compiled by Visual Studio 2015, with Visual C++ 19.00.23026.
For this OS, we will use two platform-specific sleep function in addition to ::std::this_thread::sleep_for
Sleep
SleepEx
FALSE
). Both functions are described to basically behave the same in this test: When given 0
, they will yield execution without sleeping and when given 1
they will take any time up to one system clock tick.
Since WINAPI functions only take arguments with millisecond resolution, ::std::this_thread::sleep_for
will be performed in two variations: Once with a nanosecond argument and once with a millisecond argument.
The target system had a system clock resolution of:
ClockRes v2.0 - View the system clock resolution
Copyright (C) 2009 Mark Russinovich
SysInternals - www.sysinternals.com
Maximum timer interval: 15.625 ms
Minimum timer interval: 0.500 ms
Current timer interval: 1.001 ms
With a ground truth of less than one nanosecond per iteration (980 ± 61 nanoseconds per 3 000), we will first look at the cases where the sleep functions were explicitly asked to perform a zero duration sleep:
::std::this_thread::sleep_for
with 0 nanoseconds: 130 ± 1 ns::std::this_thread::sleep_for
with 0 milliseconds: 132 ± 5 nsSleep
: 64 ± 1 nsSleepEx
: 69 ± 11 nsAs expected, there is a certain cost for yielding execution, clocking in at less than 150 ns per sleep. It should also not come as a big surprise, that the C++ standard library function has a higher overhead than the direct WINAPI calls.
Now the results for the minimal non-zero argument:
::std::this_thread::sleep_for
with 1 nanosecond: 1 535 253 ± 19 313 ns::std::this_thread::sleep_for
with 1 millisecond: 2 000 969 ± 194 nsSleep
: 2 000 949 ± 135 nsSleepEx
: 2 000 911 ± 134 nsAll functions targeting a single millisecond yield the same result, hitting 2 milliseconds instead of one.
I was surprised by the result of ::std::this_thread::sleep_for
when given a 1 nanosecond argument, as it only takes ¾ of the time that either native solution requires for its smallest argument. It should be noted however, that both relative and absolute error are larger though2.
Concluding: Out of these alternatives, ::std::this_thread::sleep_for
performs best in general, as its interface alleviates much of the pain associated with the older APIs. Still, Sleep
/SleepEx
offer a better performance when only yielding execution.
The operating system used was an Arch Linux4.1.6-1-ARCH
. All code was compiled using g++ version 5.2.0.
For this operating system, we will discuss three different native methods in addition to ::std::this_thread::sleep_for
nanosleep
pselect
sched_yield
This operating system exhibits a ground truth of less than one nanosecond per iteration (604 ± 44 nanoseconds per 3 000).5
For the first set of benchmarks, in which the effect with a zero argument is evaluated, the timerfd family of timers will not be present, as their API makes this usage impossible6:
::std::this_thread::sleep_for
: 0 ± 1 ns per iteration (612 ± 35 ns per 3000)nanosleep
: 498 577 ± 427 nspselect
: 136 ± 7 nssched_yield
: 164 ± 8 nsRight off the bat: ::std::this_thread::sleep_for
requires not statistically significant more time than the ground truth - and definitely not enough for a system call. It would seem as if this were completely handled in user-space, thus not actually yielding execution at all.
Interestingly, pselect
performs slightly better than sched_yield
, which may be due to better optimized code, dumb luck, or because it does not actually yield execution - after all it is not primarily intended to yield execution, but to wait upon an event.7
Finally, nanosleep
performs significantly worse than sched_yield
, probably making it the wrong tool for yielding execution.
Going on, here are the results for a 1 nanosecond sleep:
::std::this_thread::sleep_for
: 498 628 ± 263 nsnanosleep
: 498 693 ± 353 nspselect
: 498 796 ± 398 nstimerfd
recreating: 4 819 ± 182 nstimerfd
reusing: 3 273 ± 255 nstimerfd
interval: 2 783 ± 163 nsIt seems that ::std::this_thread::sleep_for
, nanosleep
and pselect
are provided by the same underlying mechanism - which is outperformed by several orders of magnitude by the timerfd API. It can also be noticed that nanosleep
seems to treat a 0 ns sleep the same as a 1 ns sleep, unlike the Windows sleep functions that explicitly treat this as a yield only.
There is no real surprise in the relative performance of the timerfd variants themselves: The most general usage case is slowest (although still blazingly fast), with the reuse of the file descriptor saving a lot of work, and the switch to intervals making it faster yet, although it also becomes rather inflexible.
At this point it should be noted that the actual sleeping on the timerfd is done via read
read
is invoked. Still, for this benchmark, I was able to verify that about 3000 context switches do take place during the execution of the timerfd in interval using GNU Time 1.7.
Concluding the Linux analysis: To yield execution, it seems safest to use sched_yield
, which performs slightly worse than the pselect
alternative. To perform short sleeps, the use of timerfd timers is far superior to all other variants, as a timerfd with minimal time returns two orders of magnitude quicker than nanosleep
with any time.
The exact OS X version used for this test was 10.10.5, as El Capitan was not yet available at the time of writing. Be reminded that this test was run on different hardware which must be taken into account when comparing it to the Linux and Windows tests.
The test suite was fairly similar to the Linux one, but the timerfd suite had to be removed as that particular facility is not available on OS X.
This operating system also exhibits a ground truth of less than one nanosecond per iteration (477 ± 27 ns per 3 000).
Beginning with the zero-duration sleeps:
::std::this_thread::sleep_for
: 4 ± 1 ns (10 680 ± 670 ns per 3000)nanosleep
: 1 086 ± 32 nspselect
: 412 ± 13 nssched_yield
: 180 ± 35 nsAgain, we see a conspicuously low value for ::std::this_thread::sleep_for
, suggesting that OS X does not actually perform a sleep here. Maybe the most surprising result is how good both nanosleep
and pselect
perform, compared to sched_yield
.
Now the numbers for a 1 nanosecond sleep:
::std::this_thread::sleep_for
: 13 809 ± 186 nsnanosleep
: 14 831 ± 234 nspselect
: 416 ± 12 nsFor this test, all methods used leave those available on other platform far in the dust. In fact, only Linux's timerfd facilities manage to come close – and they are still beaten by the OS X pselect
by almost an order of magnitude. Additionally, unlike on Linux, great performance is available for all tested methods, including nanosleep
, which is after all the obvious choice in C style code and ::std::this_thread::sleep_for
, which is the obvious choice for C++ style code.
Summing up the OS X results, it is obvious that this operating system has all others beat, when it comes to short sleeps. While nanosleep
performs somewhat worse than pselect
, its purpose is more obvious and it can be easily used to continue sleeping in the presence of interrupts.
Interestingly, the results were mixed for Windows and Linux: Windows 10 seems to bring primitives to the table that perform very well when only yielding execution, but lack in resolution when actually sleeping. Linux on the other hand provides the timerfd API, which allows extremely short sleeps when a sleep is actually requested. However, the winner of this articly clearly is called OS X, handily beating both alternatives in every single category.
The test program, all results and the script used to analyze them can be downloaded here.
In most cases a blocking wait should be preferred. Would life not be great if we had the pleasure of always being easily able to do things the right way? ↩
The absolute error of the 1 millisecond sleeps is about 1.0 ms, while the 1 nanosecond sleep is off by about 1.5 ms. The relative error differs by roughly 6 orders of magnitude. ↩
If you are wondering why the heck I am analyzing pselect
of all possible functions sporting a timeout, I stumbled over an answer on stackoverflow
Interestingly this is only about ⅔ of the ground truth for Windows 10, possibly due to more aggressive optimization by g++ versus Visual C++. ↩
When setting the time to zero, it disables the timerfd completely, meaning that waiting on it will take forever. ↩
I had to run these specific benchmarks significantly more often than the rest to get the confidence intervals small enough to not overlap. ↩