x86: Add --enable-rdtscp-in-benchtests
Commit Message
RDTSCP waits until all previous instructions have executed and all
previous loads are globally visible before reading the counter. RDTSC
doesn't wait until all previous instructions have been executed before
reading the counter. This patch adds --enable-rdtscp-in-benchtests to
use RDTSCP in benchtests.
NOTE: Benchtests in RDTSCP-enabled glibc require CPUs capable of RDTSCP
instruction. All x86 processors since 2010 support RDTSCP instruction.
* INSTALL: Regenerated.
* configure: Likewise.
* NEWS: Mention --enable-rdtscp-in-benchtests.
* configure.ac: Add --enable-rdtscp-in-benchtests.
* benchtests/Makefile (CPPFLAGS-nonlib): Add -DENABLE_RDTSCP
for --enable-rdtscp-in-benchtests.
* manual/install.texi: Document --enable-rdtscp-in-benchtests.
* sysdeps/x86/hp-timing.h (HP_TIMING_NOW): Use RDTSCP if
ENABLE_RDTSCP is defined.
---
INSTALL | 10 ++++++++++
NEWS | 7 +++++++
benchtests/Makefile | 4 ++++
configure | 13 +++++++++++++
configure.ac | 7 +++++++
manual/install.texi | 9 +++++++++
sysdeps/x86/hp-timing.h | 15 ++++++++++++++-
7 files changed, 64 insertions(+), 1 deletion(-)
Comments
On Mon, 22 Oct 2018, H.J. Lu wrote:
> RDTSCP waits until all previous instructions have executed and all
> previous loads are globally visible before reading the counter. RDTSC
> doesn't wait until all previous instructions have been executed before
> reading the counter. This patch adds --enable-rdtscp-in-benchtests to
> use RDTSCP in benchtests.
>
> NOTE: Benchtests in RDTSCP-enabled glibc require CPUs capable of RDTSCP
> instruction. All x86 processors since 2010 support RDTSCP instruction.
Without implying an objection to the patch, I'd like to point out that the
Linux kernel always uses "lfence; rdtsc" on Intel CPUs to obtain ordered
timestamps with lowest possible overhead. LFENCE is available on all x86-64
processors as part of SSE2.
On AMD CPUs the kernel also uses "lfence; rdtsc", except if it cannot setup
a specific MSR to make LFENCE serializing; in that case it falls back to
"mfence; rdtsc".
"lfence; rdtsc" sequence is also recommended by Intel SDM documentation.
Is there a specific reason that "rdtscp" is preferable for this patch?
Thanks.
Alexander
* H. J. Lu:
> RDTSCP waits until all previous instructions have executed and all
> previous loads are globally visible before reading the counter. RDTSC
> doesn't wait until all previous instructions have been executed before
> reading the counter. This patch adds --enable-rdtscp-in-benchtests to
> use RDTSCP in benchtests.
>
> NOTE: Benchtests in RDTSCP-enabled glibc require CPUs capable of RDTSCP
> instruction. All x86 processors since 2010 support RDTSCP instruction.
Shouldn't the benchtests use clock_gettime anyway, to avoid issues in
case the TSC is not synchronized across cores?
Thanks,
Florian
On 23/10/18 2:34 PM, Florian Weimer wrote:
> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in
> case the TSC is not synchronized across cores?
There's an option USE_CLOCK_GETTIME to make benchtests do that, but
otherwise it uses the hp_timing bits by default.
Siddhesh
On 10/23/18, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:
> On 23/10/18 2:34 PM, Florian Weimer wrote:
>> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in
>> case the TSC is not synchronized across cores?
>
> There's an option USE_CLOCK_GETTIME to make benchtests do that, but
> otherwise it uses the hp_timing bits by default.
I want something better that rdtsc and very low overhead since some bench
tests only last a few cycles. Adding lfence may make timing data look like
noise.
On 23/10/18 4:28 PM, H.J. Lu wrote:
> On 10/23/18, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:
>> On 23/10/18 2:34 PM, Florian Weimer wrote:
>>> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in
>>> case the TSC is not synchronized across cores?
>>
>> There's an option USE_CLOCK_GETTIME to make benchtests do that, but
>> otherwise it uses the hp_timing bits by default.
>
> I want something better that rdtsc and very low overhead since some bench
> tests only last a few cycles. Adding lfence may make timing data look like
> noise.
Is the configure option necessary? You could just add a Makefile option
in benchtests similar to USE_CLOCK_GETTIME and then document it in
benchtests/README.
Siddhesh
On 23/10/18 11:58, H.J. Lu wrote:
> On 10/23/18, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:
>> On 23/10/18 2:34 PM, Florian Weimer wrote:
>>> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in
>>> case the TSC is not synchronized across cores?
>>
>> There's an option USE_CLOCK_GETTIME to make benchtests do that, but
>> otherwise it uses the hp_timing bits by default.
>
> I want something better that rdtsc and very low overhead since some bench
> tests only last a few cycles. Adding lfence may make timing data look like
> noise.
>
ideally bench test should be fixed so clock_gettime gives
stable enough results.
target specific timers are not always available and their
results are hard to interpret compared to a standard api
that returns wall clock time in sane units.
On 23/10/18 5:24 PM, Szabolcs Nagy wrote:
> target specific timers are not always available and their
> results are hard to interpret compared to a standard api
> that returns wall clock time in sane units.
Right, that is specifically why the USE_CLOCK_GETTIME option exists.
It's not the default though because comparison between architectures is
not the dominant use case.
Siddhesh
@@ -153,6 +153,16 @@ if 'CFLAGS' is specified it must enable optimization. For example:
library. This option hardcodes the newly built C library path in
dynamic tests so that they can be invoked directly.
+'--enable-rdtscp-in-benchtests'
+ This x86 only option enables RDTSCP instruction in benchtests.
+ When the GNU C Library is built with '--enable-rdtscp-in-tests',
+ benchtests will use RDTSCP instruction, instead of RDTSC
+ instruction, for high precision, low overhead timing, which
+ provides more precise timing data. The resulting library is
+ unchanged by this option. Note that benchtests in RDTSCP-enabled
+ the GNU C Library require CPUs capable of RDTSCP instruction. All
+ x86 processors since 2010 support RDTSCP instruction.
+
'--disable-timezone-tools'
By default, timezone related utilities ('zic', 'zdump', and
'tzselect') are installed with the GNU C Library. If you are
@@ -9,6 +9,13 @@ Version 2.29
Major new features:
+* The GNU C Library can now be built with --enable-rdtscp-in-benchtests
+ to use RDTSCP instruction in benchtests for high precision, low
+ overhead timing on x86 CPUs, which provides more precise timing data.
+ The resulting library is unchanged. Benchtests in RDTSCP-enabled glibc
+ require CPUs capable of RDTSCP instruction. All x86 processors since
+ 2010 support RDTSCP instruction.
+
* A new convenience target has been added for distribution maintainers
to build and install all locales as directories with files. The new
target is run by issuing the following command in your build tree:
@@ -131,6 +131,10 @@ CPPFLAGS-nonlib += -DDURATION=$(BENCH_DURATION) -D_ISOMAC
# HP_TIMING if it is available.
ifdef USE_CLOCK_GETTIME
CPPFLAGS-nonlib += -DUSE_CLOCK_GETTIME
+else
+ifeq (${enable-rdtscp-in-benchtests},yes)
+CPPFLAGS-nonlib += -DENABLE_RDTSCP
+endif
endif
DETAILED_OPT :=
@@ -794,6 +794,7 @@ enable_pt_chown
enable_tunables
enable_mathvec
enable_cet
+enable_rdtscp_in_benchtests
with_cpu
'
ac_precious_vars='build_alias
@@ -1471,6 +1472,8 @@ Optional Features:
depends on architecture]
--enable-cet enable Intel Control-flow Enforcement Technology
(CET), x86 only
+ --enable-rdtscp-in-benchtests
+ enable RDTSCP in benchtests, x86 only [default=no]
Optional Packages:
--with-PACKAGE[=ARG] use PACKAGE [ARG=yes]
@@ -3785,6 +3788,16 @@ else
fi
+# Check whether --enable-rdtscp-in-benchtests was given.
+if test "${enable_rdtscp_in_benchtests+set}" = set; then :
+ enableval=$enable_rdtscp_in_benchtests; rdtscp_in_benchtests=$enableval
+else
+ rdtscp_in_benchtests=no
+fi
+
+config_vars="$config_vars
+enable-rdtscp-in-benchtests = $rdtscp_in_benchtests"
+
# We keep the original values in `$config_*' and never modify them, so we
# can write them unchanged into config.make. Everything else uses
# $machine, $vendor, and $os, and changes them whenever convenient.
@@ -478,6 +478,13 @@ AC_ARG_ENABLE([cet],
[enable_cet=$enableval],
[enable_cet=no])
+AC_ARG_ENABLE([rdtscp-in-benchtests],
+ AC_HELP_STRING([--enable-rdtscp-in-benchtests],
+ [enable RDTSCP in benchtests, x86 only @<:@default=no@:>@]),
+ [rdtscp_in_benchtests=$enableval],
+ [rdtscp_in_benchtests=no])
+LIBC_CONFIG_VAR([enable-rdtscp-in-benchtests], [$rdtscp_in_benchtests])
+
# We keep the original values in `$config_*' and never modify them, so we
# can write them unchanged into config.make. Everything else uses
# $machine, $vendor, and $os, and changes them whenever convenient.
@@ -182,6 +182,15 @@ By default, dynamic tests are linked to run with the installed C library.
This option hardcodes the newly built C library path in dynamic tests
so that they can be invoked directly.
+@item --enable-rdtscp-in-benchtests
+This x86 only option enables RDTSCP instruction in benchtests. When
+@theglibc{} is built with @option{--enable-rdtscp-in-tests}, benchtests
+will use RDTSCP instruction, instead of RDTSC instruction, for high
+precision, low overhead timing, which provides more precise timing data.
+The resulting library is unchanged by this option. Note that benchtests
+in RDTSCP-enabled @theglibc{} require CPUs capable of RDTSCP instruction.
+All x86 processors since 2010 support RDTSCP instruction.
+
@item --disable-timezone-tools
By default, timezone related utilities (@command{zic}, @command{zdump},
and @command{tzselect}) are installed with @theglibc{}. If you are building
@@ -40,7 +40,20 @@ typedef unsigned long long int hp_timing_t;
NB: Use __builtin_ia32_rdtsc directly since including <x86intrin.h>
makes building glibc very slow. */
-# define HP_TIMING_NOW(Var) ((Var) = __builtin_ia32_rdtsc ())
+# ifdef ENABLE_RDTSCP
+/* RDTSCP waits until all previous instructions have executed and all
+ previous loads are globally visible before reading the counter.
+ RDTSC doesn't wait until all previous instructions have been executed
+ before reading the counter. When --enable-rdtscp-in-tests is used,
+ use RDTSCP in benchtests. */
+# define HP_TIMING_NOW(Var) \
+ (__extension__ ({ \
+ unsigned int __aux; \
+ (Var) = __builtin_ia32_rdtscp (&__aux); \
+ }))
+# else
+# define HP_TIMING_NOW(Var) ((Var) = __builtin_ia32_rdtsc ())
+# endif
# include <hp-timing-common.h>
#else