x86: Add --enable-rdtscp-in-benchtests

Message ID 20181022223711.26910-1-hjl.tools@gmail.com
State New, archived
Headers

Commit Message

H.J. Lu Oct. 22, 2018, 10:37 p.m. UTC
  RDTSCP waits until all previous instructions have executed and all
previous loads are globally visible before reading the counter.  RDTSC
doesn't wait until all previous instructions have been executed before
reading the counter.  This patch adds --enable-rdtscp-in-benchtests to
use RDTSCP in benchtests.

NOTE: Benchtests in RDTSCP-enabled glibc require CPUs capable of RDTSCP
instruction.  All x86 processors since 2010 support RDTSCP instruction.

	* INSTALL: Regenerated.
	* configure: Likewise.
	* NEWS: Mention --enable-rdtscp-in-benchtests.
	* configure.ac: Add --enable-rdtscp-in-benchtests.
	* benchtests/Makefile (CPPFLAGS-nonlib): Add -DENABLE_RDTSCP
	for --enable-rdtscp-in-benchtests.
	* manual/install.texi: Document --enable-rdtscp-in-benchtests.
	* sysdeps/x86/hp-timing.h (HP_TIMING_NOW): Use RDTSCP if
	ENABLE_RDTSCP is defined.
---
 INSTALL                 | 10 ++++++++++
 NEWS                    |  7 +++++++
 benchtests/Makefile     |  4 ++++
 configure               | 13 +++++++++++++
 configure.ac            |  7 +++++++
 manual/install.texi     |  9 +++++++++
 sysdeps/x86/hp-timing.h | 15 ++++++++++++++-
 7 files changed, 64 insertions(+), 1 deletion(-)
  

Comments

Alexander Monakov Oct. 23, 2018, 8:44 a.m. UTC | #1
On Mon, 22 Oct 2018, H.J. Lu wrote:

> RDTSCP waits until all previous instructions have executed and all
> previous loads are globally visible before reading the counter.  RDTSC
> doesn't wait until all previous instructions have been executed before
> reading the counter.  This patch adds --enable-rdtscp-in-benchtests to
> use RDTSCP in benchtests.
> 
> NOTE: Benchtests in RDTSCP-enabled glibc require CPUs capable of RDTSCP
> instruction.  All x86 processors since 2010 support RDTSCP instruction.

Without implying an objection to the patch, I'd like to point out that the
Linux kernel always uses "lfence; rdtsc" on Intel CPUs to obtain ordered
timestamps with lowest possible overhead. LFENCE is available on all x86-64
processors as part of SSE2.

On AMD CPUs the kernel also uses "lfence; rdtsc", except if it cannot setup
a specific MSR to make LFENCE serializing; in that case it falls back to
"mfence; rdtsc".

"lfence; rdtsc" sequence is also recommended by Intel SDM documentation.

Is there a specific reason that "rdtscp" is preferable for this patch?

Thanks.
Alexander
  
Florian Weimer Oct. 23, 2018, 9:04 a.m. UTC | #2
* H. J. Lu:

> RDTSCP waits until all previous instructions have executed and all
> previous loads are globally visible before reading the counter.  RDTSC
> doesn't wait until all previous instructions have been executed before
> reading the counter.  This patch adds --enable-rdtscp-in-benchtests to
> use RDTSCP in benchtests.
>
> NOTE: Benchtests in RDTSCP-enabled glibc require CPUs capable of RDTSCP
> instruction.  All x86 processors since 2010 support RDTSCP instruction.

Shouldn't the benchtests use clock_gettime anyway, to avoid issues in
case the TSC is not synchronized across cores?

Thanks,
Florian
  
Siddhesh Poyarekar Oct. 23, 2018, 9:18 a.m. UTC | #3
On 23/10/18 2:34 PM, Florian Weimer wrote:
> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in
> case the TSC is not synchronized across cores?

There's an option USE_CLOCK_GETTIME to make benchtests do that, but 
otherwise it uses the hp_timing bits by default.

Siddhesh
  
H.J. Lu Oct. 23, 2018, 10:58 a.m. UTC | #4
On 10/23/18, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:
> On 23/10/18 2:34 PM, Florian Weimer wrote:
>> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in
>> case the TSC is not synchronized across cores?
>
> There's an option USE_CLOCK_GETTIME to make benchtests do that, but
> otherwise it uses the hp_timing bits by default.

I want something better that rdtsc and very low overhead since some bench
tests only last a few cycles.  Adding lfence may make timing data look like
noise.
  
Siddhesh Poyarekar Oct. 23, 2018, 11:53 a.m. UTC | #5
On 23/10/18 4:28 PM, H.J. Lu wrote:
> On 10/23/18, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:
>> On 23/10/18 2:34 PM, Florian Weimer wrote:
>>> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in
>>> case the TSC is not synchronized across cores?
>>
>> There's an option USE_CLOCK_GETTIME to make benchtests do that, but
>> otherwise it uses the hp_timing bits by default.
> 
> I want something better that rdtsc and very low overhead since some bench
> tests only last a few cycles.  Adding lfence may make timing data look like
> noise.

Is the configure option necessary?  You could just add a Makefile option 
in benchtests similar to USE_CLOCK_GETTIME and then document it in 
benchtests/README.

Siddhesh
  
Szabolcs Nagy Oct. 23, 2018, 11:54 a.m. UTC | #6
On 23/10/18 11:58, H.J. Lu wrote:
> On 10/23/18, Siddhesh Poyarekar <siddhesh@gotplt.org> wrote:

>> On 23/10/18 2:34 PM, Florian Weimer wrote:

>>> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in

>>> case the TSC is not synchronized across cores?

>>

>> There's an option USE_CLOCK_GETTIME to make benchtests do that, but

>> otherwise it uses the hp_timing bits by default.

> 

> I want something better that rdtsc and very low overhead since some bench

> tests only last a few cycles.  Adding lfence may make timing data look like

> noise.

> 


ideally bench test should be fixed so clock_gettime gives
stable enough results.

target specific timers are not always available and their
results are hard to interpret compared to a standard api
that returns wall clock time in sane units.
  
Siddhesh Poyarekar Oct. 24, 2018, 8:34 a.m. UTC | #7
On 23/10/18 5:24 PM, Szabolcs Nagy wrote:
> target specific timers are not always available and their
> results are hard to interpret compared to a standard api
> that returns wall clock time in sane units.

Right, that is specifically why the USE_CLOCK_GETTIME option exists. 
It's not the default though because comparison between architectures is 
not the dominant use case.

Siddhesh
  

Patch

diff --git a/INSTALL b/INSTALL
index f9c5cbb9a3..01e4f7af0c 100644
--- a/INSTALL
+++ b/INSTALL
@@ -153,6 +153,16 @@  if 'CFLAGS' is specified it must enable optimization.  For example:
      library.  This option hardcodes the newly built C library path in
      dynamic tests so that they can be invoked directly.
 
+'--enable-rdtscp-in-benchtests'
+     This x86 only option enables RDTSCP instruction in benchtests.
+     When the GNU C Library is built with '--enable-rdtscp-in-tests',
+     benchtests will use RDTSCP instruction, instead of RDTSC
+     instruction, for high precision, low overhead timing, which
+     provides more precise timing data.  The resulting library is
+     unchanged by this option.  Note that benchtests in RDTSCP-enabled
+     the GNU C Library require CPUs capable of RDTSCP instruction.  All
+     x86 processors since 2010 support RDTSCP instruction.
+
 '--disable-timezone-tools'
      By default, timezone related utilities ('zic', 'zdump', and
      'tzselect') are installed with the GNU C Library.  If you are
diff --git a/NEWS b/NEWS
index f054dc0433..d6007ec4f5 100644
--- a/NEWS
+++ b/NEWS
@@ -9,6 +9,13 @@  Version 2.29
 
 Major new features:
 
+* The GNU C Library can now be built with --enable-rdtscp-in-benchtests
+  to use RDTSCP instruction in benchtests for high precision, low
+  overhead timing on x86 CPUs, which provides more precise timing data.
+  The resulting library is unchanged.  Benchtests in RDTSCP-enabled glibc
+  require CPUs capable of RDTSCP instruction.  All x86 processors since
+  2010 support RDTSCP instruction.
+
 * A new convenience target has been added for distribution maintainers
   to build and install all locales as directories with files.  The new
   target is run by issuing the following command in your build tree:
diff --git a/benchtests/Makefile b/benchtests/Makefile
index bcd6a9c26d..16326c9e41 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -131,6 +131,10 @@  CPPFLAGS-nonlib += -DDURATION=$(BENCH_DURATION) -D_ISOMAC
 # HP_TIMING if it is available.
 ifdef USE_CLOCK_GETTIME
 CPPFLAGS-nonlib += -DUSE_CLOCK_GETTIME
+else
+ifeq (${enable-rdtscp-in-benchtests},yes)
+CPPFLAGS-nonlib += -DENABLE_RDTSCP
+endif
 endif
 
 DETAILED_OPT :=
diff --git a/configure b/configure
index f30c31afdc..a562849490 100755
--- a/configure
+++ b/configure
@@ -794,6 +794,7 @@  enable_pt_chown
 enable_tunables
 enable_mathvec
 enable_cet
+enable_rdtscp_in_benchtests
 with_cpu
 '
       ac_precious_vars='build_alias
@@ -1471,6 +1472,8 @@  Optional Features:
                           depends on architecture]
   --enable-cet            enable Intel Control-flow Enforcement Technology
                           (CET), x86 only
+  --enable-rdtscp-in-benchtests
+                          enable RDTSCP in benchtests, x86 only [default=no]
 
 Optional Packages:
   --with-PACKAGE[=ARG]    use PACKAGE [ARG=yes]
@@ -3785,6 +3788,16 @@  else
 fi
 
 
+# Check whether --enable-rdtscp-in-benchtests was given.
+if test "${enable_rdtscp_in_benchtests+set}" = set; then :
+  enableval=$enable_rdtscp_in_benchtests; rdtscp_in_benchtests=$enableval
+else
+  rdtscp_in_benchtests=no
+fi
+
+config_vars="$config_vars
+enable-rdtscp-in-benchtests = $rdtscp_in_benchtests"
+
 # We keep the original values in `$config_*' and never modify them, so we
 # can write them unchanged into config.make.  Everything else uses
 # $machine, $vendor, and $os, and changes them whenever convenient.
diff --git a/configure.ac b/configure.ac
index e983fd8faa..d3cfb2c728 100644
--- a/configure.ac
+++ b/configure.ac
@@ -478,6 +478,13 @@  AC_ARG_ENABLE([cet],
 	      [enable_cet=$enableval],
 	      [enable_cet=no])
 
+AC_ARG_ENABLE([rdtscp-in-benchtests],
+	      AC_HELP_STRING([--enable-rdtscp-in-benchtests],
+			     [enable RDTSCP in benchtests, x86 only @<:@default=no@:>@]),
+	      [rdtscp_in_benchtests=$enableval],
+	      [rdtscp_in_benchtests=no])
+LIBC_CONFIG_VAR([enable-rdtscp-in-benchtests], [$rdtscp_in_benchtests])
+
 # We keep the original values in `$config_*' and never modify them, so we
 # can write them unchanged into config.make.  Everything else uses
 # $machine, $vendor, and $os, and changes them whenever convenient.
diff --git a/manual/install.texi b/manual/install.texi
index 61178dadcd..fa3fb912f6 100644
--- a/manual/install.texi
+++ b/manual/install.texi
@@ -182,6 +182,15 @@  By default, dynamic tests are linked to run with the installed C library.
 This option hardcodes the newly built C library path in dynamic tests
 so that they can be invoked directly.
 
+@item --enable-rdtscp-in-benchtests
+This x86 only option enables RDTSCP instruction in benchtests.  When
+@theglibc{} is built with @option{--enable-rdtscp-in-tests}, benchtests
+will use RDTSCP instruction, instead of RDTSC instruction, for high
+precision, low overhead timing, which provides more precise timing data.
+The resulting library is unchanged by this option.  Note that benchtests
+in RDTSCP-enabled @theglibc{} require CPUs capable of RDTSCP instruction.
+All x86 processors since 2010 support RDTSCP instruction.
+
 @item --disable-timezone-tools
 By default, timezone related utilities (@command{zic}, @command{zdump},
 and @command{tzselect}) are installed with @theglibc{}.  If you are building
diff --git a/sysdeps/x86/hp-timing.h b/sysdeps/x86/hp-timing.h
index 77a1360748..fd840314fa 100644
--- a/sysdeps/x86/hp-timing.h
+++ b/sysdeps/x86/hp-timing.h
@@ -40,7 +40,20 @@  typedef unsigned long long int hp_timing_t;
 
    NB: Use __builtin_ia32_rdtsc directly since including <x86intrin.h>
    makes building glibc very slow.  */
-# define HP_TIMING_NOW(Var)	((Var) = __builtin_ia32_rdtsc ())
+# ifdef ENABLE_RDTSCP
+/* RDTSCP waits until all previous instructions have executed and all
+   previous loads are globally visible before reading the counter.
+   RDTSC doesn't wait until all previous instructions have been executed
+   before reading the counter.  When --enable-rdtscp-in-tests is used,
+   use RDTSCP in benchtests.  */
+#  define HP_TIMING_NOW(Var) \
+  (__extension__ ({				\
+    unsigned int __aux;				\
+    (Var) = __builtin_ia32_rdtscp (&__aux);	\
+  }))
+# else
+#  define HP_TIMING_NOW(Var) ((Var) = __builtin_ia32_rdtsc ())
+# endif
 
 # include <hp-timing-common.h>
 #else