From patchwork Tue Oct 23 21:29:00 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 29855 Received: (qmail 42778 invoked by alias); 23 Oct 2018 21:29:06 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 42752 invoked by uid 89); 23 Oct 2018 21:29:06 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-25.3 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy= X-HELO: mail-oi1-f194.google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to:cc; bh=ZGevMrUrMRHqprsYEpPbkzJNmq/d6CJkOn5bTLLRqkY=; b=NA95H6Mps6Z8Lk7gMemA3CRBdxpKXFh56sHPIqz91JFiU83Oo3eR4wu9CsgWLxmBNB oz91TCNDv/o7uQl17gcOqD/K0cUaEP4VDuG+w2MImRO3RHmb25342fq8hvyVmAMtYY1W cS7nnaSd2KcUhoRzIjwLj2h8LN8hNDS1xkdqrNz1N+5OYYD0Twg+j82HYEUDtiZC5udv WvH9wR9Oc2ziVxbQuqlAo4Bf1q0LLNTsEXD38tBGA2KH/EZi6+/wOjevd5tSmRT1CMk7 YGnrtqnJzx6K88DXqLM2cTSSMVXf8G5APnjyzGVkTlF7Q3xywOHJNewdfE/1GQW3l3Dk uJnw== MIME-Version: 1.0 From: "H.J. Lu" Date: Tue, 23 Oct 2018 14:29:00 -0700 Message-ID: Subject: [PATCH] x86: Support RDTSCP for benchtests To: Szabolcs Nagy Cc: Siddhesh Poyarekar , nd , Florian Weimer , "libc-alpha@sourceware.org" On 10/23/18, Szabolcs Nagy wrote: > On 23/10/18 11:58, H.J. Lu wrote: >> On 10/23/18, Siddhesh Poyarekar wrote: >>> On 23/10/18 2:34 PM, Florian Weimer wrote: >>>> Shouldn't the benchtests use clock_gettime anyway, to avoid issues in >>>> case the TSC is not synchronized across cores? >>> >>> There's an option USE_CLOCK_GETTIME to make benchtests do that, but >>> otherwise it uses the hp_timing bits by default. >> >> I want something better that rdtsc and very low overhead since some bench >> tests only last a few cycles. Adding lfence may make timing data look >> like >> noise. >> > > ideally bench test should be fixed so clock_gettime gives > stable enough results. > > target specific timers are not always available and their > results are hard to interpret compared to a standard api > that returns wall clock time in sane units. > Here is a simple patch to support RDTSCP for benchtests. OK for trunk? From a1cf1cb1c86cfc99b81ce6e1caf5807d2ec25c08 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Mon, 22 Oct 2018 01:13:38 -0700 Subject: [PATCH] x86: Support RDTSCP for benchtests RDTSCP waits until all previous instructions have executed and all previous loads are globally visible before reading the counter. RDTSC doesn't wait until all previous instructions have been executed before reading the counter. All x86 processors since 2010 support RDTSCP instruction. This patch adds RDTSCP support to benchtests. * benchtests/Makefile (CPPFLAGS-nonlib): Add -DUSE_RDTSCP if USE_RDTSCP is defined. * sysdeps/x86/hp-timing.h (HP_TIMING_NOW): Use RDTSCP if USE_RDTSCP is defined. --- benchtests/Makefile | 6 ++++++ benchtests/README | 9 +++++++++ sysdeps/x86/hp-timing.h | 14 +++++++++++++- 3 files changed, 28 insertions(+), 1 deletion(-) diff --git a/benchtests/Makefile b/benchtests/Makefile index bcd6a9c26d..45aeb5febe 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -131,6 +131,12 @@ CPPFLAGS-nonlib += -DDURATION=$(BENCH_DURATION) -D_ISOMAC # HP_TIMING if it is available. ifdef USE_CLOCK_GETTIME CPPFLAGS-nonlib += -DUSE_CLOCK_GETTIME +else +# On x86 processors, use RDTSCP, instead of RDTSC, to measure performance +# of functions. All x86 processors since 2010 support RDTSCP instruction. +ifdef USE_RDTSCP +CPPFLAGS-nonlib += -DUSE_RDTSCP +endif endif DETAILED_OPT := diff --git a/benchtests/README b/benchtests/README index 4ddff794d1..aaf0b659e2 100644 --- a/benchtests/README +++ b/benchtests/README @@ -34,6 +34,15 @@ the benchmark to use clock_gettime by invoking make as follows: Again, one must run `make bench-clean' before changing the measurement method. +On x86 processors, RDTSCP instruction provides more precise timing data +than RDTSC instruction. All x86 processors since 2010 support RDTSCP +instruction. One can force the benchmark to use RDTSCP by invoking make +as follows: + + $ make USE_RDTSCP=1 bench + +One must run `make bench-clean' before changing the measurement method. + Running benchmarks on another target: ==================================== diff --git a/sysdeps/x86/hp-timing.h b/sysdeps/x86/hp-timing.h index 77a1360748..0aa6f5e3f8 100644 --- a/sysdeps/x86/hp-timing.h +++ b/sysdeps/x86/hp-timing.h @@ -40,7 +40,19 @@ typedef unsigned long long int hp_timing_t; NB: Use __builtin_ia32_rdtsc directly since including makes building glibc very slow. */ -# define HP_TIMING_NOW(Var) ((Var) = __builtin_ia32_rdtsc ()) +# ifdef USE_RDTSCP +/* RDTSCP waits until all previous instructions have executed and all + previous loads are globally visible before reading the counter. + RDTSC doesn't wait until all previous instructions have been executed + before reading the counter. */ +# define HP_TIMING_NOW(Var) \ + (__extension__ ({ \ + unsigned int __aux; \ + (Var) = __builtin_ia32_rdtscp (&__aux); \ + })) +# else +# define HP_TIMING_NOW(Var) ((Var) = __builtin_ia32_rdtsc ()) +# endif # include #else -- 2.17.2