From patchwork Tue Apr 25 15:27:14 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 20138 Received: (qmail 88824 invoked by alias); 25 Apr 2017 15:27:18 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 88796 invoked by uid 89); 25 Apr 2017 15:27:17 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-24.8 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_LOW, RCVD_IN_SORBS_SPAM, SPF_PASS autolearn=ham version=3.3.2 spammy=landing, picked X-HELO: mail-qt0-f179.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=ZaBfxV0AupUUnsabVXS6uALDQq8BXh2ZjdMTTd/9mIg=; b=dxRrpRr25rlXuEhXrLsQnSVgBDXOkG3FVflUW7pPKApsvdNslC2mVqZPJ2FKeRapU1 SZr1jdCY5DFnyK1SAjQjVXqCj1NbzzwySOgdbz2RJ4wp1rDK0XNUKK7vicj2PlylOqVx nWogK8wxeg5w+3x97nNeCCB2HXnJ8ZRqEPWZ+1nUr7W3tC5xW74JqGkiS1tIvLxdtKPf xwCEN3Niz/bHOMIjEPcUr08MrvwInBx9gWQjFstbMHf5MOspsxeW3eyfHUa5FRSqnJVX pTFgn6tDlIFSNoYzTvdoDJIVlcv3uZICJc8CpCb19X/ZFJ4gSek8xGPL+xnLmH5QdL29 OT+Q== X-Gm-Message-State: AN3rC/7hOPY3Z3R+HhAZHstKujPa4PNFi7o4kBy91tK4JkaRnf3rvD5P RkDaLr7kegjLpXFkSPWb07YChtwGCg== X-Received: by 10.200.49.229 with SMTP id i34mr31115073qte.259.1493134035612; Tue, 25 Apr 2017 08:27:15 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20170418183712.GA22211@intel.com> References: <20170418183712.GA22211@intel.com> From: "H.J. Lu" Date: Tue, 25 Apr 2017 08:27:14 -0700 Message-ID: Subject: Re: [PATCH] x86: Use AVX2 memcpy/memset on Skylake server [BZ #21396] To: GNU C Library On Tue, Apr 18, 2017 at 11:37 AM, H.J. Lu wrote: > On Skylake server, AVX512 load/store instructions in memcpy/memset may > lead to lower CPU turbo frequency in certain situations. Use of AVX2 > in memcpy/memset has been observed to have improved overall performance > in many workloads due to the higher frequency. > > Since AVX512ER is unique to Xeon Phi, this patch sets Prefer_No_AVX512 > if AVX512ER isn't available so that AVX2 versions of memcpy/memset are > used on Skylake server. > > Any comments? > > > H.J. > --- > [BZ #21396] > * sysdeps/x86/cpu-features.c (init_cpu_features): Set > Prefer_No_AVX512 if AVX512ER isn't available. > * sysdeps/x86/cpu-features.h (bit_arch_Prefer_No_AVX512): New. > (index_arch_Prefer_No_AVX512): Likewise. > * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Don't use > AVX512 version if Prefer_No_AVX512 is set. > * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): > Likewise. > * sysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Likewise. > * sysdeps/x86_64/multiarch/memmove_chk.S (__memmove_chk): > Likewise. > * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise. > * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): > Likewise. > * sysdeps/x86_64/multiarch/memset.S (memset): Likewise. > * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): > Likewise. Since this issue has significant impact on Skylake server, I'd like to backport it to 2.24 and 2.25 branches together with the prerequisite patch. Any comments Thanks. H.J. From 844d4c176d03c0002c7faa9f094a4f56bc9f9733 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Tue, 18 Apr 2017 14:01:45 -0700 Subject: [PATCH 2/2] x86: Use AVX2 memcpy/memset on Skylake server [BZ #21396] On Skylake server, AVX512 load/store instructions in memcpy/memset may lead to lower CPU turbo frequency in certain situations. Use of AVX2 in memcpy/memset has been observed to have improved overall performance in many workloads due to the higher frequency. Since AVX512ER is unique to Xeon Phi, this patch sets Prefer_No_AVX512 if AVX512ER isn't available so that AVX2 versions of memcpy/memset are used on Skylake server. [BZ #21396] * sysdeps/x86/cpu-features.c (init_cpu_features): Set Prefer_No_AVX512 if AVX512ER isn't available. * sysdeps/x86/cpu-features.h (bit_arch_Prefer_No_AVX512): New. (index_arch_Prefer_No_AVX512): Likewise. * sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Don't use AVX512 version if Prefer_No_AVX512 is set. * sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise. * sysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Likewise. * sysdeps/x86_64/multiarch/memmove_chk.S (__memmove_chk): Likewise. * sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise. * sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise. * sysdeps/x86_64/multiarch/memset.S (memset): Likewise. * sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Likewise. (cherry picked from commit 4cb334c4d6249686653137ec273d081371b3672d) --- sysdeps/x86/cpu-features.c | 6 +++++- sysdeps/x86/cpu-features.h | 3 +++ sysdeps/x86_64/multiarch/memcpy.S | 2 ++ sysdeps/x86_64/multiarch/memcpy_chk.S | 2 ++ sysdeps/x86_64/multiarch/memmove.S | 2 ++ sysdeps/x86_64/multiarch/memmove_chk.S | 2 ++ sysdeps/x86_64/multiarch/mempcpy.S | 2 ++ sysdeps/x86_64/multiarch/mempcpy_chk.S | 2 ++ sysdeps/x86_64/multiarch/memset.S | 2 ++ sysdeps/x86_64/multiarch/memset_chk.S | 2 ++ 10 files changed, 24 insertions(+), 1 deletion(-) diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index 41d0be2..9afd74c 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -225,10 +225,14 @@ init_cpu_features (struct cpu_features *cpu_features) |= bit_arch_AVX_Fast_Unaligned_Load; /* Since AVX512ER is unique to Xeon Phi, set Prefer_No_VZEROUPPER - if AVX512ER is available. */ + if AVX512ER is available. Don't use AVX512 to avoid lower CPU + frequency if AVX512ER isn't available. */ if (CPU_FEATURES_CPU_P (cpu_features, AVX512ER)) cpu_features->feature[index_arch_Prefer_No_VZEROUPPER] |= bit_arch_Prefer_No_VZEROUPPER; + else + cpu_features->feature[index_arch_Prefer_No_AVX512] + |= bit_arch_Prefer_No_AVX512; /* To avoid SSE transition penalty, use _dl_runtime_resolve_slow. If XGETBV suports ECX == 1, use _dl_runtime_resolve_opt. */ diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h index 2ee8a0a..a409db6 100644 --- a/sysdeps/x86/cpu-features.h +++ b/sysdeps/x86/cpu-features.h @@ -39,6 +39,7 @@ #define bit_arch_Prefer_ERMS (1 << 19) #define bit_arch_Use_dl_runtime_resolve_opt (1 << 20) #define bit_arch_Use_dl_runtime_resolve_slow (1 << 21) +#define bit_arch_Prefer_No_AVX512 (1 << 22) /* CPUID Feature flags. */ @@ -116,6 +117,7 @@ # define index_arch_Prefer_ERMS FEATURE_INDEX_1*FEATURE_SIZE # define index_arch_Use_dl_runtime_resolve_opt FEATURE_INDEX_1*FEATURE_SIZE # define index_arch_Use_dl_runtime_resolve_slow FEATURE_INDEX_1*FEATURE_SIZE +# define index_arch_Prefer_No_AVX512 FEATURE_INDEX_1*FEATURE_SIZE # if defined (_LIBC) && !IS_IN (nonlib) @@ -298,6 +300,7 @@ extern const struct cpu_features *__get_cpu_features (void) # define index_arch_Prefer_ERMS FEATURE_INDEX_1 # define index_arch_Use_dl_runtime_resolve_opt FEATURE_INDEX_1 # define index_arch_Use_dl_runtime_resolve_slow FEATURE_INDEX_1 +# define index_arch_Prefer_No_AVX512 FEATURE_INDEX_1 #endif /* !__ASSEMBLER__ */ diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S index 1f83ee3..af27703 100644 --- a/sysdeps/x86_64/multiarch/memcpy.S +++ b/sysdeps/x86_64/multiarch/memcpy.S @@ -32,6 +32,8 @@ ENTRY(__new_memcpy) lea __memcpy_erms(%rip), %RAX_LP HAS_ARCH_FEATURE (Prefer_ERMS) jnz 2f + HAS_ARCH_FEATURE (Prefer_No_AVX512) + jnz 1f HAS_ARCH_FEATURE (AVX512F_Usable) jz 1f lea __memcpy_avx512_no_vzeroupper(%rip), %RAX_LP diff --git a/sysdeps/x86_64/multiarch/memcpy_chk.S b/sysdeps/x86_64/multiarch/memcpy_chk.S index 5492342..8737fb9 100644 --- a/sysdeps/x86_64/multiarch/memcpy_chk.S +++ b/sysdeps/x86_64/multiarch/memcpy_chk.S @@ -30,6 +30,8 @@ ENTRY(__memcpy_chk) .type __memcpy_chk, @gnu_indirect_function LOAD_RTLD_GLOBAL_RO_RDX + HAS_ARCH_FEATURE (Prefer_No_AVX512) + jnz 1f HAS_ARCH_FEATURE (AVX512F_Usable) jz 1f lea __memcpy_chk_avx512_no_vzeroupper(%rip), %RAX_LP diff --git a/sysdeps/x86_64/multiarch/memmove.S b/sysdeps/x86_64/multiarch/memmove.S index 2021bfc..8c534e8 100644 --- a/sysdeps/x86_64/multiarch/memmove.S +++ b/sysdeps/x86_64/multiarch/memmove.S @@ -30,6 +30,8 @@ ENTRY(__libc_memmove) lea __memmove_erms(%rip), %RAX_LP HAS_ARCH_FEATURE (Prefer_ERMS) jnz 2f + HAS_ARCH_FEATURE (Prefer_No_AVX512) + jnz 1f HAS_ARCH_FEATURE (AVX512F_Usable) jz 1f lea __memmove_avx512_no_vzeroupper(%rip), %RAX_LP diff --git a/sysdeps/x86_64/multiarch/memmove_chk.S b/sysdeps/x86_64/multiarch/memmove_chk.S index 8a252ad..7870dd0 100644 --- a/sysdeps/x86_64/multiarch/memmove_chk.S +++ b/sysdeps/x86_64/multiarch/memmove_chk.S @@ -29,6 +29,8 @@ ENTRY(__memmove_chk) .type __memmove_chk, @gnu_indirect_function LOAD_RTLD_GLOBAL_RO_RDX + HAS_ARCH_FEATURE (Prefer_No_AVX512) + jnz 1f HAS_ARCH_FEATURE (AVX512F_Usable) jz 1f lea __memmove_chk_avx512_no_vzeroupper(%rip), %RAX_LP diff --git a/sysdeps/x86_64/multiarch/mempcpy.S b/sysdeps/x86_64/multiarch/mempcpy.S index 79c840d..b8b2b28 100644 --- a/sysdeps/x86_64/multiarch/mempcpy.S +++ b/sysdeps/x86_64/multiarch/mempcpy.S @@ -32,6 +32,8 @@ ENTRY(__mempcpy) lea __mempcpy_erms(%rip), %RAX_LP HAS_ARCH_FEATURE (Prefer_ERMS) jnz 2f + HAS_ARCH_FEATURE (Prefer_No_AVX512) + jnz 1f HAS_ARCH_FEATURE (AVX512F_Usable) jz 1f lea __mempcpy_avx512_no_vzeroupper(%rip), %RAX_LP diff --git a/sysdeps/x86_64/multiarch/mempcpy_chk.S b/sysdeps/x86_64/multiarch/mempcpy_chk.S index 6927962..072b22c 100644 --- a/sysdeps/x86_64/multiarch/mempcpy_chk.S +++ b/sysdeps/x86_64/multiarch/mempcpy_chk.S @@ -30,6 +30,8 @@ ENTRY(__mempcpy_chk) .type __mempcpy_chk, @gnu_indirect_function LOAD_RTLD_GLOBAL_RO_RDX + HAS_ARCH_FEATURE (Prefer_No_AVX512) + jnz 1f HAS_ARCH_FEATURE (AVX512F_Usable) jz 1f lea __mempcpy_chk_avx512_no_vzeroupper(%rip), %RAX_LP diff --git a/sysdeps/x86_64/multiarch/memset.S b/sysdeps/x86_64/multiarch/memset.S index c958b2f..9d33118 100644 --- a/sysdeps/x86_64/multiarch/memset.S +++ b/sysdeps/x86_64/multiarch/memset.S @@ -41,6 +41,8 @@ ENTRY(memset) jnz L(AVX512F) lea __memset_avx2_unaligned(%rip), %RAX_LP L(AVX512F): + HAS_ARCH_FEATURE (Prefer_No_AVX512) + jnz 2f HAS_ARCH_FEATURE (AVX512F_Usable) jz 2f lea __memset_avx512_no_vzeroupper(%rip), %RAX_LP diff --git a/sysdeps/x86_64/multiarch/memset_chk.S b/sysdeps/x86_64/multiarch/memset_chk.S index 79eaa37..7e08311 100644 --- a/sysdeps/x86_64/multiarch/memset_chk.S +++ b/sysdeps/x86_64/multiarch/memset_chk.S @@ -38,6 +38,8 @@ ENTRY(__memset_chk) jnz L(AVX512F) lea __memset_chk_avx2_unaligned(%rip), %RAX_LP L(AVX512F): + HAS_ARCH_FEATURE (Prefer_No_AVX512) + jnz 2f HAS_ARCH_FEATURE (AVX512F_Usable) jz 2f lea __memset_chk_avx512_no_vzeroupper(%rip), %RAX_LP -- 2.9.3