Message ID | 20140318100138.GC8415@domone.podge |
---|---|
State | Committed |
Headers |
Return-Path: <x14307373@homiemail-mx23.g.dreamhost.com> X-Original-To: siddhesh@wilcox.dreamhost.com Delivered-To: siddhesh@wilcox.dreamhost.com Received: from homiemail-mx23.g.dreamhost.com (caibbdcaabja.dreamhost.com [208.113.200.190]) by wilcox.dreamhost.com (Postfix) with ESMTP id 49FBF360183 for <siddhesh@wilcox.dreamhost.com>; Tue, 18 Mar 2014 03:01:52 -0700 (PDT) Received: by homiemail-mx23.g.dreamhost.com (Postfix, from userid 14307373) id EE9D561C6EEA3; Tue, 18 Mar 2014 03:01:51 -0700 (PDT) X-Original-To: glibc@patchwork.siddhesh.in Delivered-To: x14307373@homiemail-mx23.g.dreamhost.com Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by homiemail-mx23.g.dreamhost.com (Postfix) with ESMTPS id D08BC61C63913 for <glibc@patchwork.siddhesh.in>; Tue, 18 Mar 2014 03:01:51 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:in-reply-to; q=dns; s=default; b=XtlW pgL6s5DysDMYKuST64xv6uVg5Wm8b8RAE7t9IFM6mfmrRWp3noaPyssGshs3M5bn AyLNcNHm9Bk3mujPld2rdHruslh9wQoY6PJLMSI7xtIjiMBfqdKgdmWM8+B2jwoW CFSkDf8VoUmzbM5VXfC5IPonpolgRIyTwXwhjdM= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:in-reply-to; s=default; bh=cDIF0dEFTy VhWDVxZWeUIo4F+a8=; b=wsmdek5W51Ibkqr+CmFfnchc1gxsdaG8H3inHgHAbn WBkgYrCV24Ecc9kAUlvML3WxB/XvByJuxsI0HMS6NtL3RwFZcughSVKg4x2Tv7XD tibxVXkrdsUODrn6dShH4FDPFV6xaFmTiRKVLiyj55U/QjvyAguTnF7w8WSj49VO c= Received: (qmail 5341 invoked by alias); 18 Mar 2014 10:01:49 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-glibc=patchwork.siddhesh.in@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 5331 invoked by uid 89); 18 Mar 2014 10:01:49 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.6 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Tue, 18 Mar 2014 11:01:38 +0100 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz> To: Carlos O'Donell <carlos@redhat.com> Cc: libc-alpha@sourceware.org Subject: [PATCH 3/2] Use strspn/strcspn/strpbrk ifunc in internal calls. Message-ID: <20140318100138.GC8415@domone.podge> References: <20140227123238.GA26291@domone.podge> <20140227124206.GA26474@domone.podge> <5318A03D.3000705@redhat.com> <20140306163241.GA11843@domone.podge> <5318B58B.5040704@redhat.com> <20140306205212.GB11843@domone.podge> <53192422.2050101@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53192422.2050101@redhat.com> User-Agent: Mutt/1.5.20 (2009-06-14) X-DH-Original-To: glibc@patchwork.siddhesh.in |
Commit Message
Ondrej Bilka
March 18, 2014, 10:01 a.m. UTC
To make a strtok faster and improve performance in general we need to do one additional change. A comment: /* It doesn't make sense to send libc-internal strcspn calls through a PLT. The speedup we get from using SSE4.2 instruction is likely eaten away by the indirect call in the PLT. */ Does not make sense at all because nobody bothered to check it. Gap between these implementations is quite big, when haystack is empty a sse2 is around 40 cycles slower because it needs to populate a lookup table and difference only increases with size. That is much bigger than plt slowdown which is few cycles. Even benchtest show a gap which also may be reverse by branch misprediction but my internal benchmark shown. simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 This patch also handles strpbrk which is implemented by including a x86_64/multiarch/strcspn.S file. * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. * sysdeps/x86_64/multiarch/strcspn.S: Likewise.
Comments
ping On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote: > To make a strtok faster and improve performance in general we need to do one > additional change. > > A comment: > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT. > The speedup we get from using SSE4.2 instruction is likely eaten away > by the indirect call in the PLT. */ > > Does not make sense at all because nobody bothered to check it. Gap > between these implementations is quite big, when haystack is empty a > sse2 is around 40 cycles slower because it needs to populate a lookup > table and difference only increases with size. That is much bigger than > plt slowdown which is few cycles. > > Even benchtest show a gap which also may be reverse by branch > misprediction but my internal benchmark shown. > > simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 > Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 > Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 > > This patch also handles strpbrk which is implemented by including a > x86_64/multiarch/strcspn.S file. > > * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. > * sysdeps/x86_64/multiarch/strcspn.S: Likewise. > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S > index 24f55e9..1b3e1aa 100644 > --- a/sysdeps/x86_64/multiarch/strcspn.S > +++ b/sysdeps/x86_64/multiarch/strcspn.S > @@ -65,14 +65,7 @@ END(STRCSPN) > # undef END > # define END(name) \ > cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2 > -# undef libc_hidden_builtin_def > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT. > - The speedup we get from using SSE4.2 instruction is likely eaten away > - by the indirect call in the PLT. */ > -# define libc_hidden_builtin_def(name) \ > - .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2 > #endif > - > #endif /* HAVE_SSE4_SUPPORT */ > > #ifdef USE_AS_STRPBRK > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S > index bf7308e..fde1e1e 100644 > --- a/sysdeps/x86_64/multiarch/strspn.S > +++ b/sysdeps/x86_64/multiarch/strspn.S > @@ -50,12 +50,6 @@ END(strspn) > # undef END > # define END(name) \ > cfi_endproc; .size __strspn_sse2, .-__strspn_sse2 > -# undef libc_hidden_builtin_def > -/* It doesn't make sense to send libc-internal strspn calls through a PLT. > - The speedup we get from using SSE4.2 instruction is likely eaten away > - by the indirect call in the PLT. */ > -# define libc_hidden_builtin_def(name) \ > - .globl __GI_strspn; __GI_strspn = __strspn_sse2 > #endif > > #endif /* HAVE_SSE4_SUPPORT */
ping On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote: > ping > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote: > > To make a strtok faster and improve performance in general we need to do one > > additional change. > > > > A comment: > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > The speedup we get from using SSE4.2 instruction is likely eaten away > > by the indirect call in the PLT. */ > > > > Does not make sense at all because nobody bothered to check it. Gap > > between these implementations is quite big, when haystack is empty a > > sse2 is around 40 cycles slower because it needs to populate a lookup > > table and difference only increases with size. That is much bigger than > > plt slowdown which is few cycles. > > > > Even benchtest show a gap which also may be reverse by branch > > misprediction but my internal benchmark shown. > > > > simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 > > Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 > > Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 > > > > This patch also handles strpbrk which is implemented by including a > > x86_64/multiarch/strcspn.S file. > > > > * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. > > * sysdeps/x86_64/multiarch/strcspn.S: Likewise. > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S > > index 24f55e9..1b3e1aa 100644 > > --- a/sysdeps/x86_64/multiarch/strcspn.S > > +++ b/sysdeps/x86_64/multiarch/strcspn.S > > @@ -65,14 +65,7 @@ END(STRCSPN) > > # undef END > > # define END(name) \ > > cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2 > > -# undef libc_hidden_builtin_def > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > - by the indirect call in the PLT. */ > > -# define libc_hidden_builtin_def(name) \ > > - .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2 > > #endif > > - > > #endif /* HAVE_SSE4_SUPPORT */ > > > > #ifdef USE_AS_STRPBRK > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S > > index bf7308e..fde1e1e 100644 > > --- a/sysdeps/x86_64/multiarch/strspn.S > > +++ b/sysdeps/x86_64/multiarch/strspn.S > > @@ -50,12 +50,6 @@ END(strspn) > > # undef END > > # define END(name) \ > > cfi_endproc; .size __strspn_sse2, .-__strspn_sse2 > > -# undef libc_hidden_builtin_def > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT. > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > - by the indirect call in the PLT. */ > > -# define libc_hidden_builtin_def(name) \ > > - .globl __GI_strspn; __GI_strspn = __strspn_sse2 > > #endif > > > > #endif /* HAVE_SSE4_SUPPORT */ > > -- > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible!
On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote: > ping > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote: > > ping > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote: > > > To make a strtok faster and improve performance in general we need to do one > > > additional change. > > > > > > A comment: > > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > The speedup we get from using SSE4.2 instruction is likely eaten away > > > by the indirect call in the PLT. */ > > > > > > Does not make sense at all because nobody bothered to check it. Gap > > > between these implementations is quite big, when haystack is empty a > > > sse2 is around 40 cycles slower because it needs to populate a lookup > > > table and difference only increases with size. That is much bigger than > > > plt slowdown which is few cycles. > > > > > > Even benchtest show a gap which also may be reverse by branch > > > misprediction but my internal benchmark shown. > > > > > > simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 > > > Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 > > > Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 > > > > > > This patch also handles strpbrk which is implemented by including a > > > x86_64/multiarch/strcspn.S file. > > > > > > * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. > > > * sysdeps/x86_64/multiarch/strcspn.S: Likewise. > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S > > > index 24f55e9..1b3e1aa 100644 > > > --- a/sysdeps/x86_64/multiarch/strcspn.S > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S > > > @@ -65,14 +65,7 @@ END(STRCSPN) > > > # undef END > > > # define END(name) \ > > > cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2 > > > -# undef libc_hidden_builtin_def > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > - by the indirect call in the PLT. */ > > > -# define libc_hidden_builtin_def(name) \ > > > - .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2 > > > #endif > > > - > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > #ifdef USE_AS_STRPBRK > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S > > > index bf7308e..fde1e1e 100644 > > > --- a/sysdeps/x86_64/multiarch/strspn.S > > > +++ b/sysdeps/x86_64/multiarch/strspn.S > > > @@ -50,12 +50,6 @@ END(strspn) > > > # undef END > > > # define END(name) \ > > > cfi_endproc; .size __strspn_sse2, .-__strspn_sse2 > > > -# undef libc_hidden_builtin_def > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT. > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > - by the indirect call in the PLT. */ > > > -# define libc_hidden_builtin_def(name) \ > > > - .globl __GI_strspn; __GI_strspn = __strspn_sse2 > > > #endif > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > -- > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible! > > -- > > Look, buddy: Windows 3.1 IS A General Protection Fault.
ping On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote: > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote: > > ping > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote: > > > ping > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote: > > > > To make a strtok faster and improve performance in general we need to do one > > > > additional change. > > > > > > > > A comment: > > > > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > The speedup we get from using SSE4.2 instruction is likely eaten away > > > > by the indirect call in the PLT. */ > > > > > > > > Does not make sense at all because nobody bothered to check it. Gap > > > > between these implementations is quite big, when haystack is empty a > > > > sse2 is around 40 cycles slower because it needs to populate a lookup > > > > table and difference only increases with size. That is much bigger than > > > > plt slowdown which is few cycles. > > > > > > > > Even benchtest show a gap which also may be reverse by branch > > > > misprediction but my internal benchmark shown. > > > > > > > > simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 > > > > Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 > > > > Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 > > > > > > > > This patch also handles strpbrk which is implemented by including a > > > > x86_64/multiarch/strcspn.S file. > > > > > > > > * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. > > > > * sysdeps/x86_64/multiarch/strcspn.S: Likewise. > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S > > > > index 24f55e9..1b3e1aa 100644 > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S > > > > @@ -65,14 +65,7 @@ END(STRCSPN) > > > > # undef END > > > > # define END(name) \ > > > > cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2 > > > > -# undef libc_hidden_builtin_def > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > - by the indirect call in the PLT. */ > > > > -# define libc_hidden_builtin_def(name) \ > > > > - .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2 > > > > #endif > > > > - > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > #ifdef USE_AS_STRPBRK > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S > > > > index bf7308e..fde1e1e 100644 > > > > --- a/sysdeps/x86_64/multiarch/strspn.S > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S > > > > @@ -50,12 +50,6 @@ END(strspn) > > > > # undef END > > > > # define END(name) \ > > > > cfi_endproc; .size __strspn_sse2, .-__strspn_sse2 > > > > -# undef libc_hidden_builtin_def > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT. > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > - by the indirect call in the PLT. */ > > > > -# define libc_hidden_builtin_def(name) \ > > > > - .globl __GI_strspn; __GI_strspn = __strspn_sse2 > > > > #endif > > > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > -- > > > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible! > > > > -- > > > > Look, buddy: Windows 3.1 IS A General Protection Fault. > > -- > > Failure to adjust for daylight savings time.
ping On Mon, May 12, 2014 at 02:00:11PM +0200, Ondřej Bílka wrote: > ping > On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote: > > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote: > > > ping > > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote: > > > > ping > > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote: > > > > > To make a strtok faster and improve performance in general we need to do one > > > > > additional change. > > > > > > > > > > A comment: > > > > > > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > > The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > by the indirect call in the PLT. */ > > > > > > > > > > Does not make sense at all because nobody bothered to check it. Gap > > > > > between these implementations is quite big, when haystack is empty a > > > > > sse2 is around 40 cycles slower because it needs to populate a lookup > > > > > table and difference only increases with size. That is much bigger than > > > > > plt slowdown which is few cycles. > > > > > > > > > > Even benchtest show a gap which also may be reverse by branch > > > > > misprediction but my internal benchmark shown. > > > > > > > > > > simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 > > > > > Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 > > > > > Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 > > > > > > > > > > This patch also handles strpbrk which is implemented by including a > > > > > x86_64/multiarch/strcspn.S file. > > > > > > > > > > * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. > > > > > * sysdeps/x86_64/multiarch/strcspn.S: Likewise. > > > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S > > > > > index 24f55e9..1b3e1aa 100644 > > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S > > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S > > > > > @@ -65,14 +65,7 @@ END(STRCSPN) > > > > > # undef END > > > > > # define END(name) \ > > > > > cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2 > > > > > -# undef libc_hidden_builtin_def > > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > - by the indirect call in the PLT. */ > > > > > -# define libc_hidden_builtin_def(name) \ > > > > > - .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2 > > > > > #endif > > > > > - > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > > > #ifdef USE_AS_STRPBRK > > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S > > > > > index bf7308e..fde1e1e 100644 > > > > > --- a/sysdeps/x86_64/multiarch/strspn.S > > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S > > > > > @@ -50,12 +50,6 @@ END(strspn) > > > > > # undef END > > > > > # define END(name) \ > > > > > cfi_endproc; .size __strspn_sse2, .-__strspn_sse2 > > > > > -# undef libc_hidden_builtin_def > > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT. > > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > - by the indirect call in the PLT. */ > > > > > -# define libc_hidden_builtin_def(name) \ > > > > > - .globl __GI_strspn; __GI_strspn = __strspn_sse2 > > > > > #endif > > > > > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > -- > > > > > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible! > > > > > > -- > > > > > > Look, buddy: Windows 3.1 IS A General Protection Fault. > > > > -- > > > > Failure to adjust for daylight savings time. > > -- > > monitor VLF leakage
ping On Sat, May 24, 2014 at 01:23:13AM +0200, Ondřej Bílka wrote: > ping > On Mon, May 12, 2014 at 02:00:11PM +0200, Ondřej Bílka wrote: > > ping > > On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote: > > > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote: > > > > ping > > > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote: > > > > > ping > > > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote: > > > > > > To make a strtok faster and improve performance in general we need to do one > > > > > > additional change. > > > > > > > > > > > > A comment: > > > > > > > > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > > > The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > by the indirect call in the PLT. */ > > > > > > > > > > > > Does not make sense at all because nobody bothered to check it. Gap > > > > > > between these implementations is quite big, when haystack is empty a > > > > > > sse2 is around 40 cycles slower because it needs to populate a lookup > > > > > > table and difference only increases with size. That is much bigger than > > > > > > plt slowdown which is few cycles. > > > > > > > > > > > > Even benchtest show a gap which also may be reverse by branch > > > > > > misprediction but my internal benchmark shown. > > > > > > > > > > > > simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 > > > > > > Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 > > > > > > Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 > > > > > > > > > > > > This patch also handles strpbrk which is implemented by including a > > > > > > x86_64/multiarch/strcspn.S file. > > > > > > > > > > > > * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. > > > > > > * sysdeps/x86_64/multiarch/strcspn.S: Likewise. > > > > > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S > > > > > > index 24f55e9..1b3e1aa 100644 > > > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S > > > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S > > > > > > @@ -65,14 +65,7 @@ END(STRCSPN) > > > > > > # undef END > > > > > > # define END(name) \ > > > > > > cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2 > > > > > > -# undef libc_hidden_builtin_def > > > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > - by the indirect call in the PLT. */ > > > > > > -# define libc_hidden_builtin_def(name) \ > > > > > > - .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2 > > > > > > #endif > > > > > > - > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > > > > > #ifdef USE_AS_STRPBRK > > > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S > > > > > > index bf7308e..fde1e1e 100644 > > > > > > --- a/sysdeps/x86_64/multiarch/strspn.S > > > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S > > > > > > @@ -50,12 +50,6 @@ END(strspn) > > > > > > # undef END > > > > > > # define END(name) \ > > > > > > cfi_endproc; .size __strspn_sse2, .-__strspn_sse2 > > > > > > -# undef libc_hidden_builtin_def > > > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT. > > > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > - by the indirect call in the PLT. */ > > > > > > -# define libc_hidden_builtin_def(name) \ > > > > > > - .globl __GI_strspn; __GI_strspn = __strspn_sse2 > > > > > > #endif > > > > > > > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > > > -- > > > > > > > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible! > > > > > > > > -- > > > > > > > > Look, buddy: Windows 3.1 IS A General Protection Fault. > > > > > > -- > > > > > > Failure to adjust for daylight savings time. > > > > -- > > > > monitor VLF leakage > > -- > > Stale file handle (next time use Tupperware(tm)!)
ping On Wed, Jun 04, 2014 at 02:47:54PM +0200, Ondřej Bílka wrote: > ping > On Sat, May 24, 2014 at 01:23:13AM +0200, Ondřej Bílka wrote: > > ping > > On Mon, May 12, 2014 at 02:00:11PM +0200, Ondřej Bílka wrote: > > > ping > > > On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote: > > > > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote: > > > > > ping > > > > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote: > > > > > > ping > > > > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote: > > > > > > > To make a strtok faster and improve performance in general we need to do one > > > > > > > additional change. > > > > > > > > > > > > > > A comment: > > > > > > > > > > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > > > > The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > > by the indirect call in the PLT. */ > > > > > > > > > > > > > > Does not make sense at all because nobody bothered to check it. Gap > > > > > > > between these implementations is quite big, when haystack is empty a > > > > > > > sse2 is around 40 cycles slower because it needs to populate a lookup > > > > > > > table and difference only increases with size. That is much bigger than > > > > > > > plt slowdown which is few cycles. > > > > > > > > > > > > > > Even benchtest show a gap which also may be reverse by branch > > > > > > > misprediction but my internal benchmark shown. > > > > > > > > > > > > > > simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 > > > > > > > Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 > > > > > > > Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 > > > > > > > > > > > > > > This patch also handles strpbrk which is implemented by including a > > > > > > > x86_64/multiarch/strcspn.S file. > > > > > > > > > > > > > > * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. > > > > > > > * sysdeps/x86_64/multiarch/strcspn.S: Likewise. > > > > > > > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S > > > > > > > index 24f55e9..1b3e1aa 100644 > > > > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S > > > > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S > > > > > > > @@ -65,14 +65,7 @@ END(STRCSPN) > > > > > > > # undef END > > > > > > > # define END(name) \ > > > > > > > cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2 > > > > > > > -# undef libc_hidden_builtin_def > > > > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > > - by the indirect call in the PLT. */ > > > > > > > -# define libc_hidden_builtin_def(name) \ > > > > > > > - .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2 > > > > > > > #endif > > > > > > > - > > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > > > > > > > #ifdef USE_AS_STRPBRK > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S > > > > > > > index bf7308e..fde1e1e 100644 > > > > > > > --- a/sysdeps/x86_64/multiarch/strspn.S > > > > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S > > > > > > > @@ -50,12 +50,6 @@ END(strspn) > > > > > > > # undef END > > > > > > > # define END(name) \ > > > > > > > cfi_endproc; .size __strspn_sse2, .-__strspn_sse2 > > > > > > > -# undef libc_hidden_builtin_def > > > > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT. > > > > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > > - by the indirect call in the PLT. */ > > > > > > > -# define libc_hidden_builtin_def(name) \ > > > > > > > - .globl __GI_strspn; __GI_strspn = __strspn_sse2 > > > > > > > #endif > > > > > > > > > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > > > > > -- > > > > > > > > > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible! > > > > > > > > > > -- > > > > > > > > > > Look, buddy: Windows 3.1 IS A General Protection Fault. > > > > > > > > -- > > > > > > > > Failure to adjust for daylight savings time. > > > > > > -- > > > > > > monitor VLF leakage > > > > -- > > > > Stale file handle (next time use Tupperware(tm)!) > > -- > > piezo-electric interference
ping On Tue, Jun 24, 2014 at 12:41:52PM +0200, Ondřej Bílka wrote: > ping > On Wed, Jun 04, 2014 at 02:47:54PM +0200, Ondřej Bílka wrote: > > ping > > On Sat, May 24, 2014 at 01:23:13AM +0200, Ondřej Bílka wrote: > > > ping > > > On Mon, May 12, 2014 at 02:00:11PM +0200, Ondřej Bílka wrote: > > > > ping > > > > On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote: > > > > > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote: > > > > > > ping > > > > > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote: > > > > > > > ping > > > > > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote: > > > > > > > > To make a strtok faster and improve performance in general we need to do one > > > > > > > > additional change. > > > > > > > > > > > > > > > > A comment: > > > > > > > > > > > > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > > > > > The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > > > by the indirect call in the PLT. */ > > > > > > > > > > > > > > > > Does not make sense at all because nobody bothered to check it. Gap > > > > > > > > between these implementations is quite big, when haystack is empty a > > > > > > > > sse2 is around 40 cycles slower because it needs to populate a lookup > > > > > > > > table and difference only increases with size. That is much bigger than > > > > > > > > plt slowdown which is few cycles. > > > > > > > > > > > > > > > > Even benchtest show a gap which also may be reverse by branch > > > > > > > > misprediction but my internal benchmark shown. > > > > > > > > > > > > > > > > simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 > > > > > > > > Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 > > > > > > > > Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 > > > > > > > > > > > > > > > > This patch also handles strpbrk which is implemented by including a > > > > > > > > x86_64/multiarch/strcspn.S file. > > > > > > > > > > > > > > > > * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. > > > > > > > > * sysdeps/x86_64/multiarch/strcspn.S: Likewise. > > > > > > > > > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S > > > > > > > > index 24f55e9..1b3e1aa 100644 > > > > > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S > > > > > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S > > > > > > > > @@ -65,14 +65,7 @@ END(STRCSPN) > > > > > > > > # undef END > > > > > > > > # define END(name) \ > > > > > > > > cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2 > > > > > > > > -# undef libc_hidden_builtin_def > > > > > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > > > - by the indirect call in the PLT. */ > > > > > > > > -# define libc_hidden_builtin_def(name) \ > > > > > > > > - .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2 > > > > > > > > #endif > > > > > > > > - > > > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > > > > > > > > > #ifdef USE_AS_STRPBRK > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S > > > > > > > > index bf7308e..fde1e1e 100644 > > > > > > > > --- a/sysdeps/x86_64/multiarch/strspn.S > > > > > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S > > > > > > > > @@ -50,12 +50,6 @@ END(strspn) > > > > > > > > # undef END > > > > > > > > # define END(name) \ > > > > > > > > cfi_endproc; .size __strspn_sse2, .-__strspn_sse2 > > > > > > > > -# undef libc_hidden_builtin_def > > > > > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT. > > > > > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > > > - by the indirect call in the PLT. */ > > > > > > > > -# define libc_hidden_builtin_def(name) \ > > > > > > > > - .globl __GI_strspn; __GI_strspn = __strspn_sse2 > > > > > > > > #endif > > > > > > > > > > > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible! > > > > > > > > > > > > -- > > > > > > > > > > > > Look, buddy: Windows 3.1 IS A General Protection Fault. > > > > > > > > > > -- > > > > > > > > > > Failure to adjust for daylight savings time. > > > > > > > > -- > > > > > > > > monitor VLF leakage > > > > > > -- > > > > > > Stale file handle (next time use Tupperware(tm)!) > > > > -- > > > > piezo-electric interference > > -- > > Hard drive sleeping. Let it wake up on it's own...
ping On Wed, Dec 10, 2014 at 03:39:31PM +0100, Ondřej Bílka wrote: > ping > On Tue, Jun 24, 2014 at 12:41:52PM +0200, Ondřej Bílka wrote: > > ping > > On Wed, Jun 04, 2014 at 02:47:54PM +0200, Ondřej Bílka wrote: > > > ping > > > On Sat, May 24, 2014 at 01:23:13AM +0200, Ondřej Bílka wrote: > > > > ping > > > > On Mon, May 12, 2014 at 02:00:11PM +0200, Ondřej Bílka wrote: > > > > > ping > > > > > On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote: > > > > > > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote: > > > > > > > ping > > > > > > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote: > > > > > > > > ping > > > > > > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote: > > > > > > > > > To make a strtok faster and improve performance in general we need to do one > > > > > > > > > additional change. > > > > > > > > > > > > > > > > > > A comment: > > > > > > > > > > > > > > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > > > > > > The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > > > > by the indirect call in the PLT. */ > > > > > > > > > > > > > > > > > > Does not make sense at all because nobody bothered to check it. Gap > > > > > > > > > between these implementations is quite big, when haystack is empty a > > > > > > > > > sse2 is around 40 cycles slower because it needs to populate a lookup > > > > > > > > > table and difference only increases with size. That is much bigger than > > > > > > > > > plt slowdown which is few cycles. > > > > > > > > > > > > > > > > > > Even benchtest show a gap which also may be reverse by branch > > > > > > > > > misprediction but my internal benchmark shown. > > > > > > > > > > > > > > > > > > simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 > > > > > > > > > Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 > > > > > > > > > Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 > > > > > > > > > > > > > > > > > > This patch also handles strpbrk which is implemented by including a > > > > > > > > > x86_64/multiarch/strcspn.S file. > > > > > > > > > > > > > > > > > > * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. > > > > > > > > > * sysdeps/x86_64/multiarch/strcspn.S: Likewise. > > > > > > > > > > > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S > > > > > > > > > index 24f55e9..1b3e1aa 100644 > > > > > > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S > > > > > > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S > > > > > > > > > @@ -65,14 +65,7 @@ END(STRCSPN) > > > > > > > > > # undef END > > > > > > > > > # define END(name) \ > > > > > > > > > cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2 > > > > > > > > > -# undef libc_hidden_builtin_def > > > > > > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT. > > > > > > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > > > > - by the indirect call in the PLT. */ > > > > > > > > > -# define libc_hidden_builtin_def(name) \ > > > > > > > > > - .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2 > > > > > > > > > #endif > > > > > > > > > - > > > > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > > > > > > > > > > > #ifdef USE_AS_STRPBRK > > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S > > > > > > > > > index bf7308e..fde1e1e 100644 > > > > > > > > > --- a/sysdeps/x86_64/multiarch/strspn.S > > > > > > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S > > > > > > > > > @@ -50,12 +50,6 @@ END(strspn) > > > > > > > > > # undef END > > > > > > > > > # define END(name) \ > > > > > > > > > cfi_endproc; .size __strspn_sse2, .-__strspn_sse2 > > > > > > > > > -# undef libc_hidden_builtin_def > > > > > > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT. > > > > > > > > > - The speedup we get from using SSE4.2 instruction is likely eaten away > > > > > > > > > - by the indirect call in the PLT. */ > > > > > > > > > -# define libc_hidden_builtin_def(name) \ > > > > > > > > > - .globl __GI_strspn; __GI_strspn = __strspn_sse2 > > > > > > > > > #endif > > > > > > > > > > > > > > > > > > #endif /* HAVE_SSE4_SUPPORT */ > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible! > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Look, buddy: Windows 3.1 IS A General Protection Fault. > > > > > > > > > > > > -- > > > > > > > > > > > > Failure to adjust for daylight savings time. > > > > > > > > > > -- > > > > > > > > > > monitor VLF leakage > > > > > > > > -- > > > > > > > > Stale file handle (next time use Tupperware(tm)!) > > > > > > -- > > > > > > piezo-electric interference > > > > -- > > > > Hard drive sleeping. Let it wake up on it's own... > > -- > > tachyon emissions overloading the system
On 18 Mar 2014 11:01, Ondřej Bílka wrote: > To make a strtok faster and improve performance in general we need to do one > additional change. > > A comment: > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT. > The speedup we get from using SSE4.2 instruction is likely eaten away > by the indirect call in the PLT. */ > > Does not make sense at all because nobody bothered to check it. Gap > between these implementations is quite big, when haystack is empty a > sse2 is around 40 cycles slower because it needs to populate a lookup > table and difference only increases with size. That is much bigger than > plt slowdown which is few cycles. > > Even benchtest show a gap which also may be reverse by branch > misprediction but my internal benchmark shown. > > simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 > Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 > Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 > > This patch also handles strpbrk which is implemented by including a > x86_64/multiarch/strcspn.S file. > > * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. > * sysdeps/x86_64/multiarch/strcspn.S: Likewise. since H.J. wrote the code, he probably should be the one approving this change -mike
On Thu, Mar 5, 2015 at 6:03 PM, Mike Frysinger <vapier@gentoo.org> wrote: > On 18 Mar 2014 11:01, Ondřej Bílka wrote: >> To make a strtok faster and improve performance in general we need to do one >> additional change. >> >> A comment: >> >> /* It doesn't make sense to send libc-internal strcspn calls through a PLT. >> The speedup we get from using SSE4.2 instruction is likely eaten away >> by the indirect call in the PLT. */ >> >> Does not make sense at all because nobody bothered to check it. Gap >> between these implementations is quite big, when haystack is empty a >> sse2 is around 40 cycles slower because it needs to populate a lookup >> table and difference only increases with size. That is much bigger than >> plt slowdown which is few cycles. >> >> Even benchtest show a gap which also may be reverse by branch >> misprediction but my internal benchmark shown. >> >> simple_strspn stupid_strspn __strspn_sse42 __strspn_sse2 >> Length 0, alignment 0, acc len 6: 18.6562 35.2344 17.0469 61.6719 >> Length 6, alignment 0, acc len 6: 59.5469 72.5781 16.4219 73.625 >> >> This patch also handles strpbrk which is implemented by including a >> x86_64/multiarch/strcspn.S file. >> >> * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection. >> * sysdeps/x86_64/multiarch/strcspn.S: Likewise. > > since H.J. wrote the code, he probably should be the one approving this change > -mike Looks good to me. Please commit. Sorry for the long delay. Thanks.
diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S index 24f55e9..1b3e1aa 100644 --- a/sysdeps/x86_64/multiarch/strcspn.S +++ b/sysdeps/x86_64/multiarch/strcspn.S @@ -65,14 +65,7 @@ END(STRCSPN) # undef END # define END(name) \ cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2 -# undef libc_hidden_builtin_def -/* It doesn't make sense to send libc-internal strcspn calls through a PLT. - The speedup we get from using SSE4.2 instruction is likely eaten away - by the indirect call in the PLT. */ -# define libc_hidden_builtin_def(name) \ - .globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2 #endif - #endif /* HAVE_SSE4_SUPPORT */ #ifdef USE_AS_STRPBRK diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S index bf7308e..fde1e1e 100644 --- a/sysdeps/x86_64/multiarch/strspn.S +++ b/sysdeps/x86_64/multiarch/strspn.S @@ -50,12 +50,6 @@ END(strspn) # undef END # define END(name) \ cfi_endproc; .size __strspn_sse2, .-__strspn_sse2 -# undef libc_hidden_builtin_def -/* It doesn't make sense to send libc-internal strspn calls through a PLT. - The speedup we get from using SSE4.2 instruction is likely eaten away - by the indirect call in the PLT. */ -# define libc_hidden_builtin_def(name) \ - .globl __GI_strspn; __GI_strspn = __strspn_sse2 #endif #endif /* HAVE_SSE4_SUPPORT */