[3/2] Use strspn/strcspn/strpbrk ifunc in internal calls.

Message ID	20140318100138.GC8415@domone.podge
State	Committed
Headers	DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id:references :mime-version:content-type:in-reply-to; q=dns; s=default; b=XtlW pgL6s5DysDMYKuST64xv6uVg5Wm8b8RAE7t9IFM6mfmrRWp3noaPyssGshs3M5bn AyLNcNHm9Bk3mujPld2rdHruslh9wQoY6PJLMSI7xtIjiMBfqdKgdmWM8+B2jwoW CFSkDf8VoUmzbM5VXfC5IPonpolgRIyTwXwhjdM= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org Date: Tue, 18 Mar 2014 11:01:38 +0100 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz> To: Carlos O'Donell <carlos@redhat.com> Cc: libc-alpha@sourceware.org Subject: [PATCH 3/2] Use strspn/strcspn/strpbrk ifunc in internal calls. Message-ID: <20140318100138.GC8415@domone.podge> References: <20140227123238.GA26291@domone.podge> <20140227124206.GA26474@domone.podge> <5318A03D.3000705@redhat.com> <20140306163241.GA11843@domone.podge> <5318B58B.5040704@redhat.com> <20140306205212.GB11843@domone.podge> <53192422.2050101@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53192422.2050101@redhat.com> User-Agent: Mutt/1.5.20 (2009-06-14)

Message ID

20140318100138.GC8415@domone.podge

State

Committed

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:date:from:to:cc:subject:message-id:references
	:mime-version:content-type:in-reply-to; q=dns; s=default; b=XtlW
	pgL6s5DysDMYKuST64xv6uVg5Wm8b8RAE7t9IFM6mfmrRWp3noaPyssGshs3M5bn
	AyLNcNHm9Bk3mujPld2rdHruslh9wQoY6PJLMSI7xtIjiMBfqdKgdmWM8+B2jwoW
	CFSkDf8VoUmzbM5VXfC5IPonpolgRIyTwXwhjdM=
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
Sender: libc-alpha-owner@sourceware.org
Date: Tue, 18 Mar 2014 11:01:38 +0100
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: Carlos O'Donell <carlos@redhat.com>
Cc: libc-alpha@sourceware.org
Subject: [PATCH 3/2] Use strspn/strcspn/strpbrk ifunc in internal calls.
Message-ID: <20140318100138.GC8415@domone.podge>
References: <20140227123238.GA26291@domone.podge>
	<20140227124206.GA26474@domone.podge>
	<5318A03D.3000705@redhat.com>
	<20140306163241.GA11843@domone.podge>
	<5318B58B.5040704@redhat.com>
	<20140306205212.GB11843@domone.podge>
	<53192422.2050101@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <53192422.2050101@redhat.com>
User-Agent: Mutt/1.5.20 (2009-06-14)

Commit Message

Ondrej Bilka March 18, 2014, 10:01 a.m. UTC

  To make a strtok faster and improve performance in general we need to do one
additional change.

A comment:

/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
   The speedup we get from using SSE4.2 instruction is likely eaten away
   by the indirect call in the PLT.  */

Does not make sense at all because nobody bothered to check it. Gap
between these implementations is quite big, when haystack is empty a
sse2 is around 40 cycles slower because it needs to populate a lookup
table and difference only increases with size. That is much bigger than
plt slowdown which is few cycles.

Even benchtest show a gap which also may be reverse by branch
misprediction but my internal benchmark shown.

 simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625

This patch also handles strpbrk which is implemented by including a
x86_64/multiarch/strcspn.S file.

	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.

Comments

Ondrej Bilka March 27, 2014, 9:18 p.m. UTC | #1

ping
On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote:
> To make a strtok faster and improve performance in general we need to do one
> additional change.
> 
> A comment:
> 
> /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
>    The speedup we get from using SSE4.2 instruction is likely eaten away
>    by the indirect call in the PLT.  */
> 
> Does not make sense at all because nobody bothered to check it. Gap
> between these implementations is quite big, when haystack is empty a
> sse2 is around 40 cycles slower because it needs to populate a lookup
> table and difference only increases with size. That is much bigger than
> plt slowdown which is few cycles.
> 
> Even benchtest show a gap which also may be reverse by branch
> misprediction but my internal benchmark shown.
> 
>  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
> Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
> Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
> 
> This patch also handles strpbrk which is implemented by including a
> x86_64/multiarch/strcspn.S file.
> 
> 	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
> 	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
> 
> diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
> index 24f55e9..1b3e1aa 100644
> --- a/sysdeps/x86_64/multiarch/strcspn.S
> +++ b/sysdeps/x86_64/multiarch/strcspn.S
> @@ -65,14 +65,7 @@ END(STRCSPN)
>  # undef END
>  # define END(name) \
>  	cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
> -# undef libc_hidden_builtin_def
> -/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> -   The speedup we get from using SSE4.2 instruction is likely eaten away
> -   by the indirect call in the PLT.  */
> -# define libc_hidden_builtin_def(name) \
> -	.globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
>  #endif
> -
>  #endif /* HAVE_SSE4_SUPPORT */
>  
>  #ifdef USE_AS_STRPBRK
> diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
> index bf7308e..fde1e1e 100644
> --- a/sysdeps/x86_64/multiarch/strspn.S
> +++ b/sysdeps/x86_64/multiarch/strspn.S
> @@ -50,12 +50,6 @@ END(strspn)
>  # undef END
>  # define END(name) \
>  	cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
> -# undef libc_hidden_builtin_def
> -/* It doesn't make sense to send libc-internal strspn calls through a PLT.
> -   The speedup we get from using SSE4.2 instruction is likely eaten away
> -   by the indirect call in the PLT.  */
> -# define libc_hidden_builtin_def(name) \
> -	.globl __GI_strspn; __GI_strspn = __strspn_sse2
>  #endif
>  
>  #endif /* HAVE_SSE4_SUPPORT */

Ondrej Bilka April 5, 2014, 2:48 p.m. UTC | #2

ping
On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote:
> ping
> On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote:
> > To make a strtok faster and improve performance in general we need to do one
> > additional change.
> > 
> > A comment:
> > 
> > /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> >    The speedup we get from using SSE4.2 instruction is likely eaten away
> >    by the indirect call in the PLT.  */
> > 
> > Does not make sense at all because nobody bothered to check it. Gap
> > between these implementations is quite big, when haystack is empty a
> > sse2 is around 40 cycles slower because it needs to populate a lookup
> > table and difference only increases with size. That is much bigger than
> > plt slowdown which is few cycles.
> > 
> > Even benchtest show a gap which also may be reverse by branch
> > misprediction but my internal benchmark shown.
> > 
> >  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
> > Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
> > Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
> > 
> > This patch also handles strpbrk which is implemented by including a
> > x86_64/multiarch/strcspn.S file.
> > 
> > 	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
> > 	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
> > 
> > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
> > index 24f55e9..1b3e1aa 100644
> > --- a/sysdeps/x86_64/multiarch/strcspn.S
> > +++ b/sysdeps/x86_64/multiarch/strcspn.S
> > @@ -65,14 +65,7 @@ END(STRCSPN)
> >  # undef END
> >  # define END(name) \
> >  	cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
> > -# undef libc_hidden_builtin_def
> > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > -   by the indirect call in the PLT.  */
> > -# define libc_hidden_builtin_def(name) \
> > -	.globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
> >  #endif
> > -
> >  #endif /* HAVE_SSE4_SUPPORT */
> >  
> >  #ifdef USE_AS_STRPBRK
> > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
> > index bf7308e..fde1e1e 100644
> > --- a/sysdeps/x86_64/multiarch/strspn.S
> > +++ b/sysdeps/x86_64/multiarch/strspn.S
> > @@ -50,12 +50,6 @@ END(strspn)
> >  # undef END
> >  # define END(name) \
> >  	cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
> > -# undef libc_hidden_builtin_def
> > -/* It doesn't make sense to send libc-internal strspn calls through a PLT.
> > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > -   by the indirect call in the PLT.  */
> > -# define libc_hidden_builtin_def(name) \
> > -	.globl __GI_strspn; __GI_strspn = __strspn_sse2
> >  #endif
> >  
> >  #endif /* HAVE_SSE4_SUPPORT */
> 
> -- 
> 
> Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible!

Ondrej Bilka April 12, 2014, 7:24 p.m. UTC | #3

On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote:
> ping
> On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote:
> > ping
> > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote:
> > > To make a strtok faster and improve performance in general we need to do one
> > > additional change.
> > > 
> > > A comment:
> > > 
> > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > >    The speedup we get from using SSE4.2 instruction is likely eaten away
> > >    by the indirect call in the PLT.  */
> > > 
> > > Does not make sense at all because nobody bothered to check it. Gap
> > > between these implementations is quite big, when haystack is empty a
> > > sse2 is around 40 cycles slower because it needs to populate a lookup
> > > table and difference only increases with size. That is much bigger than
> > > plt slowdown which is few cycles.
> > > 
> > > Even benchtest show a gap which also may be reverse by branch
> > > misprediction but my internal benchmark shown.
> > > 
> > >  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
> > > Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
> > > Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
> > > 
> > > This patch also handles strpbrk which is implemented by including a
> > > x86_64/multiarch/strcspn.S file.
> > > 
> > > 	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
> > > 	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
> > > 
> > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
> > > index 24f55e9..1b3e1aa 100644
> > > --- a/sysdeps/x86_64/multiarch/strcspn.S
> > > +++ b/sysdeps/x86_64/multiarch/strcspn.S
> > > @@ -65,14 +65,7 @@ END(STRCSPN)
> > >  # undef END
> > >  # define END(name) \
> > >  	cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
> > > -# undef libc_hidden_builtin_def
> > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > -   by the indirect call in the PLT.  */
> > > -# define libc_hidden_builtin_def(name) \
> > > -	.globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
> > >  #endif
> > > -
> > >  #endif /* HAVE_SSE4_SUPPORT */
> > >  
> > >  #ifdef USE_AS_STRPBRK
> > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
> > > index bf7308e..fde1e1e 100644
> > > --- a/sysdeps/x86_64/multiarch/strspn.S
> > > +++ b/sysdeps/x86_64/multiarch/strspn.S
> > > @@ -50,12 +50,6 @@ END(strspn)
> > >  # undef END
> > >  # define END(name) \
> > >  	cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
> > > -# undef libc_hidden_builtin_def
> > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT.
> > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > -   by the indirect call in the PLT.  */
> > > -# define libc_hidden_builtin_def(name) \
> > > -	.globl __GI_strspn; __GI_strspn = __strspn_sse2
> > >  #endif
> > >  
> > >  #endif /* HAVE_SSE4_SUPPORT */
> > 
> > -- 
> > 
> > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible!
> 
> -- 
> 
> Look, buddy:  Windows 3.1 IS A General Protection Fault.

Ondrej Bilka May 12, 2014, noon UTC | #4

ping
On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote:
> On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote:
> > ping
> > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote:
> > > ping
> > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote:
> > > > To make a strtok faster and improve performance in general we need to do one
> > > > additional change.
> > > > 
> > > > A comment:
> > > > 
> > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > >    The speedup we get from using SSE4.2 instruction is likely eaten away
> > > >    by the indirect call in the PLT.  */
> > > > 
> > > > Does not make sense at all because nobody bothered to check it. Gap
> > > > between these implementations is quite big, when haystack is empty a
> > > > sse2 is around 40 cycles slower because it needs to populate a lookup
> > > > table and difference only increases with size. That is much bigger than
> > > > plt slowdown which is few cycles.
> > > > 
> > > > Even benchtest show a gap which also may be reverse by branch
> > > > misprediction but my internal benchmark shown.
> > > > 
> > > >  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
> > > > Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
> > > > Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
> > > > 
> > > > This patch also handles strpbrk which is implemented by including a
> > > > x86_64/multiarch/strcspn.S file.
> > > > 
> > > > 	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
> > > > 	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
> > > > 
> > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
> > > > index 24f55e9..1b3e1aa 100644
> > > > --- a/sysdeps/x86_64/multiarch/strcspn.S
> > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S
> > > > @@ -65,14 +65,7 @@ END(STRCSPN)
> > > >  # undef END
> > > >  # define END(name) \
> > > >  	cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
> > > > -# undef libc_hidden_builtin_def
> > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > -   by the indirect call in the PLT.  */
> > > > -# define libc_hidden_builtin_def(name) \
> > > > -	.globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
> > > >  #endif
> > > > -
> > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > >  
> > > >  #ifdef USE_AS_STRPBRK
> > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
> > > > index bf7308e..fde1e1e 100644
> > > > --- a/sysdeps/x86_64/multiarch/strspn.S
> > > > +++ b/sysdeps/x86_64/multiarch/strspn.S
> > > > @@ -50,12 +50,6 @@ END(strspn)
> > > >  # undef END
> > > >  # define END(name) \
> > > >  	cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
> > > > -# undef libc_hidden_builtin_def
> > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT.
> > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > -   by the indirect call in the PLT.  */
> > > > -# define libc_hidden_builtin_def(name) \
> > > > -	.globl __GI_strspn; __GI_strspn = __strspn_sse2
> > > >  #endif
> > > >  
> > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > 
> > > -- 
> > > 
> > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible!
> > 
> > -- 
> > 
> > Look, buddy:  Windows 3.1 IS A General Protection Fault.
> 
> -- 
> 
> Failure to adjust for daylight savings time.

Ondrej Bilka May 23, 2014, 11:23 p.m. UTC | #5

ping
On Mon, May 12, 2014 at 02:00:11PM +0200, Ondřej Bílka wrote:
> ping
> On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote:
> > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote:
> > > ping
> > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote:
> > > > ping
> > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote:
> > > > > To make a strtok faster and improve performance in general we need to do one
> > > > > additional change.
> > > > > 
> > > > > A comment:
> > > > > 
> > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > >    The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > >    by the indirect call in the PLT.  */
> > > > > 
> > > > > Does not make sense at all because nobody bothered to check it. Gap
> > > > > between these implementations is quite big, when haystack is empty a
> > > > > sse2 is around 40 cycles slower because it needs to populate a lookup
> > > > > table and difference only increases with size. That is much bigger than
> > > > > plt slowdown which is few cycles.
> > > > > 
> > > > > Even benchtest show a gap which also may be reverse by branch
> > > > > misprediction but my internal benchmark shown.
> > > > > 
> > > > >  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
> > > > > Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
> > > > > Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
> > > > > 
> > > > > This patch also handles strpbrk which is implemented by including a
> > > > > x86_64/multiarch/strcspn.S file.
> > > > > 
> > > > > 	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
> > > > > 	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
> > > > > 
> > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
> > > > > index 24f55e9..1b3e1aa 100644
> > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S
> > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S
> > > > > @@ -65,14 +65,7 @@ END(STRCSPN)
> > > > >  # undef END
> > > > >  # define END(name) \
> > > > >  	cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
> > > > > -# undef libc_hidden_builtin_def
> > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > -   by the indirect call in the PLT.  */
> > > > > -# define libc_hidden_builtin_def(name) \
> > > > > -	.globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
> > > > >  #endif
> > > > > -
> > > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > > >  
> > > > >  #ifdef USE_AS_STRPBRK
> > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
> > > > > index bf7308e..fde1e1e 100644
> > > > > --- a/sysdeps/x86_64/multiarch/strspn.S
> > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S
> > > > > @@ -50,12 +50,6 @@ END(strspn)
> > > > >  # undef END
> > > > >  # define END(name) \
> > > > >  	cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
> > > > > -# undef libc_hidden_builtin_def
> > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT.
> > > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > -   by the indirect call in the PLT.  */
> > > > > -# define libc_hidden_builtin_def(name) \
> > > > > -	.globl __GI_strspn; __GI_strspn = __strspn_sse2
> > > > >  #endif
> > > > >  
> > > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > > 
> > > > -- 
> > > > 
> > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible!
> > > 
> > > -- 
> > > 
> > > Look, buddy:  Windows 3.1 IS A General Protection Fault.
> > 
> > -- 
> > 
> > Failure to adjust for daylight savings time.
> 
> -- 
> 
> monitor VLF leakage

Ondrej Bilka June 4, 2014, 12:47 p.m. UTC | #6

ping
On Sat, May 24, 2014 at 01:23:13AM +0200, Ondřej Bílka wrote:
> ping
> On Mon, May 12, 2014 at 02:00:11PM +0200, Ondřej Bílka wrote:
> > ping
> > On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote:
> > > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote:
> > > > ping
> > > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote:
> > > > > ping
> > > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote:
> > > > > > To make a strtok faster and improve performance in general we need to do one
> > > > > > additional change.
> > > > > > 
> > > > > > A comment:
> > > > > > 
> > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > > >    The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > >    by the indirect call in the PLT.  */
> > > > > > 
> > > > > > Does not make sense at all because nobody bothered to check it. Gap
> > > > > > between these implementations is quite big, when haystack is empty a
> > > > > > sse2 is around 40 cycles slower because it needs to populate a lookup
> > > > > > table and difference only increases with size. That is much bigger than
> > > > > > plt slowdown which is few cycles.
> > > > > > 
> > > > > > Even benchtest show a gap which also may be reverse by branch
> > > > > > misprediction but my internal benchmark shown.
> > > > > > 
> > > > > >  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
> > > > > > Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
> > > > > > Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
> > > > > > 
> > > > > > This patch also handles strpbrk which is implemented by including a
> > > > > > x86_64/multiarch/strcspn.S file.
> > > > > > 
> > > > > > 	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
> > > > > > 	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
> > > > > > 
> > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > index 24f55e9..1b3e1aa 100644
> > > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > @@ -65,14 +65,7 @@ END(STRCSPN)
> > > > > >  # undef END
> > > > > >  # define END(name) \
> > > > > >  	cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
> > > > > > -# undef libc_hidden_builtin_def
> > > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > -   by the indirect call in the PLT.  */
> > > > > > -# define libc_hidden_builtin_def(name) \
> > > > > > -	.globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
> > > > > >  #endif
> > > > > > -
> > > > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > > > >  
> > > > > >  #ifdef USE_AS_STRPBRK
> > > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
> > > > > > index bf7308e..fde1e1e 100644
> > > > > > --- a/sysdeps/x86_64/multiarch/strspn.S
> > > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S
> > > > > > @@ -50,12 +50,6 @@ END(strspn)
> > > > > >  # undef END
> > > > > >  # define END(name) \
> > > > > >  	cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
> > > > > > -# undef libc_hidden_builtin_def
> > > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT.
> > > > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > -   by the indirect call in the PLT.  */
> > > > > > -# define libc_hidden_builtin_def(name) \
> > > > > > -	.globl __GI_strspn; __GI_strspn = __strspn_sse2
> > > > > >  #endif
> > > > > >  
> > > > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > > > 
> > > > > -- 
> > > > > 
> > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible!
> > > > 
> > > > -- 
> > > > 
> > > > Look, buddy:  Windows 3.1 IS A General Protection Fault.
> > > 
> > > -- 
> > > 
> > > Failure to adjust for daylight savings time.
> > 
> > -- 
> > 
> > monitor VLF leakage
> 
> -- 
> 
> Stale file handle (next time use Tupperware(tm)!)

Ondrej Bilka June 24, 2014, 10:41 a.m. UTC | #7

ping
On Wed, Jun 04, 2014 at 02:47:54PM +0200, Ondřej Bílka wrote:
> ping
> On Sat, May 24, 2014 at 01:23:13AM +0200, Ondřej Bílka wrote:
> > ping
> > On Mon, May 12, 2014 at 02:00:11PM +0200, Ondřej Bílka wrote:
> > > ping
> > > On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote:
> > > > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote:
> > > > > ping
> > > > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote:
> > > > > > ping
> > > > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote:
> > > > > > > To make a strtok faster and improve performance in general we need to do one
> > > > > > > additional change.
> > > > > > > 
> > > > > > > A comment:
> > > > > > > 
> > > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > > > >    The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > >    by the indirect call in the PLT.  */
> > > > > > > 
> > > > > > > Does not make sense at all because nobody bothered to check it. Gap
> > > > > > > between these implementations is quite big, when haystack is empty a
> > > > > > > sse2 is around 40 cycles slower because it needs to populate a lookup
> > > > > > > table and difference only increases with size. That is much bigger than
> > > > > > > plt slowdown which is few cycles.
> > > > > > > 
> > > > > > > Even benchtest show a gap which also may be reverse by branch
> > > > > > > misprediction but my internal benchmark shown.
> > > > > > > 
> > > > > > >  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
> > > > > > > Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
> > > > > > > Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
> > > > > > > 
> > > > > > > This patch also handles strpbrk which is implemented by including a
> > > > > > > x86_64/multiarch/strcspn.S file.
> > > > > > > 
> > > > > > > 	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
> > > > > > > 	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
> > > > > > > 
> > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > > index 24f55e9..1b3e1aa 100644
> > > > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > > @@ -65,14 +65,7 @@ END(STRCSPN)
> > > > > > >  # undef END
> > > > > > >  # define END(name) \
> > > > > > >  	cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
> > > > > > > -# undef libc_hidden_builtin_def
> > > > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > > -   by the indirect call in the PLT.  */
> > > > > > > -# define libc_hidden_builtin_def(name) \
> > > > > > > -	.globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
> > > > > > >  #endif
> > > > > > > -
> > > > > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > > > > >  
> > > > > > >  #ifdef USE_AS_STRPBRK
> > > > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
> > > > > > > index bf7308e..fde1e1e 100644
> > > > > > > --- a/sysdeps/x86_64/multiarch/strspn.S
> > > > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S
> > > > > > > @@ -50,12 +50,6 @@ END(strspn)
> > > > > > >  # undef END
> > > > > > >  # define END(name) \
> > > > > > >  	cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
> > > > > > > -# undef libc_hidden_builtin_def
> > > > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT.
> > > > > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > > -   by the indirect call in the PLT.  */
> > > > > > > -# define libc_hidden_builtin_def(name) \
> > > > > > > -	.globl __GI_strspn; __GI_strspn = __strspn_sse2
> > > > > > >  #endif
> > > > > > >  
> > > > > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > > > > 
> > > > > > -- 
> > > > > > 
> > > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible!
> > > > > 
> > > > > -- 
> > > > > 
> > > > > Look, buddy:  Windows 3.1 IS A General Protection Fault.
> > > > 
> > > > -- 
> > > > 
> > > > Failure to adjust for daylight savings time.
> > > 
> > > -- 
> > > 
> > > monitor VLF leakage
> > 
> > -- 
> > 
> > Stale file handle (next time use Tupperware(tm)!)
> 
> -- 
> 
> piezo-electric interference

Ondrej Bilka Dec. 10, 2014, 2:39 p.m. UTC | #8

ping
On Tue, Jun 24, 2014 at 12:41:52PM +0200, Ondřej Bílka wrote:
> ping
> On Wed, Jun 04, 2014 at 02:47:54PM +0200, Ondřej Bílka wrote:
> > ping
> > On Sat, May 24, 2014 at 01:23:13AM +0200, Ondřej Bílka wrote:
> > > ping
> > > On Mon, May 12, 2014 at 02:00:11PM +0200, Ondřej Bílka wrote:
> > > > ping
> > > > On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote:
> > > > > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote:
> > > > > > ping
> > > > > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote:
> > > > > > > ping
> > > > > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote:
> > > > > > > > To make a strtok faster and improve performance in general we need to do one
> > > > > > > > additional change.
> > > > > > > > 
> > > > > > > > A comment:
> > > > > > > > 
> > > > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > > > > >    The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > > >    by the indirect call in the PLT.  */
> > > > > > > > 
> > > > > > > > Does not make sense at all because nobody bothered to check it. Gap
> > > > > > > > between these implementations is quite big, when haystack is empty a
> > > > > > > > sse2 is around 40 cycles slower because it needs to populate a lookup
> > > > > > > > table and difference only increases with size. That is much bigger than
> > > > > > > > plt slowdown which is few cycles.
> > > > > > > > 
> > > > > > > > Even benchtest show a gap which also may be reverse by branch
> > > > > > > > misprediction but my internal benchmark shown.
> > > > > > > > 
> > > > > > > >  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
> > > > > > > > Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
> > > > > > > > Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
> > > > > > > > 
> > > > > > > > This patch also handles strpbrk which is implemented by including a
> > > > > > > > x86_64/multiarch/strcspn.S file.
> > > > > > > > 
> > > > > > > > 	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
> > > > > > > > 	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
> > > > > > > > 
> > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > > > index 24f55e9..1b3e1aa 100644
> > > > > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > > > @@ -65,14 +65,7 @@ END(STRCSPN)
> > > > > > > >  # undef END
> > > > > > > >  # define END(name) \
> > > > > > > >  	cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
> > > > > > > > -# undef libc_hidden_builtin_def
> > > > > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > > > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > > > -   by the indirect call in the PLT.  */
> > > > > > > > -# define libc_hidden_builtin_def(name) \
> > > > > > > > -	.globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
> > > > > > > >  #endif
> > > > > > > > -
> > > > > > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > > > > > >  
> > > > > > > >  #ifdef USE_AS_STRPBRK
> > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
> > > > > > > > index bf7308e..fde1e1e 100644
> > > > > > > > --- a/sysdeps/x86_64/multiarch/strspn.S
> > > > > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S
> > > > > > > > @@ -50,12 +50,6 @@ END(strspn)
> > > > > > > >  # undef END
> > > > > > > >  # define END(name) \
> > > > > > > >  	cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
> > > > > > > > -# undef libc_hidden_builtin_def
> > > > > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT.
> > > > > > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > > > -   by the indirect call in the PLT.  */
> > > > > > > > -# define libc_hidden_builtin_def(name) \
> > > > > > > > -	.globl __GI_strspn; __GI_strspn = __strspn_sse2
> > > > > > > >  #endif
> > > > > > > >  
> > > > > > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > > > > > 
> > > > > > > -- 
> > > > > > > 
> > > > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible!
> > > > > > 
> > > > > > -- 
> > > > > > 
> > > > > > Look, buddy:  Windows 3.1 IS A General Protection Fault.
> > > > > 
> > > > > -- 
> > > > > 
> > > > > Failure to adjust for daylight savings time.
> > > > 
> > > > -- 
> > > > 
> > > > monitor VLF leakage
> > > 
> > > -- 
> > > 
> > > Stale file handle (next time use Tupperware(tm)!)
> > 
> > -- 
> > 
> > piezo-electric interference
> 
> -- 
> 
> Hard drive sleeping. Let it wake up on it's own...

Ondrej Bilka Feb. 11, 2015, 1:56 p.m. UTC | #9

ping
On Wed, Dec 10, 2014 at 03:39:31PM +0100, Ondřej Bílka wrote:
> ping
> On Tue, Jun 24, 2014 at 12:41:52PM +0200, Ondřej Bílka wrote:
> > ping
> > On Wed, Jun 04, 2014 at 02:47:54PM +0200, Ondřej Bílka wrote:
> > > ping
> > > On Sat, May 24, 2014 at 01:23:13AM +0200, Ondřej Bílka wrote:
> > > > ping
> > > > On Mon, May 12, 2014 at 02:00:11PM +0200, Ondřej Bílka wrote:
> > > > > ping
> > > > > On Sat, Apr 12, 2014 at 09:24:47PM +0200, Ondřej Bílka wrote:
> > > > > > On Sat, Apr 05, 2014 at 04:48:41PM +0200, Ondřej Bílka wrote:
> > > > > > > ping
> > > > > > > On Thu, Mar 27, 2014 at 10:18:06PM +0100, Ondřej Bílka wrote:
> > > > > > > > ping
> > > > > > > > On Tue, Mar 18, 2014 at 11:01:38AM +0100, Ondřej Bílka wrote:
> > > > > > > > > To make a strtok faster and improve performance in general we need to do one
> > > > > > > > > additional change.
> > > > > > > > > 
> > > > > > > > > A comment:
> > > > > > > > > 
> > > > > > > > > /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > > > > > >    The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > > > >    by the indirect call in the PLT.  */
> > > > > > > > > 
> > > > > > > > > Does not make sense at all because nobody bothered to check it. Gap
> > > > > > > > > between these implementations is quite big, when haystack is empty a
> > > > > > > > > sse2 is around 40 cycles slower because it needs to populate a lookup
> > > > > > > > > table and difference only increases with size. That is much bigger than
> > > > > > > > > plt slowdown which is few cycles.
> > > > > > > > > 
> > > > > > > > > Even benchtest show a gap which also may be reverse by branch
> > > > > > > > > misprediction but my internal benchmark shown.
> > > > > > > > > 
> > > > > > > > >  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
> > > > > > > > > Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
> > > > > > > > > Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
> > > > > > > > > 
> > > > > > > > > This patch also handles strpbrk which is implemented by including a
> > > > > > > > > x86_64/multiarch/strcspn.S file.
> > > > > > > > > 
> > > > > > > > > 	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
> > > > > > > > > 	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
> > > > > > > > > 
> > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > > > > index 24f55e9..1b3e1aa 100644
> > > > > > > > > --- a/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > > > > +++ b/sysdeps/x86_64/multiarch/strcspn.S
> > > > > > > > > @@ -65,14 +65,7 @@ END(STRCSPN)
> > > > > > > > >  # undef END
> > > > > > > > >  # define END(name) \
> > > > > > > > >  	cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
> > > > > > > > > -# undef libc_hidden_builtin_def
> > > > > > > > > -/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
> > > > > > > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > > > > -   by the indirect call in the PLT.  */
> > > > > > > > > -# define libc_hidden_builtin_def(name) \
> > > > > > > > > -	.globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
> > > > > > > > >  #endif
> > > > > > > > > -
> > > > > > > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > > > > > > >  
> > > > > > > > >  #ifdef USE_AS_STRPBRK
> > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
> > > > > > > > > index bf7308e..fde1e1e 100644
> > > > > > > > > --- a/sysdeps/x86_64/multiarch/strspn.S
> > > > > > > > > +++ b/sysdeps/x86_64/multiarch/strspn.S
> > > > > > > > > @@ -50,12 +50,6 @@ END(strspn)
> > > > > > > > >  # undef END
> > > > > > > > >  # define END(name) \
> > > > > > > > >  	cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
> > > > > > > > > -# undef libc_hidden_builtin_def
> > > > > > > > > -/* It doesn't make sense to send libc-internal strspn calls through a PLT.
> > > > > > > > > -   The speedup we get from using SSE4.2 instruction is likely eaten away
> > > > > > > > > -   by the indirect call in the PLT.  */
> > > > > > > > > -# define libc_hidden_builtin_def(name) \
> > > > > > > > > -	.globl __GI_strspn; __GI_strspn = __strspn_sse2
> > > > > > > > >  #endif
> > > > > > > > >  
> > > > > > > > >  #endif /* HAVE_SSE4_SUPPORT */
> > > > > > > > 
> > > > > > > > -- 
> > > > > > > > 
> > > > > > > > Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible!
> > > > > > > 
> > > > > > > -- 
> > > > > > > 
> > > > > > > Look, buddy:  Windows 3.1 IS A General Protection Fault.
> > > > > > 
> > > > > > -- 
> > > > > > 
> > > > > > Failure to adjust for daylight savings time.
> > > > > 
> > > > > -- 
> > > > > 
> > > > > monitor VLF leakage
> > > > 
> > > > -- 
> > > > 
> > > > Stale file handle (next time use Tupperware(tm)!)
> > > 
> > > -- 
> > > 
> > > piezo-electric interference
> > 
> > -- 
> > 
> > Hard drive sleeping. Let it wake up on it's own...
> 
> -- 
> 
> tachyon emissions overloading the system

Mike Frysinger March 6, 2015, 2:03 a.m. UTC | #10

On 18 Mar 2014 11:01, Ondřej Bílka wrote:
> To make a strtok faster and improve performance in general we need to do one
> additional change.
> 
> A comment:
> 
> /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
>    The speedup we get from using SSE4.2 instruction is likely eaten away
>    by the indirect call in the PLT.  */
> 
> Does not make sense at all because nobody bothered to check it. Gap
> between these implementations is quite big, when haystack is empty a
> sse2 is around 40 cycles slower because it needs to populate a lookup
> table and difference only increases with size. That is much bigger than
> plt slowdown which is few cycles.
> 
> Even benchtest show a gap which also may be reverse by branch
> misprediction but my internal benchmark shown.
> 
>  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
> Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
> Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
> 
> This patch also handles strpbrk which is implemented by including a
> x86_64/multiarch/strcspn.S file.
> 
> 	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
> 	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.

since H.J. wrote the code, he probably should be the one approving this change
-mike

H.J. Lu March 6, 2015, 1:22 p.m. UTC | #11

On Thu, Mar 5, 2015 at 6:03 PM, Mike Frysinger <vapier@gentoo.org> wrote:
> On 18 Mar 2014 11:01, Ondřej Bílka wrote:
>> To make a strtok faster and improve performance in general we need to do one
>> additional change.
>>
>> A comment:
>>
>> /* It doesn't make sense to send libc-internal strcspn calls through a PLT.
>>    The speedup we get from using SSE4.2 instruction is likely eaten away
>>    by the indirect call in the PLT.  */
>>
>> Does not make sense at all because nobody bothered to check it. Gap
>> between these implementations is quite big, when haystack is empty a
>> sse2 is around 40 cycles slower because it needs to populate a lookup
>> table and difference only increases with size. That is much bigger than
>> plt slowdown which is few cycles.
>>
>> Even benchtest show a gap which also may be reverse by branch
>> misprediction but my internal benchmark shown.
>>
>>  simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
>> Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
>> Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625
>>
>> This patch also handles strpbrk which is implemented by including a
>> x86_64/multiarch/strcspn.S file.
>>
>>       * sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
>>       * sysdeps/x86_64/multiarch/strcspn.S: Likewise.
>
> since H.J. wrote the code, he probably should be the one approving this change
> -mike

Looks good to me.  Please commit.  Sorry for the long delay.

Thanks.

diff mbox

Patch

diff --git a/sysdeps/x86_64/multiarch/strcspn.S b/sysdeps/x86_64/multiarch/strcspn.S
index 24f55e9..1b3e1aa 100644
--- a/sysdeps/x86_64/multiarch/strcspn.S
+++ b/sysdeps/x86_64/multiarch/strcspn.S
@@ -65,14 +65,7 @@  END(STRCSPN)
 # undef END
 # define END(name) \
 	cfi_endproc; .size STRCSPN_SSE2, .-STRCSPN_SSE2
-# undef libc_hidden_builtin_def
-/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
-   The speedup we get from using SSE4.2 instruction is likely eaten away
-   by the indirect call in the PLT.  */
-# define libc_hidden_builtin_def(name) \
-	.globl __GI_STRCSPN; __GI_STRCSPN = STRCSPN_SSE2
 #endif
-
 #endif /* HAVE_SSE4_SUPPORT */
 
 #ifdef USE_AS_STRPBRK
diff --git a/sysdeps/x86_64/multiarch/strspn.S b/sysdeps/x86_64/multiarch/strspn.S
index bf7308e..fde1e1e 100644
--- a/sysdeps/x86_64/multiarch/strspn.S
+++ b/sysdeps/x86_64/multiarch/strspn.S
@@ -50,12 +50,6 @@  END(strspn)
 # undef END
 # define END(name) \
 	cfi_endproc; .size __strspn_sse2, .-__strspn_sse2
-# undef libc_hidden_builtin_def
-/* It doesn't make sense to send libc-internal strspn calls through a PLT.
-   The speedup we get from using SSE4.2 instruction is likely eaten away
-   by the indirect call in the PLT.  */
-# define libc_hidden_builtin_def(name) \
-	.globl __GI_strspn; __GI_strspn = __strspn_sse2
 #endif
 
 #endif /* HAVE_SSE4_SUPPORT */