Improve performance of strcat

Message ID 000101cfb243$63a5b1b0$2af11510$@com
State Committed
Headers

Commit Message

Wilco Dijkstra Aug. 7, 2014, 1:27 p.m. UTC
  Hi,

This patch improves strcat performance by using strlen and strcpy. Strlen has a fast C
implementation, so this improves performance even on targets which don't have an optimized strlen
and strcpy - it is 25% faster in bench-strcat. On targets which don't provide an optimized strcat
but which do have an optimized strlen and strcpy, performance gain is > 2x.

OK for commit?

ChangeLog:
2014-08-07  Wilco Dijkstra <wdijkstr@arm.com>

	* string/strcat.c (strcat): Improve performance by using strlen/strcpy.
---
 string/strcat.c |   21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)
  

Comments

Adhemerval Zanella Netto Aug. 7, 2014, 1:31 p.m. UTC | #1
On 07-08-2014 10:27, Wilco Dijkstra wrote:
> Hi,
>
> This patch improves strcat performance by using strlen and strcpy. Strlen has a fast C
> implementation, so this improves performance even on targets which don't have an optimized strlen
> and strcpy - it is 25% faster in bench-strcat. On targets which don't provide an optimized strcat
> but which do have an optimized strlen and strcpy, performance gain is > 2x.
>
> OK for commit?
>
> ChangeLog:
> 2014-08-07  Wilco Dijkstra <wdijkstr@arm.com>
>
> 	* string/strcat.c (strcat): Improve performance by using strlen/strcpy.
> ---
>  string/strcat.c |   21 +--------------------
>  1 file changed, 1 insertion(+), 20 deletions(-)
>
> diff --git a/string/strcat.c b/string/strcat.c
>
> -  do
> -    {
> -      c = *s2++;
> -      *++s1 = c;
> -    }
> -  while (c != '\0');
> -
> +  strcpy (dest + strlen (dest), src);

Should it be __strcpy/__strlen ?
  
Joseph Myers Aug. 7, 2014, 1:57 p.m. UTC | #2
On Thu, 7 Aug 2014, Adhemerval Zanella wrote:

> > +  strcpy (dest + strlen (dest), src);
> 
> Should it be __strcpy/__strlen ?

As explained in recent discussion, there is no need for use of __* when 
calling functions in ISO C90 that haven't been removed in more recent 
standards (or more generally, when calling function A from function B if 
function A is in all the supported standards containing function B).  You 
do need *_hidden_* for PLT avoidance, but include/string.h already has 
libc_hidden_builtin_proto calls for strcpy and strlen (and if any 
definition of those functions is missing the corresponding 
libc_hidden_builtin_def, there will be an obvious error linking glibc).
  
Florian Weimer Sept. 10, 2014, 5:24 p.m. UTC | #3
On 08/07/2014 03:27 PM, Wilco Dijkstra wrote:

> This patch improves strcat performance by using strlen and strcpy. Strlen has a fast C
> implementation, so this improves performance even on targets which don't have an optimized strlen
> and strcpy - it is 25% faster in bench-strcat. On targets which don't provide an optimized strcat
> but which do have an optimized strlen and strcpy, performance gain is > 2x.
>
> OK for commit?

This is okay for master.  Thanks.
  
Carlos O'Donell Sept. 11, 2014, 7:42 p.m. UTC | #4
On 08/07/2014 09:27 AM, Wilco Dijkstra wrote:
> Hi,
> 
> This patch improves strcat performance by using strlen and strcpy. Strlen has a fast C
> implementation, so this improves performance even on targets which don't have an optimized strlen
> and strcpy - it is 25% faster in bench-strcat. On targets which don't provide an optimized strcat
> but which do have an optimized strlen and strcpy, performance gain is > 2x.

What benchmarks did you use to test this performance gain?

Did you use glibc's microbenchmark?

What numbers did you get?

c.
  
Ondrej Bilka Sept. 12, 2014, 5:19 a.m. UTC | #5
On Thu, Sep 11, 2014 at 03:42:31PM -0400, Carlos O'Donell wrote:
> On 08/07/2014 09:27 AM, Wilco Dijkstra wrote:
> > Hi,
> > 
> > This patch improves strcat performance by using strlen and strcpy. Strlen has a fast C
> > implementation, so this improves performance even on targets which don't have an optimized strlen
> > and strcpy - it is 25% faster in bench-strcat. On targets which don't provide an optimized strcat
> > but which do have an optimized strlen and strcpy, performance gain is > 2x.
> 
> What benchmarks did you use to test this performance gain?
> 
> Did you use glibc's microbenchmark?
> 
> What numbers did you get?
> 
Ah same patch that I send like year ago, your question is answered in
quoted email.

Also trying to benchmark C implementations is mostly meaningless, if you
want good performance you need write at least assembly implementation of memcpy,
strlen and strcpy. Optimizing one of these has bigger performance impact
than rest of functions combined and if you do not care about that you do
not have to care about this c implementation as well.

Also in case of strcat its easy to see that no matter how well you
optimize it you cannot get faster than strcpy (dest+strlen (dest), src);
minus constant overhead of like one function call. Reason is that you
could use strcat both as 

strcpy (char *x, char *y) 
{ 
  x[0] = 0;
  return strcat (x, y);
}

and

strlen (char *x)
{
  char y[] = "x";
  return strcat2 (x, y); // where we in assembly replace each write instruction jump that calculates end of string.
}

Now if you got input where strcat is faster than strcpy+strlen pair it
would give you faster strlen or strcpy by using formulas above.
  
Wilco Dijkstra Sept. 12, 2014, 11:14 a.m. UTC | #6
> Carlos O'Donell wrote:
> On 08/07/2014 09:27 AM, Wilco Dijkstra wrote:
> > Hi,
> >
> > This patch improves strcat performance by using strlen and strcpy. Strlen has a fast C
> > implementation, so this improves performance even on targets which don't have an optimized
> strlen
> > and strcpy - it is 25% faster in bench-strcat. On targets which don't provide an optimized
> strcat
> > but which do have an optimized strlen and strcpy, performance gain is > 2x.
> 
> What benchmarks did you use to test this performance gain?
> 
> Did you use glibc's microbenchmark?
> 
> What numbers did you get?

These results are for the GLIBC benchtests/bench-strcat.c - I increased the iterations
significantly and profiled the results with a high tickrate to verify the timings.

 65.74%       11343  bench-strcat  bench-strcat       [.] simple_strcat
 24.90%        4307  bench-strcat  libc-2.20.90.so    [.] strcpy
  5.20%         902  bench-strcat  libc-2.20.90.so    [.] strlen
  1.22%         214  bench-strcat  bench-strcat       [.] do_test
  1.08%         190  bench-strcat  libc-2.20.90.so    [.] strcat

So strcat+strlen+strcpy=31.18% vs simple_strcat 65.74%, ie. 2.1x speedup.

Wilco
  
Wilco Dijkstra Sept. 12, 2014, 11:51 a.m. UTC | #7
> Ondřej Bílka wrote: 
> Also trying to benchmark C implementations is mostly meaningless, if you
> want good performance you need write at least assembly implementation of memcpy,
> strlen and strcpy. Optimizing one of these has bigger performance impact
> than rest of functions combined and if you do not care about that you do
> not have to care about this c implementation as well.

The issue is that you're forced to write lots of highly optimized assembly 
code in order to get good string performance for any new architecture.

This issue exists across GLIBC, for example the math functions used to be
extremely inefficient due to a bad generic implementation. With a simple
architecture independent patch I achieved 99% of the performance of the  
best optimized implementation. Rather than forcing all targets to add
large amounts of highly optimized assembler, why not put a little more 
effort into GLIBC generic code to ensure it is efficient to start with?

So the goal here is to provide good string performance using just C routines
and ensure they benefit further from assembler implementations of strlen, 
memcpy, memset etc.

> Also in case of strcat its easy to see that no matter how well you
> optimize it you cannot get faster than strcpy (dest+strlen (dest), src);
> minus constant overhead of like one function call.

Exactly, there is no need for a specialized assembler variant for strcat.

Wilco
  

Patch

diff --git a/string/strcat.c b/string/strcat.c
index 2cbe8b3..983d115 100644
--- a/string/strcat.c
+++ b/string/strcat.c
@@ -23,26 +23,7 @@ 
 char *
 strcat (char *dest, const char *src)
 {
-  char *s1 = dest;
-  const char *s2 = src;
-  char c;
-
-  /* Find the end of the string.  */
-  do
-    c = *s1++;
-  while (c != '\0');
-
-  /* Make S1 point before the next character, so we can increment
-     it while memory is read (wins on pipelined cpus).  */
-  s1 -= 2;
-
-  do
-    {
-      c = *s2++;
-      *++s1 = c;
-    }
-  while (c != '\0');
-
+  strcpy (dest + strlen (dest), src);
   return dest;
 }
 libc_hidden_builtin_def (strcat)