Improve strncpy performance further

Message ID 001801d02b72$6ce0c3c0$46a24b40$@com
State Committed
Headers

Commit Message

Wilco Dijkstra Jan. 8, 2015, 6:39 p.m. UTC
  This patch improves strncpy performance by using strnlen/memcpy rather than a byte loop. Performance
on bench-strncpy is 1.9-2.1x faster on average. I tried several variations, and using a tailcall and
calling memset conditionally gave the best overall results.

I'm planning to post a similar patch for stpncpy as well.

ChangeLog:
2015-01-08  Wilco Dijkstra  wdijkstr@arm.com

	* string/strncpy.c (strncpy): Improve performance using strnlen/memcpy.

---
 string/strncpy.c | 64 ++++++--------------------------------------------------
 1 file changed, 6 insertions(+), 58 deletions(-)
  

Comments

Roland McGrath Jan. 8, 2015, 6:58 p.m. UTC | #1
Has to be __strnlen.
  
Wilco Dijkstra Jan. 9, 2015, 1:07 p.m. UTC | #2
> Roland McGrath:
> Has to be __strnlen.

Good point, so it becomes:

-  return s;
+  size_t size = __strnlen (s2, n);
+  if (size != n)
+    memset (s1 + size, '\0', n - size);
+  return memcpy (s1, s2, size);

Btw while on the subject on namespaces, is bcopy correctly defined?
After a lot of complex inclusion and defines, it ultimately does:

void
bcopy ()
{
...
}

Wilco
  
Roland McGrath Jan. 9, 2015, 7:16 p.m. UTC | #3
> Btw while on the subject on namespaces, is bcopy correctly defined?
> After a lot of complex inclusion and defines, it ultimately does:
> 
> void
> bcopy ()
> {
> ...
> }

I'm not sure what kind of correctness you have in mind.  bcopy is the only
name under which that function needs to be defined.  If it's in the same
file as a standard function (presumably memmove) then it needs to be
defined as weak.  No other code anywhere in libc should call bcopy (we'd
use memmove or memcpy instead), so it doesn't need to have an __ name.
  
Wilco Dijkstra Jan. 11, 2015, 6:09 p.m. UTC | #4
> Roland McGrath wrote:
> > Btw while on the subject on namespaces, is bcopy correctly defined?
> > After a lot of complex inclusion and defines, it ultimately does:
> >
> > void
> > bcopy ()
> > {
> > ...
> > }
> 
> I'm not sure what kind of correctness you have in mind.  bcopy is the only
> name under which that function needs to be defined.  If it's in the same
> file as a standard function (presumably memmove) then it needs to be
> defined as weak.  No other code anywhere in libc should call bcopy (we'd
> use memmove or memcpy instead), so it doesn't need to have an __ name.

Well, bcopy and bzero are both used in several non-test sources in GLIBC, 
but one is always defined as __bzero, and the other as bcopy.

It's unclear to me what the exact namespace rules are (or whether there is 
even a concise description of them somewhere), however it is obvious that 
this is not consistent. Note any rule that relies on whether a function is
called or not from within the same library is risky unless you have 
automated checks to catch that.

Wilco
  
Wilco Dijkstra Feb. 27, 2015, 3:08 p.m. UTC | #5
> > Roland McGrath:
> > Has to be __strnlen.
> 
> Good point, so it becomes:
> 
> -  return s;
> +  size_t size = __strnlen (s2, n);
> +  if (size != n)
> +    memset (s1 + size, '\0', n - size);
> +  return memcpy (s1, s2, size);

OK for trunk with __strnlen?

Wilco
  
Wilco Dijkstra April 15, 2015, 12:33 p.m. UTC | #6
ping

> Wilco Dijkstra wrote:
> > > Roland McGrath:
> > > Has to be __strnlen.
> >
> > Good point, so it becomes:
> >
> > -  return s;
> > +  size_t size = __strnlen (s2, n);
> > +  if (size != n)
> > +    memset (s1 + size, '\0', n - size);
> > +  return memcpy (s1, s2, size);
> 
> OK for trunk with __strnlen?
> 
> Wilco
  
Roland McGrath April 15, 2015, 6:27 p.m. UTC | #7
> > > Roland McGrath:
> > > Has to be __strnlen.
> > 
> > Good point, so it becomes:
> > 
> > -  return s;
> > +  size_t size = __strnlen (s2, n);
> > +  if (size != n)
> > +    memset (s1 + size, '\0', n - size);
> > +  return memcpy (s1, s2, size);
> 
> OK for trunk with __strnlen?

Style and whatnot is fine now.  I have no opinion one way or the other
about this being an appropriate performance change.
  
Ondrej Bilka May 13, 2015, 7:02 p.m. UTC | #8
On Wed, Apr 15, 2015 at 01:33:22PM +0100, Wilco Dijkstra wrote:
> ping
> 
> > Wilco Dijkstra wrote:
> > > > Roland McGrath:
> > > > Has to be __strnlen.
> > >
> > > Good point, so it becomes:
> > >
> > > -  return s;
> > > +  size_t size = __strnlen (s2, n);
> > > +  if (size != n)
> > > +    memset (s1 + size, '\0', n - size);
> > > +  return memcpy (s1, s2, size);
> > 
> > OK for trunk with __strnlen?
> > 
> > Wilco

Ok for me.
  
Ondrej Bilka June 16, 2015, 11:11 a.m. UTC | #9
On Wed, May 13, 2015 at 09:02:28PM +0200, Ondřej Bílka wrote:
> On Wed, Apr 15, 2015 at 01:33:22PM +0100, Wilco Dijkstra wrote:
> > ping
> > 
> > > Wilco Dijkstra wrote:
> > > > > Roland McGrath:
> > > > > Has to be __strnlen.
> > > >
> > > > Good point, so it becomes:
> > > >
> > > > -  return s;
> > > > +  size_t size = __strnlen (s2, n);
> > > > +  if (size != n)
> > > > +    memset (s1 + size, '\0', n - size);
> > > > +  return memcpy (s1, s2, size);
> > > 
> > > OK for trunk with __strnlen?
> > > 
> > > Wilco
> 
> Ok for me.

I would like this in. I plan to drop strncpy/stpcpy assembly on x64
and for it I need good generic implementation.

I replaced stupid_strncpy in benchtest by yours and ran it on x64.

As memcpy is ten times faster than current test for big size its definite improvement. For smaller sizes benchtest show only call overhead. However they miss icache pressure overhead as strncpy is dead cold function and should be optimized for size. Also randomizing alignment and size will completely change these numbers which are simply too inaccurate for small n.

                            	simple_strncpy	stupid_strncpy	__strncpy_ssse3	__strncpy_sse2_unaligned	__strncpy_sse2
Length   16, n   16, alignment  1/ 1:	49.7344	45.875	25.5938	14.4688	30.2344
Length   16, n   16, alignment  1/ 1:	48.8125	42.875	24.625	13.0625	27.2188
Length   16, n   16, alignment  1/ 2:	49.7188	42.5156	24.2188	12.7812	37.5938
Length   16, n   16, alignment  2/ 1:	48.3906	42.5625	24.6406	12.7812	26.6094
Length    2, n    4, alignment  7/ 2:	18.8906	42.4062	21.9219	23.2188	22.6562
Length    4, n    2, alignment  2/ 7:	14.25	43.1094	11.6719	15.8594	16.2188
Length    2, n    4, alignment  7/ 2:	18.6094	42.1406	21.6406	23.5312	22.5625
Length    4, n    2, alignment  2/ 7:	13.4219	43.1094	11.4375	14.75	15.1094
Length   16, n   16, alignment  2/ 2:	48.25	43.2969	24.8594	14.1094	28.6719
Length   16, n   16, alignment  2/ 2:	48.1562	42.5938	24.4062	12.7812	26.7812
Length   16, n   16, alignment  2/ 4:	48.4844	42.5625	23.4375	12.8594	34.6406
Length   16, n   16, alignment  4/ 2:	48.1094	42.5156	23.9531	12.7812	26.0156
Length    4, n    8, alignment  6/ 4:	28.5469	44.2969	27.5312	23.25	24.2188
Length    8, n    4, alignment  4/ 6:	18.9688	44.5312	15.0781	15.3906	16.4062
Length    4, n    8, alignment  6/ 4:	27.9844	44.9844	25.5625	23.25	22.75
Length    8, n    4, alignment  4/ 6:	18.5625	44.25	14.75	14.7969	15.4531
Length   16, n   16, alignment  3/ 3:	49.1719	42.5625	24.625	13.1875	27.9844
Length   16, n   16, alignment  3/ 3:	48.2969	42.5156	24.4531	12.9062	26.2969
Length   16, n   16, alignment  3/ 6:	48.4375	42.5625	23.1562	12.8594	34
Length   16, n   16, alignment  6/ 3:	48.1562	42.5625	23.8438	12.8594	25.9688
Length    8, n   16, alignment  5/ 6:	46.1406	43.0156	25.6406	22.9844	36.7188
Length   16, n    8, alignment  6/ 5:	29.3125	44.0156	19.5312	14.4375	20.1719
Length    8, n   16, alignment  5/ 6:	45.7812	42.6562	25.5938	23.5781	33.875
Length   16, n    8, alignment  6/ 5:	28.625	43.5625	18.1094	13.0469	18.75
Length   16, n   16, alignment  4/ 4:	49.6719	42.6406	24.9062	13.2344	27.2969
Length   16, n   16, alignment  4/ 4:	48.2969	42.5156	24.2188	12.8594	26.1562
Length   16, n   16, alignment  4/ 0:	48.2031	42.5156	23.4844	12.7812	25.9531
Length   16, n   16, alignment  0/ 4:	48.3906	42.6875	24.5781	12.8594	34.4219
Length   16, n   32, alignment  4/ 0:	79.6875	41.4062	44.4844	23.3906	33.5
Length   32, n   16, alignment  0/ 4:	48.3438	43.4844	23.2031	13.4219	34.3281
Length   16, n   32, alignment  4/ 0:	77.1094	41.125	41.9688	23.2656	32.9062
Length   32, n   16, alignment  0/ 4:	48.3906	42.7344	22.7969	12.8594	33.5469
Length   16, n   16, alignment  5/ 5:	48.125	42.5156	24.2188	12.8594	26.6094
Length   16, n   16, alignment  5/ 5:	48.125	42.5938	24.2188	12.9062	26.5625
Length   16, n   16, alignment  5/ 2:	48.1719	42.6406	23.625	12.7812	26.0469
Length   16, n   16, alignment  2/ 5:	48.3906	42.6406	23.2969	12.8594	36.5312
Length   32, n   64, alignment  3/ 2:	160.766	44.3906	55.4219	30.6094	53.7656
Length   64, n   32, alignment  2/ 3:	126.375	39.8906	52.1094	15.9375	69.5312
Length   32, n   64, alignment  3/ 2:	159.422	40.7656	53.3594	28.8594	52.2969
Length   64, n   32, alignment  2/ 3:	125.453	38.875	51.1562	14.9375	69.0781
Length   16, n   16, alignment  6/ 6:	48.2031	42.6406	24.9062	13.6875	27.7656
Length   16, n   16, alignment  6/ 6:	48.25	42.5156	24.0938	12.7812	26.2969
Length   16, n   16, alignment  6/ 4:	48.2031	42.5156	24.0312	12.875	26.2812
Length   16, n   16, alignment  4/ 6:	48.4844	42.6406	23.625	12.8594	34.7031
Length   64, n  128, alignment  2/ 4:	287.5	54.1875	65.2969	33	129.094
Length  128, n   64, alignment  4/ 2:	159.062	49.4062	50.5938	22.0625	87.4062
Length   64, n  128, alignment  2/ 4:	288.047	51.9688	63.875	31.8438	130.25
Length  128, n   64, alignment  4/ 2:	158.781	48.9844	48.0781	21.0938	84.4219
Length   16, n   16, alignment  7/ 7:	49.125	42.5156	24.3906	13.7969	27.25
Length   16, n   16, alignment  7/ 7:	48.2031	42.5156	24.1719	12.7812	26.1406
Length   16, n   16, alignment  7/ 6:	48.2969	42.5156	24.8125	12.7812	26.2969
Length   16, n   16, alignment  6/ 7:	48.4375	42.7344	24.1719	12.8281	38.8906
Length  128, n  256, alignment  1/ 6:	502.25	74.4375	83.9531	52.4375	260.25
Length  256, n  128, alignment  6/ 1:	265.078	54.1406	80.2344	32.625	165.906
Length  128, n  256, alignment  1/ 6:	501.938	72.875	79.9688	51.7969	264.656
Length  256, n  128, alignment  6/ 1:	267.922	54.125	70.5781	32.5938	162.734
Length    8, n   16, alignment  0/ 0:	44.625	41.2656	26.7031	24.3906	27.2031
Length   32, n   16, alignment  0/ 0:	48.8906	42.5625	24.5469	13.5938	27.3438
Length    8, n   16, alignment  7/ 2:	43.3438	45.2656	25.1875	23.0312	25.7344
Length   32, n   16, alignment  7/ 2:	48.1719	42.5156	23.4375	12.7812	26.0625
Length   16, n   32, alignment  0/ 0:	78.3125	40.9062	42.2344	23.9375	36.2969
Length   64, n   32, alignment  0/ 0:	101.609	40.0312	31.4219	15.125	47.9688
Length   16, n   32, alignment  6/ 4:	76.75	41.1875	46.7812	23.1094	32.9531
Length   64, n   32, alignment  6/ 4:	101.562	38.7812	43.5625	14.3906	45.5938
Length   32, n   64, alignment  0/ 0:	159.156	38.5625	46.2344	31.25	57.125
Length  128, n   64, alignment  0/ 0:	158.5	46.6406	38.9688	19.8125	120.125
Length   32, n   64, alignment  5/ 6:	169.625	42.1875	66.7656	28.5938	77.5312
Length  128, n   64, alignment  5/ 6:	192.688	50.2656	55.9688	22.4688	158.688
Length   64, n  128, alignment  0/ 0:	257.266	49.8125	49.5938	31.8438	93.8906
Length  256, n  128, alignment  0/ 0:	266.453	51.7969	45.6875	25.5938	186.031
Length   64, n  128, alignment  4/ 0:	256.656	49.3594	61.0312	29.8281	90.8125
Length  256, n  128, alignment  4/ 0:	267.875	51.8906	68.2031	31.7188	183.781
Length  128, n  256, alignment  0/ 0:	417.922	60.625	63.8281	41.8594	198.484
Length  512, n  256, alignment  0/ 0:	480.891	70.2656	60.0156	38.875	343.984
Length  128, n  256, alignment  3/ 2:	417.688	71.5938	84.9688	42.375	197.516
Length  512, n  256, alignment  3/ 2:	480.422	74.2656	87.7344	43.5156	347.203
Length  256, n  512, alignment  0/ 0:	741.547	81.0625	76.1562	57.0312	363.656
Length 1024, n  512, alignment  0/ 0:	907.672	94.1562	80.1094	58.5469	661.672
Length  256, n  512, alignment  2/ 4:	899.453	166.906	102.953	58.3125	506.016
Length 1024, n  512, alignment  2/ 4:	1242.33	110.156	228.859	63.9688	943.797
Length  512, n 1024, alignment  0/ 0:	1380.11	116.5	103.219	94.2031	687.219
Length 2048, n 1024, alignment  0/ 0:	1762.77	147.984	120.953	101.422	1294.89
Length  512, n 1024, alignment  1/ 6:	1762.95	153.953	141.922	90.4375	945.688
Length 2048, n 1024, alignment  1/ 6:	2537.5	194.906	179.781	105.922	1797.06
  
Wilco Dijkstra July 6, 2015, 11:28 a.m. UTC | #10
> Wilco Dijkstra wrote:
> ping
> 
> > Wilco Dijkstra wrote:
> > > > Roland McGrath:
> > > > Has to be __strnlen.
> > >
> > > Good point, so it becomes:
> > >
> > > -  return s;
> > > +  size_t size = __strnlen (s2, n);
> > > +  if (size != n)
> > > +    memset (s1 + size, '\0', n - size);
> > > +  return memcpy (s1, s2, size);
> >
> > OK for trunk with __strnlen?
> >
> > Wilco
  

Patch

diff --git a/string/strncpy.c b/string/strncpy.c
index 0915e03..9c516d6 100644
--- a/string/strncpy.c
+++ b/string/strncpy.c
@@ -15,73 +15,21 @@ 
    License along with the GNU C Library; if not, see
    <http://www.gnu.org/licenses/>.  */
 
+#include <stddef.h>
 #include <string.h>
-#include <memcopy.h>
 
 #undef strncpy
 
 #ifndef STRNCPY
-#define STRNCPY strncpy
+# define STRNCPY strncpy
 #endif
 
 char *
 STRNCPY (char *s1, const char *s2, size_t n)
 {
-  char c;
-  char *s = s1;
-
-  --s1;
-
-  if (n >= 4)
-    {
-      size_t n4 = n >> 2;
-
-      for (;;)
-	{
-	  c = *s2++;
-	  *++s1 = c;
-	  if (c == '\0')
-	    break;
-	  c = *s2++;
-	  *++s1 = c;
-	  if (c == '\0')
-	    break;
-	  c = *s2++;
-	  *++s1 = c;
-	  if (c == '\0')
-	    break;
-	  c = *s2++;
-	  *++s1 = c;
-	  if (c == '\0')
-	    break;
-	  if (--n4 == 0)
-	    goto last_chars;
-	}
-      n = n - (s1 - s) - 1;
-      if (n == 0)
-	return s;
-      goto zero_fill;
-    }
-
- last_chars:
-  n &= 3;
-  if (n == 0)
-    return s;
-
-  do
-    {
-      c = *s2++;
-      *++s1 = c;
-      if (--n == 0)
-	return s;
-    }
-  while (c != '\0');
-
- zero_fill:
-  do
-    *++s1 = '\0';
-  while (--n > 0);
-
-  return s;
+  size_t size = strnlen (s2, n);
+  if (size != n)
+    memset (s1 + size, '\0', n - size);
+  return memcpy (s1, s2, size);
 }
 libc_hidden_builtin_def (strncpy)