Simplify strncat.

Message ID 20141216202438.GA5612@domone
State Committed
Headers

Commit Message

Ondrej Bilka Dec. 16, 2014, 8:24 p.m. UTC
  Hi, as in another thread found warning in default strncat implementation
I looked to code and best way looks to replace loop by
memcpy (dest, src, strnlen (src, n));

Its because code look misoptimized, a loop does byte-by-byte copy
instead copying 4 bytes at once like

for (i = 0; i < n; i += 4)
{
  uint32 srcm = *((uint32_t *)(src + i));
  if (has_zero (srcm)) /* use trick from strlen */
    break;
  *((uint32_t *)(dst + i)) = *srcm;
}

where I for simplicity ommited aligning dest and simulating unaligned
read when architecture does not allow it.

As even generic memcpy and strnlen do these trick code will be lot
faster unless you do most of calls to append 1-4 bytes.

As strncat should not be used for performance sensitive task I decided
to keep that simple. Also on my machine its almost never used. Only
application that I could find is google-chrome which called strnlen 44
times total with following lenghts.

dest src n
56 198 198
254 75 75
329 94 94
423 24 24
56 198 198
254 75 75
329 94 94
423 24 24
56 198 198
254 75 75
329 94 94
423 24 24

Tested on x64 with modified test, OK to commit?

	* string/strncat.c (STRNCAT): Simplify implementation.
  

Comments

Paul Eggert Dec. 16, 2014, 8:50 p.m. UTC | #1
Thanks, this is much better than worrying about how to pacify GCC. The 
code could be made a bit shorter and clearer with mempcpy, and there's 
no longer any need to distinguish between s and s1, so I suggest the 
following minor rewrite, which shrinks the code size by another 26 bytes 
(16%) on my x86-64 platform.

char *
STRNCAT (char *s1, const char *s2, size_t n)
{
   char *s1_end = mempcpy (s1 + strlen (s1), s2, __strnlen (s2, n));
   *s1_end = '\0';
   return s1;
}
  
David Miller Dec. 16, 2014, 8:52 p.m. UTC | #2
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Tue, 16 Dec 2014 12:50:38 -0800

> Thanks, this is much better than worrying about how to pacify GCC. The
> code could be made a bit shorter and clearer with mempcpy, and there's
> no longer any need to distinguish between s and s1, so I suggest the
> following minor rewrite, which shrinks the code size by another 26
> bytes (16%) on my x86-64 platform.

The fact that we are talking about complete rewrites for what should
be a one line warning fix shows that we enabled -Werror prematurely.

This should have been a no-brainer, yet it is going to take a week to
fix something that is breaking the build for people.
  
Joseph Myers Dec. 17, 2014, 12:24 a.m. UTC | #3
On Tue, 16 Dec 2014, David Miller wrote:

> The fact that we are talking about complete rewrites for what should
> be a one line warning fix shows that we enabled -Werror prematurely.

No, it simply shows people getting side-tracked and confusing the 
questions of how to fix the build for an existing warning and how to clean 
up the issue it shows up.  The variety of different configurations in 
which people build glibc fundamentally requires the process of fixing 
warnings to be done in parallel by people seeing those warnings (even 
restricted to x86_64 with GCC 4.9, some people are reporting different 
warnings I don't see in such a configuration, so testing all combinations 
of architectures and compiler versions would not cover the warning space).  
The process of fixing warnings started while the -Werror policy was under 
discussion (there was plenty of opportunity for people to fix warnings 
they saw in advance of -Werror being enabled - if they didn't take that 
opportunity, announcing "-Werror will be enabled at X time in the future" 
would hardly have resulted in much more fixing).

My position remains as in 
<https://sourceware.org/ml/libc-alpha/2014-11/msg00693.html>:

  I think we should be perfectly willing to file a bug in Bugzilla and then 
  add a suppression referencing the bug, if there's something tricky about 
  determining if there's a real glibc bug there or a better way to address 
  the warning - that means the build stays working on more platforms (in the 
  presence of -Werror).  I see no reason why we should assume a possible 
  problem in existing code shown up by a warning (from new GCC or on another 
  architecture) is more serious than other bugs.  -Werror is first about 
  stopping new issues getting in accidentally, rather than forcing certain 
  existing issues (those that cause warnings) to be higher priority than 
  other existing issues.

(If the warning that it's hard to work out the correct fix for arises from 
a recent change to glibc, we should be rather more willing to revert that 
change until the fix has been worked out, although that's not necessarily 
the right thing to do in all cases.)
  
Ondrej Bilka Dec. 17, 2014, 4:32 p.m. UTC | #4
On Tue, Dec 16, 2014 at 12:36:38PM -0800, Roland McGrath wrote:
> Use '\0', not '\000'.
> 
> IIRC we have a general policy about having a benchtests case and citing
> numbers on that.

That is not policy as benchtest results on string function are still
meaningless. Here I could simply improve benchtest score by inlining
strnlen and memcpy implementations, for example with

#define memcpy memcpy2
#include <string/memcpy.c>

As it avoids call overhead. However it would be big mistake to do that
"optimization" as strncat is cold function while memcpy is likely in
memory it degrade performance due to additonal icache misses.
  
Carlos O'Donell Dec. 17, 2014, 4:45 p.m. UTC | #5
On 12/16/2014 07:24 PM, Joseph Myers wrote:
> On Tue, 16 Dec 2014, David Miller wrote:
> 
>> The fact that we are talking about complete rewrites for what should
>> be a one line warning fix shows that we enabled -Werror prematurely.
> 
> No, it simply shows people getting side-tracked and confusing the 
> questions of how to fix the build for an existing warning and how to clean 
> up the issue it shows up.

I agree with Joseph.

Cheers,
Carlos.
  
Adhemerval Zanella Netto Dec. 17, 2014, 6:31 p.m. UTC | #6
On 17-12-2014 14:32, Ondřej Bílka wrote:
> On Tue, Dec 16, 2014 at 12:36:38PM -0800, Roland McGrath wrote:
>> Use '\0', not '\000'.
>>
>> IIRC we have a general policy about having a benchtests case and citing
>> numbers on that.
> That is not policy as benchtest results on string function are still
> meaningless. Here I could simply improve benchtest score by inlining
> strnlen and memcpy implementations, for example with
>
> #define memcpy memcpy2
> #include <string/memcpy.c>
>
> As it avoids call overhead. However it would be big mistake to do that
> "optimization" as strncat is cold function while memcpy is likely in
> memory it degrade performance due to additonal icache misses.
>
This is a fair criticize, but benchtests from string functions are far for
'meanigless'.  On powerpc side, for instance, I noted unaligned cases 
that current code for strcpy was outperforming.  It leads to check more
result using different workloads and profiler and code a better strategy.
Which resulted in a better implementation in the end (patch just posted).

Better would be to characterize a better string benchtest and provide
patches.
  
Ondrej Bilka Dec. 18, 2014, 7:59 p.m. UTC | #7
On Wed, Dec 17, 2014 at 04:31:25PM -0200, Adhemerval Zanella wrote:
> On 17-12-2014 14:32, Ondřej Bílka wrote:
> > On Tue, Dec 16, 2014 at 12:36:38PM -0800, Roland McGrath wrote:
> >> Use '\0', not '\000'.
> >>
> >> IIRC we have a general policy about having a benchtests case and citing
> >> numbers on that.
> > That is not policy as benchtest results on string function are still
> > meaningless. Here I could simply improve benchtest score by inlining
> > strnlen and memcpy implementations, for example with
> >
> > #define memcpy memcpy2
> > #include <string/memcpy.c>
> >
> > As it avoids call overhead. However it would be big mistake to do that
> > "optimization" as strncat is cold function while memcpy is likely in
> > memory it degrade performance due to additonal icache misses.
> >
> This is a fair criticize, but benchtests from string functions are far for
> 'meanigless'.  On powerpc side, for instance, I noted unaligned cases 
> that current code for strcpy was outperforming.  It leads to check more
> result using different workloads and profiler and code a better strategy.
> Which resulted in a better implementation in the end (patch just posted).
> 
Which is quite dangerous, if benchmark is unreliable you cannot be sure
if that improvement is real or will turn into regression. That could now
easily be case as function in benchtest predicts all branches perfectly
which makes path fast in benchtest does not happen in reality. I will post 
several changes and we will see truth (I did not read patch so I cannot say 
what will happen.)
  
Adhemerval Zanella Netto Dec. 18, 2014, 11:39 p.m. UTC | #8
On 18-12-2014 17:59, Ondřej Bílka wrote:
> On Wed, Dec 17, 2014 at 04:31:25PM -0200, Adhemerval Zanella wrote:
>> On 17-12-2014 14:32, Ondřej Bílka wrote:
>>> On Tue, Dec 16, 2014 at 12:36:38PM -0800, Roland McGrath wrote:
>>>> Use '\0', not '\000'.
>>>>
>>>> IIRC we have a general policy about having a benchtests case and citing
>>>> numbers on that.
>>> That is not policy as benchtest results on string function are still
>>> meaningless. Here I could simply improve benchtest score by inlining
>>> strnlen and memcpy implementations, for example with
>>>
>>> #define memcpy memcpy2
>>> #include <string/memcpy.c>
>>>
>>> As it avoids call overhead. However it would be big mistake to do that
>>> "optimization" as strncat is cold function while memcpy is likely in
>>> memory it degrade performance due to additonal icache misses.
>>>
>> This is a fair criticize, but benchtests from string functions are far for
>> 'meanigless'.  On powerpc side, for instance, I noted unaligned cases 
>> that current code for strcpy was outperforming.  It leads to check more
>> result using different workloads and profiler and code a better strategy.
>> Which resulted in a better implementation in the end (patch just posted).
>>
> Which is quite dangerous, if benchmark is unreliable you cannot be sure
> if that improvement is real or will turn into regression. That could now
> easily be case as function in benchtest predicts all branches perfectly
> which makes path fast in benchtest does not happen in reality. I will post 
> several changes and we will see truth (I did not read patch so I cannot say 
> what will happen.)
>
And that is *exactly* why we should push for more patches and a better coverage 
in benchmarks.  I do see your criticizes as positive, what I do not see is just
bashing someone that use the benchmarks it without providing anything better
instead.

And do not get me wrong here, but I checked sometime ago your personal page
and your benchmark work and I lost interested because it was *way* confusing 
and without much organization.  And that is exactly what I am trying to avoid 
here: comments in separate threads with meaningful comments, but lost without 
proper organization. 

For instance, I noticed you used x86_64 specific instructions to flush the
cache and other things to try mimic a real world usage. I think is a good idea
try to came up with a benchmark that may use some arch-specific, but we will
need some discussion on how to characterize it, how to code for multiple archs,
etc.
  

Patch

diff --git a/string/strncat.c b/string/strncat.c
index 6d29114..e130a1f 100644
--- a/string/strncat.c
+++ b/string/strncat.c
@@ -29,52 +29,15 @@ 
 char *
 STRNCAT (char *s1, const char *s2, size_t n)
 {
-  char c;
   char *s = s1;
 
   /* Find the end of S1.  */
   s1 += strlen (s1);
 
-  /* Make S1 point before next character, so we can increment
-     it while memory is read (wins on pipelined cpus).  */
-  s1 -= 1;
+  size_t ss = __strnlen (s2, n);
 
-  if (n >= 4)
-    {
-      size_t n4 = n >> 2;
-      do
-	{
-	  c = *s2++;
-	  *++s1 = c;
-	  if (c == '\0')
-	    return s;
-	  c = *s2++;
-	  *++s1 = c;
-	  if (c == '\0')
-	    return s;
-	  c = *s2++;
-	  *++s1 = c;
-	  if (c == '\0')
-	    return s;
-	  c = *s2++;
-	  *++s1 = c;
-	  if (c == '\0')
-	    return s;
-	} while (--n4 > 0);
-      n &= 3;
-    }
-
-  while (n > 0)
-    {
-      c = *s2++;
-      *++s1 = c;
-      if (c == '\0')
-	return s;
-      n--;
-    }
-
-  if (c != '\0')
-    *++s1 = '\0';
+  memcpy (s1, s2, ss);
+  s1[ss] = '\000';
 
   return s;
 }