Inline mempcpy

Message ID 20150513192819.GA1170@domone
State New, archived
Headers

Commit Message

Ondrej Bilka May 13, 2015, 7:28 p.m. UTC
  Hi,
As pointed out that following patch should be generic
http://patchwork.sourceware.org/patch/6459/
here is sample patch that does it. Header for mempcpy now becomes
following:

#ifdef __USE_GNU
# if !defined _HAVE_STRING_ARCH_mempcpy || defined _FORCE_INLINES

#define __mempcpy(dest, src, n) __mempcpy_inline(dest, src, n)

__STRING_INLINE void *
__mempcpy_inline (void *__restrict __dest,
                  const void *__restrict __src, size_t __n)
{
  return memcpy (__dest, __src, __n) + __n;
}

# endif
#endif

Patch itself is messy as it also removes obsolete inlining for gcc-3.4
and older. Ok to clean that up or should I send separate patch to remove
all obsolete inlines from string2.h. These would also cause regression
as implementations improved a lot and inlines there use only 32bit
access without using 64bit capabilities.

Comments?

	* string/bits/string2.h: Use inline mempcpy.
  

Comments

Joseph Myers May 13, 2015, 10:21 p.m. UTC | #1
On Wed, 13 May 2015, Ondřej Bílka wrote:

> Hi,
> As pointed out that following patch should be generic
> http://patchwork.sourceware.org/patch/6459/
> here is sample patch that does it. Header for mempcpy now becomes
> following:
> 
> #ifdef __USE_GNU
> # if !defined _HAVE_STRING_ARCH_mempcpy || defined _FORCE_INLINES

Doesn't this change the semantics of _HAVE_STRING_ARCH_mempcpy?  That is, 
after this patch, architectures with efficient .S implementations of 
mempcpy (as opposed to ones that just use the default .c implementation) 
should define that macro rather than just those with inline 
implementations.  So the patch needs to update architectures as well.

> Patch itself is messy as it also removes obsolete inlining for gcc-3.4
> and older. Ok to clean that up or should I send separate patch to remove
> all obsolete inlines from string2.h. These would also cause regression
> as implementations improved a lot and inlines there use only 32bit
> access without using 64bit capabilities.

Did the discussion involving 
<https://sourceware.org/ml/libc-alpha/2013-01/msg00157.html> and 
<https://sourceware.org/ml/libc-alpha/2013-01/msg00270.html> reach any 
wiki-documented consensus regarding what compiler versions it's worth 
having any optimizations for in the headers?
  
Ondrej Bilka May 14, 2015, 7:21 a.m. UTC | #2
On Wed, May 13, 2015 at 10:21:25PM +0000, Joseph Myers wrote:
> On Wed, 13 May 2015, Ondřej Bílka wrote:
> 
> > Hi,
> > As pointed out that following patch should be generic
> > http://patchwork.sourceware.org/patch/6459/
> > here is sample patch that does it. Header for mempcpy now becomes
> > following:
> > 
> > #ifdef __USE_GNU
> > # if !defined _HAVE_STRING_ARCH_mempcpy || defined _FORCE_INLINES
> 
> Doesn't this change the semantics of _HAVE_STRING_ARCH_mempcpy?  That is, 
> after this patch, architectures with efficient .S implementations of 
> mempcpy (as opposed to ones that just use the default .c implementation) 
> should define that macro rather than just those with inline 
> implementations.  So the patch needs to update architectures as well.
>
Read the friendly thread. In short you should always expand mempcpy to
memcpy and keep mempcpy implementation just for legacy programs.

Possible benefits are small, few cycles to save register spill at most.
It can't be better as you can implement memcpy with mempcpy.

But there is huge cost involved. Effective mempcpy is often larger than
one kilobyte. Duplicating that kilobyte for mempcpy would increase
instruction cache pressure a lot. Easily it could cause one cache miss
per call resulting additional 30 cycle penalty negating any benefit you
try to make.
  
Ondrej Bilka May 14, 2015, 9:01 a.m. UTC | #3
On Wed, May 13, 2015 at 10:21:25PM +0000, Joseph Myers wrote:
> > Patch itself is messy as it also removes obsolete inlining for gcc-3.4
> > and older. Ok to clean that up or should I send separate patch to remove
> > all obsolete inlines from string2.h. These would also cause regression
> > as implementations improved a lot and inlines there use only 32bit
> > access without using 64bit capabilities.
> 
> Did the discussion involving 
> <https://sourceware.org/ml/libc-alpha/2013-01/msg00157.html> and 
> <https://sourceware.org/ml/libc-alpha/2013-01/msg00270.html> reach any 
> wiki-documented consensus regarding what compiler versions it's worth 
> having any optimizations for in the headers?
> 
These aren't too related to this problem, there is barely any discussion
about performance, its mostly about support.

I would use rule that if user doesn't care about performance by using
obsolete gcc that generates slower code why should we care?

Also as I said before removing many of these would improve performance
as they are obsolete and libcall is faster.
  
Joseph Myers May 14, 2015, 4:40 p.m. UTC | #4
On Thu, 14 May 2015, Ondřej Bílka wrote:

> On Wed, May 13, 2015 at 10:21:25PM +0000, Joseph Myers wrote:
> > On Wed, 13 May 2015, Ondřej Bílka wrote:
> > 
> > > Hi,
> > > As pointed out that following patch should be generic
> > > http://patchwork.sourceware.org/patch/6459/
> > > here is sample patch that does it. Header for mempcpy now becomes
> > > following:
> > > 
> > > #ifdef __USE_GNU
> > > # if !defined _HAVE_STRING_ARCH_mempcpy || defined _FORCE_INLINES
> > 
> > Doesn't this change the semantics of _HAVE_STRING_ARCH_mempcpy?  That is, 
> > after this patch, architectures with efficient .S implementations of 
> > mempcpy (as opposed to ones that just use the default .c implementation) 
> > should define that macro rather than just those with inline 
> > implementations.  So the patch needs to update architectures as well.
> >
> Read the friendly thread. In short you should always expand mempcpy to
> memcpy and keep mempcpy implementation just for legacy programs.
> 
> Possible benefits are small, few cycles to save register spill at most.
> It can't be better as you can implement memcpy with mempcpy.
> 
> But there is huge cost involved. Effective mempcpy is often larger than
> one kilobyte. Duplicating that kilobyte for mempcpy would increase
> instruction cache pressure a lot. Easily it could cause one cache miss
> per call resulting additional 30 cycle penalty negating any benefit you
> try to make.

That cost sounds like it very much depends on the architecture's cache 
size on typical processors (presumably likely to go up over time) and the 
size of the function implementations.  On SPARC, for example, mempcpy is 
simply an alternative entry point in the same file as memcpy that sets up 
a different return value, so I see no apparent reason to avoid the 
out-of-line mempcpy implementation there.  Generally, you should get 
consensus for the change from architecture people on each architecture 
with its own mempcpy version (x86_64, x86, powerpc, sparc).

Regarding the patch in particular, you appear to lose the

> -/* In glibc we use this function frequently but for namespace reasons
> -   we have to use the name `__mempcpy'.  */
> -#   define mempcpy(dest, src, n) __mempcpy (dest, src, n)

so it's not clear the patch would even be effective for programs that call 
mempcpy rather than __mempcpy.  And in

> #define __mempcpy(dest, src, n) __mempcpy_inline(dest, src, n)

you're missing the space before the '(' in the function call.  The patch 
removes the definition of __mempcpy_small from the headers, but 
__mempcpy_small is part of the libc ABI for existing binaries that were 
built with old GCC; I'd have expected this patch to cause string-inlines.c 
no longer to export a definition of __mempcpy_small, resulting in ABI 
tests failing (you didn't say how the patch was tested).  I'd expect you 
to need to arrange for the definition to go somewhere else so that it 
still gets exported from libc (taking care of x86 having its own versions 
of string-inlines.c).  And as __mempcpy_inline is __STRING_INLINE, the 
compiler could choose not to inline it, meaning that it needs to go in the 
Versions file and ABI baselines unless you can arrange for it to have 
__mempcpy as its asm name for any non-inlined calls (which would certainly 
be preferable, to avoid introducing a new ABI).

All those ABI complications around removing existing inlines from headers 
definitely show that any such removals should be separate from adding new 
inlines or other optimizations.
  
Joseph Myers May 14, 2015, 4:47 p.m. UTC | #5
On Thu, 14 May 2015, Ondřej Bílka wrote:

> > Did the discussion involving 
> > <https://sourceware.org/ml/libc-alpha/2013-01/msg00157.html> and 
> > <https://sourceware.org/ml/libc-alpha/2013-01/msg00270.html> reach any 
> > wiki-documented consensus regarding what compiler versions it's worth 
> > having any optimizations for in the headers?
> > 
> These aren't too related to this problem, there is barely any discussion
> about performance, its mostly about support.

It seems directly related.  
<https://sourceware.org/ml/libc-alpha/2013-01/msg00270.html> has two 
paragraphs about optimizations in headers and not regressing such support 
"just in the name of generic simplification".

> I would use rule that if user doesn't care about performance by using
> obsolete gcc that generates slower code why should we care?

Well, maybe we should say that the optimizations for GCC < 4.1 (say) 
aren't significant enough to be worth the complexity of keeping (in 
headers not shared with gnulib) - though you still need to export all the 
old ABIs exported by string-inlines.c (they could be conditioned 
SHLIB_COMPAT so new architectures don't get them, however).  But such a 
change needs consensus.

> Also as I said before removing many of these would improve performance
> as they are obsolete and libcall is faster.

These are generally about optimizing cases of e.g. constant arguments for 
old compilers that didn't have such optimizations themselves.
  
Paul Eggert May 14, 2015, 4:51 p.m. UTC | #6
On 05/14/2015 09:47 AM, Joseph Myers wrote:
> Well, maybe we should say that the optimizations for GCC < 4.1 (say)
> aren't significant enough to be worth the complexity of keeping (in
> headers not shared with gnulib)

We don't care about optimizing older GCC for Gnulib either, so please 
don't bother to keep that complexity just for Gnulib's sake.
  
Adhemerval Zanella May 14, 2015, 7:28 p.m. UTC | #7
On 14-05-2015 13:40, Joseph Myers wrote:
> On Thu, 14 May 2015, Ondřej Bílka wrote:
> 
>> On Wed, May 13, 2015 at 10:21:25PM +0000, Joseph Myers wrote:
>>> On Wed, 13 May 2015, Ondřej Bílka wrote:
>>>
>>>> Hi,
>>>> As pointed out that following patch should be generic
>>>> http://patchwork.sourceware.org/patch/6459/
>>>> here is sample patch that does it. Header for mempcpy now becomes
>>>> following:
>>>>
>>>> #ifdef __USE_GNU
>>>> # if !defined _HAVE_STRING_ARCH_mempcpy || defined _FORCE_INLINES
>>>
>>> Doesn't this change the semantics of _HAVE_STRING_ARCH_mempcpy?  That is, 
>>> after this patch, architectures with efficient .S implementations of 
>>> mempcpy (as opposed to ones that just use the default .c implementation) 
>>> should define that macro rather than just those with inline 
>>> implementations.  So the patch needs to update architectures as well.
>>>
>> Read the friendly thread. In short you should always expand mempcpy to
>> memcpy and keep mempcpy implementation just for legacy programs.
>>
>> Possible benefits are small, few cycles to save register spill at most.
>> It can't be better as you can implement memcpy with mempcpy.
>>
>> But there is huge cost involved. Effective mempcpy is often larger than
>> one kilobyte. Duplicating that kilobyte for mempcpy would increase
>> instruction cache pressure a lot. Easily it could cause one cache miss
>> per call resulting additional 30 cycle penalty negating any benefit you
>> try to make.
> 
> That cost sounds like it very much depends on the architecture's cache 
> size on typical processors (presumably likely to go up over time) and the 
> size of the function implementations.  On SPARC, for example, mempcpy is 
> simply an alternative entry point in the same file as memcpy that sets up 
> a different return value, so I see no apparent reason to avoid the 
> out-of-line mempcpy implementation there.  Generally, you should get 
> consensus for the change from architecture people on each architecture 
> with its own mempcpy version (x86_64, x86, powerpc, sparc).

Back when I added mempcpy for POWER7, I just used the memcpy as base and
optimize it to avoid the stack call allocation.  However, I tend to agree
that for this particular function icache pressure is more dominant that
a function call. So I think for powerpc the inline is better.

> 
> Regarding the patch in particular, you appear to lose the
> 
>> -/* In glibc we use this function frequently but for namespace reasons
>> -   we have to use the name `__mempcpy'.  */
>> -#   define mempcpy(dest, src, n) __mempcpy (dest, src, n)
> 
> so it's not clear the patch would even be effective for programs that call 
> mempcpy rather than __mempcpy.  And in
> 
>> #define __mempcpy(dest, src, n) __mempcpy_inline(dest, src, n)
> 
> you're missing the space before the '(' in the function call.  The patch 
> removes the definition of __mempcpy_small from the headers, but 
> __mempcpy_small is part of the libc ABI for existing binaries that were 
> built with old GCC; I'd have expected this patch to cause string-inlines.c 
> no longer to export a definition of __mempcpy_small, resulting in ABI 
> tests failing (you didn't say how the patch was tested).  I'd expect you 
> to need to arrange for the definition to go somewhere else so that it 
> still gets exported from libc (taking care of x86 having its own versions 
> of string-inlines.c).  And as __mempcpy_inline is __STRING_INLINE, the 
> compiler could choose not to inline it, meaning that it needs to go in the 
> Versions file and ABI baselines unless you can arrange for it to have 
> __mempcpy as its asm name for any non-inlined calls (which would certainly 
> be preferable, to avoid introducing a new ABI).
> 
> All those ABI complications around removing existing inlines from headers 
> definitely show that any such removals should be separate from adding new 
> inlines or other optimizations.
>
  

Patch

diff --git a/string/bits/string2.h b/string/bits/string2.h
index 7645176..76c9164 100644
--- a/string/bits/string2.h
+++ b/string/bits/string2.h
@@ -201,189 +201,16 @@  __STRING2_COPY_TYPE (8);
    last copied.  */
 #ifdef __USE_GNU
 # if !defined _HAVE_STRING_ARCH_mempcpy || defined _FORCE_INLINES
-#  ifndef _HAVE_STRING_ARCH_mempcpy
-#   if __GNUC_PREREQ (3, 4)
-#    define __mempcpy(dest, src, n) __builtin_mempcpy (dest, src, n)
-#   elif __GNUC_PREREQ (3, 0)
-#    define __mempcpy(dest, src, n) \
-  (__extension__ (__builtin_constant_p (src) && __builtin_constant_p (n)      \
-		  && __string2_1bptr_p (src) && n <= 8			      \
-		  ? __builtin_memcpy (dest, src, n) + (n)		      \
-		  : __mempcpy (dest, src, n)))
-#   else
-#    define __mempcpy(dest, src, n) \
-  (__extension__ (__builtin_constant_p (src) && __builtin_constant_p (n)      \
-		  && __string2_1bptr_p (src) && n <= 8			      \
-		  ? __mempcpy_small (dest, __mempcpy_args (src), n)	      \
-		  : __mempcpy (dest, src, n)))
-#   endif
-/* In glibc we use this function frequently but for namespace reasons
-   we have to use the name `__mempcpy'.  */
-#   define mempcpy(dest, src, n) __mempcpy (dest, src, n)
-#  endif
 
-#  if !__GNUC_PREREQ (3, 0) || defined _FORCE_INLINES
-#   if _STRING_ARCH_unaligned
-#    ifndef _FORCE_INLINES
-#     define __mempcpy_args(src) \
-     ((const char *) (src))[0], ((const char *) (src))[2],		      \
-     ((const char *) (src))[4], ((const char *) (src))[6],		      \
-     __extension__ __STRING2_SMALL_GET16 (src, 0),			      \
-     __extension__ __STRING2_SMALL_GET16 (src, 4),			      \
-     __extension__ __STRING2_SMALL_GET32 (src, 0),			      \
-     __extension__ __STRING2_SMALL_GET32 (src, 4)
-#    endif
-__STRING_INLINE void *__mempcpy_small (void *, char, char, char, char,
-				       __uint16_t, __uint16_t, __uint32_t,
-				       __uint32_t, size_t);
-__STRING_INLINE void *
-__mempcpy_small (void *__dest1,
-		 char __src0_1, char __src2_1, char __src4_1, char __src6_1,
-		 __uint16_t __src0_2, __uint16_t __src4_2,
-		 __uint32_t __src0_4, __uint32_t __src4_4,
-		 size_t __srclen)
-{
-  union {
-    __uint32_t __ui;
-    __uint16_t __usi;
-    unsigned char __uc;
-    unsigned char __c;
-  } *__u = __dest1;
-  switch ((unsigned int) __srclen)
-    {
-    case 1:
-      __u->__c = __src0_1;
-      __u = __extension__ ((void *) __u + 1);
-      break;
-    case 2:
-      __u->__usi = __src0_2;
-      __u = __extension__ ((void *) __u + 2);
-      break;
-    case 3:
-      __u->__usi = __src0_2;
-      __u = __extension__ ((void *) __u + 2);
-      __u->__c = __src2_1;
-      __u = __extension__ ((void *) __u + 1);
-      break;
-    case 4:
-      __u->__ui = __src0_4;
-      __u = __extension__ ((void *) __u + 4);
-      break;
-    case 5:
-      __u->__ui = __src0_4;
-      __u = __extension__ ((void *) __u + 4);
-      __u->__c = __src4_1;
-      __u = __extension__ ((void *) __u + 1);
-      break;
-    case 6:
-      __u->__ui = __src0_4;
-      __u = __extension__ ((void *) __u + 4);
-      __u->__usi = __src4_2;
-      __u = __extension__ ((void *) __u + 2);
-      break;
-    case 7:
-      __u->__ui = __src0_4;
-      __u = __extension__ ((void *) __u + 4);
-      __u->__usi = __src4_2;
-      __u = __extension__ ((void *) __u + 2);
-      __u->__c = __src6_1;
-      __u = __extension__ ((void *) __u + 1);
-      break;
-    case 8:
-      __u->__ui = __src0_4;
-      __u = __extension__ ((void *) __u + 4);
-      __u->__ui = __src4_4;
-      __u = __extension__ ((void *) __u + 4);
-      break;
-    }
-  return (void *) __u;
-}
-#   else
-#    ifndef _FORCE_INLINES
-#     define __mempcpy_args(src) \
-     ((const char *) (src))[0],						      \
-     __extension__ ((__STRING2_COPY_ARR2)				      \
-      { { ((const char *) (src))[0], ((const char *) (src))[1] } }),	      \
-     __extension__ ((__STRING2_COPY_ARR3)				      \
-      { { ((const char *) (src))[0], ((const char *) (src))[1],		      \
-	  ((const char *) (src))[2] } }),				      \
-     __extension__ ((__STRING2_COPY_ARR4)				      \
-      { { ((const char *) (src))[0], ((const char *) (src))[1],		      \
-	  ((const char *) (src))[2], ((const char *) (src))[3] } }),	      \
-     __extension__ ((__STRING2_COPY_ARR5)				      \
-      { { ((const char *) (src))[0], ((const char *) (src))[1],		      \
-	  ((const char *) (src))[2], ((const char *) (src))[3],		      \
-	  ((const char *) (src))[4] } }),				      \
-     __extension__ ((__STRING2_COPY_ARR6)				      \
-      { { ((const char *) (src))[0], ((const char *) (src))[1],		      \
-	  ((const char *) (src))[2], ((const char *) (src))[3],		      \
-	  ((const char *) (src))[4], ((const char *) (src))[5] } }),	      \
-     __extension__ ((__STRING2_COPY_ARR7)				      \
-      { { ((const char *) (src))[0], ((const char *) (src))[1],		      \
-	  ((const char *) (src))[2], ((const char *) (src))[3],		      \
-	  ((const char *) (src))[4], ((const char *) (src))[5],		      \
-	  ((const char *) (src))[6] } }),				      \
-     __extension__ ((__STRING2_COPY_ARR8)				      \
-      { { ((const char *) (src))[0], ((const char *) (src))[1],		      \
-	  ((const char *) (src))[2], ((const char *) (src))[3],		      \
-	  ((const char *) (src))[4], ((const char *) (src))[5],		      \
-	  ((const char *) (src))[6], ((const char *) (src))[7] } })
-#    endif
-__STRING_INLINE void *__mempcpy_small (void *, char, __STRING2_COPY_ARR2,
-				       __STRING2_COPY_ARR3,
-				       __STRING2_COPY_ARR4,
-				       __STRING2_COPY_ARR5,
-				       __STRING2_COPY_ARR6,
-				       __STRING2_COPY_ARR7,
-				       __STRING2_COPY_ARR8, size_t);
+#define __mempcpy(dest, src, n) __mempcpy_inline(dest, src, n)
+
 __STRING_INLINE void *
-__mempcpy_small (void *__dest, char __src1,
-		 __STRING2_COPY_ARR2 __src2, __STRING2_COPY_ARR3 __src3,
-		 __STRING2_COPY_ARR4 __src4, __STRING2_COPY_ARR5 __src5,
-		 __STRING2_COPY_ARR6 __src6, __STRING2_COPY_ARR7 __src7,
-		 __STRING2_COPY_ARR8 __src8, size_t __srclen)
+__mempcpy_inline (void *__restrict __dest,
+		  const void *__restrict __src, size_t __n)
 {
-  union {
-    char __c;
-    __STRING2_COPY_ARR2 __sca2;
-    __STRING2_COPY_ARR3 __sca3;
-    __STRING2_COPY_ARR4 __sca4;
-    __STRING2_COPY_ARR5 __sca5;
-    __STRING2_COPY_ARR6 __sca6;
-    __STRING2_COPY_ARR7 __sca7;
-    __STRING2_COPY_ARR8 __sca8;
-  } *__u = __dest;
-  switch ((unsigned int) __srclen)
-    {
-    case 1:
-      __u->__c = __src1;
-      break;
-    case 2:
-      __extension__ __u->__sca2 = __src2;
-      break;
-    case 3:
-      __extension__ __u->__sca3 = __src3;
-      break;
-    case 4:
-      __extension__ __u->__sca4 = __src4;
-      break;
-    case 5:
-      __extension__ __u->__sca5 = __src5;
-      break;
-    case 6:
-      __extension__ __u->__sca6 = __src6;
-      break;
-    case 7:
-      __extension__ __u->__sca7 = __src7;
-      break;
-    case 8:
-      __extension__ __u->__sca8 = __src8;
-      break;
-    }
-  return __extension__ ((void *) __u + __srclen);
+  return memcpy (__dest, __src, __n) + __n;
 }
-#   endif
-#  endif
+
 # endif
 #endif