Message ID | 20150113191449.AD91B2C39DC@topped-with-meat.com |
---|---|
State | Committed |
Headers |
Received: (qmail 12906 invoked by alias); 13 Jan 2015 19:14:56 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 12850 invoked by uid 89); 13 Jan 2015 19:14:54 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL, BAYES_00 autolearn=ham version=3.3.2 X-HELO: topped-with-meat.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit From: Roland McGrath <roland@hack.frob.com> To: "Wilco Dijkstra" <wdijkstr@arm.com> Cc: <libc-alpha@sourceware.org> Subject: RE: [PATCH] Improve strncpy performance further In-Reply-To: Wilco Dijkstra's message of Sunday, 11 January 2015 18:09:25 -0000 <001a01d02dc9$bd6f0370$384d0a50$@com> References: <001801d02b72$6ce0c3c0$46a24b40$@com> <20150108185812.285782C3BF6@topped-with-meat.com> <001901d02c0d$43cf9920$cb6ecb60$@com> <20150109191632.694692C3C1F@topped-with-meat.com> <001a01d02dc9$bd6f0370$384d0a50$@com> Message-Id: <20150113191449.AD91B2C39DC@topped-with-meat.com> Date: Tue, 13 Jan 2015 11:14:49 -0800 (PST) X-CMAE-Score: 0 X-CMAE-Analysis: v=2.1 cv=SvUDtp+0 c=1 sm=1 tr=0 a=WkljmVdYkabdwxfqvArNOQ==:117 a=14OXPxybAAAA:8 a=kj9zAlcOel0A:10 a=hOe2yjtxAAAA:8 a=8A7brI4kjU8C6ETGAX3PRdgeao0=:19 a=yGweBE1_6NYCPWx3sy8A:9 a=CjuIK1q_8ugA:10 |
Commit Message
Roland McGrath
Jan. 13, 2015, 7:14 p.m. UTC
> Well, bcopy and bzero are both used in several non-test sources in GLIBC, > but one is always defined as __bzero, and the other as bcopy. There are no uses of bcopy outside of tests, and portability macros in code (possibly, or once) shared with gnulib. There is a bcopy macro in locale/programs/simple-hash.c, but it's not used anywhere. I've removed that in the patch below. There are three uses of bzero outside of tests. I've changed those to memset in the patch below. In my build (x86_64-linux-gnu, GCC 4.8.3) this did not change the compiled code at all--so they were already being reduced to memset or inlined directly by the compiler. There are also two uses of bcmp outside of tests (both in old resolver code), which is a similar case (but even easier and less useful, since it is nothing but exactly an alias for memcmp). I've changed those to memcmp in the patch below. There is a handful of uses of __bzero, all in sunrpc/ (none even in tests). We're avoiding touching sunrpc/ when it's not thoroughly necessary, so we'll leave those alone. > It's unclear to me what the exact namespace rules are (or whether there is > even a concise description of them somewhere), however it is obvious that > this is not consistent. Note any rule that relies on whether a function is > called or not from within the same library is risky unless you have > automated checks to catch that. The rules per se can be described concisely, but they incorporate by reference the set of each symbols in each standard and the dependency graph of standards that are specified as supersets of others. However, the linknamespace tests should already be prepared to catch any concrete violations. Here's what I've just committed. As I mentioned above, it caused no changes to the compiled code on x86_64-linux-gnu with gcc-4.8.3, except for the expected s/bcmp/memcmp/ changes that are trivially provably harmless. Thanks, Roland 2015-01-13 Roland McGrath <roland@hack.frob.com> * login/logout.c (logout): Use memset rather than bzero. * nis/nss_compat/compat-pwd.c (getpwent_next_file): Likewise. * nis/nss_compat/compat-spwd.c (getspent_next_file): Likewise. * resolv/gethnamaddr.c (gethostbyaddr): Use memcmp rather than bcmp. (_gethtbyaddr): Likewise. * locale/programs/simple-hash.c (bcopy): Macro removed.
Comments
Roland McGrath wrote: > There are no uses of bcopy outside of tests, and portability macros in code > (possibly, or once) shared with gnulib. There is a bcopy macro in > locale/programs/simple-hash.c, but it's not used anywhere. I've removed > that in the patch below. > > There are three uses of bzero outside of tests. I've changed those to > memset in the patch below. In my build (x86_64-linux-gnu, GCC 4.8.3) this > did not change the compiled code at all--so they were already being reduced > to memset or inlined directly by the compiler. GCC often expands bzero into memset, but not all versions do this, and neither do all options, so you cannot rely on this. So before your patch login/logout.c might actually call bzero... We need something like this in string.h so we always optimize all calls to standard optimized functions, irrespectively of the compiler and options used: #ifdef __USE_MISC # define bzero(s, n) memset (s, '\0', n) # define __bzero(s, n) memset (s, '\0', n) # define bcopy(src, dest, n) memmove (dest, src, n) # define bcmp(s1, s2) memcmp(s1, s2) #endif Does that seem reasonable? Now the only remaining one to deal with is mempcpy - I'd like something like this in string/strings2.h: #define mempcpy(dest, src, n) __mempcpy_inline (dest, src, n) __STRING_INLINE void *__mempcpy_inline (void *dest, void *src, size_t n) { return memcpy(dest, src, n) + n; } If there are targets which want to keep using their optimized version, then the above could be guarded by a define (eg. _HAVE_OPTIMIZED_MEMPCPY) which would be defined by any target which has an optimized mempcpy. Or is there an easier way to test whether a target has added an optimized version of a generic function? > Here's what I've just committed. As I mentioned above, it caused no > changes to the compiled code on x86_64-linux-gnu with gcc-4.8.3, except for > the expected s/bcmp/memcmp/ changes that are trivially provably harmless. Thanks, that looks good. Wilco
On Wed, 14 Jan 2015, Wilco Dijkstra wrote: > Roland McGrath wrote: > > There are no uses of bcopy outside of tests, and portability macros in code > > (possibly, or once) shared with gnulib. There is a bcopy macro in > > locale/programs/simple-hash.c, but it's not used anywhere. I've removed > > that in the patch below. > > > > There are three uses of bzero outside of tests. I've changed those to > > memset in the patch below. In my build (x86_64-linux-gnu, GCC 4.8.3) this > > did not change the compiled code at all--so they were already being reduced > > to memset or inlined directly by the compiler. > > GCC often expands bzero into memset, but not all versions do this, and neither > do all options, so you cannot rely on this. So before your patch login/logout.c > might actually call bzero... glibc only supports building with a limited set of compilers, and the semantic options used for the build must be those determined by glibc's built system (it's not valid to include -fno-builtin in CC or CFLAGS when configuring glibc, for example). > We need something like this in string.h so we always optimize all calls to > standard optimized functions, irrespectively of the compiler and options used: We hardly *need* such an optimization, when it's only for obscure cases of old code and unusual compiler options. (It may still be a good idea, if done correctly.) > #ifdef __USE_MISC > # define bzero(s, n) memset (s, '\0', n) > # define __bzero(s, n) memset (s, '\0', n) > # define bcopy(src, dest, n) memmove (dest, src, n) These all need casts to void so as to return the correct types.
> GCC often expands bzero into memset, but not all versions do this, and > neither do all options, so you cannot rely on this. So before your patch > login/logout.c might actually call bzero... Sure. > We need something like this in string.h so we always optimize all calls to > standard optimized functions, irrespectively of the compiler and options used: We would need that if we wanted to do that. But these entrypoints are all old and deprecated. They are only for the benefit of old code. Any code so old that it hasn't been touched since there were actually systems to build it on that don't have the C89 standard functions surely has worse performance issues than this. Making the deprecated functions optimal only encourages people to keep using them. > Now the only remaining one to deal with is mempcpy - I'd like something like > this in string/strings2.h: Why? It's trivial enough for each memcpy implementation to implement mempcpy too, and for many implementations rolling it in might save an instruction or two over the generic addition. It doesn't seem worth the complexity to bother with anything in the header files.
Roland McGrath wrote: > Wilco Dijkstra wrote: > > We need something like this in string.h so we always optimize all calls to > > standard optimized functions, irrespectively of the compiler and options used: > > We would need that if we wanted to do that. But these entrypoints are all > old and deprecated. They are only for the benefit of old code. Any code > so old that it hasn't been touched since there were actually systems to > build it on that don't have the C89 standard functions surely has worse > performance issues than this. Making the deprecated functions optimal only > encourages people to keep using them. Agreed, however they appear to be used in a lot of code, including benchmarks. For example a quick grep shows there are a large number of occurrences of bzero and bcopy in SPEC2006. > > Now the only remaining one to deal with is mempcpy - I'd like something like > > this in string/strings2.h: > > Why? It's trivial enough for each memcpy implementation to implement > mempcpy too, and for many implementations rolling it in might save an > instruction or two over the generic addition. It doesn't seem worth > the complexity to bother with anything in the header files. OK, so the goal of many of the changes I've been making is as follows: By default GLIBC should provide the most efficient generic implementations so that a new target is not forced to write a large number of optimized assembler functions in order to get reasonable performance. Additionally, given that all targets add optimized versions of a few key functions (such as memcpy, memset, strlen), use those whenever feasible rather than less widely used variants. Back to mempcpy, not only is inlining mempcpy simple and a good idea, it is also the most efficient implementation. If you create a separate optimized implementation of mempcpy, it requires 1-2 extra instructions and increases pressure on caches and branch predictors. Another approach would be to set the return value at the start of memcpy so that mempcpy can jump past it. This means 1 extra instruction in every memcpy invocation plus an extra branch for mempcpy. Neither option is clearly better than just inlining. This ignores the additional effort to write/test mempcpy which could be spent on more important things. It appears most targets have not bothered with mempcpy as a result. So to me adding the inline version is a no-brainer and should have been done a long time ago. Wilco
On Thu, Jan 15, 2015 at 03:48:47PM -0000, Wilco Dijkstra wrote: > Roland McGrath wrote: > > Wilco Dijkstra wrote: > > > We need something like this in string.h so we always optimize all calls to > > > standard optimized functions, irrespectively of the compiler and options used: > > > > We would need that if we wanted to do that. But these entrypoints are all > > old and deprecated. They are only for the benefit of old code. Any code > > so old that it hasn't been touched since there were actually systems to > > build it on that don't have the C89 standard functions surely has worse > > performance issues than this. Making the deprecated functions optimal only > > encourages people to keep using them. > > Agreed, however they appear to be used in a lot of code, including benchmarks. > For example a quick grep shows there are a large number of occurrences of > bzero and bcopy in SPEC2006. > Also gcc could optimize memset to __bzero, I will probably write patch for x64 to save few cycles. There is omplication that gcc could use memset return value so we need to check if its dead or create new symbol. > > > Now the only remaining one to deal with is mempcpy - I'd like something like > > > this in string/strings2.h: > > > > Why? It's trivial enough for each memcpy implementation to implement > > mempcpy too, and for many implementations rolling it in might save an > > instruction or two over the generic addition. It doesn't seem worth > > the complexity to bother with anything in the header files. > > Back to mempcpy, not only is inlining mempcpy simple and a good idea, it is > also the most efficient implementation. If you create a separate optimized > implementation of mempcpy, it requires 1-2 extra instructions and increases > pressure on caches and branch predictors. Another approach would be to set That was previously mentioned in parent thread. With separate mempcpy you will likely pay additional 100 cycle penalty as mempcy is not called often. > the return value at the start of memcpy so that mempcpy can jump past it. > This means 1 extra instruction in every memcpy invocation plus an extra > branch for mempcpy. That is false. You need to copy starting memcpy fragment until you set return value and then jump which gives no overhead to memcpy. That could be problematic on some architectures as you need to do it without spilling extra register. > Neither option is clearly better than just inlining. > This ignores the additional effort to write/test mempcpy which could be > spent on more important things. It appears most targets have not bothered > with mempcpy as a result.
> Ondřej Bílka wrote: > On Thu, Jan 15, 2015 at 03:48:47PM -0000, Wilco Dijkstra wrote: > > Roland McGrath wrote: > > > Wilco Dijkstra wrote: > > > > We need something like this in string.h so we always optimize all calls to > > > > standard optimized functions, irrespectively of the compiler and options used: > > > > > > We would need that if we wanted to do that. But these entrypoints are all > > > old and deprecated. They are only for the benefit of old code. Any code > > > so old that it hasn't been touched since there were actually systems to > > > build it on that don't have the C89 standard functions surely has worse > > > performance issues than this. Making the deprecated functions optimal only > > > encourages people to keep using them. > > > > Agreed, however they appear to be used in a lot of code, including benchmarks. > > For example a quick grep shows there are a large number of occurrences of > > bzero and bcopy in SPEC2006. > > > Also gcc could optimize memset to __bzero, I will probably write patch > for x64 to save few cycles. There is omplication that gcc could use > memset return value so we need to check if its dead or create new > symbol. That is certainly a good idea - I added _memclr to armcc a long time ago as 99% of uses of memset set it to zero and don't use the return value (and the cost of save/restore the return value inside memcpy/memset is higher than just recomputing it on most targets). However this means we need to first make sure all targets have a decent __bzero implementation as otherwise you penalize everybody with an extra veneer to memset... > > > > Now the only remaining one to deal with is mempcpy - I'd like something like > > > > this in string/strings2.h: > > > > > > Why? It's trivial enough for each memcpy implementation to implement > > > mempcpy too, and for many implementations rolling it in might save an > > > instruction or two over the generic addition. It doesn't seem worth > > > the complexity to bother with anything in the header files. > > > > Back to mempcpy, not only is inlining mempcpy simple and a good idea, it is > > also the most efficient implementation. If you create a separate optimized > > implementation of mempcpy, it requires 1-2 extra instructions and increases > > pressure on caches and branch predictors. Another approach would be to set > > That was previously mentioned in parent thread. With separate mempcpy > you will likely pay additional 100 cycle penalty as mempcy is not called > often. > > > the return value at the start of memcpy so that mempcpy can jump past it. > > This means 1 extra instruction in every memcpy invocation plus an extra > > branch for mempcpy. > > That is false. You need to copy starting memcpy fragment until you set > return value and then jump which gives no overhead to memcpy. That's not how memcpy implementations work. You never set the return value explicitly, you either don't change the destination register (which on most ABIs also is the return value) or save/restore it on targets with few registers. Additionally for small/medium copies you use the destination (and return value) unchanged, so to support a different return value you need an extra instruction to make a copy of the destination ... > That could be problematic on some architectures as you need to do it > without spilling extra register. ... which could also mean an extra spill/restore in the small/medium copy cases. So I don't think merging mempcpy and memcpy is a good idea on any target. Wilco
On Wed, 4 Feb 2015, Wilco Dijkstra wrote: > That is certainly a good idea - I added _memclr to armcc a long time ago > as 99% of uses of memset set it to zero and don't use the return value > (and the cost of save/restore the return value inside memcpy/memset is > higher than just recomputing it on most targets). On ARM there's __aeabi_memclr (and __aeabi_memclr4 and __aeabi_memclr8), but the __aeabi_mem* functions are mostly currently implemented as wrappers (__aeabi_memcpy* as aliases for __memcpy_arm in the armv7/multiarch case, to avoid clobbering NEON registers) and there's no GCC support for generating calls to those functions (which would only be useful with glibc if they were actually more efficient rather than wrappers).
> Joseph Myers wrote: > On Wed, 4 Feb 2015, Wilco Dijkstra wrote: > > > That is certainly a good idea - I added _memclr to armcc a long time ago > > as 99% of uses of memset set it to zero and don't use the return value > > (and the cost of save/restore the return value inside memcpy/memset is > > higher than just recomputing it on most targets). > > On ARM there's __aeabi_memclr (and __aeabi_memclr4 and __aeabi_memclr8), > but the __aeabi_mem* functions are mostly currently implemented as > wrappers (__aeabi_memcpy* as aliases for __memcpy_arm in the > armv7/multiarch case, to avoid clobbering NEON registers) and there's no > GCC support for generating calls to those functions (which would only be > useful with glibc if they were actually more efficient rather than > wrappers). Indeed it's unfortunate that these as well as the generic __bzero are inefficient. Would it be possible for GCC to detect a modern GLIBC from the headers and only emit calls to __bzero (or say a new __memclr symbol) if a target provides an optimized version? Not sure whether there is already a way to do this, or whether such trickery is discouraged... Wilco
On Wed, 4 Feb 2015, Wilco Dijkstra wrote: > Indeed it's unfortunate that these as well as the generic __bzero are > inefficient. Would it be possible for GCC to detect a modern GLIBC from the > headers and only emit calls to __bzero (or say a new __memclr symbol) if a > target provides an optimized version? Not sure whether there is already a way > to do this, or whether such trickery is discouraged... There's existing support for checking the glibc version at GCC configure time and acting in GCC based on that (of course, that's just the minimum glibc version that GCC knows will be used at runtime - it doesn't take account of glibc upgrades). That's sufficient if using new glibc features would require a GCC upgrade anyway. If an older GCC could meaningfully use a newer glibc feature when glibc is upgraded - if, for example, GCC knows about using efficient __bzero but the set of relevant architectures varies with glibc version, so you want glibc headers to tell GCC whether __bzero is efficient for that glibc version and architecture - then you get into stdc-predef.h (or an architecture-specific bits/ file included therefrom) doing "#pragma GCC library_feature efficient___bzero" or similar - a new pragma to inform GCC of library features it can presume on the target. (There might be other possible approaches to implementing this.)
On Wed, Feb 04, 2015 at 04:30:43PM -0000, Wilco Dijkstra wrote: > > > the return value at the start of memcpy so that mempcpy can jump past it. > > > This means 1 extra instruction in every memcpy invocation plus an extra > > > branch for mempcpy. > > > > That is false. You need to copy starting memcpy fragment until you set > > return value and then jump which gives no overhead to memcpy. > > That's not how memcpy implementations work. You never set the return value > explicitly, you either don't change the destination register (which on most ABIs > also is the return value) or save/restore it on targets with few registers. > Additionally for small/medium copies you use the destination (and return value) > unchanged, so to support a different return value you need an extra instruction > to make a copy of the destination ... > No, my description is quite explicit. You take memcpy implementation and look at first instructions such that there is no read/write to return register/memory after reaching that instruction. Now for mempcpy you take memcpy as template and clone it until you reach instruction corresponding to one described before. On that position you change return value and jump to corresponding instruction in memcpy. It is obvious this does not add extra instruction to memcpy as memcpy is not changed.
> Ondřej Bílka wrote: > On Wed, Feb 04, 2015 at 04:30:43PM -0000, Wilco Dijkstra wrote: > > > > the return value at the start of memcpy so that mempcpy can jump past it. > > > > This means 1 extra instruction in every memcpy invocation plus an extra > > > > branch for mempcpy. > > > > > > That is false. You need to copy starting memcpy fragment until you set > > > return value and then jump which gives no overhead to memcpy. > > > > That's not how memcpy implementations work. You never set the return value > > explicitly, you either don't change the destination register (which on most ABIs > > also is the return value) or save/restore it on targets with few registers. > > Additionally for small/medium copies you use the destination (and return value) > > unchanged, so to support a different return value you need an extra instruction > > to make a copy of the destination ... > > > > No, my description is quite explicit. You take memcpy implementation and > look at first instructions such that there is no read/write to return > register/memory after reaching that instruction. > > Now for mempcpy you take memcpy as template and clone it until you reach > instruction corresponding to one described before. > On that position you change return value and jump to corresponding > instruction in memcpy. > > It is obvious this does not add extra instruction to memcpy as memcpy is > not changed. Yes but that means if memcpy never makes a copy of the destination register and uses it in *all* subcases (like my memcpy does) then you end up having to duplicate almost all of the memcpy code. While duplicating most of memcpy is not as bad as duplicating all, the common cases will not be cached/predicted. So the conclusion is that you either make memcpy less efficient in order to share all of it (by using an extra register and making a copy of the destination register early on) or duplicate most of memcpy. Neither is a good idea. Wilco
--- a/locale/programs/simple-hash.c +++ b/locale/programs/simple-hash.c @@ -42,10 +42,6 @@ # define BITSPERBYTE 8 #endif -#ifndef bcopy -# define bcopy(s, d, n) memcpy ((d), (s), (n)) -#endif - #define hashval_t uint32_t #include "hashval.h" --- a/login/logout.c +++ b/login/logout.c @@ -45,9 +45,9 @@ logout (const char *line) if (getutline_r (&tmp, &utbuf, &ut) >= 0) { /* Clear information about who & from where. */ - bzero (ut->ut_name, sizeof ut->ut_name); + memset (ut->ut_name, '\0', sizeof ut->ut_name); #if _HAVE_UT_HOST - 0 - bzero (ut->ut_host, sizeof ut->ut_host); + memset (ut->ut_host, '\0', sizeof ut->ut_host); #endif #if _HAVE_UT_TV - 0 struct timeval tv; --- a/nis/nss_compat/compat-pwd.c +++ b/nis/nss_compat/compat-pwd.c @@ -578,7 +578,7 @@ getpwent_next_file (struct passwd *result, ent_t *ent, char *user, *host, *domain; struct __netgrent netgrdata; - bzero (&netgrdata, sizeof (struct __netgrent)); + memset (&netgrdata, 0, sizeof (struct __netgrent)); __internal_setnetgrent (&result->pw_name[2], &netgrdata); while (__internal_getnetgrent_r (&host, &user, &domain, &netgrdata, buf2, sizeof (buf2), errnop)) --- a/nis/nss_compat/compat-spwd.c +++ b/nis/nss_compat/compat-spwd.c @@ -532,7 +532,7 @@ getspent_next_file (struct spwd *result, ent_t *ent, char *user, *host, *domain; struct __netgrent netgrdata; - bzero (&netgrdata, sizeof (struct __netgrent)); + memset (&netgrdata, 0, sizeof (struct __netgrent)); __internal_setnetgrent (&result->sp_namp[2], &netgrdata); while (__internal_getnetgrent_r (&host, &user, &domain, &netgrdata, buf2, sizeof (buf2), --- a/resolv/gethnamaddr.c +++ b/resolv/gethnamaddr.c @@ -672,8 +672,8 @@ gethostbyaddr(addr, len, af) return (NULL); } if (af == AF_INET6 && len == IN6ADDRSZ && - (!bcmp(uaddr, mapped, sizeof mapped) || - !bcmp(uaddr, tunnelled, sizeof tunnelled))) { + (!memcmp(uaddr, mapped, sizeof mapped) || + !memcmp(uaddr, tunnelled, sizeof tunnelled))) { /* Unmap. */ addr += sizeof mapped; uaddr += sizeof mapped; @@ -922,7 +922,7 @@ _gethtbyaddr(addr, len, af) _sethtent(0); while ((p = _gethtent())) - if (p->h_addrtype == af && !bcmp(p->h_addr, addr, len)) + if (p->h_addrtype == af && !memcmp(p->h_addr, addr, len)) break; _endhtent(); return (p);