[1/*,v3] Generic string function optimization: Add skeleton

Message ID	20150528142956.GA25176@domone
State	New, archived
Headers	Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org Date: Thu, 28 May 2015 16:29:56 +0200 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz> To: libc-alpha@sourceware.org Subject: Re: [PATCH 1/* v3] Generic string function optimization: Add skeleton Message-ID: <20150528142956.GA25176@domone> References: <20150527060121.GA19105@domone> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20150527060121.GA19105@domone> User-Agent: Mutt/1.5.20 (2009-06-14)

Message ID

20150528142956.GA25176@domone

State

New, archived

Headers

Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
Sender: libc-alpha-owner@sourceware.org
Date: Thu, 28 May 2015 16:29:56 +0200
From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= <neleai@seznam.cz>
To: libc-alpha@sourceware.org
Subject: Re: [PATCH 1/* v3] Generic string function optimization: Add
	skeleton
Message-ID: <20150528142956.GA25176@domone>
References: <20150527060121.GA19105@domone>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20150527060121.GA19105@domone>
User-Agent: Mutt/1.5.20 (2009-06-14)

Commit Message

Ondrej Bilka May 28, 2015, 2:29 p.m. UTC

  On Wed, May 27, 2015 at 08:01:21AM +0200, Ondřej Bílka wrote:
> Hi,
> 
> As i mentioned improving generic string functions this is my second
> attempt. As lot of functions will be optimized in same way this uses
> generic code flow. Functions will vary just by expression to check and
> how interpret return value.
> 
> Performance gains of using this are from loop unrolling and better
> header that doesn't have to check start byte-by-byte.
> 
> Unrolling migth be excessive but its better to tune it in skeleton than
> try manually and risk introducing errors.
> 
> This implementation would probably be faster than assembly on other
> architectures as this will beat it unless you used hardware specific
> instruction or gcc messes up compilation and generates suboptimal code.
> 
> Comments?
> 

Here is a new version of skeleton. I added a big endian support. This
reminded me that when I first wrote it I wanted to use opperations that
dont cause carry, then forgotten about it. As thats needed only for
first aligned load or always on big endian you need to supply expression
twice. one version shouldn't cause carry propagation.

 	* string/common.h: New file.
 	* string/skeleton.h: Likewise.

Comments

Richard Henderson May 28, 2015, 5:48 p.m. UTC | #1

On 05/28/2015 07:29 AM, Ondřej Bílka wrote:
> Here is a new version of skeleton. I added a big endian support. This
> reminded me that when I first wrote it I wanted to use opperations that
> dont cause carry, then forgotten about it. As thats needed only for
> first aligned load or always on big endian you need to supply expression
> twice. one version shouldn't cause carry propagation.
>
>   	* string/common.h: New file.
>   	* string/skeleton.h: Likewise.

I like the idea of this common header for implementing these algorithms. 
Though I'd like to see it not placed in string/, but sysdeps/generic/, so that 
one can provide specialized versions for different targets.

> +static const unsigned long int ones = (~0UL / 255); /* 0x0101...*/
> +static const unsigned long int add = 127 * (~0UL / 255);
> +static const unsigned long int high_bits = 128 * (~0UL / 255);

We're still using C, not C++.  These are objects requiring static allocation, 
not abstract constants.  Please just use #defines.

> +static __always_inline
> +unsigned long int
> +contains_zero (unsigned long int s)
> +{
> +  return (s - ones) & ~s & high_bits;
> +}

On Alpha or PPC, the target-specific header could use cmpbge or cmpb insns 
respectively.

On armv6t2/armv7, this can be done with the vector saturating addition, uqadd8, 
by adding 0xfe and then inverting.  Of course, this works on other targets for 
which vector insns exist, but on arm uqadd8 works on normal integer registers.

> +#define CROSS_PAGE(x, n) (((uintptr_t) x) % 4096 > 4096 - n)

Certainly different targets would like to override the minimal page size.

> +# ifdef FAST_FFS
> +  return (ffsl (u) - 1) / 8;
> +# else

Why are you stuck on ffs instead of ctz?  The later avoids all the -1 adjustments.

> +    }
> +  else
> +    {
> +#endif
...
> +#if _STRING_ARCH_unaligned
> +    }
> +#endif

Better placement of the first #endif means you don't need the second.

r~

Ondrej Bilka May 28, 2015, 6:04 p.m. UTC | #2

On Thu, May 28, 2015 at 10:48:29AM -0700, Richard Henderson wrote:
> On 05/28/2015 07:29 AM, Ondřej Bílka wrote:
> >Here is a new version of skeleton. I added a big endian support. This
> >reminded me that when I first wrote it I wanted to use opperations that
> >dont cause carry, then forgotten about it. As thats needed only for
> >first aligned load or always on big endian you need to supply expression
> >twice. one version shouldn't cause carry propagation.
> >
> >  	* string/common.h: New file.
> >  	* string/skeleton.h: Likewise.
> 
> I like the idea of this common header for implementing these
> algorithms. Though I'd like to see it not placed in string/, but
> sysdeps/generic/, so that one can provide specialized versions for
> different targets.
>
Will do.

> >+static __always_inline
> >+unsigned long int
> >+contains_zero (unsigned long int s)
> >+{
> >+  return (s - ones) & ~s & high_bits;
> >+}
> 
> On Alpha or PPC, the target-specific header could use cmpbge or cmpb
> insns respectively.
> 
> On armv6t2/armv7, this can be done with the vector saturating
> addition, uqadd8, by adding 0xfe and then inverting.  Of course,
> this works on other targets for which vector insns exist, but on arm
> uqadd8 works on normal integer registers.
> 
> >+# ifdef FAST_FFS
> >+  return (ffsl (u) - 1) / 8;
> >+# else
> 
> Why are you stuck on ffs instead of ctz?  The later avoids all the -1 adjustments.
>
Both comments are correct. We should do it generically and surround
these functions with ifdef to supply arch-specific versions.

Joseph Myers May 28, 2015, 8:54 p.m. UTC | #3

On Thu, 28 May 2015, Ondřej Bílka wrote:

> Both comments are correct. We should do it generically and surround
> these functions with ifdef to supply arch-specific versions. 

#if, not #ifdef, please, for all architecture choices in these functions.  
The sysdeps/generic version of the header architectures can use to change 
the default choices should have detailed comments on the default 
definitions of all the relevant macros to explain their semantics.

Ondrej Bilka May 28, 2015, 10:13 p.m. UTC | #4

On Thu, May 28, 2015 at 08:54:31PM +0000, Joseph Myers wrote:
> On Thu, 28 May 2015, Ondřej Bílka wrote:
> 
> > Both comments are correct. We should do it generically and surround
> > these functions with ifdef to supply arch-specific versions. 
> 
> #if, not #ifdef, please, for all architecture choices in these functions.  
> The sysdeps/generic version of the header architectures can use to change 
> the default choices should have detailed comments on the default 
> definitions of all the relevant macros to explain their semantics.
> 
For what purpose? Its pointless except that you would need to have
additional header say precommon.h, then undef and redefine macro when
you want change.

And it doesn't help to catch any errors. If you misspell define then you
will get error with duplicate definition.

Joseph Myers May 29, 2015, 10:42 a.m. UTC | #5

On Fri, 29 May 2015, Ondřej Bílka wrote:

> On Thu, May 28, 2015 at 08:54:31PM +0000, Joseph Myers wrote:
> > On Thu, 28 May 2015, Ondřej Bílka wrote:
> > 
> > > Both comments are correct. We should do it generically and surround
> > > these functions with ifdef to supply arch-specific versions. 
> > 
> > #if, not #ifdef, please, for all architecture choices in these functions.  
> > The sysdeps/generic version of the header architectures can use to change 
> > the default choices should have detailed comments on the default 
> > definitions of all the relevant macros to explain their semantics.
> > 
> For what purpose? Its pointless except that you would need to have
> additional header say precommon.h, then undef and redefine macro when
> you want change.

The general principle is that macro uses should be typo-proof.  That means 
avoiding #ifdef, #ifndef and #undef where possible.

The principle of documenting macro semantics applies even if there's a 
good reason typo-proof conventions are problematic in a particular case.  
There should *never* be any sort of architecture hook without clear 
documentation of the semantics of the hook (written from the perspective 
of an architecture maintainer wanting to know how to set the hook for 
their architecture, not from the perspective of the person writing the 
code using the hook).  I always advise writing documentation early in the 
process of developing a patch - the documentation is as important as the 
rest of the code.

> And it doesn't help to catch any errors. If you misspell define then you
> will get error with duplicate definition.

If the code does

#ifdef MACRO

or

#ifndef MACRO

then if an architecture does

#define MCARO

(with or without #undef) that will silently be ignored.

If, instead, the code does

#if MACRO

and there's a sysdeps/generic header that defines MACRO one way, if an 
architecture overrides that header with one that misspells the name, a 
-Wundef warning will be immediately visible (though we still need to fix 
the -Wundef warnings in the testsuite and remove the -Wno-error=undef).

Ondrej Bilka May 29, 2015, 10:56 a.m. UTC | #6

On Fri, May 29, 2015 at 10:42:17AM +0000, Joseph Myers wrote:
> On Fri, 29 May 2015, Ondřej Bílka wrote:
> 
> > On Thu, May 28, 2015 at 08:54:31PM +0000, Joseph Myers wrote:
> > > On Thu, 28 May 2015, Ondřej Bílka wrote:
> > > 
> > > > Both comments are correct. We should do it generically and surround
> > > > these functions with ifdef to supply arch-specific versions. 
> > > 
> > > #if, not #ifdef, please, for all architecture choices in these functions.  
> > > The sysdeps/generic version of the header architectures can use to change 
> > > the default choices should have detailed comments on the default 
> > > definitions of all the relevant macros to explain their semantics.
> > > 
> > For what purpose? Its pointless except that you would need to have
> > additional header say precommon.h, then undef and redefine macro when
> > you want change.
> 
> The general principle is that macro uses should be typo-proof.  That means 
> avoiding #ifdef, #ifndef and #undef where possible.
> 
> The principle of documenting macro semantics applies even if there's a 
> good reason typo-proof conventions are problematic in a particular case.  
> There should *never* be any sort of architecture hook without clear 
> documentation of the semantics of the hook (written from the perspective 
> of an architecture maintainer wanting to know how to set the hook for 
> their architecture, not from the perspective of the person writing the 
> code using the hook).  I always advise writing documentation early in the 
> process of developing a patch - the documentation is as important as the 
> rest of the code.
> 
> > And it doesn't help to catch any errors. If you misspell define then you
> > will get error with duplicate definition.
> 
> If the code does
> 
> #ifdef MACRO
> 
> or
> 
> #ifndef MACRO
> 
> then if an architecture does
> 
> #define MCARO
> 
> (with or without #undef) that will silently be ignored.
> 
> If, instead, the code does
> 
> #if MACRO
> 
> and there's a sysdeps/generic header that defines MACRO one way, if an 
> architecture overrides that header with one that misspells the name, a 
> -Wundef warning will be immediately visible (though we still need to fix 
> the -Wundef warnings in the testsuite and remove the -Wno-error=undef).
> 
Joseph, read previous mail before writing. Your suggestion is pointless.

For a code

#ifndef CUSTOM_FOO
int foo()
{
}
#endif

If architecture does
#define CUTSOM_FOO
int foo()
{
}

Then it gets error with redefinition of foo.

Joseph Myers May 29, 2015, 11:38 a.m. UTC | #7

On Fri, 29 May 2015, Ondřej Bílka wrote:

> > If, instead, the code does
> > 
> > #if MACRO
> > 
> > and there's a sysdeps/generic header that defines MACRO one way, if an 
> > architecture overrides that header with one that misspells the name, a 
> > -Wundef warning will be immediately visible (though we still need to fix 
> > the -Wundef warnings in the testsuite and remove the -Wno-error=undef).
> > 
> Joseph, read previous mail before writing. Your suggestion is pointless.

Which previous mail?  As far as I can tell, your last patch posting adding 
common.h is <https://sourceware.org/ml/libc-alpha/2015-05/msg00751.html>, 
which has a range of within-function #ifdefs (some completely untested, 
e.g. "#  ifdef NEED BITWISE", and all completely undocumented).  Anyway, 
when we have conventions in glibc (such as preferring #if to #ifdef) you 
should follow them unless there is a strong justification for doing 
otherwise (presented in every patch submission).  Likewise, even early 
patches should follow normal glibc formatting conventions.

Throwing out a series of untested, undocumented patches is not a helpful 
way of proposing changes to glibc.  I strongly recommend, in any 
complicated case such as this, that:

(a) each patch submission is self-contained - has the full self-contained 
write-up of the patch itself with the rationale for the patch and all the 
choices made, that would go in the commit message, followed by the 
description of changes from the previous version, rather than requiring a 
trail of previous messages to be followed to get the full rationale; and

(b) the documentation is more important than the code (write first for 
humans to read, only then for computers to execute); documenting the 
interfaces (such as FAST_CLZ and NEED_BITWISE) should be a very early step 
before any patches are sent to the list, not an afterthought, and the same 
applies to each internal function and macro in the code, even those that 
are not interfaces for architectures to reimplement.

> For a code
> 
> #ifndef CUSTOM_FOO
> int foo()
> {
> }
> #endif
> 
> If architecture does
> #define CUTSOM_FOO
> int foo()
> {
> }
> 
> Then it gets error with redefinition of foo.

Or you could avoid making readers think about whether the #ifndef is OK in 
a particular case by simply following normal glibc practices and have a 
sysdeps/generic/string-foo.h header that has a default definition with a 
careful comment explaining the semantics, and then architectures can have 
their own version to replace it as needed; no macros needed at all.

Ondrej Bilka June 16, 2015, 12:36 p.m. UTC | #8

On Fri, May 29, 2015 at 11:38:51AM +0000, Joseph Myers wrote:
> On Fri, 29 May 2015, Ondřej Bílka wrote:
> 
> > > If, instead, the code does
> > > 
> > > #if MACRO
> > > 
> > > and there's a sysdeps/generic header that defines MACRO one way, if an 
> > > architecture overrides that header with one that misspells the name, a 
> > > -Wundef warning will be immediately visible (though we still need to fix 
> > > the -Wundef warnings in the testsuite and remove the -Wno-error=undef).
> > > 
> > Joseph, read previous mail before writing. Your suggestion is pointless.
> 
> Which previous mail?  As far as I can tell, your last patch posting adding 
> common.h is <https://sourceware.org/ml/libc-alpha/2015-05/msg00751.html>,

This one. I already explained that for using if you would need
additional header to be able to undefine macros.

https://sourceware.org/ml/libc-alpha/2015-05/msg00812.html


> 
> (a) each patch submission is self-contained - has the full self-contained 
> write-up of the patch itself with the rationale for the patch and all the 
> choices made, that would go in the commit message, followed by the 
> description of changes from the previous version, rather than requiring a 
> trail of previous messages to be followed to get the full rationale; and
>
Almost nobody does that for good reason, see that most v2 on lists are
shorter than previous. You should write mostly what changed. Sure, I
could copy-paste three pages from original mail with rationale but most
readers would skip it as its duplicate and would skip changes made into
it.

I could make recapitulation once per while but for incremental
improvements its best to keep just increments.
 
> (b) the documentation is more important than the code (write first for 
> humans to read, only then for computers to execute); documenting the 
> interfaces (such as FAST_CLZ and NEED_BITWISE) should be a very early step 
> before any patches are sent to the list, not an afterthought, and the same 
> applies to each internal function and macro in the code, even those that 
> are not interfaces for architectures to reimplement.
> 
Thats not completely true as purpose of this is get every bit of
performance before you need to go into assembly. It depends how
technical my interface will become, with strcmp I found that I need
handle another primitive. As documentation its better to have good
overview than overly verbose one. Some macros there are just there for
optimizer to try both branches and select better one.


> > For a code
> > 
> > #ifndef CUSTOM_FOO
> > int foo()
> > {
> > }
> > #endif
> > 
> > If architecture does
> > #define CUTSOM_FOO
> > int foo()
> > {
> > }
> > 
> > Then it gets error with redefinition of foo.
> 
> Or you could avoid making readers think about whether the #ifndef is OK in 
> a particular case by simply following normal glibc practices and have a 
> sysdeps/generic/string-foo.h header that has a default definition with a 
> careful comment explaining the semantics, and then architectures can have 
> their own version to replace it as needed; no macros needed at all.
> 
First its too typo-prone. Architecture maintainer would create
string_foo.h file and never notice that it was silently ignored.


Then why didn't you said that directly? That saves time instead
suggesting if when you mean separate file. Thats correct as there are
several variants and you need to run benchmark to select correct one.

diff mbox

Patch

diff --git a/string/common.h b/string/common.h
new file mode 100644
index 0000000..481c99f
--- /dev/null
+++ b/string/common.h
@@ -0,0 +1,80 @@ 
+#include <stdint.h>
+
+static const unsigned long int ones = (~0UL / 255); /* 0x0101...*/
+static const unsigned long int add = 127 * (~0UL / 255);
+static const unsigned long int high_bits = 128 * (~0UL / 255);
+
+/* Use vector arithmetic tricks. Idea is to take expression works on
+   unsigned byte and evaluates 0 for nozero byte and nonzero on zero byte.
+   Our expression is ((s - 1) & (~s)) & 128  
+   Now we evaluate this expression on each byte in parallel and on first 
+   nonzero byte our expression will have nonzero value. 
+
+   We need to provide version of expression that doesn't cause carry 
+   propagation and opperations could be done in parallel. However its
+   not needed on little endian architectures as we end on first byte 
+   that succeeds and we don't care that next ones could be corrupted.
+  */
+
+#if __BYTE_ORDER == __LITTLE_ENDIAN
+static __always_inline
+unsigned long int 
+contains_zero (unsigned long int s)
+{
+  return (s - ones) & ~s & high_bits;
+}
+#else
+#define contains_zero contains_zero_nocarry
+#endif
+
+static __always_inline
+unsigned long int 
+contains_zero_nocarry (unsigned long int s)
+{
+  return (((s | high_bits) - ones) ^ high_bits) & ~s & high_bits;
+}
+
+#define LSIZE sizeof (unsigned long int)
+#define CROSS_PAGE(x, n) (((uintptr_t) x) % 4096 > 4096 - n)
+
+#if __BYTE_ORDER == __BIG_ENDIAN
+#define SHIFT_BYTES(x, n) ((x) << (8 * (n)))
+#else
+#define SHIFT_BYTES(x, n) ((x) >> (8 * (n)))
+#endif
+
+static __always_inline
+size_t
+first_nonzero_byte (unsigned long int u)
+{
+#if __BYTE_ORDER == __BIG_ENDIAN
+# ifdef FAST_CLZ
+  return clz (u) / 8;
+# else
+#  ifdef NEED BITWISE
+  u = u | (u >> 1);
+  u = u | (u >> 2);
+  u = u | (u >> 4);
+#  else
+  u = u >> 7;
+#  endif
+  u = u | (u >> 8);
+  u = u | (u >> 16);
+  u = u | (u >> 32);
+#  ifdef NEED_BITWISE
+  u = u & ones;
+#  endif
+  u = u * ones;
+  return 8 - (u >> (8 * LSIZE - 8));
+# endif
+#else
+# ifdef FAST_FFS
+  return (ffsl (u) - 1) / 8;
+# else
+  u = u ^ (u - 1);
+  u = u & ones;
+  u = u * ones;
+  return (u >> (8 * LSIZE - 8)) - 1;
+# endif
+#endif
+}
diff --git a/string/skeleton.h b/string/skeleton.h
new file mode 100644
index 0000000..76bd08f
--- /dev/null
+++ b/string/skeleton.h
@@ -0,0 +1,150 @@ 
+/* Skeleton of generic string functions.
+   Copyright (C) 1991-2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <string.h>
+#include <libc-internal.h>
+#include <stdint.h>
+
+#ifndef BOUND
+# define BOUND(x) 0
+#endif
+
+/* On high endian an positive could cause false positive in previous byte.  */
+
+#if __BYTE_ORDER == __BIG_ENDIAN
+#undef EXPRESSION
+#define EXPRESSION(x,y) EXPRESSION_NOCARRY(x,y)
+#endif
+
+static __always_inline
+int
+found_in_long_bytes(char *s, unsigned long int cmask, char **result)
+{
+  const unsigned long int *lptr = (const unsigned long int *) s;
+  unsigned long int mask = EXPRESSION(*lptr, cmask);
+  if (mask)
+    {
+      *result = s + first_nonzero_byte (mask);
+      return 1;
+    }
+  else
+    return 0;
+}
+
+static __always_inline
+char *
+string_skeleton (const char *s_in, int c_in, char *end)
+{
+  unsigned long int mask;
+  const unsigned long int *lptr;
+  char *s = (char *) s_in;
+  unsigned char c = (unsigned char) c_in;
+  char *r;
+  unsigned long int __attribute__ ((unused)) cmask = c * ones;
+
+#if _STRING_ARCH_unaligned
+  /* We fetch 32 bytes while not crossing page boundary. 
+     Most strings in practice are of that size and we avoid a loop.
+     This looks as best in practice, alternative below uses aligned load 
+     but is slower when string starts just few 
+     bytes before 32 byte boundary. A tradeoff is that we rarely could 
+     fetch extra cache line without needing it but this optimization 
+     does pay for that. */
+  if (!CROSS_PAGE(s, 32))
+    {
+      if (found_in_long_bytes (s + 0 * LSIZE, cmask, &r))
+        return r;
+      if (found_in_long_bytes (s + 1 * LSIZE, cmask, &r))
+        return r;
+      if (found_in_long_bytes (s + 2 * LSIZE, cmask, &r))
+        return r;
+      if (found_in_long_bytes (s + 3 * LSIZE, cmask, &r))
+        return r;
+      if (sizeof (unsigned long int) == 4)
+        {
+          if (found_in_long_bytes (s + 5 * LSIZE, cmask, &r))
+            return r;
+          if (found_in_long_bytes (s + 6 * LSIZE, cmask, &r))
+            return r;
+          if (found_in_long_bytes (s + 7 * LSIZE, cmask, &r))
+            return r;
+          if (found_in_long_bytes (s + 8 * LSIZE, cmask, &r))
+            return r;
+        }
+
+      if (BOUND (s + 32))
+        return NULL;
+    }
+  else
+    {
+#endif
+  /* We need use aligned loads. For first load we read some bytes before 
+     start that we discard by shifting them down. */
+ 
+      char *s_aligned = PTR_ALIGN_DOWN (s, LSIZE);
+      lptr = (const unsigned long int *) s_aligned;
+      /* We need be careful here as bytes before start can corrupt it.  */
+      mask = SHIFT_BYTES ((EXPRESSION_NOCARRY (*lptr, cmask)), s - s_aligned);
+
+      if (mask)
+        return s + first_nonzero_byte (mask);
+
+      if (BOUND (s_aligned + 1 * LSIZE))
+        return NULL;
+      if (found_in_long_bytes (s_aligned + 1 * LSIZE, cmask, &r))
+        return r;
+      if (BOUND (s_aligned + 2 * LSIZE))
+        return NULL;
+      if (found_in_long_bytes (s_aligned + 2 * LSIZE, cmask, &r))
+        return r;
+      if (BOUND (s_aligned + 3 * LSIZE))
+        return NULL;
+      if (found_in_long_bytes (s_aligned + 3 * LSIZE, cmask, &r))
+        return r;
+#if _STRING_ARCH_unaligned
+    }
+#endif
+   /* Now we read enough bytes to start a loop.  */
+
+  char *s_loop = PTR_ALIGN_DOWN (s, 4 * LSIZE);
+  while (!BOUND (s_loop + 4 * LSIZE))
+    {
+      s_loop += 4 * LSIZE;
+      lptr = (const unsigned long int *) (s_loop + 0 * LSIZE);
+      mask = EXPRESSION (*lptr, cmask);
+      lptr = (const unsigned long int *) (s_loop + 1 * LSIZE);
+      mask |= EXPRESSION (*lptr, cmask);
+      lptr = (const unsigned long int *) (s_loop + 2 * LSIZE);
+      mask |= EXPRESSION (*lptr, cmask);
+      lptr = (const unsigned long int *) (s_loop + 3 * LSIZE);
+      mask |= EXPRESSION (*lptr, cmask);
+
+      if (mask)
+        {
+          if (found_in_long_bytes (s_loop + 0 * LSIZE, cmask, &r))
+            return r;
+          if (found_in_long_bytes (s_loop + 1 * LSIZE, cmask, &r))
+            return r;
+          if (found_in_long_bytes (s_loop + 2 * LSIZE, cmask, &r))
+            return r;
+
+          return s_loop + 3 * LSIZE + first_nonzero_byte (mask);
+        }
+    }
+ return NULL;
+}