Make may_trap_p_1 return false for constant pool references [PR116145]

Message ID mpt7cd1ms4l.fsf@arm.com
State Committed
Commit ba730fd10934e4ca004251aa3748bf9da4d35e62
Headers
Series Make may_trap_p_1 return false for constant pool references [PR116145] |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed

Commit Message

Richard Sandiford July 31, 2024, 4:40 p.m. UTC
  The testcase contains the constant:

  arr2 = svreinterpret_u8(svdup_u32(0x0a0d5c3f));

which was initially hoisted by hand, but which gimple optimisers later
propagated to each use (as expected).  The constant was then expanded
as a load-and-duplicate from the constant pool.  Normally that load
should then be hoisted back out of the loop, but may_trap_or_fault_p
stopped that from happening in this case.

The code responsible was:

      if (/* MEM_NOTRAP_P only relates to the actual position of the memory
	     reference; moving it out of context such as when moving code
	     when optimizing, might cause its address to become invalid.  */
	  code_changed
	  || !MEM_NOTRAP_P (x))
	{
	  poly_int64 size = MEM_SIZE_KNOWN_P (x) ? MEM_SIZE (x) : -1;
	  return rtx_addr_can_trap_p_1 (XEXP (x, 0), 0, size,
					GET_MODE (x), code_changed);
	}

where code_changed is true.  (Arguably it doesn't need to be true in
this case, if we inserted invariants on the preheader edge, but it
would still need to be true for conditionally executed loads.)

Normally this wouldn't be a problem, since rtx_addr_can_trap_p_1
would recognise that the address refers to the constant pool.
However, the SVE load-and-replicate instructions have a limited
offset range, so it isn't possible for them to have a LO_SUM address.
All we have is a plain pseudo base register.

MEM_READONLY_P is defined as:

/* 1 if RTX is a mem that is statically allocated in read-only memory.  */
  (RTL_FLAG_CHECK1 ("MEM_READONLY_P", (RTX), MEM)->unchanging)

and so I think it should be safe to move memory references if both
MEM_READONLY_P and MEM_NOTRAP_P are true.

The testcase isn't a minimal reproducer, but I think it's good
to have a realistic full routine in the testsuite.

Bootstrapped & regression-tested on aarch64-linux-gnu.  OK to install?

Richard


gcc/
	PR rtl-optimization/116145
	* rtlanal.cc (may_trap_p_1): Trust MEM_NOTRAP_P even for code
	movement if MEM_READONLY_P is also true.

gcc/testsuite/
	PR rtl-optimization/116145
	* gcc.target/aarch64/sve/acle/general/pr116145.c: New test.
---
 gcc/rtlanal.cc                                | 14 ++++--
 .../aarch64/sve/acle/general/pr116145.c       | 46 +++++++++++++++++++
 2 files changed, 56 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c
  

Comments

Richard Biener Aug. 1, 2024, 12:49 p.m. UTC | #1
On Wed, Jul 31, 2024 at 6:41 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> The testcase contains the constant:
>
>   arr2 = svreinterpret_u8(svdup_u32(0x0a0d5c3f));
>
> which was initially hoisted by hand, but which gimple optimisers later
> propagated to each use (as expected).  The constant was then expanded
> as a load-and-duplicate from the constant pool.  Normally that load
> should then be hoisted back out of the loop, but may_trap_or_fault_p
> stopped that from happening in this case.
>
> The code responsible was:
>
>       if (/* MEM_NOTRAP_P only relates to the actual position of the memory
>              reference; moving it out of context such as when moving code
>              when optimizing, might cause its address to become invalid.  */
>           code_changed
>           || !MEM_NOTRAP_P (x))
>         {
>           poly_int64 size = MEM_SIZE_KNOWN_P (x) ? MEM_SIZE (x) : -1;
>           return rtx_addr_can_trap_p_1 (XEXP (x, 0), 0, size,
>                                         GET_MODE (x), code_changed);
>         }
>
> where code_changed is true.  (Arguably it doesn't need to be true in
> this case, if we inserted invariants on the preheader edge, but it
> would still need to be true for conditionally executed loads.)
>
> Normally this wouldn't be a problem, since rtx_addr_can_trap_p_1
> would recognise that the address refers to the constant pool.
> However, the SVE load-and-replicate instructions have a limited
> offset range, so it isn't possible for them to have a LO_SUM address.
> All we have is a plain pseudo base register.
>
> MEM_READONLY_P is defined as:
>
> /* 1 if RTX is a mem that is statically allocated in read-only memory.  */
>   (RTL_FLAG_CHECK1 ("MEM_READONLY_P", (RTX), MEM)->unchanging)
>
> and so I think it should be safe to move memory references if both
> MEM_READONLY_P and MEM_NOTRAP_P are true.
>
> The testcase isn't a minimal reproducer, but I think it's good
> to have a realistic full routine in the testsuite.
>
> Bootstrapped & regression-tested on aarch64-linux-gnu.  OK to install?

OK.

Thanks,
Richard.

> Richard
>
>
> gcc/
>         PR rtl-optimization/116145
>         * rtlanal.cc (may_trap_p_1): Trust MEM_NOTRAP_P even for code
>         movement if MEM_READONLY_P is also true.
>
> gcc/testsuite/
>         PR rtl-optimization/116145
>         * gcc.target/aarch64/sve/acle/general/pr116145.c: New test.
> ---
>  gcc/rtlanal.cc                                | 14 ++++--
>  .../aarch64/sve/acle/general/pr116145.c       | 46 +++++++++++++++++++
>  2 files changed, 56 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c
>
> diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
> index 4158a531bdd..893a6afbbc5 100644
> --- a/gcc/rtlanal.cc
> +++ b/gcc/rtlanal.cc
> @@ -3152,10 +3152,16 @@ may_trap_p_1 (const_rtx x, unsigned flags)
>           && MEM_VOLATILE_P (x)
>           && XEXP (x, 0) == stack_pointer_rtx)
>         return true;
> -      if (/* MEM_NOTRAP_P only relates to the actual position of the memory
> -            reference; moving it out of context such as when moving code
> -            when optimizing, might cause its address to become invalid.  */
> -         code_changed
> +      if (/* MEM_READONLY_P means that the memory is both statically
> +            allocated and readonly, so MEM_NOTRAP_P should remain true
> +            even if the memory reference is moved.  This is certainly
> +            true for the important case of force_const_mem.
> +
> +            Otherwise, MEM_NOTRAP_P only relates to the actual position
> +            of the memory reference; moving it out of context such as
> +            when moving code when optimizing, might cause its address
> +            to become invalid.  */
> +         (code_changed && !MEM_READONLY_P (x))
>           || !MEM_NOTRAP_P (x))
>         {
>           poly_int64 size = MEM_SIZE_KNOWN_P (x) ? MEM_SIZE (x) : -1;
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c
> new file mode 100644
> index 00000000000..a3d93d3e1c8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c
> @@ -0,0 +1,46 @@
> +// { dg-options "-O2" }
> +
> +#include <stdlib.h>
> +#include <arm_sve.h>
> +
> +#pragma GCC target "+sve2"
> +
> +typedef unsigned char uchar;
> +
> +const uchar *
> +search_line_fast (const uchar *s, const uchar *end)
> +{
> +  size_t VL = svcntb();
> +  svuint8_t arr1, arr2;
> +  svbool_t pc, pg = svptrue_b8();
> +
> +  // This should not be loaded inside the loop every time.
> +  arr2 = svreinterpret_u8(svdup_u32(0x0a0d5c3f));
> +
> +  for (; s+VL <= end; s += VL) {
> +    arr1 = svld1_u8(pg, s);
> +    pc = svmatch_u8(pg, arr1, arr2);
> +
> +    if (svptest_any(pg, pc)) {
> +      pc = svbrkb_z(pg, pc);
> +      return s+svcntp_b8(pg, pc);
> +    }
> +  }
> +
> +  // Handle remainder.
> +  if (s < end) {
> +    pg = svwhilelt_b8((size_t)s, (size_t)end);
> +
> +    arr1 = svld1_u8(pg, s);
> +    pc = svmatch_u8(pg, arr1, arr2);
> +
> +    if (svptest_any(pg, pc)) {
> +      pc = svbrkb_z(pg, pc);
> +      return s+svcntp_b8(pg, pc);
> +    }
> +  }
> +
> +  return end;
> +}
> +
> +// { dg-final { scan-assembler {:\n\tld1b\t[^\n]*\n\tmatch\t[^\n]*\n\tb\.} } }
> --
> 2.25.1
>
  
Andrew Pinski Aug. 2, 2024, 8:23 p.m. UTC | #2
On Wed, Jul 31, 2024 at 9:41 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> The testcase contains the constant:
>
>   arr2 = svreinterpret_u8(svdup_u32(0x0a0d5c3f));
>
> which was initially hoisted by hand, but which gimple optimisers later
> propagated to each use (as expected).  The constant was then expanded
> as a load-and-duplicate from the constant pool.  Normally that load
> should then be hoisted back out of the loop, but may_trap_or_fault_p
> stopped that from happening in this case.
>
> The code responsible was:
>
>       if (/* MEM_NOTRAP_P only relates to the actual position of the memory
>              reference; moving it out of context such as when moving code
>              when optimizing, might cause its address to become invalid.  */
>           code_changed
>           || !MEM_NOTRAP_P (x))
>         {
>           poly_int64 size = MEM_SIZE_KNOWN_P (x) ? MEM_SIZE (x) : -1;
>           return rtx_addr_can_trap_p_1 (XEXP (x, 0), 0, size,
>                                         GET_MODE (x), code_changed);
>         }
>
> where code_changed is true.  (Arguably it doesn't need to be true in
> this case, if we inserted invariants on the preheader edge, but it
> would still need to be true for conditionally executed loads.)
>
> Normally this wouldn't be a problem, since rtx_addr_can_trap_p_1
> would recognise that the address refers to the constant pool.
> However, the SVE load-and-replicate instructions have a limited
> offset range, so it isn't possible for them to have a LO_SUM address.
> All we have is a plain pseudo base register.
>
> MEM_READONLY_P is defined as:
>
> /* 1 if RTX is a mem that is statically allocated in read-only memory.  */
>   (RTL_FLAG_CHECK1 ("MEM_READONLY_P", (RTX), MEM)->unchanging)
>
> and so I think it should be safe to move memory references if both
> MEM_READONLY_P and MEM_NOTRAP_P are true.
>
> The testcase isn't a minimal reproducer, but I think it's good
> to have a realistic full routine in the testsuite.
>
> Bootstrapped & regression-tested on aarch64-linux-gnu.  OK to install?


This is breaking the build on a few targets (x86_64 and powerpc64le so
far reported, see PR 116200).

From what I can tell is that it is treating `(plus:DI (ashift:DI
(reg:DI 0 ax [690]) (const_int 3 [0x3]))  (label_ref:DI 1620))` as not
trapping and allowing it to be moved before the check of ax being in
the range [0..2] and we have eax being (unsigned long)(unsigned int)-9
in value. So we get a bogus address which will trap. I put my findings
in PR 116200 too.

Thanks,
Andrew Pinski


>
> Richard
>
>
> gcc/
>         PR rtl-optimization/116145
>         * rtlanal.cc (may_trap_p_1): Trust MEM_NOTRAP_P even for code
>         movement if MEM_READONLY_P is also true.
>
> gcc/testsuite/
>         PR rtl-optimization/116145
>         * gcc.target/aarch64/sve/acle/general/pr116145.c: New test.
> ---
>  gcc/rtlanal.cc                                | 14 ++++--
>  .../aarch64/sve/acle/general/pr116145.c       | 46 +++++++++++++++++++
>  2 files changed, 56 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c
>
> diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
> index 4158a531bdd..893a6afbbc5 100644
> --- a/gcc/rtlanal.cc
> +++ b/gcc/rtlanal.cc
> @@ -3152,10 +3152,16 @@ may_trap_p_1 (const_rtx x, unsigned flags)
>           && MEM_VOLATILE_P (x)
>           && XEXP (x, 0) == stack_pointer_rtx)
>         return true;
> -      if (/* MEM_NOTRAP_P only relates to the actual position of the memory
> -            reference; moving it out of context such as when moving code
> -            when optimizing, might cause its address to become invalid.  */
> -         code_changed
> +      if (/* MEM_READONLY_P means that the memory is both statically
> +            allocated and readonly, so MEM_NOTRAP_P should remain true
> +            even if the memory reference is moved.  This is certainly
> +            true for the important case of force_const_mem.
> +
> +            Otherwise, MEM_NOTRAP_P only relates to the actual position
> +            of the memory reference; moving it out of context such as
> +            when moving code when optimizing, might cause its address
> +            to become invalid.  */
> +         (code_changed && !MEM_READONLY_P (x))
>           || !MEM_NOTRAP_P (x))
>         {
>           poly_int64 size = MEM_SIZE_KNOWN_P (x) ? MEM_SIZE (x) : -1;
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c
> new file mode 100644
> index 00000000000..a3d93d3e1c8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c
> @@ -0,0 +1,46 @@
> +// { dg-options "-O2" }
> +
> +#include <stdlib.h>
> +#include <arm_sve.h>
> +
> +#pragma GCC target "+sve2"
> +
> +typedef unsigned char uchar;
> +
> +const uchar *
> +search_line_fast (const uchar *s, const uchar *end)
> +{
> +  size_t VL = svcntb();
> +  svuint8_t arr1, arr2;
> +  svbool_t pc, pg = svptrue_b8();
> +
> +  // This should not be loaded inside the loop every time.
> +  arr2 = svreinterpret_u8(svdup_u32(0x0a0d5c3f));
> +
> +  for (; s+VL <= end; s += VL) {
> +    arr1 = svld1_u8(pg, s);
> +    pc = svmatch_u8(pg, arr1, arr2);
> +
> +    if (svptest_any(pg, pc)) {
> +      pc = svbrkb_z(pg, pc);
> +      return s+svcntp_b8(pg, pc);
> +    }
> +  }
> +
> +  // Handle remainder.
> +  if (s < end) {
> +    pg = svwhilelt_b8((size_t)s, (size_t)end);
> +
> +    arr1 = svld1_u8(pg, s);
> +    pc = svmatch_u8(pg, arr1, arr2);
> +
> +    if (svptest_any(pg, pc)) {
> +      pc = svbrkb_z(pg, pc);
> +      return s+svcntp_b8(pg, pc);
> +    }
> +  }
> +
> +  return end;
> +}
> +
> +// { dg-final { scan-assembler {:\n\tld1b\t[^\n]*\n\tmatch\t[^\n]*\n\tb\.} } }
> --
> 2.25.1
>
  
Jeff Law Aug. 4, 2024, 3:45 p.m. UTC | #3
On 8/2/24 2:23 PM, Andrew Pinski wrote:
> On Wed, Jul 31, 2024 at 9:41 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> The testcase contains the constant:
>>
>>    arr2 = svreinterpret_u8(svdup_u32(0x0a0d5c3f));
>>
>> which was initially hoisted by hand, but which gimple optimisers later
>> propagated to each use (as expected).  The constant was then expanded
>> as a load-and-duplicate from the constant pool.  Normally that load
>> should then be hoisted back out of the loop, but may_trap_or_fault_p
>> stopped that from happening in this case.
>>
>> The code responsible was:
>>
>>        if (/* MEM_NOTRAP_P only relates to the actual position of the memory
>>               reference; moving it out of context such as when moving code
>>               when optimizing, might cause its address to become invalid.  */
>>            code_changed
>>            || !MEM_NOTRAP_P (x))
>>          {
>>            poly_int64 size = MEM_SIZE_KNOWN_P (x) ? MEM_SIZE (x) : -1;
>>            return rtx_addr_can_trap_p_1 (XEXP (x, 0), 0, size,
>>                                          GET_MODE (x), code_changed);
>>          }
>>
>> where code_changed is true.  (Arguably it doesn't need to be true in
>> this case, if we inserted invariants on the preheader edge, but it
>> would still need to be true for conditionally executed loads.)
>>
>> Normally this wouldn't be a problem, since rtx_addr_can_trap_p_1
>> would recognise that the address refers to the constant pool.
>> However, the SVE load-and-replicate instructions have a limited
>> offset range, so it isn't possible for them to have a LO_SUM address.
>> All we have is a plain pseudo base register.
>>
>> MEM_READONLY_P is defined as:
>>
>> /* 1 if RTX is a mem that is statically allocated in read-only memory.  */
>>    (RTL_FLAG_CHECK1 ("MEM_READONLY_P", (RTX), MEM)->unchanging)
>>
>> and so I think it should be safe to move memory references if both
>> MEM_READONLY_P and MEM_NOTRAP_P are true.
>>
>> The testcase isn't a minimal reproducer, but I think it's good
>> to have a realistic full routine in the testsuite.
>>
>> Bootstrapped & regression-tested on aarch64-linux-gnu.  OK to install?
> 
> 
> This is breaking the build on a few targets (x86_64 and powerpc64le so
> far reported, see PR 116200).
> 
>  From what I can tell is that it is treating `(plus:DI (ashift:DI
> (reg:DI 0 ax [690]) (const_int 3 [0x3]))  (label_ref:DI 1620))` as not
> trapping and allowing it to be moved before the check of ax being in
> the range [0..2] and we have eax being (unsigned long)(unsigned int)-9
> in value. So we get a bogus address which will trap. I put my findings
> in PR 116200 too.
I think it's the root cause of the x86_64 libgomp failures on the trunk 
as well.  I haven't done any debugging beyond that.

jeff
  
Richard Sandiford Aug. 5, 2024, 9:05 a.m. UTC | #4
Jeff Law <jeffreyalaw@gmail.com> writes:
> On 8/2/24 2:23 PM, Andrew Pinski wrote:
>> On Wed, Jul 31, 2024 at 9:41 AM Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>>>
>>> The testcase contains the constant:
>>>
>>>    arr2 = svreinterpret_u8(svdup_u32(0x0a0d5c3f));
>>>
>>> which was initially hoisted by hand, but which gimple optimisers later
>>> propagated to each use (as expected).  The constant was then expanded
>>> as a load-and-duplicate from the constant pool.  Normally that load
>>> should then be hoisted back out of the loop, but may_trap_or_fault_p
>>> stopped that from happening in this case.
>>>
>>> The code responsible was:
>>>
>>>        if (/* MEM_NOTRAP_P only relates to the actual position of the memory
>>>               reference; moving it out of context such as when moving code
>>>               when optimizing, might cause its address to become invalid.  */
>>>            code_changed
>>>            || !MEM_NOTRAP_P (x))
>>>          {
>>>            poly_int64 size = MEM_SIZE_KNOWN_P (x) ? MEM_SIZE (x) : -1;
>>>            return rtx_addr_can_trap_p_1 (XEXP (x, 0), 0, size,
>>>                                          GET_MODE (x), code_changed);
>>>          }
>>>
>>> where code_changed is true.  (Arguably it doesn't need to be true in
>>> this case, if we inserted invariants on the preheader edge, but it
>>> would still need to be true for conditionally executed loads.)
>>>
>>> Normally this wouldn't be a problem, since rtx_addr_can_trap_p_1
>>> would recognise that the address refers to the constant pool.
>>> However, the SVE load-and-replicate instructions have a limited
>>> offset range, so it isn't possible for them to have a LO_SUM address.
>>> All we have is a plain pseudo base register.
>>>
>>> MEM_READONLY_P is defined as:
>>>
>>> /* 1 if RTX is a mem that is statically allocated in read-only memory.  */
>>>    (RTL_FLAG_CHECK1 ("MEM_READONLY_P", (RTX), MEM)->unchanging)
>>>
>>> and so I think it should be safe to move memory references if both
>>> MEM_READONLY_P and MEM_NOTRAP_P are true.
>>>
>>> The testcase isn't a minimal reproducer, but I think it's good
>>> to have a realistic full routine in the testsuite.
>>>
>>> Bootstrapped & regression-tested on aarch64-linux-gnu.  OK to install?
>> 
>> 
>> This is breaking the build on a few targets (x86_64 and powerpc64le so
>> far reported, see PR 116200).
>> 
>>  From what I can tell is that it is treating `(plus:DI (ashift:DI
>> (reg:DI 0 ax [690]) (const_int 3 [0x3]))  (label_ref:DI 1620))` as not
>> trapping and allowing it to be moved before the check of ax being in
>> the range [0..2] and we have eax being (unsigned long)(unsigned int)-9
>> in value. So we get a bogus address which will trap. I put my findings
>> in PR 116200 too.
> I think it's the root cause of the x86_64 libgomp failures on the trunk 
> as well.  I haven't done any debugging beyond that.

Sorry for the breakage.  I've reverted the patch.

Richard
  

Patch

diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index 4158a531bdd..893a6afbbc5 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -3152,10 +3152,16 @@  may_trap_p_1 (const_rtx x, unsigned flags)
 	  && MEM_VOLATILE_P (x)
 	  && XEXP (x, 0) == stack_pointer_rtx)
 	return true;
-      if (/* MEM_NOTRAP_P only relates to the actual position of the memory
-	     reference; moving it out of context such as when moving code
-	     when optimizing, might cause its address to become invalid.  */
-	  code_changed
+      if (/* MEM_READONLY_P means that the memory is both statically
+	     allocated and readonly, so MEM_NOTRAP_P should remain true
+	     even if the memory reference is moved.  This is certainly
+	     true for the important case of force_const_mem.
+
+	     Otherwise, MEM_NOTRAP_P only relates to the actual position
+	     of the memory reference; moving it out of context such as
+	     when moving code when optimizing, might cause its address
+	     to become invalid.  */
+	  (code_changed && !MEM_READONLY_P (x))
 	  || !MEM_NOTRAP_P (x))
 	{
 	  poly_int64 size = MEM_SIZE_KNOWN_P (x) ? MEM_SIZE (x) : -1;
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c
new file mode 100644
index 00000000000..a3d93d3e1c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/pr116145.c
@@ -0,0 +1,46 @@ 
+// { dg-options "-O2" }
+
+#include <stdlib.h>
+#include <arm_sve.h>
+
+#pragma GCC target "+sve2"
+
+typedef unsigned char uchar;
+
+const uchar *
+search_line_fast (const uchar *s, const uchar *end)
+{
+  size_t VL = svcntb();
+  svuint8_t arr1, arr2;
+  svbool_t pc, pg = svptrue_b8();
+
+  // This should not be loaded inside the loop every time.
+  arr2 = svreinterpret_u8(svdup_u32(0x0a0d5c3f));
+
+  for (; s+VL <= end; s += VL) {
+    arr1 = svld1_u8(pg, s);
+    pc = svmatch_u8(pg, arr1, arr2);
+
+    if (svptest_any(pg, pc)) {
+      pc = svbrkb_z(pg, pc);
+      return s+svcntp_b8(pg, pc);
+    }
+  }
+
+  // Handle remainder.
+  if (s < end) {
+    pg = svwhilelt_b8((size_t)s, (size_t)end);
+
+    arr1 = svld1_u8(pg, s);
+    pc = svmatch_u8(pg, arr1, arr2);
+
+    if (svptest_any(pg, pc)) {
+      pc = svbrkb_z(pg, pc);
+      return s+svcntp_b8(pg, pc);
+    }
+  }
+
+  return end;
+}
+
+// { dg-final { scan-assembler {:\n\tld1b\t[^\n]*\n\tmatch\t[^\n]*\n\tb\.} } }