Patchwork Improve string benchtest timing

login
register
mail settings
Submitter Wilco Dijkstra
Date May 8, 2019, 3:39 p.m.
Message ID <VI1PR0801MB21278698F84319E7D2B0A1D983320@VI1PR0801MB2127.eurprd08.prod.outlook.com>
Download mbox | patch
Permalink /patch/32606/
State New
Headers show

Comments

Wilco Dijkstra - May 8, 2019, 3:39 p.m.
Improve string benchtest timing.  Many tests run for 0.01s which is way too
short to give accurate results.  Other tests take over 40 seconds which is
way too long.  Significantly increase the iterations of the short running
tests.  Reduce number of alignment variations in the long running memcpy walk
tests so they take less than 5 seconds.

As a result most tests take at least 0.1s and all finish within 5 seconds.

OK for commit?

2019-05-08  Wilco Dijkstra  <wdijkstr@arm.com>

        * benchtests/bench-memcpy-random.c (do_one_test): Use medium iterations.
        * benchtests/bench-memcpy-walk.c (test_main): Reduce alignment tests.
        * benchtests/bench-memmem.c (do_one_test): Use small iterations.
        * benchtests/bench-memmove-walk.c (test_main): Reduce alignment tests.
        * benchtests/bench-memset-walk.c (test_main): Reduce alignment tests.
        * benchtests/bench-strcasestr.c (do_one_test): Use small iterations.
        * benchtests/bench-string.h (INNER_LOOP_ITERS): Increase iterations.
        (INNER_LOOP_ITERS_MEDIUM): New define.
        (INNER_LOOP_ITERS_SMALL): New define.
        * benchtests/bench-strpbrk.c (do_one_test): Use medium iterations.
        * benchtests/bench-strsep.c (do_one_test): Use small iterations.
        * benchtests/bench-strspn.c (do_one_test): Use medium iterations.
        * benchtests/bench-strstr.c (do_one_test): Use small iterations.
        * benchtests/bench-strtok.c (do_one_test): Use small iterations.

--
Adhemerval Zanella Netto - May 20, 2019, 6:44 p.m.
On 08/05/2019 12:39, Wilco Dijkstra wrote:
> Improve string benchtest timing.  Many tests run for 0.01s which is way too
> short to give accurate results.  Other tests take over 40 seconds which is
> way too long.  Significantly increase the iterations of the short running
> tests.  Reduce number of alignment variations in the long running memcpy walk
> tests so they take less than 5 seconds.
> 
> As a result most tests take at least 0.1s and all finish within 5 seconds.
> 
> OK for commit?
> 
> 2019-05-08  Wilco Dijkstra  <wdijkstr@arm.com>
> 
>         * benchtests/bench-memcpy-random.c (do_one_test): Use medium iterations.
>         * benchtests/bench-memcpy-walk.c (test_main): Reduce alignment tests.
>         * benchtests/bench-memmem.c (do_one_test): Use small iterations.
>         * benchtests/bench-memmove-walk.c (test_main): Reduce alignment tests.
>         * benchtests/bench-memset-walk.c (test_main): Reduce alignment tests.
>         * benchtests/bench-strcasestr.c (do_one_test): Use small iterations.
>         * benchtests/bench-string.h (INNER_LOOP_ITERS): Increase iterations.
>         (INNER_LOOP_ITERS_MEDIUM): New define.
>         (INNER_LOOP_ITERS_SMALL): New define.
>         * benchtests/bench-strpbrk.c (do_one_test): Use medium iterations.
>         * benchtests/bench-strsep.c (do_one_test): Use small iterations.
>         * benchtests/bench-strspn.c (do_one_test): Use medium iterations.
>         * benchtests/bench-strstr.c (do_one_test): Use small iterations.
>         * benchtests/bench-strtok.c (do_one_test): Use small iterations.
> 
> --
> 
> diff --git a/benchtests/bench-memcpy-random.c b/benchtests/bench-memcpy-random.c
> index 873133d9a6988af8a251e1035ab8bded277d5775..f2c2e9dc3db29ca269514874a4976eec13d9d685 100644
> --- a/benchtests/bench-memcpy-random.c
> +++ b/benchtests/bench-memcpy-random.c
> @@ -101,7 +101,7 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, char *dst, char *src,
>  	     copy_t *copy, size_t n)
>  {
>    timing_t start, stop, cur;
> -  size_t iters = INNER_LOOP_ITERS * 20;
> +  size_t iters = INNER_LOOP_ITERS_MEDIUM;
>  
>    TIMING_NOW (start);
>    for (int i = 0; i < iters; ++i)

Ok.

> diff --git a/benchtests/bench-memcpy-walk.c b/benchtests/bench-memcpy-walk.c
> index 16b455f7796b9b3b9288fb6e5245bed006967970..397a1e98df4f7bd069865262470c40a58665bcd6 100644
> --- a/benchtests/bench-memcpy-walk.c
> +++ b/benchtests/bench-memcpy-walk.c
> @@ -103,12 +103,8 @@ test_main (void)
>    json_array_begin (&json_ctx, "results");
>    for (size_t i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
>      {
> -      /* Test length alignments from 0-16 bytes.  */
> -      for (int j = 0; j < 8; j++)
> -	{
> -	  do_test (&json_ctx, i + j);
> -	  do_test (&json_ctx, i + 16 - j);
> -	}
> +      do_test (&json_ctx, i);
> +      do_test (&json_ctx, i + 1);
>      }
>  
>    json_array_end (&json_ctx);

Reading the initial benchmark comment, it seems the original intention is
indeed to measure some sizes different than the multiple of START_SIZE
(my guess is to measure some tail handling on memcpy).  So either adjust
the comment accordingly to reflect what the benchmark is now measuring
or maybe add some sizes non multiple of START_SIZE.

Siddhesh, do you recall why did you added the 'misaligning' sizes?

> diff --git a/benchtests/bench-memmem.c b/benchtests/bench-memmem.c
> index b6b97f3d1f5f94a71c177cc516cba81c869d3865..42778b07658efbe9d8027e04dc4682aba5ab8797 100644
> --- a/benchtests/bench-memmem.c
> +++ b/benchtests/bench-memmem.c
> @@ -105,7 +105,7 @@ static void
>  do_one_test (impl_t *impl, const void *haystack, size_t haystack_len,
>  	     const void *needle, size_t needle_len, const void *expected)
>  {
> -  size_t i, iters = INNER_LOOP_ITERS;
> +  size_t i, iters = INNER_LOOP_ITERS_SMALL;
>    timing_t start, stop, cur;
>  
>    TIMING_NOW (start);

Ok.

> diff --git a/benchtests/bench-memmove-walk.c b/benchtests/bench-memmove-walk.c
> index 7b4c31e63f6f3d39beb8ec8da8bfffc82f24d1ac..2b4a7e4f2966b930a8488dd02647639ffbfddc73 100644
> --- a/benchtests/bench-memmove-walk.c
> +++ b/benchtests/bench-memmove-walk.c
> @@ -107,23 +107,15 @@ test_main (void)
>    /* Non-overlapping buffers.  */
>    for (size_t i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
>      {
> -      /* Test length alignments from 0-16 bytes.  */
> -      for (int j = 0; j < 8; j++)
> -	{
> -	  do_test (&json_ctx, i + j, false);
> -	  do_test (&json_ctx, i + 16 - j, false);
> -	}
> +      do_test (&json_ctx, i, false);
> +      do_test (&json_ctx, i + 1, false);
>      }
>  
>    /* Overlapping buffers.  */
>    for (size_t i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
>      {
> -      /* Test length alignments from 0-16 bytes.  */
> -      for (int j = 0; j < 8; j++)
> -	{
> -	  do_test (&json_ctx, i + j, true);
> -	  do_test (&json_ctx, i + 16 - j, true);
> -	}
> +      do_test (&json_ctx, i, true);
> +      do_test (&json_ctx, i + 1, true);
>      }
>  
>    json_array_end (&json_ctx);

Same comment from bench-memcpy-walk.c.

> diff --git a/benchtests/bench-memset-walk.c b/benchtests/bench-memset-walk.c
> index 5f43671216e2294856df35a865061740de7d7d67..930aae34d10603c80a7c40d123fb0376cde56436 100644
> --- a/benchtests/bench-memset-walk.c
> +++ b/benchtests/bench-memset-walk.c
> @@ -112,13 +112,16 @@ test_main (void)
>  
>    json_array_begin (&json_ctx, "results");
>    for (i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
> -      /* Test length alignments from 0-16 bytes.  */
> -      for (int j = 0; j < i && j < 16; j++)
> -	do_test (&json_ctx, 65, i + j);
> +    {
> +      do_test (&json_ctx, 65, i);
> +      do_test (&json_ctx, 65, i + 1);
> +    }
>  
>    for (i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
> -      for (int j = 0; j < i && j < 16; j++)
> -	do_test (&json_ctx, 0, i + j);
> +    {
> +      do_test (&json_ctx, 0, i);
> +      do_test (&json_ctx, 0, i + 1);
> +    }
>  
>    json_array_end (&json_ctx);
>    json_attr_object_end (&json_ctx);

I don't have a strong opinion here, but maybe use the size of the uintptr_t
to add some 'misaligning' to size?

> diff --git a/benchtests/bench-strcasestr.c b/benchtests/bench-strcasestr.c
> index 1458070d47cbfe5e27d40288f8ac7f83df735d08..d701ff4f364b35bd5223831fc8df7b468e183ceb 100644
> --- a/benchtests/bench-strcasestr.c
> +++ b/benchtests/bench-strcasestr.c
> @@ -36,7 +36,7 @@ IMPL (strcasestr, 1)
>  static void
>  do_one_test (impl_t *impl, const char *s1, const char *s2, char *exp_result)
>  {
> -  size_t i, iters = INNER_LOOP_ITERS;
> +  size_t i, iters = INNER_LOOP_ITERS_SMALL;
>    timing_t start, stop, cur;
>  
>    TIMING_NOW (start);

Ok.

> diff --git a/benchtests/bench-string.h b/benchtests/bench-string.h
> index 9d3332a624c54b22884b2f90444c841a67f4baeb..e7fb015bc639d97e702e2c0e3c9aee86c9537e2c 100644
> --- a/benchtests/bench-string.h
> +++ b/benchtests/bench-string.h
> @@ -124,7 +124,9 @@ extern impl_t __start_impls[], __stop_impls[];
>  # define OPT_RANDOM 10001
>  # define OPT_SEED 10002
>  
> -# define INNER_LOOP_ITERS 64
> +# define INNER_LOOP_ITERS 8192
> +# define INNER_LOOP_ITERS_MEDIUM 1024
> +# define INNER_LOOP_ITERS_SMALL 32
>  
>  int ret, do_srandom;
>  unsigned int seed;

Ok.

> diff --git a/benchtests/bench-strpbrk.c b/benchtests/bench-strpbrk.c
> index 14e1bcb873fa86caa1f50a9013f52cd422e74185..e484ebd0c08df802e67578e613535f65ad233bdc 100644
> --- a/benchtests/bench-strpbrk.c
> +++ b/benchtests/bench-strpbrk.c
> @@ -66,7 +66,7 @@ static void
>  do_one_test (impl_t *impl, const CHAR *s, const CHAR *rej, RES_TYPE exp_res)
>  {
>    RES_TYPE res = CALL (impl, s, rej);
> -  size_t i, iters = INNER_LOOP_ITERS;
> +  size_t i, iters = INNER_LOOP_ITERS_MEDIUM;
>    timing_t start, stop, cur;
>  
>    if (res != exp_res)

Ok.

> diff --git a/benchtests/bench-strsep.c b/benchtests/bench-strsep.c
> index fea7fbdfa6f2dcc8601354f56558dc3d83f07eef..96d3be52221f40ed1afeda1896921a04215a6b85 100644
> --- a/benchtests/bench-strsep.c
> +++ b/benchtests/bench-strsep.c
> @@ -103,7 +103,7 @@ IMPL (oldstrsep, 2)
>  static void
>  do_one_test (impl_t * impl, const char *s1, const char *s2)
>  {
> -  size_t i, iters = INNER_LOOP_ITERS;
> +  size_t i, iters = INNER_LOOP_ITERS_SMALL;
>    timing_t start, stop, cur;
>  
>    TIMING_NOW (start);

Ok.

> diff --git a/benchtests/bench-strspn.c b/benchtests/bench-strspn.c
> index ac874fe234af66ad98a6188099f32c10484452db..ed97c5fc6205825531cb7e0b6f349223be0c3f66 100644
> --- a/benchtests/bench-strspn.c
> +++ b/benchtests/bench-strspn.c
> @@ -60,7 +60,7 @@ SIMPLE_STRSPN (const CHAR *s, const CHAR *acc)
>  static void
>  do_one_test (impl_t *impl, const CHAR *s, const CHAR *acc, size_t exp_res)
>  {
> -  size_t res = CALL (impl, s, acc), i, iters = INNER_LOOP_ITERS;
> +  size_t res = CALL (impl, s, acc), i, iters = INNER_LOOP_ITERS_MEDIUM;
>    timing_t start, stop, cur;
>  
>    if (res != exp_res)

Ok.

> diff --git a/benchtests/bench-strstr.c b/benchtests/bench-strstr.c
> index 6e45cb7c901fef4319ad62c8ba8ebb9109f12c28..1769dc2a102b3978ea51130e2dc7bb91c8b8947c 100644
> --- a/benchtests/bench-strstr.c
> +++ b/benchtests/bench-strstr.c
> @@ -131,7 +131,7 @@ IMPL (basic_strstr, 0)
>  static void
>  do_one_test (impl_t *impl, const char *s1, const char *s2, char *exp_result)
>  {
> -  size_t i, iters = INNER_LOOP_ITERS;
> +  size_t i, iters = INNER_LOOP_ITERS_SMALL;
>    timing_t start, stop, cur;
>    char *res;
>  

Ok.

> diff --git a/benchtests/bench-strtok.c b/benchtests/bench-strtok.c
> index 15ae76e2373792781930a3d73418ef81fecfe645..f5ab587dd94ab5035632fbf78f4058cc62d72acb 100644
> --- a/benchtests/bench-strtok.c
> +++ b/benchtests/bench-strtok.c
> @@ -60,7 +60,7 @@ IMPL (strtok, 1)
>  static void
>  do_one_test (impl_t * impl, const char *s1, const char *s2)
>  {
> -  size_t i, iters = INNER_LOOP_ITERS;
> +  size_t i, iters = INNER_LOOP_ITERS_SMALL;
>    timing_t start, stop, cur;
>    TIMING_NOW (start);
>    for (i = 0; i < iters; ++i)
> 

Ok.
Siddhesh Poyarekar - May 21, 2019, 7:01 a.m.
On 21/05/19 12:14 AM, Adhemerval Zanella wrote:
> Reading the initial benchmark comment, it seems the original intention is
> indeed to measure some sizes different than the multiple of START_SIZE
> (my guess is to measure some tail handling on memcpy).  So either adjust
> the comment accordingly to reflect what the benchmark is now measuring
> or maybe add some sizes non multiple of START_SIZE.
> 
> Siddhesh, do you recall why did you added the 'misaligning' sizes?

The misalignment was a simulation of a proprietary workload that was
used to tune memcpy when I wrote it.  This proposed change skews the
numbers in favour of aligned inputs, which is not really a bad thing.
I'm OK with this bit.

Siddhesh
Wilco Dijkstra - May 21, 2019, 1:58 p.m.
Hi Siddhesh,

>> Reading the initial benchmark comment, it seems the original intention is
>> indeed to measure some sizes different than the multiple of START_SIZE
>> (my guess is to measure some tail handling on memcpy).  So either adjust
>> the comment accordingly to reflect what the benchmark is now measuring
>> or maybe add some sizes non multiple of START_SIZE.
>> 
>> Siddhesh, do you recall why did you added the 'misaligning' sizes?
>
> The misalignment was a simulation of a proprietary workload that was
> used to tune memcpy when I wrote it.  This proposed change skews the
> numbers in favour of aligned inputs, which is not really a bad thing.
> I'm OK with this bit.

Well the test doesn't actually test misaligned copies - both source and
destination are always mutually aligned, so any memcpy implementation which
aligns either source or destination will only do aligned copies.

In any case I'm not sure what the test is supposed to measure - the scores are
identical across all memcpy implementations. The time taken for double the
copy size is exactly twice as much.

Wilco
Siddhesh Poyarekar - May 21, 2019, 2:11 p.m.
On 21/05/19 7:28 PM, Wilco Dijkstra wrote:
> Well the test doesn't actually test misaligned copies - both source and
> destination are always mutually aligned, so any memcpy implementation which
> aligns either source or destination will only do aligned copies.

They were not intended to be mutually misaligned, that was not the
intent of the benchmark since the target application it modeled did not
have such inputs.

> In any case I'm not sure what the test is supposed to measure - the scores are
> identical across all memcpy implementations. The time taken for double the
> copy size is exactly twice as much.

Right, you'll probably only see differences in case of mutually
misaligned inputs.

Siddhesh
Patrick McGehearty - May 21, 2019, 3:11 p.m.
General comment on performance test design:

The idea that it is worthwhile to measure memcpy with
misaligned src/dest that are mutually aligned shows
how exposure to particular implementations can
create blind spots in implementation. I'm speaking
of my own blind spot in this issue as I've been
involved in tuning platform specific memcpy
on Sparc/Solaris for over a decade and it never
occurred to me to test that particular case.
That's because the first memcpy code I worked from
[optimized for the Sparc Cheetah processor, first
available almost twenty years ago] had an early
test to determine if the src/dest were mutually
aligned and if so then move a few bytes until
they were long word aligned. Thus, for medium length
or longer copies, the performance was essentially
the same for aligned copies with aligned starting points
and aligned copies with misaligned starting points.

Full coverage performance testing of library code can be
challenging, especially when the "typical" usage pattern
is not predictable in advance. Even something as well
defined as memcpy can have very different usage patterns.
Examples:
Machine characteristic differences:
target machine supports/does not support misaligned access at full speed
target machine supports/does not support misaligned access at cost
     similar to or less than mispredicted branch
target machine imposes major overhead for misaligned access
     equivalent to a segfault, handled by the kernel
Even within a single architectural family, these details can
change with implementations. Across architectures like x86, sparc,
power, arm, etc. it can be very challenging to write 'general'
fast code.

Data characteristic differences also can be challenging:
    Data is overwhelmingly aligned but often short
       [Need to minimize testing before moving data]
    Data is often misaligned sometimes aligned and large
       [Best to special case various alignments with large copies
        handled differently]
    Data frequently overlaps
       [Need to watch out for HW limitations on load/store timing]
and so on...

In writing tests, if we are buried in irrelevant data, we are more
likely to miss important changes while tuning. But if we trim the
tests too much, we may miss important limitations of the implementation.
Another opportunity comes in how the data is presented. If the
data to be compared is spread in an irregular way, it is more
difficult to identify trends as opposed to random variations.
I find the current standard glibc mem* performance tests difficult to
quickly skim/compare for possible performance regressions.

I've just touched on a few of the issues involved for memcpy perf work.
malloc is even more complex which is why we are still discussing
how to performance test glibc malloc.

- Patrick McGehearty (patrick.mcgehearty@oracle.com)


On 5/21/2019 9:11 AM, Siddhesh Poyarekar wrote:
> On 21/05/19 7:28 PM, Wilco Dijkstra wrote:
>> Well the test doesn't actually test misaligned copies - both source and
>> destination are always mutually aligned, so any memcpy implementation which
>> aligns either source or destination will only do aligned copies.
> They were not intended to be mutually misaligned, that was not the
> intent of the benchmark since the target application it modeled did not
> have such inputs.
>
>> In any case I'm not sure what the test is supposed to measure - the scores are
>> identical across all memcpy implementations. The time taken for double the
>> copy size is exactly twice as much.
> Right, you'll probably only see differences in case of mutually
> misaligned inputs.
>
> Siddhesh
Wilco Dijkstra - May 22, 2019, 11:11 a.m.
Hi Siddhesh,
  
>On 21/05/19 7:28 PM, Wilco Dijkstra wrote:
>> Well the test doesn't actually test misaligned copies - both source and
>> destination are always mutually aligned, so any memcpy implementation which
>> aligns either source or destination will only do aligned copies.
>
> They were not intended to be mutually misaligned, that was not the
> intent of the benchmark since the target application it modeled did not
> have such inputs.
>
>> In any case I'm not sure what the test is supposed to measure - the scores are
>> identical across all memcpy implementations. The time taken for double the
>> copy size is exactly twice as much.
>
> Right, you'll probably only see differences in case of mutually
> misaligned inputs.

Well if I force the copies to be mutually unaligned, there is only about 1% difference
for a few of the memcpy implementations compared to them being always aligned
The others show identical performance whether aligned or not. This is not too
surprising since the test is basically waiting for DRAM most of the time.

So if we wanted to measure something useful we'd need to do it differently. Maybe
the goal was to measure DRAM bandwidth? If so we could modify it to compare
copy bandwidth for just a few different sizes (corresponding with typical L1/L2/L3 sizes).

Wilco
Siddhesh Poyarekar - May 22, 2019, 11:27 a.m.
On 22/05/19 4:41 PM, Wilco Dijkstra wrote:
> Well if I force the copies to be mutually unaligned, there is only about 1% difference
> for a few of the memcpy implementations compared to them being always aligned
> The others show identical performance whether aligned or not. This is not too
> surprising since the test is basically waiting for DRAM most of the time.

That's a good point.  Is that the case for thunderx as well?  IIRC they
perform particularly badly with misaligned code but I don't know if
they're bad enough to be significant in the face of DRAM waits.

> So if we wanted to measure something useful we'd need to do it differently. Maybe
> the goal was to measure DRAM bandwidth? If so we could modify it to compare
> copy bandwidth for just a few different sizes (corresponding with typical L1/L2/L3 sizes).

That sounds like a good idea.

Siddhesh
Anton Youdkevitch - May 22, 2019, 12:38 p.m.
Siddhesh,

On 22.5.2019 14:27 , Siddhesh Poyarekar wrote:
> On 22/05/19 4:41 PM, Wilco Dijkstra wrote:
>> Well if I force the copies to be mutually unaligned, there is only about 1% difference
>> for a few of the memcpy implementations compared to them being always aligned
>> The others show identical performance whether aligned or not. This is not too
>> surprising since the test is basically waiting for DRAM most of the time.
>
> That's a good point.  Is that the case for thunderx as well?  IIRC they
> perform particularly badly with misaligned code but I don't know if
> they're bad enough to be significant in the face of DRAM waits.
They were. The difference was up to 50% in some cases. But this is the
data for bench-memcpy. Reports for bench-memcpy-walk and
bench-memcpy-random do not have alignment info in them.

So, unaligned accesses can be much slower that aligned ones. Or, at least,
this is how the benchmarks measure them. We already know that the results
are not very stable, though.

P.S. To be precise I'm speaking about TX2 here.
Siddhesh Poyarekar - May 22, 2019, 1:24 p.m.
On 22/05/19 6:08 PM, Anton Youdkevitch wrote:
> They were. The difference was up to 50% in some cases. But this is the
> data for bench-memcpy. Reports for bench-memcpy-walk and
> bench-memcpy-random do not have alignment info in them.
> 
> So, unaligned accesses can be much slower that aligned ones. Or, at least,
> this is how the benchmarks measure them. We already know that the results
> are not very stable, though.

bench-memcpy-walk mixes in the misaligned sizes, but it shouldn't have a
big impact because 1) the source and destination are not mutually
misaligned and 2) the alignment code is usually a tiny portion of the
execution time and hence shouldn't show up in memcpy-walk.

Siddhesh
Anton Youdkevitch - May 22, 2019, 1:59 p.m.
On 22.5.2019 16:24 , Siddhesh Poyarekar wrote:
> On 22/05/19 6:08 PM, Anton Youdkevitch wrote:
>> They were. The difference was up to 50% in some cases. But this is the
>> data for bench-memcpy. Reports for bench-memcpy-walk and
>> bench-memcpy-random do not have alignment info in them.
>>
>> So, unaligned accesses can be much slower that aligned ones. Or, at least,
>> this is how the benchmarks measure them. We already know that the results
>> are not very stable, though.
>
> bench-memcpy-walk mixes in the misaligned sizes, but it shouldn't have a
> big impact because 1) the source and destination are not mutually
> misaligned and 2) the alignment code is usually a tiny portion of the
> execution time and hence shouldn't show up in memcpy-walk.
OK, sorry, I probably misunderstood the purpose of memcpy-walk.
If it has a specific access pattern intentionally (for whatever reason)
then my objection does not apply.

Patch

diff --git a/benchtests/bench-memcpy-random.c b/benchtests/bench-memcpy-random.c
index 873133d9a6988af8a251e1035ab8bded277d5775..f2c2e9dc3db29ca269514874a4976eec13d9d685 100644
--- a/benchtests/bench-memcpy-random.c
+++ b/benchtests/bench-memcpy-random.c
@@ -101,7 +101,7 @@  do_one_test (json_ctx_t *json_ctx, impl_t *impl, char *dst, char *src,
 	     copy_t *copy, size_t n)
 {
   timing_t start, stop, cur;
-  size_t iters = INNER_LOOP_ITERS * 20;
+  size_t iters = INNER_LOOP_ITERS_MEDIUM;
 
   TIMING_NOW (start);
   for (int i = 0; i < iters; ++i)
diff --git a/benchtests/bench-memcpy-walk.c b/benchtests/bench-memcpy-walk.c
index 16b455f7796b9b3b9288fb6e5245bed006967970..397a1e98df4f7bd069865262470c40a58665bcd6 100644
--- a/benchtests/bench-memcpy-walk.c
+++ b/benchtests/bench-memcpy-walk.c
@@ -103,12 +103,8 @@  test_main (void)
   json_array_begin (&json_ctx, "results");
   for (size_t i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
     {
-      /* Test length alignments from 0-16 bytes.  */
-      for (int j = 0; j < 8; j++)
-	{
-	  do_test (&json_ctx, i + j);
-	  do_test (&json_ctx, i + 16 - j);
-	}
+      do_test (&json_ctx, i);
+      do_test (&json_ctx, i + 1);
     }
 
   json_array_end (&json_ctx);
diff --git a/benchtests/bench-memmem.c b/benchtests/bench-memmem.c
index b6b97f3d1f5f94a71c177cc516cba81c869d3865..42778b07658efbe9d8027e04dc4682aba5ab8797 100644
--- a/benchtests/bench-memmem.c
+++ b/benchtests/bench-memmem.c
@@ -105,7 +105,7 @@  static void
 do_one_test (impl_t *impl, const void *haystack, size_t haystack_len,
 	     const void *needle, size_t needle_len, const void *expected)
 {
-  size_t i, iters = INNER_LOOP_ITERS;
+  size_t i, iters = INNER_LOOP_ITERS_SMALL;
   timing_t start, stop, cur;
 
   TIMING_NOW (start);
diff --git a/benchtests/bench-memmove-walk.c b/benchtests/bench-memmove-walk.c
index 7b4c31e63f6f3d39beb8ec8da8bfffc82f24d1ac..2b4a7e4f2966b930a8488dd02647639ffbfddc73 100644
--- a/benchtests/bench-memmove-walk.c
+++ b/benchtests/bench-memmove-walk.c
@@ -107,23 +107,15 @@  test_main (void)
   /* Non-overlapping buffers.  */
   for (size_t i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
     {
-      /* Test length alignments from 0-16 bytes.  */
-      for (int j = 0; j < 8; j++)
-	{
-	  do_test (&json_ctx, i + j, false);
-	  do_test (&json_ctx, i + 16 - j, false);
-	}
+      do_test (&json_ctx, i, false);
+      do_test (&json_ctx, i + 1, false);
     }
 
   /* Overlapping buffers.  */
   for (size_t i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
     {
-      /* Test length alignments from 0-16 bytes.  */
-      for (int j = 0; j < 8; j++)
-	{
-	  do_test (&json_ctx, i + j, true);
-	  do_test (&json_ctx, i + 16 - j, true);
-	}
+      do_test (&json_ctx, i, true);
+      do_test (&json_ctx, i + 1, true);
     }
 
   json_array_end (&json_ctx);
diff --git a/benchtests/bench-memset-walk.c b/benchtests/bench-memset-walk.c
index 5f43671216e2294856df35a865061740de7d7d67..930aae34d10603c80a7c40d123fb0376cde56436 100644
--- a/benchtests/bench-memset-walk.c
+++ b/benchtests/bench-memset-walk.c
@@ -112,13 +112,16 @@  test_main (void)
 
   json_array_begin (&json_ctx, "results");
   for (i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
-      /* Test length alignments from 0-16 bytes.  */
-      for (int j = 0; j < i && j < 16; j++)
-	do_test (&json_ctx, 65, i + j);
+    {
+      do_test (&json_ctx, 65, i);
+      do_test (&json_ctx, 65, i + 1);
+    }
 
   for (i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
-      for (int j = 0; j < i && j < 16; j++)
-	do_test (&json_ctx, 0, i + j);
+    {
+      do_test (&json_ctx, 0, i);
+      do_test (&json_ctx, 0, i + 1);
+    }
 
   json_array_end (&json_ctx);
   json_attr_object_end (&json_ctx);
diff --git a/benchtests/bench-strcasestr.c b/benchtests/bench-strcasestr.c
index 1458070d47cbfe5e27d40288f8ac7f83df735d08..d701ff4f364b35bd5223831fc8df7b468e183ceb 100644
--- a/benchtests/bench-strcasestr.c
+++ b/benchtests/bench-strcasestr.c
@@ -36,7 +36,7 @@  IMPL (strcasestr, 1)
 static void
 do_one_test (impl_t *impl, const char *s1, const char *s2, char *exp_result)
 {
-  size_t i, iters = INNER_LOOP_ITERS;
+  size_t i, iters = INNER_LOOP_ITERS_SMALL;
   timing_t start, stop, cur;
 
   TIMING_NOW (start);
diff --git a/benchtests/bench-string.h b/benchtests/bench-string.h
index 9d3332a624c54b22884b2f90444c841a67f4baeb..e7fb015bc639d97e702e2c0e3c9aee86c9537e2c 100644
--- a/benchtests/bench-string.h
+++ b/benchtests/bench-string.h
@@ -124,7 +124,9 @@  extern impl_t __start_impls[], __stop_impls[];
 # define OPT_RANDOM 10001
 # define OPT_SEED 10002
 
-# define INNER_LOOP_ITERS 64
+# define INNER_LOOP_ITERS 8192
+# define INNER_LOOP_ITERS_MEDIUM 1024
+# define INNER_LOOP_ITERS_SMALL 32
 
 int ret, do_srandom;
 unsigned int seed;
diff --git a/benchtests/bench-strpbrk.c b/benchtests/bench-strpbrk.c
index 14e1bcb873fa86caa1f50a9013f52cd422e74185..e484ebd0c08df802e67578e613535f65ad233bdc 100644
--- a/benchtests/bench-strpbrk.c
+++ b/benchtests/bench-strpbrk.c
@@ -66,7 +66,7 @@  static void
 do_one_test (impl_t *impl, const CHAR *s, const CHAR *rej, RES_TYPE exp_res)
 {
   RES_TYPE res = CALL (impl, s, rej);
-  size_t i, iters = INNER_LOOP_ITERS;
+  size_t i, iters = INNER_LOOP_ITERS_MEDIUM;
   timing_t start, stop, cur;
 
   if (res != exp_res)
diff --git a/benchtests/bench-strsep.c b/benchtests/bench-strsep.c
index fea7fbdfa6f2dcc8601354f56558dc3d83f07eef..96d3be52221f40ed1afeda1896921a04215a6b85 100644
--- a/benchtests/bench-strsep.c
+++ b/benchtests/bench-strsep.c
@@ -103,7 +103,7 @@  IMPL (oldstrsep, 2)
 static void
 do_one_test (impl_t * impl, const char *s1, const char *s2)
 {
-  size_t i, iters = INNER_LOOP_ITERS;
+  size_t i, iters = INNER_LOOP_ITERS_SMALL;
   timing_t start, stop, cur;
 
   TIMING_NOW (start);
diff --git a/benchtests/bench-strspn.c b/benchtests/bench-strspn.c
index ac874fe234af66ad98a6188099f32c10484452db..ed97c5fc6205825531cb7e0b6f349223be0c3f66 100644
--- a/benchtests/bench-strspn.c
+++ b/benchtests/bench-strspn.c
@@ -60,7 +60,7 @@  SIMPLE_STRSPN (const CHAR *s, const CHAR *acc)
 static void
 do_one_test (impl_t *impl, const CHAR *s, const CHAR *acc, size_t exp_res)
 {
-  size_t res = CALL (impl, s, acc), i, iters = INNER_LOOP_ITERS;
+  size_t res = CALL (impl, s, acc), i, iters = INNER_LOOP_ITERS_MEDIUM;
   timing_t start, stop, cur;
 
   if (res != exp_res)
diff --git a/benchtests/bench-strstr.c b/benchtests/bench-strstr.c
index 6e45cb7c901fef4319ad62c8ba8ebb9109f12c28..1769dc2a102b3978ea51130e2dc7bb91c8b8947c 100644
--- a/benchtests/bench-strstr.c
+++ b/benchtests/bench-strstr.c
@@ -131,7 +131,7 @@  IMPL (basic_strstr, 0)
 static void
 do_one_test (impl_t *impl, const char *s1, const char *s2, char *exp_result)
 {
-  size_t i, iters = INNER_LOOP_ITERS;
+  size_t i, iters = INNER_LOOP_ITERS_SMALL;
   timing_t start, stop, cur;
   char *res;
 
diff --git a/benchtests/bench-strtok.c b/benchtests/bench-strtok.c
index 15ae76e2373792781930a3d73418ef81fecfe645..f5ab587dd94ab5035632fbf78f4058cc62d72acb 100644
--- a/benchtests/bench-strtok.c
+++ b/benchtests/bench-strtok.c
@@ -60,7 +60,7 @@  IMPL (strtok, 1)
 static void
 do_one_test (impl_t * impl, const char *s1, const char *s2)
 {
-  size_t i, iters = INNER_LOOP_ITERS;
+  size_t i, iters = INNER_LOOP_ITERS_SMALL;
   timing_t start, stop, cur;
   TIMING_NOW (start);
   for (i = 0; i < iters; ++i)