[v3,2/5] benchtests: Add memset zero fill benchtest

Message ID 20210805075053.433538-1-naohirot@fujitsu.com
State Superseded
Headers
Series benchtests: Add memset zero fill benchmark test |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent

Commit Message

Naohiro Tamura Aug. 5, 2021, 7:50 a.m. UTC
  Memset takes 0 as the second parameter in most cases.
However, we cannot measure the zero fill performance by
bench-memset.c, bench-memset-large.c and bench-memset-walk.c
precisely.
X86_64 micro-architecture has some zero-over-zero optimization, and
AArch64 micro-architecture also has some optimization for DC ZVA
instruction.
This patch provides bench-memset-zerofill.c which is suitable to
analyze the zero fill performance by comparing among 4 patterns,
zero-over-zero, zero-over-one, one-over-zero and one-over-one, from
256B to 64MB(RAM) through L1, L2 and L3 caches.

The following commands are examples to analyze a JSON output,
bench-memset-zerofill.out, by 'jq' and 'plot_strings.py'.

1) compare zero-over-zero performance

$ cat bench-memset-zerofill.out | \
  jq -r '
    .functions.memset."bench-variant"="zerofill-0o0" |
    del(.functions.memset.results[] | select(.char1 != 0 or .char2 != 0))
  ' | \
  plot_strings.py -l -p thru -v -

2) compare zero paformance

$ cat bench-memset-zerofill.out | \
  jq -r '
    .functions.memset."bench-variant"="zerofill-zero" |
    del(.functions.memset.results[] | select(.char2 != 0))
  ' | \
  plot_strings.py -l -p thru -v -

3) compare nonzero paformance

$ cat bench-memset-zerofill.out | \
  jq -r '
    .functions.memset."bench-variant"="zerofill-nonzero" |
    del(.functions.memset.results[] | select(.char2 == 0))
  ' | \
  plot_strings.py -l -p thru -v -
---
 benchtests/Makefile                |   2 +-
 benchtests/bench-memset-zerofill.c | 134 +++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+), 1 deletion(-)
 create mode 100644 benchtests/bench-memset-zerofill.c
  

Comments

develop--- via Libc-alpha Sept. 8, 2021, 2:03 a.m. UTC | #1
Hi Lucas, Wilco, Noah and all,
Is there any comment?
https://sourceware.org/pipermail/libc-alpha/2021-August/129839.html
Thanks.
Naohiro

> -----Original Message-----
> From: Naohiro Tamura <naohirot@fujitsu.com>
> Sent: Thursday, August 5, 2021 4:51 PM
> To: Lucas A. M. Magalhaes <lamm@linux.ibm.com>; Wilco Dijkstra <Wilco.Dijkstra@arm.com>; Noah Goldstein
> <goldstein.w.n@gmail.com>; libc-alpha@sourceware.org
> Cc: Tamura, Naohiro/田村 直広 <naohirot@fujitsu.com>
> Subject: [PATCH v3 2/5] benchtests: Add memset zero fill benchtest
> 
> Memset takes 0 as the second parameter in most cases.
> However, we cannot measure the zero fill performance by
> bench-memset.c, bench-memset-large.c and bench-memset-walk.c
> precisely.
> X86_64 micro-architecture has some zero-over-zero optimization, and
> AArch64 micro-architecture also has some optimization for DC ZVA
> instruction.
> This patch provides bench-memset-zerofill.c which is suitable to
> analyze the zero fill performance by comparing among 4 patterns,
> zero-over-zero, zero-over-one, one-over-zero and one-over-one, from
> 256B to 64MB(RAM) through L1, L2 and L3 caches.
> 
> The following commands are examples to analyze a JSON output,
> bench-memset-zerofill.out, by 'jq' and 'plot_strings.py'.
> 
> 1) compare zero-over-zero performance
> 
> $ cat bench-memset-zerofill.out | \
>   jq -r '
>     .functions.memset."bench-variant"="zerofill-0o0" |
>     del(.functions.memset.results[] | select(.char1 != 0 or .char2 != 0))
>   ' | \
>   plot_strings.py -l -p thru -v -
> 
> 2) compare zero paformance
> 
> $ cat bench-memset-zerofill.out | \
>   jq -r '
>     .functions.memset."bench-variant"="zerofill-zero" |
>     del(.functions.memset.results[] | select(.char2 != 0))
>   ' | \
>   plot_strings.py -l -p thru -v -
> 
> 3) compare nonzero paformance
> 
> $ cat bench-memset-zerofill.out | \
>   jq -r '
>     .functions.memset."bench-variant"="zerofill-nonzero" |
>     del(.functions.memset.results[] | select(.char2 == 0))
>   ' | \
>   plot_strings.py -l -p thru -v -
> ---
>  benchtests/Makefile                |   2 +-
>  benchtests/bench-memset-zerofill.c | 134 +++++++++++++++++++++++++++++
>  2 files changed, 135 insertions(+), 1 deletion(-)
>  create mode 100644 benchtests/bench-memset-zerofill.c
> 
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index 1530939a8ce8..21b95c736190 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -53,7 +53,7 @@ string-benchset := memccpy memchr memcmp memcpy memmem memmove \
>  		   strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
>  		   strspn strstr strcpy_chk stpcpy_chk memrchr strsep strtok \
>  		   strcoll memcpy-large memcpy-random memmove-large memset-large \
> -		   memcpy-walk memset-walk memmove-walk
> +		   memcpy-walk memset-walk memmove-walk memset-zerofill
> 
>  # Build and run locale-dependent benchmarks only if we're building natively.
>  ifeq (no,$(cross-compiling))
> diff --git a/benchtests/bench-memset-zerofill.c b/benchtests/bench-memset-zerofill.c
> new file mode 100644
> index 000000000000..7aa7fe048574
> --- /dev/null
> +++ b/benchtests/bench-memset-zerofill.c
> @@ -0,0 +1,134 @@
> +/* Measure memset functions with zero fill data.
> +   Copyright (C) 2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#define TEST_MAIN
> +#define TEST_NAME "memset"
> +#define START_SIZE 256
> +#define MIN_PAGE_SIZE (getpagesize () + 64 * 1024 * 1024)
> +#define TIMEOUT (20 * 60)
> +#include "bench-string.h"
> +
> +#include "json-lib.h"
> +
> +void *generic_memset (void *, int, size_t);
> +typedef void *(*proto_t) (void *, int, size_t);
> +
> +IMPL (MEMSET, 1)
> +IMPL (generic_memset, 0)
> +
> +static void
> +__attribute__((noinline, noclone))
> +do_one_test (json_ctx_t *json_ctx, impl_t *impl, CHAR *s,
> +	     int c1 __attribute ((unused)), int c2 __attribute ((unused)),
> +	     size_t n)
> +{
> +  size_t i, iters = 32;
> +  timing_t start, stop, cur, latency = 0;
> +
> +  CALL (impl, s, c2, n); // warm up
> +
> +  for (i = 0; i < iters; i++)
> +    {
> +      memset (s, c1, n); // alternation
> +
> +      TIMING_NOW (start);
> +
> +      CALL (impl, s, c2, n);
> +
> +      TIMING_NOW (stop);
> +      TIMING_DIFF (cur, start, stop);
> +      TIMING_ACCUM (latency, cur);
> +    }
> +
> +  json_element_double (json_ctx, (double) latency / (double) iters);
> +}
> +
> +static void
> +do_test (json_ctx_t *json_ctx, size_t align, int c1, int c2, size_t len)
> +{
> +  align &= getpagesize () - 1;
> +  if ((align + len) * sizeof (CHAR) > page_size)
> +    return;
> +
> +  json_element_object_begin (json_ctx);
> +  json_attr_uint (json_ctx, "length", len);
> +  json_attr_uint (json_ctx, "alignment", align);
> +  json_attr_int (json_ctx, "char1", c1);
> +  json_attr_int (json_ctx, "char2", c2);
> +  json_array_begin (json_ctx, "timings");
> +
> +  FOR_EACH_IMPL (impl, 0)
> +    {
> +      do_one_test (json_ctx, impl, (CHAR *) (buf1) + align, c1, c2, len);
> +      alloc_bufs ();
> +    }
> +
> +  json_array_end (json_ctx);
> +  json_element_object_end (json_ctx);
> +}
> +
> +int
> +test_main (void)
> +{
> +  json_ctx_t json_ctx;
> +  size_t i;
> +  int c1, c2;
> +
> +  test_init ();
> +
> +  json_init (&json_ctx, 0, stdout);
> +
> +  json_document_begin (&json_ctx);
> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> +
> +  json_attr_object_begin (&json_ctx, "functions");
> +  json_attr_object_begin (&json_ctx, TEST_NAME);
> +  json_attr_string (&json_ctx, "bench-variant", "zerofill");
> +
> +  json_array_begin (&json_ctx, "ifuncs");
> +  FOR_EACH_IMPL (impl, 0)
> +    json_element_string (&json_ctx, impl->name);
> +  json_array_end (&json_ctx);
> +
> +  json_array_begin (&json_ctx, "results");
> +
> +  for (c1 = 0; c1 < 2; c1++)
> +    for (c2 = 0; c2 < 2; c2++)
> +      for (i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
> +	{
> +	  do_test (&json_ctx, 0, c1, c2, i);
> +	  do_test (&json_ctx, 3, c1, c2, i);
> +	}
> +
> +  json_array_end (&json_ctx);
> +  json_attr_object_end (&json_ctx);
> +  json_attr_object_end (&json_ctx);
> +  json_document_end (&json_ctx);
> +
> +  return ret;
> +}
> +
> +#include <support/test-driver.c>
> +
> +#define libc_hidden_builtin_def(X)
> +#define libc_hidden_def(X)
> +#define libc_hidden_weak(X)
> +#define weak_alias(X,Y)
> +#undef MEMSET
> +#define MEMSET generic_memset
> +#include <string/memset.c>
> --
> 2.17.1
  
Lucas A. M. Magalhaes Sept. 10, 2021, 8:40 p.m. UTC | #2
Hi Naohiro,

Thanks for working on this. Please, correct me if I'm wrong but I guess you sent
an old version by mistake. This patch is lacking the bench-variant
implementations mentioned on the commit message.

---
Lucas A. M. Magalhães

Quoting Naohiro Tamura (2021-08-05 04:50:53)
> Memset takes 0 as the second parameter in most cases.
> However, we cannot measure the zero fill performance by
> bench-memset.c, bench-memset-large.c and bench-memset-walk.c
> precisely.
> X86_64 micro-architecture has some zero-over-zero optimization, and
> AArch64 micro-architecture also has some optimization for DC ZVA
> instruction.
> This patch provides bench-memset-zerofill.c which is suitable to
> analyze the zero fill performance by comparing among 4 patterns,
> zero-over-zero, zero-over-one, one-over-zero and one-over-one, from
> 256B to 64MB(RAM) through L1, L2 and L3 caches.
> 
> The following commands are examples to analyze a JSON output,
> bench-memset-zerofill.out, by 'jq' and 'plot_strings.py'.
> 
> 1) compare zero-over-zero performance
> 
> $ cat bench-memset-zerofill.out | \
>   jq -r '
>     .functions.memset."bench-variant"="zerofill-0o0" |
>     del(.functions.memset.results[] | select(.char1 != 0 or .char2 != 0))
>   ' | \
>   plot_strings.py -l -p thru -v -
> 
> 2) compare zero paformance
> 
> $ cat bench-memset-zerofill.out | \
>   jq -r '
>     .functions.memset."bench-variant"="zerofill-zero" |
>     del(.functions.memset.results[] | select(.char2 != 0))
>   ' | \
>   plot_strings.py -l -p thru -v -
> 
> 3) compare nonzero paformance
> 
> $ cat bench-memset-zerofill.out | \
>   jq -r '
>     .functions.memset."bench-variant"="zerofill-nonzero" |
>     del(.functions.memset.results[] | select(.char2 == 0))
>   ' | \
>   plot_strings.py -l -p thru -v -
> ---
>  benchtests/Makefile                |   2 +-
>  benchtests/bench-memset-zerofill.c | 134 +++++++++++++++++++++++++++++
>  2 files changed, 135 insertions(+), 1 deletion(-)
>  create mode 100644 benchtests/bench-memset-zerofill.c
> 
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index 1530939a8ce8..21b95c736190 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -53,7 +53,7 @@ string-benchset := memccpy memchr memcmp memcpy memmem memmove \
>                    strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
>                    strspn strstr strcpy_chk stpcpy_chk memrchr strsep strtok \
>                    strcoll memcpy-large memcpy-random memmove-large memset-large \
> -                  memcpy-walk memset-walk memmove-walk
> +                  memcpy-walk memset-walk memmove-walk memset-zerofill
>  
>  # Build and run locale-dependent benchmarks only if we're building natively.
>  ifeq (no,$(cross-compiling))
> diff --git a/benchtests/bench-memset-zerofill.c b/benchtests/bench-memset-zerofill.c
> new file mode 100644
> index 000000000000..7aa7fe048574
> --- /dev/null
> +++ b/benchtests/bench-memset-zerofill.c
> @@ -0,0 +1,134 @@
> +/* Measure memset functions with zero fill data.
> +   Copyright (C) 2021 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#define TEST_MAIN
> +#define TEST_NAME "memset"
> +#define START_SIZE 256
> +#define MIN_PAGE_SIZE (getpagesize () + 64 * 1024 * 1024)
> +#define TIMEOUT (20 * 60)
> +#include "bench-string.h"
> +
> +#include "json-lib.h"
> +
> +void *generic_memset (void *, int, size_t);
> +typedef void *(*proto_t) (void *, int, size_t);
> +
> +IMPL (MEMSET, 1)
> +IMPL (generic_memset, 0)
> +
> +static void
> +__attribute__((noinline, noclone))
> +do_one_test (json_ctx_t *json_ctx, impl_t *impl, CHAR *s,
> +            int c1 __attribute ((unused)), int c2 __attribute ((unused)),
> +            size_t n)
> +{
> +  size_t i, iters = 32;
> +  timing_t start, stop, cur, latency = 0;
> +
> +  CALL (impl, s, c2, n); // warm up
> +
> +  for (i = 0; i < iters; i++)
> +    {
> +      memset (s, c1, n); // alternation
> +
> +      TIMING_NOW (start);
> +
> +      CALL (impl, s, c2, n);
> +
> +      TIMING_NOW (stop);
> +      TIMING_DIFF (cur, start, stop);
> +      TIMING_ACCUM (latency, cur);
> +    }
> +
> +  json_element_double (json_ctx, (double) latency / (double) iters);
> +}
> +
> +static void
> +do_test (json_ctx_t *json_ctx, size_t align, int c1, int c2, size_t len)
> +{
> +  align &= getpagesize () - 1;
> +  if ((align + len) * sizeof (CHAR) > page_size)
> +    return;
> +
> +  json_element_object_begin (json_ctx);
> +  json_attr_uint (json_ctx, "length", len);
> +  json_attr_uint (json_ctx, "alignment", align);
> +  json_attr_int (json_ctx, "char1", c1);
> +  json_attr_int (json_ctx, "char2", c2);
> +  json_array_begin (json_ctx, "timings");
> +
> +  FOR_EACH_IMPL (impl, 0)
> +    {
> +      do_one_test (json_ctx, impl, (CHAR *) (buf1) + align, c1, c2, len);
> +      alloc_bufs ();
> +    }
> +
> +  json_array_end (json_ctx);
> +  json_element_object_end (json_ctx);
> +}
> +
> +int
> +test_main (void)
> +{
> +  json_ctx_t json_ctx;
> +  size_t i;
> +  int c1, c2;
> +
> +  test_init ();
> +
> +  json_init (&json_ctx, 0, stdout);
> +
> +  json_document_begin (&json_ctx);
> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> +
> +  json_attr_object_begin (&json_ctx, "functions");
> +  json_attr_object_begin (&json_ctx, TEST_NAME);
> +  json_attr_string (&json_ctx, "bench-variant", "zerofill");
> +
> +  json_array_begin (&json_ctx, "ifuncs");
> +  FOR_EACH_IMPL (impl, 0)
> +    json_element_string (&json_ctx, impl->name);
> +  json_array_end (&json_ctx);
> +
> +  json_array_begin (&json_ctx, "results");
> +
> +  for (c1 = 0; c1 < 2; c1++)
> +    for (c2 = 0; c2 < 2; c2++)
> +      for (i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
> +       {
> +         do_test (&json_ctx, 0, c1, c2, i);
> +         do_test (&json_ctx, 3, c1, c2, i);
> +       }
> +
> +  json_array_end (&json_ctx);
> +  json_attr_object_end (&json_ctx);
> +  json_attr_object_end (&json_ctx);
> +  json_document_end (&json_ctx);
> +
> +  return ret;
> +}
> +
> +#include <support/test-driver.c>
> +
> +#define libc_hidden_builtin_def(X)
> +#define libc_hidden_def(X)
> +#define libc_hidden_weak(X)
> +#define weak_alias(X,Y)
> +#undef MEMSET
> +#define MEMSET generic_memset
> +#include <string/memset.c>
> -- 
> 2.17.1
>
  
develop--- via Libc-alpha Sept. 13, 2021, 12:53 a.m. UTC | #3
Hi Lucas,

> From: Lucas A. M. Magalhaes <lamm@linux.ibm.com>
> Sent: Saturday, September 11, 2021 5:40 AM
> 
> Thanks for working on this. Please, correct me if I'm wrong but I guess you sent
> an old version by mistake. This patch is lacking the bench-variant
> implementations mentioned on the commit message.

Thank you for the comment!
I double checked the source code and confirmed it is the one I intended.
4 patterns are combination of json attribute "char1" and "char2".
"char1" and "char2" varies 0 and 1 respectively.

zero-over-zero: char1=0, char2=0
zero-over-one: char1=0, char2=1
one-over-zero: char1=1, char2=0
one-over-one: char1=1, char2=1

I made a comment inline too.

BTW, could you review the patch "benchtests: Remove redundant assert.h" [1]
that is reflected your comment [2] to other bench tests if you had time?

[1] https://sourceware.org/pipermail/libc-alpha/2021-August/129840.html
[2] https://sourceware.org/pipermail/libc-alpha/2021-July/128989.html

> 
> Quoting Naohiro Tamura (2021-08-05 04:50:53)
> > Memset takes 0 as the second parameter in most cases.
> > However, we cannot measure the zero fill performance by
> > bench-memset.c, bench-memset-large.c and bench-memset-walk.c
> > precisely.
> > X86_64 micro-architecture has some zero-over-zero optimization, and
> > AArch64 micro-architecture also has some optimization for DC ZVA
> > instruction.
> > This patch provides bench-memset-zerofill.c which is suitable to
> > analyze the zero fill performance by comparing among 4 patterns,
> > zero-over-zero, zero-over-one, one-over-zero and one-over-one, from
> > 256B to 64MB(RAM) through L1, L2 and L3 caches.
> >
> > The following commands are examples to analyze a JSON output,
> > bench-memset-zerofill.out, by 'jq' and 'plot_strings.py'.
> >
> > 1) compare zero-over-zero performance
> >
> > $ cat bench-memset-zerofill.out | \
> >   jq -r '
> >     .functions.memset."bench-variant"="zerofill-0o0" |
> >     del(.functions.memset.results[] | select(.char1 != 0 or .char2 != 0))
> >   ' | \
> >   plot_strings.py -l -p thru -v -
> >
> > 2) compare zero paformance
> >
> > $ cat bench-memset-zerofill.out | \
> >   jq -r '
> >     .functions.memset."bench-variant"="zerofill-zero" |
> >     del(.functions.memset.results[] | select(.char2 != 0))
> >   ' | \
> >   plot_strings.py -l -p thru -v -
> >
> > 3) compare nonzero paformance
> >
> > $ cat bench-memset-zerofill.out | \
> >   jq -r '
> >     .functions.memset."bench-variant"="zerofill-nonzero" |
> >     del(.functions.memset.results[] | select(.char2 == 0))
> >   ' | \
> >   plot_strings.py -l -p thru -v -
> > ---
> >  benchtests/Makefile                |   2 +-
> >  benchtests/bench-memset-zerofill.c | 134 +++++++++++++++++++++++++++++
> >  2 files changed, 135 insertions(+), 1 deletion(-)
> >  create mode 100644 benchtests/bench-memset-zerofill.c
> >
> > diff --git a/benchtests/Makefile b/benchtests/Makefile
> > index 1530939a8ce8..21b95c736190 100644
> > --- a/benchtests/Makefile
> > +++ b/benchtests/Makefile
> > @@ -53,7 +53,7 @@ string-benchset := memccpy memchr memcmp memcpy memmem memmove \
> >                    strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
> >                    strspn strstr strcpy_chk stpcpy_chk memrchr strsep strtok \
> >                    strcoll memcpy-large memcpy-random memmove-large memset-large \
> > -                  memcpy-walk memset-walk memmove-walk
> > +                  memcpy-walk memset-walk memmove-walk memset-zerofill
> >
> >  # Build and run locale-dependent benchmarks only if we're building natively.
> >  ifeq (no,$(cross-compiling))
> > diff --git a/benchtests/bench-memset-zerofill.c b/benchtests/bench-memset-zerofill.c
> > new file mode 100644
> > index 000000000000..7aa7fe048574
> > --- /dev/null
> > +++ b/benchtests/bench-memset-zerofill.c
> > @@ -0,0 +1,134 @@
> > +/* Measure memset functions with zero fill data.
> > +   Copyright (C) 2021 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +#define TEST_MAIN
> > +#define TEST_NAME "memset"
> > +#define START_SIZE 256
> > +#define MIN_PAGE_SIZE (getpagesize () + 64 * 1024 * 1024)
> > +#define TIMEOUT (20 * 60)
> > +#include "bench-string.h"
> > +
> > +#include "json-lib.h"
> > +
> > +void *generic_memset (void *, int, size_t);
> > +typedef void *(*proto_t) (void *, int, size_t);
> > +
> > +IMPL (MEMSET, 1)
> > +IMPL (generic_memset, 0)
> > +
> > +static void
> > +__attribute__((noinline, noclone))
> > +do_one_test (json_ctx_t *json_ctx, impl_t *impl, CHAR *s,
> > +            int c1 __attribute ((unused)), int c2 __attribute ((unused)),
> > +            size_t n)
> > +{
> > +  size_t i, iters = 32;
> > +  timing_t start, stop, cur, latency = 0;
> > +
> > +  CALL (impl, s, c2, n); // warm up
> > +
> > +  for (i = 0; i < iters; i++)
> > +    {
> > +      memset (s, c1, n); // alternation
> > +
> > +      TIMING_NOW (start);
> > +
> > +      CALL (impl, s, c2, n);
> > +
> > +      TIMING_NOW (stop);
> > +      TIMING_DIFF (cur, start, stop);
> > +      TIMING_ACCUM (latency, cur);
> > +    }
> > +
> > +  json_element_double (json_ctx, (double) latency / (double) iters);
> > +}
> > +
> > +static void
> > +do_test (json_ctx_t *json_ctx, size_t align, int c1, int c2, size_t len)
> > +{
> > +  align &= getpagesize () - 1;
> > +  if ((align + len) * sizeof (CHAR) > page_size)
> > +    return;
> > +
> > +  json_element_object_begin (json_ctx);
> > +  json_attr_uint (json_ctx, "length", len);
> > +  json_attr_uint (json_ctx, "alignment", align);
> > +  json_attr_int (json_ctx, "char1", c1);
> > +  json_attr_int (json_ctx, "char2", c2);
> > +  json_array_begin (json_ctx, "timings");
> > +
> > +  FOR_EACH_IMPL (impl, 0)
> > +    {
> > +      do_one_test (json_ctx, impl, (CHAR *) (buf1) + align, c1, c2, len);
> > +      alloc_bufs ();
> > +    }
> > +
> > +  json_array_end (json_ctx);
> > +  json_element_object_end (json_ctx);
> > +}
> > +
> > +int
> > +test_main (void)
> > +{
> > +  json_ctx_t json_ctx;
> > +  size_t i;
> > +  int c1, c2;
> > +
> > +  test_init ();
> > +
> > +  json_init (&json_ctx, 0, stdout);
> > +
> > +  json_document_begin (&json_ctx);
> > +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> > +
> > +  json_attr_object_begin (&json_ctx, "functions");
> > +  json_attr_object_begin (&json_ctx, TEST_NAME);
> > +  json_attr_string (&json_ctx, "bench-variant", "zerofill");
> > +
> > +  json_array_begin (&json_ctx, "ifuncs");
> > +  FOR_EACH_IMPL (impl, 0)
> > +    json_element_string (&json_ctx, impl->name);
> > +  json_array_end (&json_ctx);
> > +
> > +  json_array_begin (&json_ctx, "results");
> > +
> > +  for (c1 = 0; c1 < 2; c1++)
> > +    for (c2 = 0; c2 < 2; c2++)
> > +      for (i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
> > +       {

Creating 4 patterns here.

Thanks.
Naohiro

> > +         do_test (&json_ctx, 0, c1, c2, i);
> > +         do_test (&json_ctx, 3, c1, c2, i);
> > +       }
> > +
> > +  json_array_end (&json_ctx);
> > +  json_attr_object_end (&json_ctx);
> > +  json_attr_object_end (&json_ctx);
> > +  json_document_end (&json_ctx);
> > +
> > +  return ret;
> > +}
> > +
> > +#include <support/test-driver.c>
> > +
> > +#define libc_hidden_builtin_def(X)
> > +#define libc_hidden_def(X)
> > +#define libc_hidden_weak(X)
> > +#define weak_alias(X,Y)
> > +#undef MEMSET
> > +#define MEMSET generic_memset
> > +#include <string/memset.c>
> > --
> > 2.17.1
> >
  
Lucas A. M. Magalhaes Sept. 13, 2021, 2:05 p.m. UTC | #4
Quoting naohirot@fujitsu.com (2021-09-12 21:53:22)
> Hi Lucas,
> 
> > From: Lucas A. M. Magalhaes <lamm@linux.ibm.com>
> > Sent: Saturday, September 11, 2021 5:40 AM
> > 
> > Thanks for working on this. Please, correct me if I'm wrong but I guess you sent
> > an old version by mistake. This patch is lacking the bench-variant
> > implementations mentioned on the commit message.
> 
> Thank you for the comment!
> I double checked the source code and confirmed it is the one I intended.
> 4 patterns are combination of json attribute "char1" and "char2".
> "char1" and "char2" varies 0 and 1 respectively.
> 
> zero-over-zero: char1=0, char2=0
> zero-over-one: char1=0, char2=1
> one-over-zero: char1=1, char2=0
> one-over-one: char1=1, char2=1
> 
> I made a comment inline too.
> 

Thanks for clarifying, now I got it. Please can you add a comment on the
code explaining this patterns and the reason behind them?

With that said this patch LGTM.

> BTW, could you review the patch "benchtests: Remove redundant assert.h" [1]
> that is reflected your comment [2] to other bench tests if you had time?
> 
> [1] https://sourceware.org/pipermail/libc-alpha/2021-August/129840.html
> [2] https://sourceware.org/pipermail/libc-alpha/2021-July/128989.html
> 
> > 
> > Quoting Naohiro Tamura (2021-08-05 04:50:53)
> > > Memset takes 0 as the second parameter in most cases.
> > > However, we cannot measure the zero fill performance by
> > > bench-memset.c, bench-memset-large.c and bench-memset-walk.c
> > > precisely.
> > > X86_64 micro-architecture has some zero-over-zero optimization, and
> > > AArch64 micro-architecture also has some optimization for DC ZVA
> > > instruction.
> > > This patch provides bench-memset-zerofill.c which is suitable to
> > > analyze the zero fill performance by comparing among 4 patterns,
> > > zero-over-zero, zero-over-one, one-over-zero and one-over-one, from
> > > 256B to 64MB(RAM) through L1, L2 and L3 caches.
> > >
> > > The following commands are examples to analyze a JSON output,
> > > bench-memset-zerofill.out, by 'jq' and 'plot_strings.py'.
> > >
> > > 1) compare zero-over-zero performance
> > >
> > > $ cat bench-memset-zerofill.out | \
> > >   jq -r '
> > >     .functions.memset."bench-variant"="zerofill-0o0" |
> > >     del(.functions.memset.results[] | select(.char1 != 0 or .char2 != 0))
> > >   ' | \
> > >   plot_strings.py -l -p thru -v -
> > >
> > > 2) compare zero paformance
> > >
> > > $ cat bench-memset-zerofill.out | \
> > >   jq -r '
> > >     .functions.memset."bench-variant"="zerofill-zero" |
> > >     del(.functions.memset.results[] | select(.char2 != 0))
> > >   ' | \
> > >   plot_strings.py -l -p thru -v -
> > >
> > > 3) compare nonzero paformance
> > >
> > > $ cat bench-memset-zerofill.out | \
> > >   jq -r '
> > >     .functions.memset."bench-variant"="zerofill-nonzero" |
> > >     del(.functions.memset.results[] | select(.char2 == 0))
> > >   ' | \
> > >   plot_strings.py -l -p thru -v -
> > > ---
> > >  benchtests/Makefile                |   2 +-
> > >  benchtests/bench-memset-zerofill.c | 134 +++++++++++++++++++++++++++++
> > >  2 files changed, 135 insertions(+), 1 deletion(-)
> > >  create mode 100644 benchtests/bench-memset-zerofill.c
> > >
> > > diff --git a/benchtests/Makefile b/benchtests/Makefile
> > > index 1530939a8ce8..21b95c736190 100644
> > > --- a/benchtests/Makefile
> > > +++ b/benchtests/Makefile
> > > @@ -53,7 +53,7 @@ string-benchset := memccpy memchr memcmp memcpy memmem memmove \
> > >                    strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
> > >                    strspn strstr strcpy_chk stpcpy_chk memrchr strsep strtok \
> > >                    strcoll memcpy-large memcpy-random memmove-large memset-large \
> > > -                  memcpy-walk memset-walk memmove-walk
> > > +                  memcpy-walk memset-walk memmove-walk memset-zerofill
> > >
> > >  # Build and run locale-dependent benchmarks only if we're building natively.
> > >  ifeq (no,$(cross-compiling))
> > > diff --git a/benchtests/bench-memset-zerofill.c b/benchtests/bench-memset-zerofill.c
> > > new file mode 100644
> > > index 000000000000..7aa7fe048574
> > > --- /dev/null
> > > +++ b/benchtests/bench-memset-zerofill.c
> > > @@ -0,0 +1,134 @@
> > > +/* Measure memset functions with zero fill data.
> > > +   Copyright (C) 2021 Free Software Foundation, Inc.
> > > +   This file is part of the GNU C Library.
> > > +
> > > +   The GNU C Library is free software; you can redistribute it and/or
> > > +   modify it under the terms of the GNU Lesser General Public
> > > +   License as published by the Free Software Foundation; either
> > > +   version 2.1 of the License, or (at your option) any later version.
> > > +
> > > +   The GNU C Library is distributed in the hope that it will be useful,
> > > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > +   Lesser General Public License for more details.
> > > +
> > > +   You should have received a copy of the GNU Lesser General Public
> > > +   License along with the GNU C Library; if not, see
> > > +   <https://www.gnu.org/licenses/>.  */
> > > +
> > > +#define TEST_MAIN
> > > +#define TEST_NAME "memset"
> > > +#define START_SIZE 256
> > > +#define MIN_PAGE_SIZE (getpagesize () + 64 * 1024 * 1024)
> > > +#define TIMEOUT (20 * 60)
> > > +#include "bench-string.h"
> > > +
> > > +#include "json-lib.h"
> > > +
> > > +void *generic_memset (void *, int, size_t);
> > > +typedef void *(*proto_t) (void *, int, size_t);
> > > +
> > > +IMPL (MEMSET, 1)
> > > +IMPL (generic_memset, 0)
> > > +
> > > +static void
> > > +__attribute__((noinline, noclone))
> > > +do_one_test (json_ctx_t *json_ctx, impl_t *impl, CHAR *s,
> > > +            int c1 __attribute ((unused)), int c2 __attribute ((unused)),
> > > +            size_t n)
> > > +{
> > > +  size_t i, iters = 32;
> > > +  timing_t start, stop, cur, latency = 0;
> > > +
> > > +  CALL (impl, s, c2, n); // warm up
> > > +
> > > +  for (i = 0; i < iters; i++)
> > > +    {
> > > +      memset (s, c1, n); // alternation
> > > +
> > > +      TIMING_NOW (start);
> > > +
> > > +      CALL (impl, s, c2, n);
> > > +
> > > +      TIMING_NOW (stop);
> > > +      TIMING_DIFF (cur, start, stop);
> > > +      TIMING_ACCUM (latency, cur);
> > > +    }
> > > +
> > > +  json_element_double (json_ctx, (double) latency / (double) iters);
> > > +}
> > > +
Ok.

> > > +static void
> > > +do_test (json_ctx_t *json_ctx, size_t align, int c1, int c2, size_t len)
> > > +{
> > > +  align &= getpagesize () - 1;
> > > +  if ((align + len) * sizeof (CHAR) > page_size)
> > > +    return;
> > > +
> > > +  json_element_object_begin (json_ctx);
> > > +  json_attr_uint (json_ctx, "length", len);
> > > +  json_attr_uint (json_ctx, "alignment", align);
> > > +  json_attr_int (json_ctx, "char1", c1);
> > > +  json_attr_int (json_ctx, "char2", c2);
> > > +  json_array_begin (json_ctx, "timings");
> > > +
> > > +  FOR_EACH_IMPL (impl, 0)
> > > +    {
> > > +      do_one_test (json_ctx, impl, (CHAR *) (buf1) + align, c1, c2, len);
> > > +      alloc_bufs ();
> > > +    }
> > > +
> > > +  json_array_end (json_ctx);
> > > +  json_element_object_end (json_ctx);
> > > +}
Ok.

> > > +
> > > +int
> > > +test_main (void)
> > > +{
> > > +  json_ctx_t json_ctx;
> > > +  size_t i;
> > > +  int c1, c2;
> > > +
> > > +  test_init ();
> > > +
> > > +  json_init (&json_ctx, 0, stdout);
> > > +
> > > +  json_document_begin (&json_ctx);
> > > +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> > > +
> > > +  json_attr_object_begin (&json_ctx, "functions");
> > > +  json_attr_object_begin (&json_ctx, TEST_NAME);
> > > +  json_attr_string (&json_ctx, "bench-variant", "zerofill");
> > > +
> > > +  json_array_begin (&json_ctx, "ifuncs");
> > > +  FOR_EACH_IMPL (impl, 0)
> > > +    json_element_string (&json_ctx, impl->name);
> > > +  json_array_end (&json_ctx);
> > > +
> > > +  json_array_begin (&json_ctx, "results");
> > > +
> > > +  for (c1 = 0; c1 < 2; c1++)
> > > +    for (c2 = 0; c2 < 2; c2++)
> > > +      for (i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
> > > +       {
> > > +         do_test (&json_ctx, 0, c1, c2, i);
> > > +         do_test (&json_ctx, 3, c1, c2, i);
> > > +       }
> > > +
> > > +  json_array_end (&json_ctx);
> > > +  json_attr_object_end (&json_ctx);
> > > +  json_attr_object_end (&json_ctx);
> > > +  json_document_end (&json_ctx);
> > > +
> > > +  return ret;
> > > +}
Ok.

> > > +
> > > +#include <support/test-driver.c>
> > > +
> > > +#define libc_hidden_builtin_def(X)
> > > +#define libc_hidden_def(X)
> > > +#define libc_hidden_weak(X)
> > > +#define weak_alias(X,Y)
> > > +#undef MEMSET
> > > +#define MEMSET generic_memset
> > > +#include <string/memset.c>
> > > --
> > > 2.17.1
> > >
  
develop--- via Libc-alpha Sept. 14, 2021, 12:44 a.m. UTC | #5
Hi Lucas,

> From: Lucas A. M. Magalhaes <lamm@linux.ibm.com>
> Sent: Monday, September 13, 2021 11:05 PM
>
> Thanks for clarifying, now I got it. Please can you add a comment on the
> code explaining this patterns and the reason behind them?
> 
> With that said this patch LGTM.

Thank you for the review!
I just submitted V4 patch by adding the comment.
Please find it [1] and merge if it's OK.

Changes from V3:

> Reviewed-by: Lucas A. M. Magalhaes <lamm@linux.ibm.com>
> Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
> Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> +  // To analyze zero fill performance by comparing among the following 4
> +  // patterns from 256B to 64MB(RAM) through L1, L2 and L3 caches.
> +  // - zero-over-zero: c1=0, c2=0
> +  // - zero-over-one:  c1=0, c2=1
> +  // - one-over-zero:  c1=1, c2=0
> +  // - one-over-one:   c1=1, c2=1

[1] https://sourceware.org/pipermail/libc-alpha/2021-September/130946.html

Thanks.
Naohiro
  
Wilco Dijkstra Sept. 14, 2021, 2:02 p.m. UTC | #6
Hi Naohiro,

I had a quick go at running the new benchmark. The main problem is that it doesn't
give repeatable results - there are huge variations from run to run of about 50% for
the smaller sizes. This is a fundamental problem due to the timing loop, and the only
way to reduce it is to increase the time taken by memset, ie. start at a much larger
size (say at 16KB).

It also takes a long time to run - generally it's best to ensure a benchmark takes less
than 10 seconds on a typical modern system (remember there will be many that are
slower!). It should be feasible to reduce the iteration count for large sizes, but you
could go up to 16MB rather than 64MB.

Cheers,
Wilco
  
develop--- via Libc-alpha Sept. 15, 2021, 8:24 a.m. UTC | #7
Hi Wilco,

Thank you for the comment.
I understood your concerns about the start size and the end size.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> Sent: Tuesday, September 14, 2021 11:03 PM
>
> I had a quick go at running the new benchmark. The main problem is that it doesn't
> give repeatable results - there are huge variations from run to run of about 50% for
> the smaller sizes. This is a fundamental problem due to the timing loop, and the only
> way to reduce it is to increase the time taken by memset, ie. start at a much larger
> size (say at 16KB).

In terms of the start size, 256B is chosen because __memset_generic
(sysdeps/aarch64/memset.S) calls DC ZVA for zero fill from 256B, which
code you committed [1].
And I reported an interesting insight in the mail [2] that DC ZVA is
slower than store instruction from 256B to 16KB on A64FX [3].
So it seems valuable to measure the range from 256B to 16KB to see
each CPU's behavior.
What do you think?

[1] https://sourceware.org/git/?p=glibc.git&h=a8c5a2a9521e105da6e96eaf4029b8e4d595e4f5
[2] https://sourceware.org/pipermail/libc-alpha/2021-August/129805.html
[3] https://drive.google.com/file/d/1fonjDDlF4LPLfZY9-z22DGn-yaSpGN4g/view

> It also takes a long time to run - generally it's best to ensure a benchmark takes less
> than 10 seconds on a typical modern system (remember there will be many that are
> slower!). It should be feasible to reduce the iteration count for large sizes, but you
> could go up to 16MB rather than 64MB.

OK, I'll change the end size to 16MB.

Thanks.
Naohiro
  
develop--- via Libc-alpha Sept. 21, 2021, 1:27 a.m. UTC | #8
Hi Wilco,

Let me ping you regarding the start size.

> -----Original Message-----
> From: Tamura, Naohiro/田村 直広 <naohirot@fujitsu.com>
> Sent: Wednesday, September 15, 2021 5:25 PM
> To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>; 'Lucas A. M. Magalhaes' <lamm@linux.ibm.com>; Noah Goldstein
> <goldstein.w.n@gmail.com>; libc-alpha@sourceware.org
> Subject: RE: [PATCH v3 2/5] benchtests: Add memset zero fill benchtest
> 
> Hi Wilco,
> 
> Thank you for the comment.
> I understood your concerns about the start size and the end size.
> 
> > From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > Sent: Tuesday, September 14, 2021 11:03 PM
> >
> > I had a quick go at running the new benchmark. The main problem is that it doesn't
> > give repeatable results - there are huge variations from run to run of about 50% for
> > the smaller sizes. This is a fundamental problem due to the timing loop, and the only
> > way to reduce it is to increase the time taken by memset, ie. start at a much larger
> > size (say at 16KB).
> 
> In terms of the start size, 256B is chosen because __memset_generic
> (sysdeps/aarch64/memset.S) calls DC ZVA for zero fill from 256B, which
> code you committed [1].
> And I reported an interesting insight in the mail [2] that DC ZVA is
> slower than store instruction from 256B to 16KB on A64FX [3].
> So it seems valuable to measure the range from 256B to 16KB to see
> each CPU's behavior.
> What do you think?
> 
> [1] https://sourceware.org/git/?p=glibc.git&h=a8c5a2a9521e105da6e96eaf4029b8e4d595e4f5
> [2] https://sourceware.org/pipermail/libc-alpha/2021-August/129805.html
> [3] https://drive.google.com/file/d/1fonjDDlF4LPLfZY9-z22DGn-yaSpGN4g/view
> 
> > It also takes a long time to run - generally it's best to ensure a benchmark takes less
> > than 10 seconds on a typical modern system (remember there will be many that are
> > slower!). It should be feasible to reduce the iteration count for large sizes, but you
> > could go up to 16MB rather than 64MB.
> 
> OK, I'll change the end size to 16MB.
> 
> Thanks.
> Naohiro
  
Wilco Dijkstra Sept. 21, 2021, 11:09 a.m. UTC | #9
Hi Naohiro,

> In terms of the start size, 256B is chosen because __memset_generic
> (sysdeps/aarch64/memset.S) calls DC ZVA for zero fill from 256B, which
> code you committed [1].
> And I reported an interesting insight in the mail [2] that DC ZVA is
> slower than store instruction from 256B to 16KB on A64FX [3].
> So it seems valuable to measure the range from 256B to 16KB to see
> each CPU's behavior.
> What do you think?

As I've mentioned, this will never work using the current benchmark loop.
At size 256 your loop has only 1 timer tick... The only way to get any data
out is to increase the time taken per call. At 16K there are about 20 ticks so
it is still very inaccurate. By repeating the test thousands of times you can
some signal out (eg. 20% is 20 ticks, 80% is 21 gives ~20.8 ticks on average),
but that's impossible for smaller sizes.

So if you want to measure small sizes, you need to use a more accurate timing
loop.

Cheers,
Wilco
  
develop--- via Libc-alpha Sept. 22, 2021, 1:07 a.m. UTC | #10
Hi Wilco,

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> Sent: Tuesday, September 21, 2021 8:09 PM
>
> > What do you think?
> 
> As I've mentioned, this will never work using the current benchmark loop.
> At size 256 your loop has only 1 timer tick... The only way to get any data
> out is to increase the time taken per call. At 16K there are about 20 ticks so
> it is still very inaccurate. By repeating the test thousands of times you can
> some signal out (eg. 20% is 20 ticks, 80% is 21 gives ~20.8 ticks on average),
> but that's impossible for smaller sizes.
> 
> So if you want to measure small sizes, you need to use a more accurate timing
> loop.

Thank you for the comment.
OK, I understood. So I updated the start size to 16KB too to commit first.
Please find V5 [1] and merge it if it's OK.
Changes from V4:
- Start size to 16KB from 256B
- End size to 16MB from 64MB

[1] https://sourceware.org/pipermail/libc-alpha/2021-September/131245.html

Thanks.
Naohiro
  
develop--- via Libc-alpha Sept. 28, 2021, 1:40 a.m. UTC | #11
Hi Wilco,

Let me ping you again if V5 [1] is OK or not.
[1]https://sourceware.org/pipermail/libc-alpha/2021-September/131245.html
  
Lucas A. M. Magalhaes Oct. 18, 2021, 12:57 p.m. UTC | #12
> > > What do you think?
> >
> > As I've mentioned, this will never work using the current benchmark loop.
> > At size 256 your loop has only 1 timer tick... The only way to get any data
> > out is to increase the time taken per call. At 16K there are about 20 ticks so
> > it is still very inaccurate. By repeating the test thousands of times you can
> > some signal out (eg. 20% is 20 ticks, 80% is 21 gives ~20.8 ticks on average),
> > but that's impossible for smaller sizes.
> >
> > So if you want to measure small sizes, you need to use a more accurate timing
> > loop.
> 
> Thank you for the comment.
> OK, I understood. So I updated the start size to 16KB too to commit first.
> Please find V5 [1] and merge it if it's OK.
> Changes from V4:
> - Start size to 16KB from 256B
> - End size to 16MB from 64MB
 
> [1] https://sourceware.org/pipermail/libc-alpha/2021-September/131245.html
 
Hi Tamura,

I agree with you that is important to measure calls with smaller
lengths.  IMHO the issue here is not if the benchmark should measure or
not this lengths, but how it could measure that.

+static void
+__attribute__((noinline, noclone))
+do_one_test (json_ctx_t *json_ctx, impl_t *impl, CHAR *s,
+	     int c1 __attribute ((unused)), int c2 __attribute ((unused)),
+	     size_t n)
+{
+  size_t i, iters = 32;
+  timing_t start, stop, cur, latency = 0;
+
+  CALL (impl, s, c2, n); // warm up
+
+  for (i = 0; i < iters; i++)
+    {
+      memset (s, c1, n); // alternation
+
+      TIMING_NOW (start);
+
+      CALL (impl, s, c2, n);
+
+      TIMING_NOW (stop);
+      TIMING_DIFF (cur, start, stop);
+      TIMING_ACCUM (latency, cur);
+    }
+
+  json_element_double (json_ctx, (double) latency / (double) iters);
+}

By doing this you are measuring just the call it self and accumulating
the results. This is indeed not measurable for really small lengths.
You could try moving the memset and the timing out of the loop and
measure the time spent in multiple runs. To fix the memset you could
memset a bigger buffer and move the s pointer on each loop. I guess this
will reduce the variations Wilco mentioned.
Maybe we need to keep this loop for bigger lengths as we will need
a buffer too much big for the implementation that I suggested.

Another point here is that GNU Code Style asks for /**/ comments
instead of //. As seen in
http://www.gnu.org/prep/standards/standards.html#Comments

Finally, Sorry that I took so long to reply here.
Thanks for working on this.
---
Lucas A. M. Magalhães
  
Wilco Dijkstra Oct. 20, 2021, 1:44 p.m. UTC | #13
Hi Lucas,

> By doing this you are measuring just the call it self and accumulating
> the results. This is indeed not measurable for really small lengths.
> You could try moving the memset and the timing out of the loop and
> measure the time spent in multiple runs. To fix the memset you could
> memset a bigger buffer and move the s pointer on each loop. I guess this
> will reduce the variations Wilco mentioned.

That would basically end up the same as bench-memset-walk.c given that
you need a huge buffer to get reasonable accuracy (bench-memset does
8192 iterations by default, and that is still inaccurate for small sizes).
In that case it would be easier to improve bench-memset-walk.c rather than
adding yet another benchmark that is too inaccurate to be useful.

Alternatively we could use the timing loop I suggested which allows any
pattern of zero/non-zero to be tested accurately:

      TIMING_NOW (start);
      for (j = 0; j < iters; j++)
        CALL (impl, s, memset_value[j & MASK], n);
      TIMING_NOW (stop);

Cheers,
Wilco
  
Lucas A. M. Magalhaes Oct. 20, 2021, 3:35 p.m. UTC | #14
Hi Wilco,
> > By doing this you are measuring just the call it self and accumulating
> > the results. This is indeed not measurable for really small lengths.
> > You could try moving the memset and the timing out of the loop and
> > measure the time spent in multiple runs. To fix the memset you could
> > memset a bigger buffer and move the s pointer on each loop. I guess this
> > will reduce the variations Wilco mentioned.
> 
> That would basically end up the same as bench-memset-walk.c given that
> you need a huge buffer to get reasonable accuracy (bench-memset does
> 8192 iterations by default, and that is still inaccurate for small sizes).
> In that case it would be easier to improve bench-memset-walk.c rather than
> adding yet another benchmark that is too inaccurate to be useful.
Yeah, I agree with you.
> 
> Alternatively we could use the timing loop I suggested which allows any
> pattern of zero/non-zero to be tested accurately:
> 
>       TIMING_NOW (start);
>       for (j = 0; j < iters; j++)
>         CALL (impl, s, memset_value[j & MASK], n);
>       TIMING_NOW (stop);
> 
Sorry but I suppose don't understood your suggestion completely.  The
memset_value array will hold patterns like [0,0], [0,1] or [1,1],
right?  If so, this will not work to measure the zero-to-one pattern for
example, as it will be mixing zero-to-one with one-to-zero calls. In
order to measure just an specific patter the buffer must be loaded
previously of the timing loop.

---
Lucas A. M. Magalhães
  
Wilco Dijkstra Oct. 20, 2021, 5:47 p.m. UTC | #15
Hi Lucas,

> Sorry but I suppose don't understood your suggestion completely.  The
> memset_value array will hold patterns like [0,0], [0,1] or [1,1],
> right?  If so, this will not work to measure the zero-to-one pattern for
> example, as it will be mixing zero-to-one with one-to-zero calls. In
> order to measure just an specific patter the buffer must be loaded
> previously of the timing loop.

The original idea was to add more tests for memset of zero and check
whether writing zero is optimized and/or writing zero over zero. There is
an equal number of 0->1 and 1->0 transitions in a pattern, so you can't
easily differentiate between them, but you can tell whether they are the
same or faster than 1->1 transitions.

For 0->0 you can run different patterns with a varying number of transitions
but the same number of zeroes and ones: eg. 0000000011111111 (7 times 0->0)
vs 0011001100110011 (4 times 0->0) vs 0101010101010101 (no 0->0).

Cheers,
Wilco
  
Lucas A. M. Magalhaes Oct. 22, 2021, 1:08 p.m. UTC | #16
Hi Wilco, Thanks for clarifying.

> > Sorry but I suppose don't understood your suggestion completely.  The
> > memset_value array will hold patterns like [0,0], [0,1] or [1,1],
> > right?  If so, this will not work to measure the zero-to-one pattern for
> > example, as it will be mixing zero-to-one with one-to-zero calls. In
> > order to measure just an specific patter the buffer must be loaded
> > previously of the timing loop.
> 
> The original idea was to add more tests for memset of zero and check
> whether writing zero is optimized and/or writing zero over zero. There is
> an equal number of 0->1 and 1->0 transitions in a pattern, so you can't
> easily differentiate between them, but you can tell whether they are the
> same or faster than 1->1 transitions.
> 
> For 0->0 you can run different patterns with a varying number of transitions
> but the same number of zeroes and ones: eg. 0000000011111111 (7 times 0->0)
> vs 0011001100110011 (4 times 0->0) vs 0101010101010101 (no 0->0).

That's an interesting strategy, indeed. I guess that's a little more
complex than most of the other benchmarks. I agree that this could solve
the issues with variations for small lenghts.

Thanks.
---
Lucas A. M. Magalhães
  

Patch

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 1530939a8ce8..21b95c736190 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -53,7 +53,7 @@  string-benchset := memccpy memchr memcmp memcpy memmem memmove \
 		   strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
 		   strspn strstr strcpy_chk stpcpy_chk memrchr strsep strtok \
 		   strcoll memcpy-large memcpy-random memmove-large memset-large \
-		   memcpy-walk memset-walk memmove-walk
+		   memcpy-walk memset-walk memmove-walk memset-zerofill
 
 # Build and run locale-dependent benchmarks only if we're building natively.
 ifeq (no,$(cross-compiling))
diff --git a/benchtests/bench-memset-zerofill.c b/benchtests/bench-memset-zerofill.c
new file mode 100644
index 000000000000..7aa7fe048574
--- /dev/null
+++ b/benchtests/bench-memset-zerofill.c
@@ -0,0 +1,134 @@ 
+/* Measure memset functions with zero fill data.
+   Copyright (C) 2021 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#define TEST_MAIN
+#define TEST_NAME "memset"
+#define START_SIZE 256
+#define MIN_PAGE_SIZE (getpagesize () + 64 * 1024 * 1024)
+#define TIMEOUT (20 * 60)
+#include "bench-string.h"
+
+#include "json-lib.h"
+
+void *generic_memset (void *, int, size_t);
+typedef void *(*proto_t) (void *, int, size_t);
+
+IMPL (MEMSET, 1)
+IMPL (generic_memset, 0)
+
+static void
+__attribute__((noinline, noclone))
+do_one_test (json_ctx_t *json_ctx, impl_t *impl, CHAR *s,
+	     int c1 __attribute ((unused)), int c2 __attribute ((unused)),
+	     size_t n)
+{
+  size_t i, iters = 32;
+  timing_t start, stop, cur, latency = 0;
+
+  CALL (impl, s, c2, n); // warm up
+
+  for (i = 0; i < iters; i++)
+    {
+      memset (s, c1, n); // alternation
+
+      TIMING_NOW (start);
+
+      CALL (impl, s, c2, n);
+
+      TIMING_NOW (stop);
+      TIMING_DIFF (cur, start, stop);
+      TIMING_ACCUM (latency, cur);
+    }
+
+  json_element_double (json_ctx, (double) latency / (double) iters);
+}
+
+static void
+do_test (json_ctx_t *json_ctx, size_t align, int c1, int c2, size_t len)
+{
+  align &= getpagesize () - 1;
+  if ((align + len) * sizeof (CHAR) > page_size)
+    return;
+
+  json_element_object_begin (json_ctx);
+  json_attr_uint (json_ctx, "length", len);
+  json_attr_uint (json_ctx, "alignment", align);
+  json_attr_int (json_ctx, "char1", c1);
+  json_attr_int (json_ctx, "char2", c2);
+  json_array_begin (json_ctx, "timings");
+
+  FOR_EACH_IMPL (impl, 0)
+    {
+      do_one_test (json_ctx, impl, (CHAR *) (buf1) + align, c1, c2, len);
+      alloc_bufs ();
+    }
+
+  json_array_end (json_ctx);
+  json_element_object_end (json_ctx);
+}
+
+int
+test_main (void)
+{
+  json_ctx_t json_ctx;
+  size_t i;
+  int c1, c2;
+
+  test_init ();
+
+  json_init (&json_ctx, 0, stdout);
+
+  json_document_begin (&json_ctx);
+  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
+
+  json_attr_object_begin (&json_ctx, "functions");
+  json_attr_object_begin (&json_ctx, TEST_NAME);
+  json_attr_string (&json_ctx, "bench-variant", "zerofill");
+
+  json_array_begin (&json_ctx, "ifuncs");
+  FOR_EACH_IMPL (impl, 0)
+    json_element_string (&json_ctx, impl->name);
+  json_array_end (&json_ctx);
+
+  json_array_begin (&json_ctx, "results");
+
+  for (c1 = 0; c1 < 2; c1++)
+    for (c2 = 0; c2 < 2; c2++)
+      for (i = START_SIZE; i <= MIN_PAGE_SIZE; i <<= 1)
+	{
+	  do_test (&json_ctx, 0, c1, c2, i);
+	  do_test (&json_ctx, 3, c1, c2, i);
+	}
+
+  json_array_end (&json_ctx);
+  json_attr_object_end (&json_ctx);
+  json_attr_object_end (&json_ctx);
+  json_document_end (&json_ctx);
+
+  return ret;
+}
+
+#include <support/test-driver.c>
+
+#define libc_hidden_builtin_def(X)
+#define libc_hidden_def(X)
+#define libc_hidden_weak(X)
+#define weak_alias(X,Y)
+#undef MEMSET
+#define MEMSET generic_memset
+#include <string/memset.c>