[v3,1/7] benchtests: Add bench-qsort

Message ID 20210903171144.952737-2-adhemerval.zanella@linaro.org
State Superseded
Headers
Series Use introsort for qsort |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent

Commit Message

Adhemerval Zanella Sept. 3, 2021, 5:11 p.m. UTC
  This patch adds a qsort benchmark.  I tried to model the benchmark
taking in consideration the possible input variation in both internal
element size, element numbers, and internal state for 1. real word cases
and 2. possible scenarios based on hardware characteristics.

For 1. I tracked qsort usage (using a simple preload library to dump its
usage and a script to pos-process it) on both GCC bootstrap and Firefox.
GCC 8 bootstrap build shows 51786641 call to qsort with the following
characterics:

Key: number of elements:
key=2 : 39.74
key=3 : 19.23
key=4 : 9.77
key=1 : 8.44
key=0 : 6.60
key=5 : 4.40
key=7 : 2.37
key=6 : 2.25
key=9 : 1.48
key=8 : 0.97

Key: element size in bytes:
key=8 : 91.74
key=32 : 3.62
key=4 : 2.42
key=40 : 1.20
key=16 : 0.67
key=24 : 0.30
key=48 : 0.05
key=56 : 0.00
key=1 : 0.00

Key: total size (number of elements * element size):
key=16 : 35.98
key=24 : 18.67
key=32 : 9.79
key=8 : 8.28
key=0 : 6.60
key=40 : 4.21
key=64 : 3.15
key=48 : 2.24
key=56 : 2.15
key=80 : 1.45

So for GCC:

  - 80% of total qsort usage are done with 10 elements of less.
  - All usages are done element size of maximum of 56 bytes.
  - 90% of calls are done with array of maximum size of 80 bytes or
    less.

(GCC on recent version now uses a internal qsort implementation instead
of the libc one).

The Firefox usage was done with 2 hours of usage, with first 10 minutes
activelly openning and closing different types of sites.  It resulted in
21042 calls with following characteristics:

Key: number of elements:
key=7 : 24.40
key=1 : 10.44
key=3 : 6.33
key=4 : 5.81
key=2 : 5.46
key=6 : 4.80
key=17 : 4.54
key=0 : 3.07
key=5 : 3.05
key=9 : 2.51
key=12 : 2.06

Key: element size in bytes:
key=8 : 94.49
key=28 : 4.40
key=2 : 0.70
key=16 : 0.19
key=36 : 0.07
key=12 : 0.07
key=40 : 0.07
key=24 : 0.03

Key: total size (number of elements * element size):
key=56 : 24.20
key=8 : 10.27
key=24 : 6.36
key=32 : 5.86
key=16 : 5.46
key=48 : 4.80
key=476 : 3.75
key=0 : 3.07
key=40 : 3.05
key=72 : 2.50

So for Firefox:

  - 72% of total qsort usage are done with 18 elements of less.
  - All usages are done element size of maximum of 40 bytes.
  - 70% of calls are done with array of maximum size of 476 bytes or
    less.

For 2. I used the idea of a machine with 3 levels of cache with sizes
L1 32kb, L2 256kb, and L3 4096Kb.

It resulted in a benchmark with following traits:

  * It checks four types of input arrays: sorted, mostly sorted, random,
    repeated, and bitonic:

    - 'sorted': the array is already sorted.
    - 'mostly sorted': the array will have a certain number of random
       position with random values (current ratio used is 20%).
    - 'random' the array will contain random elements.
    - 'repeated' the array will have random elements with a certain
       number (current ratio is 20%) of a repeated element distributed
       randomly.
    - 'bitonic': elements will be strictly increasing to up middle of the
       array then strictly decreasing.  This is to trigger the
       pattern-defeting input for the median of 3 pivot selection used
       in quicksort implementation.

  * Three elements sizes are checked: uint32_t, uint64_t, and an
    element with 32 bytes (but using the uint32_t comparison checks).
    These element sizes are used to 1. to avoid include the comparison
    function itself and/or memory copy in sort benchmark itself, and 2.
    because key of size_t are the most used for both GCC and Firefox.

  * Five different element numbers: 64 (which cover mostly of used
    elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
    of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
    24288/1048576 (L3 with  4 MB).  The sizes are configurable by
    --nelem option.

Checked on x86_64-linux-gnu
---
 benchtests/Makefile      |   2 +-
 benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 196 insertions(+), 1 deletion(-)
 create mode 100644 benchtests/bench-qsort.c
  

Comments

Alexander Monakov Sept. 4, 2021, 9:09 a.m. UTC | #1
On Fri, 3 Sep 2021, Adhemerval Zanella via Libc-alpha wrote:

> So for GCC:
> 
>   - 80% of total qsort usage are done with 10 elements of less.

Please keep in mind that calls with larger number of elements run
slower. If you look at cycle count rather than call count, you'll find
that it is more evenly split: about 50% of total time in qsort is with
10 elements or fewer, the other 50% are with 11 elements or more.

Alexander
  
Adhemerval Zanella Sept. 6, 2021, 6:30 p.m. UTC | #2
On 04/09/2021 06:09, Alexander Monakov wrote:
> On Fri, 3 Sep 2021, Adhemerval Zanella via Libc-alpha wrote:
> 
>> So for GCC:
>>
>>   - 80% of total qsort usage are done with 10 elements of less.
> 
> Please keep in mind that calls with larger number of elements run
> slower. If you look at cycle count rather than call count, you'll find
> that it is more evenly split: about 50% of total time in qsort is with
> 10 elements or fewer, the other 50% are with 11 elements or more.

Thanks for the info, indeed I only take in consideration the call
count instead of cycle spent.
  
Noah Goldstein Oct. 13, 2021, 3:19 a.m. UTC | #3
On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> This patch adds a qsort benchmark.  I tried to model the benchmark
> taking in consideration the possible input variation in both internal
> element size, element numbers, and internal state for 1. real word cases
> and 2. possible scenarios based on hardware characteristics.
>
> For 1. I tracked qsort usage (using a simple preload library to dump its
> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
> GCC 8 bootstrap build shows 51786641 call to qsort with the following
> characterics:
>
> Key: number of elements:
> key=2 : 39.74
> key=3 : 19.23
> key=4 : 9.77
> key=1 : 8.44
> key=0 : 6.60
> key=5 : 4.40
> key=7 : 2.37
> key=6 : 2.25
> key=9 : 1.48
> key=8 : 0.97
>
> Key: element size in bytes:
> key=8 : 91.74
> key=32 : 3.62
> key=4 : 2.42
> key=40 : 1.20
> key=16 : 0.67
> key=24 : 0.30
> key=48 : 0.05
> key=56 : 0.00
> key=1 : 0.00
>
> Key: total size (number of elements * element size):
> key=16 : 35.98
> key=24 : 18.67
> key=32 : 9.79
> key=8 : 8.28
> key=0 : 6.60
> key=40 : 4.21
> key=64 : 3.15
> key=48 : 2.24
> key=56 : 2.15
> key=80 : 1.45
>
> So for GCC:
>
>   - 80% of total qsort usage are done with 10 elements of less.
>   - All usages are done element size of maximum of 56 bytes.
>   - 90% of calls are done with array of maximum size of 80 bytes or
>     less.
>
> (GCC on recent version now uses a internal qsort implementation instead
> of the libc one).
>
> The Firefox usage was done with 2 hours of usage, with first 10 minutes
> activelly openning and closing different types of sites.  It resulted in
> 21042 calls with following characteristics:
>
> Key: number of elements:
> key=7 : 24.40
> key=1 : 10.44
> key=3 : 6.33
> key=4 : 5.81
> key=2 : 5.46
> key=6 : 4.80
> key=17 : 4.54
> key=0 : 3.07
> key=5 : 3.05
> key=9 : 2.51
> key=12 : 2.06
>
> Key: element size in bytes:
> key=8 : 94.49
> key=28 : 4.40
> key=2 : 0.70
> key=16 : 0.19
> key=36 : 0.07
> key=12 : 0.07
> key=40 : 0.07
> key=24 : 0.03
>
> Key: total size (number of elements * element size):
> key=56 : 24.20
> key=8 : 10.27
> key=24 : 6.36
> key=32 : 5.86
> key=16 : 5.46
> key=48 : 4.80
> key=476 : 3.75
> key=0 : 3.07
> key=40 : 3.05
> key=72 : 2.50
>
> So for Firefox:
>
>   - 72% of total qsort usage are done with 18 elements of less.
>   - All usages are done element size of maximum of 40 bytes.
>   - 70% of calls are done with array of maximum size of 476 bytes or
>     less.
>
> For 2. I used the idea of a machine with 3 levels of cache with sizes
> L1 32kb, L2 256kb, and L3 4096Kb.
>
> It resulted in a benchmark with following traits:
>
>   * It checks four types of input arrays: sorted, mostly sorted, random,
>     repeated, and bitonic:
>
>     - 'sorted': the array is already sorted.
>     - 'mostly sorted': the array will have a certain number of random
>        position with random values (current ratio used is 20%).
>     - 'random' the array will contain random elements.
>     - 'repeated' the array will have random elements with a certain
>        number (current ratio is 20%) of a repeated element distributed
>        randomly.
>     - 'bitonic': elements will be strictly increasing to up middle of the
>        array then strictly decreasing.  This is to trigger the
>        pattern-defeting input for the median of 3 pivot selection used
>        in quicksort implementation.
>
>   * Three elements sizes are checked: uint32_t, uint64_t, and an
>     element with 32 bytes (but using the uint32_t comparison checks).
>     These element sizes are used to 1. to avoid include the comparison
>     function itself and/or memory copy in sort benchmark itself, and 2.
>     because key of size_t are the most used for both GCC and Firefox.
>
>   * Five different element numbers: 64 (which cover mostly of used
>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
>     --nelem option.
>
> Checked on x86_64-linux-gnu
> ---
>  benchtests/Makefile      |   2 +-
>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 196 insertions(+), 1 deletion(-)
>  create mode 100644 benchtests/bench-qsort.c
>
> diff --git a/benchtests/Makefile b/benchtests/Makefile
> index 1530939a8c..781cbca8af 100644
> --- a/benchtests/Makefile
> +++ b/benchtests/Makefile
> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
>  include ../gen-locales.mk
>  endif
>
> -stdlib-benchset := strtod
> +stdlib-benchset := strtod qsort
>
>  stdio-common-benchset := sprintf
>
> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
> new file mode 100644
> index 0000000000..905aa6d7b6
> --- /dev/null
> +++ b/benchtests/bench-qsort.c
> @@ -0,0 +1,195 @@
> +/* Measure qsort function.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <array_length.h>
> +#include <bench-timing.h>
> +#include <errno.h>
> +#include <getopt.h>
> +#include <json-lib.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +
> +#include <stdlib/tst-qsort-common.c>
> +
> +/* Number of elements of determined type to be measured.  */
> +static const size_t default_nelems[] =
> +{
> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
> +                              on GCC build.  */
> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
> +};
> +
> +#define OPT_NELEM 10000
> +#define OPT_SEED  10001
> +#define CMDLINE_OPTIONS \
> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
> +  { "seed", required_argument, NULL, OPT_SEED },
> +
> +static const size_t max_nelem = 16;
> +static size_t *elems = NULL;
> +static size_t nelem = 0;
> +static uint64_t seed = 0;
> +static bool seed_set = false;
> +
> +static void
> +cmdline_process_function (int c)
> +{
> +  switch (c)
> +    {
> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
> +        The elements sizes as passed with a ':' as the delimiter, for
> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
> +      case OPT_NELEM:
> +        {
> +         elems = xmalloc (max_nelem * sizeof (size_t));
> +         nelem = 0;
> +
> +         char *saveptr;
> +         char *token;
> +         token = strtok_r (optarg, ":", &saveptr);
> +         if (token == NULL)
> +           {
> +             printf ("error: invalid --nelem value\n");
> +             exit (EXIT_FAILURE);
> +           }
> +         do
> +           {
> +             if (nelem == max_nelem)
> +               {
> +                 printf ("error: invalid --nelem value (max elem)\n");
> +                 exit (EXIT_FAILURE);
> +               }
> +             elems[nelem++] = atol (token);
> +             token = strtok_r (saveptr, ":", &saveptr);
> +           } while (token != NULL);
> +        }
> +      break;
> +
> +      /* handle the --seed option to use a different seed than a random one.
> +        The SEED used should be a uint64_t number.  */
> +      case OPT_SEED:
> +       {
> +         uint64_t value = strtoull (optarg, NULL, 0);
> +         if (errno == ERANGE || value > UINT64_MAX)
> +           {
> +             printf ("error: seed should be a value in range of "
> +                     "[0, UINT64_MAX]\n");
> +             exit (EXIT_FAILURE);
> +           }
> +         seed = value;
> +         seed_set = true;
> +       }
> +    }
> +}
> +
> +#define CMDLINE_PROCESS cmdline_process_function
> +
> +static int
> +do_test (void)
> +{
> +  random_init ();
> +
> +
> +  json_ctx_t json_ctx;
> +
> +  json_init (&json_ctx, 0, stdout);
> +
> +  json_document_begin (&json_ctx);
> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> +
> +  json_attr_object_begin (&json_ctx, "functions");
> +  json_attr_object_begin (&json_ctx, "qsort");
> +  json_attr_uint (&json_ctx, "seed", seed);
> +
> +  json_array_begin (&json_ctx, "results");
> +
> +  const size_t *welem = elems == NULL ? default_nelems : elems;
> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
> +
> +  struct test_t
> +  {
> +    size_t type_size;
> +    cmpfunc_t cmpfunc;
> +  };
> +  static const struct test_t tests[] =
> +  {
> +    { sizeof (uint32_t), uint32_t_cmp },
> +    { sizeof (uint64_t), uint64_t_cmp },
> +    /* Test swap with large elements.  */
> +    { 32,                uint32_t_cmp },
> +  };
> +
> +  const size_t inner_loop_iters = 16;
> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
> +    {
> +      for (const struct array_t *arraytype = arraytypes;
> +          arraytype < array_end (arraytypes);
> +          ++arraytype)
> +       {
> +         for (int k = 0; k < wnelem; k++)
> +           {
> +             json_element_object_begin (&json_ctx);
> +
> +             size_t arraysize = welem[k] * test->type_size;
> +
> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
> +             json_attr_string (&json_ctx, "property", arraytype->name);
> +
> +             void *base = create_array (welem[k], test->type_size,
> +                                        arraytype->type);
> +             void *work = xmalloc (arraysize);
> +
> +             timing_t total = 0;
> +
> +             for (int n = 0; n < inner_loop_iters; n++)
> +               {
> +                 memcpy (work, base, arraysize);
> +
> +                 timing_t start, end, diff;
> +                 TIMING_NOW (start);
> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
> +                 TIMING_NOW (end);
> +
> +                 TIMING_DIFF (diff, start, end);
> +                 TIMING_ACCUM (total, diff);
> +               }
> +
To avoid overhead from timing. Maybe you could allocate / initialize
multiple buffers to test a given sort on. Then instead of having to
re-memcpy / time you just loop through the N initialized buffers
with one timing call. For the smaller sorts this may help decrease
the timing overhead.

> +            json_attr_uint (&json_ctx, "timings",
> +                            (double) total / (double) inner_loop_iters);
> +            json_element_object_end (&json_ctx);
> +
> +            free (base);
> +            free (work);
> +          }
> +        }
> +    }
> +
> +  json_array_end (&json_ctx);
> +
> +  json_attr_object_end (&json_ctx);
> +  json_attr_object_end (&json_ctx);
> +  json_document_end (&json_ctx);
> +
> +  return 0;
> +}
> +
> +#define TIMEOUT 600
> +#include <support/test-driver.c>
> --
> 2.30.2
>
  
Adhemerval Zanella Oct. 15, 2021, 12:52 p.m. UTC | #4
On 13/10/2021 00:19, Noah Goldstein wrote:
> On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> This patch adds a qsort benchmark.  I tried to model the benchmark
>> taking in consideration the possible input variation in both internal
>> element size, element numbers, and internal state for 1. real word cases
>> and 2. possible scenarios based on hardware characteristics.
>>
>> For 1. I tracked qsort usage (using a simple preload library to dump its
>> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
>> GCC 8 bootstrap build shows 51786641 call to qsort with the following
>> characterics:
>>
>> Key: number of elements:
>> key=2 : 39.74
>> key=3 : 19.23
>> key=4 : 9.77
>> key=1 : 8.44
>> key=0 : 6.60
>> key=5 : 4.40
>> key=7 : 2.37
>> key=6 : 2.25
>> key=9 : 1.48
>> key=8 : 0.97
>>
>> Key: element size in bytes:
>> key=8 : 91.74
>> key=32 : 3.62
>> key=4 : 2.42
>> key=40 : 1.20
>> key=16 : 0.67
>> key=24 : 0.30
>> key=48 : 0.05
>> key=56 : 0.00
>> key=1 : 0.00
>>
>> Key: total size (number of elements * element size):
>> key=16 : 35.98
>> key=24 : 18.67
>> key=32 : 9.79
>> key=8 : 8.28
>> key=0 : 6.60
>> key=40 : 4.21
>> key=64 : 3.15
>> key=48 : 2.24
>> key=56 : 2.15
>> key=80 : 1.45
>>
>> So for GCC:
>>
>>   - 80% of total qsort usage are done with 10 elements of less.
>>   - All usages are done element size of maximum of 56 bytes.
>>   - 90% of calls are done with array of maximum size of 80 bytes or
>>     less.
>>
>> (GCC on recent version now uses a internal qsort implementation instead
>> of the libc one).
>>
>> The Firefox usage was done with 2 hours of usage, with first 10 minutes
>> activelly openning and closing different types of sites.  It resulted in
>> 21042 calls with following characteristics:
>>
>> Key: number of elements:
>> key=7 : 24.40
>> key=1 : 10.44
>> key=3 : 6.33
>> key=4 : 5.81
>> key=2 : 5.46
>> key=6 : 4.80
>> key=17 : 4.54
>> key=0 : 3.07
>> key=5 : 3.05
>> key=9 : 2.51
>> key=12 : 2.06
>>
>> Key: element size in bytes:
>> key=8 : 94.49
>> key=28 : 4.40
>> key=2 : 0.70
>> key=16 : 0.19
>> key=36 : 0.07
>> key=12 : 0.07
>> key=40 : 0.07
>> key=24 : 0.03
>>
>> Key: total size (number of elements * element size):
>> key=56 : 24.20
>> key=8 : 10.27
>> key=24 : 6.36
>> key=32 : 5.86
>> key=16 : 5.46
>> key=48 : 4.80
>> key=476 : 3.75
>> key=0 : 3.07
>> key=40 : 3.05
>> key=72 : 2.50
>>
>> So for Firefox:
>>
>>   - 72% of total qsort usage are done with 18 elements of less.
>>   - All usages are done element size of maximum of 40 bytes.
>>   - 70% of calls are done with array of maximum size of 476 bytes or
>>     less.
>>
>> For 2. I used the idea of a machine with 3 levels of cache with sizes
>> L1 32kb, L2 256kb, and L3 4096Kb.
>>
>> It resulted in a benchmark with following traits:
>>
>>   * It checks four types of input arrays: sorted, mostly sorted, random,
>>     repeated, and bitonic:
>>
>>     - 'sorted': the array is already sorted.
>>     - 'mostly sorted': the array will have a certain number of random
>>        position with random values (current ratio used is 20%).
>>     - 'random' the array will contain random elements.
>>     - 'repeated' the array will have random elements with a certain
>>        number (current ratio is 20%) of a repeated element distributed
>>        randomly.
>>     - 'bitonic': elements will be strictly increasing to up middle of the
>>        array then strictly decreasing.  This is to trigger the
>>        pattern-defeting input for the median of 3 pivot selection used
>>        in quicksort implementation.
>>
>>   * Three elements sizes are checked: uint32_t, uint64_t, and an
>>     element with 32 bytes (but using the uint32_t comparison checks).
>>     These element sizes are used to 1. to avoid include the comparison
>>     function itself and/or memory copy in sort benchmark itself, and 2.
>>     because key of size_t are the most used for both GCC and Firefox.
>>
>>   * Five different element numbers: 64 (which cover mostly of used
>>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
>>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
>>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
>>     --nelem option.
>>
>> Checked on x86_64-linux-gnu
>> ---
>>  benchtests/Makefile      |   2 +-
>>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 196 insertions(+), 1 deletion(-)
>>  create mode 100644 benchtests/bench-qsort.c
>>
>> diff --git a/benchtests/Makefile b/benchtests/Makefile
>> index 1530939a8c..781cbca8af 100644
>> --- a/benchtests/Makefile
>> +++ b/benchtests/Makefile
>> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
>>  include ../gen-locales.mk
>>  endif
>>
>> -stdlib-benchset := strtod
>> +stdlib-benchset := strtod qsort
>>
>>  stdio-common-benchset := sprintf
>>
>> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
>> new file mode 100644
>> index 0000000000..905aa6d7b6
>> --- /dev/null
>> +++ b/benchtests/bench-qsort.c
>> @@ -0,0 +1,195 @@
>> +/* Measure qsort function.
>> +   Copyright (C) 2018 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#include <array_length.h>
>> +#include <bench-timing.h>
>> +#include <errno.h>
>> +#include <getopt.h>
>> +#include <json-lib.h>
>> +#include <stdio.h>
>> +#include <unistd.h>
>> +
>> +#include <stdlib/tst-qsort-common.c>
>> +
>> +/* Number of elements of determined type to be measured.  */
>> +static const size_t default_nelems[] =
>> +{
>> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
>> +                              on GCC build.  */
>> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
>> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
>> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
>> +};
>> +
>> +#define OPT_NELEM 10000
>> +#define OPT_SEED  10001
>> +#define CMDLINE_OPTIONS \
>> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
>> +  { "seed", required_argument, NULL, OPT_SEED },
>> +
>> +static const size_t max_nelem = 16;
>> +static size_t *elems = NULL;
>> +static size_t nelem = 0;
>> +static uint64_t seed = 0;
>> +static bool seed_set = false;
>> +
>> +static void
>> +cmdline_process_function (int c)
>> +{
>> +  switch (c)
>> +    {
>> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
>> +        The elements sizes as passed with a ':' as the delimiter, for
>> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
>> +      case OPT_NELEM:
>> +        {
>> +         elems = xmalloc (max_nelem * sizeof (size_t));
>> +         nelem = 0;
>> +
>> +         char *saveptr;
>> +         char *token;
>> +         token = strtok_r (optarg, ":", &saveptr);
>> +         if (token == NULL)
>> +           {
>> +             printf ("error: invalid --nelem value\n");
>> +             exit (EXIT_FAILURE);
>> +           }
>> +         do
>> +           {
>> +             if (nelem == max_nelem)
>> +               {
>> +                 printf ("error: invalid --nelem value (max elem)\n");
>> +                 exit (EXIT_FAILURE);
>> +               }
>> +             elems[nelem++] = atol (token);
>> +             token = strtok_r (saveptr, ":", &saveptr);
>> +           } while (token != NULL);
>> +        }
>> +      break;
>> +
>> +      /* handle the --seed option to use a different seed than a random one.
>> +        The SEED used should be a uint64_t number.  */
>> +      case OPT_SEED:
>> +       {
>> +         uint64_t value = strtoull (optarg, NULL, 0);
>> +         if (errno == ERANGE || value > UINT64_MAX)
>> +           {
>> +             printf ("error: seed should be a value in range of "
>> +                     "[0, UINT64_MAX]\n");
>> +             exit (EXIT_FAILURE);
>> +           }
>> +         seed = value;
>> +         seed_set = true;
>> +       }
>> +    }
>> +}
>> +
>> +#define CMDLINE_PROCESS cmdline_process_function
>> +
>> +static int
>> +do_test (void)
>> +{
>> +  random_init ();
>> +
>> +
>> +  json_ctx_t json_ctx;
>> +
>> +  json_init (&json_ctx, 0, stdout);
>> +
>> +  json_document_begin (&json_ctx);
>> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
>> +
>> +  json_attr_object_begin (&json_ctx, "functions");
>> +  json_attr_object_begin (&json_ctx, "qsort");
>> +  json_attr_uint (&json_ctx, "seed", seed);
>> +
>> +  json_array_begin (&json_ctx, "results");
>> +
>> +  const size_t *welem = elems == NULL ? default_nelems : elems;
>> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
>> +
>> +  struct test_t
>> +  {
>> +    size_t type_size;
>> +    cmpfunc_t cmpfunc;
>> +  };
>> +  static const struct test_t tests[] =
>> +  {
>> +    { sizeof (uint32_t), uint32_t_cmp },
>> +    { sizeof (uint64_t), uint64_t_cmp },
>> +    /* Test swap with large elements.  */
>> +    { 32,                uint32_t_cmp },
>> +  };
>> +
>> +  const size_t inner_loop_iters = 16;
>> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
>> +    {
>> +      for (const struct array_t *arraytype = arraytypes;
>> +          arraytype < array_end (arraytypes);
>> +          ++arraytype)
>> +       {
>> +         for (int k = 0; k < wnelem; k++)
>> +           {
>> +             json_element_object_begin (&json_ctx);
>> +
>> +             size_t arraysize = welem[k] * test->type_size;
>> +
>> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
>> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
>> +             json_attr_string (&json_ctx, "property", arraytype->name);
>> +
>> +             void *base = create_array (welem[k], test->type_size,
>> +                                        arraytype->type);
>> +             void *work = xmalloc (arraysize);
>> +
>> +             timing_t total = 0;
>> +
>> +             for (int n = 0; n < inner_loop_iters; n++)
>> +               {
>> +                 memcpy (work, base, arraysize);
>> +
>> +                 timing_t start, end, diff;
>> +                 TIMING_NOW (start);
>> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
>> +                 TIMING_NOW (end);
>> +
>> +                 TIMING_DIFF (diff, start, end);
>> +                 TIMING_ACCUM (total, diff);
>> +               }
>> +
> To avoid overhead from timing. Maybe you could allocate / initialize
> multiple buffers to test a given sort on. Then instead of having to
> re-memcpy / time you just loop through the N initialized buffers
> with one timing call. For the smaller sorts this may help decrease
> the timing overhead.

I think this make sense, although it would require more memory. I changed
to:

---
  for (const struct test_t *test = tests; test < array_end (tests); ++test)
    {
      for (size_t at = 0; at < array_length (arraytypes); at++)
        {
          for (int ne = 0; ne < wnelem; ne++)
            {
              void *inputs[INNER_LOOP_ITERS];
              for (int i = 0; i < INNER_LOOP_ITERS; i++)
                inputs[i] = create_array (welem[ne], test->type_size,
                                          arraytypes[at].type);

              timing_t start, end, diff;
              TIMING_NOW (start);
              for (int i = 0; i < INNER_LOOP_ITERS; i++)
                qsort (inputs[i], welem[ne], test->type_size, test->cmpfunc);
              TIMING_NOW (end);

              TIMING_DIFF (diff, start, end);

              json_element_object_begin (&json_ctx);
              json_attr_uint (&json_ctx, "nmemb", welem[ne]);
              json_attr_uint (&json_ctx, "type_size", test->type_size);
              json_attr_string (&json_ctx, "property", arraytypes[at].name);
              uint64_t timing = (double) diff / (double) INNER_LOOP_ITERS;
              json_attr_uint (&json_ctx, "timings", timing);
              json_element_object_end (&json_ctx);

              for (int i = 0; i < INNER_LOOP_ITERS; i++)
                free (inputs[i]);
            }
        }
    }
---
  
Noah Goldstein Oct. 15, 2021, 4:39 p.m. UTC | #5
On Fri, Oct 15, 2021 at 7:52 AM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 13/10/2021 00:19, Noah Goldstein wrote:
> > On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> This patch adds a qsort benchmark.  I tried to model the benchmark
> >> taking in consideration the possible input variation in both internal
> >> element size, element numbers, and internal state for 1. real word cases
> >> and 2. possible scenarios based on hardware characteristics.
> >>
> >> For 1. I tracked qsort usage (using a simple preload library to dump its
> >> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
> >> GCC 8 bootstrap build shows 51786641 call to qsort with the following
> >> characterics:
> >>
> >> Key: number of elements:
> >> key=2 : 39.74
> >> key=3 : 19.23
> >> key=4 : 9.77
> >> key=1 : 8.44
> >> key=0 : 6.60
> >> key=5 : 4.40
> >> key=7 : 2.37
> >> key=6 : 2.25
> >> key=9 : 1.48
> >> key=8 : 0.97
> >>
> >> Key: element size in bytes:
> >> key=8 : 91.74
> >> key=32 : 3.62
> >> key=4 : 2.42
> >> key=40 : 1.20
> >> key=16 : 0.67
> >> key=24 : 0.30
> >> key=48 : 0.05
> >> key=56 : 0.00
> >> key=1 : 0.00
> >>
> >> Key: total size (number of elements * element size):
> >> key=16 : 35.98
> >> key=24 : 18.67
> >> key=32 : 9.79
> >> key=8 : 8.28
> >> key=0 : 6.60
> >> key=40 : 4.21
> >> key=64 : 3.15
> >> key=48 : 2.24
> >> key=56 : 2.15
> >> key=80 : 1.45
> >>
> >> So for GCC:
> >>
> >>   - 80% of total qsort usage are done with 10 elements of less.
> >>   - All usages are done element size of maximum of 56 bytes.
> >>   - 90% of calls are done with array of maximum size of 80 bytes or
> >>     less.
> >>
> >> (GCC on recent version now uses a internal qsort implementation instead
> >> of the libc one).
> >>
> >> The Firefox usage was done with 2 hours of usage, with first 10 minutes
> >> activelly openning and closing different types of sites.  It resulted in
> >> 21042 calls with following characteristics:
> >>
> >> Key: number of elements:
> >> key=7 : 24.40
> >> key=1 : 10.44
> >> key=3 : 6.33
> >> key=4 : 5.81
> >> key=2 : 5.46
> >> key=6 : 4.80
> >> key=17 : 4.54
> >> key=0 : 3.07
> >> key=5 : 3.05
> >> key=9 : 2.51
> >> key=12 : 2.06
> >>
> >> Key: element size in bytes:
> >> key=8 : 94.49
> >> key=28 : 4.40
> >> key=2 : 0.70
> >> key=16 : 0.19
> >> key=36 : 0.07
> >> key=12 : 0.07
> >> key=40 : 0.07
> >> key=24 : 0.03
> >>
> >> Key: total size (number of elements * element size):
> >> key=56 : 24.20
> >> key=8 : 10.27
> >> key=24 : 6.36
> >> key=32 : 5.86
> >> key=16 : 5.46
> >> key=48 : 4.80
> >> key=476 : 3.75
> >> key=0 : 3.07
> >> key=40 : 3.05
> >> key=72 : 2.50
> >>
> >> So for Firefox:
> >>
> >>   - 72% of total qsort usage are done with 18 elements of less.
> >>   - All usages are done element size of maximum of 40 bytes.
> >>   - 70% of calls are done with array of maximum size of 476 bytes or
> >>     less.
> >>
> >> For 2. I used the idea of a machine with 3 levels of cache with sizes
> >> L1 32kb, L2 256kb, and L3 4096Kb.
> >>
> >> It resulted in a benchmark with following traits:
> >>
> >>   * It checks four types of input arrays: sorted, mostly sorted, random,
> >>     repeated, and bitonic:
> >>
> >>     - 'sorted': the array is already sorted.
> >>     - 'mostly sorted': the array will have a certain number of random
> >>        position with random values (current ratio used is 20%).
> >>     - 'random' the array will contain random elements.
> >>     - 'repeated' the array will have random elements with a certain
> >>        number (current ratio is 20%) of a repeated element distributed
> >>        randomly.
> >>     - 'bitonic': elements will be strictly increasing to up middle of the
> >>        array then strictly decreasing.  This is to trigger the
> >>        pattern-defeting input for the median of 3 pivot selection used
> >>        in quicksort implementation.
> >>
> >>   * Three elements sizes are checked: uint32_t, uint64_t, and an
> >>     element with 32 bytes (but using the uint32_t comparison checks).
> >>     These element sizes are used to 1. to avoid include the comparison
> >>     function itself and/or memory copy in sort benchmark itself, and 2.
> >>     because key of size_t are the most used for both GCC and Firefox.
> >>
> >>   * Five different element numbers: 64 (which cover mostly of used
> >>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
> >>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
> >>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
> >>     --nelem option.
> >>
> >> Checked on x86_64-linux-gnu
> >> ---
> >>  benchtests/Makefile      |   2 +-
> >>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 196 insertions(+), 1 deletion(-)
> >>  create mode 100644 benchtests/bench-qsort.c
> >>
> >> diff --git a/benchtests/Makefile b/benchtests/Makefile
> >> index 1530939a8c..781cbca8af 100644
> >> --- a/benchtests/Makefile
> >> +++ b/benchtests/Makefile
> >> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
> >>  include ../gen-locales.mk
> >>  endif
> >>
> >> -stdlib-benchset := strtod
> >> +stdlib-benchset := strtod qsort
> >>
> >>  stdio-common-benchset := sprintf
> >>
> >> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
> >> new file mode 100644
> >> index 0000000000..905aa6d7b6
> >> --- /dev/null
> >> +++ b/benchtests/bench-qsort.c
> >> @@ -0,0 +1,195 @@
> >> +/* Measure qsort function.
> >> +   Copyright (C) 2018 Free Software Foundation, Inc.
> >> +   This file is part of the GNU C Library.
> >> +
> >> +   The GNU C Library is free software; you can redistribute it and/or
> >> +   modify it under the terms of the GNU Lesser General Public
> >> +   License as published by the Free Software Foundation; either
> >> +   version 2.1 of the License, or (at your option) any later version.
> >> +
> >> +   The GNU C Library is distributed in the hope that it will be useful,
> >> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> +   Lesser General Public License for more details.
> >> +
> >> +   You should have received a copy of the GNU Lesser General Public
> >> +   License along with the GNU C Library; if not, see
> >> +   <http://www.gnu.org/licenses/>.  */
> >> +
> >> +#include <array_length.h>
> >> +#include <bench-timing.h>
> >> +#include <errno.h>
> >> +#include <getopt.h>
> >> +#include <json-lib.h>
> >> +#include <stdio.h>
> >> +#include <unistd.h>
> >> +
> >> +#include <stdlib/tst-qsort-common.c>
> >> +
> >> +/* Number of elements of determined type to be measured.  */
> >> +static const size_t default_nelems[] =
> >> +{
> >> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
> >> +                              on GCC build.  */
> >> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
> >> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
> >> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
> >> +};
> >> +
> >> +#define OPT_NELEM 10000
> >> +#define OPT_SEED  10001
> >> +#define CMDLINE_OPTIONS \
> >> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
> >> +  { "seed", required_argument, NULL, OPT_SEED },
> >> +
> >> +static const size_t max_nelem = 16;
> >> +static size_t *elems = NULL;
> >> +static size_t nelem = 0;
> >> +static uint64_t seed = 0;
> >> +static bool seed_set = false;
> >> +
> >> +static void
> >> +cmdline_process_function (int c)
> >> +{
> >> +  switch (c)
> >> +    {
> >> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
> >> +        The elements sizes as passed with a ':' as the delimiter, for
> >> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
> >> +      case OPT_NELEM:
> >> +        {
> >> +         elems = xmalloc (max_nelem * sizeof (size_t));
> >> +         nelem = 0;
> >> +
> >> +         char *saveptr;
> >> +         char *token;
> >> +         token = strtok_r (optarg, ":", &saveptr);
> >> +         if (token == NULL)
> >> +           {
> >> +             printf ("error: invalid --nelem value\n");
> >> +             exit (EXIT_FAILURE);
> >> +           }
> >> +         do
> >> +           {
> >> +             if (nelem == max_nelem)
> >> +               {
> >> +                 printf ("error: invalid --nelem value (max elem)\n");
> >> +                 exit (EXIT_FAILURE);
> >> +               }
> >> +             elems[nelem++] = atol (token);
> >> +             token = strtok_r (saveptr, ":", &saveptr);
> >> +           } while (token != NULL);
> >> +        }
> >> +      break;
> >> +
> >> +      /* handle the --seed option to use a different seed than a random one.
> >> +        The SEED used should be a uint64_t number.  */
> >> +      case OPT_SEED:
> >> +       {
> >> +         uint64_t value = strtoull (optarg, NULL, 0);
> >> +         if (errno == ERANGE || value > UINT64_MAX)
> >> +           {
> >> +             printf ("error: seed should be a value in range of "
> >> +                     "[0, UINT64_MAX]\n");
> >> +             exit (EXIT_FAILURE);
> >> +           }
> >> +         seed = value;
> >> +         seed_set = true;
> >> +       }
> >> +    }
> >> +}
> >> +
> >> +#define CMDLINE_PROCESS cmdline_process_function
> >> +
> >> +static int
> >> +do_test (void)
> >> +{
> >> +  random_init ();
> >> +
> >> +
> >> +  json_ctx_t json_ctx;
> >> +
> >> +  json_init (&json_ctx, 0, stdout);
> >> +
> >> +  json_document_begin (&json_ctx);
> >> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
> >> +
> >> +  json_attr_object_begin (&json_ctx, "functions");
> >> +  json_attr_object_begin (&json_ctx, "qsort");
> >> +  json_attr_uint (&json_ctx, "seed", seed);
> >> +
> >> +  json_array_begin (&json_ctx, "results");
> >> +
> >> +  const size_t *welem = elems == NULL ? default_nelems : elems;
> >> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
> >> +
> >> +  struct test_t
> >> +  {
> >> +    size_t type_size;
> >> +    cmpfunc_t cmpfunc;
> >> +  };
> >> +  static const struct test_t tests[] =
> >> +  {
> >> +    { sizeof (uint32_t), uint32_t_cmp },
> >> +    { sizeof (uint64_t), uint64_t_cmp },
> >> +    /* Test swap with large elements.  */
> >> +    { 32,                uint32_t_cmp },
> >> +  };
> >> +
> >> +  const size_t inner_loop_iters = 16;
> >> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
> >> +    {
> >> +      for (const struct array_t *arraytype = arraytypes;
> >> +          arraytype < array_end (arraytypes);
> >> +          ++arraytype)
> >> +       {
> >> +         for (int k = 0; k < wnelem; k++)
> >> +           {
> >> +             json_element_object_begin (&json_ctx);
> >> +
> >> +             size_t arraysize = welem[k] * test->type_size;
> >> +
> >> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
> >> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
> >> +             json_attr_string (&json_ctx, "property", arraytype->name);
> >> +
> >> +             void *base = create_array (welem[k], test->type_size,
> >> +                                        arraytype->type);
> >> +             void *work = xmalloc (arraysize);
> >> +
> >> +             timing_t total = 0;
> >> +
> >> +             for (int n = 0; n < inner_loop_iters; n++)
> >> +               {
> >> +                 memcpy (work, base, arraysize);
> >> +
> >> +                 timing_t start, end, diff;
> >> +                 TIMING_NOW (start);
> >> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
> >> +                 TIMING_NOW (end);
> >> +
> >> +                 TIMING_DIFF (diff, start, end);
> >> +                 TIMING_ACCUM (total, diff);
> >> +               }
> >> +
> > To avoid overhead from timing. Maybe you could allocate / initialize
> > multiple buffers to test a given sort on. Then instead of having to
> > re-memcpy / time you just loop through the N initialized buffers
> > with one timing call. For the smaller sorts this may help decrease
> > the timing overhead.
>
> I think this make sense, although it would require more memory. I changed
> to:
>
> ---
>   for (const struct test_t *test = tests; test < array_end (tests); ++test)
>     {
>       for (size_t at = 0; at < array_length (arraytypes); at++)
>         {
>           for (int ne = 0; ne < wnelem; ne++)
>             {
>               void *inputs[INNER_LOOP_ITERS];
>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>                 inputs[i] = create_array (welem[ne], test->type_size,
>                                           arraytypes[at].type);

I think it may pay to use a 1-dimensional array. i.e

size_t qsort_bytes = welem[ne] * test->type_size;
void * inputs = malloc(qsort_bytes * NTESTS);
// initialize [inputs, inputs + qsort_bytes]
for (int i = 1; i < NTESTS; ++i)
    memcpy(inputs + qsort_bytes * i, inputs, qsort_bytes)

then increment inputs by 'qsort_bytes' in the test loop.

NTESTS may be fine being INNER_LOOP_ITERS but
if we are concerned about using too much memory
we could add another loop through INNER_LOOP_ITERS / NTESTS
and accumulate the time diffs.



>
>               timing_t start, end, diff;
>               TIMING_NOW (start);
>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>                 qsort (inputs[i], welem[ne], test->type_size, test->cmpfunc);
>               TIMING_NOW (end);
>
>               TIMING_DIFF (diff, start, end);
>
>               json_element_object_begin (&json_ctx);
>               json_attr_uint (&json_ctx, "nmemb", welem[ne]);
>               json_attr_uint (&json_ctx, "type_size", test->type_size);
>               json_attr_string (&json_ctx, "property", arraytypes[at].name);
>               uint64_t timing = (double) diff / (double) INNER_LOOP_ITERS;
>               json_attr_uint (&json_ctx, "timings", timing);
>               json_element_object_end (&json_ctx);
>
>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>                 free (inputs[i]);
>             }
>         }
>     }
> ---
  
Adhemerval Zanella Oct. 15, 2021, 5:19 p.m. UTC | #6
On 15/10/2021 13:39, Noah Goldstein wrote:
> On Fri, Oct 15, 2021 at 7:52 AM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 13/10/2021 00:19, Noah Goldstein wrote:
>>> On Fri, Sep 3, 2021 at 1:13 PM Adhemerval Zanella via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> This patch adds a qsort benchmark.  I tried to model the benchmark
>>>> taking in consideration the possible input variation in both internal
>>>> element size, element numbers, and internal state for 1. real word cases
>>>> and 2. possible scenarios based on hardware characteristics.
>>>>
>>>> For 1. I tracked qsort usage (using a simple preload library to dump its
>>>> usage and a script to pos-process it) on both GCC bootstrap and Firefox.
>>>> GCC 8 bootstrap build shows 51786641 call to qsort with the following
>>>> characterics:
>>>>
>>>> Key: number of elements:
>>>> key=2 : 39.74
>>>> key=3 : 19.23
>>>> key=4 : 9.77
>>>> key=1 : 8.44
>>>> key=0 : 6.60
>>>> key=5 : 4.40
>>>> key=7 : 2.37
>>>> key=6 : 2.25
>>>> key=9 : 1.48
>>>> key=8 : 0.97
>>>>
>>>> Key: element size in bytes:
>>>> key=8 : 91.74
>>>> key=32 : 3.62
>>>> key=4 : 2.42
>>>> key=40 : 1.20
>>>> key=16 : 0.67
>>>> key=24 : 0.30
>>>> key=48 : 0.05
>>>> key=56 : 0.00
>>>> key=1 : 0.00
>>>>
>>>> Key: total size (number of elements * element size):
>>>> key=16 : 35.98
>>>> key=24 : 18.67
>>>> key=32 : 9.79
>>>> key=8 : 8.28
>>>> key=0 : 6.60
>>>> key=40 : 4.21
>>>> key=64 : 3.15
>>>> key=48 : 2.24
>>>> key=56 : 2.15
>>>> key=80 : 1.45
>>>>
>>>> So for GCC:
>>>>
>>>>   - 80% of total qsort usage are done with 10 elements of less.
>>>>   - All usages are done element size of maximum of 56 bytes.
>>>>   - 90% of calls are done with array of maximum size of 80 bytes or
>>>>     less.
>>>>
>>>> (GCC on recent version now uses a internal qsort implementation instead
>>>> of the libc one).
>>>>
>>>> The Firefox usage was done with 2 hours of usage, with first 10 minutes
>>>> activelly openning and closing different types of sites.  It resulted in
>>>> 21042 calls with following characteristics:
>>>>
>>>> Key: number of elements:
>>>> key=7 : 24.40
>>>> key=1 : 10.44
>>>> key=3 : 6.33
>>>> key=4 : 5.81
>>>> key=2 : 5.46
>>>> key=6 : 4.80
>>>> key=17 : 4.54
>>>> key=0 : 3.07
>>>> key=5 : 3.05
>>>> key=9 : 2.51
>>>> key=12 : 2.06
>>>>
>>>> Key: element size in bytes:
>>>> key=8 : 94.49
>>>> key=28 : 4.40
>>>> key=2 : 0.70
>>>> key=16 : 0.19
>>>> key=36 : 0.07
>>>> key=12 : 0.07
>>>> key=40 : 0.07
>>>> key=24 : 0.03
>>>>
>>>> Key: total size (number of elements * element size):
>>>> key=56 : 24.20
>>>> key=8 : 10.27
>>>> key=24 : 6.36
>>>> key=32 : 5.86
>>>> key=16 : 5.46
>>>> key=48 : 4.80
>>>> key=476 : 3.75
>>>> key=0 : 3.07
>>>> key=40 : 3.05
>>>> key=72 : 2.50
>>>>
>>>> So for Firefox:
>>>>
>>>>   - 72% of total qsort usage are done with 18 elements of less.
>>>>   - All usages are done element size of maximum of 40 bytes.
>>>>   - 70% of calls are done with array of maximum size of 476 bytes or
>>>>     less.
>>>>
>>>> For 2. I used the idea of a machine with 3 levels of cache with sizes
>>>> L1 32kb, L2 256kb, and L3 4096Kb.
>>>>
>>>> It resulted in a benchmark with following traits:
>>>>
>>>>   * It checks four types of input arrays: sorted, mostly sorted, random,
>>>>     repeated, and bitonic:
>>>>
>>>>     - 'sorted': the array is already sorted.
>>>>     - 'mostly sorted': the array will have a certain number of random
>>>>        position with random values (current ratio used is 20%).
>>>>     - 'random' the array will contain random elements.
>>>>     - 'repeated' the array will have random elements with a certain
>>>>        number (current ratio is 20%) of a repeated element distributed
>>>>        randomly.
>>>>     - 'bitonic': elements will be strictly increasing to up middle of the
>>>>        array then strictly decreasing.  This is to trigger the
>>>>        pattern-defeting input for the median of 3 pivot selection used
>>>>        in quicksort implementation.
>>>>
>>>>   * Three elements sizes are checked: uint32_t, uint64_t, and an
>>>>     element with 32 bytes (but using the uint32_t comparison checks).
>>>>     These element sizes are used to 1. to avoid include the comparison
>>>>     function itself and/or memory copy in sort benchmark itself, and 2.
>>>>     because key of size_t are the most used for both GCC and Firefox.
>>>>
>>>>   * Five different element numbers: 64 (which cover mostly of used
>>>>     elementsizes for both GCC and Firefox), 4096/8192 (which cover 32 KB
>>>>     of L1 for 32 and 64 bits), 32768/65536 (L2 with 256 KB), and
>>>>     24288/1048576 (L3 with  4 MB).  The sizes are configurable by
>>>>     --nelem option.
>>>>
>>>> Checked on x86_64-linux-gnu
>>>> ---
>>>>  benchtests/Makefile      |   2 +-
>>>>  benchtests/bench-qsort.c | 195 +++++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 196 insertions(+), 1 deletion(-)
>>>>  create mode 100644 benchtests/bench-qsort.c
>>>>
>>>> diff --git a/benchtests/Makefile b/benchtests/Makefile
>>>> index 1530939a8c..781cbca8af 100644
>>>> --- a/benchtests/Makefile
>>>> +++ b/benchtests/Makefile
>>>> @@ -76,7 +76,7 @@ LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
>>>>  include ../gen-locales.mk
>>>>  endif
>>>>
>>>> -stdlib-benchset := strtod
>>>> +stdlib-benchset := strtod qsort
>>>>
>>>>  stdio-common-benchset := sprintf
>>>>
>>>> diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
>>>> new file mode 100644
>>>> index 0000000000..905aa6d7b6
>>>> --- /dev/null
>>>> +++ b/benchtests/bench-qsort.c
>>>> @@ -0,0 +1,195 @@
>>>> +/* Measure qsort function.
>>>> +   Copyright (C) 2018 Free Software Foundation, Inc.
>>>> +   This file is part of the GNU C Library.
>>>> +
>>>> +   The GNU C Library is free software; you can redistribute it and/or
>>>> +   modify it under the terms of the GNU Lesser General Public
>>>> +   License as published by the Free Software Foundation; either
>>>> +   version 2.1 of the License, or (at your option) any later version.
>>>> +
>>>> +   The GNU C Library is distributed in the hope that it will be useful,
>>>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> +   Lesser General Public License for more details.
>>>> +
>>>> +   You should have received a copy of the GNU Lesser General Public
>>>> +   License along with the GNU C Library; if not, see
>>>> +   <http://www.gnu.org/licenses/>.  */
>>>> +
>>>> +#include <array_length.h>
>>>> +#include <bench-timing.h>
>>>> +#include <errno.h>
>>>> +#include <getopt.h>
>>>> +#include <json-lib.h>
>>>> +#include <stdio.h>
>>>> +#include <unistd.h>
>>>> +
>>>> +#include <stdlib/tst-qsort-common.c>
>>>> +
>>>> +/* Number of elements of determined type to be measured.  */
>>>> +static const size_t default_nelems[] =
>>>> +{
>>>> +  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
>>>> +                              on GCC build.  */
>>>> +  32768/sizeof(size_t),            /* 4096/8192 to fit on a L1 with 32 KB.  */
>>>> +  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
>>>> +  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
>>>> +};
>>>> +
>>>> +#define OPT_NELEM 10000
>>>> +#define OPT_SEED  10001
>>>> +#define CMDLINE_OPTIONS \
>>>> +  { "nelem", required_argument, NULL, OPT_NELEM }, \
>>>> +  { "seed", required_argument, NULL, OPT_SEED },
>>>> +
>>>> +static const size_t max_nelem = 16;
>>>> +static size_t *elems = NULL;
>>>> +static size_t nelem = 0;
>>>> +static uint64_t seed = 0;
>>>> +static bool seed_set = false;
>>>> +
>>>> +static void
>>>> +cmdline_process_function (int c)
>>>> +{
>>>> +  switch (c)
>>>> +    {
>>>> +      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
>>>> +        The elements sizes as passed with a ':' as the delimiter, for
>>>> +        instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
>>>> +      case OPT_NELEM:
>>>> +        {
>>>> +         elems = xmalloc (max_nelem * sizeof (size_t));
>>>> +         nelem = 0;
>>>> +
>>>> +         char *saveptr;
>>>> +         char *token;
>>>> +         token = strtok_r (optarg, ":", &saveptr);
>>>> +         if (token == NULL)
>>>> +           {
>>>> +             printf ("error: invalid --nelem value\n");
>>>> +             exit (EXIT_FAILURE);
>>>> +           }
>>>> +         do
>>>> +           {
>>>> +             if (nelem == max_nelem)
>>>> +               {
>>>> +                 printf ("error: invalid --nelem value (max elem)\n");
>>>> +                 exit (EXIT_FAILURE);
>>>> +               }
>>>> +             elems[nelem++] = atol (token);
>>>> +             token = strtok_r (saveptr, ":", &saveptr);
>>>> +           } while (token != NULL);
>>>> +        }
>>>> +      break;
>>>> +
>>>> +      /* handle the --seed option to use a different seed than a random one.
>>>> +        The SEED used should be a uint64_t number.  */
>>>> +      case OPT_SEED:
>>>> +       {
>>>> +         uint64_t value = strtoull (optarg, NULL, 0);
>>>> +         if (errno == ERANGE || value > UINT64_MAX)
>>>> +           {
>>>> +             printf ("error: seed should be a value in range of "
>>>> +                     "[0, UINT64_MAX]\n");
>>>> +             exit (EXIT_FAILURE);
>>>> +           }
>>>> +         seed = value;
>>>> +         seed_set = true;
>>>> +       }
>>>> +    }
>>>> +}
>>>> +
>>>> +#define CMDLINE_PROCESS cmdline_process_function
>>>> +
>>>> +static int
>>>> +do_test (void)
>>>> +{
>>>> +  random_init ();
>>>> +
>>>> +
>>>> +  json_ctx_t json_ctx;
>>>> +
>>>> +  json_init (&json_ctx, 0, stdout);
>>>> +
>>>> +  json_document_begin (&json_ctx);
>>>> +  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
>>>> +
>>>> +  json_attr_object_begin (&json_ctx, "functions");
>>>> +  json_attr_object_begin (&json_ctx, "qsort");
>>>> +  json_attr_uint (&json_ctx, "seed", seed);
>>>> +
>>>> +  json_array_begin (&json_ctx, "results");
>>>> +
>>>> +  const size_t *welem = elems == NULL ? default_nelems : elems;
>>>> +  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
>>>> +
>>>> +  struct test_t
>>>> +  {
>>>> +    size_t type_size;
>>>> +    cmpfunc_t cmpfunc;
>>>> +  };
>>>> +  static const struct test_t tests[] =
>>>> +  {
>>>> +    { sizeof (uint32_t), uint32_t_cmp },
>>>> +    { sizeof (uint64_t), uint64_t_cmp },
>>>> +    /* Test swap with large elements.  */
>>>> +    { 32,                uint32_t_cmp },
>>>> +  };
>>>> +
>>>> +  const size_t inner_loop_iters = 16;
>>>> +  for (const struct test_t *test = tests; test < array_end (tests); ++test)
>>>> +    {
>>>> +      for (const struct array_t *arraytype = arraytypes;
>>>> +          arraytype < array_end (arraytypes);
>>>> +          ++arraytype)
>>>> +       {
>>>> +         for (int k = 0; k < wnelem; k++)
>>>> +           {
>>>> +             json_element_object_begin (&json_ctx);
>>>> +
>>>> +             size_t arraysize = welem[k] * test->type_size;
>>>> +
>>>> +             json_attr_uint (&json_ctx, "nmemb", welem[k]);
>>>> +             json_attr_uint (&json_ctx, "type_size", test->type_size);
>>>> +             json_attr_string (&json_ctx, "property", arraytype->name);
>>>> +
>>>> +             void *base = create_array (welem[k], test->type_size,
>>>> +                                        arraytype->type);
>>>> +             void *work = xmalloc (arraysize);
>>>> +
>>>> +             timing_t total = 0;
>>>> +
>>>> +             for (int n = 0; n < inner_loop_iters; n++)
>>>> +               {
>>>> +                 memcpy (work, base, arraysize);
>>>> +
>>>> +                 timing_t start, end, diff;
>>>> +                 TIMING_NOW (start);
>>>> +                 qsort (work, welem[k], test->type_size, test->cmpfunc);
>>>> +                 TIMING_NOW (end);
>>>> +
>>>> +                 TIMING_DIFF (diff, start, end);
>>>> +                 TIMING_ACCUM (total, diff);
>>>> +               }
>>>> +
>>> To avoid overhead from timing. Maybe you could allocate / initialize
>>> multiple buffers to test a given sort on. Then instead of having to
>>> re-memcpy / time you just loop through the N initialized buffers
>>> with one timing call. For the smaller sorts this may help decrease
>>> the timing overhead.
>>
>> I think this make sense, although it would require more memory. I changed
>> to:
>>
>> ---
>>   for (const struct test_t *test = tests; test < array_end (tests); ++test)
>>     {
>>       for (size_t at = 0; at < array_length (arraytypes); at++)
>>         {
>>           for (int ne = 0; ne < wnelem; ne++)
>>             {
>>               void *inputs[INNER_LOOP_ITERS];
>>               for (int i = 0; i < INNER_LOOP_ITERS; i++)
>>                 inputs[i] = create_array (welem[ne], test->type_size,
>>                                           arraytypes[at].type);
> 
> I think it may pay to use a 1-dimensional array. i.e
> 
> size_t qsort_bytes = welem[ne] * test->type_size;
> void * inputs = malloc(qsort_bytes * NTESTS);
> // initialize [inputs, inputs + qsort_bytes]
> for (int i = 1; i < NTESTS; ++i)
>     memcpy(inputs + qsort_bytes * i, inputs, qsort_bytes)
> 
> then increment inputs by 'qsort_bytes' in the test loop.
> 
> NTESTS may be fine being INNER_LOOP_ITERS but
> if we are concerned about using too much memory
> we could add another loop through INNER_LOOP_ITERS / NTESTS
> and accumulate the time diffs.

Right, I added a fill_array() (which is similar to create_array,
but do not allocate memory) I changed to:

  size_t qsort_size = welem[ne] * test->type_size;
  char *inputs = xmalloc (qsort_size * INNER_LOOP_ITERS);                                          
  for (int i = 0; i < INNER_LOOP_ITERS; i++)
    fill_array (inputs + qsort_size * i, welem[ne],                                                
                est->type_size, arraytypes[at].type);
  

Patch

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 1530939a8c..781cbca8af 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -76,7 +76,7 @@  LOCALES := en_US.UTF-8 tr_TR.UTF-8 cs_CZ.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8 \
 include ../gen-locales.mk
 endif
 
-stdlib-benchset := strtod
+stdlib-benchset := strtod qsort
 
 stdio-common-benchset := sprintf
 
diff --git a/benchtests/bench-qsort.c b/benchtests/bench-qsort.c
new file mode 100644
index 0000000000..905aa6d7b6
--- /dev/null
+++ b/benchtests/bench-qsort.c
@@ -0,0 +1,195 @@ 
+/* Measure qsort function.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <array_length.h>
+#include <bench-timing.h>
+#include <errno.h>
+#include <getopt.h>
+#include <json-lib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#include <stdlib/tst-qsort-common.c>
+
+/* Number of elements of determined type to be measured.  */
+static const size_t default_nelems[] =
+{
+  256/sizeof(size_t),       /* 64/128 which covers mostly used element number
+			       on GCC build.  */
+  32768/sizeof(size_t),	    /* 4096/8192 to fit on a L1 with 32 KB.  */
+  262144/sizeof(size_t),    /* 32768/65536 to fit on a L2 with 256 KB.  */
+  4194304/sizeof(size_t),   /* 524288/1048576 to fix on a L3 with 4 MB.  */
+};
+
+#define OPT_NELEM 10000
+#define OPT_SEED  10001
+#define CMDLINE_OPTIONS \
+  { "nelem", required_argument, NULL, OPT_NELEM }, \
+  { "seed", required_argument, NULL, OPT_SEED },
+
+static const size_t max_nelem = 16;
+static size_t *elems = NULL;
+static size_t nelem = 0;
+static uint64_t seed = 0;
+static bool seed_set = false;
+
+static void
+cmdline_process_function (int c)
+{
+  switch (c)
+    {
+      /* Handle the --nelem option to run different sizes than DEFAULT_ELEM.
+	 The elements sizes as passed with a ':' as the delimiter, for
+	 instance --nelem 32:128:1024 will ran 32, 128, and 1024 elements.  */
+      case OPT_NELEM:
+        {
+	  elems = xmalloc (max_nelem * sizeof (size_t));
+	  nelem = 0;
+
+	  char *saveptr;
+	  char *token;
+	  token = strtok_r (optarg, ":", &saveptr);
+	  if (token == NULL)
+	    {
+	      printf ("error: invalid --nelem value\n");
+	      exit (EXIT_FAILURE);
+	    }
+	  do
+	    {
+	      if (nelem == max_nelem)
+		{
+		  printf ("error: invalid --nelem value (max elem)\n");
+		  exit (EXIT_FAILURE);
+		}
+	      elems[nelem++] = atol (token);
+	      token = strtok_r (saveptr, ":", &saveptr);
+	    } while (token != NULL);
+        }
+      break;
+
+      /* handle the --seed option to use a different seed than a random one.
+	 The SEED used should be a uint64_t number.  */
+      case OPT_SEED:
+	{
+	  uint64_t value = strtoull (optarg, NULL, 0);
+	  if (errno == ERANGE || value > UINT64_MAX)
+	    {
+	      printf ("error: seed should be a value in range of "
+		      "[0, UINT64_MAX]\n");
+	      exit (EXIT_FAILURE);
+	    }
+	  seed = value;
+	  seed_set = true;
+	}
+    }
+}
+
+#define CMDLINE_PROCESS cmdline_process_function
+
+static int
+do_test (void)
+{
+  random_init ();
+
+
+  json_ctx_t json_ctx;
+
+  json_init (&json_ctx, 0, stdout);
+
+  json_document_begin (&json_ctx);
+  json_attr_string (&json_ctx, "timing_type", TIMING_TYPE);
+
+  json_attr_object_begin (&json_ctx, "functions");
+  json_attr_object_begin (&json_ctx, "qsort");
+  json_attr_uint (&json_ctx, "seed", seed);
+
+  json_array_begin (&json_ctx, "results");
+
+  const size_t *welem = elems == NULL ? default_nelems : elems;
+  const size_t wnelem = elems == NULL ? array_length (default_nelems) : nelem;
+
+  struct test_t
+  {
+    size_t type_size;
+    cmpfunc_t cmpfunc;
+  };
+  static const struct test_t tests[] =
+  {
+    { sizeof (uint32_t), uint32_t_cmp },
+    { sizeof (uint64_t), uint64_t_cmp },
+    /* Test swap with large elements.  */
+    { 32,                uint32_t_cmp },
+  };
+
+  const size_t inner_loop_iters = 16;
+  for (const struct test_t *test = tests; test < array_end (tests); ++test)
+    {
+      for (const struct array_t *arraytype = arraytypes;
+	   arraytype < array_end (arraytypes);
+	   ++arraytype)
+	{
+	  for (int k = 0; k < wnelem; k++)
+	    {
+	      json_element_object_begin (&json_ctx);
+
+	      size_t arraysize = welem[k] * test->type_size;
+
+	      json_attr_uint (&json_ctx, "nmemb", welem[k]);
+	      json_attr_uint (&json_ctx, "type_size", test->type_size);
+	      json_attr_string (&json_ctx, "property", arraytype->name);
+
+	      void *base = create_array (welem[k], test->type_size,
+					 arraytype->type);
+	      void *work = xmalloc (arraysize);
+
+	      timing_t total = 0;
+
+	      for (int n = 0; n < inner_loop_iters; n++)
+	        {
+		  memcpy (work, base, arraysize);
+
+	          timing_t start, end, diff;
+	          TIMING_NOW (start);
+	          qsort (work, welem[k], test->type_size, test->cmpfunc);
+	          TIMING_NOW (end);
+
+	          TIMING_DIFF (diff, start, end);
+	          TIMING_ACCUM (total, diff);
+	        }
+
+	     json_attr_uint (&json_ctx, "timings",
+			     (double) total / (double) inner_loop_iters);
+	     json_element_object_end (&json_ctx);
+
+	     free (base);
+	     free (work);
+	   }
+        }
+    }
+
+  json_array_end (&json_ctx);
+
+  json_attr_object_end (&json_ctx);
+  json_attr_object_end (&json_ctx);
+  json_document_end (&json_ctx);
+
+  return 0;
+}
+
+#define TIMEOUT 600
+#include <support/test-driver.c>