[v5,1/6] stdlib: Optimization qsort{_r} swap implementation

Message ID 20230713132540.2854320-2-adhemerval.zanella@linaro.org
State Superseded
Headers
Series Use introsort for qsort |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed

Commit Message

Adhemerval Zanella Netto July 13, 2023, 1:25 p.m. UTC
  It optimizes take in consideration both the most common elements are
either 32 or 64 bit in size and inputs are aligned to the word
boundary.  This is similar to the optimization done on lib/sort.c
from Linux.

This patchs adds an optimized swap operation on qsort based in previous
msort one.  Instead of byte operation, three variants are provided:

  1. Using uint32_t loads and stores.
  2. Using uint64_t loads and stores.
  3. Generic one with a temporary buffer and memcpy/mempcpy.

The 1. and 2. options are selected only if base pointer is aligned
to required type.

Checked on x86_64-linux-gnu.
---
 stdlib/qsort.c | 109 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 91 insertions(+), 18 deletions(-)
  

Comments

Paul Eggert July 13, 2023, 10:17 p.m. UTC | #1
On 2023-07-13 06:25, Adhemerval Zanella via Libc-alpha wrote:

> +is_aligned (const void *base, size_t size, unsigned char align)

ALIGN should be of type size_t, not unsigned char. Although this 
shouldn't change the machine code it's better documentation and more 
robust to future changes.


> +  if (is_aligned (pbase, size, 8))

Change "8" to "alignof (uint64_t)".

> +    swap_type = SWAP_WORDS_64;
> +  else if (is_aligned (pbase, size, 4))

Likewise, change "4" to "alignof (uint32_t)".

Or is the problem that alignof didn't work for you? If so, perhaps 
__alignof__. Still, alignof is used elsewhere in glibc and is in C23 so 
it's better to use here if it works.
  
Paul Eggert July 14, 2023, 1:01 a.m. UTC | #2
On 2023-07-13 15:17, Paul Eggert wrote:

>> +  if (is_aligned (pbase, size, 8))
> 
> Change "8" to "alignof (uint64_t)".


Oops, scratch that. The size must be a multiple of sizeof (uint64_t) and 
the alignment must be a multiple of alignof (uint64_t). Since sizeof 
(uint64_t) must be a multiple of alignof (uint64_t) what you wrote was 
safe (but confusing), whereas what I wrote was not.

To help avoid confusion by future readers, I suggest rewriting the 
comments and parameters for is_aligned to be something like the 
following. This shouldn't change the machine code.

   /* If this function returns true, elements can be safely copied using
      word loads and stores.  Otherwise, it might not be safe.
      BASE (as an integer) must be a multiple of the word alignment.
      TOTAL_SIZE must be a multiple of WORDSIZE.
      Since WORDSIZE must be a multiple of the word alignment,
      and WORDSIZE is a power of two on all supported platforms,
      this function for speed merely checks that BASE and SIZE
      are both multiples of the word size.  */
   __attribute_const__ __always_inline static bool
   is_aligned (const void *base, size_t size, size_t wordsize)
   {
     return (((uintptr_t) base | size) & (wordsize - 1)) == 0;
   }
  
Adhemerval Zanella Netto July 14, 2023, 12:06 p.m. UTC | #3
On 13/07/23 22:01, Paul Eggert wrote:
> On 2023-07-13 15:17, Paul Eggert wrote:
> 
>>> +  if (is_aligned (pbase, size, 8))
>>
>> Change "8" to "alignof (uint64_t)".
> 
> 
> Oops, scratch that. The size must be a multiple of sizeof (uint64_t) and the alignment must be a multiple of alignof (uint64_t). Since sizeof (uint64_t) must be a multiple of alignof (uint64_t) what you wrote was safe (but confusing), whereas what I wrote was not.
> 
> To help avoid confusion by future readers, I suggest rewriting the comments and parameters for is_aligned to be something like the following. This shouldn't change the machine code.
> 
>   /* If this function returns true, elements can be safely copied using
>      word loads and stores.  Otherwise, it might not be safe.
>      BASE (as an integer) must be a multiple of the word alignment.
>      TOTAL_SIZE must be a multiple of WORDSIZE.

I think you meant SIZE here instead of TOTAL_SIZE.

>      Since WORDSIZE must be a multiple of the word alignment,
>      and WORDSIZE is a power of two on all supported platforms,
>      this function for speed merely checks that BASE and SIZE
>      are both multiples of the word size.  */
>   __attribute_const__ __always_inline static bool
>   is_aligned (const void *base, size_t size, size_t wordsize)
>   {
>     return (((uintptr_t) base | size) & (wordsize - 1)) == 0;
>   }
> 

Alright, I changed this version.
  
Alexander Monakov July 17, 2023, 4:57 p.m. UTC | #4
On Thu, 13 Jul 2023, Adhemerval Zanella via Libc-alpha wrote:

> +/* Returns true if elements can be copied using word loads and stores.
> +   The SIZE and BASE must be a multiple of the ALIGN.  */
> +__attribute_const__ __always_inline static bool

Kernel folks sure love their double underscores, but plain
'static inline bool' should work just as well here.

I'd recommend to properly credit Linux lib/sort.c since obviously this
commit originated as a direct copy of the code therein, not just being
inspired by what's there.

> +is_aligned (const void *base, size_t size, unsigned char align)
> +{
> +  return (((uintptr_t) base | size) & (align - 1)) == 0;
> +}
> +
> +static void
> +swap_words_64 (void * restrict a, void * restrict b, size_t n)
> +{
> +  do
> +   {
> +     n -= 8;
> +     uint64_t t = *(uint64_t *)(a + n);
> +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
> +     *(uint64_t *)(b + n) = t;
> +   } while (n);
> +}

The undefined behavior resulting from attempt to access via an 'uint64_t *'
something of a different type is known to cause miscompilation in practice:
https://www.spec.org/cpu2017/Docs/benchmarks/605.mcf_s.html#portability

I see that the new string routines do it properly via 

  typedef unsigned long int __attribute__ ((__may_alias__)) op_t;

Would be nice to avoid new instances of such UB here.

Alexander
  
Paul Eggert July 17, 2023, 5:29 p.m. UTC | #5
On 2023-07-17 09:57, Alexander Monakov wrote:
> I'd recommend to properly credit Linux lib/sort.c since obviously this
> commit originated as a direct copy of the code therein, not just being
> inspired by what's there.

What are the licensing issues here? The Linux kernel is GPLv2 only, 
whereas glibc/stdlib/qsort.c is LGPLv2.1+, so we can't simply take 
nontrivial source code from the Linux kernel.
  
Adhemerval Zanella Netto July 17, 2023, 6:07 p.m. UTC | #6
On 17/07/23 14:29, Paul Eggert wrote:
> On 2023-07-17 09:57, Alexander Monakov wrote:
>> I'd recommend to properly credit Linux lib/sort.c since obviously this
>> commit originated as a direct copy of the code therein, not just being
>> inspired by what's there.
> 
> What are the licensing issues here? The Linux kernel is GPLv2 only, whereas glibc/stdlib/qsort.c is LGPLv2.1+, so we can't simply take nontrivial source code from the Linux kernel.

I am not sure in fact, my take was the code as simple enough to have this
concern.  But if it were case, I can reimplement it different to avoid this
issue.
  
Paul Eggert July 17, 2023, 6:58 p.m. UTC | #7
On 2023-07-17 11:07, Adhemerval Zanella Netto wrote:
> I am not sure in fact, my take was the code as simple enough to have this
> concern.

The usual rule of thumb is that ten lines are trivial, and that more 
than that might be a concern. When it's a concern, you can't simply read 
the original code and rewrite it in a different way; you need to 
actually rewrite it from scratch, preferably without looking at the 
original code. It's a pain, admittedly.
  
Adhemerval Zanella Netto July 17, 2023, 7:06 p.m. UTC | #8
On 17/07/23 15:58, Paul Eggert wrote:
> On 2023-07-17 11:07, Adhemerval Zanella Netto wrote:
>> I am not sure in fact, my take was the code as simple enough to have this
>> concern.
> 
> The usual rule of thumb is that ten lines are trivial, and that more than that might be a concern. When it's a concern, you can't simply read the original code and rewrite it in a different way; you need to actually rewrite it from scratch, preferably without looking at the original code. It's a pain, admittedly.

Alright, I don't think it should much of trouble to reimplement a simple heapsort.
  
Adhemerval Zanella Netto July 18, 2023, 2:01 p.m. UTC | #9
On 17/07/23 13:57, Alexander Monakov wrote:
> 
> On Thu, 13 Jul 2023, Adhemerval Zanella via Libc-alpha wrote:
> 
>> +/* Returns true if elements can be copied using word loads and stores.
>> +   The SIZE and BASE must be a multiple of the ALIGN.  */
>> +__attribute_const__ __always_inline static bool
> 
> Kernel folks sure love their double underscores, but plain
> 'static inline bool' should work just as well here.
> 
> I'd recommend to properly credit Linux lib/sort.c since obviously this
> commit originated as a direct copy of the code therein, not just being
> inspired by what's there.

In fact our msort_with_tmp uses the exact same idea, but it is somewhat
convoluted because it adds an specific optimization to 'unsigned long'
and for large values it uses a inplace mempcpy.

> 
>> +is_aligned (const void *base, size_t size, unsigned char align)
>> +{
>> +  return (((uintptr_t) base | size) & (align - 1)) == 0;
>> +}
>> +
>> +static void
>> +swap_words_64 (void * restrict a, void * restrict b, size_t n)
>> +{
>> +  do
>> +   {
>> +     n -= 8;
>> +     uint64_t t = *(uint64_t *)(a + n);
>> +     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
>> +     *(uint64_t *)(b + n) = t;
>> +   } while (n);
>> +}
> 
> The undefined behavior resulting from attempt to access via an 'uint64_t *'
> something of a different type is known to cause miscompilation in practice:
> https://www.spec.org/cpu2017/Docs/benchmarks/605.mcf_s.html#portability
> 
> I see that the new string routines do it properly via 
> 
>   typedef unsigned long int __attribute__ ((__may_alias__)) op_t;
> 
> Would be nice to avoid new instances of such UB here.

Ack.
  
Adhemerval Zanella Netto July 18, 2023, 2:13 p.m. UTC | #10
On 18/07/23 11:01, Adhemerval Zanella Netto wrote:
> 
> 
> On 17/07/23 13:57, Alexander Monakov wrote:
>>
>> On Thu, 13 Jul 2023, Adhemerval Zanella via Libc-alpha wrote:
>>
>>> +/* Returns true if elements can be copied using word loads and stores.
>>> +   The SIZE and BASE must be a multiple of the ALIGN.  */
>>> +__attribute_const__ __always_inline static bool
>>
>> Kernel folks sure love their double underscores, but plain
>> 'static inline bool' should work just as well here.
>>
>> I'd recommend to properly credit Linux lib/sort.c since obviously this
>> commit originated as a direct copy of the code therein, not just being
>> inspired by what's there.
> 
> In fact our msort_with_tmp uses the exact same idea, but it is somewhat
> convoluted because it adds an specific optimization to 'unsigned long'
> and for large values it uses a inplace mempcpy.

And just to be clear about the the license, the optimization is really
a common technique already used in our mergesort implementation, and
Linux implementation is slight different because it takes in consideration
if the architectures support fast unaligned accesses (which glibc does not
consider anymore since a9b3b770f596c), for 64 bit it also handles 64 and 32
bits with differently (which I don't think we should not bother,
compiler will know better how to optimize it); and it does not have
the large buffer optimization.  So I don't think this falls to the idea
this code is copy/pasted.

For the heapsort, I really used Linux implementation which I replaced with
textbook one.
  

Patch

diff --git a/stdlib/qsort.c b/stdlib/qsort.c
index 728a0ed370..bf3326e4b9 100644
--- a/stdlib/qsort.c
+++ b/stdlib/qsort.c
@@ -23,20 +23,85 @@ 
 #include <limits.h>
 #include <stdlib.h>
 #include <string.h>
+#include <stdbool.h>
 
-/* Byte-wise swap two items of size SIZE. */
-#define SWAP(a, b, size)						      \
-  do									      \
-    {									      \
-      size_t __size = (size);						      \
-      char *__a = (a), *__b = (b);					      \
-      do								      \
-	{								      \
-	  char __tmp = *__a;						      \
-	  *__a++ = *__b;						      \
-	  *__b++ = __tmp;						      \
-	} while (--__size > 0);						      \
-    } while (0)
+/* Swap SIZE bytes between addresses A and B.  These helpers are provided
+   along the generic one as an optimization.  */
+
+enum swap_type_t
+  {
+    SWAP_WORDS_64,
+    SWAP_WORDS_32,
+    SWAP_BYTES
+  };
+
+/* Returns true if elements can be copied using word loads and stores.
+   The SIZE and BASE must be a multiple of the ALIGN.  */
+__attribute_const__ __always_inline static bool
+is_aligned (const void *base, size_t size, unsigned char align)
+{
+  return (((uintptr_t) base | size) & (align - 1)) == 0;
+}
+
+static void
+swap_words_64 (void * restrict a, void * restrict b, size_t n)
+{
+  do
+   {
+     n -= 8;
+     uint64_t t = *(uint64_t *)(a + n);
+     *(uint64_t *)(a + n) = *(uint64_t *)(b + n);
+     *(uint64_t *)(b + n) = t;
+   } while (n);
+}
+
+static void
+swap_words_32 (void * restrict a, void * restrict b, size_t n)
+{
+  do
+   {
+     n -= 4;
+     uint32_t t = *(uint32_t *)(a + n);
+     *(uint32_t *)(a + n) = *(uint32_t *)(b + n);
+     *(uint32_t *)(b + n) = t;
+   } while (n);
+}
+
+static void
+swap_bytes (void * restrict a, void * restrict b, size_t n)
+{
+  /* Use multiple small memcpys with constant size to enable inlining
+     on most targets.  */
+  enum { SWAP_GENERIC_SIZE = 32 };
+  unsigned char tmp[SWAP_GENERIC_SIZE];
+  while (n > SWAP_GENERIC_SIZE)
+    {
+      memcpy (tmp, a, SWAP_GENERIC_SIZE);
+      a = __mempcpy (a, b, SWAP_GENERIC_SIZE);
+      b = __mempcpy (b, tmp, SWAP_GENERIC_SIZE);
+      n -= SWAP_GENERIC_SIZE;
+    }
+  while (n > 0)
+    {
+      unsigned char t = ((unsigned char *)a)[--n];
+      ((unsigned char *)a)[n] = ((unsigned char *)b)[n];
+      ((unsigned char *)b)[n] = t;
+    }
+}
+
+/* Replace the indirect call with a serie of if statements.  It should help
+   the branch predictor.  */
+static void
+do_swap (void * restrict a, void * restrict b, size_t size,
+	 enum swap_type_t swap_func)
+{
+  if (swap_func == SWAP_WORDS_64)
+    swap_words_64 (a, b, size);
+  else if (swap_func == SWAP_WORDS_32)
+    swap_words_32 (a, b, size);
+  else
+    swap_bytes (a, b, size);
+}
 
 /* Discontinue quicksort algorithm when partition gets below this size.
    This particular magic number was chosen to work best on a Sun 4/260. */
@@ -96,6 +161,14 @@  _quicksort (void *const pbase, size_t total_elems, size_t size,
     /* Avoid lossage with unsigned arithmetic below.  */
     return;
 
+  enum swap_type_t swap_type;
+  if (is_aligned (pbase, size, 8))
+    swap_type = SWAP_WORDS_64;
+  else if (is_aligned (pbase, size, 4))
+    swap_type = SWAP_WORDS_32;
+  else
+    swap_type = SWAP_BYTES;
+
   if (total_elems > MAX_THRESH)
     {
       char *lo = base_ptr;
@@ -119,13 +192,13 @@  _quicksort (void *const pbase, size_t total_elems, size_t size,
 	  char *mid = lo + size * ((hi - lo) / size >> 1);
 
 	  if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
-	    SWAP (mid, lo, size);
+	    do_swap (mid, lo, size, swap_type);
 	  if ((*cmp) ((void *) hi, (void *) mid, arg) < 0)
-	    SWAP (mid, hi, size);
+	    do_swap (mid, hi, size, swap_type);
 	  else
 	    goto jump_over;
 	  if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
-	    SWAP (mid, lo, size);
+	    do_swap (mid, lo, size, swap_type);
 	jump_over:;
 
 	  left_ptr  = lo + size;
@@ -144,7 +217,7 @@  _quicksort (void *const pbase, size_t total_elems, size_t size,
 
 	      if (left_ptr < right_ptr)
 		{
-		  SWAP (left_ptr, right_ptr, size);
+		  do_swap (left_ptr, right_ptr, size, swap_type);
 		  if (mid == left_ptr)
 		    mid = right_ptr;
 		  else if (mid == right_ptr)
@@ -216,7 +289,7 @@  _quicksort (void *const pbase, size_t total_elems, size_t size,
         tmp_ptr = run_ptr;
 
     if (tmp_ptr != base_ptr)
-      SWAP (tmp_ptr, base_ptr, size);
+      do_swap (tmp_ptr, base_ptr, size, swap_type);
 
     /* Insertion sort, running from left-hand-side up to right-hand-side.  */