stdlib: Optimize number of calls to comparison function

Message ID 20231202214839.2634493-1-visitorckw@gmail.com
State Superseded
Headers
Series stdlib: Optimize number of calls to comparison function |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed

Commit Message

Kuan-Wei Chiu Dec. 2, 2023, 9:48 p.m. UTC
  The current heapsort implementation in the siftdown function follows
the standard textbook version, necessitating two comparisons at each
level. Transitioning to the Bottom-up heapsort version allows us to
decrease the required comparisons to just one per level. On average,
this modification significantly reduces the comparison count from
2nlog2(n) - 3n + o(n) to nlog2(n) + 0.37*n + o(n).

Refs:
  BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average,
  QUICKSORT (if n is not very small)
  Ingo Wegener
  Theoretical Computer Science, 118(1); Pages 81-98, 13 September 1993
  https://doi.org/10.1016/0304-3975(93)90364-Y

Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
---
 stdlib/qsort.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)
  

Comments

Florian Weimer Dec. 4, 2023, 8:20 a.m. UTC | #1
* Kuan-Wei Chiu:

> The current heapsort implementation in the siftdown function follows
> the standard textbook version, necessitating two comparisons at each
> level. Transitioning to the Bottom-up heapsort version allows us to
> decrease the required comparisons to just one per level. On average,
> this modification significantly reduces the comparison count from
> 2nlog2(n) - 3n + o(n) to nlog2(n) + 0.37*n + o(n).

I think the factor in stdlib/tst-qsort5.c needs to be adjusted:

  /* This is an arbitrary factor which is true for the current
     implementation across a wide range of sizes.  */
  TEST_VERIFY (factor <= 4.5);

On the other hand, the heapsort code is not really expected to run at
all in the current implementation because median-of-three quicksort
usually avoids degenerating behavior.  If this heapsort variant uses a
number of comparisons that is competitive to quicksort, maybe we should
use it instead?  And use insertion sort for short arrays only.

Thanks,
Florian
  
Kuan-Wei Chiu Dec. 4, 2023, 6:31 p.m. UTC | #2
On Mon, Dec 04, 2023 at 09:20:50AM +0100, Florian Weimer wrote:
> I think the factor in stdlib/tst-qsort5.c needs to be adjusted:
> 
>   /* This is an arbitrary factor which is true for the current
>      implementation across a wide range of sizes.  */
>   TEST_VERIFY (factor <= 4.5);
>

It seems that the factor can be adjusted to around 3.5. I can send
another patch for this adjustment or resend it as a patch series.

> On the other hand, the heapsort code is not really expected to run at
> all in the current implementation because median-of-three quicksort
> usually avoids degenerating behavior.  If this heapsort variant uses a
> number of comparisons that is competitive to quicksort, maybe we should
> use it instead?  And use insertion sort for short arrays only.
> 

This patch currently reduces the comparison count only in scenarios
where deep recursion results in a fallback to heapsort. According to
the referenced paper, beyond n >= 16000, the comparison count of
bottom-up heapsort is lower than that of median-of-three quicksort.
However, my concern is that quicksort typically exhibits better
locality of reference, making it more cache-friendly, while heapsort
lacks this advantage.


Best regards,
Kuan-Wei Chiu
  
Florian Weimer Dec. 5, 2023, 10:44 a.m. UTC | #3
* Kuan-Wei Chiu:

> On Mon, Dec 04, 2023 at 09:20:50AM +0100, Florian Weimer wrote:
>> I think the factor in stdlib/tst-qsort5.c needs to be adjusted:
>> 
>>   /* This is an arbitrary factor which is true for the current
>>      implementation across a wide range of sizes.  */
>>   TEST_VERIFY (factor <= 4.5);
>>
>
> It seems that the factor can be adjusted to around 3.5. I can send
> another patch for this adjustment or resend it as a patch series.

Based on the paper, I would have expected a greater reduction.

>> On the other hand, the heapsort code is not really expected to run at
>> all in the current implementation because median-of-three quicksort
>> usually avoids degenerating behavior.  If this heapsort variant uses a
>> number of comparisons that is competitive to quicksort, maybe we should
>> use it instead?  And use insertion sort for short arrays only.
>> 
>
> This patch currently reduces the comparison count only in scenarios
> where deep recursion results in a fallback to heapsort. According to
> the referenced paper, beyond n >= 16000, the comparison count of
> bottom-up heapsort is lower than that of median-of-three quicksort.
> However, my concern is that quicksort typically exhibits better
> locality of reference, making it more cache-friendly, while heapsort
> lacks this advantage.

Yes, the performance numbers I posted suggest that something like that
is at play here.  Even though heapsort with your patch performs fewer
comparisons to most other implementations we have floating around
(except the historic merge sort), it is still slower than most.

Thanks,
Florian
  
Kuan-Wei Chiu Dec. 5, 2023, 8 p.m. UTC | #4
On Tue, Dec 05, 2023 at 11:44:03AM +0100, Florian Weimer wrote:
> * Kuan-Wei Chiu:
> 
> > On Mon, Dec 04, 2023 at 09:20:50AM +0100, Florian Weimer wrote:
> >> I think the factor in stdlib/tst-qsort5.c needs to be adjusted:
> >> 
> >>   /* This is an arbitrary factor which is true for the current
> >>      implementation across a wide range of sizes.  */
> >>   TEST_VERIFY (factor <= 4.5);
> >>
> >
> > It seems that the factor can be adjusted to around 3.5. I can send
> > another patch for this adjustment or resend it as a patch series.
> 
> Based on the paper, I would have expected a greater reduction.
> 

The adjustment factor of 3.5 is derived from data obtained using the
tst-qsort5 test. Upon examining the number of elements sorted by
heapsort during the test and the corresponding comparison counts, the
comparison count is approximately nlog2(n) + 0.83*n. This is higher
than the analysis in the paper, which suggests nlog2(n) + d(n)*n, where
d(n) belongs to the range [0.34, 0.39]. I believe this discrepancy is
due to the paper's focus on the average comparison count in random
scenarios, whereas here, we are intentionally creating adversarial
permutations, making it not purely random.

Additionally, in the performance numbers you posted, d(n) is
approximately equal to 0.34, aligning with the analysis in the paper.

Here is the test data:
heapsort: number of comparisons, number of elements

Before applying the patch (commit 8e755f5bc8):

info: length 100: 1832 comparisons ~ 2.438387 * n * log (n)         heapsort: 662, 76
info: length 1000: 34020 comparisons ~ 3.260275 * n * log (n)       heapsort: 15324, 964
info: length 10000: 494661 comparisons ~ 3.772689 * n * log (n)     heapsort: 225309, 9948
info: length 20000: 1071780 comparisons ~ 3.859536 * n * log (n)    heapsort: 492534, 19944
info: length 30000: 1644872 comparisons ~ 3.830672 * n * log (n)    heapsort: 775626, 29944
info: length 40000: 2305025 comparisons ~ 3.933328 * n * log (n)    heapsort: 1065893, 39940
info: length 50000: 2912911 comparisons ~ 3.913976 * n * log (n)    heapsort: 1363779, 49940
info: length 60000: 3529897 comparisons ~ 3.902134 * n * log (n)    heapsort: 1670765, 59940
info: length 70000: 4285743 comparisons ~ 4.009278 * n * log (n)    heapsort: 1976733, 69936
info: length 80000: 4929859 comparisons ~ 3.998699 * n * log (n)    heapsort: 2290849, 79936
info: length 90000: 5575320 comparisons ~ 3.987933 * n * log (n)    heapsort: 2606310, 89936
info: length 100000: 6223572 comparisons ~ 3.978286 * n * log (n)   heapsort: 2924562, 99936
info: length 110000: 6881296 comparisons ~ 3.973322 * n * log (n)   heapsort: 3252286, 109936
info: length 120000: 7537050 comparisons ~ 3.966365 * n * log (n)   heapsort: 3578040, 119936
info: length 130000: 8191258 comparisons ~ 3.958248 * n * log (n)   heapsort: 3902248, 129936
info: length 140000: 9138135 comparisons ~ 4.072406 * n * log (n)   heapsort: 4239255, 139932
info: length 150000: 9822309 comparisons ~ 4.067167 * n * log (n)   heapsort: 4573429, 149932

After applying the patch:

info: length 100: 1690 comparisons ~ 2.273802 * n * log (n)         heapsort: 520, 76
info: length 1000: 28993 comparisons ~ 2.821755 * n * log (n)       heapsort: 10297, 964
info: length 10000: 409589 comparisons ~ 3.169480 * n * log (n)     heapsort: 140237, 9948
info: length 20000: 879212 comparisons ~ 3.211902 * n * log (n)     heapsort: 299966, 19944
info: length 30000: 1336536 comparisons ~ 3.158407 * n * log (n)    heapsort: 467290, 29944
info: length 40000: 1882150 comparisons ~ 3.256782 * n * log (n)    heapsort: 643018, 39940
info: length 50000: 2369603 comparisons ~ 3.228731 * n * log (n)    heapsort: 820471, 49940
info: length 60000: 2859511 comparisons ~ 3.205839 * n * log (n)    heapsort: 1000379, 59940
info: length 70000: 3492146 comparisons ~ 3.311278 * n * log (n)    heapsort: 1183136, 69936
info: length 80000: 4006690 comparisons ~ 3.294222 * n * log (n)    heapsort: 1367680, 79936
info: length 90000: 4523504 comparisons ~ 3.279727 * n * log (n)    heapsort: 1554494, 89936
info: length 100000: 5041697 comparisons ~ 3.266775 * n * log (n)   heapsort: 1742687, 99936
info: length 110000: 5562597 comparisons ~ 3.255888 * n * log (n)   heapsort: 1933587, 109936
info: length 120000: 6083545 comparisons ~ 3.245368 * n * log (n)   heapsort: 2124535, 119936
info: length 130000: 6604920 comparisons ~ 3.235434 * n * log (n)   heapsort: 2315910, 129936
info: length 140000: 7407274 comparisons ~ 3.344872 * n * log (n)   heapsort: 2508394, 139932
info: length 150000: 7951873 comparisons ~ 3.336444 * n * log (n)   heapsort: 2702993, 149932


Best regards,
Kuan-Wei Chiu
>
  
Zack Weinberg Dec. 5, 2023, 8:35 p.m. UTC | #5
On Mon, Dec 4, 2023, at 1:31 PM, Kuan-Wei Chiu wrote:
> On Mon, Dec 04, 2023 at 09:20:50AM +0100, Florian Weimer wrote:
>> I think the factor in stdlib/tst-qsort5.c needs to be adjusted:
>> 
>>   /* This is an arbitrary factor which is true for the current
>>      implementation across a wide range of sizes.  */
>>   TEST_VERIFY (factor <= 4.5);
>
> It seems that the factor can be adjusted to around 3.5. I can send
> another patch for this adjustment or resend it as a patch series.

Before you go any further with this patch series I think we need to
decide whether our backward compatibility constraints mean our qsort
has to be a stable sort.  If so, we should make it *always* be a
stable sort and write that into the documentation, and that would
mean junking the entire heapsort implementation.

I do not have a strong opinion either way, but I do think it should
either *always* or *never* be stable, i.e. not the previous state of
"whether you get a stable sort depends on the size of the input and
various other undocumented factors."  Also, I think that the
behavior should not depend on which version of glibc a program
was compiled against.

zw
  
Florian Weimer Dec. 6, 2023, 10:21 a.m. UTC | #6
* Zack Weinberg:

> On Mon, Dec 4, 2023, at 1:31 PM, Kuan-Wei Chiu wrote:
>> On Mon, Dec 04, 2023 at 09:20:50AM +0100, Florian Weimer wrote:
>>> I think the factor in stdlib/tst-qsort5.c needs to be adjusted:
>>> 
>>>   /* This is an arbitrary factor which is true for the current
>>>      implementation across a wide range of sizes.  */
>>>   TEST_VERIFY (factor <= 4.5);
>>
>> It seems that the factor can be adjusted to around 3.5. I can send
>> another patch for this adjustment or resend it as a patch series.
>
> Before you go any further with this patch series I think we need to
> decide whether our backward compatibility constraints mean our qsort
> has to be a stable sort.  If so, we should make it *always* be a
> stable sort and write that into the documentation, and that would
> mean junking the entire heapsort implementation.

That makes sense, although there might not exist an in-place sorting
algorithm that takes constant extra space regardless of element count
and element size and has reasonable performance.  Maybe we could say,
“stable if element size is less than 1024 bytes” or something like that.

What I'm concerned about is that with the current implementation, we
take a performance hit *and* have compatibility problems.  The
compatibility problems would be easier to justify if we actually made
things faster.  Not calling malloc internally is unlikely to be
compelling to programmers.

Thanks,
Florian
  
Adhemerval Zanella Netto Dec. 6, 2023, 12:51 p.m. UTC | #7
On 06/12/23 07:21, Florian Weimer wrote:
> * Zack Weinberg:
> 
>> On Mon, Dec 4, 2023, at 1:31 PM, Kuan-Wei Chiu wrote:
>>> On Mon, Dec 04, 2023 at 09:20:50AM +0100, Florian Weimer wrote:
>>>> I think the factor in stdlib/tst-qsort5.c needs to be adjusted:
>>>>
>>>>   /* This is an arbitrary factor which is true for the current
>>>>      implementation across a wide range of sizes.  */
>>>>   TEST_VERIFY (factor <= 4.5);
>>>
>>> It seems that the factor can be adjusted to around 3.5. I can send
>>> another patch for this adjustment or resend it as a patch series.
>>
>> Before you go any further with this patch series I think we need to
>> decide whether our backward compatibility constraints mean our qsort
>> has to be a stable sort.  If so, we should make it *always* be a
>> stable sort and write that into the documentation, and that would
>> mean junking the entire heapsort implementation.
> 
> That makes sense, although there might not exist an in-place sorting
> algorithm that takes constant extra space regardless of element count
> and element size and has reasonable performance.  Maybe we could say,
> “stable if element size is less than 1024 bytes” or something like that.
> 
> What I'm concerned about is that with the current implementation, we
> take a performance hit *and* have compatibility problems.  The
> compatibility problems would be easier to justify if we actually made
> things faster.  Not calling malloc internally is unlikely to be
> compelling to programmers.

My initial intention was to remove internal multiple sorting strategies
that has different semantics and that might eventually trigger corner
cases issues (the stable default mergesort versus the fallback quicksort),
and fix the quicksort fallback that had the long standing issue of O(n^2) 
on adversarial inputs.

However it seems that glibc qsort now become an instance on Hyrum's law,
where it does not really matter that neither POSIX nor C standard promised 
sorting stability.

So I presume it would be better to keep with mergesort for all input sizes,
along with the limited stack buffer (so there is no failure point up a 
certain input number/size); and fallback to heapsort on memory allocation
failure.  I will prepare a patch to restore it.

Other system does provide mergesort as a different symbol [1], which also
returns a failure is case memory can not be allocated.  But I don't think
it would be an improvement in this situation (no one will really adapt
their code to use mergesort if a stable allocation is required).  I also
though about adding a tunable to enable mergesort on qsort, but it would
ending up only add more complexity and required more testing.

[1] https://man.freebsd.org/cgi/man.cgi?query=mergesort&sektion=3
  
Zack Weinberg Dec. 17, 2023, 3:42 p.m. UTC | #8
On Wed, Dec 6, 2023, at 7:51 AM, Adhemerval Zanella Netto wrote:
> On 06/12/23 07:21, Florian Weimer wrote:
>> * Zack Weinberg:
>>>
>>> Before you go any further with this patch series I think we need to
>>> decide whether our backward compatibility constraints mean our qsort
>>> has to be a stable sort.  If so, we should make it *always* be a
>>> stable sort and write that into the documentation, and that would
>>> mean junking the entire heapsort implementation.
>>
>> That makes sense, although there might not exist an in-place sorting
>> algorithm that takes constant extra space regardless of element
>> count and element size and has reasonable performance.  Maybe we
>> could say, “stable if element size is less than 1024 bytes” or
>> something like that.

It occurred to me late last night that *any* in-place sorting algorithm
that operates on an array, can be made stable with no additional storage
cost, by using each element's address as a tiebreaker for the comparison
function.  It *doesn't* need to be the element's original address -- all
that matters is that for every pair of elements that compare equal, the
sorting algorithm preserves the < relation for their addresses across
all swaps, which should happen naturally if address-< is used as a
tiebreaker for comparisons.

(That said, I still think we should give that "block exchange mergesort"
algorithm I dug up the other week a try.  I might have time early in
January to put a patch set together.)

zw
  
Florian Weimer Dec. 17, 2023, 3:55 p.m. UTC | #9
* Zack Weinberg:

> It occurred to me late last night that *any* in-place sorting algorithm
> that operates on an array, can be made stable with no additional storage
> cost, by using each element's address as a tiebreaker for the comparison
> function.  It *doesn't* need to be the element's original address -- all
> that matters is that for every pair of elements that compare equal, the
> sorting algorithm preserves the < relation for their addresses across
> all swaps, which should happen naturally if address-< is used as a
> tiebreaker for comparisons.

It's not as simple as that because we might have A < B, so we can move A
to a lower address than B, but at the same time time, we must make sure
that we do not move it past an element C with A == C, otherwise the
result of the tie-breaking address comparison changes.  I haven't really
thought about it, but I suspect it's possible to do for some of the
quadratic algorithms, but quicksort and heapsort—not so much.

Thanks,
Florian
  
Kuan-Wei Chiu Dec. 17, 2023, 4:47 p.m. UTC | #10
On Sun, Dec 17, 2023 at 10:42:17AM -0500, Zack Weinberg wrote:
> It occurred to me late last night that *any* in-place sorting algorithm
> that operates on an array, can be made stable with no additional storage
> cost, by using each element's address as a tiebreaker for the comparison
> function.  It *doesn't* need to be the element's original address -- all
> that matters is that for every pair of elements that compare equal, the
> sorting algorithm preserves the < relation for their addresses across
> all swaps, which should happen naturally if address-< is used as a
> tiebreaker for comparisons.
> 
> (That said, I still think we should give that "block exchange mergesort"
> algorithm I dug up the other week a try.  I might have time early in
> January to put a patch set together.)

Additionally, if we're open to exploring non-in-place options, I think
Timsort might be worth considering. It's widely adopted, including in
Python and Android, serving as a stable hybrid sorting algorithm.
Timsort is an adaptive sort algorithm that operates in O(n) time when
the input is already sorted and maintains a worst-case time complexity
of O(n log n), similar to mergesort. I'd be happy to prepare a patch
for Timsort next month if needed.

Link: https://github.com/python/cpython/blob/main/Objects/listsort.txt

Best regards,
Kuan-Wei Chiu
  
Florian Weimer Dec. 17, 2023, 6:02 p.m. UTC | #11
* Kuan-Wei Chiu:

> On Sun, Dec 17, 2023 at 10:42:17AM -0500, Zack Weinberg wrote:
>> It occurred to me late last night that *any* in-place sorting algorithm
>> that operates on an array, can be made stable with no additional storage
>> cost, by using each element's address as a tiebreaker for the comparison
>> function.  It *doesn't* need to be the element's original address -- all
>> that matters is that for every pair of elements that compare equal, the
>> sorting algorithm preserves the < relation for their addresses across
>> all swaps, which should happen naturally if address-< is used as a
>> tiebreaker for comparisons.
>> 
>> (That said, I still think we should give that "block exchange mergesort"
>> algorithm I dug up the other week a try.  I might have time early in
>> January to put a patch set together.)
>
> Additionally, if we're open to exploring non-in-place options, I think
> Timsort might be worth considering. It's widely adopted, including in
> Python and Android, serving as a stable hybrid sorting algorithm.
> Timsort is an adaptive sort algorithm that operates in O(n) time when
> the input is already sorted and maintains a worst-case time complexity
> of O(n log n), similar to mergesort. I'd be happy to prepare a patch
> for Timsort next month if needed.

I think we really want to get rid of that malloc in qsort.  This means
that we need something in-place-ish: we can throw a fair amount of stack
space at the problem (probably quite a bit more of the circa 1024 bytes
we use today), but not something that scales linearly with the size of
the input array.  We cannot not even even assume space for just one
temporary object on the stack, I think.

Thanks,
Florian
  
Kuan-Wei Chiu Feb. 16, 2024, 7:08 a.m. UTC | #12
On Sun, Dec 03, 2023 at 05:48:39AM +0800, Kuan-Wei Chiu wrote:
> The current heapsort implementation in the siftdown function follows
> the standard textbook version, necessitating two comparisons at each
> level. Transitioning to the Bottom-up heapsort version allows us to
> decrease the required comparisons to just one per level. On average,
> this modification significantly reduces the comparison count from
> 2nlog2(n) - 3n + o(n) to nlog2(n) + 0.37*n + o(n).
> 
> Refs:
>   BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average,
>   QUICKSORT (if n is not very small)
>   Ingo Wegener
>   Theoretical Computer Science, 118(1); Pages 81-98, 13 September 1993
>   https://doi.org/10.1016/0304-3975(93)90364-Y
> 
> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> ---
>  stdlib/qsort.c | 36 ++++++++++++++++++------------------
>  1 file changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/stdlib/qsort.c b/stdlib/qsort.c
> index be01fb5598..f5babef150 100644
> --- a/stdlib/qsort.c
> +++ b/stdlib/qsort.c
> @@ -132,26 +132,26 @@ static inline void
>  siftdown (void *base, size_t size, size_t k, size_t n,
>  	  enum swap_type_t swap_type, __compar_d_fn_t cmp, void *arg)
>  {
> -  /* There can only be a heap condition violation if there are
> -     children.  */
> -  while (2 * k + 1 <= n)
> -    {
> -      /* Left child.  */
> -      size_t j = 2 * k + 1;
> -      /* If the right child is larger, use it.  */
> -      if (j < n && cmp (base + (j * size), base + ((j + 1) * size), arg) < 0)
> -	j++;
> -
> -      /* If k is already >= to its children, we are done.  */
> -      if (j == k || cmp (base + (k * size), base + (j * size), arg) >= 0)
> -	break;
> +  size_t i, j;
> +
> +  /* Find the sift-down path all the way to the leaves. */
> +  for (i = k; j = 2 * i + 1, j + 1 <= n;)
> +    i = cmp (base + j * size, base + (j + 1) * size, arg) >= 0 ? j : (j + 1);
>  
> -      /* Heal the violation.  */
> -      do_swap (base + (size * j), base + (k * size), size, swap_type);
> +  /* Special case for the last leaf with no sibling. */
> +  if (j == n)
> +    i = j;
>  
> -      /* Swapping with j may have introduced a violation at j.  Fix
> -	 it in the next loop iteration.  */
> -      k = j;
> +  /* Backtrack to the correct location. */
> +  while (i != k && cmp (base + k * size, base + i * size, arg) > 0)
> +    i = (i - 1) / 2;
> +
> +  /* Shift the element into its correct place. */
> +  j = i;
> +  while (i != k)
> +    {
> +      i = (i - 1) / 2;
> +      do_swap (base + i * size, base + j * size, size, swap_type);
>      }
>  }
>  
> -- 
> 2.25.1
>

Since we still retain heapsort as a fallback for mergesort, should we
reconsider applying this patch?

Thanks,
Kuan-Wei
  
Adhemerval Zanella Netto March 27, 2024, 7:45 p.m. UTC | #13
On 16/02/24 04:08, Kuan-Wei Chiu wrote:
> On Sun, Dec 03, 2023 at 05:48:39AM +0800, Kuan-Wei Chiu wrote:
>> The current heapsort implementation in the siftdown function follows
>> the standard textbook version, necessitating two comparisons at each
>> level. Transitioning to the Bottom-up heapsort version allows us to
>> decrease the required comparisons to just one per level. On average,
>> this modification significantly reduces the comparison count from
>> 2nlog2(n) - 3n + o(n) to nlog2(n) + 0.37*n + o(n).
>>
>> Refs:
>>   BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average,
>>   QUICKSORT (if n is not very small)
>>   Ingo Wegener
>>   Theoretical Computer Science, 118(1); Pages 81-98, 13 September 1993
>>   https://doi.org/10.1016/0304-3975(93)90364-Y
>>
>> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
>> ---
>>  stdlib/qsort.c | 36 ++++++++++++++++++------------------
>>  1 file changed, 18 insertions(+), 18 deletions(-)
>>
>> diff --git a/stdlib/qsort.c b/stdlib/qsort.c
>> index be01fb5598..f5babef150 100644
>> --- a/stdlib/qsort.c
>> +++ b/stdlib/qsort.c
>> @@ -132,26 +132,26 @@ static inline void
>>  siftdown (void *base, size_t size, size_t k, size_t n,
>>  	  enum swap_type_t swap_type, __compar_d_fn_t cmp, void *arg)
>>  {
>> -  /* There can only be a heap condition violation if there are
>> -     children.  */
>> -  while (2 * k + 1 <= n)
>> -    {
>> -      /* Left child.  */
>> -      size_t j = 2 * k + 1;
>> -      /* If the right child is larger, use it.  */
>> -      if (j < n && cmp (base + (j * size), base + ((j + 1) * size), arg) < 0)
>> -	j++;
>> -
>> -      /* If k is already >= to its children, we are done.  */
>> -      if (j == k || cmp (base + (k * size), base + (j * size), arg) >= 0)
>> -	break;
>> +  size_t i, j;
>> +
>> +  /* Find the sift-down path all the way to the leaves. */
>> +  for (i = k; j = 2 * i + 1, j + 1 <= n;)
>> +    i = cmp (base + j * size, base + (j + 1) * size, arg) >= 0 ? j : (j + 1);
>>  
>> -      /* Heal the violation.  */
>> -      do_swap (base + (size * j), base + (k * size), size, swap_type);
>> +  /* Special case for the last leaf with no sibling. */
>> +  if (j == n)
>> +    i = j;
>>  
>> -      /* Swapping with j may have introduced a violation at j.  Fix
>> -	 it in the next loop iteration.  */
>> -      k = j;
>> +  /* Backtrack to the correct location. */
>> +  while (i != k && cmp (base + k * size, base + i * size, arg) > 0)
>> +    i = (i - 1) / 2;
>> +
>> +  /* Shift the element into its correct place. */
>> +  j = i;
>> +  while (i != k)
>> +    {
>> +      i = (i - 1) / 2;
>> +      do_swap (base + i * size, base + j * size, size, swap_type);
>>      }
>>  }
>>  
>> -- 
>> 2.25.1
>>
> 
> Since we still retain heapsort as a fallback for mergesort, should we
> reconsider applying this patch?
> 
> Thanks,
> Kuan-Wei

Yes, I think it is work and it looks great to me.
  
Kuan-Wei Chiu March 27, 2024, 7:59 p.m. UTC | #14
On Wed, Mar 27, 2024 at 04:45:35PM -0300, Adhemerval Zanella Netto wrote:
> 
> 
> On 16/02/24 04:08, Kuan-Wei Chiu wrote:
> > On Sun, Dec 03, 2023 at 05:48:39AM +0800, Kuan-Wei Chiu wrote:
> >> The current heapsort implementation in the siftdown function follows
> >> the standard textbook version, necessitating two comparisons at each
> >> level. Transitioning to the Bottom-up heapsort version allows us to
> >> decrease the required comparisons to just one per level. On average,
> >> this modification significantly reduces the comparison count from
> >> 2nlog2(n) - 3n + o(n) to nlog2(n) + 0.37*n + o(n).
> >>
> >> Refs:
> >>   BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average,
> >>   QUICKSORT (if n is not very small)
> >>   Ingo Wegener
> >>   Theoretical Computer Science, 118(1); Pages 81-98, 13 September 1993
> >>   https://doi.org/10.1016/0304-3975(93)90364-Y
> >>
> >> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> >> ---
> >>  stdlib/qsort.c | 36 ++++++++++++++++++------------------
> >>  1 file changed, 18 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/stdlib/qsort.c b/stdlib/qsort.c
> >> index be01fb5598..f5babef150 100644
> >> --- a/stdlib/qsort.c
> >> +++ b/stdlib/qsort.c
> >> @@ -132,26 +132,26 @@ static inline void
> >>  siftdown (void *base, size_t size, size_t k, size_t n,
> >>  	  enum swap_type_t swap_type, __compar_d_fn_t cmp, void *arg)
> >>  {
> >> -  /* There can only be a heap condition violation if there are
> >> -     children.  */
> >> -  while (2 * k + 1 <= n)
> >> -    {
> >> -      /* Left child.  */
> >> -      size_t j = 2 * k + 1;
> >> -      /* If the right child is larger, use it.  */
> >> -      if (j < n && cmp (base + (j * size), base + ((j + 1) * size), arg) < 0)
> >> -	j++;
> >> -
> >> -      /* If k is already >= to its children, we are done.  */
> >> -      if (j == k || cmp (base + (k * size), base + (j * size), arg) >= 0)
> >> -	break;
> >> +  size_t i, j;
> >> +
> >> +  /* Find the sift-down path all the way to the leaves. */
> >> +  for (i = k; j = 2 * i + 1, j + 1 <= n;)
> >> +    i = cmp (base + j * size, base + (j + 1) * size, arg) >= 0 ? j : (j + 1);
> >>  
> >> -      /* Heal the violation.  */
> >> -      do_swap (base + (size * j), base + (k * size), size, swap_type);
> >> +  /* Special case for the last leaf with no sibling. */
> >> +  if (j == n)
> >> +    i = j;
> >>  
> >> -      /* Swapping with j may have introduced a violation at j.  Fix
> >> -	 it in the next loop iteration.  */
> >> -      k = j;
> >> +  /* Backtrack to the correct location. */
> >> +  while (i != k && cmp (base + k * size, base + i * size, arg) > 0)
> >> +    i = (i - 1) / 2;
> >> +
> >> +  /* Shift the element into its correct place. */
> >> +  j = i;
> >> +  while (i != k)
> >> +    {
> >> +      i = (i - 1) / 2;
> >> +      do_swap (base + i * size, base + j * size, size, swap_type);
> >>      }
> >>  }
> >>  
> >> -- 
> >> 2.25.1
> >>
> > 
> > Since we still retain heapsort as a fallback for mergesort, should we
> > reconsider applying this patch?
> > 
> > Thanks,
> > Kuan-Wei
> 
> Yes, I think it is work and it looks great to me.

Should I resend a new patch, or can you directly apply this one?

Thanks,
Kuan-Wei
  
Adhemerval Zanella Netto March 27, 2024, 8:42 p.m. UTC | #15
On 27/03/24 16:59, Kuan-Wei Chiu wrote:
> On Wed, Mar 27, 2024 at 04:45:35PM -0300, Adhemerval Zanella Netto wrote:
>>
>>
>> On 16/02/24 04:08, Kuan-Wei Chiu wrote:
>>> On Sun, Dec 03, 2023 at 05:48:39AM +0800, Kuan-Wei Chiu wrote:
>>>> The current heapsort implementation in the siftdown function follows
>>>> the standard textbook version, necessitating two comparisons at each
>>>> level. Transitioning to the Bottom-up heapsort version allows us to
>>>> decrease the required comparisons to just one per level. On average,
>>>> this modification significantly reduces the comparison count from
>>>> 2nlog2(n) - 3n + o(n) to nlog2(n) + 0.37*n + o(n).
>>>>
>>>> Refs:
>>>>   BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average,
>>>>   QUICKSORT (if n is not very small)
>>>>   Ingo Wegener
>>>>   Theoretical Computer Science, 118(1); Pages 81-98, 13 September 1993
>>>>   https://doi.org/10.1016/0304-3975(93)90364-Y
>>>>
>>>> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
>>>> ---
>>>>  stdlib/qsort.c | 36 ++++++++++++++++++------------------
>>>>  1 file changed, 18 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/stdlib/qsort.c b/stdlib/qsort.c
>>>> index be01fb5598..f5babef150 100644
>>>> --- a/stdlib/qsort.c
>>>> +++ b/stdlib/qsort.c
>>>> @@ -132,26 +132,26 @@ static inline void
>>>>  siftdown (void *base, size_t size, size_t k, size_t n,
>>>>  	  enum swap_type_t swap_type, __compar_d_fn_t cmp, void *arg)
>>>>  {
>>>> -  /* There can only be a heap condition violation if there are
>>>> -     children.  */
>>>> -  while (2 * k + 1 <= n)
>>>> -    {
>>>> -      /* Left child.  */
>>>> -      size_t j = 2 * k + 1;
>>>> -      /* If the right child is larger, use it.  */
>>>> -      if (j < n && cmp (base + (j * size), base + ((j + 1) * size), arg) < 0)
>>>> -	j++;
>>>> -
>>>> -      /* If k is already >= to its children, we are done.  */
>>>> -      if (j == k || cmp (base + (k * size), base + (j * size), arg) >= 0)
>>>> -	break;
>>>> +  size_t i, j;
>>>> +
>>>> +  /* Find the sift-down path all the way to the leaves. */
>>>> +  for (i = k; j = 2 * i + 1, j + 1 <= n;)
>>>> +    i = cmp (base + j * size, base + (j + 1) * size, arg) >= 0 ? j : (j + 1);
>>>>  
>>>> -      /* Heal the violation.  */
>>>> -      do_swap (base + (size * j), base + (k * size), size, swap_type);
>>>> +  /* Special case for the last leaf with no sibling. */
>>>> +  if (j == n)
>>>> +    i = j;
>>>>  
>>>> -      /* Swapping with j may have introduced a violation at j.  Fix
>>>> -	 it in the next loop iteration.  */
>>>> -      k = j;
>>>> +  /* Backtrack to the correct location. */
>>>> +  while (i != k && cmp (base + k * size, base + i * size, arg) > 0)
>>>> +    i = (i - 1) / 2;
>>>> +
>>>> +  /* Shift the element into its correct place. */
>>>> +  j = i;
>>>> +  while (i != k)
>>>> +    {
>>>> +      i = (i - 1) / 2;
>>>> +      do_swap (base + i * size, base + j * size, size, swap_type);
>>>>      }
>>>>  }
>>>>  
>>>> -- 
>>>> 2.25.1
>>>>
>>>
>>> Since we still retain heapsort as a fallback for mergesort, should we
>>> reconsider applying this patch?
>>>
>>> Thanks,
>>> Kuan-Wei
>>
>> Yes, I think it is work and it looks great to me.
> 
> Should I resend a new patch, or can you directly apply this one?

It applies as-is, so it should be ok. I will run some tests and apply, thanks.
  

Patch

diff --git a/stdlib/qsort.c b/stdlib/qsort.c
index be01fb5598..f5babef150 100644
--- a/stdlib/qsort.c
+++ b/stdlib/qsort.c
@@ -132,26 +132,26 @@  static inline void
 siftdown (void *base, size_t size, size_t k, size_t n,
 	  enum swap_type_t swap_type, __compar_d_fn_t cmp, void *arg)
 {
-  /* There can only be a heap condition violation if there are
-     children.  */
-  while (2 * k + 1 <= n)
-    {
-      /* Left child.  */
-      size_t j = 2 * k + 1;
-      /* If the right child is larger, use it.  */
-      if (j < n && cmp (base + (j * size), base + ((j + 1) * size), arg) < 0)
-	j++;
-
-      /* If k is already >= to its children, we are done.  */
-      if (j == k || cmp (base + (k * size), base + (j * size), arg) >= 0)
-	break;
+  size_t i, j;
+
+  /* Find the sift-down path all the way to the leaves. */
+  for (i = k; j = 2 * i + 1, j + 1 <= n;)
+    i = cmp (base + j * size, base + (j + 1) * size, arg) >= 0 ? j : (j + 1);
 
-      /* Heal the violation.  */
-      do_swap (base + (size * j), base + (k * size), size, swap_type);
+  /* Special case for the last leaf with no sibling. */
+  if (j == n)
+    i = j;
 
-      /* Swapping with j may have introduced a violation at j.  Fix
-	 it in the next loop iteration.  */
-      k = j;
+  /* Backtrack to the correct location. */
+  while (i != k && cmp (base + k * size, base + i * size, arg) > 0)
+    i = (i - 1) / 2;
+
+  /* Shift the element into its correct place. */
+  j = i;
+  while (i != k)
+    {
+      i = (i - 1) / 2;
+      do_swap (base + i * size, base + j * size, size, swap_type);
     }
 }