[v7,2/2] Optimize bsearch() implementation for performance

Message ID 20240909162327.3163100-3-visitorckw@gmail.com
State Accepted
Headers
Series Optimization and benchmarking of bsearch() |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed

Commit Message

Kuan-Wei Chiu Sept. 9, 2024, 4:23 p.m. UTC
  Optimize the bsearch() function to improve binary search performance.
Although the code size grew by 8 bytes, the new implementation achieves
a 15% reduction in execution time on my x86 machine, according to the
bench-bsearch benchmark results.

Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
 bits/stdlib-bsearch.h | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)
  

Comments

DJ Delorie Sept. 11, 2024, 10:16 p.m. UTC | #1
Kuan-Wei Chiu <visitorckw@gmail.com> writes:
> Optimize the bsearch() function to improve binary search performance.
> Although the code size grew by 8 bytes, the new implementation achieves
> a 15% reduction in execution time on my x86 machine, according to the
> bench-bsearch benchmark results.

LGTM as-is but I noted something to try that might speed it up more.
Reviewed-by: DJ Delorie <dj@redhat.com>

> diff --git a/bits/stdlib-bsearch.h b/bits/stdlib-bsearch.h

>  bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
>  	 __compar_fn_t __compar)
>  {
> -  size_t __l, __u, __idx;
>    const void *__p;
>    int __comparison;
>  
> -  __l = 0;
> -  __u = __nmemb;
> -  while (__l < __u)
> +  while (__nmemb)
>      {
> -      __idx = (__l + __u) / 2;
> -      __p = (const void *) (((const char *) __base) + (__idx * __size));
> +      __p = (const void *) (((const char *) __base) + ((__nmemb >> 1) * __size));

We're replacing an add/shift with just a shift, but using the existing
__base and __nmemb instead of the new __l and __u.  Ok.

After this, __p points to base[__nmemb>>1]

>        __comparison = (*__compar) (__key, __p);
> -      if (__comparison < 0)
> -	__u = __idx;
> -      else if (__comparison > 0)
> -	__l = __idx + 1;
> -      else
> +      if (__comparison == 0)

It seems to me performance would be better if the more likely cases went
first - checking "__comparison > 0" would, approximately 50% of the
time, skip the "__comparison == 0" check, but not the other way around.
Have you tried this with the ">0" test before the "==0" test, or with
marking the "==0" test as unlikely?

Keeping the >0 and <0 paths close to the top of the loop might help with
cache hits too, depending on link alignment.

>  	{
>  #if __GNUC_PREREQ(4, 6)
>  # pragma GCC diagnostic push
> @@ -46,6 +38,12 @@ bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
>  # pragma GCC diagnostic pop
>  #endif
>  	}
> +      if (__comparison > 0)
> +	{
> +	  __base = ((const char *) __p) + __size;
> +	  --__nmemb;
> +	}
> +      __nmemb >>= 1;

So if comparison < 0 (key is before probe):
  __nmemb is set to the number of elements before __p which is base[__nmemb>>1]
  __base is unchanged
    - range is 0..__nmemb-1

if comparison is > 0 (key is after probe):
  __base is set to the element after the probe
  If (__nmemb is odd)
    probe point is at (__nmemb-1) >> 1
    same numbers before and after probe
    subtracting one is unneeded but has no effect

0 1 2 3 4 5 6 7 8  (n == 9)  (n>>1 == 4)
        P
0 1 2 3 4 5 6  (n == 7)  (n>>1 == 3)
      P

  If (__nmemb is even)
    probe point is at (__nmemb) >> 1
    even numbers before but odd number after
    subtracting one is needed

0 1 2 3 4 5 6 7  (n == 8)  (n>>1 == 4)
        P
0 1 2 3 4 5  (n == 6)  (n>>1 == 3)
      P

Ok.

>      }
>  
>    return NULL;
  
Kuan-Wei Chiu Sept. 12, 2024, 4 p.m. UTC | #2
On Wed, Sep 11, 2024 at 06:16:32PM -0400, DJ Delorie wrote:
> Kuan-Wei Chiu <visitorckw@gmail.com> writes:
> > Optimize the bsearch() function to improve binary search performance.
> > Although the code size grew by 8 bytes, the new implementation achieves
> > a 15% reduction in execution time on my x86 machine, according to the
> > bench-bsearch benchmark results.
> 
> LGTM as-is but I noted something to try that might speed it up more.
> Reviewed-by: DJ Delorie <dj@redhat.com>

Thank you for your review and suggestions!
> 
> > diff --git a/bits/stdlib-bsearch.h b/bits/stdlib-bsearch.h
> 
> >  bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
> >  	 __compar_fn_t __compar)
> >  {
> > -  size_t __l, __u, __idx;
> >    const void *__p;
> >    int __comparison;
> >  
> > -  __l = 0;
> > -  __u = __nmemb;
> > -  while (__l < __u)
> > +  while (__nmemb)
> >      {
> > -      __idx = (__l + __u) / 2;
> > -      __p = (const void *) (((const char *) __base) + (__idx * __size));
> > +      __p = (const void *) (((const char *) __base) + ((__nmemb >> 1) * __size));
> 
> We're replacing an add/shift with just a shift, but using the existing
> __base and __nmemb instead of the new __l and __u.  Ok.
> 
> After this, __p points to base[__nmemb>>1]
> 
> >        __comparison = (*__compar) (__key, __p);
> > -      if (__comparison < 0)
> > -	__u = __idx;
> > -      else if (__comparison > 0)
> > -	__l = __idx + 1;
> > -      else
> > +      if (__comparison == 0)
> 
> It seems to me performance would be better if the more likely cases went
> first - checking "__comparison > 0" would, approximately 50% of the
> time, skip the "__comparison == 0" check, but not the other way around.
> Have you tried this with the ">0" test before the "==0" test, or with
> marking the "==0" test as unlikely?
> 
> Keeping the >0 and <0 paths close to the top of the loop might help with
> cache hits too, depending on link alignment.

I tried adding unlikely(), but it seems that gcc generates the exact
same code (Note: I only checked on x86). As for moving the > 0 test
first, I also gave it a try, but it actually worsened the benchmark
results. I haven’t looked closely at the generated code to understand
why, but here are the benchmark results:

For the current patch:
{
 "timing_type": "hp_timing",
 "functions": {
  "bsearch": {
   "bench-variant": "default",
   "results": [
    {
     "array-size": 100000,
     "element-size": 4,
     "key-pattern": "descending",
     "contained": "no",
     "simple": "yes",
     "timing": 110.092
    },
    {
     "array-size": 100000,
     "element-size": 4,
     "key-pattern": "descending",
     "contained": "yes",
     "simple": "yes",
     "timing": 99.0945
    },
    {
     "array-size": 100000,
     "element-size": 4,
     "key-pattern": "ascending",
     "contained": "no",
     "simple": "yes",
     "timing": 112.395
    },
    {
     "array-size": 100000,
     "element-size": 4,
     "key-pattern": "ascending",
     "contained": "yes",
     "simple": "yes",
     "timing": 98.5953
    }]
  }
 }
}

With the > 0 test moved to the front:
{
 "timing_type": "hp_timing",
 "functions": {
  "bsearch": {
   "bench-variant": "default",
   "results": [
    {
     "array-size": 100000,
     "element-size": 4,
     "key-pattern": "descending",
     "contained": "no",
     "simple": "yes",
     "timing": 113.875
    },
    {
     "array-size": 100000,
     "element-size": 4,
     "key-pattern": "descending",
     "contained": "yes",
     "simple": "yes",
     "timing": 101.111
    },
    {
     "array-size": 100000,
     "element-size": 4,
     "key-pattern": "ascending",
     "contained": "no",
     "simple": "yes",
     "timing": 113.713
    },
    {
     "array-size": 100000,
     "element-size": 4,
     "key-pattern": "ascending",
     "contained": "yes",
     "simple": "yes",
     "timing": 105.429
    }]
  }
 }
}
> 
> >  	{
> >  #if __GNUC_PREREQ(4, 6)
> >  # pragma GCC diagnostic push
> > @@ -46,6 +38,12 @@ bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
> >  # pragma GCC diagnostic pop
> >  #endif
> >  	}
> > +      if (__comparison > 0)
> > +	{
> > +	  __base = ((const char *) __p) + __size;
> > +	  --__nmemb;
> > +	}
> > +      __nmemb >>= 1;
> 
> So if comparison < 0 (key is before probe):
>   __nmemb is set to the number of elements before __p which is base[__nmemb>>1]
>   __base is unchanged
>     - range is 0..__nmemb-1
> 
> if comparison is > 0 (key is after probe):
>   __base is set to the element after the probe
>   If (__nmemb is odd)
>     probe point is at (__nmemb-1) >> 1
>     same numbers before and after probe
>     subtracting one is unneeded but has no effect
> 
> 0 1 2 3 4 5 6 7 8  (n == 9)  (n>>1 == 4)
>         P
> 0 1 2 3 4 5 6  (n == 7)  (n>>1 == 3)
>       P
> 
>   If (__nmemb is even)
>     probe point is at (__nmemb) >> 1
>     even numbers before but odd number after
>     subtracting one is needed
> 
> 0 1 2 3 4 5 6 7  (n == 8)  (n>>1 == 4)
>         P
> 0 1 2 3 4 5  (n == 6)  (n>>1 == 3)
>       P
> 
> Ok.
> 
> >      }
> >  
> >    return NULL;
>
  
DJ Delorie Sept. 12, 2024, 5:12 p.m. UTC | #3
Kuan-Wei Chiu <visitorckw@gmail.com> writes:
> I tried adding unlikely(), but it seems that gcc generates the exact
> same code (Note: I only checked on x86). As for moving the > 0 test
> first, I also gave it a try, but it actually worsened the benchmark
> results. I haven’t looked closely at the generated code to understand
> why, but here are the benchmark results:

Weird.  Maybe gcc is so complex that the usual ideas just don't work any
more?  Oh well, the numbers are what counts.
  

Patch

diff --git a/bits/stdlib-bsearch.h b/bits/stdlib-bsearch.h
index 540d718952..57f060b504 100644
--- a/bits/stdlib-bsearch.h
+++ b/bits/stdlib-bsearch.h
@@ -20,22 +20,14 @@  __extern_inline void *
 bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
 	 __compar_fn_t __compar)
 {
-  size_t __l, __u, __idx;
   const void *__p;
   int __comparison;
 
-  __l = 0;
-  __u = __nmemb;
-  while (__l < __u)
+  while (__nmemb)
     {
-      __idx = (__l + __u) / 2;
-      __p = (const void *) (((const char *) __base) + (__idx * __size));
+      __p = (const void *) (((const char *) __base) + ((__nmemb >> 1) * __size));
       __comparison = (*__compar) (__key, __p);
-      if (__comparison < 0)
-	__u = __idx;
-      else if (__comparison > 0)
-	__l = __idx + 1;
-      else
+      if (__comparison == 0)
 	{
 #if __GNUC_PREREQ(4, 6)
 # pragma GCC diagnostic push
@@ -46,6 +38,12 @@  bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
 # pragma GCC diagnostic pop
 #endif
 	}
+      if (__comparison > 0)
+	{
+	  __base = ((const char *) __p) + __size;
+	  --__nmemb;
+	}
+      __nmemb >>= 1;
     }
 
   return NULL;