[v7,2/2] Optimize bsearch() implementation for performance
Checks
Context |
Check |
Description |
redhat-pt-bot/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 |
success
|
Build passed
|
redhat-pt-bot/TryBot-32bit |
success
|
Build for i686
|
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 |
success
|
Test passed
|
linaro-tcwg-bot/tcwg_glibc_build--master-arm |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-arm |
success
|
Test passed
|
Commit Message
Optimize the bsearch() function to improve binary search performance.
Although the code size grew by 8 bytes, the new implementation achieves
a 15% reduction in execution time on my x86 machine, according to the
bench-bsearch benchmark results.
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
bits/stdlib-bsearch.h | 20 +++++++++-----------
1 file changed, 9 insertions(+), 11 deletions(-)
Comments
Kuan-Wei Chiu <visitorckw@gmail.com> writes:
> Optimize the bsearch() function to improve binary search performance.
> Although the code size grew by 8 bytes, the new implementation achieves
> a 15% reduction in execution time on my x86 machine, according to the
> bench-bsearch benchmark results.
LGTM as-is but I noted something to try that might speed it up more.
Reviewed-by: DJ Delorie <dj@redhat.com>
> diff --git a/bits/stdlib-bsearch.h b/bits/stdlib-bsearch.h
> bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
> __compar_fn_t __compar)
> {
> - size_t __l, __u, __idx;
> const void *__p;
> int __comparison;
>
> - __l = 0;
> - __u = __nmemb;
> - while (__l < __u)
> + while (__nmemb)
> {
> - __idx = (__l + __u) / 2;
> - __p = (const void *) (((const char *) __base) + (__idx * __size));
> + __p = (const void *) (((const char *) __base) + ((__nmemb >> 1) * __size));
We're replacing an add/shift with just a shift, but using the existing
__base and __nmemb instead of the new __l and __u. Ok.
After this, __p points to base[__nmemb>>1]
> __comparison = (*__compar) (__key, __p);
> - if (__comparison < 0)
> - __u = __idx;
> - else if (__comparison > 0)
> - __l = __idx + 1;
> - else
> + if (__comparison == 0)
It seems to me performance would be better if the more likely cases went
first - checking "__comparison > 0" would, approximately 50% of the
time, skip the "__comparison == 0" check, but not the other way around.
Have you tried this with the ">0" test before the "==0" test, or with
marking the "==0" test as unlikely?
Keeping the >0 and <0 paths close to the top of the loop might help with
cache hits too, depending on link alignment.
> {
> #if __GNUC_PREREQ(4, 6)
> # pragma GCC diagnostic push
> @@ -46,6 +38,12 @@ bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
> # pragma GCC diagnostic pop
> #endif
> }
> + if (__comparison > 0)
> + {
> + __base = ((const char *) __p) + __size;
> + --__nmemb;
> + }
> + __nmemb >>= 1;
So if comparison < 0 (key is before probe):
__nmemb is set to the number of elements before __p which is base[__nmemb>>1]
__base is unchanged
- range is 0..__nmemb-1
if comparison is > 0 (key is after probe):
__base is set to the element after the probe
If (__nmemb is odd)
probe point is at (__nmemb-1) >> 1
same numbers before and after probe
subtracting one is unneeded but has no effect
0 1 2 3 4 5 6 7 8 (n == 9) (n>>1 == 4)
P
0 1 2 3 4 5 6 (n == 7) (n>>1 == 3)
P
If (__nmemb is even)
probe point is at (__nmemb) >> 1
even numbers before but odd number after
subtracting one is needed
0 1 2 3 4 5 6 7 (n == 8) (n>>1 == 4)
P
0 1 2 3 4 5 (n == 6) (n>>1 == 3)
P
Ok.
> }
>
> return NULL;
On Wed, Sep 11, 2024 at 06:16:32PM -0400, DJ Delorie wrote:
> Kuan-Wei Chiu <visitorckw@gmail.com> writes:
> > Optimize the bsearch() function to improve binary search performance.
> > Although the code size grew by 8 bytes, the new implementation achieves
> > a 15% reduction in execution time on my x86 machine, according to the
> > bench-bsearch benchmark results.
>
> LGTM as-is but I noted something to try that might speed it up more.
> Reviewed-by: DJ Delorie <dj@redhat.com>
Thank you for your review and suggestions!
>
> > diff --git a/bits/stdlib-bsearch.h b/bits/stdlib-bsearch.h
>
> > bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
> > __compar_fn_t __compar)
> > {
> > - size_t __l, __u, __idx;
> > const void *__p;
> > int __comparison;
> >
> > - __l = 0;
> > - __u = __nmemb;
> > - while (__l < __u)
> > + while (__nmemb)
> > {
> > - __idx = (__l + __u) / 2;
> > - __p = (const void *) (((const char *) __base) + (__idx * __size));
> > + __p = (const void *) (((const char *) __base) + ((__nmemb >> 1) * __size));
>
> We're replacing an add/shift with just a shift, but using the existing
> __base and __nmemb instead of the new __l and __u. Ok.
>
> After this, __p points to base[__nmemb>>1]
>
> > __comparison = (*__compar) (__key, __p);
> > - if (__comparison < 0)
> > - __u = __idx;
> > - else if (__comparison > 0)
> > - __l = __idx + 1;
> > - else
> > + if (__comparison == 0)
>
> It seems to me performance would be better if the more likely cases went
> first - checking "__comparison > 0" would, approximately 50% of the
> time, skip the "__comparison == 0" check, but not the other way around.
> Have you tried this with the ">0" test before the "==0" test, or with
> marking the "==0" test as unlikely?
>
> Keeping the >0 and <0 paths close to the top of the loop might help with
> cache hits too, depending on link alignment.
I tried adding unlikely(), but it seems that gcc generates the exact
same code (Note: I only checked on x86). As for moving the > 0 test
first, I also gave it a try, but it actually worsened the benchmark
results. I haven’t looked closely at the generated code to understand
why, but here are the benchmark results:
For the current patch:
{
"timing_type": "hp_timing",
"functions": {
"bsearch": {
"bench-variant": "default",
"results": [
{
"array-size": 100000,
"element-size": 4,
"key-pattern": "descending",
"contained": "no",
"simple": "yes",
"timing": 110.092
},
{
"array-size": 100000,
"element-size": 4,
"key-pattern": "descending",
"contained": "yes",
"simple": "yes",
"timing": 99.0945
},
{
"array-size": 100000,
"element-size": 4,
"key-pattern": "ascending",
"contained": "no",
"simple": "yes",
"timing": 112.395
},
{
"array-size": 100000,
"element-size": 4,
"key-pattern": "ascending",
"contained": "yes",
"simple": "yes",
"timing": 98.5953
}]
}
}
}
With the > 0 test moved to the front:
{
"timing_type": "hp_timing",
"functions": {
"bsearch": {
"bench-variant": "default",
"results": [
{
"array-size": 100000,
"element-size": 4,
"key-pattern": "descending",
"contained": "no",
"simple": "yes",
"timing": 113.875
},
{
"array-size": 100000,
"element-size": 4,
"key-pattern": "descending",
"contained": "yes",
"simple": "yes",
"timing": 101.111
},
{
"array-size": 100000,
"element-size": 4,
"key-pattern": "ascending",
"contained": "no",
"simple": "yes",
"timing": 113.713
},
{
"array-size": 100000,
"element-size": 4,
"key-pattern": "ascending",
"contained": "yes",
"simple": "yes",
"timing": 105.429
}]
}
}
}
>
> > {
> > #if __GNUC_PREREQ(4, 6)
> > # pragma GCC diagnostic push
> > @@ -46,6 +38,12 @@ bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
> > # pragma GCC diagnostic pop
> > #endif
> > }
> > + if (__comparison > 0)
> > + {
> > + __base = ((const char *) __p) + __size;
> > + --__nmemb;
> > + }
> > + __nmemb >>= 1;
>
> So if comparison < 0 (key is before probe):
> __nmemb is set to the number of elements before __p which is base[__nmemb>>1]
> __base is unchanged
> - range is 0..__nmemb-1
>
> if comparison is > 0 (key is after probe):
> __base is set to the element after the probe
> If (__nmemb is odd)
> probe point is at (__nmemb-1) >> 1
> same numbers before and after probe
> subtracting one is unneeded but has no effect
>
> 0 1 2 3 4 5 6 7 8 (n == 9) (n>>1 == 4)
> P
> 0 1 2 3 4 5 6 (n == 7) (n>>1 == 3)
> P
>
> If (__nmemb is even)
> probe point is at (__nmemb) >> 1
> even numbers before but odd number after
> subtracting one is needed
>
> 0 1 2 3 4 5 6 7 (n == 8) (n>>1 == 4)
> P
> 0 1 2 3 4 5 (n == 6) (n>>1 == 3)
> P
>
> Ok.
>
> > }
> >
> > return NULL;
>
Kuan-Wei Chiu <visitorckw@gmail.com> writes:
> I tried adding unlikely(), but it seems that gcc generates the exact
> same code (Note: I only checked on x86). As for moving the > 0 test
> first, I also gave it a try, but it actually worsened the benchmark
> results. I haven’t looked closely at the generated code to understand
> why, but here are the benchmark results:
Weird. Maybe gcc is so complex that the usual ideas just don't work any
more? Oh well, the numbers are what counts.
@@ -20,22 +20,14 @@ __extern_inline void *
bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
__compar_fn_t __compar)
{
- size_t __l, __u, __idx;
const void *__p;
int __comparison;
- __l = 0;
- __u = __nmemb;
- while (__l < __u)
+ while (__nmemb)
{
- __idx = (__l + __u) / 2;
- __p = (const void *) (((const char *) __base) + (__idx * __size));
+ __p = (const void *) (((const char *) __base) + ((__nmemb >> 1) * __size));
__comparison = (*__compar) (__key, __p);
- if (__comparison < 0)
- __u = __idx;
- else if (__comparison > 0)
- __l = __idx + 1;
- else
+ if (__comparison == 0)
{
#if __GNUC_PREREQ(4, 6)
# pragma GCC diagnostic push
@@ -46,6 +38,12 @@ bsearch (const void *__key, const void *__base, size_t __nmemb, size_t __size,
# pragma GCC diagnostic pop
#endif
}
+ if (__comparison > 0)
+ {
+ __base = ((const char *) __p) + __size;
+ --__nmemb;
+ }
+ __nmemb >>= 1;
}
return NULL;