[RFC,3/3] x86: Enable non-temporal memset for Hygon processors
Checks
Context |
Check |
Description |
redhat-pt-bot/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
redhat-pt-bot/TryBot-32bit |
fail
|
Patch series failed to build
|
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 |
success
|
Test passed
|
linaro-tcwg-bot/tcwg_glibc_build--master-arm |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-arm |
success
|
Test passed
|
Commit Message
This patch is based on the following new flag patch:
https://patchwork.sourceware.org/project/glibc/patch/20240811055619.2863839-1-goldstein.w.n@gmail.com/
After the new cpu-flag 'Prefer_Non_Temporal' is added in glibc,
this patch can be enabled to access the non-temporal memset
implementation for hygon processors.
Test Results:
thread: 1
memset store value: 0
hygon1 arch
x86_memset_non_temporal_threshold = 8MB
size new performance / old performance
128 byte(2x -4x vec case) 1
256 byte(4x - 8x vec case) 1
512 byte( > 8x loop case) 1
1MB 0.994
4MB 0.996
8MB 0.670
16MB 0.343
32MB 0.355
hygon2 arch
x86_memset_non_temporal_threshold = 8MB
size new performance / old performance
128 byte(2x -4x vec case) 1
256 byte(4x - 8x vec case) 0.653
512 byte( > 8x loop case) 0.713
1MB 1
4MB 0.887
8MB 1.312
16MB 0.822
32MB 0.830
hygon3 arch
x86_memset_non_temporal_threshold = 8MB
size new performance / old performance
128 byte(2x -4x vec case) 1
256 byte(4x - 8x vec case) 1
512 byte( > 8x loop case) 1
1MB 1
4MB 0.990
8MB 0.737
16MB 0.390
32MB 0.401
For hygon arch with this patch, no performance degradation on '2x - 8x branch case'
when extra branch jump added. And with this patch, non-temporal stores can improve
performance by 20% - 65%.
Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
Reviewed-by: Jing Li <lijing@hygon.cn>
---
sysdeps/x86/cpu-features.c | 6 ++++++
1 file changed, 6 insertions(+)
Comments
On 12/08/24 03:48, Feifei Wang wrote:
> This patch is based on the following new flag patch:
> https://patchwork.sourceware.org/project/glibc/patch/20240811055619.2863839-1-goldstein.w.n@gmail.com/
>
This patch fails to build for 32-bit:
https://www.delorie.com/trybots/32bit/37310/make.tail.txt
> After the new cpu-flag 'Prefer_Non_Temporal' is added in glibc,
> this patch can be enabled to access the non-temporal memset
> implementation for hygon processors.
>
> Test Results:
> thread: 1
> memset store value: 0
>
> hygon1 arch
> x86_memset_non_temporal_threshold = 8MB
> size new performance / old performance
> 128 byte(2x -4x vec case) 1
> 256 byte(4x - 8x vec case) 1
> 512 byte( > 8x loop case) 1
> 1MB 0.994
> 4MB 0.996
> 8MB 0.670
> 16MB 0.343
> 32MB 0.355
>
> hygon2 arch
> x86_memset_non_temporal_threshold = 8MB
> size new performance / old performance
> 128 byte(2x -4x vec case) 1
> 256 byte(4x - 8x vec case) 0.653
> 512 byte( > 8x loop case) 0.713
> 1MB 1
> 4MB 0.887
> 8MB 1.312
> 16MB 0.822
> 32MB 0.830
>
> hygon3 arch
> x86_memset_non_temporal_threshold = 8MB
> size new performance / old performance
> 128 byte(2x -4x vec case) 1
> 256 byte(4x - 8x vec case) 1
> 512 byte( > 8x loop case) 1
> 1MB 1
> 4MB 0.990
> 8MB 0.737
> 16MB 0.390
> 32MB 0.401
>
> For hygon arch with this patch, no performance degradation on '2x - 8x branch case'
> when extra branch jump added. And with this patch, non-temporal stores can improve
> performance by 20% - 65%.
>
> Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> Reviewed-by: Jing Li <lijing@hygon.cn>
> ---
> sysdeps/x86/cpu-features.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> index 034dc28f64..cae26babc7 100644
> --- a/sysdeps/x86/cpu-features.c
> +++ b/sysdeps/x86/cpu-features.c
> @@ -1098,6 +1098,12 @@ https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
> get_extended_indices (cpu_features);
>
> update_active (cpu_features);
> +
> + /* Use Prefer_Non_Temporal flag to access the non-temporal
> + memset implementation due to ERMS is disable in Hygon
> + processors. */
> + cpu_features->preferred[index_arch_Prefer_Non_Temporal]
> + |= (bit_arch_Prefer_Non_Temporal);
> }
> else
> {
On Sun, Aug 11, 2024 at 11:49 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
>
> This patch is based on the following new flag patch:
> https://patchwork.sourceware.org/project/glibc/patch/20240811055619.2863839-1-goldstein.w.n@gmail.com/
Please wait until the above patch has been reviewed and committed.
> After the new cpu-flag 'Prefer_Non_Temporal' is added in glibc,
> this patch can be enabled to access the non-temporal memset
> implementation for hygon processors.
>
> Test Results:
> thread: 1
> memset store value: 0
>
> hygon1 arch
> x86_memset_non_temporal_threshold = 8MB
> size new performance / old performance
> 128 byte(2x -4x vec case) 1
> 256 byte(4x - 8x vec case) 1
> 512 byte( > 8x loop case) 1
> 1MB 0.994
> 4MB 0.996
> 8MB 0.670
> 16MB 0.343
> 32MB 0.355
>
> hygon2 arch
> x86_memset_non_temporal_threshold = 8MB
> size new performance / old performance
> 128 byte(2x -4x vec case) 1
> 256 byte(4x - 8x vec case) 0.653
> 512 byte( > 8x loop case) 0.713
> 1MB 1
> 4MB 0.887
> 8MB 1.312
> 16MB 0.822
> 32MB 0.830
>
> hygon3 arch
> x86_memset_non_temporal_threshold = 8MB
> size new performance / old performance
> 128 byte(2x -4x vec case) 1
> 256 byte(4x - 8x vec case) 1
> 512 byte( > 8x loop case) 1
> 1MB 1
> 4MB 0.990
> 8MB 0.737
> 16MB 0.390
> 32MB 0.401
>
> For hygon arch with this patch, no performance degradation on '2x - 8x branch case'
> when extra branch jump added. And with this patch, non-temporal stores can improve
> performance by 20% - 65%.
>
> Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> Reviewed-by: Jing Li <lijing@hygon.cn>
> ---
> sysdeps/x86/cpu-features.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> index 034dc28f64..cae26babc7 100644
> --- a/sysdeps/x86/cpu-features.c
> +++ b/sysdeps/x86/cpu-features.c
> @@ -1098,6 +1098,12 @@ https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
> get_extended_indices (cpu_features);
>
> update_active (cpu_features);
> +
> + /* Use Prefer_Non_Temporal flag to access the non-temporal
> + memset implementation due to ERMS is disable in Hygon
> + processors. */
> + cpu_features->preferred[index_arch_Prefer_Non_Temporal]
> + |= (bit_arch_Prefer_Non_Temporal);
> }
> else
> {
> --
> 2.43.0
>
> -----邮件原件-----
> 发件人: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
> 发送时间: 2024年8月12日 21:02
> 收件人: Feifei Wang <wangfeifei@hygon.cn>; libc-alpha@sourceware.org
> 抄送: hjl.tools@gmail.com; carlos@redhat.com; fw@deneb.enyo.de;
> goldstein.w.n@gmail.com; Jing Li <lijing@hygon.cn>
> 主题: Re: [RFC PATCH 3/3] x86: Enable non-temporal memset for Hygon
> processors
>
>
>
> On 12/08/24 03:48, Feifei Wang wrote:
> > This patch is based on the following new flag patch:
> > https://patchwork.sourceware.org/project/glibc/patch/20240811055619.28
> > 63839-1-goldstein.w.n@gmail.com/
> >
>
> This patch fails to build for 32-bit:
>
> https://www.delorie.com/trybots/32bit/37310/make.tail.txt
This patch is based on the above new flag patch, after it is merged, this can be build
Successfully,
>
> > After the new cpu-flag 'Prefer_Non_Temporal' is added in glibc, this
> > patch can be enabled to access the non-temporal memset implementation
> > for hygon processors.
> >
> > Test Results:
> > thread: 1
> > memset store value: 0
> >
> > hygon1 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size new performance / old performance
> > 128 byte(2x -4x vec case) 1
> > 256 byte(4x - 8x vec case) 1
> > 512 byte( > 8x loop case) 1
> > 1MB 0.994
> > 4MB 0.996
> > 8MB 0.670
> > 16MB 0.343
> > 32MB 0.355
> >
> > hygon2 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size new performance / old performance
> > 128 byte(2x -4x vec case) 1
> > 256 byte(4x - 8x vec case) 0.653
> > 512 byte( > 8x loop case) 0.713
> > 1MB 1
> > 4MB 0.887
> > 8MB 1.312
> > 16MB 0.822
> > 32MB 0.830
> >
> > hygon3 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size new performance / old performance
> > 128 byte(2x -4x vec case) 1
> > 256 byte(4x - 8x vec case) 1
> > 512 byte( > 8x loop case) 1
> > 1MB 1
> > 4MB 0.990
> > 8MB 0.737
> > 16MB 0.390
> > 32MB 0.401
> >
> > For hygon arch with this patch, no performance degradation on '2x - 8x branch
> case'
> > when extra branch jump added. And with this patch, non-temporal stores
> > can improve performance by 20% - 65%.
> >
> > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > Reviewed-by: Jing Li <lijing@hygon.cn>
> > ---
> > sysdeps/x86/cpu-features.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> > index 034dc28f64..cae26babc7 100644
> > --- a/sysdeps/x86/cpu-features.c
> > +++ b/sysdeps/x86/cpu-features.c
> > @@ -1098,6 +1098,12 @@
> https://www.intel.com/content/www/us/en/support/articles/000059422/proce
> ssors.ht
> > get_extended_indices (cpu_features);
> >
> > update_active (cpu_features);
> > +
> > + /* Use Prefer_Non_Temporal flag to access the non-temporal
> > + memset implementation due to ERMS is disable in Hygon
> > + processors. */
> > + cpu_features->preferred[index_arch_Prefer_Non_Temporal]
> > + |= (bit_arch_Prefer_Non_Temporal);
> > }
> > else
> > {
> -----邮件原件-----
> 发件人: H.J. Lu <hjl.tools@gmail.com>
> 发送时间: 2024年8月12日 21:12
> 收件人: Feifei Wang <wangfeifei@hygon.cn>
> 抄送: libc-alpha@sourceware.org; carlos@redhat.com; fw@deneb.enyo.de;
> goldstein.w.n@gmail.com; Jing Li <lijing@hygon.cn>
> 主题: Re: [RFC PATCH 3/3] x86: Enable non-temporal memset for Hygon
> processors
>
> On Sun, Aug 11, 2024 at 11:49 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> >
> > This patch is based on the following new flag patch:
> > https://patchwork.sourceware.org/project/glibc/patch/20240811055619.28
> > 63839-1-goldstein.w.n@gmail.com/
>
> Please wait until the above patch has been reviewed and committed.
>
That's fine.
> > After the new cpu-flag 'Prefer_Non_Temporal' is added in glibc, this
> > patch can be enabled to access the non-temporal memset implementation
> > for hygon processors.
> >
> > Test Results:
> > thread: 1
> > memset store value: 0
> >
> > hygon1 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size new performance / old performance
> > 128 byte(2x -4x vec case) 1
> > 256 byte(4x - 8x vec case) 1
> > 512 byte( > 8x loop case) 1
> > 1MB 0.994
> > 4MB 0.996
> > 8MB 0.670
> > 16MB 0.343
> > 32MB 0.355
> >
> > hygon2 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size new performance / old performance
> > 128 byte(2x -4x vec case) 1
> > 256 byte(4x - 8x vec case) 0.653
> > 512 byte( > 8x loop case) 0.713
> > 1MB 1
> > 4MB 0.887
> > 8MB 1.312
> > 16MB 0.822
> > 32MB 0.830
> >
> > hygon3 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size new performance / old performance
> > 128 byte(2x -4x vec case) 1
> > 256 byte(4x - 8x vec case) 1
> > 512 byte( > 8x loop case) 1
> > 1MB 1
> > 4MB 0.990
> > 8MB 0.737
> > 16MB 0.390
> > 32MB 0.401
> >
> > For hygon arch with this patch, no performance degradation on '2x - 8x branch
> case'
> > when extra branch jump added. And with this patch, non-temporal stores
> > can improve performance by 20% - 65%.
> >
> > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > Reviewed-by: Jing Li <lijing@hygon.cn>
> > ---
> > sysdeps/x86/cpu-features.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> > index 034dc28f64..cae26babc7 100644
> > --- a/sysdeps/x86/cpu-features.c
> > +++ b/sysdeps/x86/cpu-features.c
> > @@ -1098,6 +1098,12 @@
> https://www.intel.com/content/www/us/en/support/articles/000059422/proce
> ssors.ht
> > get_extended_indices (cpu_features);
> >
> > update_active (cpu_features);
> > +
> > + /* Use Prefer_Non_Temporal flag to access the non-temporal
> > + memset implementation due to ERMS is disable in Hygon
> > + processors. */
> > + cpu_features->preferred[index_arch_Prefer_Non_Temporal]
> > + |= (bit_arch_Prefer_Non_Temporal);
> > }
> > else
> > {
> > --
> > 2.43.0
> >
>
>
> --
> H.J.
@@ -1098,6 +1098,12 @@ https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
get_extended_indices (cpu_features);
update_active (cpu_features);
+
+ /* Use Prefer_Non_Temporal flag to access the non-temporal
+ memset implementation due to ERMS is disable in Hygon
+ processors. */
+ cpu_features->preferred[index_arch_Prefer_Non_Temporal]
+ |= (bit_arch_Prefer_Non_Temporal);
}
else
{