[RFC,3/3] x86: Enable non-temporal memset for Hygon processors

Message ID 1723445305-99403-4-git-send-email-wangfeifei@hygon.cn
State Superseded
Headers
Series x86: Add support for Hygon processors |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit fail Patch series failed to build
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed

Commit Message

Feifei Wang Aug. 12, 2024, 6:48 a.m. UTC
  This patch is based on the following new flag patch:
https://patchwork.sourceware.org/project/glibc/patch/20240811055619.2863839-1-goldstein.w.n@gmail.com/

After the new cpu-flag 'Prefer_Non_Temporal' is added in glibc,
this patch can be enabled to access the non-temporal memset
implementation for hygon processors.

Test Results:
thread: 1
memset store value: 0

hygon1 arch
x86_memset_non_temporal_threshold = 8MB
size                          new performance / old performance
128 byte(2x -4x vec case)     1
256 byte(4x - 8x vec case)    1
512 byte( > 8x loop case)     1
1MB                           0.994
4MB                           0.996
8MB                           0.670
16MB                          0.343
32MB                          0.355

hygon2 arch
x86_memset_non_temporal_threshold = 8MB
size                          new performance / old performance
128 byte(2x -4x vec case)     1
256 byte(4x - 8x vec case)    0.653
512 byte( > 8x loop case)     0.713
1MB                           1
4MB                           0.887
8MB                           1.312
16MB                          0.822
32MB                          0.830

hygon3 arch
x86_memset_non_temporal_threshold = 8MB
size                          new performance / old performance
128 byte(2x -4x vec case)     1
256 byte(4x - 8x vec case)    1
512 byte( > 8x loop case)     1
1MB                           1
4MB                           0.990
8MB                           0.737
16MB                          0.390
32MB                          0.401

For hygon arch with this patch, no performance degradation on '2x - 8x branch case'
when extra branch jump added. And with this patch, non-temporal stores can improve
performance by 20% - 65%.

Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
Reviewed-by: Jing Li <lijing@hygon.cn>
---
 sysdeps/x86/cpu-features.c | 6 ++++++
 1 file changed, 6 insertions(+)
  

Comments

Adhemerval Zanella Netto Aug. 12, 2024, 1:02 p.m. UTC | #1
On 12/08/24 03:48, Feifei Wang wrote:
> This patch is based on the following new flag patch:
> https://patchwork.sourceware.org/project/glibc/patch/20240811055619.2863839-1-goldstein.w.n@gmail.com/
> 

This patch fails to build for 32-bit:

https://www.delorie.com/trybots/32bit/37310/make.tail.txt

> After the new cpu-flag 'Prefer_Non_Temporal' is added in glibc,
> this patch can be enabled to access the non-temporal memset
> implementation for hygon processors.
> 
> Test Results:
> thread: 1
> memset store value: 0
> 
> hygon1 arch
> x86_memset_non_temporal_threshold = 8MB
> size                          new performance / old performance
> 128 byte(2x -4x vec case)     1
> 256 byte(4x - 8x vec case)    1
> 512 byte( > 8x loop case)     1
> 1MB                           0.994
> 4MB                           0.996
> 8MB                           0.670
> 16MB                          0.343
> 32MB                          0.355
> 
> hygon2 arch
> x86_memset_non_temporal_threshold = 8MB
> size                          new performance / old performance
> 128 byte(2x -4x vec case)     1
> 256 byte(4x - 8x vec case)    0.653
> 512 byte( > 8x loop case)     0.713
> 1MB                           1
> 4MB                           0.887
> 8MB                           1.312
> 16MB                          0.822
> 32MB                          0.830
> 
> hygon3 arch
> x86_memset_non_temporal_threshold = 8MB
> size                          new performance / old performance
> 128 byte(2x -4x vec case)     1
> 256 byte(4x - 8x vec case)    1
> 512 byte( > 8x loop case)     1
> 1MB                           1
> 4MB                           0.990
> 8MB                           0.737
> 16MB                          0.390
> 32MB                          0.401
> 
> For hygon arch with this patch, no performance degradation on '2x - 8x branch case'
> when extra branch jump added. And with this patch, non-temporal stores can improve
> performance by 20% - 65%.
> 
> Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> Reviewed-by: Jing Li <lijing@hygon.cn>
> ---
>  sysdeps/x86/cpu-features.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> index 034dc28f64..cae26babc7 100644
> --- a/sysdeps/x86/cpu-features.c
> +++ b/sysdeps/x86/cpu-features.c
> @@ -1098,6 +1098,12 @@ https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
>        get_extended_indices (cpu_features);
>  
>        update_active (cpu_features);
> +
> +      /* Use Prefer_Non_Temporal flag to access the non-temporal
> +	 memset implementation due to ERMS is disable in Hygon
> +	 processors.  */
> +      cpu_features->preferred[index_arch_Prefer_Non_Temporal]
> +      |= (bit_arch_Prefer_Non_Temporal);
>      }
>    else
>      {
  
H.J. Lu Aug. 12, 2024, 1:11 p.m. UTC | #2
On Sun, Aug 11, 2024 at 11:49 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
>
> This patch is based on the following new flag patch:
> https://patchwork.sourceware.org/project/glibc/patch/20240811055619.2863839-1-goldstein.w.n@gmail.com/

Please wait until the above patch has been reviewed and committed.

> After the new cpu-flag 'Prefer_Non_Temporal' is added in glibc,
> this patch can be enabled to access the non-temporal memset
> implementation for hygon processors.
>
> Test Results:
> thread: 1
> memset store value: 0
>
> hygon1 arch
> x86_memset_non_temporal_threshold = 8MB
> size                          new performance / old performance
> 128 byte(2x -4x vec case)     1
> 256 byte(4x - 8x vec case)    1
> 512 byte( > 8x loop case)     1
> 1MB                           0.994
> 4MB                           0.996
> 8MB                           0.670
> 16MB                          0.343
> 32MB                          0.355
>
> hygon2 arch
> x86_memset_non_temporal_threshold = 8MB
> size                          new performance / old performance
> 128 byte(2x -4x vec case)     1
> 256 byte(4x - 8x vec case)    0.653
> 512 byte( > 8x loop case)     0.713
> 1MB                           1
> 4MB                           0.887
> 8MB                           1.312
> 16MB                          0.822
> 32MB                          0.830
>
> hygon3 arch
> x86_memset_non_temporal_threshold = 8MB
> size                          new performance / old performance
> 128 byte(2x -4x vec case)     1
> 256 byte(4x - 8x vec case)    1
> 512 byte( > 8x loop case)     1
> 1MB                           1
> 4MB                           0.990
> 8MB                           0.737
> 16MB                          0.390
> 32MB                          0.401
>
> For hygon arch with this patch, no performance degradation on '2x - 8x branch case'
> when extra branch jump added. And with this patch, non-temporal stores can improve
> performance by 20% - 65%.
>
> Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> Reviewed-by: Jing Li <lijing@hygon.cn>
> ---
>  sysdeps/x86/cpu-features.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> index 034dc28f64..cae26babc7 100644
> --- a/sysdeps/x86/cpu-features.c
> +++ b/sysdeps/x86/cpu-features.c
> @@ -1098,6 +1098,12 @@ https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
>        get_extended_indices (cpu_features);
>
>        update_active (cpu_features);
> +
> +      /* Use Prefer_Non_Temporal flag to access the non-temporal
> +        memset implementation due to ERMS is disable in Hygon
> +        processors.  */
> +      cpu_features->preferred[index_arch_Prefer_Non_Temporal]
> +      |= (bit_arch_Prefer_Non_Temporal);
>      }
>    else
>      {
> --
> 2.43.0
>
  
Feifei Wang Aug. 13, 2024, 2:06 a.m. UTC | #3
> -----邮件原件-----
> 发件人: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
> 发送时间: 2024年8月12日 21:02
> 收件人: Feifei Wang <wangfeifei@hygon.cn>; libc-alpha@sourceware.org
> 抄送: hjl.tools@gmail.com; carlos@redhat.com; fw@deneb.enyo.de;
> goldstein.w.n@gmail.com; Jing Li <lijing@hygon.cn>
> 主题: Re: [RFC PATCH 3/3] x86: Enable non-temporal memset for Hygon
> processors
> 
> 
> 
> On 12/08/24 03:48, Feifei Wang wrote:
> > This patch is based on the following new flag patch:
> > https://patchwork.sourceware.org/project/glibc/patch/20240811055619.28
> > 63839-1-goldstein.w.n@gmail.com/
> >
> 
> This patch fails to build for 32-bit:
> 
> https://www.delorie.com/trybots/32bit/37310/make.tail.txt
This patch is based on the above new flag patch, after it is merged, this can be build
Successfully,
> 
> > After the new cpu-flag 'Prefer_Non_Temporal' is added in glibc, this
> > patch can be enabled to access the non-temporal memset implementation
> > for hygon processors.
> >
> > Test Results:
> > thread: 1
> > memset store value: 0
> >
> > hygon1 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size                          new performance / old performance
> > 128 byte(2x -4x vec case)     1
> > 256 byte(4x - 8x vec case)    1
> > 512 byte( > 8x loop case)     1
> > 1MB                           0.994
> > 4MB                           0.996
> > 8MB                           0.670
> > 16MB                          0.343
> > 32MB                          0.355
> >
> > hygon2 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size                          new performance / old performance
> > 128 byte(2x -4x vec case)     1
> > 256 byte(4x - 8x vec case)    0.653
> > 512 byte( > 8x loop case)     0.713
> > 1MB                           1
> > 4MB                           0.887
> > 8MB                           1.312
> > 16MB                          0.822
> > 32MB                          0.830
> >
> > hygon3 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size                          new performance / old performance
> > 128 byte(2x -4x vec case)     1
> > 256 byte(4x - 8x vec case)    1
> > 512 byte( > 8x loop case)     1
> > 1MB                           1
> > 4MB                           0.990
> > 8MB                           0.737
> > 16MB                          0.390
> > 32MB                          0.401
> >
> > For hygon arch with this patch, no performance degradation on '2x - 8x branch
> case'
> > when extra branch jump added. And with this patch, non-temporal stores
> > can improve performance by 20% - 65%.
> >
> > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > Reviewed-by: Jing Li <lijing@hygon.cn>
> > ---
> >  sysdeps/x86/cpu-features.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> > index 034dc28f64..cae26babc7 100644
> > --- a/sysdeps/x86/cpu-features.c
> > +++ b/sysdeps/x86/cpu-features.c
> > @@ -1098,6 +1098,12 @@
> https://www.intel.com/content/www/us/en/support/articles/000059422/proce
> ssors.ht
> >        get_extended_indices (cpu_features);
> >
> >        update_active (cpu_features);
> > +
> > +      /* Use Prefer_Non_Temporal flag to access the non-temporal
> > +	 memset implementation due to ERMS is disable in Hygon
> > +	 processors.  */
> > +      cpu_features->preferred[index_arch_Prefer_Non_Temporal]
> > +      |= (bit_arch_Prefer_Non_Temporal);
> >      }
> >    else
> >      {
  
Feifei Wang Aug. 13, 2024, 2:07 a.m. UTC | #4
> -----邮件原件-----
> 发件人: H.J. Lu <hjl.tools@gmail.com>
> 发送时间: 2024年8月12日 21:12
> 收件人: Feifei Wang <wangfeifei@hygon.cn>
> 抄送: libc-alpha@sourceware.org; carlos@redhat.com; fw@deneb.enyo.de;
> goldstein.w.n@gmail.com; Jing Li <lijing@hygon.cn>
> 主题: Re: [RFC PATCH 3/3] x86: Enable non-temporal memset for Hygon
> processors
> 
> On Sun, Aug 11, 2024 at 11:49 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> >
> > This patch is based on the following new flag patch:
> > https://patchwork.sourceware.org/project/glibc/patch/20240811055619.28
> > 63839-1-goldstein.w.n@gmail.com/
> 
> Please wait until the above patch has been reviewed and committed.
> 
That's fine.
> > After the new cpu-flag 'Prefer_Non_Temporal' is added in glibc, this
> > patch can be enabled to access the non-temporal memset implementation
> > for hygon processors.
> >
> > Test Results:
> > thread: 1
> > memset store value: 0
> >
> > hygon1 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size                          new performance / old performance
> > 128 byte(2x -4x vec case)     1
> > 256 byte(4x - 8x vec case)    1
> > 512 byte( > 8x loop case)     1
> > 1MB                           0.994
> > 4MB                           0.996
> > 8MB                           0.670
> > 16MB                          0.343
> > 32MB                          0.355
> >
> > hygon2 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size                          new performance / old performance
> > 128 byte(2x -4x vec case)     1
> > 256 byte(4x - 8x vec case)    0.653
> > 512 byte( > 8x loop case)     0.713
> > 1MB                           1
> > 4MB                           0.887
> > 8MB                           1.312
> > 16MB                          0.822
> > 32MB                          0.830
> >
> > hygon3 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size                          new performance / old performance
> > 128 byte(2x -4x vec case)     1
> > 256 byte(4x - 8x vec case)    1
> > 512 byte( > 8x loop case)     1
> > 1MB                           1
> > 4MB                           0.990
> > 8MB                           0.737
> > 16MB                          0.390
> > 32MB                          0.401
> >
> > For hygon arch with this patch, no performance degradation on '2x - 8x branch
> case'
> > when extra branch jump added. And with this patch, non-temporal stores
> > can improve performance by 20% - 65%.
> >
> > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > Reviewed-by: Jing Li <lijing@hygon.cn>
> > ---
> >  sysdeps/x86/cpu-features.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
> > index 034dc28f64..cae26babc7 100644
> > --- a/sysdeps/x86/cpu-features.c
> > +++ b/sysdeps/x86/cpu-features.c
> > @@ -1098,6 +1098,12 @@
> https://www.intel.com/content/www/us/en/support/articles/000059422/proce
> ssors.ht
> >        get_extended_indices (cpu_features);
> >
> >        update_active (cpu_features);
> > +
> > +      /* Use Prefer_Non_Temporal flag to access the non-temporal
> > +        memset implementation due to ERMS is disable in Hygon
> > +        processors.  */
> > +      cpu_features->preferred[index_arch_Prefer_Non_Temporal]
> > +      |= (bit_arch_Prefer_Non_Temporal);
> >      }
> >    else
> >      {
> > --
> > 2.43.0
> >
> 
> 
> --
> H.J.
  

Patch

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 034dc28f64..cae26babc7 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -1098,6 +1098,12 @@  https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht
       get_extended_indices (cpu_features);
 
       update_active (cpu_features);
+
+      /* Use Prefer_Non_Temporal flag to access the non-temporal
+	 memset implementation due to ERMS is disable in Hygon
+	 processors.  */
+      cpu_features->preferred[index_arch_Prefer_Non_Temporal]
+      |= (bit_arch_Prefer_Non_Temporal);
     }
   else
     {