x86: Enable non-temporal memset without ERMS

Message ID 1720165937-11352-1-git-send-email-wangfeifei@hygon.cn
State New
Headers
Series x86: Enable non-temporal memset without ERMS |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
redhat-pt-bot/TryBot-32bit success Build for i686
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Test passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Test passed

Commit Message

Feifei Wang July 5, 2024, 7:52 a.m. UTC
  Currently, large memset perf can be improved with non-temporal stores[1].
But it just supports the case in which ERMS feature is enabled.

For some architectures, ERMS is disabled in cpu feature as default
For example, AMD zen3-milan[2],
Or ERMS performance is worse than vectorized loop, and it is disabled.
For example, HygonGenuine arch.

Thus, to let these arch achieve perf improvement from non-temporal memset,
and enable non-temporal memset without ERMS, a 'non-temporal memset
branch jump' is added in 'more_2x_vec'case.

Test Results:
thread: 1
memset store value: 0
function: memset_avx2_unaligned

hygon1 arch
x86_memset_non_temporal_threshold = 8MB
size                          new performance / old performance
128 byte(2x -4x vec case)     1
256 byte(4x - 8x vec case)    1
512 byte( > 8x loop case)     1
1MB                           0.994
4MB                           0.996
8MB                           0.670
16MB                          0.343
32MB                          0.355

hygon2 arch
x86_memset_non_temporal_threshold = 8MB
size                          new performance / old performance
128 byte(2x -4x vec case)     1
256 byte(4x - 8x vec case)    0.653
512 byte( > 8x loop case)     0.713
1MB                           1
4MB                           0.887
8MB                           1.312
16MB                          0.822
32MB                          0.830

hygon3 arch
x86_memset_non_temporal_threshold = 8MB
size                          new performance / old performance
128 byte(2x -4x vec case)     1
256 byte(4x - 8x vec case)    1
512 byte( > 8x loop case)     1
1MB                           1
4MB                           0.990
8MB                           0.737
16MB                          0.390
32MB                          0.401

For hygon arch with , no performance degradation on '2x - 8x branch case' when
extra branch jump added. And with this patch, non-temporal stores can improve
performance by 20% - 65%.

amd-zen2-rome arch
x86_memset_non_temporal_threshold = 12MB
size                          new performance / old performance
128 byte(2x -4x vec case)     0.986
256 byte(4x - 8x vec case)    1.057
512 byte( > 8x loop case)     1.005
1MB                           0.999
4MB                           0.998
8MB                           1.014
16MB                          1.156
32MB                          0.634

For amd-zen2-rome, no performance degradation on '2x - 8x branch case' when
extra branch jump added. When size >= x86_memset_non_temporal_threshold, for
'size = 16MB' case, performance degradation is due to that amd-zen2 non_temporal
perf is worse than temporal[3]. And in 32MB case, it can achieve performance
improvement by 37%.

For intel arch, intel arch ERMS feature is enabled as default, so this patch
has no effect for it.

Note:
Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing
high performance x86 processor for China server market.

Ref:
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5d94b094659094336da90f
[2] https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.html
[3] https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109

Jira: HES-210
Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
Reviewed-by: Jing Li <lijing@hygon.cn>
Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
---
 sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)
  

Comments

Noah Goldstein July 8, 2024, 5:29 a.m. UTC | #1
On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
>
> Currently, large memset perf can be improved with non-temporal stores[1].
> But it just supports the case in which ERMS feature is enabled.
>
> For some architectures, ERMS is disabled in cpu feature as default
> For example, AMD zen3-milan[2],
> Or ERMS performance is worse than vectorized loop, and it is disabled.
> For example, HygonGenuine arch.
>
> Thus, to let these arch achieve perf improvement from non-temporal memset,
> and enable non-temporal memset without ERMS, a 'non-temporal memset
> branch jump' is added in 'more_2x_vec'case.
>
> Test Results:
> thread: 1
> memset store value: 0
> function: memset_avx2_unaligned
>
> hygon1 arch
> x86_memset_non_temporal_threshold = 8MB
> size                          new performance / old performance
> 128 byte(2x -4x vec case)     1
> 256 byte(4x - 8x vec case)    1
> 512 byte( > 8x loop case)     1
> 1MB                           0.994
> 4MB                           0.996
> 8MB                           0.670
> 16MB                          0.343
> 32MB                          0.355
>
> hygon2 arch
> x86_memset_non_temporal_threshold = 8MB
> size                          new performance / old performance
> 128 byte(2x -4x vec case)     1
> 256 byte(4x - 8x vec case)    0.653
> 512 byte( > 8x loop case)     0.713
> 1MB                           1
> 4MB                           0.887
> 8MB                           1.312
> 16MB                          0.822
> 32MB                          0.830
>
> hygon3 arch
> x86_memset_non_temporal_threshold = 8MB
> size                          new performance / old performance
> 128 byte(2x -4x vec case)     1
> 256 byte(4x - 8x vec case)    1
> 512 byte( > 8x loop case)     1
> 1MB                           1
> 4MB                           0.990
> 8MB                           0.737
> 16MB                          0.390
> 32MB                          0.401
>
> For hygon arch with , no performance degradation on '2x - 8x branch case' when
> extra branch jump added. And with this patch, non-temporal stores can improve
> performance by 20% - 65%.
>
> amd-zen2-rome arch
> x86_memset_non_temporal_threshold = 12MB
> size                          new performance / old performance
> 128 byte(2x -4x vec case)     0.986
> 256 byte(4x - 8x vec case)    1.057
> 512 byte( > 8x loop case)     1.005
> 1MB                           0.999
> 4MB                           0.998
> 8MB                           1.014
> 16MB                          1.156
> 32MB                          0.634
>
> For amd-zen2-rome, no performance degradation on '2x - 8x branch case' when
> extra branch jump added. When size >= x86_memset_non_temporal_threshold, for
> 'size = 16MB' case, performance degradation is due to that amd-zen2 non_temporal
> perf is worse than temporal[3]. And in 32MB case, it can achieve performance
> improvement by 37%.
>
> For intel arch, intel arch ERMS feature is enabled as default, so this patch
> has no effect for it.
>
> Note:
> Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing
> high performance x86 processor for China server market.
>
> Ref:
> [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5d94b094659094336da90f
> [2] https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.html
> [3] https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
>
> Jira: HES-210
> Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> Reviewed-by: Jing Li <lijing@hygon.cn>
> Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> ---
>  sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 88bf08e..063dd51 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
>         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
>         ja      L(stosb_local)
>  #endif
> -       /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> +       /* If rdx is less than __x86_memset_non_temporal_threshold,
> +          fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
>            and (4x, 8x] jump to target.  */
>  L(more_2x_vec):
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +       /* Check non-temporal store threshold.
> +          This is for the hardware which can not support ERMS feature or
> +          ERMS fearure is disabled.  */
> +       cmp     __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> +       jae     L(nt_memset)
> +#endif
>         /* Store next 2x vec regardless.  */
>         VMOVU   %VMM(0), (%rdi)
>         VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdi)
> --
> 2.7.4
>

I am opposed to this patch getting in as is. I strongly dislike the
essentially redundant checks for `__x86_rep_stosb_threshold` and
`__x86_memset_non_temporal_threshold` that will occur on most
processors. Further, in this case, it messes up the alignment of the
loop.

What I would like to do instead for processors which prefer
non-temporal but not erms, is to have them set
`__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold` and
prefer the `erms` version.

Can you implement it like that instead?
  
Feifei Wang July 9, 2024, 2:57 a.m. UTC | #2
Thanks for your kindly reply.

> -----邮件原件-----
> 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> 发送时间: 2024年7月8日 13:29
> 收件人: Feifei Wang <wangfeifei@hygon.cn>
> 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> 
> On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> >
> > Currently, large memset perf can be improved with non-temporal stores[1].
> > But it just supports the case in which ERMS feature is enabled.
> >
> > For some architectures, ERMS is disabled in cpu feature as default For
> > example, AMD zen3-milan[2], Or ERMS performance is worse than
> > vectorized loop, and it is disabled.
> > For example, HygonGenuine arch.
> >
> > Thus, to let these arch achieve perf improvement from non-temporal
> > memset, and enable non-temporal memset without ERMS, a 'non-temporal
> > memset branch jump' is added in 'more_2x_vec'case.
> >
> > Test Results:
> > thread: 1
> > memset store value: 0
> > function: memset_avx2_unaligned
> >
> > hygon1 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size                          new performance / old performance
> > 128 byte(2x -4x vec case)     1
> > 256 byte(4x - 8x vec case)    1
> > 512 byte( > 8x loop case)     1
> > 1MB                           0.994
> > 4MB                           0.996
> > 8MB                           0.670
> > 16MB                          0.343
> > 32MB                          0.355
> >
> > hygon2 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size                          new performance / old performance
> > 128 byte(2x -4x vec case)     1
> > 256 byte(4x - 8x vec case)    0.653
> > 512 byte( > 8x loop case)     0.713
> > 1MB                           1
> > 4MB                           0.887
> > 8MB                           1.312
> > 16MB                          0.822
> > 32MB                          0.830
> >
> > hygon3 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size                          new performance / old performance
> > 128 byte(2x -4x vec case)     1
> > 256 byte(4x - 8x vec case)    1
> > 512 byte( > 8x loop case)     1
> > 1MB                           1
> > 4MB                           0.990
> > 8MB                           0.737
> > 16MB                          0.390
> > 32MB                          0.401
> >
> > For hygon arch with , no performance degradation on '2x - 8x branch
> > case' when extra branch jump added. And with this patch, non-temporal
> > stores can improve performance by 20% - 65%.
> >
> > amd-zen2-rome arch
> > x86_memset_non_temporal_threshold = 12MB
> > size                          new performance / old performance
> > 128 byte(2x -4x vec case)     0.986
> > 256 byte(4x - 8x vec case)    1.057
> > 512 byte( > 8x loop case)     1.005
> > 1MB                           0.999
> > 4MB                           0.998
> > 8MB                           1.014
> > 16MB                          1.156
> > 32MB                          0.634
> >
> > For amd-zen2-rome, no performance degradation on '2x - 8x branch case'
> > when extra branch jump added. When size >=
> > x86_memset_non_temporal_threshold, for 'size = 16MB' case, performance
> > degradation is due to that amd-zen2 non_temporal perf is worse than
> > temporal[3]. And in 32MB case, it can achieve performance improvement by
> 37%.
> >
> > For intel arch, intel arch ERMS feature is enabled as default, so this
> > patch has no effect for it.
> >
> > Note:
> > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing high
> > performance x86 processor for China server market.
> >
> > Ref:
> > [1]
> > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5
> > d94b094659094336da90f [2]
> > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.html
> > [3]
> >
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4
> > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> >
> > Jira: HES-210
> > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > Reviewed-by: Jing Li <lijing@hygon.cn>
> > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > ---
> >  sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > index 88bf08e..063dd51 100644
> > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
> >         ja      L(stosb_local)
> >  #endif
> > -       /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > +       /* If rdx is less than __x86_memset_non_temporal_threshold,
> > +          fallthrough goes to L(loop_4x_vec). Tests for memset (2x,
> > + 4x]
> >            and (4x, 8x] jump to target.  */
> >  L(more_2x_vec):
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> > +       /* Check non-temporal store threshold.
> > +          This is for the hardware which can not support ERMS feature or
> > +          ERMS fearure is disabled.  */
> > +       cmp
> __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > +       jae     L(nt_memset)
> > +#endif
> >         /* Store next 2x vec regardless.  */
> >         VMOVU   %VMM(0), (%rdi)
> >         VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdi)
> > --
> > 2.7.4
> >
> 
> I am opposed to this patch getting in as is. I strongly dislike the essentially
> redundant checks for `__x86_rep_stosb_threshold` and
> `__x86_memset_non_temporal_threshold` that will occur on most processors.
> Further, in this case, it messes up the alignment of the loop.
> 
> What I would like to do instead for processors which prefer non-temporal but
> not erms, is to have them set `__x86_rep_stosb_threshold ==
> __x86_memset_non_temporal_threshold` and prefer the `erms` version.
> 
> Can you implement it like that instead?
[Feifei] This is also a good choice. But I have a question for this:
If my understand is right, you means that If __x86_rep_stosb_threshold ==
__x86_memset_non_temporal_threshold, we can set "Prefer_ERMS" when
init_cpu_features for servers without ERMS.

However, if like this, I think we ignore the small size case. Though we can use non temporal now,
But for "size < __x86_rep_stosb_threshold" case, it is forced to use erms whose performance is very bad
For example, hygon and amd zen2.

Thus we add a nt jump for temporal branch, and for this, we can support that :
1. small size can use temporal for good performance rather than erms.
2.big size can use non temporal to achieve better performance.


Best Regards
Feifei
  
Noah Goldstein July 9, 2024, 4:19 a.m. UTC | #3
On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn> wrote:
>
> Thanks for your kindly reply.
>
> > -----邮件原件-----
> > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > 发送时间: 2024年7月8日 13:29
> > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> >
> > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > >
> > > Currently, large memset perf can be improved with non-temporal stores[1].
> > > But it just supports the case in which ERMS feature is enabled.
> > >
> > > For some architectures, ERMS is disabled in cpu feature as default For
> > > example, AMD zen3-milan[2], Or ERMS performance is worse than
> > > vectorized loop, and it is disabled.
> > > For example, HygonGenuine arch.
> > >
> > > Thus, to let these arch achieve perf improvement from non-temporal
> > > memset, and enable non-temporal memset without ERMS, a 'non-temporal
> > > memset branch jump' is added in 'more_2x_vec'case.
> > >
> > > Test Results:
> > > thread: 1
> > > memset store value: 0
> > > function: memset_avx2_unaligned
> > >
> > > hygon1 arch
> > > x86_memset_non_temporal_threshold = 8MB
> > > size                          new performance / old performance
> > > 128 byte(2x -4x vec case)     1
> > > 256 byte(4x - 8x vec case)    1
> > > 512 byte( > 8x loop case)     1
> > > 1MB                           0.994
> > > 4MB                           0.996
> > > 8MB                           0.670
> > > 16MB                          0.343
> > > 32MB                          0.355
> > >
> > > hygon2 arch
> > > x86_memset_non_temporal_threshold = 8MB
> > > size                          new performance / old performance
> > > 128 byte(2x -4x vec case)     1
> > > 256 byte(4x - 8x vec case)    0.653
> > > 512 byte( > 8x loop case)     0.713
> > > 1MB                           1
> > > 4MB                           0.887
> > > 8MB                           1.312
> > > 16MB                          0.822
> > > 32MB                          0.830
> > >
> > > hygon3 arch
> > > x86_memset_non_temporal_threshold = 8MB
> > > size                          new performance / old performance
> > > 128 byte(2x -4x vec case)     1
> > > 256 byte(4x - 8x vec case)    1
> > > 512 byte( > 8x loop case)     1
> > > 1MB                           1
> > > 4MB                           0.990
> > > 8MB                           0.737
> > > 16MB                          0.390
> > > 32MB                          0.401
> > >
> > > For hygon arch with , no performance degradation on '2x - 8x branch
> > > case' when extra branch jump added. And with this patch, non-temporal
> > > stores can improve performance by 20% - 65%.
> > >
> > > amd-zen2-rome arch
> > > x86_memset_non_temporal_threshold = 12MB
> > > size                          new performance / old performance
> > > 128 byte(2x -4x vec case)     0.986
> > > 256 byte(4x - 8x vec case)    1.057
> > > 512 byte( > 8x loop case)     1.005
> > > 1MB                           0.999
> > > 4MB                           0.998
> > > 8MB                           1.014
> > > 16MB                          1.156
> > > 32MB                          0.634
> > >
> > > For amd-zen2-rome, no performance degradation on '2x - 8x branch case'
> > > when extra branch jump added. When size >=
> > > x86_memset_non_temporal_threshold, for 'size = 16MB' case, performance
> > > degradation is due to that amd-zen2 non_temporal perf is worse than
> > > temporal[3]. And in 32MB case, it can achieve performance improvement by
> > 37%.
> > >
> > > For intel arch, intel arch ERMS feature is enabled as default, so this
> > > patch has no effect for it.
> > >
> > > Note:
> > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing high
> > > performance x86 processor for China server market.
> > >
> > > Ref:
> > > [1]
> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5
> > > d94b094659094336da90f [2]
> > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.html
> > > [3]
> > >
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4
> > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > >
> > > Jira: HES-210
> > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > ---
> > >  sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > index 88bf08e..063dd51 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
> > >         ja      L(stosb_local)
> > >  #endif
> > > -       /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > > +       /* If rdx is less than __x86_memset_non_temporal_threshold,
> > > +          fallthrough goes to L(loop_4x_vec). Tests for memset (2x,
> > > + 4x]
> > >            and (4x, 8x] jump to target.  */
> > >  L(more_2x_vec):
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > +       /* Check non-temporal store threshold.
> > > +          This is for the hardware which can not support ERMS feature or
> > > +          ERMS fearure is disabled.  */
> > > +       cmp
> > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > +       jae     L(nt_memset)
> > > +#endif
> > >         /* Store next 2x vec regardless.  */
> > >         VMOVU   %VMM(0), (%rdi)
> > >         VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > --
> > > 2.7.4
> > >
> >
> > I am opposed to this patch getting in as is. I strongly dislike the essentially
> > redundant checks for `__x86_rep_stosb_threshold` and
> > `__x86_memset_non_temporal_threshold` that will occur on most processors.
> > Further, in this case, it messes up the alignment of the loop.
> >
> > What I would like to do instead for processors which prefer non-temporal but
> > not erms, is to have them set `__x86_rep_stosb_threshold ==
> > __x86_memset_non_temporal_threshold` and prefer the `erms` version.
> >
> > Can you implement it like that instead?
> [Feifei] This is also a good choice. But I have a question for this:
> If my understand is right, you means that If __x86_rep_stosb_threshold ==
> __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS" when
> init_cpu_features for servers without ERMS.
>
Yes, if you don't want memcpy to use `erms`, set
`__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.

> However, if like this, I think we ignore the small size case. Though we can use non temporal now,
> But for "size < __x86_rep_stosb_threshold" case, it is forced to use erms whose performance is very bad
> For example, hygon and amd zen2.

Hmm, I'm not sure I understand what you mean.

What I am arguing for is that a processor that wants non-temporal but
doesn't want erms should do the following:

memset:
`__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold ==
YOUR_PREFERED_NT_THRESHOLD`

This will have the exact same behavior as your patch. Below NT threshold it
will use the normal temporal loop.

Above NT threshold it will use nt-stores. It will never use `rep stosb`.
>
> Thus we add a nt jump for temporal branch, and for this, we can support that :
> 1. small size can use temporal for good performance rather than erms.
> 2.big size can use non temporal to achieve better performance.
>
>
> Best Regards
> Feifei
>
  
Feifei Wang July 9, 2024, 10:19 a.m. UTC | #4
> -----邮件原件-----
> 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> 发送时间: 2024年7月9日 12:19
> 收件人: Feifei Wang <wangfeifei@hygon.cn>
> 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> 
> On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn> wrote:
> >
> > Thanks for your kindly reply.
> >
> > > -----邮件原件-----
> > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > 发送时间: 2024年7月8日 13:29
> > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > >
> > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > > >
> > > > Currently, large memset perf can be improved with non-temporal
> stores[1].
> > > > But it just supports the case in which ERMS feature is enabled.
> > > >
> > > > For some architectures, ERMS is disabled in cpu feature as default
> > > > For example, AMD zen3-milan[2], Or ERMS performance is worse than
> > > > vectorized loop, and it is disabled.
> > > > For example, HygonGenuine arch.
> > > >
> > > > Thus, to let these arch achieve perf improvement from non-temporal
> > > > memset, and enable non-temporal memset without ERMS, a
> > > > 'non-temporal memset branch jump' is added in 'more_2x_vec'case.
> > > >
> > > > Test Results:
> > > > thread: 1
> > > > memset store value: 0
> > > > function: memset_avx2_unaligned
> > > >
> > > > hygon1 arch
> > > > x86_memset_non_temporal_threshold = 8MB
> > > > size                          new performance / old performance
> > > > 128 byte(2x -4x vec case)     1
> > > > 256 byte(4x - 8x vec case)    1
> > > > 512 byte( > 8x loop case)     1
> > > > 1MB                           0.994
> > > > 4MB                           0.996
> > > > 8MB                           0.670
> > > > 16MB                          0.343
> > > > 32MB                          0.355
> > > >
> > > > hygon2 arch
> > > > x86_memset_non_temporal_threshold = 8MB
> > > > size                          new performance / old performance
> > > > 128 byte(2x -4x vec case)     1
> > > > 256 byte(4x - 8x vec case)    0.653
> > > > 512 byte( > 8x loop case)     0.713
> > > > 1MB                           1
> > > > 4MB                           0.887
> > > > 8MB                           1.312
> > > > 16MB                          0.822
> > > > 32MB                          0.830
> > > >
> > > > hygon3 arch
> > > > x86_memset_non_temporal_threshold = 8MB
> > > > size                          new performance / old performance
> > > > 128 byte(2x -4x vec case)     1
> > > > 256 byte(4x - 8x vec case)    1
> > > > 512 byte( > 8x loop case)     1
> > > > 1MB                           1
> > > > 4MB                           0.990
> > > > 8MB                           0.737
> > > > 16MB                          0.390
> > > > 32MB                          0.401
> > > >
> > > > For hygon arch with , no performance degradation on '2x - 8x
> > > > branch case' when extra branch jump added. And with this patch,
> > > > non-temporal stores can improve performance by 20% - 65%.
> > > >
> > > > amd-zen2-rome arch
> > > > x86_memset_non_temporal_threshold = 12MB
> > > > size                          new performance / old performance
> > > > 128 byte(2x -4x vec case)     0.986
> > > > 256 byte(4x - 8x vec case)    1.057
> > > > 512 byte( > 8x loop case)     1.005
> > > > 1MB                           0.999
> > > > 4MB                           0.998
> > > > 8MB                           1.014
> > > > 16MB                          1.156
> > > > 32MB                          0.634
> > > >
> > > > For amd-zen2-rome, no performance degradation on '2x - 8x branch case'
> > > > when extra branch jump added. When size >=
> > > > x86_memset_non_temporal_threshold, for 'size = 16MB' case,
> > > > performance degradation is due to that amd-zen2 non_temporal perf
> > > > is worse than temporal[3]. And in 32MB case, it can achieve
> > > > performance improvement by
> > > 37%.
> > > >
> > > > For intel arch, intel arch ERMS feature is enabled as default, so
> > > > this patch has no effect for it.
> > > >
> > > > Note:
> > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing high
> > > > performance x86 processor for China server market.
> > > >
> > > > Ref:
> > > > [1]
> > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e
> > > > 4ae5
> > > > d94b094659094336da90f [2]
> > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.htm
> > > > l
> > > > [3]
> > > >
> > >
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEi
> > > l4
> > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > >
> > > > Jira: HES-210
> > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > ---
> > > >  sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10
> > > > +++++++++-
> > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > index 88bf08e..063dd51 100644
> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > >         ja      L(stosb_local)
> > > >  #endif
> > > > -       /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > > > +       /* If rdx is less than __x86_memset_non_temporal_threshold,
> > > > +          fallthrough goes to L(loop_4x_vec). Tests for memset
> > > > + (2x, 4x]
> > > >            and (4x, 8x] jump to target.  */
> > > >  L(more_2x_vec):
> > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > +       /* Check non-temporal store threshold.
> > > > +          This is for the hardware which can not support ERMS feature
> or
> > > > +          ERMS fearure is disabled.  */
> > > > +       cmp
> > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > +       jae     L(nt_memset)
> > > > +#endif
> > > >         /* Store next 2x vec regardless.  */
> > > >         VMOVU   %VMM(0), (%rdi)
> > > >         VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > --
> > > > 2.7.4
> > > >
> > >
> > > I am opposed to this patch getting in as is. I strongly dislike the
> > > essentially redundant checks for `__x86_rep_stosb_threshold` and
> > > `__x86_memset_non_temporal_threshold` that will occur on most
> processors.
> > > Further, in this case, it messes up the alignment of the loop.
> > >
> > > What I would like to do instead for processors which prefer
> > > non-temporal but not erms, is to have them set
> > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold`
> and prefer the `erms` version.
> > >
> > > Can you implement it like that instead?
> > [Feifei] This is also a good choice. But I have a question for this:
> > If my understand is right, you means that If __x86_rep_stosb_threshold
> > == __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> when
> > init_cpu_features for servers without ERMS.
> >
> Yes, if you don't want memcpy to use `erms`, set
> `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> 
> > However, if like this, I think we ignore the small size case. Though
> > we can use non temporal now, But for "size <
> > __x86_rep_stosb_threshold" case, it is forced to use erms whose performance
> is very bad For example, hygon and amd zen2.
> 
> Hmm, I'm not sure I understand what you mean.
> 
> What I am arguing for is that a processor that wants non-temporal but doesn't
> want erms should do the following:
> 
> memset:
> `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold ==
> YOUR_PREFERED_NT_THRESHOLD`
> 
> This will have the exact same behavior as your patch. Below NT threshold it will
> use the normal temporal loop.

[Feifei]Ok, maybe I understand your meanings.
Consider we have a server support avx2, vec_size = 32, erms cpu flag is disabled.
1. set __x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold == 1MB
2. set 'Prefer ERMS' to force server to select erms function in 'IFUNC_SELECTOR'
3. we need to use memset 0 for 1KB memory(size = 1KB).

Firstly, no erms support, cpu can also call " memset_avx2_unaligned_erms" function.
Second, in the memset_erms function, check size > 2 * vec_size, and then jump to 'stosb_more_2x_vec'
Third, check size < _x86_rep_stosb_threshold, then go to normal branch. Here, if size = 2MB,
we can jump to 'stosb_local', then check NT_threshold, we jump to NT loop.
Is my right?

By the way, I just did a test that If we setting 'Prefer_ERMS', cpu will call 'memset_erms' in memset_erms.S,
But we need call ' memset_avx2_unaligned_erms '. 
Thus, maybe we need to do a change for IFUNC_SELECTOR:
Line53: +  if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
       +	if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
       +	return OPTIMIZE (avx2_unaligned_erms);
       + }
       + else 
			if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
         	return OPTIMIZE (erms);

Or we add a new flag 'Prefer_VEC_ERMs'?

> 
> Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> >
> > Thus we add a nt jump for temporal branch, and for this, we can support that :
> > 1. small size can use temporal for good performance rather than erms.
> > 2.big size can use non temporal to achieve better performance.
> >
> >
> > Best Regards
> > Feifei
> >
  
Noah Goldstein July 9, 2024, 10:52 a.m. UTC | #5
On Tue, Jul 9, 2024 at 6:20 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
>
>
>
> > -----邮件原件-----
> > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > 发送时间: 2024年7月9日 12:19
> > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> >
> > On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > >
> > > Thanks for your kindly reply.
> > >
> > > > -----邮件原件-----
> > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > 发送时间: 2024年7月8日 13:29
> > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > >
> > > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > > > >
> > > > > Currently, large memset perf can be improved with non-temporal
> > stores[1].
> > > > > But it just supports the case in which ERMS feature is enabled.
> > > > >
> > > > > For some architectures, ERMS is disabled in cpu feature as default
> > > > > For example, AMD zen3-milan[2], Or ERMS performance is worse than
> > > > > vectorized loop, and it is disabled.
> > > > > For example, HygonGenuine arch.
> > > > >
> > > > > Thus, to let these arch achieve perf improvement from non-temporal
> > > > > memset, and enable non-temporal memset without ERMS, a
> > > > > 'non-temporal memset branch jump' is added in 'more_2x_vec'case.
> > > > >
> > > > > Test Results:
> > > > > thread: 1
> > > > > memset store value: 0
> > > > > function: memset_avx2_unaligned
> > > > >
> > > > > hygon1 arch
> > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > size                          new performance / old performance
> > > > > 128 byte(2x -4x vec case)     1
> > > > > 256 byte(4x - 8x vec case)    1
> > > > > 512 byte( > 8x loop case)     1
> > > > > 1MB                           0.994
> > > > > 4MB                           0.996
> > > > > 8MB                           0.670
> > > > > 16MB                          0.343
> > > > > 32MB                          0.355
> > > > >
> > > > > hygon2 arch
> > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > size                          new performance / old performance
> > > > > 128 byte(2x -4x vec case)     1
> > > > > 256 byte(4x - 8x vec case)    0.653
> > > > > 512 byte( > 8x loop case)     0.713
> > > > > 1MB                           1
> > > > > 4MB                           0.887
> > > > > 8MB                           1.312
> > > > > 16MB                          0.822
> > > > > 32MB                          0.830
> > > > >
> > > > > hygon3 arch
> > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > size                          new performance / old performance
> > > > > 128 byte(2x -4x vec case)     1
> > > > > 256 byte(4x - 8x vec case)    1
> > > > > 512 byte( > 8x loop case)     1
> > > > > 1MB                           1
> > > > > 4MB                           0.990
> > > > > 8MB                           0.737
> > > > > 16MB                          0.390
> > > > > 32MB                          0.401
> > > > >
> > > > > For hygon arch with , no performance degradation on '2x - 8x
> > > > > branch case' when extra branch jump added. And with this patch,
> > > > > non-temporal stores can improve performance by 20% - 65%.
> > > > >
> > > > > amd-zen2-rome arch
> > > > > x86_memset_non_temporal_threshold = 12MB
> > > > > size                          new performance / old performance
> > > > > 128 byte(2x -4x vec case)     0.986
> > > > > 256 byte(4x - 8x vec case)    1.057
> > > > > 512 byte( > 8x loop case)     1.005
> > > > > 1MB                           0.999
> > > > > 4MB                           0.998
> > > > > 8MB                           1.014
> > > > > 16MB                          1.156
> > > > > 32MB                          0.634
> > > > >
> > > > > For amd-zen2-rome, no performance degradation on '2x - 8x branch case'
> > > > > when extra branch jump added. When size >=
> > > > > x86_memset_non_temporal_threshold, for 'size = 16MB' case,
> > > > > performance degradation is due to that amd-zen2 non_temporal perf
> > > > > is worse than temporal[3]. And in 32MB case, it can achieve
> > > > > performance improvement by
> > > > 37%.
> > > > >
> > > > > For intel arch, intel arch ERMS feature is enabled as default, so
> > > > > this patch has no effect for it.
> > > > >
> > > > > Note:
> > > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing high
> > > > > performance x86 processor for China server market.
> > > > >
> > > > > Ref:
> > > > > [1]
> > > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e
> > > > > 4ae5
> > > > > d94b094659094336da90f [2]
> > > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.htm
> > > > > l
> > > > > [3]
> > > > >
> > > >
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEi
> > > > l4
> > > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > > >
> > > > > Jira: HES-210
> > > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > > ---
> > > > >  sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10
> > > > > +++++++++-
> > > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > index 88bf08e..063dd51 100644
> > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > >         ja      L(stosb_local)
> > > > >  #endif
> > > > > -       /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > > > > +       /* If rdx is less than __x86_memset_non_temporal_threshold,
> > > > > +          fallthrough goes to L(loop_4x_vec). Tests for memset
> > > > > + (2x, 4x]
> > > > >            and (4x, 8x] jump to target.  */
> > > > >  L(more_2x_vec):
> > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > +       /* Check non-temporal store threshold.
> > > > > +          This is for the hardware which can not support ERMS feature
> > or
> > > > > +          ERMS fearure is disabled.  */
> > > > > +       cmp
> > > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > > +       jae     L(nt_memset)
> > > > > +#endif
> > > > >         /* Store next 2x vec regardless.  */
> > > > >         VMOVU   %VMM(0), (%rdi)
> > > > >         VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > > I am opposed to this patch getting in as is. I strongly dislike the
> > > > essentially redundant checks for `__x86_rep_stosb_threshold` and
> > > > `__x86_memset_non_temporal_threshold` that will occur on most
> > processors.
> > > > Further, in this case, it messes up the alignment of the loop.
> > > >
> > > > What I would like to do instead for processors which prefer
> > > > non-temporal but not erms, is to have them set
> > > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold`
> > and prefer the `erms` version.
> > > >
> > > > Can you implement it like that instead?
> > > [Feifei] This is also a good choice. But I have a question for this:
> > > If my understand is right, you means that If __x86_rep_stosb_threshold
> > > == __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> > when
> > > init_cpu_features for servers without ERMS.
> > >
> > Yes, if you don't want memcpy to use `erms`, set
> > `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> >
> > > However, if like this, I think we ignore the small size case. Though
> > > we can use non temporal now, But for "size <
> > > __x86_rep_stosb_threshold" case, it is forced to use erms whose performance
> > is very bad For example, hygon and amd zen2.
> >
> > Hmm, I'm not sure I understand what you mean.
> >
> > What I am arguing for is that a processor that wants non-temporal but doesn't
> > want erms should do the following:
> >
> > memset:
> > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold ==
> > YOUR_PREFERED_NT_THRESHOLD`
> >
> > This will have the exact same behavior as your patch. Below NT threshold it will
> > use the normal temporal loop.
>
> [Feifei]Ok, maybe I understand your meanings.
> Consider we have a server support avx2, vec_size = 32, erms cpu flag is disabled.
> 1. set __x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold == 1MB
> 2. set 'Prefer ERMS' to force server to select erms function in 'IFUNC_SELECTOR'
> 3. we need to use memset 0 for 1KB memory(size = 1KB).
>
> Firstly, no erms support, cpu can also call " memset_avx2_unaligned_erms" function.
> Second, in the memset_erms function, check size > 2 * vec_size, and then jump to 'stosb_more_2x_vec'
> Third, check size < _x86_rep_stosb_threshold, then go to normal branch. Here, if size = 2MB,
> we can jump to 'stosb_local', then check NT_threshold, we jump to NT loop.
> Is my right?

Yes.

>
> By the way, I just did a test that If we setting 'Prefer_ERMS', cpu will call 'memset_erms' in memset_erms.S,
> But we need call ' memset_avx2_unaligned_erms '.
> Thus, maybe we need to do a change for IFUNC_SELECTOR:
> Line53: +  if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
>        +        if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
>        +        return OPTIMIZE (avx2_unaligned_erms);
>        + }
>        + else
>                         if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
>                 return OPTIMIZE (erms);
>
> Or we add a new flag 'Prefer_VEC_ERMS'?
>

Yeah, so I need to do some plumbing here.
`Prefer_ERMS` controls the pure `rep stosb` impl for memset.
For deciding between memset_<isa>_unaligned... we use:
`CPU_FEATURE_USABLE_P (cpu_features, ERMS)`.

Think we need to add a new flag along the lines of:
`CPU_FEATURES_ARCH_P (cpu_features, NON_TEMPORAL)`
and make the memset/memmove_<isa>_unaligned_erms use that flag.

Then in dl-cacheinfo.h if we don't have ERMS, we setup
stosb/movsb thresholds s.t stosb/movsb code will just forward to
non-temporal codes.
> >
> > Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> > >
> > > Thus we add a nt jump for temporal branch, and for this, we can support that :
> > > 1. small size can use temporal for good performance rather than erms.
> > > 2.big size can use non temporal to achieve better performance.
> > >
> > >
> > > Best Regards
> > > Feifei
> > >
  
Noah Goldstein July 9, 2024, 10:54 a.m. UTC | #6
On Tue, Jul 9, 2024 at 6:52 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Tue, Jul 9, 2024 at 6:20 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> >
> >
> >
> > > -----邮件原件-----
> > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > 发送时间: 2024年7月9日 12:19
> > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > >
> > > On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > > >
> > > > Thanks for your kindly reply.
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > 发送时间: 2024年7月8日 13:29
> > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > > >
> > > > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > > > > >
> > > > > > Currently, large memset perf can be improved with non-temporal
> > > stores[1].
> > > > > > But it just supports the case in which ERMS feature is enabled.
> > > > > >
> > > > > > For some architectures, ERMS is disabled in cpu feature as default
> > > > > > For example, AMD zen3-milan[2], Or ERMS performance is worse than
> > > > > > vectorized loop, and it is disabled.
> > > > > > For example, HygonGenuine arch.
> > > > > >
> > > > > > Thus, to let these arch achieve perf improvement from non-temporal
> > > > > > memset, and enable non-temporal memset without ERMS, a
> > > > > > 'non-temporal memset branch jump' is added in 'more_2x_vec'case.
> > > > > >
> > > > > > Test Results:
> > > > > > thread: 1
> > > > > > memset store value: 0
> > > > > > function: memset_avx2_unaligned
> > > > > >
> > > > > > hygon1 arch
> > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > size                          new performance / old performance
> > > > > > 128 byte(2x -4x vec case)     1
> > > > > > 256 byte(4x - 8x vec case)    1
> > > > > > 512 byte( > 8x loop case)     1
> > > > > > 1MB                           0.994
> > > > > > 4MB                           0.996
> > > > > > 8MB                           0.670
> > > > > > 16MB                          0.343
> > > > > > 32MB                          0.355
> > > > > >
> > > > > > hygon2 arch
> > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > size                          new performance / old performance
> > > > > > 128 byte(2x -4x vec case)     1
> > > > > > 256 byte(4x - 8x vec case)    0.653
> > > > > > 512 byte( > 8x loop case)     0.713
> > > > > > 1MB                           1
> > > > > > 4MB                           0.887
> > > > > > 8MB                           1.312
> > > > > > 16MB                          0.822
> > > > > > 32MB                          0.830
> > > > > >
> > > > > > hygon3 arch
> > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > size                          new performance / old performance
> > > > > > 128 byte(2x -4x vec case)     1
> > > > > > 256 byte(4x - 8x vec case)    1
> > > > > > 512 byte( > 8x loop case)     1
> > > > > > 1MB                           1
> > > > > > 4MB                           0.990
> > > > > > 8MB                           0.737
> > > > > > 16MB                          0.390
> > > > > > 32MB                          0.401
> > > > > >
> > > > > > For hygon arch with , no performance degradation on '2x - 8x
> > > > > > branch case' when extra branch jump added. And with this patch,
> > > > > > non-temporal stores can improve performance by 20% - 65%.
> > > > > >
> > > > > > amd-zen2-rome arch
> > > > > > x86_memset_non_temporal_threshold = 12MB
> > > > > > size                          new performance / old performance
> > > > > > 128 byte(2x -4x vec case)     0.986
> > > > > > 256 byte(4x - 8x vec case)    1.057
> > > > > > 512 byte( > 8x loop case)     1.005
> > > > > > 1MB                           0.999
> > > > > > 4MB                           0.998
> > > > > > 8MB                           1.014
> > > > > > 16MB                          1.156
> > > > > > 32MB                          0.634
> > > > > >
> > > > > > For amd-zen2-rome, no performance degradation on '2x - 8x branch case'
> > > > > > when extra branch jump added. When size >=
> > > > > > x86_memset_non_temporal_threshold, for 'size = 16MB' case,
> > > > > > performance degradation is due to that amd-zen2 non_temporal perf
> > > > > > is worse than temporal[3]. And in 32MB case, it can achieve
> > > > > > performance improvement by
> > > > > 37%.
> > > > > >
> > > > > > For intel arch, intel arch ERMS feature is enabled as default, so
> > > > > > this patch has no effect for it.
> > > > > >
> > > > > > Note:
> > > > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing high
> > > > > > performance x86 processor for China server market.
> > > > > >
> > > > > > Ref:
> > > > > > [1]
> > > > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e
> > > > > > 4ae5
> > > > > > d94b094659094336da90f [2]
> > > > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.htm
> > > > > > l
> > > > > > [3]
> > > > > >
> > > > >
> > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEi
> > > > > l4
> > > > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > > > >
> > > > > > Jira: HES-210
> > > > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > > > ---
> > > > > >  sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10
> > > > > > +++++++++-
> > > > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > index 88bf08e..063dd51 100644
> > > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > > >         ja      L(stosb_local)
> > > > > >  #endif
> > > > > > -       /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > > > > > +       /* If rdx is less than __x86_memset_non_temporal_threshold,
> > > > > > +          fallthrough goes to L(loop_4x_vec). Tests for memset
> > > > > > + (2x, 4x]
> > > > > >            and (4x, 8x] jump to target.  */
> > > > > >  L(more_2x_vec):
> > > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > > +       /* Check non-temporal store threshold.
> > > > > > +          This is for the hardware which can not support ERMS feature
> > > or
> > > > > > +          ERMS fearure is disabled.  */
> > > > > > +       cmp
> > > > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > > > +       jae     L(nt_memset)
> > > > > > +#endif
> > > > > >         /* Store next 2x vec regardless.  */
> > > > > >         VMOVU   %VMM(0), (%rdi)
> > > > > >         VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > > --
> > > > > > 2.7.4
> > > > > >
> > > > >
> > > > > I am opposed to this patch getting in as is. I strongly dislike the
> > > > > essentially redundant checks for `__x86_rep_stosb_threshold` and
> > > > > `__x86_memset_non_temporal_threshold` that will occur on most
> > > processors.
> > > > > Further, in this case, it messes up the alignment of the loop.
> > > > >
> > > > > What I would like to do instead for processors which prefer
> > > > > non-temporal but not erms, is to have them set
> > > > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold`
> > > and prefer the `erms` version.
> > > > >
> > > > > Can you implement it like that instead?
> > > > [Feifei] This is also a good choice. But I have a question for this:
> > > > If my understand is right, you means that If __x86_rep_stosb_threshold
> > > > == __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> > > when
> > > > init_cpu_features for servers without ERMS.
> > > >
> > > Yes, if you don't want memcpy to use `erms`, set
> > > `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> > >
> > > > However, if like this, I think we ignore the small size case. Though
> > > > we can use non temporal now, But for "size <
> > > > __x86_rep_stosb_threshold" case, it is forced to use erms whose performance
> > > is very bad For example, hygon and amd zen2.
> > >
> > > Hmm, I'm not sure I understand what you mean.
> > >
> > > What I am arguing for is that a processor that wants non-temporal but doesn't
> > > want erms should do the following:
> > >
> > > memset:
> > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold ==
> > > YOUR_PREFERED_NT_THRESHOLD`
> > >
> > > This will have the exact same behavior as your patch. Below NT threshold it will
> > > use the normal temporal loop.
> >
> > [Feifei]Ok, maybe I understand your meanings.
> > Consider we have a server support avx2, vec_size = 32, erms cpu flag is disabled.
> > 1. set __x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold == 1MB
> > 2. set 'Prefer ERMS' to force server to select erms function in 'IFUNC_SELECTOR'
> > 3. we need to use memset 0 for 1KB memory(size = 1KB).
> >
> > Firstly, no erms support, cpu can also call " memset_avx2_unaligned_erms" function.
> > Second, in the memset_erms function, check size > 2 * vec_size, and then jump to 'stosb_more_2x_vec'
> > Third, check size < _x86_rep_stosb_threshold, then go to normal branch. Here, if size = 2MB,
> > we can jump to 'stosb_local', then check NT_threshold, we jump to NT loop.
> > Is my right?
>
> Yes.
>
> >
> > By the way, I just did a test that If we setting 'Prefer_ERMS', cpu will call 'memset_erms' in memset_erms.S,
> > But we need call ' memset_avx2_unaligned_erms '.
> > Thus, maybe we need to do a change for IFUNC_SELECTOR:
> > Line53: +  if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
> >        +        if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
> >        +        return OPTIMIZE (avx2_unaligned_erms);
> >        + }
> >        + else
> >                         if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
> >                 return OPTIMIZE (erms);
> >
> > Or we add a new flag 'Prefer_VEC_ERMS'?
> >
>
> Yeah, so I need to do some plumbing here.
> `Prefer_ERMS` controls the pure `rep stosb` impl for memset.
> For deciding between memset_<isa>_unaligned... we use:
> `CPU_FEATURE_USABLE_P (cpu_features, ERMS)`.
>
> Think we need to add a new flag along the lines of:
> `CPU_FEATURES_ARCH_P (cpu_features, NON_TEMPORAL)`
> and make the memset/memmove_<isa>_unaligned_erms use that flag.
>
> Then in dl-cacheinfo.h if we don't have ERMS, we setup
> stosb/movsb thresholds s.t stosb/movsb code will just forward to
> non-temporal codes.

I'm going to assume this is not going to make it into 2.40 w/ the freeze and
the fact that we will need to do some changes in cpu-features/dl-cacheinfo.

Before master re-opens (after the release) ill post patches to add support
for this usecase.
> > >
> > > Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> > > >
> > > > Thus we add a nt jump for temporal branch, and for this, we can support that :
> > > > 1. small size can use temporal for good performance rather than erms.
> > > > 2.big size can use non temporal to achieve better performance.
> > > >
> > > >
> > > > Best Regards
> > > > Feifei
> > > >
  
Feifei Wang July 10, 2024, 6:39 a.m. UTC | #7
> -----邮件原件-----
> 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> 发送时间: 2024年7月9日 18:55
> 收件人: Feifei Wang <wangfeifei@hygon.cn>
> 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> 
> On Tue, Jul 9, 2024 at 6:52 PM Noah Goldstein <goldstein.w.n@gmail.com>
> wrote:
> >
> > On Tue, Jul 9, 2024 at 6:20 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > >
> > >
> > >
> > > > -----邮件原件-----
> > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > 发送时间: 2024年7月9日 12:19
> > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > >
> > > > On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn>
> wrote:
> > > > >
> > > > > Thanks for your kindly reply.
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > > 发送时间: 2024年7月8日 13:29
> > > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > > > >
> > > > > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn>
> wrote:
> > > > > > >
> > > > > > > Currently, large memset perf can be improved with
> > > > > > > non-temporal
> > > > stores[1].
> > > > > > > But it just supports the case in which ERMS feature is enabled.
> > > > > > >
> > > > > > > For some architectures, ERMS is disabled in cpu feature as
> > > > > > > default For example, AMD zen3-milan[2], Or ERMS performance
> > > > > > > is worse than vectorized loop, and it is disabled.
> > > > > > > For example, HygonGenuine arch.
> > > > > > >
> > > > > > > Thus, to let these arch achieve perf improvement from
> > > > > > > non-temporal memset, and enable non-temporal memset without
> > > > > > > ERMS, a 'non-temporal memset branch jump' is added in
> 'more_2x_vec'case.
> > > > > > >
> > > > > > > Test Results:
> > > > > > > thread: 1
> > > > > > > memset store value: 0
> > > > > > > function: memset_avx2_unaligned
> > > > > > >
> > > > > > > hygon1 arch
> > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > size                          new performance / old
> performance
> > > > > > > 128 byte(2x -4x vec case)     1
> > > > > > > 256 byte(4x - 8x vec case)    1
> > > > > > > 512 byte( > 8x loop case)     1
> > > > > > > 1MB                           0.994
> > > > > > > 4MB                           0.996
> > > > > > > 8MB                           0.670
> > > > > > > 16MB                          0.343
> > > > > > > 32MB                          0.355
> > > > > > >
> > > > > > > hygon2 arch
> > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > size                          new performance / old
> performance
> > > > > > > 128 byte(2x -4x vec case)     1
> > > > > > > 256 byte(4x - 8x vec case)    0.653
> > > > > > > 512 byte( > 8x loop case)     0.713
> > > > > > > 1MB                           1
> > > > > > > 4MB                           0.887
> > > > > > > 8MB                           1.312
> > > > > > > 16MB                          0.822
> > > > > > > 32MB                          0.830
> > > > > > >
> > > > > > > hygon3 arch
> > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > size                          new performance / old
> performance
> > > > > > > 128 byte(2x -4x vec case)     1
> > > > > > > 256 byte(4x - 8x vec case)    1
> > > > > > > 512 byte( > 8x loop case)     1
> > > > > > > 1MB                           1
> > > > > > > 4MB                           0.990
> > > > > > > 8MB                           0.737
> > > > > > > 16MB                          0.390
> > > > > > > 32MB                          0.401
> > > > > > >
> > > > > > > For hygon arch with , no performance degradation on '2x - 8x
> > > > > > > branch case' when extra branch jump added. And with this
> > > > > > > patch, non-temporal stores can improve performance by 20% - 65%.
> > > > > > >
> > > > > > > amd-zen2-rome arch
> > > > > > > x86_memset_non_temporal_threshold = 12MB
> > > > > > > size                          new performance / old
> performance
> > > > > > > 128 byte(2x -4x vec case)     0.986
> > > > > > > 256 byte(4x - 8x vec case)    1.057
> > > > > > > 512 byte( > 8x loop case)     1.005
> > > > > > > 1MB                           0.999
> > > > > > > 4MB                           0.998
> > > > > > > 8MB                           1.014
> > > > > > > 16MB                          1.156
> > > > > > > 32MB                          0.634
> > > > > > >
> > > > > > > For amd-zen2-rome, no performance degradation on '2x - 8x branch
> case'
> > > > > > > when extra branch jump added. When size >=
> > > > > > > x86_memset_non_temporal_threshold, for 'size = 16MB' case,
> > > > > > > performance degradation is due to that amd-zen2 non_temporal
> > > > > > > perf is worse than temporal[3]. And in 32MB case, it can
> > > > > > > achieve performance improvement by
> > > > > > 37%.
> > > > > > >
> > > > > > > For intel arch, intel arch ERMS feature is enabled as
> > > > > > > default, so this patch has no effect for it.
> > > > > > >
> > > > > > > Note:
> > > > > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at
> > > > > > > providing high performance x86 processor for China server market.
> > > > > > >
> > > > > > > Ref:
> > > > > > > [1]
> > > > > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab805
> > > > > > > 73d66e
> > > > > > > 4ae5
> > > > > > > d94b094659094336da90f [2]
> > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg067
> > > > > > > 47.htm
> > > > > > > l
> > > > > > > [3]
> > > > > > >
> > > > > >
> > > >
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjA
> > > > Ei
> > > > > > l4
> > > > > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > > > > >
> > > > > > > Jira: HES-210
> > > > > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > > > > ---
> > > > > > >  sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10
> > > > > > > +++++++++-
> > > > > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git
> > > > > > > a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > index 88bf08e..063dd51 100644
> > > > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > > > >         ja      L(stosb_local)
> > > > > > >  #endif
> > > > > > > -       /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x,
> 4x]
> > > > > > > +       /* If rdx is less than
> __x86_memset_non_temporal_threshold,
> > > > > > > +          fallthrough goes to L(loop_4x_vec). Tests for
> > > > > > > + memset (2x, 4x]
> > > > > > >            and (4x, 8x] jump to target.  */
> > > > > > >  L(more_2x_vec):
> > > > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > > > +       /* Check non-temporal store threshold.
> > > > > > > +          This is for the hardware which can not support
> > > > > > > +ERMS feature
> > > > or
> > > > > > > +          ERMS fearure is disabled.  */
> > > > > > > +       cmp
> > > > > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > > > > +       jae     L(nt_memset)
> > > > > > > +#endif
> > > > > > >         /* Store next 2x vec regardless.  */
> > > > > > >         VMOVU   %VMM(0), (%rdi)
> > > > > > >         VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > > > --
> > > > > > > 2.7.4
> > > > > > >
> > > > > >
> > > > > > I am opposed to this patch getting in as is. I strongly
> > > > > > dislike the essentially redundant checks for
> > > > > > `__x86_rep_stosb_threshold` and
> > > > > > `__x86_memset_non_temporal_threshold` that will occur on most
> > > > processors.
> > > > > > Further, in this case, it messes up the alignment of the loop.
> > > > > >
> > > > > > What I would like to do instead for processors which prefer
> > > > > > non-temporal but not erms, is to have them set
> > > > > > `__x86_rep_stosb_threshold ==
> > > > > > __x86_memset_non_temporal_threshold`
> > > > and prefer the `erms` version.
> > > > > >
> > > > > > Can you implement it like that instead?
> > > > > [Feifei] This is also a good choice. But I have a question for this:
> > > > > If my understand is right, you means that If
> > > > > __x86_rep_stosb_threshold ==
> __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> > > > when
> > > > > init_cpu_features for servers without ERMS.
> > > > >
> > > > Yes, if you don't want memcpy to use `erms`, set
> > > > `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> > > >
> > > > > However, if like this, I think we ignore the small size case.
> > > > > Though we can use non temporal now, But for "size <
> > > > > __x86_rep_stosb_threshold" case, it is forced to use erms whose
> > > > > performance
> > > > is very bad For example, hygon and amd zen2.
> > > >
> > > > Hmm, I'm not sure I understand what you mean.
> > > >
> > > > What I am arguing for is that a processor that wants non-temporal
> > > > but doesn't want erms should do the following:
> > > >
> > > > memset:
> > > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold
> > > > == YOUR_PREFERED_NT_THRESHOLD`
> > > >
> > > > This will have the exact same behavior as your patch. Below NT
> > > > threshold it will use the normal temporal loop.
> > >
> > > [Feifei]Ok, maybe I understand your meanings.
> > > Consider we have a server support avx2, vec_size = 32, erms cpu flag is
> disabled.
> > > 1. set __x86_rep_stosb_threshold ==
> > > __x86_memset_non_temporal_threshold == 1MB 2. set 'Prefer ERMS' to
> force server to select erms function in 'IFUNC_SELECTOR'
> > > 3. we need to use memset 0 for 1KB memory(size = 1KB).
> > >
> > > Firstly, no erms support, cpu can also call " memset_avx2_unaligned_erms"
> function.
> > > Second, in the memset_erms function, check size > 2 * vec_size, and then
> jump to 'stosb_more_2x_vec'
> > > Third, check size < _x86_rep_stosb_threshold, then go to normal
> > > branch. Here, if size = 2MB, we can jump to 'stosb_local', then check
> NT_threshold, we jump to NT loop.
> > > Is my right?
> >
> > Yes.
> >
> > >
> > > By the way, I just did a test that If we setting 'Prefer_ERMS', cpu
> > > will call 'memset_erms' in memset_erms.S, But we need call '
> memset_avx2_unaligned_erms '.
> > > Thus, maybe we need to do a change for IFUNC_SELECTOR:
> > > Line53: +  if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
> > >        +        if (CPU_FEATURES_ARCH_P (cpu_features,
> Prefer_ERMS))
> > >        +        return OPTIMIZE (avx2_unaligned_erms);
> > >        + }
> > >        + else
> > >                         if (CPU_FEATURES_ARCH_P (cpu_features,
> Prefer_ERMS))
> > >                 return OPTIMIZE (erms);
> > >
> > > Or we add a new flag 'Prefer_VEC_ERMS'?
> > >
> >
> > Yeah, so I need to do some plumbing here.
> > `Prefer_ERMS` controls the pure `rep stosb` impl for memset.
> > For deciding between memset_<isa>_unaligned... we use:
> > `CPU_FEATURE_USABLE_P (cpu_features, ERMS)`.
> >
> > Think we need to add a new flag along the lines of:
> > `CPU_FEATURES_ARCH_P (cpu_features, NON_TEMPORAL)` and make the
> > memset/memmove_<isa>_unaligned_erms use that flag.
> >
> > Then in dl-cacheinfo.h if we don't have ERMS, we setup stosb/movsb
> > thresholds s.t stosb/movsb code will just forward to non-temporal
> > codes.
[Feifei] Agree. NON_TEMPORAL flag is suitable for these usecase.
> 
> I'm going to assume this is not going to make it into 2.40 w/ the freeze and the
> fact that we will need to do some changes in cpu-features/dl-cacheinfo.
> 
> Before master re-opens (after the release) ill post patches to add support for
> this usecase.

[Feifei] OK, and I just have a question: when the master re-open again?
And I will prepare for the patch of this and push it to the community as soon as possible for reviewing.
In the next version, change will be as below:
1. add a new flag called "NON_TEMPORAL"
2. add Hygon arch branch in cpu-features.c and dl-cacheinfo.h, to split Hygon from AMD branch.
3.For Hygon arch branch, we enable NON_TEMPORAL flag
4. NON_TEMPORAL flag is set, __x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold in dl-cacheinfo.h

> > > >
> > > > Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> > > > >
> > > > > Thus we add a nt jump for temporal branch, and for this, we can support
> that :
> > > > > 1. small size can use temporal for good performance rather than erms.
> > > > > 2.big size can use non temporal to achieve better performance.
> > > > >
> > > > >
> > > > > Best Regards
> > > > > Feifei
> > > > >
  
Noah Goldstein July 10, 2024, 6:53 a.m. UTC | #8
On Wed, Jul 10, 2024 at 2:39 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
>
>
>
> > -----邮件原件-----
> > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > 发送时间: 2024年7月9日 18:55
> > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> >
> > On Tue, Jul 9, 2024 at 6:52 PM Noah Goldstein <goldstein.w.n@gmail.com>
> > wrote:
> > >
> > > On Tue, Jul 9, 2024 at 6:20 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > > >
> > > >
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > 发送时间: 2024年7月9日 12:19
> > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > > >
> > > > > On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn>
> > wrote:
> > > > > >
> > > > > > Thanks for your kindly reply.
> > > > > >
> > > > > > > -----邮件原件-----
> > > > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > > > 发送时间: 2024年7月8日 13:29
> > > > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > > > > >
> > > > > > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn>
> > wrote:
> > > > > > > >
> > > > > > > > Currently, large memset perf can be improved with
> > > > > > > > non-temporal
> > > > > stores[1].
> > > > > > > > But it just supports the case in which ERMS feature is enabled.
> > > > > > > >
> > > > > > > > For some architectures, ERMS is disabled in cpu feature as
> > > > > > > > default For example, AMD zen3-milan[2], Or ERMS performance
> > > > > > > > is worse than vectorized loop, and it is disabled.
> > > > > > > > For example, HygonGenuine arch.
> > > > > > > >
> > > > > > > > Thus, to let these arch achieve perf improvement from
> > > > > > > > non-temporal memset, and enable non-temporal memset without
> > > > > > > > ERMS, a 'non-temporal memset branch jump' is added in
> > 'more_2x_vec'case.
> > > > > > > >
> > > > > > > > Test Results:
> > > > > > > > thread: 1
> > > > > > > > memset store value: 0
> > > > > > > > function: memset_avx2_unaligned
> > > > > > > >
> > > > > > > > hygon1 arch
> > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > size                          new performance / old
> > performance
> > > > > > > > 128 byte(2x -4x vec case)     1
> > > > > > > > 256 byte(4x - 8x vec case)    1
> > > > > > > > 512 byte( > 8x loop case)     1
> > > > > > > > 1MB                           0.994
> > > > > > > > 4MB                           0.996
> > > > > > > > 8MB                           0.670
> > > > > > > > 16MB                          0.343
> > > > > > > > 32MB                          0.355
> > > > > > > >
> > > > > > > > hygon2 arch
> > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > size                          new performance / old
> > performance
> > > > > > > > 128 byte(2x -4x vec case)     1
> > > > > > > > 256 byte(4x - 8x vec case)    0.653
> > > > > > > > 512 byte( > 8x loop case)     0.713
> > > > > > > > 1MB                           1
> > > > > > > > 4MB                           0.887
> > > > > > > > 8MB                           1.312
> > > > > > > > 16MB                          0.822
> > > > > > > > 32MB                          0.830
> > > > > > > >
> > > > > > > > hygon3 arch
> > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > size                          new performance / old
> > performance
> > > > > > > > 128 byte(2x -4x vec case)     1
> > > > > > > > 256 byte(4x - 8x vec case)    1
> > > > > > > > 512 byte( > 8x loop case)     1
> > > > > > > > 1MB                           1
> > > > > > > > 4MB                           0.990
> > > > > > > > 8MB                           0.737
> > > > > > > > 16MB                          0.390
> > > > > > > > 32MB                          0.401
> > > > > > > >
> > > > > > > > For hygon arch with , no performance degradation on '2x - 8x
> > > > > > > > branch case' when extra branch jump added. And with this
> > > > > > > > patch, non-temporal stores can improve performance by 20% - 65%.
> > > > > > > >
> > > > > > > > amd-zen2-rome arch
> > > > > > > > x86_memset_non_temporal_threshold = 12MB
> > > > > > > > size                          new performance / old
> > performance
> > > > > > > > 128 byte(2x -4x vec case)     0.986
> > > > > > > > 256 byte(4x - 8x vec case)    1.057
> > > > > > > > 512 byte( > 8x loop case)     1.005
> > > > > > > > 1MB                           0.999
> > > > > > > > 4MB                           0.998
> > > > > > > > 8MB                           1.014
> > > > > > > > 16MB                          1.156
> > > > > > > > 32MB                          0.634
> > > > > > > >
> > > > > > > > For amd-zen2-rome, no performance degradation on '2x - 8x branch
> > case'
> > > > > > > > when extra branch jump added. When size >=
> > > > > > > > x86_memset_non_temporal_threshold, for 'size = 16MB' case,
> > > > > > > > performance degradation is due to that amd-zen2 non_temporal
> > > > > > > > perf is worse than temporal[3]. And in 32MB case, it can
> > > > > > > > achieve performance improvement by
> > > > > > > 37%.
> > > > > > > >
> > > > > > > > For intel arch, intel arch ERMS feature is enabled as
> > > > > > > > default, so this patch has no effect for it.
> > > > > > > >
> > > > > > > > Note:
> > > > > > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at
> > > > > > > > providing high performance x86 processor for China server market.
> > > > > > > >
> > > > > > > > Ref:
> > > > > > > > [1]
> > > > > > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab805
> > > > > > > > 73d66e
> > > > > > > > 4ae5
> > > > > > > > d94b094659094336da90f [2]
> > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg067
> > > > > > > > 47.htm
> > > > > > > > l
> > > > > > > > [3]
> > > > > > > >
> > > > > > >
> > > > >
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjA
> > > > > Ei
> > > > > > > l4
> > > > > > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > > > > > >
> > > > > > > > Jira: HES-210
> > > > > > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > > > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > > > > > ---
> > > > > > > >  sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10
> > > > > > > > +++++++++-
> > > > > > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git
> > > > > > > > a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > index 88bf08e..063dd51 100644
> > > > > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > > > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > > > > >         ja      L(stosb_local)
> > > > > > > >  #endif
> > > > > > > > -       /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x,
> > 4x]
> > > > > > > > +       /* If rdx is less than
> > __x86_memset_non_temporal_threshold,
> > > > > > > > +          fallthrough goes to L(loop_4x_vec). Tests for
> > > > > > > > + memset (2x, 4x]
> > > > > > > >            and (4x, 8x] jump to target.  */
> > > > > > > >  L(more_2x_vec):
> > > > > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > > > > +       /* Check non-temporal store threshold.
> > > > > > > > +          This is for the hardware which can not support
> > > > > > > > +ERMS feature
> > > > > or
> > > > > > > > +          ERMS fearure is disabled.  */
> > > > > > > > +       cmp
> > > > > > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > > > > > +       jae     L(nt_memset)
> > > > > > > > +#endif
> > > > > > > >         /* Store next 2x vec regardless.  */
> > > > > > > >         VMOVU   %VMM(0), (%rdi)
> > > > > > > >         VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > > > > --
> > > > > > > > 2.7.4
> > > > > > > >
> > > > > > >
> > > > > > > I am opposed to this patch getting in as is. I strongly
> > > > > > > dislike the essentially redundant checks for
> > > > > > > `__x86_rep_stosb_threshold` and
> > > > > > > `__x86_memset_non_temporal_threshold` that will occur on most
> > > > > processors.
> > > > > > > Further, in this case, it messes up the alignment of the loop.
> > > > > > >
> > > > > > > What I would like to do instead for processors which prefer
> > > > > > > non-temporal but not erms, is to have them set
> > > > > > > `__x86_rep_stosb_threshold ==
> > > > > > > __x86_memset_non_temporal_threshold`
> > > > > and prefer the `erms` version.
> > > > > > >
> > > > > > > Can you implement it like that instead?
> > > > > > [Feifei] This is also a good choice. But I have a question for this:
> > > > > > If my understand is right, you means that If
> > > > > > __x86_rep_stosb_threshold ==
> > __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> > > > > when
> > > > > > init_cpu_features for servers without ERMS.
> > > > > >
> > > > > Yes, if you don't want memcpy to use `erms`, set
> > > > > `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> > > > >
> > > > > > However, if like this, I think we ignore the small size case.
> > > > > > Though we can use non temporal now, But for "size <
> > > > > > __x86_rep_stosb_threshold" case, it is forced to use erms whose
> > > > > > performance
> > > > > is very bad For example, hygon and amd zen2.
> > > > >
> > > > > Hmm, I'm not sure I understand what you mean.
> > > > >
> > > > > What I am arguing for is that a processor that wants non-temporal
> > > > > but doesn't want erms should do the following:
> > > > >
> > > > > memset:
> > > > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold
> > > > > == YOUR_PREFERED_NT_THRESHOLD`
> > > > >
> > > > > This will have the exact same behavior as your patch. Below NT
> > > > > threshold it will use the normal temporal loop.
> > > >
> > > > [Feifei]Ok, maybe I understand your meanings.
> > > > Consider we have a server support avx2, vec_size = 32, erms cpu flag is
> > disabled.
> > > > 1. set __x86_rep_stosb_threshold ==
> > > > __x86_memset_non_temporal_threshold == 1MB 2. set 'Prefer ERMS' to
> > force server to select erms function in 'IFUNC_SELECTOR'
> > > > 3. we need to use memset 0 for 1KB memory(size = 1KB).
> > > >
> > > > Firstly, no erms support, cpu can also call " memset_avx2_unaligned_erms"
> > function.
> > > > Second, in the memset_erms function, check size > 2 * vec_size, and then
> > jump to 'stosb_more_2x_vec'
> > > > Third, check size < _x86_rep_stosb_threshold, then go to normal
> > > > branch. Here, if size = 2MB, we can jump to 'stosb_local', then check
> > NT_threshold, we jump to NT loop.
> > > > Is my right?
> > >
> > > Yes.
> > >
> > > >
> > > > By the way, I just did a test that If we setting 'Prefer_ERMS', cpu
> > > > will call 'memset_erms' in memset_erms.S, But we need call '
> > memset_avx2_unaligned_erms '.
> > > > Thus, maybe we need to do a change for IFUNC_SELECTOR:
> > > > Line53: +  if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
> > > >        +        if (CPU_FEATURES_ARCH_P (cpu_features,
> > Prefer_ERMS))
> > > >        +        return OPTIMIZE (avx2_unaligned_erms);
> > > >        + }
> > > >        + else
> > > >                         if (CPU_FEATURES_ARCH_P (cpu_features,
> > Prefer_ERMS))
> > > >                 return OPTIMIZE (erms);
> > > >
> > > > Or we add a new flag 'Prefer_VEC_ERMS'?
> > > >
> > >
> > > Yeah, so I need to do some plumbing here.
> > > `Prefer_ERMS` controls the pure `rep stosb` impl for memset.
> > > For deciding between memset_<isa>_unaligned... we use:
> > > `CPU_FEATURE_USABLE_P (cpu_features, ERMS)`.
> > >
> > > Think we need to add a new flag along the lines of:
> > > `CPU_FEATURES_ARCH_P (cpu_features, NON_TEMPORAL)` and make the
> > > memset/memmove_<isa>_unaligned_erms use that flag.
> > >
> > > Then in dl-cacheinfo.h if we don't have ERMS, we setup stosb/movsb
> > > thresholds s.t stosb/movsb code will just forward to non-temporal
> > > codes.
> [Feifei] Agree. NON_TEMPORAL flag is suitable for these usecase.
> >
> > I'm going to assume this is not going to make it into 2.40 w/ the freeze and the
> > fact that we will need to do some changes in cpu-features/dl-cacheinfo.
> >
> > Before master re-opens (after the release) ill post patches to add support for
> > this usecase.
>
> [Feifei] OK, and I just have a question: when the master re-open again?
> And I will prepare for the patch of this and push it to the community as soon as possible for reviewing.
> In the next version, change will be as below:
> 1. add a new flag called "NON_TEMPORAL"
> 2. add Hygon arch branch in cpu-features.c and dl-cacheinfo.h, to split Hygon from AMD branch.
> 3.For Hygon arch branch, we enable NON_TEMPORAL flag
> 4. NON_TEMPORAL flag is set, __x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold in dl-cacheinfo.h
>

I posted a patch for it, once it gets in you just need to set
`Prefer_Non_Temporal`
for your target.
> > > > >
> > > > > Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> > > > > >
> > > > > > Thus we add a nt jump for temporal branch, and for this, we can support
> > that :
> > > > > > 1. small size can use temporal for good performance rather than erms.
> > > > > > 2.big size can use non temporal to achieve better performance.
> > > > > >
> > > > > >
> > > > > > Best Regards
> > > > > > Feifei
> > > > > >
  
Feifei Wang July 10, 2024, 7 a.m. UTC | #9
> -----邮件原件-----
> 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> 发送时间: 2024年7月10日 14:53
> 收件人: Feifei Wang <wangfeifei@hygon.cn>
> 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> 
> On Wed, Jul 10, 2024 at 2:39 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> >
> >
> >
> > > -----邮件原件-----
> > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > 发送时间: 2024年7月9日 18:55
> > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > >
> > > On Tue, Jul 9, 2024 at 6:52 PM Noah Goldstein
> > > <goldstein.w.n@gmail.com>
> > > wrote:
> > > >
> > > > On Tue, Jul 9, 2024 at 6:20 PM Feifei Wang <wangfeifei@hygon.cn>
> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > > 发送时间: 2024年7月9日 12:19
> > > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > > > >
> > > > > > On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang
> > > > > > <wangfeifei@hygon.cn>
> > > wrote:
> > > > > > >
> > > > > > > Thanks for your kindly reply.
> > > > > > >
> > > > > > > > -----邮件原件-----
> > > > > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > > > > 发送时间: 2024年7月8日 13:29
> > > > > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without
> > > > > > > > ERMS
> > > > > > > >
> > > > > > > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang
> > > > > > > > <wangfeifei@hygon.cn>
> > > wrote:
> > > > > > > > >
> > > > > > > > > Currently, large memset perf can be improved with
> > > > > > > > > non-temporal
> > > > > > stores[1].
> > > > > > > > > But it just supports the case in which ERMS feature is enabled.
> > > > > > > > >
> > > > > > > > > For some architectures, ERMS is disabled in cpu feature
> > > > > > > > > as default For example, AMD zen3-milan[2], Or ERMS
> > > > > > > > > performance is worse than vectorized loop, and it is disabled.
> > > > > > > > > For example, HygonGenuine arch.
> > > > > > > > >
> > > > > > > > > Thus, to let these arch achieve perf improvement from
> > > > > > > > > non-temporal memset, and enable non-temporal memset
> > > > > > > > > without ERMS, a 'non-temporal memset branch jump' is
> > > > > > > > > added in
> > > 'more_2x_vec'case.
> > > > > > > > >
> > > > > > > > > Test Results:
> > > > > > > > > thread: 1
> > > > > > > > > memset store value: 0
> > > > > > > > > function: memset_avx2_unaligned
> > > > > > > > >
> > > > > > > > > hygon1 arch
> > > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > > size                          new performance / old
> > > performance
> > > > > > > > > 128 byte(2x -4x vec case)     1
> > > > > > > > > 256 byte(4x - 8x vec case)    1
> > > > > > > > > 512 byte( > 8x loop case)     1
> > > > > > > > > 1MB                           0.994
> > > > > > > > > 4MB                           0.996
> > > > > > > > > 8MB                           0.670
> > > > > > > > > 16MB                          0.343
> > > > > > > > > 32MB                          0.355
> > > > > > > > >
> > > > > > > > > hygon2 arch
> > > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > > size                          new performance / old
> > > performance
> > > > > > > > > 128 byte(2x -4x vec case)     1
> > > > > > > > > 256 byte(4x - 8x vec case)    0.653
> > > > > > > > > 512 byte( > 8x loop case)     0.713
> > > > > > > > > 1MB                           1
> > > > > > > > > 4MB                           0.887
> > > > > > > > > 8MB                           1.312
> > > > > > > > > 16MB                          0.822
> > > > > > > > > 32MB                          0.830
> > > > > > > > >
> > > > > > > > > hygon3 arch
> > > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > > size                          new performance / old
> > > performance
> > > > > > > > > 128 byte(2x -4x vec case)     1
> > > > > > > > > 256 byte(4x - 8x vec case)    1
> > > > > > > > > 512 byte( > 8x loop case)     1
> > > > > > > > > 1MB                           1
> > > > > > > > > 4MB                           0.990
> > > > > > > > > 8MB                           0.737
> > > > > > > > > 16MB                          0.390
> > > > > > > > > 32MB                          0.401
> > > > > > > > >
> > > > > > > > > For hygon arch with , no performance degradation on '2x
> > > > > > > > > - 8x branch case' when extra branch jump added. And with
> > > > > > > > > this patch, non-temporal stores can improve performance by 20%
> - 65%.
> > > > > > > > >
> > > > > > > > > amd-zen2-rome arch
> > > > > > > > > x86_memset_non_temporal_threshold = 12MB
> > > > > > > > > size                          new performance / old
> > > performance
> > > > > > > > > 128 byte(2x -4x vec case)     0.986
> > > > > > > > > 256 byte(4x - 8x vec case)    1.057
> > > > > > > > > 512 byte( > 8x loop case)     1.005
> > > > > > > > > 1MB                           0.999
> > > > > > > > > 4MB                           0.998
> > > > > > > > > 8MB                           1.014
> > > > > > > > > 16MB                          1.156
> > > > > > > > > 32MB                          0.634
> > > > > > > > >
> > > > > > > > > For amd-zen2-rome, no performance degradation on '2x -
> > > > > > > > > 8x branch
> > > case'
> > > > > > > > > when extra branch jump added. When size >=
> > > > > > > > > x86_memset_non_temporal_threshold, for 'size = 16MB'
> > > > > > > > > case, performance degradation is due to that amd-zen2
> > > > > > > > > non_temporal perf is worse than temporal[3]. And in 32MB
> > > > > > > > > case, it can achieve performance improvement by
> > > > > > > > 37%.
> > > > > > > > >
> > > > > > > > > For intel arch, intel arch ERMS feature is enabled as
> > > > > > > > > default, so this patch has no effect for it.
> > > > > > > > >
> > > > > > > > > Note:
> > > > > > > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at
> > > > > > > > > providing high performance x86 processor for China server
> market.
> > > > > > > > >
> > > > > > > > > Ref:
> > > > > > > > > [1]
> > > > > > > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0a
> > > > > > > > > b805
> > > > > > > > > 73d66e
> > > > > > > > > 4ae5
> > > > > > > > > d94b094659094336da90f [2]
> > > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/ms
> > > > > > > > > g067
> > > > > > > > > 47.htm
> > > > > > > > > l
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjA
> > > > > > Ei
> > > > > > > > l4
> > > > > > > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > > > > > > >
> > > > > > > > > Jira: HES-210
> > > > > > > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > > > > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > > > > > > ---
> > > > > > > > >  sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S |
> > > > > > > > > 10
> > > > > > > > > +++++++++-
> > > > > > > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > > > > > >
> > > > > > > > > diff --git
> > > > > > > > > a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > > index 88bf08e..063dd51 100644
> > > > > > > > > ---
> > > > > > > > > a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms
> > > > > > > > > +++ .S
> > > > > > > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > > > > > >         cmp     __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > > > > > >         ja      L(stosb_local)
> > > > > > > > >  #endif
> > > > > > > > > -       /* Fallthrough goes to L(loop_4x_vec). Tests for memset
> (2x,
> > > 4x]
> > > > > > > > > +       /* If rdx is less than
> > > __x86_memset_non_temporal_threshold,
> > > > > > > > > +          fallthrough goes to L(loop_4x_vec). Tests for
> > > > > > > > > + memset (2x, 4x]
> > > > > > > > >            and (4x, 8x] jump to target.  */
> > > > > > > > >  L(more_2x_vec):
> > > > > > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > > > > > +       /* Check non-temporal store threshold.
> > > > > > > > > +          This is for the hardware which can not
> > > > > > > > > +support ERMS feature
> > > > > > or
> > > > > > > > > +          ERMS fearure is disabled.  */
> > > > > > > > > +       cmp
> > > > > > > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > > > > > > +       jae     L(nt_memset)
> > > > > > > > > +#endif
> > > > > > > > >         /* Store next 2x vec regardless.  */
> > > > > > > > >         VMOVU   %VMM(0), (%rdi)
> > > > > > > > >         VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > > > > > --
> > > > > > > > > 2.7.4
> > > > > > > > >
> > > > > > > >
> > > > > > > > I am opposed to this patch getting in as is. I strongly
> > > > > > > > dislike the essentially redundant checks for
> > > > > > > > `__x86_rep_stosb_threshold` and
> > > > > > > > `__x86_memset_non_temporal_threshold` that will occur on
> > > > > > > > most
> > > > > > processors.
> > > > > > > > Further, in this case, it messes up the alignment of the loop.
> > > > > > > >
> > > > > > > > What I would like to do instead for processors which
> > > > > > > > prefer non-temporal but not erms, is to have them set
> > > > > > > > `__x86_rep_stosb_threshold ==
> > > > > > > > __x86_memset_non_temporal_threshold`
> > > > > > and prefer the `erms` version.
> > > > > > > >
> > > > > > > > Can you implement it like that instead?
> > > > > > > [Feifei] This is also a good choice. But I have a question for this:
> > > > > > > If my understand is right, you means that If
> > > > > > > __x86_rep_stosb_threshold ==
> > > __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> > > > > > when
> > > > > > > init_cpu_features for servers without ERMS.
> > > > > > >
> > > > > > Yes, if you don't want memcpy to use `erms`, set
> > > > > > `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> > > > > >
> > > > > > > However, if like this, I think we ignore the small size case.
> > > > > > > Though we can use non temporal now, But for "size <
> > > > > > > __x86_rep_stosb_threshold" case, it is forced to use erms
> > > > > > > whose performance
> > > > > > is very bad For example, hygon and amd zen2.
> > > > > >
> > > > > > Hmm, I'm not sure I understand what you mean.
> > > > > >
> > > > > > What I am arguing for is that a processor that wants
> > > > > > non-temporal but doesn't want erms should do the following:
> > > > > >
> > > > > > memset:
> > > > > > `__x86_rep_stosb_threshold ==
> > > > > > __x86_memset_non_temporal_threshold
> > > > > > == YOUR_PREFERED_NT_THRESHOLD`
> > > > > >
> > > > > > This will have the exact same behavior as your patch. Below NT
> > > > > > threshold it will use the normal temporal loop.
> > > > >
> > > > > [Feifei]Ok, maybe I understand your meanings.
> > > > > Consider we have a server support avx2, vec_size = 32, erms cpu
> > > > > flag is
> > > disabled.
> > > > > 1. set __x86_rep_stosb_threshold ==
> > > > > __x86_memset_non_temporal_threshold == 1MB 2. set 'Prefer ERMS'
> > > > > to
> > > force server to select erms function in 'IFUNC_SELECTOR'
> > > > > 3. we need to use memset 0 for 1KB memory(size = 1KB).
> > > > >
> > > > > Firstly, no erms support, cpu can also call "
> memset_avx2_unaligned_erms"
> > > function.
> > > > > Second, in the memset_erms function, check size > 2 * vec_size,
> > > > > and then
> > > jump to 'stosb_more_2x_vec'
> > > > > Third, check size < _x86_rep_stosb_threshold, then go to normal
> > > > > branch. Here, if size = 2MB, we can jump to 'stosb_local', then
> > > > > check
> > > NT_threshold, we jump to NT loop.
> > > > > Is my right?
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > By the way, I just did a test that If we setting 'Prefer_ERMS',
> > > > > cpu will call 'memset_erms' in memset_erms.S, But we need call '
> > > memset_avx2_unaligned_erms '.
> > > > > Thus, maybe we need to do a change for IFUNC_SELECTOR:
> > > > > Line53: +  if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
> > > > >        +        if (CPU_FEATURES_ARCH_P (cpu_features,
> > > Prefer_ERMS))
> > > > >        +        return OPTIMIZE (avx2_unaligned_erms);
> > > > >        + }
> > > > >        + else
> > > > >                         if (CPU_FEATURES_ARCH_P (cpu_features,
> > > Prefer_ERMS))
> > > > >                 return OPTIMIZE (erms);
> > > > >
> > > > > Or we add a new flag 'Prefer_VEC_ERMS'?
> > > > >
> > > >
> > > > Yeah, so I need to do some plumbing here.
> > > > `Prefer_ERMS` controls the pure `rep stosb` impl for memset.
> > > > For deciding between memset_<isa>_unaligned... we use:
> > > > `CPU_FEATURE_USABLE_P (cpu_features, ERMS)`.
> > > >
> > > > Think we need to add a new flag along the lines of:
> > > > `CPU_FEATURES_ARCH_P (cpu_features, NON_TEMPORAL)` and make the
> > > > memset/memmove_<isa>_unaligned_erms use that flag.
> > > >
> > > > Then in dl-cacheinfo.h if we don't have ERMS, we setup stosb/movsb
> > > > thresholds s.t stosb/movsb code will just forward to non-temporal
> > > > codes.
> > [Feifei] Agree. NON_TEMPORAL flag is suitable for these usecase.
> > >
> > > I'm going to assume this is not going to make it into 2.40 w/ the
> > > freeze and the fact that we will need to do some changes in
> cpu-features/dl-cacheinfo.
> > >
> > > Before master re-opens (after the release) ill post patches to add
> > > support for this usecase.
> >
> > [Feifei] OK, and I just have a question: when the master re-open again?
> > And I will prepare for the patch of this and push it to the community as soon as
> possible for reviewing.
> > In the next version, change will be as below:
> > 1. add a new flag called "NON_TEMPORAL"
> > 2. add Hygon arch branch in cpu-features.c and dl-cacheinfo.h, to split Hygon
> from AMD branch.
> > 3.For Hygon arch branch, we enable NON_TEMPORAL flag 4.
> NON_TEMPORAL
> > flag is set, __x86_rep_stosb_threshold ==
> > __x86_memset_non_temporal_threshold in dl-cacheinfo.h
> >
> 
> I posted a patch for it, once it gets in you just need to set
> `Prefer_Non_Temporal` for your target.

[Feifei]Ok, that's good.
After it merge in, I will add Hygon branch and set Prefer_Non_Temporal branch.
> > > > > >
> > > > > > Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> > > > > > >
> > > > > > > Thus we add a nt jump for temporal branch, and for this, we
> > > > > > > can support
> > > that :
> > > > > > > 1. small size can use temporal for good performance rather than
> erms.
> > > > > > > 2.big size can use non temporal to achieve better performance.
> > > > > > >
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Feifei
> > > > > > >
  

Patch

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 88bf08e..063dd51 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -229,9 +229,17 @@  L(stosb_more_2x_vec):
 	cmp	__x86_rep_stosb_threshold(%rip), %RDX_LP
 	ja	L(stosb_local)
 #endif
-	/* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
+	/* If rdx is less than __x86_memset_non_temporal_threshold,
+	   fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
 	   and (4x, 8x] jump to target.  */
 L(more_2x_vec):
+#if defined USE_MULTIARCH && IS_IN (libc)
+	/* Check non-temporal store threshold.
+	   This is for the hardware which can not support ERMS feature or
+	   ERMS fearure is disabled.  */
+	cmp	__x86_memset_non_temporal_threshold(%rip), %RDX_LP
+	jae	L(nt_memset)
+#endif
 	/* Store next 2x vec regardless.  */
 	VMOVU	%VMM(0), (%rdi)
 	VMOVU	%VMM(0), (VEC_SIZE * 1)(%rdi)