x86: Enable non-temporal memset without ERMS
Checks
Context |
Check |
Description |
redhat-pt-bot/TryBot-apply_patch |
success
|
Patch applied to master at the time it was sent
|
redhat-pt-bot/TryBot-32bit |
success
|
Build for i686
|
linaro-tcwg-bot/tcwg_glibc_build--master-arm |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-arm |
success
|
Test passed
|
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 |
success
|
Test passed
|
Commit Message
Currently, large memset perf can be improved with non-temporal stores[1].
But it just supports the case in which ERMS feature is enabled.
For some architectures, ERMS is disabled in cpu feature as default
For example, AMD zen3-milan[2],
Or ERMS performance is worse than vectorized loop, and it is disabled.
For example, HygonGenuine arch.
Thus, to let these arch achieve perf improvement from non-temporal memset,
and enable non-temporal memset without ERMS, a 'non-temporal memset
branch jump' is added in 'more_2x_vec'case.
Test Results:
thread: 1
memset store value: 0
function: memset_avx2_unaligned
hygon1 arch
x86_memset_non_temporal_threshold = 8MB
size new performance / old performance
128 byte(2x -4x vec case) 1
256 byte(4x - 8x vec case) 1
512 byte( > 8x loop case) 1
1MB 0.994
4MB 0.996
8MB 0.670
16MB 0.343
32MB 0.355
hygon2 arch
x86_memset_non_temporal_threshold = 8MB
size new performance / old performance
128 byte(2x -4x vec case) 1
256 byte(4x - 8x vec case) 0.653
512 byte( > 8x loop case) 0.713
1MB 1
4MB 0.887
8MB 1.312
16MB 0.822
32MB 0.830
hygon3 arch
x86_memset_non_temporal_threshold = 8MB
size new performance / old performance
128 byte(2x -4x vec case) 1
256 byte(4x - 8x vec case) 1
512 byte( > 8x loop case) 1
1MB 1
4MB 0.990
8MB 0.737
16MB 0.390
32MB 0.401
For hygon arch with , no performance degradation on '2x - 8x branch case' when
extra branch jump added. And with this patch, non-temporal stores can improve
performance by 20% - 65%.
amd-zen2-rome arch
x86_memset_non_temporal_threshold = 12MB
size new performance / old performance
128 byte(2x -4x vec case) 0.986
256 byte(4x - 8x vec case) 1.057
512 byte( > 8x loop case) 1.005
1MB 0.999
4MB 0.998
8MB 1.014
16MB 1.156
32MB 0.634
For amd-zen2-rome, no performance degradation on '2x - 8x branch case' when
extra branch jump added. When size >= x86_memset_non_temporal_threshold, for
'size = 16MB' case, performance degradation is due to that amd-zen2 non_temporal
perf is worse than temporal[3]. And in 32MB case, it can achieve performance
improvement by 37%.
For intel arch, intel arch ERMS feature is enabled as default, so this patch
has no effect for it.
Note:
Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing
high performance x86 processor for China server market.
Ref:
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5d94b094659094336da90f
[2] https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.html
[3] https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
Jira: HES-210
Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
Reviewed-by: Jing Li <lijing@hygon.cn>
Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
---
sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
Comments
On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
>
> Currently, large memset perf can be improved with non-temporal stores[1].
> But it just supports the case in which ERMS feature is enabled.
>
> For some architectures, ERMS is disabled in cpu feature as default
> For example, AMD zen3-milan[2],
> Or ERMS performance is worse than vectorized loop, and it is disabled.
> For example, HygonGenuine arch.
>
> Thus, to let these arch achieve perf improvement from non-temporal memset,
> and enable non-temporal memset without ERMS, a 'non-temporal memset
> branch jump' is added in 'more_2x_vec'case.
>
> Test Results:
> thread: 1
> memset store value: 0
> function: memset_avx2_unaligned
>
> hygon1 arch
> x86_memset_non_temporal_threshold = 8MB
> size new performance / old performance
> 128 byte(2x -4x vec case) 1
> 256 byte(4x - 8x vec case) 1
> 512 byte( > 8x loop case) 1
> 1MB 0.994
> 4MB 0.996
> 8MB 0.670
> 16MB 0.343
> 32MB 0.355
>
> hygon2 arch
> x86_memset_non_temporal_threshold = 8MB
> size new performance / old performance
> 128 byte(2x -4x vec case) 1
> 256 byte(4x - 8x vec case) 0.653
> 512 byte( > 8x loop case) 0.713
> 1MB 1
> 4MB 0.887
> 8MB 1.312
> 16MB 0.822
> 32MB 0.830
>
> hygon3 arch
> x86_memset_non_temporal_threshold = 8MB
> size new performance / old performance
> 128 byte(2x -4x vec case) 1
> 256 byte(4x - 8x vec case) 1
> 512 byte( > 8x loop case) 1
> 1MB 1
> 4MB 0.990
> 8MB 0.737
> 16MB 0.390
> 32MB 0.401
>
> For hygon arch with , no performance degradation on '2x - 8x branch case' when
> extra branch jump added. And with this patch, non-temporal stores can improve
> performance by 20% - 65%.
>
> amd-zen2-rome arch
> x86_memset_non_temporal_threshold = 12MB
> size new performance / old performance
> 128 byte(2x -4x vec case) 0.986
> 256 byte(4x - 8x vec case) 1.057
> 512 byte( > 8x loop case) 1.005
> 1MB 0.999
> 4MB 0.998
> 8MB 1.014
> 16MB 1.156
> 32MB 0.634
>
> For amd-zen2-rome, no performance degradation on '2x - 8x branch case' when
> extra branch jump added. When size >= x86_memset_non_temporal_threshold, for
> 'size = 16MB' case, performance degradation is due to that amd-zen2 non_temporal
> perf is worse than temporal[3]. And in 32MB case, it can achieve performance
> improvement by 37%.
>
> For intel arch, intel arch ERMS feature is enabled as default, so this patch
> has no effect for it.
>
> Note:
> Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing
> high performance x86 processor for China server market.
>
> Ref:
> [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5d94b094659094336da90f
> [2] https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.html
> [3] https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
>
> Jira: HES-210
> Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> Reviewed-by: Jing Li <lijing@hygon.cn>
> Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> ---
> sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 88bf08e..063dd51 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> ja L(stosb_local)
> #endif
> - /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> + /* If rdx is less than __x86_memset_non_temporal_threshold,
> + fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> and (4x, 8x] jump to target. */
> L(more_2x_vec):
> +#if defined USE_MULTIARCH && IS_IN (libc)
> + /* Check non-temporal store threshold.
> + This is for the hardware which can not support ERMS feature or
> + ERMS fearure is disabled. */
> + cmp __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> + jae L(nt_memset)
> +#endif
> /* Store next 2x vec regardless. */
> VMOVU %VMM(0), (%rdi)
> VMOVU %VMM(0), (VEC_SIZE * 1)(%rdi)
> --
> 2.7.4
>
I am opposed to this patch getting in as is. I strongly dislike the
essentially redundant checks for `__x86_rep_stosb_threshold` and
`__x86_memset_non_temporal_threshold` that will occur on most
processors. Further, in this case, it messes up the alignment of the
loop.
What I would like to do instead for processors which prefer
non-temporal but not erms, is to have them set
`__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold` and
prefer the `erms` version.
Can you implement it like that instead?
Thanks for your kindly reply.
> -----邮件原件-----
> 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> 发送时间: 2024年7月8日 13:29
> 收件人: Feifei Wang <wangfeifei@hygon.cn>
> 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
>
> On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> >
> > Currently, large memset perf can be improved with non-temporal stores[1].
> > But it just supports the case in which ERMS feature is enabled.
> >
> > For some architectures, ERMS is disabled in cpu feature as default For
> > example, AMD zen3-milan[2], Or ERMS performance is worse than
> > vectorized loop, and it is disabled.
> > For example, HygonGenuine arch.
> >
> > Thus, to let these arch achieve perf improvement from non-temporal
> > memset, and enable non-temporal memset without ERMS, a 'non-temporal
> > memset branch jump' is added in 'more_2x_vec'case.
> >
> > Test Results:
> > thread: 1
> > memset store value: 0
> > function: memset_avx2_unaligned
> >
> > hygon1 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size new performance / old performance
> > 128 byte(2x -4x vec case) 1
> > 256 byte(4x - 8x vec case) 1
> > 512 byte( > 8x loop case) 1
> > 1MB 0.994
> > 4MB 0.996
> > 8MB 0.670
> > 16MB 0.343
> > 32MB 0.355
> >
> > hygon2 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size new performance / old performance
> > 128 byte(2x -4x vec case) 1
> > 256 byte(4x - 8x vec case) 0.653
> > 512 byte( > 8x loop case) 0.713
> > 1MB 1
> > 4MB 0.887
> > 8MB 1.312
> > 16MB 0.822
> > 32MB 0.830
> >
> > hygon3 arch
> > x86_memset_non_temporal_threshold = 8MB
> > size new performance / old performance
> > 128 byte(2x -4x vec case) 1
> > 256 byte(4x - 8x vec case) 1
> > 512 byte( > 8x loop case) 1
> > 1MB 1
> > 4MB 0.990
> > 8MB 0.737
> > 16MB 0.390
> > 32MB 0.401
> >
> > For hygon arch with , no performance degradation on '2x - 8x branch
> > case' when extra branch jump added. And with this patch, non-temporal
> > stores can improve performance by 20% - 65%.
> >
> > amd-zen2-rome arch
> > x86_memset_non_temporal_threshold = 12MB
> > size new performance / old performance
> > 128 byte(2x -4x vec case) 0.986
> > 256 byte(4x - 8x vec case) 1.057
> > 512 byte( > 8x loop case) 1.005
> > 1MB 0.999
> > 4MB 0.998
> > 8MB 1.014
> > 16MB 1.156
> > 32MB 0.634
> >
> > For amd-zen2-rome, no performance degradation on '2x - 8x branch case'
> > when extra branch jump added. When size >=
> > x86_memset_non_temporal_threshold, for 'size = 16MB' case, performance
> > degradation is due to that amd-zen2 non_temporal perf is worse than
> > temporal[3]. And in 32MB case, it can achieve performance improvement by
> 37%.
> >
> > For intel arch, intel arch ERMS feature is enabled as default, so this
> > patch has no effect for it.
> >
> > Note:
> > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing high
> > performance x86 processor for China server market.
> >
> > Ref:
> > [1]
> > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5
> > d94b094659094336da90f [2]
> > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.html
> > [3]
> >
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4
> > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> >
> > Jira: HES-210
> > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > Reviewed-by: Jing Li <lijing@hygon.cn>
> > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > ---
> > sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > index 88bf08e..063dd51 100644
> > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> > ja L(stosb_local)
> > #endif
> > - /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > + /* If rdx is less than __x86_memset_non_temporal_threshold,
> > + fallthrough goes to L(loop_4x_vec). Tests for memset (2x,
> > + 4x]
> > and (4x, 8x] jump to target. */
> > L(more_2x_vec):
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> > + /* Check non-temporal store threshold.
> > + This is for the hardware which can not support ERMS feature or
> > + ERMS fearure is disabled. */
> > + cmp
> __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > + jae L(nt_memset)
> > +#endif
> > /* Store next 2x vec regardless. */
> > VMOVU %VMM(0), (%rdi)
> > VMOVU %VMM(0), (VEC_SIZE * 1)(%rdi)
> > --
> > 2.7.4
> >
>
> I am opposed to this patch getting in as is. I strongly dislike the essentially
> redundant checks for `__x86_rep_stosb_threshold` and
> `__x86_memset_non_temporal_threshold` that will occur on most processors.
> Further, in this case, it messes up the alignment of the loop.
>
> What I would like to do instead for processors which prefer non-temporal but
> not erms, is to have them set `__x86_rep_stosb_threshold ==
> __x86_memset_non_temporal_threshold` and prefer the `erms` version.
>
> Can you implement it like that instead?
[Feifei] This is also a good choice. But I have a question for this:
If my understand is right, you means that If __x86_rep_stosb_threshold ==
__x86_memset_non_temporal_threshold, we can set "Prefer_ERMS" when
init_cpu_features for servers without ERMS.
However, if like this, I think we ignore the small size case. Though we can use non temporal now,
But for "size < __x86_rep_stosb_threshold" case, it is forced to use erms whose performance is very bad
For example, hygon and amd zen2.
Thus we add a nt jump for temporal branch, and for this, we can support that :
1. small size can use temporal for good performance rather than erms.
2.big size can use non temporal to achieve better performance.
Best Regards
Feifei
On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn> wrote:
>
> Thanks for your kindly reply.
>
> > -----邮件原件-----
> > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > 发送时间: 2024年7月8日 13:29
> > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> >
> > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > >
> > > Currently, large memset perf can be improved with non-temporal stores[1].
> > > But it just supports the case in which ERMS feature is enabled.
> > >
> > > For some architectures, ERMS is disabled in cpu feature as default For
> > > example, AMD zen3-milan[2], Or ERMS performance is worse than
> > > vectorized loop, and it is disabled.
> > > For example, HygonGenuine arch.
> > >
> > > Thus, to let these arch achieve perf improvement from non-temporal
> > > memset, and enable non-temporal memset without ERMS, a 'non-temporal
> > > memset branch jump' is added in 'more_2x_vec'case.
> > >
> > > Test Results:
> > > thread: 1
> > > memset store value: 0
> > > function: memset_avx2_unaligned
> > >
> > > hygon1 arch
> > > x86_memset_non_temporal_threshold = 8MB
> > > size new performance / old performance
> > > 128 byte(2x -4x vec case) 1
> > > 256 byte(4x - 8x vec case) 1
> > > 512 byte( > 8x loop case) 1
> > > 1MB 0.994
> > > 4MB 0.996
> > > 8MB 0.670
> > > 16MB 0.343
> > > 32MB 0.355
> > >
> > > hygon2 arch
> > > x86_memset_non_temporal_threshold = 8MB
> > > size new performance / old performance
> > > 128 byte(2x -4x vec case) 1
> > > 256 byte(4x - 8x vec case) 0.653
> > > 512 byte( > 8x loop case) 0.713
> > > 1MB 1
> > > 4MB 0.887
> > > 8MB 1.312
> > > 16MB 0.822
> > > 32MB 0.830
> > >
> > > hygon3 arch
> > > x86_memset_non_temporal_threshold = 8MB
> > > size new performance / old performance
> > > 128 byte(2x -4x vec case) 1
> > > 256 byte(4x - 8x vec case) 1
> > > 512 byte( > 8x loop case) 1
> > > 1MB 1
> > > 4MB 0.990
> > > 8MB 0.737
> > > 16MB 0.390
> > > 32MB 0.401
> > >
> > > For hygon arch with , no performance degradation on '2x - 8x branch
> > > case' when extra branch jump added. And with this patch, non-temporal
> > > stores can improve performance by 20% - 65%.
> > >
> > > amd-zen2-rome arch
> > > x86_memset_non_temporal_threshold = 12MB
> > > size new performance / old performance
> > > 128 byte(2x -4x vec case) 0.986
> > > 256 byte(4x - 8x vec case) 1.057
> > > 512 byte( > 8x loop case) 1.005
> > > 1MB 0.999
> > > 4MB 0.998
> > > 8MB 1.014
> > > 16MB 1.156
> > > 32MB 0.634
> > >
> > > For amd-zen2-rome, no performance degradation on '2x - 8x branch case'
> > > when extra branch jump added. When size >=
> > > x86_memset_non_temporal_threshold, for 'size = 16MB' case, performance
> > > degradation is due to that amd-zen2 non_temporal perf is worse than
> > > temporal[3]. And in 32MB case, it can achieve performance improvement by
> > 37%.
> > >
> > > For intel arch, intel arch ERMS feature is enabled as default, so this
> > > patch has no effect for it.
> > >
> > > Note:
> > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing high
> > > performance x86 processor for China server market.
> > >
> > > Ref:
> > > [1]
> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5
> > > d94b094659094336da90f [2]
> > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.html
> > > [3]
> > >
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4
> > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > >
> > > Jira: HES-210
> > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > ---
> > > sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10 +++++++++-
> > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > index 88bf08e..063dd51 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > ja L(stosb_local)
> > > #endif
> > > - /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > > + /* If rdx is less than __x86_memset_non_temporal_threshold,
> > > + fallthrough goes to L(loop_4x_vec). Tests for memset (2x,
> > > + 4x]
> > > and (4x, 8x] jump to target. */
> > > L(more_2x_vec):
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > + /* Check non-temporal store threshold.
> > > + This is for the hardware which can not support ERMS feature or
> > > + ERMS fearure is disabled. */
> > > + cmp
> > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > + jae L(nt_memset)
> > > +#endif
> > > /* Store next 2x vec regardless. */
> > > VMOVU %VMM(0), (%rdi)
> > > VMOVU %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > --
> > > 2.7.4
> > >
> >
> > I am opposed to this patch getting in as is. I strongly dislike the essentially
> > redundant checks for `__x86_rep_stosb_threshold` and
> > `__x86_memset_non_temporal_threshold` that will occur on most processors.
> > Further, in this case, it messes up the alignment of the loop.
> >
> > What I would like to do instead for processors which prefer non-temporal but
> > not erms, is to have them set `__x86_rep_stosb_threshold ==
> > __x86_memset_non_temporal_threshold` and prefer the `erms` version.
> >
> > Can you implement it like that instead?
> [Feifei] This is also a good choice. But I have a question for this:
> If my understand is right, you means that If __x86_rep_stosb_threshold ==
> __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS" when
> init_cpu_features for servers without ERMS.
>
Yes, if you don't want memcpy to use `erms`, set
`__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> However, if like this, I think we ignore the small size case. Though we can use non temporal now,
> But for "size < __x86_rep_stosb_threshold" case, it is forced to use erms whose performance is very bad
> For example, hygon and amd zen2.
Hmm, I'm not sure I understand what you mean.
What I am arguing for is that a processor that wants non-temporal but
doesn't want erms should do the following:
memset:
`__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold ==
YOUR_PREFERED_NT_THRESHOLD`
This will have the exact same behavior as your patch. Below NT threshold it
will use the normal temporal loop.
Above NT threshold it will use nt-stores. It will never use `rep stosb`.
>
> Thus we add a nt jump for temporal branch, and for this, we can support that :
> 1. small size can use temporal for good performance rather than erms.
> 2.big size can use non temporal to achieve better performance.
>
>
> Best Regards
> Feifei
>
> -----邮件原件-----
> 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> 发送时间: 2024年7月9日 12:19
> 收件人: Feifei Wang <wangfeifei@hygon.cn>
> 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
>
> On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn> wrote:
> >
> > Thanks for your kindly reply.
> >
> > > -----邮件原件-----
> > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > 发送时间: 2024年7月8日 13:29
> > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > >
> > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > > >
> > > > Currently, large memset perf can be improved with non-temporal
> stores[1].
> > > > But it just supports the case in which ERMS feature is enabled.
> > > >
> > > > For some architectures, ERMS is disabled in cpu feature as default
> > > > For example, AMD zen3-milan[2], Or ERMS performance is worse than
> > > > vectorized loop, and it is disabled.
> > > > For example, HygonGenuine arch.
> > > >
> > > > Thus, to let these arch achieve perf improvement from non-temporal
> > > > memset, and enable non-temporal memset without ERMS, a
> > > > 'non-temporal memset branch jump' is added in 'more_2x_vec'case.
> > > >
> > > > Test Results:
> > > > thread: 1
> > > > memset store value: 0
> > > > function: memset_avx2_unaligned
> > > >
> > > > hygon1 arch
> > > > x86_memset_non_temporal_threshold = 8MB
> > > > size new performance / old performance
> > > > 128 byte(2x -4x vec case) 1
> > > > 256 byte(4x - 8x vec case) 1
> > > > 512 byte( > 8x loop case) 1
> > > > 1MB 0.994
> > > > 4MB 0.996
> > > > 8MB 0.670
> > > > 16MB 0.343
> > > > 32MB 0.355
> > > >
> > > > hygon2 arch
> > > > x86_memset_non_temporal_threshold = 8MB
> > > > size new performance / old performance
> > > > 128 byte(2x -4x vec case) 1
> > > > 256 byte(4x - 8x vec case) 0.653
> > > > 512 byte( > 8x loop case) 0.713
> > > > 1MB 1
> > > > 4MB 0.887
> > > > 8MB 1.312
> > > > 16MB 0.822
> > > > 32MB 0.830
> > > >
> > > > hygon3 arch
> > > > x86_memset_non_temporal_threshold = 8MB
> > > > size new performance / old performance
> > > > 128 byte(2x -4x vec case) 1
> > > > 256 byte(4x - 8x vec case) 1
> > > > 512 byte( > 8x loop case) 1
> > > > 1MB 1
> > > > 4MB 0.990
> > > > 8MB 0.737
> > > > 16MB 0.390
> > > > 32MB 0.401
> > > >
> > > > For hygon arch with , no performance degradation on '2x - 8x
> > > > branch case' when extra branch jump added. And with this patch,
> > > > non-temporal stores can improve performance by 20% - 65%.
> > > >
> > > > amd-zen2-rome arch
> > > > x86_memset_non_temporal_threshold = 12MB
> > > > size new performance / old performance
> > > > 128 byte(2x -4x vec case) 0.986
> > > > 256 byte(4x - 8x vec case) 1.057
> > > > 512 byte( > 8x loop case) 1.005
> > > > 1MB 0.999
> > > > 4MB 0.998
> > > > 8MB 1.014
> > > > 16MB 1.156
> > > > 32MB 0.634
> > > >
> > > > For amd-zen2-rome, no performance degradation on '2x - 8x branch case'
> > > > when extra branch jump added. When size >=
> > > > x86_memset_non_temporal_threshold, for 'size = 16MB' case,
> > > > performance degradation is due to that amd-zen2 non_temporal perf
> > > > is worse than temporal[3]. And in 32MB case, it can achieve
> > > > performance improvement by
> > > 37%.
> > > >
> > > > For intel arch, intel arch ERMS feature is enabled as default, so
> > > > this patch has no effect for it.
> > > >
> > > > Note:
> > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing high
> > > > performance x86 processor for China server market.
> > > >
> > > > Ref:
> > > > [1]
> > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e
> > > > 4ae5
> > > > d94b094659094336da90f [2]
> > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.htm
> > > > l
> > > > [3]
> > > >
> > >
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEi
> > > l4
> > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > >
> > > > Jira: HES-210
> > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > ---
> > > > sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10
> > > > +++++++++-
> > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > index 88bf08e..063dd51 100644
> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > ja L(stosb_local)
> > > > #endif
> > > > - /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > > > + /* If rdx is less than __x86_memset_non_temporal_threshold,
> > > > + fallthrough goes to L(loop_4x_vec). Tests for memset
> > > > + (2x, 4x]
> > > > and (4x, 8x] jump to target. */
> > > > L(more_2x_vec):
> > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > + /* Check non-temporal store threshold.
> > > > + This is for the hardware which can not support ERMS feature
> or
> > > > + ERMS fearure is disabled. */
> > > > + cmp
> > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > + jae L(nt_memset)
> > > > +#endif
> > > > /* Store next 2x vec regardless. */
> > > > VMOVU %VMM(0), (%rdi)
> > > > VMOVU %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > --
> > > > 2.7.4
> > > >
> > >
> > > I am opposed to this patch getting in as is. I strongly dislike the
> > > essentially redundant checks for `__x86_rep_stosb_threshold` and
> > > `__x86_memset_non_temporal_threshold` that will occur on most
> processors.
> > > Further, in this case, it messes up the alignment of the loop.
> > >
> > > What I would like to do instead for processors which prefer
> > > non-temporal but not erms, is to have them set
> > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold`
> and prefer the `erms` version.
> > >
> > > Can you implement it like that instead?
> > [Feifei] This is also a good choice. But I have a question for this:
> > If my understand is right, you means that If __x86_rep_stosb_threshold
> > == __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> when
> > init_cpu_features for servers without ERMS.
> >
> Yes, if you don't want memcpy to use `erms`, set
> `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
>
> > However, if like this, I think we ignore the small size case. Though
> > we can use non temporal now, But for "size <
> > __x86_rep_stosb_threshold" case, it is forced to use erms whose performance
> is very bad For example, hygon and amd zen2.
>
> Hmm, I'm not sure I understand what you mean.
>
> What I am arguing for is that a processor that wants non-temporal but doesn't
> want erms should do the following:
>
> memset:
> `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold ==
> YOUR_PREFERED_NT_THRESHOLD`
>
> This will have the exact same behavior as your patch. Below NT threshold it will
> use the normal temporal loop.
[Feifei]Ok, maybe I understand your meanings.
Consider we have a server support avx2, vec_size = 32, erms cpu flag is disabled.
1. set __x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold == 1MB
2. set 'Prefer ERMS' to force server to select erms function in 'IFUNC_SELECTOR'
3. we need to use memset 0 for 1KB memory(size = 1KB).
Firstly, no erms support, cpu can also call " memset_avx2_unaligned_erms" function.
Second, in the memset_erms function, check size > 2 * vec_size, and then jump to 'stosb_more_2x_vec'
Third, check size < _x86_rep_stosb_threshold, then go to normal branch. Here, if size = 2MB,
we can jump to 'stosb_local', then check NT_threshold, we jump to NT loop.
Is my right?
By the way, I just did a test that If we setting 'Prefer_ERMS', cpu will call 'memset_erms' in memset_erms.S,
But we need call ' memset_avx2_unaligned_erms '.
Thus, maybe we need to do a change for IFUNC_SELECTOR:
Line53: + if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
+ if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
+ return OPTIMIZE (avx2_unaligned_erms);
+ }
+ else
if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
return OPTIMIZE (erms);
Or we add a new flag 'Prefer_VEC_ERMs'?
>
> Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> >
> > Thus we add a nt jump for temporal branch, and for this, we can support that :
> > 1. small size can use temporal for good performance rather than erms.
> > 2.big size can use non temporal to achieve better performance.
> >
> >
> > Best Regards
> > Feifei
> >
On Tue, Jul 9, 2024 at 6:20 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
>
>
>
> > -----邮件原件-----
> > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > 发送时间: 2024年7月9日 12:19
> > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> >
> > On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > >
> > > Thanks for your kindly reply.
> > >
> > > > -----邮件原件-----
> > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > 发送时间: 2024年7月8日 13:29
> > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > >
> > > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > > > >
> > > > > Currently, large memset perf can be improved with non-temporal
> > stores[1].
> > > > > But it just supports the case in which ERMS feature is enabled.
> > > > >
> > > > > For some architectures, ERMS is disabled in cpu feature as default
> > > > > For example, AMD zen3-milan[2], Or ERMS performance is worse than
> > > > > vectorized loop, and it is disabled.
> > > > > For example, HygonGenuine arch.
> > > > >
> > > > > Thus, to let these arch achieve perf improvement from non-temporal
> > > > > memset, and enable non-temporal memset without ERMS, a
> > > > > 'non-temporal memset branch jump' is added in 'more_2x_vec'case.
> > > > >
> > > > > Test Results:
> > > > > thread: 1
> > > > > memset store value: 0
> > > > > function: memset_avx2_unaligned
> > > > >
> > > > > hygon1 arch
> > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > size new performance / old performance
> > > > > 128 byte(2x -4x vec case) 1
> > > > > 256 byte(4x - 8x vec case) 1
> > > > > 512 byte( > 8x loop case) 1
> > > > > 1MB 0.994
> > > > > 4MB 0.996
> > > > > 8MB 0.670
> > > > > 16MB 0.343
> > > > > 32MB 0.355
> > > > >
> > > > > hygon2 arch
> > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > size new performance / old performance
> > > > > 128 byte(2x -4x vec case) 1
> > > > > 256 byte(4x - 8x vec case) 0.653
> > > > > 512 byte( > 8x loop case) 0.713
> > > > > 1MB 1
> > > > > 4MB 0.887
> > > > > 8MB 1.312
> > > > > 16MB 0.822
> > > > > 32MB 0.830
> > > > >
> > > > > hygon3 arch
> > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > size new performance / old performance
> > > > > 128 byte(2x -4x vec case) 1
> > > > > 256 byte(4x - 8x vec case) 1
> > > > > 512 byte( > 8x loop case) 1
> > > > > 1MB 1
> > > > > 4MB 0.990
> > > > > 8MB 0.737
> > > > > 16MB 0.390
> > > > > 32MB 0.401
> > > > >
> > > > > For hygon arch with , no performance degradation on '2x - 8x
> > > > > branch case' when extra branch jump added. And with this patch,
> > > > > non-temporal stores can improve performance by 20% - 65%.
> > > > >
> > > > > amd-zen2-rome arch
> > > > > x86_memset_non_temporal_threshold = 12MB
> > > > > size new performance / old performance
> > > > > 128 byte(2x -4x vec case) 0.986
> > > > > 256 byte(4x - 8x vec case) 1.057
> > > > > 512 byte( > 8x loop case) 1.005
> > > > > 1MB 0.999
> > > > > 4MB 0.998
> > > > > 8MB 1.014
> > > > > 16MB 1.156
> > > > > 32MB 0.634
> > > > >
> > > > > For amd-zen2-rome, no performance degradation on '2x - 8x branch case'
> > > > > when extra branch jump added. When size >=
> > > > > x86_memset_non_temporal_threshold, for 'size = 16MB' case,
> > > > > performance degradation is due to that amd-zen2 non_temporal perf
> > > > > is worse than temporal[3]. And in 32MB case, it can achieve
> > > > > performance improvement by
> > > > 37%.
> > > > >
> > > > > For intel arch, intel arch ERMS feature is enabled as default, so
> > > > > this patch has no effect for it.
> > > > >
> > > > > Note:
> > > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing high
> > > > > performance x86 processor for China server market.
> > > > >
> > > > > Ref:
> > > > > [1]
> > > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e
> > > > > 4ae5
> > > > > d94b094659094336da90f [2]
> > > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.htm
> > > > > l
> > > > > [3]
> > > > >
> > > >
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEi
> > > > l4
> > > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > > >
> > > > > Jira: HES-210
> > > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > > ---
> > > > > sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10
> > > > > +++++++++-
> > > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > index 88bf08e..063dd51 100644
> > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > > ja L(stosb_local)
> > > > > #endif
> > > > > - /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > > > > + /* If rdx is less than __x86_memset_non_temporal_threshold,
> > > > > + fallthrough goes to L(loop_4x_vec). Tests for memset
> > > > > + (2x, 4x]
> > > > > and (4x, 8x] jump to target. */
> > > > > L(more_2x_vec):
> > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > + /* Check non-temporal store threshold.
> > > > > + This is for the hardware which can not support ERMS feature
> > or
> > > > > + ERMS fearure is disabled. */
> > > > > + cmp
> > > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > > + jae L(nt_memset)
> > > > > +#endif
> > > > > /* Store next 2x vec regardless. */
> > > > > VMOVU %VMM(0), (%rdi)
> > > > > VMOVU %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > > I am opposed to this patch getting in as is. I strongly dislike the
> > > > essentially redundant checks for `__x86_rep_stosb_threshold` and
> > > > `__x86_memset_non_temporal_threshold` that will occur on most
> > processors.
> > > > Further, in this case, it messes up the alignment of the loop.
> > > >
> > > > What I would like to do instead for processors which prefer
> > > > non-temporal but not erms, is to have them set
> > > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold`
> > and prefer the `erms` version.
> > > >
> > > > Can you implement it like that instead?
> > > [Feifei] This is also a good choice. But I have a question for this:
> > > If my understand is right, you means that If __x86_rep_stosb_threshold
> > > == __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> > when
> > > init_cpu_features for servers without ERMS.
> > >
> > Yes, if you don't want memcpy to use `erms`, set
> > `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> >
> > > However, if like this, I think we ignore the small size case. Though
> > > we can use non temporal now, But for "size <
> > > __x86_rep_stosb_threshold" case, it is forced to use erms whose performance
> > is very bad For example, hygon and amd zen2.
> >
> > Hmm, I'm not sure I understand what you mean.
> >
> > What I am arguing for is that a processor that wants non-temporal but doesn't
> > want erms should do the following:
> >
> > memset:
> > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold ==
> > YOUR_PREFERED_NT_THRESHOLD`
> >
> > This will have the exact same behavior as your patch. Below NT threshold it will
> > use the normal temporal loop.
>
> [Feifei]Ok, maybe I understand your meanings.
> Consider we have a server support avx2, vec_size = 32, erms cpu flag is disabled.
> 1. set __x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold == 1MB
> 2. set 'Prefer ERMS' to force server to select erms function in 'IFUNC_SELECTOR'
> 3. we need to use memset 0 for 1KB memory(size = 1KB).
>
> Firstly, no erms support, cpu can also call " memset_avx2_unaligned_erms" function.
> Second, in the memset_erms function, check size > 2 * vec_size, and then jump to 'stosb_more_2x_vec'
> Third, check size < _x86_rep_stosb_threshold, then go to normal branch. Here, if size = 2MB,
> we can jump to 'stosb_local', then check NT_threshold, we jump to NT loop.
> Is my right?
Yes.
>
> By the way, I just did a test that If we setting 'Prefer_ERMS', cpu will call 'memset_erms' in memset_erms.S,
> But we need call ' memset_avx2_unaligned_erms '.
> Thus, maybe we need to do a change for IFUNC_SELECTOR:
> Line53: + if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
> + if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
> + return OPTIMIZE (avx2_unaligned_erms);
> + }
> + else
> if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
> return OPTIMIZE (erms);
>
> Or we add a new flag 'Prefer_VEC_ERMS'?
>
Yeah, so I need to do some plumbing here.
`Prefer_ERMS` controls the pure `rep stosb` impl for memset.
For deciding between memset_<isa>_unaligned... we use:
`CPU_FEATURE_USABLE_P (cpu_features, ERMS)`.
Think we need to add a new flag along the lines of:
`CPU_FEATURES_ARCH_P (cpu_features, NON_TEMPORAL)`
and make the memset/memmove_<isa>_unaligned_erms use that flag.
Then in dl-cacheinfo.h if we don't have ERMS, we setup
stosb/movsb thresholds s.t stosb/movsb code will just forward to
non-temporal codes.
> >
> > Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> > >
> > > Thus we add a nt jump for temporal branch, and for this, we can support that :
> > > 1. small size can use temporal for good performance rather than erms.
> > > 2.big size can use non temporal to achieve better performance.
> > >
> > >
> > > Best Regards
> > > Feifei
> > >
On Tue, Jul 9, 2024 at 6:52 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Tue, Jul 9, 2024 at 6:20 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> >
> >
> >
> > > -----邮件原件-----
> > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > 发送时间: 2024年7月9日 12:19
> > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > >
> > > On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > > >
> > > > Thanks for your kindly reply.
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > 发送时间: 2024年7月8日 13:29
> > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > > >
> > > > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > > > > >
> > > > > > Currently, large memset perf can be improved with non-temporal
> > > stores[1].
> > > > > > But it just supports the case in which ERMS feature is enabled.
> > > > > >
> > > > > > For some architectures, ERMS is disabled in cpu feature as default
> > > > > > For example, AMD zen3-milan[2], Or ERMS performance is worse than
> > > > > > vectorized loop, and it is disabled.
> > > > > > For example, HygonGenuine arch.
> > > > > >
> > > > > > Thus, to let these arch achieve perf improvement from non-temporal
> > > > > > memset, and enable non-temporal memset without ERMS, a
> > > > > > 'non-temporal memset branch jump' is added in 'more_2x_vec'case.
> > > > > >
> > > > > > Test Results:
> > > > > > thread: 1
> > > > > > memset store value: 0
> > > > > > function: memset_avx2_unaligned
> > > > > >
> > > > > > hygon1 arch
> > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > size new performance / old performance
> > > > > > 128 byte(2x -4x vec case) 1
> > > > > > 256 byte(4x - 8x vec case) 1
> > > > > > 512 byte( > 8x loop case) 1
> > > > > > 1MB 0.994
> > > > > > 4MB 0.996
> > > > > > 8MB 0.670
> > > > > > 16MB 0.343
> > > > > > 32MB 0.355
> > > > > >
> > > > > > hygon2 arch
> > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > size new performance / old performance
> > > > > > 128 byte(2x -4x vec case) 1
> > > > > > 256 byte(4x - 8x vec case) 0.653
> > > > > > 512 byte( > 8x loop case) 0.713
> > > > > > 1MB 1
> > > > > > 4MB 0.887
> > > > > > 8MB 1.312
> > > > > > 16MB 0.822
> > > > > > 32MB 0.830
> > > > > >
> > > > > > hygon3 arch
> > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > size new performance / old performance
> > > > > > 128 byte(2x -4x vec case) 1
> > > > > > 256 byte(4x - 8x vec case) 1
> > > > > > 512 byte( > 8x loop case) 1
> > > > > > 1MB 1
> > > > > > 4MB 0.990
> > > > > > 8MB 0.737
> > > > > > 16MB 0.390
> > > > > > 32MB 0.401
> > > > > >
> > > > > > For hygon arch with , no performance degradation on '2x - 8x
> > > > > > branch case' when extra branch jump added. And with this patch,
> > > > > > non-temporal stores can improve performance by 20% - 65%.
> > > > > >
> > > > > > amd-zen2-rome arch
> > > > > > x86_memset_non_temporal_threshold = 12MB
> > > > > > size new performance / old performance
> > > > > > 128 byte(2x -4x vec case) 0.986
> > > > > > 256 byte(4x - 8x vec case) 1.057
> > > > > > 512 byte( > 8x loop case) 1.005
> > > > > > 1MB 0.999
> > > > > > 4MB 0.998
> > > > > > 8MB 1.014
> > > > > > 16MB 1.156
> > > > > > 32MB 0.634
> > > > > >
> > > > > > For amd-zen2-rome, no performance degradation on '2x - 8x branch case'
> > > > > > when extra branch jump added. When size >=
> > > > > > x86_memset_non_temporal_threshold, for 'size = 16MB' case,
> > > > > > performance degradation is due to that amd-zen2 non_temporal perf
> > > > > > is worse than temporal[3]. And in 32MB case, it can achieve
> > > > > > performance improvement by
> > > > > 37%.
> > > > > >
> > > > > > For intel arch, intel arch ERMS feature is enabled as default, so
> > > > > > this patch has no effect for it.
> > > > > >
> > > > > > Note:
> > > > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at providing high
> > > > > > performance x86 processor for China server market.
> > > > > >
> > > > > > Ref:
> > > > > > [1]
> > > > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e
> > > > > > 4ae5
> > > > > > d94b094659094336da90f [2]
> > > > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg06747.htm
> > > > > > l
> > > > > > [3]
> > > > > >
> > > > >
> > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEi
> > > > > l4
> > > > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > > > >
> > > > > > Jira: HES-210
> > > > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > > > ---
> > > > > > sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10
> > > > > > +++++++++-
> > > > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > index 88bf08e..063dd51 100644
> > > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > > > ja L(stosb_local)
> > > > > > #endif
> > > > > > - /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
> > > > > > + /* If rdx is less than __x86_memset_non_temporal_threshold,
> > > > > > + fallthrough goes to L(loop_4x_vec). Tests for memset
> > > > > > + (2x, 4x]
> > > > > > and (4x, 8x] jump to target. */
> > > > > > L(more_2x_vec):
> > > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > > + /* Check non-temporal store threshold.
> > > > > > + This is for the hardware which can not support ERMS feature
> > > or
> > > > > > + ERMS fearure is disabled. */
> > > > > > + cmp
> > > > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > > > + jae L(nt_memset)
> > > > > > +#endif
> > > > > > /* Store next 2x vec regardless. */
> > > > > > VMOVU %VMM(0), (%rdi)
> > > > > > VMOVU %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > > --
> > > > > > 2.7.4
> > > > > >
> > > > >
> > > > > I am opposed to this patch getting in as is. I strongly dislike the
> > > > > essentially redundant checks for `__x86_rep_stosb_threshold` and
> > > > > `__x86_memset_non_temporal_threshold` that will occur on most
> > > processors.
> > > > > Further, in this case, it messes up the alignment of the loop.
> > > > >
> > > > > What I would like to do instead for processors which prefer
> > > > > non-temporal but not erms, is to have them set
> > > > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold`
> > > and prefer the `erms` version.
> > > > >
> > > > > Can you implement it like that instead?
> > > > [Feifei] This is also a good choice. But I have a question for this:
> > > > If my understand is right, you means that If __x86_rep_stosb_threshold
> > > > == __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> > > when
> > > > init_cpu_features for servers without ERMS.
> > > >
> > > Yes, if you don't want memcpy to use `erms`, set
> > > `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> > >
> > > > However, if like this, I think we ignore the small size case. Though
> > > > we can use non temporal now, But for "size <
> > > > __x86_rep_stosb_threshold" case, it is forced to use erms whose performance
> > > is very bad For example, hygon and amd zen2.
> > >
> > > Hmm, I'm not sure I understand what you mean.
> > >
> > > What I am arguing for is that a processor that wants non-temporal but doesn't
> > > want erms should do the following:
> > >
> > > memset:
> > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold ==
> > > YOUR_PREFERED_NT_THRESHOLD`
> > >
> > > This will have the exact same behavior as your patch. Below NT threshold it will
> > > use the normal temporal loop.
> >
> > [Feifei]Ok, maybe I understand your meanings.
> > Consider we have a server support avx2, vec_size = 32, erms cpu flag is disabled.
> > 1. set __x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold == 1MB
> > 2. set 'Prefer ERMS' to force server to select erms function in 'IFUNC_SELECTOR'
> > 3. we need to use memset 0 for 1KB memory(size = 1KB).
> >
> > Firstly, no erms support, cpu can also call " memset_avx2_unaligned_erms" function.
> > Second, in the memset_erms function, check size > 2 * vec_size, and then jump to 'stosb_more_2x_vec'
> > Third, check size < _x86_rep_stosb_threshold, then go to normal branch. Here, if size = 2MB,
> > we can jump to 'stosb_local', then check NT_threshold, we jump to NT loop.
> > Is my right?
>
> Yes.
>
> >
> > By the way, I just did a test that If we setting 'Prefer_ERMS', cpu will call 'memset_erms' in memset_erms.S,
> > But we need call ' memset_avx2_unaligned_erms '.
> > Thus, maybe we need to do a change for IFUNC_SELECTOR:
> > Line53: + if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
> > + if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
> > + return OPTIMIZE (avx2_unaligned_erms);
> > + }
> > + else
> > if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS))
> > return OPTIMIZE (erms);
> >
> > Or we add a new flag 'Prefer_VEC_ERMS'?
> >
>
> Yeah, so I need to do some plumbing here.
> `Prefer_ERMS` controls the pure `rep stosb` impl for memset.
> For deciding between memset_<isa>_unaligned... we use:
> `CPU_FEATURE_USABLE_P (cpu_features, ERMS)`.
>
> Think we need to add a new flag along the lines of:
> `CPU_FEATURES_ARCH_P (cpu_features, NON_TEMPORAL)`
> and make the memset/memmove_<isa>_unaligned_erms use that flag.
>
> Then in dl-cacheinfo.h if we don't have ERMS, we setup
> stosb/movsb thresholds s.t stosb/movsb code will just forward to
> non-temporal codes.
I'm going to assume this is not going to make it into 2.40 w/ the freeze and
the fact that we will need to do some changes in cpu-features/dl-cacheinfo.
Before master re-opens (after the release) ill post patches to add support
for this usecase.
> > >
> > > Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> > > >
> > > > Thus we add a nt jump for temporal branch, and for this, we can support that :
> > > > 1. small size can use temporal for good performance rather than erms.
> > > > 2.big size can use non temporal to achieve better performance.
> > > >
> > > >
> > > > Best Regards
> > > > Feifei
> > > >
> -----邮件原件-----
> 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> 发送时间: 2024年7月9日 18:55
> 收件人: Feifei Wang <wangfeifei@hygon.cn>
> 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
>
> On Tue, Jul 9, 2024 at 6:52 PM Noah Goldstein <goldstein.w.n@gmail.com>
> wrote:
> >
> > On Tue, Jul 9, 2024 at 6:20 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > >
> > >
> > >
> > > > -----邮件原件-----
> > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > 发送时间: 2024年7月9日 12:19
> > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > >
> > > > On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn>
> wrote:
> > > > >
> > > > > Thanks for your kindly reply.
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > > 发送时间: 2024年7月8日 13:29
> > > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > > > >
> > > > > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn>
> wrote:
> > > > > > >
> > > > > > > Currently, large memset perf can be improved with
> > > > > > > non-temporal
> > > > stores[1].
> > > > > > > But it just supports the case in which ERMS feature is enabled.
> > > > > > >
> > > > > > > For some architectures, ERMS is disabled in cpu feature as
> > > > > > > default For example, AMD zen3-milan[2], Or ERMS performance
> > > > > > > is worse than vectorized loop, and it is disabled.
> > > > > > > For example, HygonGenuine arch.
> > > > > > >
> > > > > > > Thus, to let these arch achieve perf improvement from
> > > > > > > non-temporal memset, and enable non-temporal memset without
> > > > > > > ERMS, a 'non-temporal memset branch jump' is added in
> 'more_2x_vec'case.
> > > > > > >
> > > > > > > Test Results:
> > > > > > > thread: 1
> > > > > > > memset store value: 0
> > > > > > > function: memset_avx2_unaligned
> > > > > > >
> > > > > > > hygon1 arch
> > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > size new performance / old
> performance
> > > > > > > 128 byte(2x -4x vec case) 1
> > > > > > > 256 byte(4x - 8x vec case) 1
> > > > > > > 512 byte( > 8x loop case) 1
> > > > > > > 1MB 0.994
> > > > > > > 4MB 0.996
> > > > > > > 8MB 0.670
> > > > > > > 16MB 0.343
> > > > > > > 32MB 0.355
> > > > > > >
> > > > > > > hygon2 arch
> > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > size new performance / old
> performance
> > > > > > > 128 byte(2x -4x vec case) 1
> > > > > > > 256 byte(4x - 8x vec case) 0.653
> > > > > > > 512 byte( > 8x loop case) 0.713
> > > > > > > 1MB 1
> > > > > > > 4MB 0.887
> > > > > > > 8MB 1.312
> > > > > > > 16MB 0.822
> > > > > > > 32MB 0.830
> > > > > > >
> > > > > > > hygon3 arch
> > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > size new performance / old
> performance
> > > > > > > 128 byte(2x -4x vec case) 1
> > > > > > > 256 byte(4x - 8x vec case) 1
> > > > > > > 512 byte( > 8x loop case) 1
> > > > > > > 1MB 1
> > > > > > > 4MB 0.990
> > > > > > > 8MB 0.737
> > > > > > > 16MB 0.390
> > > > > > > 32MB 0.401
> > > > > > >
> > > > > > > For hygon arch with , no performance degradation on '2x - 8x
> > > > > > > branch case' when extra branch jump added. And with this
> > > > > > > patch, non-temporal stores can improve performance by 20% - 65%.
> > > > > > >
> > > > > > > amd-zen2-rome arch
> > > > > > > x86_memset_non_temporal_threshold = 12MB
> > > > > > > size new performance / old
> performance
> > > > > > > 128 byte(2x -4x vec case) 0.986
> > > > > > > 256 byte(4x - 8x vec case) 1.057
> > > > > > > 512 byte( > 8x loop case) 1.005
> > > > > > > 1MB 0.999
> > > > > > > 4MB 0.998
> > > > > > > 8MB 1.014
> > > > > > > 16MB 1.156
> > > > > > > 32MB 0.634
> > > > > > >
> > > > > > > For amd-zen2-rome, no performance degradation on '2x - 8x branch
> case'
> > > > > > > when extra branch jump added. When size >=
> > > > > > > x86_memset_non_temporal_threshold, for 'size = 16MB' case,
> > > > > > > performance degradation is due to that amd-zen2 non_temporal
> > > > > > > perf is worse than temporal[3]. And in 32MB case, it can
> > > > > > > achieve performance improvement by
> > > > > > 37%.
> > > > > > >
> > > > > > > For intel arch, intel arch ERMS feature is enabled as
> > > > > > > default, so this patch has no effect for it.
> > > > > > >
> > > > > > > Note:
> > > > > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at
> > > > > > > providing high performance x86 processor for China server market.
> > > > > > >
> > > > > > > Ref:
> > > > > > > [1]
> > > > > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab805
> > > > > > > 73d66e
> > > > > > > 4ae5
> > > > > > > d94b094659094336da90f [2]
> > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg067
> > > > > > > 47.htm
> > > > > > > l
> > > > > > > [3]
> > > > > > >
> > > > > >
> > > >
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjA
> > > > Ei
> > > > > > l4
> > > > > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > > > > >
> > > > > > > Jira: HES-210
> > > > > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > > > > ---
> > > > > > > sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10
> > > > > > > +++++++++-
> > > > > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git
> > > > > > > a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > index 88bf08e..063dd51 100644
> > > > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > > > > ja L(stosb_local)
> > > > > > > #endif
> > > > > > > - /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x,
> 4x]
> > > > > > > + /* If rdx is less than
> __x86_memset_non_temporal_threshold,
> > > > > > > + fallthrough goes to L(loop_4x_vec). Tests for
> > > > > > > + memset (2x, 4x]
> > > > > > > and (4x, 8x] jump to target. */
> > > > > > > L(more_2x_vec):
> > > > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > > > + /* Check non-temporal store threshold.
> > > > > > > + This is for the hardware which can not support
> > > > > > > +ERMS feature
> > > > or
> > > > > > > + ERMS fearure is disabled. */
> > > > > > > + cmp
> > > > > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > > > > + jae L(nt_memset)
> > > > > > > +#endif
> > > > > > > /* Store next 2x vec regardless. */
> > > > > > > VMOVU %VMM(0), (%rdi)
> > > > > > > VMOVU %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > > > --
> > > > > > > 2.7.4
> > > > > > >
> > > > > >
> > > > > > I am opposed to this patch getting in as is. I strongly
> > > > > > dislike the essentially redundant checks for
> > > > > > `__x86_rep_stosb_threshold` and
> > > > > > `__x86_memset_non_temporal_threshold` that will occur on most
> > > > processors.
> > > > > > Further, in this case, it messes up the alignment of the loop.
> > > > > >
> > > > > > What I would like to do instead for processors which prefer
> > > > > > non-temporal but not erms, is to have them set
> > > > > > `__x86_rep_stosb_threshold ==
> > > > > > __x86_memset_non_temporal_threshold`
> > > > and prefer the `erms` version.
> > > > > >
> > > > > > Can you implement it like that instead?
> > > > > [Feifei] This is also a good choice. But I have a question for this:
> > > > > If my understand is right, you means that If
> > > > > __x86_rep_stosb_threshold ==
> __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> > > > when
> > > > > init_cpu_features for servers without ERMS.
> > > > >
> > > > Yes, if you don't want memcpy to use `erms`, set
> > > > `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> > > >
> > > > > However, if like this, I think we ignore the small size case.
> > > > > Though we can use non temporal now, But for "size <
> > > > > __x86_rep_stosb_threshold" case, it is forced to use erms whose
> > > > > performance
> > > > is very bad For example, hygon and amd zen2.
> > > >
> > > > Hmm, I'm not sure I understand what you mean.
> > > >
> > > > What I am arguing for is that a processor that wants non-temporal
> > > > but doesn't want erms should do the following:
> > > >
> > > > memset:
> > > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold
> > > > == YOUR_PREFERED_NT_THRESHOLD`
> > > >
> > > > This will have the exact same behavior as your patch. Below NT
> > > > threshold it will use the normal temporal loop.
> > >
> > > [Feifei]Ok, maybe I understand your meanings.
> > > Consider we have a server support avx2, vec_size = 32, erms cpu flag is
> disabled.
> > > 1. set __x86_rep_stosb_threshold ==
> > > __x86_memset_non_temporal_threshold == 1MB 2. set 'Prefer ERMS' to
> force server to select erms function in 'IFUNC_SELECTOR'
> > > 3. we need to use memset 0 for 1KB memory(size = 1KB).
> > >
> > > Firstly, no erms support, cpu can also call " memset_avx2_unaligned_erms"
> function.
> > > Second, in the memset_erms function, check size > 2 * vec_size, and then
> jump to 'stosb_more_2x_vec'
> > > Third, check size < _x86_rep_stosb_threshold, then go to normal
> > > branch. Here, if size = 2MB, we can jump to 'stosb_local', then check
> NT_threshold, we jump to NT loop.
> > > Is my right?
> >
> > Yes.
> >
> > >
> > > By the way, I just did a test that If we setting 'Prefer_ERMS', cpu
> > > will call 'memset_erms' in memset_erms.S, But we need call '
> memset_avx2_unaligned_erms '.
> > > Thus, maybe we need to do a change for IFUNC_SELECTOR:
> > > Line53: + if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
> > > + if (CPU_FEATURES_ARCH_P (cpu_features,
> Prefer_ERMS))
> > > + return OPTIMIZE (avx2_unaligned_erms);
> > > + }
> > > + else
> > > if (CPU_FEATURES_ARCH_P (cpu_features,
> Prefer_ERMS))
> > > return OPTIMIZE (erms);
> > >
> > > Or we add a new flag 'Prefer_VEC_ERMS'?
> > >
> >
> > Yeah, so I need to do some plumbing here.
> > `Prefer_ERMS` controls the pure `rep stosb` impl for memset.
> > For deciding between memset_<isa>_unaligned... we use:
> > `CPU_FEATURE_USABLE_P (cpu_features, ERMS)`.
> >
> > Think we need to add a new flag along the lines of:
> > `CPU_FEATURES_ARCH_P (cpu_features, NON_TEMPORAL)` and make the
> > memset/memmove_<isa>_unaligned_erms use that flag.
> >
> > Then in dl-cacheinfo.h if we don't have ERMS, we setup stosb/movsb
> > thresholds s.t stosb/movsb code will just forward to non-temporal
> > codes.
[Feifei] Agree. NON_TEMPORAL flag is suitable for these usecase.
>
> I'm going to assume this is not going to make it into 2.40 w/ the freeze and the
> fact that we will need to do some changes in cpu-features/dl-cacheinfo.
>
> Before master re-opens (after the release) ill post patches to add support for
> this usecase.
[Feifei] OK, and I just have a question: when the master re-open again?
And I will prepare for the patch of this and push it to the community as soon as possible for reviewing.
In the next version, change will be as below:
1. add a new flag called "NON_TEMPORAL"
2. add Hygon arch branch in cpu-features.c and dl-cacheinfo.h, to split Hygon from AMD branch.
3.For Hygon arch branch, we enable NON_TEMPORAL flag
4. NON_TEMPORAL flag is set, __x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold in dl-cacheinfo.h
> > > >
> > > > Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> > > > >
> > > > > Thus we add a nt jump for temporal branch, and for this, we can support
> that :
> > > > > 1. small size can use temporal for good performance rather than erms.
> > > > > 2.big size can use non temporal to achieve better performance.
> > > > >
> > > > >
> > > > > Best Regards
> > > > > Feifei
> > > > >
On Wed, Jul 10, 2024 at 2:39 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
>
>
>
> > -----邮件原件-----
> > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > 发送时间: 2024年7月9日 18:55
> > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> >
> > On Tue, Jul 9, 2024 at 6:52 PM Noah Goldstein <goldstein.w.n@gmail.com>
> > wrote:
> > >
> > > On Tue, Jul 9, 2024 at 6:20 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> > > >
> > > >
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > 发送时间: 2024年7月9日 12:19
> > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > > >
> > > > > On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang <wangfeifei@hygon.cn>
> > wrote:
> > > > > >
> > > > > > Thanks for your kindly reply.
> > > > > >
> > > > > > > -----邮件原件-----
> > > > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > > > 发送时间: 2024年7月8日 13:29
> > > > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > > > > >
> > > > > > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang <wangfeifei@hygon.cn>
> > wrote:
> > > > > > > >
> > > > > > > > Currently, large memset perf can be improved with
> > > > > > > > non-temporal
> > > > > stores[1].
> > > > > > > > But it just supports the case in which ERMS feature is enabled.
> > > > > > > >
> > > > > > > > For some architectures, ERMS is disabled in cpu feature as
> > > > > > > > default For example, AMD zen3-milan[2], Or ERMS performance
> > > > > > > > is worse than vectorized loop, and it is disabled.
> > > > > > > > For example, HygonGenuine arch.
> > > > > > > >
> > > > > > > > Thus, to let these arch achieve perf improvement from
> > > > > > > > non-temporal memset, and enable non-temporal memset without
> > > > > > > > ERMS, a 'non-temporal memset branch jump' is added in
> > 'more_2x_vec'case.
> > > > > > > >
> > > > > > > > Test Results:
> > > > > > > > thread: 1
> > > > > > > > memset store value: 0
> > > > > > > > function: memset_avx2_unaligned
> > > > > > > >
> > > > > > > > hygon1 arch
> > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > size new performance / old
> > performance
> > > > > > > > 128 byte(2x -4x vec case) 1
> > > > > > > > 256 byte(4x - 8x vec case) 1
> > > > > > > > 512 byte( > 8x loop case) 1
> > > > > > > > 1MB 0.994
> > > > > > > > 4MB 0.996
> > > > > > > > 8MB 0.670
> > > > > > > > 16MB 0.343
> > > > > > > > 32MB 0.355
> > > > > > > >
> > > > > > > > hygon2 arch
> > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > size new performance / old
> > performance
> > > > > > > > 128 byte(2x -4x vec case) 1
> > > > > > > > 256 byte(4x - 8x vec case) 0.653
> > > > > > > > 512 byte( > 8x loop case) 0.713
> > > > > > > > 1MB 1
> > > > > > > > 4MB 0.887
> > > > > > > > 8MB 1.312
> > > > > > > > 16MB 0.822
> > > > > > > > 32MB 0.830
> > > > > > > >
> > > > > > > > hygon3 arch
> > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > size new performance / old
> > performance
> > > > > > > > 128 byte(2x -4x vec case) 1
> > > > > > > > 256 byte(4x - 8x vec case) 1
> > > > > > > > 512 byte( > 8x loop case) 1
> > > > > > > > 1MB 1
> > > > > > > > 4MB 0.990
> > > > > > > > 8MB 0.737
> > > > > > > > 16MB 0.390
> > > > > > > > 32MB 0.401
> > > > > > > >
> > > > > > > > For hygon arch with , no performance degradation on '2x - 8x
> > > > > > > > branch case' when extra branch jump added. And with this
> > > > > > > > patch, non-temporal stores can improve performance by 20% - 65%.
> > > > > > > >
> > > > > > > > amd-zen2-rome arch
> > > > > > > > x86_memset_non_temporal_threshold = 12MB
> > > > > > > > size new performance / old
> > performance
> > > > > > > > 128 byte(2x -4x vec case) 0.986
> > > > > > > > 256 byte(4x - 8x vec case) 1.057
> > > > > > > > 512 byte( > 8x loop case) 1.005
> > > > > > > > 1MB 0.999
> > > > > > > > 4MB 0.998
> > > > > > > > 8MB 1.014
> > > > > > > > 16MB 1.156
> > > > > > > > 32MB 0.634
> > > > > > > >
> > > > > > > > For amd-zen2-rome, no performance degradation on '2x - 8x branch
> > case'
> > > > > > > > when extra branch jump added. When size >=
> > > > > > > > x86_memset_non_temporal_threshold, for 'size = 16MB' case,
> > > > > > > > performance degradation is due to that amd-zen2 non_temporal
> > > > > > > > perf is worse than temporal[3]. And in 32MB case, it can
> > > > > > > > achieve performance improvement by
> > > > > > > 37%.
> > > > > > > >
> > > > > > > > For intel arch, intel arch ERMS feature is enabled as
> > > > > > > > default, so this patch has no effect for it.
> > > > > > > >
> > > > > > > > Note:
> > > > > > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at
> > > > > > > > providing high performance x86 processor for China server market.
> > > > > > > >
> > > > > > > > Ref:
> > > > > > > > [1]
> > > > > > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab805
> > > > > > > > 73d66e
> > > > > > > > 4ae5
> > > > > > > > d94b094659094336da90f [2]
> > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg067
> > > > > > > > 47.htm
> > > > > > > > l
> > > > > > > > [3]
> > > > > > > >
> > > > > > >
> > > > >
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjA
> > > > > Ei
> > > > > > > l4
> > > > > > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > > > > > >
> > > > > > > > Jira: HES-210
> > > > > > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > > > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > > > > > ---
> > > > > > > > sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S | 10
> > > > > > > > +++++++++-
> > > > > > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git
> > > > > > > > a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > index 88bf08e..063dd51 100644
> > > > > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > > > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > > > > > ja L(stosb_local)
> > > > > > > > #endif
> > > > > > > > - /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x,
> > 4x]
> > > > > > > > + /* If rdx is less than
> > __x86_memset_non_temporal_threshold,
> > > > > > > > + fallthrough goes to L(loop_4x_vec). Tests for
> > > > > > > > + memset (2x, 4x]
> > > > > > > > and (4x, 8x] jump to target. */
> > > > > > > > L(more_2x_vec):
> > > > > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > > > > + /* Check non-temporal store threshold.
> > > > > > > > + This is for the hardware which can not support
> > > > > > > > +ERMS feature
> > > > > or
> > > > > > > > + ERMS fearure is disabled. */
> > > > > > > > + cmp
> > > > > > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > > > > > + jae L(nt_memset)
> > > > > > > > +#endif
> > > > > > > > /* Store next 2x vec regardless. */
> > > > > > > > VMOVU %VMM(0), (%rdi)
> > > > > > > > VMOVU %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > > > > --
> > > > > > > > 2.7.4
> > > > > > > >
> > > > > > >
> > > > > > > I am opposed to this patch getting in as is. I strongly
> > > > > > > dislike the essentially redundant checks for
> > > > > > > `__x86_rep_stosb_threshold` and
> > > > > > > `__x86_memset_non_temporal_threshold` that will occur on most
> > > > > processors.
> > > > > > > Further, in this case, it messes up the alignment of the loop.
> > > > > > >
> > > > > > > What I would like to do instead for processors which prefer
> > > > > > > non-temporal but not erms, is to have them set
> > > > > > > `__x86_rep_stosb_threshold ==
> > > > > > > __x86_memset_non_temporal_threshold`
> > > > > and prefer the `erms` version.
> > > > > > >
> > > > > > > Can you implement it like that instead?
> > > > > > [Feifei] This is also a good choice. But I have a question for this:
> > > > > > If my understand is right, you means that If
> > > > > > __x86_rep_stosb_threshold ==
> > __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> > > > > when
> > > > > > init_cpu_features for servers without ERMS.
> > > > > >
> > > > > Yes, if you don't want memcpy to use `erms`, set
> > > > > `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> > > > >
> > > > > > However, if like this, I think we ignore the small size case.
> > > > > > Though we can use non temporal now, But for "size <
> > > > > > __x86_rep_stosb_threshold" case, it is forced to use erms whose
> > > > > > performance
> > > > > is very bad For example, hygon and amd zen2.
> > > > >
> > > > > Hmm, I'm not sure I understand what you mean.
> > > > >
> > > > > What I am arguing for is that a processor that wants non-temporal
> > > > > but doesn't want erms should do the following:
> > > > >
> > > > > memset:
> > > > > `__x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold
> > > > > == YOUR_PREFERED_NT_THRESHOLD`
> > > > >
> > > > > This will have the exact same behavior as your patch. Below NT
> > > > > threshold it will use the normal temporal loop.
> > > >
> > > > [Feifei]Ok, maybe I understand your meanings.
> > > > Consider we have a server support avx2, vec_size = 32, erms cpu flag is
> > disabled.
> > > > 1. set __x86_rep_stosb_threshold ==
> > > > __x86_memset_non_temporal_threshold == 1MB 2. set 'Prefer ERMS' to
> > force server to select erms function in 'IFUNC_SELECTOR'
> > > > 3. we need to use memset 0 for 1KB memory(size = 1KB).
> > > >
> > > > Firstly, no erms support, cpu can also call " memset_avx2_unaligned_erms"
> > function.
> > > > Second, in the memset_erms function, check size > 2 * vec_size, and then
> > jump to 'stosb_more_2x_vec'
> > > > Third, check size < _x86_rep_stosb_threshold, then go to normal
> > > > branch. Here, if size = 2MB, we can jump to 'stosb_local', then check
> > NT_threshold, we jump to NT loop.
> > > > Is my right?
> > >
> > > Yes.
> > >
> > > >
> > > > By the way, I just did a test that If we setting 'Prefer_ERMS', cpu
> > > > will call 'memset_erms' in memset_erms.S, But we need call '
> > memset_avx2_unaligned_erms '.
> > > > Thus, maybe we need to do a change for IFUNC_SELECTOR:
> > > > Line53: + if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
> > > > + if (CPU_FEATURES_ARCH_P (cpu_features,
> > Prefer_ERMS))
> > > > + return OPTIMIZE (avx2_unaligned_erms);
> > > > + }
> > > > + else
> > > > if (CPU_FEATURES_ARCH_P (cpu_features,
> > Prefer_ERMS))
> > > > return OPTIMIZE (erms);
> > > >
> > > > Or we add a new flag 'Prefer_VEC_ERMS'?
> > > >
> > >
> > > Yeah, so I need to do some plumbing here.
> > > `Prefer_ERMS` controls the pure `rep stosb` impl for memset.
> > > For deciding between memset_<isa>_unaligned... we use:
> > > `CPU_FEATURE_USABLE_P (cpu_features, ERMS)`.
> > >
> > > Think we need to add a new flag along the lines of:
> > > `CPU_FEATURES_ARCH_P (cpu_features, NON_TEMPORAL)` and make the
> > > memset/memmove_<isa>_unaligned_erms use that flag.
> > >
> > > Then in dl-cacheinfo.h if we don't have ERMS, we setup stosb/movsb
> > > thresholds s.t stosb/movsb code will just forward to non-temporal
> > > codes.
> [Feifei] Agree. NON_TEMPORAL flag is suitable for these usecase.
> >
> > I'm going to assume this is not going to make it into 2.40 w/ the freeze and the
> > fact that we will need to do some changes in cpu-features/dl-cacheinfo.
> >
> > Before master re-opens (after the release) ill post patches to add support for
> > this usecase.
>
> [Feifei] OK, and I just have a question: when the master re-open again?
> And I will prepare for the patch of this and push it to the community as soon as possible for reviewing.
> In the next version, change will be as below:
> 1. add a new flag called "NON_TEMPORAL"
> 2. add Hygon arch branch in cpu-features.c and dl-cacheinfo.h, to split Hygon from AMD branch.
> 3.For Hygon arch branch, we enable NON_TEMPORAL flag
> 4. NON_TEMPORAL flag is set, __x86_rep_stosb_threshold == __x86_memset_non_temporal_threshold in dl-cacheinfo.h
>
I posted a patch for it, once it gets in you just need to set
`Prefer_Non_Temporal`
for your target.
> > > > >
> > > > > Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> > > > > >
> > > > > > Thus we add a nt jump for temporal branch, and for this, we can support
> > that :
> > > > > > 1. small size can use temporal for good performance rather than erms.
> > > > > > 2.big size can use non temporal to achieve better performance.
> > > > > >
> > > > > >
> > > > > > Best Regards
> > > > > > Feifei
> > > > > >
> -----邮件原件-----
> 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> 发送时间: 2024年7月10日 14:53
> 收件人: Feifei Wang <wangfeifei@hygon.cn>
> 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
>
> On Wed, Jul 10, 2024 at 2:39 PM Feifei Wang <wangfeifei@hygon.cn> wrote:
> >
> >
> >
> > > -----邮件原件-----
> > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > 发送时间: 2024年7月9日 18:55
> > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > >
> > > On Tue, Jul 9, 2024 at 6:52 PM Noah Goldstein
> > > <goldstein.w.n@gmail.com>
> > > wrote:
> > > >
> > > > On Tue, Jul 9, 2024 at 6:20 PM Feifei Wang <wangfeifei@hygon.cn>
> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > > 发送时间: 2024年7月9日 12:19
> > > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without ERMS
> > > > > >
> > > > > > On Tue, Jul 9, 2024 at 10:58 AM Feifei Wang
> > > > > > <wangfeifei@hygon.cn>
> > > wrote:
> > > > > > >
> > > > > > > Thanks for your kindly reply.
> > > > > > >
> > > > > > > > -----邮件原件-----
> > > > > > > > 发件人: Noah Goldstein <goldstein.w.n@gmail.com>
> > > > > > > > 发送时间: 2024年7月8日 13:29
> > > > > > > > 收件人: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > > > 抄送: libc-alpha@sourceware.org; Jing Li <lijing@hygon.cn>
> > > > > > > > 主题: Re: [PATCH] x86: Enable non-temporal memset without
> > > > > > > > ERMS
> > > > > > > >
> > > > > > > > On Fri, Jul 5, 2024 at 3:53 PM Feifei Wang
> > > > > > > > <wangfeifei@hygon.cn>
> > > wrote:
> > > > > > > > >
> > > > > > > > > Currently, large memset perf can be improved with
> > > > > > > > > non-temporal
> > > > > > stores[1].
> > > > > > > > > But it just supports the case in which ERMS feature is enabled.
> > > > > > > > >
> > > > > > > > > For some architectures, ERMS is disabled in cpu feature
> > > > > > > > > as default For example, AMD zen3-milan[2], Or ERMS
> > > > > > > > > performance is worse than vectorized loop, and it is disabled.
> > > > > > > > > For example, HygonGenuine arch.
> > > > > > > > >
> > > > > > > > > Thus, to let these arch achieve perf improvement from
> > > > > > > > > non-temporal memset, and enable non-temporal memset
> > > > > > > > > without ERMS, a 'non-temporal memset branch jump' is
> > > > > > > > > added in
> > > 'more_2x_vec'case.
> > > > > > > > >
> > > > > > > > > Test Results:
> > > > > > > > > thread: 1
> > > > > > > > > memset store value: 0
> > > > > > > > > function: memset_avx2_unaligned
> > > > > > > > >
> > > > > > > > > hygon1 arch
> > > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > > size new performance / old
> > > performance
> > > > > > > > > 128 byte(2x -4x vec case) 1
> > > > > > > > > 256 byte(4x - 8x vec case) 1
> > > > > > > > > 512 byte( > 8x loop case) 1
> > > > > > > > > 1MB 0.994
> > > > > > > > > 4MB 0.996
> > > > > > > > > 8MB 0.670
> > > > > > > > > 16MB 0.343
> > > > > > > > > 32MB 0.355
> > > > > > > > >
> > > > > > > > > hygon2 arch
> > > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > > size new performance / old
> > > performance
> > > > > > > > > 128 byte(2x -4x vec case) 1
> > > > > > > > > 256 byte(4x - 8x vec case) 0.653
> > > > > > > > > 512 byte( > 8x loop case) 0.713
> > > > > > > > > 1MB 1
> > > > > > > > > 4MB 0.887
> > > > > > > > > 8MB 1.312
> > > > > > > > > 16MB 0.822
> > > > > > > > > 32MB 0.830
> > > > > > > > >
> > > > > > > > > hygon3 arch
> > > > > > > > > x86_memset_non_temporal_threshold = 8MB
> > > > > > > > > size new performance / old
> > > performance
> > > > > > > > > 128 byte(2x -4x vec case) 1
> > > > > > > > > 256 byte(4x - 8x vec case) 1
> > > > > > > > > 512 byte( > 8x loop case) 1
> > > > > > > > > 1MB 1
> > > > > > > > > 4MB 0.990
> > > > > > > > > 8MB 0.737
> > > > > > > > > 16MB 0.390
> > > > > > > > > 32MB 0.401
> > > > > > > > >
> > > > > > > > > For hygon arch with , no performance degradation on '2x
> > > > > > > > > - 8x branch case' when extra branch jump added. And with
> > > > > > > > > this patch, non-temporal stores can improve performance by 20%
> - 65%.
> > > > > > > > >
> > > > > > > > > amd-zen2-rome arch
> > > > > > > > > x86_memset_non_temporal_threshold = 12MB
> > > > > > > > > size new performance / old
> > > performance
> > > > > > > > > 128 byte(2x -4x vec case) 0.986
> > > > > > > > > 256 byte(4x - 8x vec case) 1.057
> > > > > > > > > 512 byte( > 8x loop case) 1.005
> > > > > > > > > 1MB 0.999
> > > > > > > > > 4MB 0.998
> > > > > > > > > 8MB 1.014
> > > > > > > > > 16MB 1.156
> > > > > > > > > 32MB 0.634
> > > > > > > > >
> > > > > > > > > For amd-zen2-rome, no performance degradation on '2x -
> > > > > > > > > 8x branch
> > > case'
> > > > > > > > > when extra branch jump added. When size >=
> > > > > > > > > x86_memset_non_temporal_threshold, for 'size = 16MB'
> > > > > > > > > case, performance degradation is due to that amd-zen2
> > > > > > > > > non_temporal perf is worse than temporal[3]. And in 32MB
> > > > > > > > > case, it can achieve performance improvement by
> > > > > > > > 37%.
> > > > > > > > >
> > > > > > > > > For intel arch, intel arch ERMS feature is enabled as
> > > > > > > > > default, so this patch has no effect for it.
> > > > > > > > >
> > > > > > > > > Note:
> > > > > > > > > Chengdu Haiguang IC Design Co., Ltd (Hygon) aims at
> > > > > > > > > providing high performance x86 processor for China server
> market.
> > > > > > > > >
> > > > > > > > > Ref:
> > > > > > > > > [1]
> > > > > > > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0a
> > > > > > > > > b805
> > > > > > > > > 73d66e
> > > > > > > > > 4ae5
> > > > > > > > > d94b094659094336da90f [2]
> > > > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2022-01/ms
> > > > > > > > > g067
> > > > > > > > > 47.htm
> > > > > > > > > l
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjA
> > > > > > Ei
> > > > > > > > l4
> > > > > > > > > P2uMjjQGLbLcU/edit?gid=120165109#gid=120165109
> > > > > > > > >
> > > > > > > > > Jira: HES-210
> > > > > > > > > Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
> > > > > > > > > Reviewed-by: Jing Li <lijing@hygon.cn>
> > > > > > > > > Change-Id: I5f9b032c848f4e9d2e54a08138af6b9ae71e1156
> > > > > > > > > ---
> > > > > > > > > sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S |
> > > > > > > > > 10
> > > > > > > > > +++++++++-
> > > > > > > > > 1 file changed, 9 insertions(+), 1 deletion(-)
> > > > > > > > >
> > > > > > > > > diff --git
> > > > > > > > > a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > > index 88bf08e..063dd51 100644
> > > > > > > > > ---
> > > > > > > > > a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms
> > > > > > > > > +++ .S
> > > > > > > > > @@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
> > > > > > > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
> > > > > > > > > ja L(stosb_local)
> > > > > > > > > #endif
> > > > > > > > > - /* Fallthrough goes to L(loop_4x_vec). Tests for memset
> (2x,
> > > 4x]
> > > > > > > > > + /* If rdx is less than
> > > __x86_memset_non_temporal_threshold,
> > > > > > > > > + fallthrough goes to L(loop_4x_vec). Tests for
> > > > > > > > > + memset (2x, 4x]
> > > > > > > > > and (4x, 8x] jump to target. */
> > > > > > > > > L(more_2x_vec):
> > > > > > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > > > > > + /* Check non-temporal store threshold.
> > > > > > > > > + This is for the hardware which can not
> > > > > > > > > +support ERMS feature
> > > > > > or
> > > > > > > > > + ERMS fearure is disabled. */
> > > > > > > > > + cmp
> > > > > > > > __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> > > > > > > > > + jae L(nt_memset)
> > > > > > > > > +#endif
> > > > > > > > > /* Store next 2x vec regardless. */
> > > > > > > > > VMOVU %VMM(0), (%rdi)
> > > > > > > > > VMOVU %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > > > > > --
> > > > > > > > > 2.7.4
> > > > > > > > >
> > > > > > > >
> > > > > > > > I am opposed to this patch getting in as is. I strongly
> > > > > > > > dislike the essentially redundant checks for
> > > > > > > > `__x86_rep_stosb_threshold` and
> > > > > > > > `__x86_memset_non_temporal_threshold` that will occur on
> > > > > > > > most
> > > > > > processors.
> > > > > > > > Further, in this case, it messes up the alignment of the loop.
> > > > > > > >
> > > > > > > > What I would like to do instead for processors which
> > > > > > > > prefer non-temporal but not erms, is to have them set
> > > > > > > > `__x86_rep_stosb_threshold ==
> > > > > > > > __x86_memset_non_temporal_threshold`
> > > > > > and prefer the `erms` version.
> > > > > > > >
> > > > > > > > Can you implement it like that instead?
> > > > > > > [Feifei] This is also a good choice. But I have a question for this:
> > > > > > > If my understand is right, you means that If
> > > > > > > __x86_rep_stosb_threshold ==
> > > __x86_memset_non_temporal_threshold, we can set "Prefer_ERMS"
> > > > > > when
> > > > > > > init_cpu_features for servers without ERMS.
> > > > > > >
> > > > > > Yes, if you don't want memcpy to use `erms`, set
> > > > > > `__x86_rep_movsb_stop_threshold == SIZE_MAX` as well.
> > > > > >
> > > > > > > However, if like this, I think we ignore the small size case.
> > > > > > > Though we can use non temporal now, But for "size <
> > > > > > > __x86_rep_stosb_threshold" case, it is forced to use erms
> > > > > > > whose performance
> > > > > > is very bad For example, hygon and amd zen2.
> > > > > >
> > > > > > Hmm, I'm not sure I understand what you mean.
> > > > > >
> > > > > > What I am arguing for is that a processor that wants
> > > > > > non-temporal but doesn't want erms should do the following:
> > > > > >
> > > > > > memset:
> > > > > > `__x86_rep_stosb_threshold ==
> > > > > > __x86_memset_non_temporal_threshold
> > > > > > == YOUR_PREFERED_NT_THRESHOLD`
> > > > > >
> > > > > > This will have the exact same behavior as your patch. Below NT
> > > > > > threshold it will use the normal temporal loop.
> > > > >
> > > > > [Feifei]Ok, maybe I understand your meanings.
> > > > > Consider we have a server support avx2, vec_size = 32, erms cpu
> > > > > flag is
> > > disabled.
> > > > > 1. set __x86_rep_stosb_threshold ==
> > > > > __x86_memset_non_temporal_threshold == 1MB 2. set 'Prefer ERMS'
> > > > > to
> > > force server to select erms function in 'IFUNC_SELECTOR'
> > > > > 3. we need to use memset 0 for 1KB memory(size = 1KB).
> > > > >
> > > > > Firstly, no erms support, cpu can also call "
> memset_avx2_unaligned_erms"
> > > function.
> > > > > Second, in the memset_erms function, check size > 2 * vec_size,
> > > > > and then
> > > jump to 'stosb_more_2x_vec'
> > > > > Third, check size < _x86_rep_stosb_threshold, then go to normal
> > > > > branch. Here, if size = 2MB, we can jump to 'stosb_local', then
> > > > > check
> > > NT_threshold, we jump to NT loop.
> > > > > Is my right?
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > By the way, I just did a test that If we setting 'Prefer_ERMS',
> > > > > cpu will call 'memset_erms' in memset_erms.S, But we need call '
> > > memset_avx2_unaligned_erms '.
> > > > > Thus, maybe we need to do a change for IFUNC_SELECTOR:
> > > > > Line53: + if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) {
> > > > > + if (CPU_FEATURES_ARCH_P (cpu_features,
> > > Prefer_ERMS))
> > > > > + return OPTIMIZE (avx2_unaligned_erms);
> > > > > + }
> > > > > + else
> > > > > if (CPU_FEATURES_ARCH_P (cpu_features,
> > > Prefer_ERMS))
> > > > > return OPTIMIZE (erms);
> > > > >
> > > > > Or we add a new flag 'Prefer_VEC_ERMS'?
> > > > >
> > > >
> > > > Yeah, so I need to do some plumbing here.
> > > > `Prefer_ERMS` controls the pure `rep stosb` impl for memset.
> > > > For deciding between memset_<isa>_unaligned... we use:
> > > > `CPU_FEATURE_USABLE_P (cpu_features, ERMS)`.
> > > >
> > > > Think we need to add a new flag along the lines of:
> > > > `CPU_FEATURES_ARCH_P (cpu_features, NON_TEMPORAL)` and make the
> > > > memset/memmove_<isa>_unaligned_erms use that flag.
> > > >
> > > > Then in dl-cacheinfo.h if we don't have ERMS, we setup stosb/movsb
> > > > thresholds s.t stosb/movsb code will just forward to non-temporal
> > > > codes.
> > [Feifei] Agree. NON_TEMPORAL flag is suitable for these usecase.
> > >
> > > I'm going to assume this is not going to make it into 2.40 w/ the
> > > freeze and the fact that we will need to do some changes in
> cpu-features/dl-cacheinfo.
> > >
> > > Before master re-opens (after the release) ill post patches to add
> > > support for this usecase.
> >
> > [Feifei] OK, and I just have a question: when the master re-open again?
> > And I will prepare for the patch of this and push it to the community as soon as
> possible for reviewing.
> > In the next version, change will be as below:
> > 1. add a new flag called "NON_TEMPORAL"
> > 2. add Hygon arch branch in cpu-features.c and dl-cacheinfo.h, to split Hygon
> from AMD branch.
> > 3.For Hygon arch branch, we enable NON_TEMPORAL flag 4.
> NON_TEMPORAL
> > flag is set, __x86_rep_stosb_threshold ==
> > __x86_memset_non_temporal_threshold in dl-cacheinfo.h
> >
>
> I posted a patch for it, once it gets in you just need to set
> `Prefer_Non_Temporal` for your target.
[Feifei]Ok, that's good.
After it merge in, I will add Hygon branch and set Prefer_Non_Temporal branch.
> > > > > >
> > > > > > Above NT threshold it will use nt-stores. It will never use `rep stosb`.
> > > > > > >
> > > > > > > Thus we add a nt jump for temporal branch, and for this, we
> > > > > > > can support
> > > that :
> > > > > > > 1. small size can use temporal for good performance rather than
> erms.
> > > > > > > 2.big size can use non temporal to achieve better performance.
> > > > > > >
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Feifei
> > > > > > >
@@ -229,9 +229,17 @@ L(stosb_more_2x_vec):
cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
ja L(stosb_local)
#endif
- /* Fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
+ /* If rdx is less than __x86_memset_non_temporal_threshold,
+ fallthrough goes to L(loop_4x_vec). Tests for memset (2x, 4x]
and (4x, 8x] jump to target. */
L(more_2x_vec):
+#if defined USE_MULTIARCH && IS_IN (libc)
+ /* Check non-temporal store threshold.
+ This is for the hardware which can not support ERMS feature or
+ ERMS fearure is disabled. */
+ cmp __x86_memset_non_temporal_threshold(%rip), %RDX_LP
+ jae L(nt_memset)
+#endif
/* Store next 2x vec regardless. */
VMOVU %VMM(0), (%rdi)
VMOVU %VMM(0), (VEC_SIZE * 1)(%rdi)