[x86] Improve V[48]QI shifts on AVX512
Checks
Context |
Check |
Description |
linaro-tcwg-bot/tcwg_gcc_build--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_gcc_check--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 |
success
|
Testing passed
|
Commit Message
The following one line patch improves the code generated for V8QI and V4QI
shifts when AV512BW and AVX512VL functionality is available.
For the testcase (from gcc.target/i386/vect-shiftv8qi.c):
typedef signed char v8qi __attribute__ ((__vector_size__ (8)));
v8qi foo (v8qi x) { return x >> 5; }
GCC with -O2 -march=cascadelake currently generates:
foo: movl $67372036, %eax
vpsraw $5, %xmm0, %xmm2
vpbroadcastd %eax, %xmm1
movl $117901063, %eax
vpbroadcastd %eax, %xmm3
vmovdqa %xmm1, %xmm0
vmovdqa %xmm3, -24(%rsp)
vpternlogd $120, -24(%rsp), %xmm2, %xmm0
vpsubb %xmm1, %xmm0, %xmm0
ret
with this patch we now generate the much improved:
foo: vpmovsxbw %xmm0, %xmm0
vpsraw $5, %xmm0, %xmm0
vpmovwb %xmm0, %xmm0
ret
This patch also fixes the FAILs of gcc.target/i386/vect-shiftv[48]qi.c
when run with the additional -march=cascadelake flag, by splitting these
tests into two; one form testing code generation with -msse2 (and
-mno-avx512vl) as originally intended, and the other testing AVX512
code generation with an explicit -march=cascadelake.
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures. Ok for mainline?
2024-05-09 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
Don't attempt ix86_expand_vec_shift_qihi_constant on AVX512.
gcc/testsuite/ChangeLog
* gcc.target/i386/vect-shiftv4qi.c: Specify -mno-avx512vl.
* gcc.target/i386/vect-shiftv8qi.c: Likewise.
* gcc.target/i386/vect-shiftv4qi-2.c: New test case.
* gcc.target/i386/vect-shiftv8qi-2.c: Likewise.
Thanks in advance,
Roger
--
Comments
On Fri, May 10, 2024 at 6:26 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> The following one line patch improves the code generated for V8QI and V4QI
> shifts when AV512BW and AVX512VL functionality is available.
+ /* With AVX512 its cheaper to do vpmovsxbw/op/vpmovwb. */
+ && !(TARGET_AVX512BW && TARGET_AVX512VL && TARGET_SSE4_1)
&& ix86_expand_vec_shift_qihi_constant (code, qdest, qop1, qop2))
I think TARGET_SSE4_1 is enough, it's always better w/ sse4.1 and
above when not going into ix86_expand_vec_shift_qihi_constant.
Others LGTM.
>
> For the testcase (from gcc.target/i386/vect-shiftv8qi.c):
>
> typedef signed char v8qi __attribute__ ((__vector_size__ (8)));
> v8qi foo (v8qi x) { return x >> 5; }
>
> GCC with -O2 -march=cascadelake currently generates:
>
> foo: movl $67372036, %eax
> vpsraw $5, %xmm0, %xmm2
> vpbroadcastd %eax, %xmm1
> movl $117901063, %eax
> vpbroadcastd %eax, %xmm3
> vmovdqa %xmm1, %xmm0
> vmovdqa %xmm3, -24(%rsp)
> vpternlogd $120, -24(%rsp), %xmm2, %xmm0
It looks like a miss-optimization under AVX512, but it's a separate issue.
> vpsubb %xmm1, %xmm0, %xmm0
> ret
>
> with this patch we now generate the much improved:
>
> foo: vpmovsxbw %xmm0, %xmm0
> vpsraw $5, %xmm0, %xmm0
> vpmovwb %xmm0, %xmm0
> ret
>
> This patch also fixes the FAILs of gcc.target/i386/vect-shiftv[48]qi.c
> when run with the additional -march=cascadelake flag, by splitting these
> tests into two; one form testing code generation with -msse2 (and
> -mno-avx512vl) as originally intended, and the other testing AVX512
> code generation with an explicit -march=cascadelake.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures. Ok for mainline?
>
>
> 2024-05-09 Roger Sayle <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
> * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
> Don't attempt ix86_expand_vec_shift_qihi_constant on AVX512.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/vect-shiftv4qi.c: Specify -mno-avx512vl.
> * gcc.target/i386/vect-shiftv8qi.c: Likewise.
> * gcc.target/i386/vect-shiftv4qi-2.c: New test case.
> * gcc.target/i386/vect-shiftv8qi-2.c: Likewise.
>
>
> Thanks in advance,
> Roger
> --
>
Many thanks for the speedy review and correction/improvement.
It's interesting that you spotted the ternlog "spill"...
I have a patch that rewrites ternlog handling that's been
waiting for stage 1, that would also fix this mem operand
issue. I hope to submit it for review this weekend.
Thanks again,
Roger
> From: Hongtao Liu <crazylht@gmail.com>
> On Fri, May 10, 2024 at 6:26 AM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > The following one line patch improves the code generated for V8QI and
> > V4QI shifts when AV512BW and AVX512VL functionality is available.
> + /* With AVX512 its cheaper to do vpmovsxbw/op/vpmovwb. */
> + && !(TARGET_AVX512BW && TARGET_AVX512VL && TARGET_SSE4_1)
> && ix86_expand_vec_shift_qihi_constant (code, qdest, qop1, qop2)) I think
> TARGET_SSE4_1 is enough, it's always better w/ sse4.1 and above when not going
> into ix86_expand_vec_shift_qihi_constant.
> Others LGTM.
> >
> > For the testcase (from gcc.target/i386/vect-shiftv8qi.c):
> >
> > typedef signed char v8qi __attribute__ ((__vector_size__ (8))); v8qi
> > foo (v8qi x) { return x >> 5; }
> >
> > GCC with -O2 -march=cascadelake currently generates:
> >
> > foo: movl $67372036, %eax
> > vpsraw $5, %xmm0, %xmm2
> > vpbroadcastd %eax, %xmm1
> > movl $117901063, %eax
> > vpbroadcastd %eax, %xmm3
> > vmovdqa %xmm1, %xmm0
> > vmovdqa %xmm3, -24(%rsp)
> > vpternlogd $120, -24(%rsp), %xmm2, %xmm0
> It looks like a miss-optimization under AVX512, but it's a separate issue.
> > vpsubb %xmm1, %xmm0, %xmm0
> > ret
> >
> > with this patch we now generate the much improved:
> >
> > foo: vpmovsxbw %xmm0, %xmm0
> > vpsraw $5, %xmm0, %xmm0
> > vpmovwb %xmm0, %xmm0
> > ret
> >
> > This patch also fixes the FAILs of gcc.target/i386/vect-shiftv[48]qi.c
> > when run with the additional -march=cascadelake flag, by splitting
> > these tests into two; one form testing code generation with -msse2
> > (and
> > -mno-avx512vl) as originally intended, and the other testing AVX512
> > code generation with an explicit -march=cascadelake.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures. Ok for mainline?
> >
> >
> > 2024-05-09 Roger Sayle <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> > * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
> > Don't attempt ix86_expand_vec_shift_qihi_constant on AVX512.
> >
> > gcc/testsuite/ChangeLog
> > * gcc.target/i386/vect-shiftv4qi.c: Specify -mno-avx512vl.
> > * gcc.target/i386/vect-shiftv8qi.c: Likewise.
> > * gcc.target/i386/vect-shiftv4qi-2.c: New test case.
> > * gcc.target/i386/vect-shiftv8qi-2.c: Likewise.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >
> --
> BR,
> Hongtao
On Fri, May 10, 2024 at 3:41 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Many thanks for the speedy review and correction/improvement.
> It's interesting that you spotted the ternlog "spill"...
> I have a patch that rewrites ternlog handling that's been
> waiting for stage 1, that would also fix this mem operand
> issue. I hope to submit it for review this weekend.
I opened a PR for that. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021
>
> Thanks again,
> Roger
>
> > From: Hongtao Liu <crazylht@gmail.com>
> > On Fri, May 10, 2024 at 6:26 AM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > The following one line patch improves the code generated for V8QI and
> > > V4QI shifts when AV512BW and AVX512VL functionality is available.
> > + /* With AVX512 its cheaper to do vpmovsxbw/op/vpmovwb. */
> > + && !(TARGET_AVX512BW && TARGET_AVX512VL && TARGET_SSE4_1)
> > && ix86_expand_vec_shift_qihi_constant (code, qdest, qop1, qop2)) I think
> > TARGET_SSE4_1 is enough, it's always better w/ sse4.1 and above when not going
> > into ix86_expand_vec_shift_qihi_constant.
> > Others LGTM.
> > >
> > > For the testcase (from gcc.target/i386/vect-shiftv8qi.c):
> > >
> > > typedef signed char v8qi __attribute__ ((__vector_size__ (8))); v8qi
> > > foo (v8qi x) { return x >> 5; }
> > >
> > > GCC with -O2 -march=cascadelake currently generates:
> > >
> > > foo: movl $67372036, %eax
> > > vpsraw $5, %xmm0, %xmm2
> > > vpbroadcastd %eax, %xmm1
> > > movl $117901063, %eax
> > > vpbroadcastd %eax, %xmm3
> > > vmovdqa %xmm1, %xmm0
> > > vmovdqa %xmm3, -24(%rsp)
> > > vpternlogd $120, -24(%rsp), %xmm2, %xmm0
> > It looks like a miss-optimization under AVX512, but it's a separate issue.
> > > vpsubb %xmm1, %xmm0, %xmm0
> > > ret
> > >
> > > with this patch we now generate the much improved:
> > >
> > > foo: vpmovsxbw %xmm0, %xmm0
> > > vpsraw $5, %xmm0, %xmm0
> > > vpmovwb %xmm0, %xmm0
> > > ret
> > >
> > > This patch also fixes the FAILs of gcc.target/i386/vect-shiftv[48]qi.c
> > > when run with the additional -march=cascadelake flag, by splitting
> > > these tests into two; one form testing code generation with -msse2
> > > (and
> > > -mno-avx512vl) as originally intended, and the other testing AVX512
> > > code generation with an explicit -march=cascadelake.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures. Ok for mainline?
> > >
> > >
> > > 2024-05-09 Roger Sayle <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > > * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
> > > Don't attempt ix86_expand_vec_shift_qihi_constant on AVX512.
> > >
> > > gcc/testsuite/ChangeLog
> > > * gcc.target/i386/vect-shiftv4qi.c: Specify -mno-avx512vl.
> > > * gcc.target/i386/vect-shiftv8qi.c: Likewise.
> > > * gcc.target/i386/vect-shiftv4qi-2.c: New test case.
> > > * gcc.target/i386/vect-shiftv8qi-2.c: Likewise.
> > >
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >
> > --
> > BR,
> > Hongtao
>
@@ -24212,6 +24212,8 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx dest, rtx op1, rtx op2)
if (CONST_INT_P (op2)
&& (code == ASHIFT || code == LSHIFTRT || code == ASHIFTRT)
+ /* With AVX512 its cheaper to do vpmovsxbw/op/vpmovwb. */
+ && !(TARGET_AVX512BW && TARGET_AVX512VL && TARGET_SSE4_1)
&& ix86_expand_vec_shift_qihi_constant (code, qdest, qop1, qop2))
{
emit_move_insn (dest, gen_lowpart (qimode, qdest));
new file mode 100644
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=cascadelake" } */
+
+#define N 4
+
+typedef unsigned char __vu __attribute__ ((__vector_size__ (N)));
+typedef signed char __vi __attribute__ ((__vector_size__ (N)));
+
+__vu sll (__vu a, int n)
+{
+ return a << n;
+}
+
+__vu sll_c (__vu a)
+{
+ return a << 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsllw" 2 } } */
+
+__vu srl (__vu a, int n)
+{
+ return a >> n;
+}
+
+__vu srl_c (__vu a)
+{
+ return a >> 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsrlw" 2 } } */
+
+__vi sra (__vi a, int n)
+{
+ return a >> n;
+}
+
+__vi sra_c (__vi a)
+{
+ return a >> 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsraw" 2 } } */
@@ -1,5 +1,5 @@
/* { dg-do compile } */
-/* { dg-options "-O2 -msse2" } */
+/* { dg-options "-O2 -msse2 -mno-avx2 -mno-avx512vl" } */
#define N 4
new file mode 100644
@@ -0,0 +1,43 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=cascadelake" } */
+
+#define N 8
+
+typedef unsigned char __vu __attribute__ ((__vector_size__ (N)));
+typedef signed char __vi __attribute__ ((__vector_size__ (N)));
+
+__vu sll (__vu a, int n)
+{
+ return a << n;
+}
+
+__vu sll_c (__vu a)
+{
+ return a << 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsllw" 2 } } */
+
+__vu srl (__vu a, int n)
+{
+ return a >> n;
+}
+
+__vu srl_c (__vu a)
+{
+ return a >> 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsrlw" 2 } } */
+
+__vi sra (__vi a, int n)
+{
+ return a >> n;
+}
+
+__vi sra_c (__vi a)
+{
+ return a >> 5;
+}
+
+/* { dg-final { scan-assembler-times "vpsraw" 2 } } */
@@ -1,5 +1,5 @@
/* { dg-do compile { target { ! ia32 } } } */
-/* { dg-options "-O2 -msse2" } */
+/* { dg-options "-O2 -msse2 -mno-avx2 -mno-avx512vl" } */
#define N 8