[2/6] Support Intel AMX-AVX512

Message ID 20241113084435.1784546-3-haochen.jiang@intel.com
State New
Headers
Series Support Intel Diamond Rapids AMX instructions |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_binutils_build--master-arm fail Patch failed to apply
linaro-tcwg-bot/tcwg_binutils_build--master-aarch64 fail Patch failed to apply

Commit Message

Haochen Jiang Nov. 13, 2024, 8:44 a.m. UTC
  This patch will support AMX-AVX512. In disassmbler, we need to manually
change the operand3 names to r32 when it is register or it will print
out zmm instead.

gas/ChangeLog:

	* NEWS: Support Intel AMX-AVX512.
	* config/tc-i386.c: Add amx_avx512.
	* doc/c-i386.texi: Document .amx_avx512.
	* testsuite/gas/i386/i386.exp: Run AMX-AVX512 tests.
	* testsuite/gas/i386/x86-64.exp: Ditto.
	* testsuite/gas/i386/amx-avx512-inval.l: New test.
	* testsuite/gas/i386/amx-avx512-inval.s: Ditto.
	* testsuite/gas/i386/x86-64-amx-avx512-intel.d: Ditto.
	* testsuite/gas/i386/x86-64-amx-avx512.d: Ditto.
	* testsuite/gas/i386/x86-64-amx-avx512.s: Ditto.

opcodes/ChangeLog:

	* i386-dis-evex-len.h: Add EVEX_LEN_0F384A_X86_64_W_0,
	EVEX_LEN_0F386D_X86_64_W_0, EVEX_LEN_0F3A07_X86_64_W_0,
	EVEX_LEN_0F3A77_X86_64_W_0.
	* i386-dis-evex-prefix.h: Add PREFIX_EVEX_0F384A_W_0_L_2,
	PREFIX_EVEX_0F386D_W_0_L_2, PREFIX_EVEX_0F3A07_W_0_L_2,
	PREFIX_EVEX_0F3A77_W_0_L_2.
	* i386-dis-evex-w.h: Add EVEX_W_0F384A_X86_64, EVEX_W_0F386D_X86_64,
	EVEX_W_0F3A07_X86_64, EVEX_W_0F3A77_X86_64.
	* i386-dis-evex-x86-64.h: Add X86_64_EVEX_0F384A, X86_64_EVEX_0F386D,
	X86_64_EVEX_0F3A07, X86_64_EVEX_0F3A77.
	* i386-dis-evex.h: Ditto.
	* i386-dis.c (EVEX_LEN_0F384A_X86_64_W_0): New.
	(EVEX_LEN_0F386D_X86_64_W_0): Ditto.
	(EVEX_LEN_0F3A07_X86_64_W_0): Ditto.
	(EVEX_LEN_0F3A77_X86_64_W_0): Ditto.
	(MOD_EVEX_0F384A_X86_64_W_0): Ditto.
	(MOD_EVEX_0F386D_X86_64_W_0): Ditto.
	(MOD_EVEX_0F3A07_X86_64_W_0): Ditto.
	(MOD_EVEX_0F3A77_X86_64_W_0): Ditto.
	(PREFIX_EVEX_0F384A_W_0_L_2): Ditto.
	(PREFIX_EVEX_0F386D_W_0_L_2): Ditto.
	(PREFIX_EVEX_0F3A07_W_0_L_2): Ditto.
	(PREFIX_EVEX_0F3A77_W_0_L_2): Ditto.
	(EVEX_W_0F384A_X86_64): Ditto.
	(EVEX_W_0F386D_X86_64): Ditto.
	(EVEX_W_0F3A07_X86_64): Ditto.
	(EVEX_W_0F3A77_X86_64): Ditto.
	(X86_64_EVEX_0F384A): Ditto.
	(X86_64_EVEX_0F386D): Ditto.
	(X86_64_EVEX_0F3A07): Ditto.
	(X86_64_EVEX_0F3A77): Ditto.
	(OP_VEX): Add handler for dq_mode under EVEX512.
	* i386-gen.c (cpu_flag_init): Add CPU_AMX_AVX512_FLAGS and
	CPU_ANY_AMX_AVX512_FLAGS.
	* i386-init.h: Regenerated.
	* i386-mnem.h: Ditto.
	* i386-opc.h (CpuAMX_AVX512): New.
	(i386_cpu_flags): Add cpuamx_avx512.
	* i386-opc.tbl: Add AMX-AVX512 instructions.
	* i386-tbl.h: Regenerated.
---
 gas/NEWS                                      |    2 +
 gas/config/tc-i386.c                          |    1 +
 gas/doc/c-i386.texi                           |    4 +-
 gas/testsuite/gas/i386/amx-avx512-inval.l     |    7 +
 gas/testsuite/gas/i386/amx-avx512-inval.s     |   11 +
 gas/testsuite/gas/i386/i386.exp               |    1 +
 .../gas/i386/x86-64-amx-avx512-intel.d        |   35 +
 gas/testsuite/gas/i386/x86-64-amx-avx512.d    |   34 +
 gas/testsuite/gas/i386/x86-64-amx-avx512.s    |   55 +
 gas/testsuite/gas/i386/x86-64.exp             |    2 +
 opcodes/i386-dis-evex-len.h                   |   28 +
 opcodes/i386-dis-evex-prefix.h                |   27 +
 opcodes/i386-dis-evex-w.h                     |   16 +
 opcodes/i386-dis-evex-x86-64.h                |   20 +
 opcodes/i386-dis-evex.h                       |    8 +-
 opcodes/i386-dis.c                            |   22 +-
 opcodes/i386-gen.c                            |    3 +
 opcodes/i386-init.h                           |  766 ++---
 opcodes/i386-mnem.h                           | 2554 +++++++++--------
 opcodes/i386-opc.h                            |    3 +
 opcodes/i386-opc.tbl                          |   15 +
 opcodes/i386-tbl.h                            |  416 ++-
 22 files changed, 2244 insertions(+), 1786 deletions(-)
 create mode 100644 gas/testsuite/gas/i386/amx-avx512-inval.l
 create mode 100644 gas/testsuite/gas/i386/amx-avx512-inval.s
 create mode 100644 gas/testsuite/gas/i386/x86-64-amx-avx512-intel.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-amx-avx512.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-amx-avx512.s
  

Comments

Jan Beulich Nov. 14, 2024, 9:01 a.m. UTC | #1
On 13.11.2024 09:44, Haochen Jiang wrote:
> This patch will support AMX-AVX512. In disassmbler, we need to manually
> change the operand3 names to r32 when it is register or it will print
> out zmm instead.

Just to mention it in public as well, if only for posterity: The spec
looks suspicious in a number of aspects. Until its correctness was
confirmed, or it was corrected, there's no basis for approving these
changes. Last thing we want is change encodings perhaps even after one
form had been part of a release.

This also affects other parts of this series, ftaod.

Jan
  
Jan Beulich Nov. 15, 2024, 2:03 p.m. UTC | #2
On 13.11.2024 09:44, Haochen Jiang wrote:
> --- a/gas/NEWS
> +++ b/gas/NEWS
> @@ -1,5 +1,7 @@
>  -*- text -*-
>  
> +* Add support for Intel AMX-AVX512 instructions.
> +
>  * Add support for Intel AMX-TRANSPOSE instructions.

As they're closely related, I think all AMX-* together want to take just a
single list entry.

> --- a/gas/testsuite/gas/i386/i386.exp
> +++ b/gas/testsuite/gas/i386/i386.exp
> @@ -547,6 +547,7 @@ if [gas_32_check] then {
>      run_dump_test "avx10_2-256-miscs-intel"
>      run_list_test "msr_imm-inval"
>      run_list_test "amx-transpose-inval"
> +    run_list_test "amx-avx512-inval"

See comment on earlier patch.

> --- a/opcodes/i386-dis.c
> +++ b/opcodes/i386-dis.c
> @@ -592,6 +592,7 @@ fetch_error (const instr_info *ins)
>  #define VexGatherD { OP_VEX, vex_vsib_d_w_dq_mode }
>  #define VexGatherQ { OP_VEX, vex_vsib_q_w_dq_mode }
>  #define VexGdq { OP_VEX, dq_mode }
> +#define VexGd { OP_VEX, d_mode }

Why wouldn't VexGdq be suitable to use?

> @@ -13931,6 +13949,8 @@ OP_VEX (instr_info *ins, int bytemode, int sizeflag ATTRIBUTE_UNUSED)
>      case 512:
>        names = att_names_zmm;
>        ins->evex_used |= EVEX_len_used;
> +      if (bytemode == d_mode)
> +	names = att_names32;
>        break;
>      default:
>        abort ();

Irrespective of VexGd (i.e. d_mode) or VexGdq (dq_mode) - the GPR handling
imo simply wants pulling out of this switch().

> --- a/opcodes/i386-gen.c
> +++ b/opcodes/i386-gen.c
> @@ -265,6 +265,8 @@ static const dependency isa_dependencies[] =
>      "AMX_TILE" },
>    { "AMX_TRANSPOSE",
>      "AMX_TILE" },
> +  { "AMX_AVX512",
> +    "AMX_TILE|AVX10_2" },

This dependency looks certainly correct to add, yet how does that fit with
all insns only supporting VL=512, when AVX10 is specifically about permitting
vector lengths only up to 256 in hardware?

> --- a/opcodes/i386-opc.tbl
> +++ b/opcodes/i386-opc.tbl
> @@ -3204,6 +3204,19 @@ tconjtcmmimfp16ps, 0x6b, AMX_COMPLEX&AMX_TRANSPOSE, Modrm|Vex128|Space0F38|Src2V
>  
>  tconjtfp16, 0x666b, AMX_COMPLEX&AMX_TRANSPOSE, Modrm|Vex128|Space0F38|VexW0|NoSuf, { RegTMM, RegTMM }
>  
> +tcvtrowd2ps, 0xf34a, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
> +tcvtrowd2ps, 0xf307, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
> +
> +tcvtrowps2pbf16h, 0xf26d, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
> +tcvtrowps2pbf16h, 0xf207, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
> +tcvtrowps2pbf16l, 0xf36d, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
> +tcvtrowps2pbf16l, 0xf377, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
> +
> +tcvtrowps2phh, 0x6d, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
> +tcvtrowps2phh, 0x07, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
> +tcvtrowps2phl, 0x666d, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
> +tcvtrowps2phl, 0xf277, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
> +
>  tdpbf16ps, 0xf35c, AMX_BF16, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
>  tdpfp16ps, 0xf25c, AMX_FP16, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
>  tdpbssd, 0xf25e, AMX_INT8, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
> @@ -3213,6 +3226,8 @@ tdpbsud, 0xf35e, AMX_INT8, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM
>  
>  tileloadd, 0xf24b, APX_F(AMX_TILE), Sibmem|Vex128|EVex128|Space0F38|VexW0|NoSuf, { Unspecified|BaseIndex, RegTMM }
>  tileloaddt1, 0x664b, APX_F(AMX_TILE), Sibmem|Vex128|EVex128|Space0F38|VexW0|NoSuf, { Unspecified|BaseIndex, RegTMM }
> +tilemovrow, 0x664a, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
> +tilemovrow, 0x6607, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
>  tilestored, 0xf34b, APX_F(AMX_TILE), Sibmem|Vex128|EVex128|Space0F38|VexW0|NoSuf, { RegTMM, Unspecified|BaseIndex }
>  
>  tilerelease, 0x49c0, AMX_TILE, Vex128|Space0F38|VexW0|NoSuf, {}

Again I'm wondering why the additions are scattered around, rather than kept
together (and not going in the middle of other sub-groups). Hmm, now that I
look at this a 3rd time - is this perhaps an attempt to sort alphabetically?
Such sorting is imo fine as a secondary criteria; the first imo ought to be
the feature.

And just to mention it here again - this shouldn't go in without the encoding
anomalies sorted, one way or the other.

Jan
  
Haochen Jiang Nov. 19, 2024, 3:15 a.m. UTC | #3
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, November 15, 2024 10:04 PM
> 
> > --- a/opcodes/i386-dis.c
> > +++ b/opcodes/i386-dis.c
> > @@ -592,6 +592,7 @@ fetch_error (const instr_info *ins)  #define
> > VexGatherD { OP_VEX, vex_vsib_d_w_dq_mode }  #define VexGatherQ {
> > OP_VEX, vex_vsib_q_w_dq_mode }  #define VexGdq { OP_VEX, dq_mode }
> > +#define VexGd { OP_VEX, d_mode }
> 
> Why wouldn't VexGdq be suitable to use?

VexGdq could be used but not that exact on meaning. I actually used VexGdq
at the very beginning but eventually used d_mode due to only r32 is permitted
for clearness. I am ok to go either way.

> 
> > @@ -13931,6 +13949,8 @@ OP_VEX (instr_info *ins, int bytemode, int
> sizeflag ATTRIBUTE_UNUSED)
> >      case 512:
> >        names = att_names_zmm;
> >        ins->evex_used |= EVEX_len_used;
> > +      if (bytemode == d_mode)
> > +	names = att_names32;
> >        break;
> >      default:
> >        abort ();
> 
> Irrespective of VexGd (i.e. d_mode) or VexGdq (dq_mode) - the GPR handling
> imo simply wants pulling out of this switch().

Let me find a way to get it out. It falls to here actually due to VexGd/VexGdq.

> 
> > --- a/opcodes/i386-gen.c
> > +++ b/opcodes/i386-gen.c
> > @@ -265,6 +265,8 @@ static const dependency isa_dependencies[] =
> >      "AMX_TILE" },
> >    { "AMX_TRANSPOSE",
> >      "AMX_TILE" },
> > +  { "AMX_AVX512",
> > +    "AMX_TILE|AVX10_2" },
> 
> This dependency looks certainly correct to add, yet how does that fit with all
> insns only supporting VL=512, when AVX10 is specifically about permitting
> vector lengths only up to 256 in hardware?
>

I did not quite get the question. I guess your concern is whether it will be an
insn only support VL=128/256 for AVX10. I suppose there won't be that or it
will be quite disastrous.

> 
> And just to mention it here again - this shouldn't go in without the encoding
> anomalies sorted, one way or the other.

Yes, let's wait for that.

Thx,
Haochen
  
Jan Beulich Nov. 19, 2024, 8:56 a.m. UTC | #4
On 19.11.2024 04:15, Jiang, Haochen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Friday, November 15, 2024 10:04 PM
>>
>>> --- a/opcodes/i386-dis.c
>>> +++ b/opcodes/i386-dis.c
>>> @@ -592,6 +592,7 @@ fetch_error (const instr_info *ins)  #define
>>> VexGatherD { OP_VEX, vex_vsib_d_w_dq_mode }  #define VexGatherQ {
>>> OP_VEX, vex_vsib_q_w_dq_mode }  #define VexGdq { OP_VEX, dq_mode }
>>> +#define VexGd { OP_VEX, d_mode }
>>
>> Why wouldn't VexGdq be suitable to use?
> 
> VexGdq could be used but not that exact on meaning. I actually used VexGdq
> at the very beginning but eventually used d_mode due to only r32 is permitted
> for clearness. I am ok to go either way.

My take here is: Re-use what's available whenever possible, in preference
to adding something new.

>>> @@ -13931,6 +13949,8 @@ OP_VEX (instr_info *ins, int bytemode, int
>> sizeflag ATTRIBUTE_UNUSED)
>>>      case 512:
>>>        names = att_names_zmm;
>>>        ins->evex_used |= EVEX_len_used;
>>> +      if (bytemode == d_mode)
>>> +	names = att_names32;
>>>        break;
>>>      default:
>>>        abort ();
>>
>> Irrespective of VexGd (i.e. d_mode) or VexGdq (dq_mode) - the GPR handling
>> imo simply wants pulling out of this switch().
> 
> Let me find a way to get it out. It falls to here actually due to VexGd/VexGdq.

Right, so taking care of the case e.g. ahead of the switch() would likely
cover both original and new use cases.

>>> --- a/opcodes/i386-gen.c
>>> +++ b/opcodes/i386-gen.c
>>> @@ -265,6 +265,8 @@ static const dependency isa_dependencies[] =
>>>      "AMX_TILE" },
>>>    { "AMX_TRANSPOSE",
>>>      "AMX_TILE" },
>>> +  { "AMX_AVX512",
>>> +    "AMX_TILE|AVX10_2" },
>>
>> This dependency looks certainly correct to add, yet how does that fit with all
>> insns only supporting VL=512, when AVX10 is specifically about permitting
>> vector lengths only up to 256 in hardware?
>>
> 
> I did not quite get the question. I guess your concern is whether it will be an
> insn only support VL=128/256 for AVX10. I suppose there won't be that or it
> will be quite disastrous.

It's a spec question: Why are 256 (and 128) bit forms not specified right
away? There's hardly any other insn in AVX10.2 that becomes unavailable
entirely when vsz512 is clear in the CPUID leaf. And those few insns then
disappear truly for a reason. Whereas the ones here "naturally" extend to
VL=256 and VL=128, given how their operation is described.

Jan
  
Haochen Jiang Nov. 21, 2024, 6:01 a.m. UTC | #5
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, November 19, 2024 4:57 PM
> 
> On 19.11.2024 04:15, Jiang, Haochen wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Friday, November 15, 2024 10:04 PM
> >>> --- a/opcodes/i386-gen.c
> >>> +++ b/opcodes/i386-gen.c
> >>> @@ -265,6 +265,8 @@ static const dependency isa_dependencies[] =
> >>>      "AMX_TILE" },
> >>>    { "AMX_TRANSPOSE",
> >>>      "AMX_TILE" },
> >>> +  { "AMX_AVX512",
> >>> +    "AMX_TILE|AVX10_2" },
> >>
> >> This dependency looks certainly correct to add, yet how does that fit with
> all
> >> insns only supporting VL=512, when AVX10 is specifically about permitting
> >> vector lengths only up to 256 in hardware?
> >>
> >
> > I did not quite get the question. I guess your concern is whether it will be an
> > insn only support VL=128/256 for AVX10. I suppose there won't be that or it
> > will be quite disastrous.
> 
> It's a spec question: Why are 256 (and 128) bit forms not specified right
> away? There's hardly any other insn in AVX10.2 that becomes unavailable
> entirely when vsz512 is clear in the CPUID leaf. And those few insns then
> disappear truly for a reason. Whereas the ones here "naturally" extend to
> VL=256 and VL=128, given how their operation is described.

Per my understanding, there needn't be VL=128/256 form here. A row in tmm is
always 16 elements. With each element FP32, it will always be 512 bit.

Thx,
Haochen
  
Jan Beulich Nov. 21, 2024, 10:14 a.m. UTC | #6
On 21.11.2024 07:01, Jiang, Haochen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, November 19, 2024 4:57 PM
>>
>> On 19.11.2024 04:15, Jiang, Haochen wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Friday, November 15, 2024 10:04 PM
>>>>> --- a/opcodes/i386-gen.c
>>>>> +++ b/opcodes/i386-gen.c
>>>>> @@ -265,6 +265,8 @@ static const dependency isa_dependencies[] =
>>>>>      "AMX_TILE" },
>>>>>    { "AMX_TRANSPOSE",
>>>>>      "AMX_TILE" },
>>>>> +  { "AMX_AVX512",
>>>>> +    "AMX_TILE|AVX10_2" },
>>>>
>>>> This dependency looks certainly correct to add, yet how does that fit with
>> all
>>>> insns only supporting VL=512, when AVX10 is specifically about permitting
>>>> vector lengths only up to 256 in hardware?
>>>>
>>>
>>> I did not quite get the question. I guess your concern is whether it will be an
>>> insn only support VL=128/256 for AVX10. I suppose there won't be that or it
>>> will be quite disastrous.
>>
>> It's a spec question: Why are 256 (and 128) bit forms not specified right
>> away? There's hardly any other insn in AVX10.2 that becomes unavailable
>> entirely when vsz512 is clear in the CPUID leaf. And those few insns then
>> disappear truly for a reason. Whereas the ones here "naturally" extend to
>> VL=256 and VL=128, given how their operation is described.
> 
> Per my understanding, there needn't be VL=128/256 form here. A row in tmm is
> always 16 elements. With each element FP32, it will always be 512 bit.

That's not my understanding. For one the insns may move only part of a row
(controlled by the top two bits of the immediate). Plus the size of a row
also depends on tile configuration. See the "Operation" section of the insns.

Jan
  
Haochen Jiang Nov. 22, 2024, 2:43 a.m. UTC | #7
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Thursday, November 21, 2024 6:14 PM
> 
> On 21.11.2024 07:01, Jiang, Haochen wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Tuesday, November 19, 2024 4:57 PM
> >>
> >> It's a spec question: Why are 256 (and 128) bit forms not specified
> >> right away? There's hardly any other insn in AVX10.2 that becomes
> >> unavailable entirely when vsz512 is clear in the CPUID leaf. And
> >> those few insns then disappear truly for a reason. Whereas the ones
> >> here "naturally" extend to
> >> VL=256 and VL=128, given how their operation is described.
> >
> > Per my understanding, there needn't be VL=128/256 form here. A row in
> > tmm is always 16 elements. With each element FP32, it will always be 512 bit.
> 
> That's not my understanding. For one the insns may move only part of a row
> (controlled by the top two bits of the immediate). Plus the size of a row also
> depends on tile configuration. See the "Operation" section of the insns.

The problem is how could you get the row size and immediate value before you
determine the register size. Or it will over-complex the logic for determining
whether the inst is valid or not. Row size is in config and immediate value needs
extra interpretation. One of two configured wrongly will cause VL=128/256
get an overflow. Zmm is the only safe choice here. 

Even if the element does not hit maximum of 16 rows here, the remaining
will be kept as zero and as placeholder. So it is safe to move them all to zmm.

Thx,
Haochen
  
Jan Beulich Nov. 22, 2024, 9:29 a.m. UTC | #8
On 22.11.2024 03:43, Jiang, Haochen wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Thursday, November 21, 2024 6:14 PM
>>
>> On 21.11.2024 07:01, Jiang, Haochen wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Tuesday, November 19, 2024 4:57 PM
>>>>
>>>> It's a spec question: Why are 256 (and 128) bit forms not specified
>>>> right away? There's hardly any other insn in AVX10.2 that becomes
>>>> unavailable entirely when vsz512 is clear in the CPUID leaf. And
>>>> those few insns then disappear truly for a reason. Whereas the ones
>>>> here "naturally" extend to
>>>> VL=256 and VL=128, given how their operation is described.
>>>
>>> Per my understanding, there needn't be VL=128/256 form here. A row in
>>> tmm is always 16 elements. With each element FP32, it will always be 512 bit.
>>
>> That's not my understanding. For one the insns may move only part of a row
>> (controlled by the top two bits of the immediate). Plus the size of a row also
>> depends on tile configuration. See the "Operation" section of the insns.
> 
> The problem is how could you get the row size and immediate value before you
> determine the register size. Or it will over-complex the logic for determining
> whether the inst is valid or not. Row size is in config and immediate value needs
> extra interpretation.

All of this is, at least very likely, known to the person writing the code.

> One of two configured wrongly will cause VL=128/256
> get an overflow. Zmm is the only safe choice here. 

And the insn then entirely unavailable on AVX10.2/256.

Jan
  
Haochen Jiang Nov. 25, 2024, 3 a.m. UTC | #9
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, November 22, 2024 5:29 PM
> 
> On 22.11.2024 03:43, Jiang, Haochen wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Thursday, November 21, 2024 6:14 PM
> >>
> >> On 21.11.2024 07:01, Jiang, Haochen wrote:
> >>>> From: Jan Beulich <jbeulich@suse.com>
> >>>> Sent: Tuesday, November 19, 2024 4:57 PM
> >>>>
> >>>> It's a spec question: Why are 256 (and 128) bit forms not specified
> >>>> right away? There's hardly any other insn in AVX10.2 that becomes
> >>>> unavailable entirely when vsz512 is clear in the CPUID leaf. And
> >>>> those few insns then disappear truly for a reason. Whereas the ones
> >>>> here "naturally" extend to
> >>>> VL=256 and VL=128, given how their operation is described.
> >>>
> >>> Per my understanding, there needn't be VL=128/256 form here. A row
> >>> in tmm is always 16 elements. With each element FP32, it will always be
> 512 bit.
> >>
> >> That's not my understanding. For one the insns may move only part of
> >> a row (controlled by the top two bits of the immediate). Plus the
> >> size of a row also depends on tile configuration. See the "Operation"
> section of the insns.
> >
> > The problem is how could you get the row size and immediate value
> > before you determine the register size. Or it will over-complex the
> > logic for determining whether the inst is valid or not. Row size is in
> > config and immediate value needs extra interpretation.
> 
> All of this is, at least very likely, known to the person writing the code.
> 
> > One of two configured wrongly will cause VL=128/256 get an overflow.
> > Zmm is the only safe choice here.
> 
> And the insn then entirely unavailable on AVX10.2/256.
> 

That is true. It is not available under AVX10.2/256 only since it will always need
to use zmm. We need AVX10.2/512 enabled.

Thx,
Haochen
  
Jan Beulich Nov. 25, 2024, 8:12 a.m. UTC | #10
On 25.11.2024 04:00, Jiang, Haochen wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Friday, November 22, 2024 5:29 PM
>>
>> On 22.11.2024 03:43, Jiang, Haochen wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Thursday, November 21, 2024 6:14 PM
>>>>
>>>> On 21.11.2024 07:01, Jiang, Haochen wrote:
>>>>>> From: Jan Beulich <jbeulich@suse.com>
>>>>>> Sent: Tuesday, November 19, 2024 4:57 PM
>>>>>>
>>>>>> It's a spec question: Why are 256 (and 128) bit forms not specified
>>>>>> right away? There's hardly any other insn in AVX10.2 that becomes
>>>>>> unavailable entirely when vsz512 is clear in the CPUID leaf. And
>>>>>> those few insns then disappear truly for a reason. Whereas the ones
>>>>>> here "naturally" extend to
>>>>>> VL=256 and VL=128, given how their operation is described.
>>>>>
>>>>> Per my understanding, there needn't be VL=128/256 form here. A row
>>>>> in tmm is always 16 elements. With each element FP32, it will always be
>> 512 bit.
>>>>
>>>> That's not my understanding. For one the insns may move only part of
>>>> a row (controlled by the top two bits of the immediate). Plus the
>>>> size of a row also depends on tile configuration. See the "Operation"
>> section of the insns.
>>>
>>> The problem is how could you get the row size and immediate value
>>> before you determine the register size. Or it will over-complex the
>>> logic for determining whether the inst is valid or not. Row size is in
>>> config and immediate value needs extra interpretation.
>>
>> All of this is, at least very likely, known to the person writing the code.
>>
>>> One of two configured wrongly will cause VL=128/256 get an overflow.
>>> Zmm is the only safe choice here.
>>
>> And the insn then entirely unavailable on AVX10.2/256.
>>
> 
> That is true. It is not available under AVX10.2/256 only since it will always need
> to use zmm. We need AVX10.2/512 enabled.

IOW a first stain on the shiny new AVX10, even before hardware is available.
Interesting, especially when it would be straightforward to avoid.

Jan
  

Patch

diff --git a/gas/NEWS b/gas/NEWS
index 2a8f79b6360..9575dcdaaa1 100644
--- a/gas/NEWS
+++ b/gas/NEWS
@@ -1,5 +1,7 @@ 
 -*- text -*-
 
+* Add support for Intel AMX-AVX512 instructions.
+
 * Add support for Intel AMX-TRANSPOSE instructions.
 
 * Add support for Intel MSR_IMM instructions.
diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c
index c28d98109d3..9e54aae65fa 100644
--- a/gas/config/tc-i386.c
+++ b/gas/config/tc-i386.c
@@ -1183,6 +1183,7 @@  static const arch_entry cpu_arch[] =
   SUBARCH (amx_fp16, AMX_FP16, ANY_AMX_FP16, false),
   SUBARCH (amx_complex, AMX_COMPLEX, ANY_AMX_COMPLEX, false),
   SUBARCH (amx_transpose, AMX_TRANSPOSE, ANY_AMX_TRANSPOSE, false),
+  SUBARCH (amx_avx512, AMX_AVX512, ANY_AMX_AVX512, false),
   SUBARCH (amx_tile, AMX_TILE, ANY_AMX_TILE, false),
   SUBARCH (movdiri, MOVDIRI, MOVDIRI, false),
   SUBARCH (movdir64b, MOVDIR64B, MOVDIR64B, false),
diff --git a/gas/doc/c-i386.texi b/gas/doc/c-i386.texi
index 84bb875791d..dd2e422e323 100644
--- a/gas/doc/c-i386.texi
+++ b/gas/doc/c-i386.texi
@@ -229,6 +229,7 @@  accept various extension mnemonics.  For example,
 @code{amx_fp16},
 @code{amx_complex},
 @code{amx_transpose},
+@code{amx_avx512},
 @code{amx_tile},
 @code{vmx},
 @code{vmfunc},
@@ -1701,7 +1702,8 @@  supported on the CPU specified.  The choices for @var{cpu_type} are:
 @item @samp{.shstk} @tab @samp{.gfni} @tab @samp{.vaes} @tab @samp{.vpclmulqdq}
 @item @samp{.movdiri} @tab @samp{.movdir64b} @tab @samp{.enqcmd} @tab @samp{.tsxldtrk}
 @item @samp{.amx_int8} @tab @samp{.amx_bf16} @tab @samp{.amx_fp16}
-@item @samp{.amx_complex} @tab @samp{.amx_transpose} @tab @samp{.amx_tile}
+@item @samp{.amx_complex} @tab @samp{.amx_transpose} @tab @samp{.amx_avx512}
+@item @samp{.amx_tile}
 @item @samp{.kl} @tab @samp{.widekl} @tab @samp{.uintr} @tab @samp{.hreset}
 @item @samp{.3dnow} @tab @samp{.3dnowa} @tab @samp{.sse4a} @tab @samp{.sse5}
 @item @samp{.syscall} @tab @samp{.rdtscp} @tab @samp{.svme}
diff --git a/gas/testsuite/gas/i386/amx-avx512-inval.l b/gas/testsuite/gas/i386/amx-avx512-inval.l
new file mode 100644
index 00000000000..0cfe3a7bf2f
--- /dev/null
+++ b/gas/testsuite/gas/i386/amx-avx512-inval.l
@@ -0,0 +1,7 @@ 
+.* Assembler messages:
+.*:6: Error: `tcvtrowd2ps' is only supported in 64-bit mode
+.*:7: Error: `tcvtrowps2pbf16h' is only supported in 64-bit mode
+.*:8: Error: `tcvtrowps2pbf16l' is only supported in 64-bit mode
+.*:9: Error: `tcvtrowps2phh' is only supported in 64-bit mode
+.*:10: Error: `tcvtrowps2phl' is only supported in 64-bit mode
+.*:11: Error: `tilemovrow' is only supported in 64-bit mode
diff --git a/gas/testsuite/gas/i386/amx-avx512-inval.s b/gas/testsuite/gas/i386/amx-avx512-inval.s
new file mode 100644
index 00000000000..2e7a6af2e60
--- /dev/null
+++ b/gas/testsuite/gas/i386/amx-avx512-inval.s
@@ -0,0 +1,11 @@ 
+# Check	Illegal AMX-AVX512 instructions
+
+	.allow_index_reg
+	.text
+_start:
+	tcvtrowd2ps	%edx, %tmm5, %zmm30
+	tcvtrowps2pbf16h	%edx, %tmm5, %zmm30
+	tcvtrowps2pbf16l	%edx, %tmm5, %zmm30
+	tcvtrowps2phh	%edx, %tmm5, %zmm30
+	tcvtrowps2phl	%edx, %tmm5, %zmm30
+	tilemovrow	%edx, %tmm5, %zmm30
diff --git a/gas/testsuite/gas/i386/i386.exp b/gas/testsuite/gas/i386/i386.exp
index 71abb0fc66c..acc1e2b9a63 100644
--- a/gas/testsuite/gas/i386/i386.exp
+++ b/gas/testsuite/gas/i386/i386.exp
@@ -547,6 +547,7 @@  if [gas_32_check] then {
     run_dump_test "avx10_2-256-miscs-intel"
     run_list_test "msr_imm-inval"
     run_list_test "amx-transpose-inval"
+    run_list_test "amx-avx512-inval"
     run_list_test "sg"
     run_dump_test "clzero"
     run_dump_test "invlpgb"
diff --git a/gas/testsuite/gas/i386/x86-64-amx-avx512-intel.d b/gas/testsuite/gas/i386/x86-64-amx-avx512-intel.d
new file mode 100644
index 00000000000..06ef2293bb9
--- /dev/null
+++ b/gas/testsuite/gas/i386/x86-64-amx-avx512-intel.d
@@ -0,0 +1,35 @@ 
+#objdump: -dw -Mintel
+#name: x86_64 AMX-AVX512 insns (Intel disassembly)
+#source: x86-64-amx-avx512.s
+
+.*: +file format .*
+
+Disassembly of section \.text:
+
+#...
+[a-f0-9]+ <_intel>:
+\s*[a-f0-9]+:\s*62 62 6e 48 4a f5\s+tcvtrowd2ps zmm30,tmm5,edx
+\s*[a-f0-9]+:\s*62 62 6e 48 4a f2\s+tcvtrowd2ps zmm30,tmm2,edx
+\s*[a-f0-9]+:\s*62 63 7e 48 07 f5 7b\s+tcvtrowd2ps zmm30,tmm5,0x7b
+\s*[a-f0-9]+:\s*62 63 7e 48 07 f2 7b\s+tcvtrowd2ps zmm30,tmm2,0x7b
+\s*[a-f0-9]+:\s*62 62 6f 48 6d f5\s+tcvtrowps2pbf16h zmm30,tmm5,edx
+\s*[a-f0-9]+:\s*62 62 6f 48 6d f2\s+tcvtrowps2pbf16h zmm30,tmm2,edx
+\s*[a-f0-9]+:\s*62 63 7f 48 07 f5 7b\s+tcvtrowps2pbf16h zmm30,tmm5,0x7b
+\s*[a-f0-9]+:\s*62 63 7f 48 07 f2 7b\s+tcvtrowps2pbf16h zmm30,tmm2,0x7b
+\s*[a-f0-9]+:\s*62 62 6e 48 6d f5\s+tcvtrowps2pbf16l zmm30,tmm5,edx
+\s*[a-f0-9]+:\s*62 62 6e 48 6d f2\s+tcvtrowps2pbf16l zmm30,tmm2,edx
+\s*[a-f0-9]+:\s*62 63 7e 48 77 f5 7b\s+tcvtrowps2pbf16l zmm30,tmm5,0x7b
+\s*[a-f0-9]+:\s*62 63 7e 48 77 f2 7b\s+tcvtrowps2pbf16l zmm30,tmm2,0x7b
+\s*[a-f0-9]+:\s*62 62 6c 48 6d f5\s+tcvtrowps2phh zmm30,tmm5,edx
+\s*[a-f0-9]+:\s*62 62 6c 48 6d f2\s+tcvtrowps2phh zmm30,tmm2,edx
+\s*[a-f0-9]+:\s*62 63 7c 48 07 f5 7b\s+tcvtrowps2phh zmm30,tmm5,0x7b
+\s*[a-f0-9]+:\s*62 63 7c 48 07 f2 7b\s+tcvtrowps2phh zmm30,tmm2,0x7b
+\s*[a-f0-9]+:\s*62 62 6d 48 6d f5\s+tcvtrowps2phl zmm30,tmm5,edx
+\s*[a-f0-9]+:\s*62 62 6d 48 6d f2\s+tcvtrowps2phl zmm30,tmm2,edx
+\s*[a-f0-9]+:\s*62 63 7f 48 77 f5 7b\s+tcvtrowps2phl zmm30,tmm5,0x7b
+\s*[a-f0-9]+:\s*62 63 7f 48 77 f2 7b\s+tcvtrowps2phl zmm30,tmm2,0x7b
+\s*[a-f0-9]+:\s*62 62 6d 48 4a f5\s+tilemovrow zmm30,tmm5,edx
+\s*[a-f0-9]+:\s*62 62 6d 48 4a f2\s+tilemovrow zmm30,tmm2,edx
+\s*[a-f0-9]+:\s*62 63 7d 48 07 f5 7b\s+tilemovrow zmm30,tmm5,0x7b
+\s*[a-f0-9]+:\s*62 63 7d 48 07 f2 7b\s+tilemovrow zmm30,tmm2,0x7b
+#pass
diff --git a/gas/testsuite/gas/i386/x86-64-amx-avx512.d b/gas/testsuite/gas/i386/x86-64-amx-avx512.d
new file mode 100644
index 00000000000..410588d494e
--- /dev/null
+++ b/gas/testsuite/gas/i386/x86-64-amx-avx512.d
@@ -0,0 +1,34 @@ 
+#objdump: -dw
+#name: x86_64 AMX-AVX512 insns
+#source: x86-64-amx-avx512.s
+
+.*: +file format .*
+
+Disassembly of section \.text:
+
+0+ <_start>:
+\s*[a-f0-9]+:\s*62 62 6e 48 4a f5\s+tcvtrowd2ps %edx,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 62 6e 48 4a f2\s+tcvtrowd2ps %edx,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 63 7e 48 07 f5 7b\s+tcvtrowd2ps \$0x7b,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 63 7e 48 07 f2 7b\s+tcvtrowd2ps \$0x7b,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 62 6f 48 6d f5\s+tcvtrowps2pbf16h %edx,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 62 6f 48 6d f2\s+tcvtrowps2pbf16h %edx,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 63 7f 48 07 f5 7b\s+tcvtrowps2pbf16h \$0x7b,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 63 7f 48 07 f2 7b\s+tcvtrowps2pbf16h \$0x7b,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 62 6e 48 6d f5\s+tcvtrowps2pbf16l %edx,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 62 6e 48 6d f2\s+tcvtrowps2pbf16l %edx,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 63 7e 48 77 f5 7b\s+tcvtrowps2pbf16l \$0x7b,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 63 7e 48 77 f2 7b\s+tcvtrowps2pbf16l \$0x7b,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 62 6c 48 6d f5\s+tcvtrowps2phh %edx,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 62 6c 48 6d f2\s+tcvtrowps2phh %edx,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 63 7c 48 07 f5 7b\s+tcvtrowps2phh \$0x7b,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 63 7c 48 07 f2 7b\s+tcvtrowps2phh \$0x7b,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 62 6d 48 6d f5\s+tcvtrowps2phl %edx,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 62 6d 48 6d f2\s+tcvtrowps2phl %edx,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 63 7f 48 77 f5 7b\s+tcvtrowps2phl \$0x7b,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 63 7f 48 77 f2 7b\s+tcvtrowps2phl \$0x7b,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 62 6d 48 4a f5\s+tilemovrow %edx,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 62 6d 48 4a f2\s+tilemovrow %edx,%tmm2,%zmm30
+\s*[a-f0-9]+:\s*62 63 7d 48 07 f5 7b\s+tilemovrow \$0x7b,%tmm5,%zmm30
+\s*[a-f0-9]+:\s*62 63 7d 48 07 f2 7b\s+tilemovrow \$0x7b,%tmm2,%zmm30
+#pass
diff --git a/gas/testsuite/gas/i386/x86-64-amx-avx512.s b/gas/testsuite/gas/i386/x86-64-amx-avx512.s
new file mode 100644
index 00000000000..0faecfde820
--- /dev/null
+++ b/gas/testsuite/gas/i386/x86-64-amx-avx512.s
@@ -0,0 +1,55 @@ 
+# Check 64bit AMX-AVX512 instructions
+
+	.text
+_start:
+	tcvtrowd2ps	%edx, %tmm5, %zmm30
+	tcvtrowd2ps	%edx, %tmm2, %zmm30
+	tcvtrowd2ps	$123, %tmm5, %zmm30
+	tcvtrowd2ps	$123, %tmm2, %zmm30
+	tcvtrowps2pbf16h	%edx, %tmm5, %zmm30
+	tcvtrowps2pbf16h	%edx, %tmm2, %zmm30
+	tcvtrowps2pbf16h	$123, %tmm5, %zmm30
+	tcvtrowps2pbf16h	$123, %tmm2, %zmm30
+	tcvtrowps2pbf16l	%edx, %tmm5, %zmm30
+	tcvtrowps2pbf16l	%edx, %tmm2, %zmm30
+	tcvtrowps2pbf16l	$123, %tmm5, %zmm30
+	tcvtrowps2pbf16l	$123, %tmm2, %zmm30
+	tcvtrowps2phh	%edx, %tmm5, %zmm30
+	tcvtrowps2phh	%edx, %tmm2, %zmm30
+	tcvtrowps2phh	$123, %tmm5, %zmm30
+	tcvtrowps2phh	$123, %tmm2, %zmm30
+	tcvtrowps2phl	%edx, %tmm5, %zmm30
+	tcvtrowps2phl	%edx, %tmm2, %zmm30
+	tcvtrowps2phl	$123, %tmm5, %zmm30
+	tcvtrowps2phl	$123, %tmm2, %zmm30
+	tilemovrow	%edx, %tmm5, %zmm30
+	tilemovrow	%edx, %tmm2, %zmm30
+	tilemovrow	$123, %tmm5, %zmm30
+	tilemovrow	$123, %tmm2, %zmm30
+
+_intel:
+	.intel_syntax noprefix
+	tcvtrowd2ps	zmm30, tmm5, edx
+	tcvtrowd2ps	zmm30, tmm2, edx
+	tcvtrowd2ps	zmm30, tmm5, 123
+	tcvtrowd2ps	zmm30, tmm2, 123
+	tcvtrowps2pbf16h	zmm30, tmm5, edx
+	tcvtrowps2pbf16h	zmm30, tmm2, edx
+	tcvtrowps2pbf16h	zmm30, tmm5, 123
+	tcvtrowps2pbf16h	zmm30, tmm2, 123
+	tcvtrowps2pbf16l	zmm30, tmm5, edx
+	tcvtrowps2pbf16l	zmm30, tmm2, edx
+	tcvtrowps2pbf16l	zmm30, tmm5, 123
+	tcvtrowps2pbf16l	zmm30, tmm2, 123
+	tcvtrowps2phh	zmm30, tmm5, edx
+	tcvtrowps2phh	zmm30, tmm2, edx
+	tcvtrowps2phh	zmm30, tmm5, 123
+	tcvtrowps2phh	zmm30, tmm2, 123
+	tcvtrowps2phl	zmm30, tmm5, edx
+	tcvtrowps2phl	zmm30, tmm2, edx
+	tcvtrowps2phl	zmm30, tmm5, 123
+	tcvtrowps2phl	zmm30, tmm2, 123
+	tilemovrow	zmm30, tmm5, edx
+	tilemovrow	zmm30, tmm2, edx
+	tilemovrow	zmm30, tmm5, 123
+	tilemovrow	zmm30, tmm2, 123
diff --git a/gas/testsuite/gas/i386/x86-64.exp b/gas/testsuite/gas/i386/x86-64.exp
index 1d54d7700e2..131e598e02a 100644
--- a/gas/testsuite/gas/i386/x86-64.exp
+++ b/gas/testsuite/gas/i386/x86-64.exp
@@ -527,6 +527,8 @@  run_list_test "x86-64-msr_imm-inval"
 run_dump_test "x86-64-amx-transpose"
 run_dump_test "x86-64-amx-transpose-intel"
 run_list_test "x86-64-amx-transpose-inval"
+run_dump_test "x86-64-amx-avx512"
+run_dump_test "x86-64-amx-avx512-intel"
 run_dump_test "x86-64-clzero"
 run_dump_test "x86-64-mwaitx-bdver4"
 run_list_test "x86-64-mwaitx-reg"
diff --git a/opcodes/i386-dis-evex-len.h b/opcodes/i386-dis-evex-len.h
index 24cc7b2e027..276749e1d54 100644
--- a/opcodes/i386-dis-evex-len.h
+++ b/opcodes/i386-dis-evex-len.h
@@ -44,6 +44,13 @@  static const struct dis386 evex_len_table[][3] = {
     { "vperm%DQ",	{ XM, Vex, EXx }, PREFIX_DATA },
   },
 
+  /* EVEX_LEN_0F384A_X86_64_W_0 */
+  {
+    { Bad_Opcode },
+    { Bad_Opcode },
+    { PREFIX_TABLE (PREFIX_EVEX_0F384A_X86_64_W_0_L_2) },
+  },
+
   /* EVEX_LEN_0F385A */
   {
     { Bad_Opcode },
@@ -58,6 +65,13 @@  static const struct dis386 evex_len_table[][3] = {
     { VEX_W_TABLE (EVEX_W_0F385B_L_2) },
   },
 
+  /* EVEX_LEN_0F386D_X86_64_W_0_M_1 */
+  {
+    { Bad_Opcode },
+    { Bad_Opcode },
+    { PREFIX_TABLE (PREFIX_EVEX_0F386D_X86_64_W_0_L_2) },
+  },
+
   /* EVEX_LEN_0F38C6 */
   {
     { Bad_Opcode },
@@ -86,6 +100,13 @@  static const struct dis386 evex_len_table[][3] = {
     { VEX_W_TABLE (VEX_W_0F3A01_L_1) },
   },
 
+  /* EVEX_LEN_0F3A07_X86_64_W_0 */
+  {
+    { Bad_Opcode },
+    { Bad_Opcode },
+    { PREFIX_TABLE (PREFIX_EVEX_0F3A07_X86_64_W_0_L_2) },
+  },
+
   /* EVEX_LEN_0F3A18 */
   {
     { Bad_Opcode },
@@ -156,6 +177,13 @@  static const struct dis386 evex_len_table[][3] = {
     { VEX_W_TABLE (EVEX_W_0F3A43_L_n) },
   },
 
+  /* EVEX_LEN_0F3A77_X86_64_W_0 */
+  {
+    { Bad_Opcode },
+    { Bad_Opcode },
+    { PREFIX_TABLE (PREFIX_EVEX_0F3A77_X86_64_W_0_L_2) },
+  },
+
   /* EVEX_LEN_MAP5_6E */
   {
     { PREFIX_TABLE (PREFIX_EVEX_MAP5_6E_L_0) },
diff --git a/opcodes/i386-dis-evex-prefix.h b/opcodes/i386-dis-evex-prefix.h
index 55d6d806ccb..a559b0b7c62 100644
--- a/opcodes/i386-dis-evex-prefix.h
+++ b/opcodes/i386-dis-evex-prefix.h
@@ -243,6 +243,12 @@ 
     { VEX_W_TABLE (EVEX_W_0F383A_P_1) },
     { "%XEvpminuw",	{ XM, Vex, EXx }, 0 },
   },
+  /* PREFIX_EVEX_0F384A_W_0_L_2 */
+  {
+    { Bad_Opcode },
+    { "tcvtrowd2ps",	{ XM, Rtmm, VexGd }, 0 },
+    { "tilemovrow",	{ XM, Rtmm, VexGd }, 0 },
+  },
   /* PREFIX_EVEX_0F3852 */
   {
     { "vdpphp%XS",	{ XM, Vex, EXx }, 0 },
@@ -264,6 +270,13 @@ 
     { Bad_Opcode },
     { "vp2intersectY%DQ", { MaskG, Vex, EXx, EXxEVexS }, 0 },
   },
+  /* PREFIX_EVEX_0F386D_W_0_L_2 */
+  {
+    { "tcvtrowps2phh",	{ XM, Rtmm, VexGd }, 0 },
+    { "tcvtrowps2pbf16l",	{ XM, Rtmm, VexGd }, 0 },
+    { "tcvtrowps2phl",	{ XM, Rtmm, VexGd }, 0 },
+    { "tcvtrowps2pbf16h",	{ XM, Rtmm, VexGd }, 0 },
+  },
   /* PREFIX_EVEX_0F3872 */
   {
     { Bad_Opcode },
@@ -306,6 +319,13 @@ 
     { "%XEvfmsub213s%XW",	{ XMScalar, VexScalar, EXdq, EXxEVexR }, 0 },
     { "v4fnmadds%XS",	{ XMScalar, VexScalar, Mxmm }, 0 },
   },
+  /* PREFIX_EVEX_0F3A07_W_0_L_2 */
+  {
+    { "tcvtrowps2phh",	{ XM, Rtmm, Ib }, 0 },
+    { "tcvtrowd2ps",	{ XM, Rtmm, Ib }, 0 },
+    { "tilemovrow",	{ XM, Rtmm, Ib }, 0 },
+    { "tcvtrowps2pbf16h",	{ XM, Rtmm, Ib }, 0 },
+  },
   /* PREFIX_EVEX_0F3A08 */
   {
     { "vrndscalep%XH",  { XM, EXxh, EXxEVexS, Ib }, 0 },
@@ -377,6 +397,13 @@ 
     { Bad_Opcode },
     { "vfpclasss%XW",	{ MaskG, EXdq, Ib }, 0 },
   },
+  /* PREFIX_EVEX_0F3A77_W_0_L_2 */
+  {
+    { Bad_Opcode },
+    { "tcvtrowps2pbf16l",	{ XM, Rtmm, Ib }, 0 },
+    { Bad_Opcode },
+    { "tcvtrowps2phl",	{ XM, Rtmm, Ib }, 0 },
+  },
   /* PREFIX_EVEX_0F3AC2 */
   {
     { "vcmpp%XH", { MaskG, Vex, EXxh, EXxEVexS, CMP }, 0 },
diff --git a/opcodes/i386-dis-evex-w.h b/opcodes/i386-dis-evex-w.h
index 36b8150cd2c..70f65dab96e 100644
--- a/opcodes/i386-dis-evex-w.h
+++ b/opcodes/i386-dis-evex-w.h
@@ -336,6 +336,10 @@ 
   {
     { "vpbroadcastmw2dY",	{ XM, MaskR }, 0 },
   },
+  /* EVEX_W_0F384A_X86_64 */
+  {
+    { EVEX_LEN_TABLE (EVEX_LEN_0F384A_X86_64_W_0) },
+  },
   /* EVEX_W_0F3859 */
   {
     { "vbroadcasti32x2",	{ XM, EXq }, PREFIX_DATA },
@@ -351,6 +355,10 @@ 
     { "vbroadcasti32x8",	{ XM, Mymm }, PREFIX_DATA },
     { "vbroadcasti64x4",	{ XM, Mymm }, PREFIX_DATA },
   },
+  /* EVEX_W_0F386D_X86_64 */
+  {
+    { EVEX_LEN_TABLE (EVEX_LEN_0F386D_X86_64_W_0) },
+  },
   /* EVEX_W_0F3870 */
   {
     { Bad_Opcode },
@@ -374,6 +382,10 @@ 
     { Bad_Opcode },
     { "vpmultishiftqb",	{ XM, Vex, EXx }, PREFIX_DATA },
   },
+  /* EVEX_W_0F3A07_X86_64 */
+  {
+    { EVEX_LEN_TABLE (EVEX_LEN_0F3A07_X86_64_W_0) },
+  },
   /* EVEX_W_0F3A18_L_n */
   {
     { "vinsertf32x4",	{ XM, Vex, EXxmm, Ib }, PREFIX_DATA },
@@ -442,6 +454,10 @@ 
     { Bad_Opcode },
     { "vpshrdw",   { XM, Vex, EXx, Ib }, 0 },
   },
+  /* EVEX_W_0F3A77_X86_64 */
+  {
+    { EVEX_LEN_TABLE (EVEX_LEN_0F3A77_X86_64_W_0) },
+  },
   /* EVEX_W_MAP4_8F_R_0 */
   {
     { "pop2", { { PUSH2_POP2_Fixup, q_mode}, Eq }, NO_PREFIX },
diff --git a/opcodes/i386-dis-evex-x86-64.h b/opcodes/i386-dis-evex-x86-64.h
index 4e52607d306..21bf3bf5e5d 100644
--- a/opcodes/i386-dis-evex-x86-64.h
+++ b/opcodes/i386-dis-evex-x86-64.h
@@ -1,3 +1,23 @@ 
+  /* X86_64_EVEX_0F384A */
+  {
+    { Bad_Opcode },
+    { VEX_W_TABLE (EVEX_W_0F384A_X86_64) },
+  },
+  /* X86_64_EVEX_0F386D */
+  {
+    { Bad_Opcode },
+    { VEX_W_TABLE (EVEX_W_0F386D_X86_64) },
+  },
+  /* X86_64_EVEX_0F3A07 */
+  {
+    { Bad_Opcode },
+    { VEX_W_TABLE (EVEX_W_0F3A07_X86_64) },
+  },
+  /* X86_64_EVEX_0F3A77 */
+  {
+    { Bad_Opcode },
+    { VEX_W_TABLE (EVEX_W_0F3A77_X86_64) },
+  },
   /* X86_64_EVEX_MAP5_6C_W_1_P_1 */
   {
     { Bad_Opcode },
diff --git a/opcodes/i386-dis-evex.h b/opcodes/i386-dis-evex.h
index d42b5af7b53..130b1da0272 100644
--- a/opcodes/i386-dis-evex.h
+++ b/opcodes/i386-dis-evex.h
@@ -376,7 +376,7 @@  static const struct dis386 evex_table[][256] = {
     /* 48 */
     { Bad_Opcode },
     { X86_64_EVEX_MEM_W_TABLE (VEX_W_0F3849_X86_64_L_0) },
-    { Bad_Opcode },
+    { X86_64_TABLE (X86_64_EVEX_0F384A) },
     { X86_64_EVEX_MEM_W_TABLE (VEX_W_0F384B_X86_64_L_0) },
     { "vrcp14p%XW",	{ XM, EXx }, PREFIX_DATA },
     { "vrcp14s%XW",	{ XMScalar, VexScalar, EXdq }, PREFIX_DATA },
@@ -415,7 +415,7 @@  static const struct dis386 evex_table[][256] = {
     { Bad_Opcode },
     { Bad_Opcode },
     { Bad_Opcode },
-    { Bad_Opcode },
+    { X86_64_TABLE (X86_64_EVEX_0F386D) },
     { Bad_Opcode },
     { Bad_Opcode },
     /* 70 */
@@ -591,7 +591,7 @@  static const struct dis386 evex_table[][256] = {
     { VEX_W_TABLE (VEX_W_0F3A04) },
     { "%XEvpermilp%XD", { XM, EXx, Ib }, PREFIX_DATA },
     { Bad_Opcode },
-    { Bad_Opcode },
+    { X86_64_TABLE (X86_64_EVEX_0F3A07) },
     /* 08 */
     { PREFIX_TABLE (PREFIX_EVEX_0F3A08) },
     { "vrndscalep%XD", { XM, EXx, EXxEVexS, Ib }, PREFIX_DATA },
@@ -717,7 +717,7 @@  static const struct dis386 evex_table[][256] = {
     { Bad_Opcode },
     { Bad_Opcode },
     { Bad_Opcode },
-    { Bad_Opcode },
+    { X86_64_TABLE (X86_64_EVEX_0F3A77) },
     /* 78 */
     { Bad_Opcode },
     { Bad_Opcode },
diff --git a/opcodes/i386-dis.c b/opcodes/i386-dis.c
index 2095bb65196..8f651f7a06f 100644
--- a/opcodes/i386-dis.c
+++ b/opcodes/i386-dis.c
@@ -592,6 +592,7 @@  fetch_error (const instr_info *ins)
 #define VexGatherD { OP_VEX, vex_vsib_d_w_dq_mode }
 #define VexGatherQ { OP_VEX, vex_vsib_q_w_dq_mode }
 #define VexGdq { OP_VEX, dq_mode }
+#define VexGd { OP_VEX, d_mode }
 #define VexGb { OP_VEX, b_mode }
 #define VexGv { OP_VEX, v_mode }
 #define VexTmm { OP_VEX, tmm_mode }
@@ -1200,9 +1201,11 @@  enum
   PREFIX_EVEX_0F3838,
   PREFIX_EVEX_0F3839,
   PREFIX_EVEX_0F383A,
+  PREFIX_EVEX_0F384A_X86_64_W_0_L_2,
   PREFIX_EVEX_0F3852,
   PREFIX_EVEX_0F3853,
   PREFIX_EVEX_0F3868,
+  PREFIX_EVEX_0F386D_X86_64_W_0_L_2,
   PREFIX_EVEX_0F3872,
   PREFIX_EVEX_0F3874,
   PREFIX_EVEX_0F389A,
@@ -1210,6 +1213,7 @@  enum
   PREFIX_EVEX_0F38AA,
   PREFIX_EVEX_0F38AB,
 
+  PREFIX_EVEX_0F3A07_X86_64_W_0_L_2,
   PREFIX_EVEX_0F3A08,
   PREFIX_EVEX_0F3A0A,
   PREFIX_EVEX_0F3A26,
@@ -1221,6 +1225,7 @@  enum
   PREFIX_EVEX_0F3A57,
   PREFIX_EVEX_0F3A66,
   PREFIX_EVEX_0F3A67,
+  PREFIX_EVEX_0F3A77_X86_64_W_0_L_2,
   PREFIX_EVEX_0F3AC2,
 
   PREFIX_EVEX_MAP4_4x,
@@ -1362,7 +1367,12 @@  enum
 
   X86_64_VEX_MAP7_F6_L_0_W_0_R_0,
   X86_64_VEX_MAP7_F8_L_0_W_0_R_0,
-  
+
+  X86_64_EVEX_0F384A,
+  X86_64_EVEX_0F386D,
+  X86_64_EVEX_0F3A07,
+  X86_64_EVEX_0F3A77,
+
   X86_64_EVEX_MAP5_6C_W_1_P_1,
   X86_64_EVEX_MAP5_6C_W_1_P_3,
   X86_64_EVEX_MAP5_6D_W_1_P_1,
@@ -1555,12 +1565,15 @@  enum
   EVEX_LEN_0F381A,
   EVEX_LEN_0F381B,
   EVEX_LEN_0F3836,
+  EVEX_LEN_0F384A_X86_64_W_0,
   EVEX_LEN_0F385A,
   EVEX_LEN_0F385B,
+  EVEX_LEN_0F386D_X86_64_W_0,
   EVEX_LEN_0F38C6,
   EVEX_LEN_0F38C7,
   EVEX_LEN_0F3A00,
   EVEX_LEN_0F3A01,
+  EVEX_LEN_0F3A07_X86_64_W_0,
   EVEX_LEN_0F3A18,
   EVEX_LEN_0F3A19,
   EVEX_LEN_0F3A1A,
@@ -1571,6 +1584,7 @@  enum
   EVEX_LEN_0F3A3A,
   EVEX_LEN_0F3A3B,
   EVEX_LEN_0F3A43,
+  EVEX_LEN_0F3A77_X86_64_W_0,
 
   EVEX_LEN_MAP5_6E,
   EVEX_LEN_MAP5_7E,
@@ -1779,15 +1793,18 @@  enum
   EVEX_W_0F3835_P_2,
   EVEX_W_0F3837,
   EVEX_W_0F383A_P_1,
+  EVEX_W_0F384A_X86_64,
   EVEX_W_0F3859,
   EVEX_W_0F385A_L_n,
   EVEX_W_0F385B_L_2,
+  EVEX_W_0F386D_X86_64,
   EVEX_W_0F3870,
   EVEX_W_0F3872_P_2,
   EVEX_W_0F387A,
   EVEX_W_0F387B,
   EVEX_W_0F3883,
 
+  EVEX_W_0F3A07_X86_64,
   EVEX_W_0F3A18_L_n,
   EVEX_W_0F3A19_L_n,
   EVEX_W_0F3A1A_L_2,
@@ -1802,6 +1819,7 @@  enum
   EVEX_W_0F3A43_L_n,
   EVEX_W_0F3A70,
   EVEX_W_0F3A72,
+  EVEX_W_0F3A77_X86_64,
 
   EVEX_W_MAP4_8F_R_0,
   EVEX_W_MAP4_F8_P1_M_1,
@@ -13931,6 +13949,8 @@  OP_VEX (instr_info *ins, int bytemode, int sizeflag ATTRIBUTE_UNUSED)
     case 512:
       names = att_names_zmm;
       ins->evex_used |= EVEX_len_used;
+      if (bytemode == d_mode)
+	names = att_names32;
       break;
     default:
       abort ();
diff --git a/opcodes/i386-gen.c b/opcodes/i386-gen.c
index be05b1be817..168dc565a60 100644
--- a/opcodes/i386-gen.c
+++ b/opcodes/i386-gen.c
@@ -265,6 +265,8 @@  static const dependency isa_dependencies[] =
     "AMX_TILE" },
   { "AMX_TRANSPOSE",
     "AMX_TILE" },
+  { "AMX_AVX512",
+    "AMX_TILE|AVX10_2" },
   { "KL",
     "SSE2" },
   { "WIDEKL",
@@ -432,6 +434,7 @@  static bitfield cpu_flags[] =
   BITFIELD (AMX_FP16),
   BITFIELD (AMX_COMPLEX),
   BITFIELD (AMX_TRANSPOSE),
+  BITFIELD (AMX_AVX512),
   BITFIELD (AMX_TILE),
   BITFIELD (MOVDIRI),
   BITFIELD (MOVDIR64B),
diff --git a/opcodes/i386-opc.h b/opcodes/i386-opc.h
index fd11f9f0cd8..91972954966 100644
--- a/opcodes/i386-opc.h
+++ b/opcodes/i386-opc.h
@@ -252,6 +252,8 @@  enum i386_cpu
   CpuAMX_FP16,
   /* AMX-COMPLEX instructions required.  */
   CpuAMX_COMPLEX,
+  /* Intel AMX-AVX512 Instructions support required.  */
+  CpuAMX_AVX512,
   /* AMX-TILE instructions required */
   CpuAMX_TILE,
   /* GFNI instructions required */
@@ -500,6 +502,7 @@  typedef union i386_cpu_flags
       unsigned int cpuamx_bf16:1;
       unsigned int cpuamx_fp16:1;
       unsigned int cpuamx_complex:1;
+      unsigned int cpuamx_avx512:1;
       unsigned int cpuamx_tile:1;
       unsigned int cpugfni:1;
       unsigned int cpuvaes:1;
diff --git a/opcodes/i386-opc.tbl b/opcodes/i386-opc.tbl
index d8f2a180ba7..d17765aa0af 100644
--- a/opcodes/i386-opc.tbl
+++ b/opcodes/i386-opc.tbl
@@ -3204,6 +3204,19 @@  tconjtcmmimfp16ps, 0x6b, AMX_COMPLEX&AMX_TRANSPOSE, Modrm|Vex128|Space0F38|Src2V
 
 tconjtfp16, 0x666b, AMX_COMPLEX&AMX_TRANSPOSE, Modrm|Vex128|Space0F38|VexW0|NoSuf, { RegTMM, RegTMM }
 
+tcvtrowd2ps, 0xf34a, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
+tcvtrowd2ps, 0xf307, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
+
+tcvtrowps2pbf16h, 0xf26d, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
+tcvtrowps2pbf16h, 0xf207, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
+tcvtrowps2pbf16l, 0xf36d, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
+tcvtrowps2pbf16l, 0xf377, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
+
+tcvtrowps2phh, 0x6d, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
+tcvtrowps2phh, 0x07, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
+tcvtrowps2phl, 0x666d, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
+tcvtrowps2phl, 0xf277, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
+
 tdpbf16ps, 0xf35c, AMX_BF16, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
 tdpfp16ps, 0xf25c, AMX_FP16, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
 tdpbssd, 0xf25e, AMX_INT8, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
@@ -3213,6 +3226,8 @@  tdpbsud, 0xf35e, AMX_INT8, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM
 
 tileloadd, 0xf24b, APX_F(AMX_TILE), Sibmem|Vex128|EVex128|Space0F38|VexW0|NoSuf, { Unspecified|BaseIndex, RegTMM }
 tileloaddt1, 0x664b, APX_F(AMX_TILE), Sibmem|Vex128|EVex128|Space0F38|VexW0|NoSuf, { Unspecified|BaseIndex, RegTMM }
+tilemovrow, 0x664a, AMX_AVX512, Modrm|EVex512|Space0F38|Src2VVVV|VexW0|NoSuf, { Reg32, RegTMM, RegZMM }
+tilemovrow, 0x6607, AMX_AVX512, Modrm|EVex512|Space0F3A|VexW0|NoSuf, { Imm8, RegTMM, RegZMM }
 tilestored, 0xf34b, APX_F(AMX_TILE), Sibmem|Vex128|EVex128|Space0F38|VexW0|NoSuf, { RegTMM, Unspecified|BaseIndex }
 
 tilerelease, 0x49c0, AMX_TILE, Vex128|Space0F38|VexW0|NoSuf, {}