[0/3] Refine AVX10.2 mnemonics

Message ID 20250113082614.1716559-1-haochen.jiang@intel.com
Headers
Series Refine AVX10.2 mnemonics |

Message

Jiang, Haochen Jan. 13, 2025, 8:26 a.m. UTC
  Hi all,

Since AVX10.2 got published, there are several discussions regarding to
the mnemonics. After internal discussion, we will make three changes for
AVX10.2 mnemonics.

Quick Conclusion:

  - NE would be removed for all AVX10.2 new insns
  - VCOMSBF16 -> VCOMISBF16
  - P for packed omitted for AI data types

The details of why doing so will be embedding at the end of this email.

All of them will get refined in AVX10.2 SPEC (ETA published this week).
Since Binutils 2.44 window is near, I choose to send them out.

Upcoming will be the three patches to change the mnemonics according to
the conclusion, where patch 1 on BF16 arithmetic insns, patch 2 on
VCOMISBF16 and patch 3 on convert insns.

These three patches would probably be the last unsent patches for Intel
Diamond Rapids in Binutils 2.44 if no exception happened (The only
potential exception would be MOVRS APX_F EVEX.W issue). Really appreciate
the review and discussion on AVX10.2 and AMX since August. It has been
a really long run.

Ok for trunk?

Thx,
Haochen

Details for the change:

  - NE removal for AI data types default roundings
  NE is a total mess after we review the instructions we had currently.
  The name itself is ambiguous. It should be Rounding to Nearest Even,
  but could be mis-interpretated to No Exception (That is actually how I
  interpretated at the beginning of this year and Jan’s understanding)
  or No Embedded Rounding. While the ambiguous name, it appears
  here and there w/o following a consistent rule. The biggest disaster
  is in AVX-NE-CONVERT, where almost all of the insns are up-convert
  (actually only one insn in that CPUID is down-convert which needs
  rounding) and should not have NE, but NE just appears everywhere.

  Given the current inconsistency and mess, we intend to clean it up in
  mnemonics since AVX10.2. A question would be since NE itself is
  ambiguous and misleading, why do we need them? Therefore, our
  decision is to remove NE from ALL new instructions in AVX10.2, with
  BF16 documentation added into SDM in the future (I would expect in this
  year). It will reduce the mnemonics length, also easier for user to
  use since they will no longer to remember if NE is needed in mnemonics.
  And since all the rounding modes are in insn description for now, it
  will not lead to guess in rounding mode.
  
  For old insns, the option for now is to leave as-is for the
  implementation is there for some time. However, it is discussable
  whether we could change that. But at least for Binutils 2.44 and
  GCC 15, it won't be changed due to the time.

  - VCOMSBF16 rename to VCOMISBF16.
  VCOMSBF16 actually got the same functionality as previous VCOMISD/S/H,
  except for the type. They are all compare and set three EFLAGS. Thus,
  it should be VCOMISBF16, not a brand new VCOMSBF16.

  - P prefix omitted for packed on BF16 and future AI data types
  For legacy double, float and FP16, we will use PD/PS/PH for packed and
  SD/SS/SH for scalar. Since the very beginning of BF16, P for Packed is
  omitted and that omit continues to now. We suppose there is an assumption
  that AI data types like BF16 would always be packed in calculation so the
  omit is safe at that time.

  However, in AVX10.2, P is not omitted for most instructions. Therefore, we
  decide to remove the P before BF16 in AVX10.2 BF16 instructions to keep
  consistent. It will also apply to AMX-AVX512 TCVTROWPS2PBF16[H,L],
  which will change to TCVTROWPS2BF16[H,L] (already affected in ISE056 and
  the AMX-AVX512 patch reflects that). Based on the P omit, we should add
  explicit S to indicate scalar use for BF16. For example, VCOMISBF16 is still
  VCOMISBF16.

  When we go through all the instructions, we found that there is a problem
  for VBCSTNEBF162PS in AVX-NE-CONVERT. It should add S before BF16, but
  since BCST itself is meaningful here, we will keep as-is for now.

  This will also apply to future AI data types, including TF32 and FP8
  (Default packed, add S to indicate scalar explicitly).

At the end of the day, there will be some significant changes in BF16 insns.
For example, VADDNEPBF16 will become VADDBF16.
  

Comments

Christian Ludloff Jan. 13, 2025, 10:12 a.m. UTC | #1
On Mon, Jan 13, 2025 at 9:26 AM Haochen Jiang <haochen.jiang@intel.com> wrote:
> Since AVX10.2 got published, there are several discussions regarding to
> the mnemonics. After internal discussion, we will make three changes for
> AVX10.2 mnemonics.
>
> Quick Conclusion:
>
>   - NE would be removed for all AVX10.2 new insns
>   - VCOMSBF16 -> VCOMISBF16
>   - P for packed omitted for AI data types
>
> The details of why doing so will be embedding at the end of this email.
>
> All of them will get refined in AVX10.2 SPEC (ETA published this week).
> Since Binutils 2.44 window is near, I choose to send them out.
>
> Upcoming will be the three patches to change the mnemonics according to
> the conclusion, where patch 1 on BF16 arithmetic insns, patch 2 on
> VCOMISBF16 and patch 3 on convert insns.
>
> These three patches would probably be the last unsent patches for Intel
> Diamond Rapids in Binutils 2.44 if no exception happened (The only
> potential exception would be MOVRS APX_F EVEX.W issue). Really appreciate
> the review and discussion on AVX10.2 and AMX since August. It has been
> a really long run.
>
> Ok for trunk?

One more: don't forget the 256-bit {er} variants of VCVTDQ2PD at
F3.0F__.W0.E6 and VCVTUDQ2PD at F3.0F__.W0.7A – they are
missing from the AVX10.2 spec; they should be supported, similar
to the existing 512-bit {er} variants: attempts to encode {er} should
be ignored, as documented in the SDM.

--
C.
  
Jan Beulich Jan. 13, 2025, 3:25 p.m. UTC | #2
On 13.01.2025 09:26, Haochen Jiang wrote:
> Hi all,
> 
> Since AVX10.2 got published, there are several discussions regarding to
> the mnemonics. After internal discussion, we will make three changes for
> AVX10.2 mnemonics.
> 
> Quick Conclusion:
> 
>   - NE would be removed for all AVX10.2 new insns
>   - VCOMSBF16 -> VCOMISBF16
>   - P for packed omitted for AI data types
> 
> The details of why doing so will be embedding at the end of this email.
> 
> All of them will get refined in AVX10.2 SPEC (ETA published this week).
> Since Binutils 2.44 window is near, I choose to send them out.
> 
> Upcoming will be the three patches to change the mnemonics according to
> the conclusion, where patch 1 on BF16 arithmetic insns, patch 2 on
> VCOMISBF16 and patch 3 on convert insns.
> 
> These three patches would probably be the last unsent patches for Intel
> Diamond Rapids in Binutils 2.44 if no exception happened (The only
> potential exception would be MOVRS APX_F EVEX.W issue). Really appreciate
> the review and discussion on AVX10.2 and AMX since August. It has been
> a really long run.
> 
> Ok for trunk?

Okay, on the assumption that the doc will be updated accordingly in due
course, and hence we're not going to end up with any back and forth.

Jan
  
Jiang, Haochen Jan. 14, 2025, 5:50 a.m. UTC | #3
> From: Christian Ludloff <ludloff@gmail.com>
> Sent: Monday, January 13, 2025 6:13 PM
> 
> On Mon, Jan 13, 2025 at 9:26 AM Haochen Jiang <haochen.jiang@intel.com>
> wrote:
> > Since AVX10.2 got published, there are several discussions regarding
> > to the mnemonics. After internal discussion, we will make three
> > changes for
> > AVX10.2 mnemonics.
> >
> > Quick Conclusion:
> >
> >   - NE would be removed for all AVX10.2 new insns
> >   - VCOMSBF16 -> VCOMISBF16
> >   - P for packed omitted for AI data types
> >
> > The details of why doing so will be embedding at the end of this email.
> >
> > All of them will get refined in AVX10.2 SPEC (ETA published this week).
> > Since Binutils 2.44 window is near, I choose to send them out.
> >
> > Upcoming will be the three patches to change the mnemonics according
> > to the conclusion, where patch 1 on BF16 arithmetic insns, patch 2 on
> > VCOMISBF16 and patch 3 on convert insns.
> >
> > These three patches would probably be the last unsent patches for
> > Intel Diamond Rapids in Binutils 2.44 if no exception happened (The
> > only potential exception would be MOVRS APX_F EVEX.W issue). Really
> > appreciate the review and discussion on AVX10.2 and AMX since August.
> > It has been a really long run.
> >
> > Ok for trunk?
> 
> One more: don't forget the 256-bit {er} variants of VCVTDQ2PD at
> F3.0F__.W0.E6 and VCVTUDQ2PD at F3.0F__.W0.7A – they are missing from
> the AVX10.2 spec; they should be supported, similar to the existing 512-bit
> {er} variants: attempts to encode {er} should be ignored, as documented in the
> SDM.
> 

Let me find a way to fix that. Hope it could catch up with Binutils 2.44.

Thx,
Haochen
  
Christian Ludloff Jan. 14, 2025, 6:35 a.m. UTC | #4
On Tue, Jan 14, 2025, 06:51 Jiang, Haochen <haochen.jiang@intel.com> wrote:

> > One more: don't forget the 256-bit {er} variants of VCVTDQ2PD at
> > F3.0F__.W0.E6 and VCVTUDQ2PD at F3.0F__.W0.7A – they are missing from
> > the AVX10.2 spec; they should be supported, similar to the existing
> 512-bit
> > {er} variants: attempts to encode {er} should be ignored, as documented
> in the
> > SDM.
> >
>
> Let me find a way to fix that. Hope it could catch up with Binutils 2.44.


Hopefully all the AVX10.2 silicon got it right in the first place.

--
C.

>
  
Christian Ludloff Jan. 16, 2025, 8:23 p.m. UTC | #5
>> > One more: don't forget the 256-bit {er} variants of VCVTDQ2PD at
>> > F3.0F__.W0.E6 and VCVTUDQ2PD at F3.0F__.W0.7A – they are missing from
>> > the AVX10.2 spec; they should be supported, similar to the existing 512-bit
>> > {er} variants: attempts to encode {er} should be ignored, as documented in the
>> > SDM.

> Hopefully all the AVX10.2 silicon got it right in the first place.

Fwiw, the latest AVX10.2 spec (#361050-003 from Jan 14)
is still missing the 256-bit {er} variants of VCVT[,U]DQ2PD.

For now, we have the unofficial confirmation here:

  https://sourceware.org/pipermail/binutils/2025-January/138698.html

--
Christian
  
Jiang, Haochen Jan. 17, 2025, 2:13 a.m. UTC | #6
> From: Christian Ludloff <ludloff@gmail.com>
> Sent: Friday, January 17, 2025 4:24 AM
> 
> >> > One more: don't forget the 256-bit {er} variants of VCVTDQ2PD at
> >> > F3.0F__.W0.E6 and VCVTUDQ2PD at F3.0F__.W0.7A – they are missing
> >> > from the AVX10.2 spec; they should be supported, similar to the
> >> > existing 512-bit {er} variants: attempts to encode {er} should be
> >> > ignored, as documented in the SDM.
> 
> > Hopefully all the AVX10.2 silicon got it right in the first place.
> 
> Fwiw, the latest AVX10.2 spec (#361050-003 from Jan 14) is still missing the
> 256-bit {er} variants of VCVT[,U]DQ2PD.
> 
> For now, we have the unofficial confirmation here:
> 
>   https://sourceware.org/pipermail/binutils/2025-January/138698.html

I suppose it is not included due to it actually has no rounding operand
normally. We just ignore that rounding if someone got the bytecode with
rounding. So when doc is there, it won't be included. Maybe we could get
a better way to get that. (Actually when AVX10.2 things went into SDM,
everything would be aligned.) Let me try to find a way.

Thx,
Haochen
  
Christian Ludloff Jan. 17, 2025, 3:34 a.m. UTC | #7
> > >> > One more: don't forget the 256-bit {er} variants of VCVTDQ2PD at
> > >> > F3.0F__.W0.E6 and VCVTUDQ2PD at F3.0F__.W0.7A – they are missing
> > >> > from the AVX10.2 spec; they should be supported, similar to the
> > >> > existing 512-bit {er} variants: attempts to encode {er} should be
> > >> > ignored, as documented in the SDM.
> >
> > > Hopefully all the AVX10.2 silicon got it right in the first place.
> >
> > Fwiw, the latest AVX10.2 spec (#361050-003 from Jan 14) is still missing the
> > 256-bit {er} variants of VCVT[,U]DQ2PD.
> >
> > For now, we have the unofficial confirmation here:
> >
> >   https://sourceware.org/pipermail/binutils/2025-January/138698.html
>
> I suppose it is not included due to it actually has no rounding operand
> normally. We just ignore that rounding if someone got the bytecode with
> rounding. So when doc is there, it won't be included. Maybe we could get
> a better way to get that. (Actually when AVX10.2 things went into SDM,
> everything would be aligned.) Let me try to find a way.

The AVX10.2 spec went to great length to list all those existing
instructions for which their LL=256 variant now permits U=0 – it
included CVT[,U]DQ2{PH,PS} but not CVT[,U]DQ2PD.

Perhaps clone table entries for PD from table entries for PH/PS,
and likewise clone instruction pages for PD from PH/PS but add
the SDM statement "Attempt to encode this instruction with EVEX
embedded rounding is ignored." on the PD pages.

For SDM, it is probably best to add {er} to the four affected CVTs,
perhaps as {er.ignored}. That way their "Attempt to..." statements
would no longer be dangling as much.

Fwiw, I have seen various tools emit the {er} for LL=512, as they
handle PH/PS/PD identically, rather than special-case PD. And I
expect to see this carry over to LL=256. Which is why I think it is
important that the spec and the chips get it right (no #UD).

Thanks for your hard work!

--
C.