[2/3] x86: Drop SwapSources

Message ID 20240424072356.2433122-3-lili.cui@intel.com
State New
Headers
Series x86: Optimize the encoder of the vvvv register |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_binutils_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_binutils_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_binutils_check--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_binutils_check--master-arm success Testing passed

Commit Message

Cui, Lili April 24, 2024, 7:23 a.m. UTC
  gas/ChangeLog:

        * config/tc-i386.c (build_modrm_byte): Dropped the use of
	SWAP_SOURCES to encode the vvvv register.

opcodes/ChangeLog:

        * i386-opc.h (SWAP_SOURCES): Dropped.
        (NO_DEFAULT_MASK): Adjusted the value.
        (ADDR_PREFIX_OP_REG): Ditto.
        (DISTINCT_DEST): Ditto.
        (IMPLICIT_STACK_OP): Ditto.
        (VexVVVV_SRC2): New.
        * i386-opc.tbl: Dropped SwapSources and replaced its Src2VVVV
	with Src1VVVV.
---
 gas/config/tc-i386.c | 16 ++++++------
 opcodes/i386-opc.h   | 12 ++++-----
 opcodes/i386-opc.tbl | 60 ++++++++++++++++++++++----------------------
 3 files changed, 44 insertions(+), 44 deletions(-)
  

Comments

Jan Beulich April 24, 2024, 8:07 a.m. UTC | #1
On 24.04.2024 09:23, Cui, Lili wrote:
> --- a/gas/config/tc-i386.c
> +++ b/gas/config/tc-i386.c
> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
>  
>    switch (i.tm.opcode_modifier.vexvvvv)
>      {
> +    case VexVVVV_SRC2:
> +      if (source != op)
> +	{
> +	  v = source++;
> +	  break;
> +	}
> +      /* For XOP: vpshl* and vpsha*.  */
> +      /* Fall through.  */
>      case VexVVVV_SRC1:

This falling-through is odd and hence needs a better comment (then also
covering vprot*, which afaict is similarly affected). The reason for this
is the XOP.W-controlled operand swapping, if I'm not mistaken? In which
case perhaps instead of the fall-through here the logic swapping the
operands should replace VexVVVV_SRC2 by VexVVVV_SRC1?

Jan
  
Cui, Lili April 26, 2024, 8:14 a.m. UTC | #2
> On 24.04.2024 09:23, Cui, Lili wrote:
> > --- a/gas/config/tc-i386.c
> > +++ b/gas/config/tc-i386.c
> > @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
> >
> >    switch (i.tm.opcode_modifier.vexvvvv)
> >      {
> > +    case VexVVVV_SRC2:
> > +      if (source != op)
> > +	{
> > +	  v = source++;
> > +	  break;
> > +	}
> > +      /* For XOP: vpshl* and vpsha*.  */
> > +      /* Fall through.  */
> >      case VexVVVV_SRC1:
> 
> This falling-through is odd and hence needs a better comment (then also
> covering vprot*, which afaict is similarly affected). The reason for this is the
> XOP.W-controlled operand swapping, if I'm not mistaken? In which case
> perhaps instead of the fall-through here the logic swapping the operands
> should replace VexVVVV_SRC2 by VexVVVV_SRC1?
> 

Yes, vprot* should be included, and it is related to XOP.W-controlled operand swapping, the comments says " /* Only the first two register operands need reversing, alongside flipping VEX.W.  */ ",  But there is actually a memory operand, not two register operands.

I think VexVVVV_SRC2 makes more sense here, it matches the actual situation, we want to use vvvv to encode the first operand.

Opcode table:
vprot<xop>, 0x90 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }

testcase:
vprotb (%rax),%xmm12,%xmm15
vprotb %xmm15,(%r12),%xmm0

Thanks,
Lili.
  
Jan Beulich April 26, 2024, 10:37 a.m. UTC | #3
On 26.04.2024 10:14, Cui, Lili wrote:
>> On 24.04.2024 09:23, Cui, Lili wrote:
>>> --- a/gas/config/tc-i386.c
>>> +++ b/gas/config/tc-i386.c
>>> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
>>>
>>>    switch (i.tm.opcode_modifier.vexvvvv)
>>>      {
>>> +    case VexVVVV_SRC2:
>>> +      if (source != op)
>>> +	{
>>> +	  v = source++;
>>> +	  break;
>>> +	}
>>> +      /* For XOP: vpshl* and vpsha*.  */
>>> +      /* Fall through.  */
>>>      case VexVVVV_SRC1:
>>
>> This falling-through is odd and hence needs a better comment (then also
>> covering vprot*, which afaict is similarly affected). The reason for this is the
>> XOP.W-controlled operand swapping, if I'm not mistaken? In which case
>> perhaps instead of the fall-through here the logic swapping the operands
>> should replace VexVVVV_SRC2 by VexVVVV_SRC1?
>>
> 
> Yes, vprot* should be included, and it is related to XOP.W-controlled operand swapping, the comments says " /* Only the first two register operands need reversing, alongside flipping VEX.W.  */ ",  But there is actually a memory operand, not two register operands.
> 
> I think VexVVVV_SRC2 makes more sense here, it matches the actual situation, we want to use vvvv to encode the first operand.
> 
> Opcode table:
> vprot<xop>, 0x90 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
> 
> testcase:
> vprotb (%rax),%xmm12,%xmm15
> vprotb %xmm15,(%r12),%xmm0

VexVVVV_Src2 is appropriate for the latter, yes, but not for the former. That
uses VexVVVV_Src1 layout. Hence my suggestion to replace the attribute when
swapping operands.

Jan
  
Cui, Lili April 28, 2024, 4:47 a.m. UTC | #4
> On 26.04.2024 10:14, Cui, Lili wrote:
> >> On 24.04.2024 09:23, Cui, Lili wrote:
> >>> --- a/gas/config/tc-i386.c
> >>> +++ b/gas/config/tc-i386.c
> >>> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
> >>>
> >>>    switch (i.tm.opcode_modifier.vexvvvv)
> >>>      {
> >>> +    case VexVVVV_SRC2:
> >>> +      if (source != op)
> >>> +	{
> >>> +	  v = source++;
> >>> +	  break;
> >>> +	}
> >>> +      /* For XOP: vpshl* and vpsha*.  */
> >>> +      /* Fall through.  */
> >>>      case VexVVVV_SRC1:
> >>
> >> This falling-through is odd and hence needs a better comment (then
> >> also covering vprot*, which afaict is similarly affected). The reason
> >> for this is the XOP.W-controlled operand swapping, if I'm not
> >> mistaken? In which case perhaps instead of the fall-through here the
> >> logic swapping the operands should replace VexVVVV_SRC2 by
> VexVVVV_SRC1?
> >>
> >
> > Yes, vprot* should be included, and it is related to XOP.W-controlled
> operand swapping, the comments says " /* Only the first two register
> operands need reversing, alongside flipping VEX.W.  */ ",  But there is
> actually a memory operand, not two register operands.
> >
> > I think VexVVVV_SRC2 makes more sense here, it matches the actual
> situation, we want to use vvvv to encode the first operand.
> >
> > Opcode table:
> > vprot<xop>, 0x90 | <xop:opc>, XOP,
> > D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf, { RegXMM,
> > RegXMM|Unspecified|BaseIndex, RegXMM }
> >
> > testcase:
> > vprotb (%rax),%xmm12,%xmm15
> > vprotb %xmm15,(%r12),%xmm0
> 
> VexVVVV_Src2 is appropriate for the latter, yes, but not for the former. That
> uses VexVVVV_Src1 layout. Hence my suggestion to replace the attribute
> when swapping operands.
> 

If replace the Src2VVVV| VexW0 with Src1VVVV| VexW1 and swapping operands. We can put VexVVVV_SRC1 before VexVVVV_SRC2, but we still need to add "(!is_cpu (&i.tm, CpuXOP) || source == op" under VexVVVV_SRC1 , and match_template also needs to be adjusted (I made a simple modification and it still failed, I think continuing like this may go against the original intention).

  switch (i.tm.opcode_modifier.vexvvvv)
    {
    /* VEX.vvvv encodes the first source register operand.  */
    case VexVVVV_SRC1:
      if (!is_cpu (&i.tm, CpuXOP) || source == op)
        {
          v =  dest - 1;
          break;
        }
    /* For XOP: vpshl*, vpsha* and vprot*.  */
    /* Fall through.  */
    /* VEX.vvvv encodes the last source register operand.  */
    case VexVVVV_SRC2:
      v = source++;
      break;
    /* VEX.vvvv encodes the destination register operand.  */
    case VexVVVV_DST:
      v = dest--;
      break;
    default:
      v = ~0;
      break;
     }

Do you think we should add a separate patch 4 for XOP that removes the special handling in match_template and completes its template? so we don't have to add special handling for src1vvvv or src2vvvv. This might go against your desire to reduce template size, but it would help simplify the logic. I'd like to know your thoughts.

  vprot<xop>, 0x90 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src2VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
+vprot<xop>, 0x90 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src1VVVV|VexW1|NoSuf, { RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }

Thanks,
Lili.
  
Jan Beulich April 29, 2024, 6:40 a.m. UTC | #5
On 28.04.2024 06:47, Cui, Lili wrote:
>> On 26.04.2024 10:14, Cui, Lili wrote:
>>>> On 24.04.2024 09:23, Cui, Lili wrote:
>>>>> --- a/gas/config/tc-i386.c
>>>>> +++ b/gas/config/tc-i386.c
>>>>> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
>>>>>
>>>>>    switch (i.tm.opcode_modifier.vexvvvv)
>>>>>      {
>>>>> +    case VexVVVV_SRC2:
>>>>> +      if (source != op)
>>>>> +	{
>>>>> +	  v = source++;
>>>>> +	  break;
>>>>> +	}
>>>>> +      /* For XOP: vpshl* and vpsha*.  */
>>>>> +      /* Fall through.  */
>>>>>      case VexVVVV_SRC1:
>>>>
>>>> This falling-through is odd and hence needs a better comment (then
>>>> also covering vprot*, which afaict is similarly affected). The reason
>>>> for this is the XOP.W-controlled operand swapping, if I'm not
>>>> mistaken? In which case perhaps instead of the fall-through here the
>>>> logic swapping the operands should replace VexVVVV_SRC2 by
>> VexVVVV_SRC1?
>>>>
>>>
>>> Yes, vprot* should be included, and it is related to XOP.W-controlled
>> operand swapping, the comments says " /* Only the first two register
>> operands need reversing, alongside flipping VEX.W.  */ ",  But there is
>> actually a memory operand, not two register operands.
>>>
>>> I think VexVVVV_SRC2 makes more sense here, it matches the actual
>> situation, we want to use vvvv to encode the first operand.
>>>
>>> Opcode table:
>>> vprot<xop>, 0x90 | <xop:opc>, XOP,
>>> D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf, { RegXMM,
>>> RegXMM|Unspecified|BaseIndex, RegXMM }
>>>
>>> testcase:
>>> vprotb (%rax),%xmm12,%xmm15
>>> vprotb %xmm15,(%r12),%xmm0
>>
>> VexVVVV_Src2 is appropriate for the latter, yes, but not for the former. That
>> uses VexVVVV_Src1 layout. Hence my suggestion to replace the attribute
>> when swapping operands.
>>
> 
> If replace the Src2VVVV| VexW0 with Src1VVVV| VexW1 and swapping operands. We can put VexVVVV_SRC1 before VexVVVV_SRC2, but we still need to add "(!is_cpu (&i.tm, CpuXOP) || source == op" under VexVVVV_SRC1 , and match_template also needs to be adjusted (I made a simple modification and it still failed, I think continuing like this may go against the original intention).
> 
>   switch (i.tm.opcode_modifier.vexvvvv)
>     {
>     /* VEX.vvvv encodes the first source register operand.  */
>     case VexVVVV_SRC1:
>       if (!is_cpu (&i.tm, CpuXOP) || source == op)
>         {
>           v =  dest - 1;
>           break;
>         }
>     /* For XOP: vpshl*, vpsha* and vprot*.  */
>     /* Fall through.  */
>     /* VEX.vvvv encodes the last source register operand.  */
>     case VexVVVV_SRC2:
>       v = source++;
>       break;
>     /* VEX.vvvv encodes the destination register operand.  */
>     case VexVVVV_DST:
>       v = dest--;
>       break;
>     default:
>       v = ~0;
>       break;
>      }
> 
> Do you think we should add a separate patch 4 for XOP that removes the special handling in match_template and completes its template? so we don't have to add special handling for src1vvvv or src2vvvv. This might go against your desire to reduce template size, but it would help simplify the logic. I'd like to know your thoughts.

Indeed. You'd effectively revert earlier folding that I did. And the adjustment
I suggested earlier ought to be small/simple enough.

Jan
  
Cui, Lili April 29, 2024, 12:23 p.m. UTC | #6
> On 28.04.2024 06:47, Cui, Lili wrote:
> >> On 26.04.2024 10:14, Cui, Lili wrote:
> >>>> On 24.04.2024 09:23, Cui, Lili wrote:
> >>>>> --- a/gas/config/tc-i386.c
> >>>>> +++ b/gas/config/tc-i386.c
> >>>>> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
> >>>>>
> >>>>>    switch (i.tm.opcode_modifier.vexvvvv)
> >>>>>      {
> >>>>> +    case VexVVVV_SRC2:
> >>>>> +      if (source != op)
> >>>>> +	{
> >>>>> +	  v = source++;
> >>>>> +	  break;
> >>>>> +	}
> >>>>> +      /* For XOP: vpshl* and vpsha*.  */
> >>>>> +      /* Fall through.  */
> >>>>>      case VexVVVV_SRC1:
> >>>>
> >>>> This falling-through is odd and hence needs a better comment (then
> >>>> also covering vprot*, which afaict is similarly affected). The
> >>>> reason for this is the XOP.W-controlled operand swapping, if I'm
> >>>> not mistaken? In which case perhaps instead of the fall-through
> >>>> here the logic swapping the operands should replace VexVVVV_SRC2 by
> >> VexVVVV_SRC1?
> >>>>
> >>>
> >>> Yes, vprot* should be included, and it is related to
> >>> XOP.W-controlled
> >> operand swapping, the comments says " /* Only the first two register
> >> operands need reversing, alongside flipping VEX.W.  */ ",  But there
> >> is actually a memory operand, not two register operands.
> >>>
> >>> I think VexVVVV_SRC2 makes more sense here, it matches the actual
> >> situation, we want to use vvvv to encode the first operand.
> >>>
> >>> Opcode table:
> >>> vprot<xop>, 0x90 | <xop:opc>, XOP,
> >>> D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf,
> { RegXMM,
> >>> RegXMM|Unspecified|BaseIndex, RegXMM }
> >>>
> >>> testcase:
> >>> vprotb (%rax),%xmm12,%xmm15
> >>> vprotb %xmm15,(%r12),%xmm0
> >>
> >> VexVVVV_Src2 is appropriate for the latter, yes, but not for the
> >> former. That uses VexVVVV_Src1 layout. Hence my suggestion to replace
> >> the attribute when swapping operands.
> >>
> >
> > If replace the Src2VVVV| VexW0 with Src1VVVV| VexW1 and swapping
> operands. We can put VexVVVV_SRC1 before VexVVVV_SRC2, but we still
> need to add "(!is_cpu (&i.tm, CpuXOP) || source == op" under
> VexVVVV_SRC1 , and match_template also needs to be adjusted (I made a
> simple modification and it still failed, I think continuing like this may go
> against the original intention).
> >
> >   switch (i.tm.opcode_modifier.vexvvvv)
> >     {
> >     /* VEX.vvvv encodes the first source register operand.  */
> >     case VexVVVV_SRC1:
> >       if (!is_cpu (&i.tm, CpuXOP) || source == op)
> >         {
> >           v =  dest - 1;
> >           break;
> >         }
> >     /* For XOP: vpshl*, vpsha* and vprot*.  */
> >     /* Fall through.  */
> >     /* VEX.vvvv encodes the last source register operand.  */
> >     case VexVVVV_SRC2:
> >       v = source++;
> >       break;
> >     /* VEX.vvvv encodes the destination register operand.  */
> >     case VexVVVV_DST:
> >       v = dest--;
> >       break;
> >     default:
> >       v = ~0;
> >       break;
> >      }
> >
> > Do you think we should add a separate patch 4 for XOP that removes the
> special handling in match_template and completes its template? so we don't
> have to add special handling for src1vvvv or src2vvvv. This might go against
> your desire to reduce template size, but it would help simplify the logic. I'd
> like to know your thoughts.
> 
> Indeed. You'd effectively revert earlier folding that I did. And the adjustment I
> suggested earlier ought to be small/simple enough.
> 

So, I continued working on the previous suggestion. With the following modification and it worked.

@@ -8932,7 +8932,7 @@ match_template (char mnem_suffix)
                          || is_cpu (t, CpuAPX_F));
              if (!operand_type_match (overlap0, i.types[0])
                  || !operand_type_match (overlap1, i.types[j])
-                 || (t->operands == 3
+                 || (t->operands == 3 && !is_cpu (t, CpuXOP)
                      && !operand_type_match (overlap2, i.types[1]))
                  || (check_register
                      && !operand_type_register_match (i.types[0],

But I found that there are 4 test files that failed, I didn't find the doc on how to encode vprotb but I guess that is because I changed the default template from Src2VVVV| VexW0 to Src1VVVV| VexW1, then all the related test cases needed to be modified. Do you have any comments here? 

regexp "^[      ]*[a-f0-9]+:    8f e9 40 90 d8[         ]+vprotb %xmm7,%xmm0,%xmm3$"
line   "    11ed:       8f e9 f8 90 df          vprotb %xmm7,%xmm0,%xmm3"

Thanks,
Lili.
  
Jan Beulich April 29, 2024, 1:08 p.m. UTC | #7
On 29.04.2024 14:23, Cui, Lili wrote:
>> On 28.04.2024 06:47, Cui, Lili wrote:
>>>> On 26.04.2024 10:14, Cui, Lili wrote:
>>>>>> On 24.04.2024 09:23, Cui, Lili wrote:
>>>>>>> --- a/gas/config/tc-i386.c
>>>>>>> +++ b/gas/config/tc-i386.c
>>>>>>> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
>>>>>>>
>>>>>>>    switch (i.tm.opcode_modifier.vexvvvv)
>>>>>>>      {
>>>>>>> +    case VexVVVV_SRC2:
>>>>>>> +      if (source != op)
>>>>>>> +	{
>>>>>>> +	  v = source++;
>>>>>>> +	  break;
>>>>>>> +	}
>>>>>>> +      /* For XOP: vpshl* and vpsha*.  */
>>>>>>> +      /* Fall through.  */
>>>>>>>      case VexVVVV_SRC1:
>>>>>>
>>>>>> This falling-through is odd and hence needs a better comment (then
>>>>>> also covering vprot*, which afaict is similarly affected). The
>>>>>> reason for this is the XOP.W-controlled operand swapping, if I'm
>>>>>> not mistaken? In which case perhaps instead of the fall-through
>>>>>> here the logic swapping the operands should replace VexVVVV_SRC2 by
>>>> VexVVVV_SRC1?
>>>>>>
>>>>>
>>>>> Yes, vprot* should be included, and it is related to
>>>>> XOP.W-controlled
>>>> operand swapping, the comments says " /* Only the first two register
>>>> operands need reversing, alongside flipping VEX.W.  */ ",  But there
>>>> is actually a memory operand, not two register operands.
>>>>>
>>>>> I think VexVVVV_SRC2 makes more sense here, it matches the actual
>>>> situation, we want to use vvvv to encode the first operand.
>>>>>
>>>>> Opcode table:
>>>>> vprot<xop>, 0x90 | <xop:opc>, XOP,
>>>>> D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf,
>> { RegXMM,
>>>>> RegXMM|Unspecified|BaseIndex, RegXMM }
>>>>>
>>>>> testcase:
>>>>> vprotb (%rax),%xmm12,%xmm15
>>>>> vprotb %xmm15,(%r12),%xmm0
>>>>
>>>> VexVVVV_Src2 is appropriate for the latter, yes, but not for the
>>>> former. That uses VexVVVV_Src1 layout. Hence my suggestion to replace
>>>> the attribute when swapping operands.
>>>>
>>>
>>> If replace the Src2VVVV| VexW0 with Src1VVVV| VexW1 and swapping
>> operands. We can put VexVVVV_SRC1 before VexVVVV_SRC2, but we still
>> need to add "(!is_cpu (&i.tm, CpuXOP) || source == op" under
>> VexVVVV_SRC1 , and match_template also needs to be adjusted (I made a
>> simple modification and it still failed, I think continuing like this may go
>> against the original intention).
>>>
>>>   switch (i.tm.opcode_modifier.vexvvvv)
>>>     {
>>>     /* VEX.vvvv encodes the first source register operand.  */
>>>     case VexVVVV_SRC1:
>>>       if (!is_cpu (&i.tm, CpuXOP) || source == op)
>>>         {
>>>           v =  dest - 1;
>>>           break;
>>>         }
>>>     /* For XOP: vpshl*, vpsha* and vprot*.  */
>>>     /* Fall through.  */
>>>     /* VEX.vvvv encodes the last source register operand.  */
>>>     case VexVVVV_SRC2:
>>>       v = source++;
>>>       break;
>>>     /* VEX.vvvv encodes the destination register operand.  */
>>>     case VexVVVV_DST:
>>>       v = dest--;
>>>       break;
>>>     default:
>>>       v = ~0;
>>>       break;
>>>      }
>>>
>>> Do you think we should add a separate patch 4 for XOP that removes the
>> special handling in match_template and completes its template? so we don't
>> have to add special handling for src1vvvv or src2vvvv. This might go against
>> your desire to reduce template size, but it would help simplify the logic. I'd
>> like to know your thoughts.
>>
>> Indeed. You'd effectively revert earlier folding that I did. And the adjustment I
>> suggested earlier ought to be small/simple enough.
>>
> 
> So, I continued working on the previous suggestion. With the following modification and it worked.
> 
> @@ -8932,7 +8932,7 @@ match_template (char mnem_suffix)
>                           || is_cpu (t, CpuAPX_F));
>               if (!operand_type_match (overlap0, i.types[0])
>                   || !operand_type_match (overlap1, i.types[j])
> -                 || (t->operands == 3
> +                 || (t->operands == 3 && !is_cpu (t, CpuXOP)
>                       && !operand_type_match (overlap2, i.types[1]))
>                   || (check_register
>                       && !operand_type_register_match (i.types[0],

Just to mention it - this certainly isn't what I suggested. In fact I seem to
vaguely recall that something similar was once proposed during the original
APX work as well, where I then objected, too.

> But I found that there are 4 test files that failed, I didn't find the doc on how to encode vprotb but I guess that is because I changed the default template from Src2VVVV| VexW0 to Src1VVVV| VexW1, then all the related test cases needed to be modified. Do you have any comments here? 
> 
> regexp "^[      ]*[a-f0-9]+:    8f e9 40 90 d8[         ]+vprotb %xmm7,%xmm0,%xmm3$"
> line   "    11ed:       8f e9 f8 90 df          vprotb %xmm7,%xmm0,%xmm3"

Well, while changing the templates is in principle possible, and the
resulting code would still be correct, changing encodings it usually not a
good idea. Thus when it can be avoided, it should be avoided, imo. Hence
why I didn't suggest this, but to amend the code doing the operand swapping
(for the case where operand order is controlled by XOP.W).

Jan
  
Cui, Lili April 29, 2024, 1:41 p.m. UTC | #8
> On 29.04.2024 14:23, Cui, Lili wrote:
> >> On 28.04.2024 06:47, Cui, Lili wrote:
> >>>> On 26.04.2024 10:14, Cui, Lili wrote:
> >>>>>> On 24.04.2024 09:23, Cui, Lili wrote:
> >>>>>>> --- a/gas/config/tc-i386.c
> >>>>>>> +++ b/gas/config/tc-i386.c
> >>>>>>> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
> >>>>>>>
> >>>>>>>    switch (i.tm.opcode_modifier.vexvvvv)
> >>>>>>>      {
> >>>>>>> +    case VexVVVV_SRC2:
> >>>>>>> +      if (source != op)
> >>>>>>> +	{
> >>>>>>> +	  v = source++;
> >>>>>>> +	  break;
> >>>>>>> +	}
> >>>>>>> +      /* For XOP: vpshl* and vpsha*.  */
> >>>>>>> +      /* Fall through.  */
> >>>>>>>      case VexVVVV_SRC1:
> >>>>>>
> >>>>>> This falling-through is odd and hence needs a better comment
> >>>>>> (then also covering vprot*, which afaict is similarly affected).
> >>>>>> The reason for this is the XOP.W-controlled operand swapping, if
> >>>>>> I'm not mistaken? In which case perhaps instead of the
> >>>>>> fall-through here the logic swapping the operands should replace
> >>>>>> VexVVVV_SRC2 by
> >>>> VexVVVV_SRC1?
> >>>>>>
> >>>>>
> >>>>> Yes, vprot* should be included, and it is related to
> >>>>> XOP.W-controlled
> >>>> operand swapping, the comments says " /* Only the first two
> >>>> register operands need reversing, alongside flipping VEX.W.  */ ",
> >>>> But there is actually a memory operand, not two register operands.
> >>>>>
> >>>>> I think VexVVVV_SRC2 makes more sense here, it matches the actual
> >>>> situation, we want to use vvvv to encode the first operand.
> >>>>>
> >>>>> Opcode table:
> >>>>> vprot<xop>, 0x90 | <xop:opc>, XOP,
> >>>>> D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf,
> >> { RegXMM,
> >>>>> RegXMM|Unspecified|BaseIndex, RegXMM }
> >>>>>
> >>>>> testcase:
> >>>>> vprotb (%rax),%xmm12,%xmm15
> >>>>> vprotb %xmm15,(%r12),%xmm0
> >>>>
> >>>> VexVVVV_Src2 is appropriate for the latter, yes, but not for the
> >>>> former. That uses VexVVVV_Src1 layout. Hence my suggestion to
> >>>> replace the attribute when swapping operands.
> >>>>
> >>>
> >>> If replace the Src2VVVV| VexW0 with Src1VVVV| VexW1 and swapping
> >> operands. We can put VexVVVV_SRC1 before VexVVVV_SRC2, but we still
> >> need to add "(!is_cpu (&i.tm, CpuXOP) || source == op" under
> >> VexVVVV_SRC1 , and match_template also needs to be adjusted (I made a
> >> simple modification and it still failed, I think continuing like this
> >> may go against the original intention).
> >>>
> >>>   switch (i.tm.opcode_modifier.vexvvvv)
> >>>     {
> >>>     /* VEX.vvvv encodes the first source register operand.  */
> >>>     case VexVVVV_SRC1:
> >>>       if (!is_cpu (&i.tm, CpuXOP) || source == op)
> >>>         {
> >>>           v =  dest - 1;
> >>>           break;
> >>>         }
> >>>     /* For XOP: vpshl*, vpsha* and vprot*.  */
> >>>     /* Fall through.  */
> >>>     /* VEX.vvvv encodes the last source register operand.  */
> >>>     case VexVVVV_SRC2:
> >>>       v = source++;
> >>>       break;
> >>>     /* VEX.vvvv encodes the destination register operand.  */
> >>>     case VexVVVV_DST:
> >>>       v = dest--;
> >>>       break;
> >>>     default:
> >>>       v = ~0;
> >>>       break;
> >>>      }
> >>>
> >>> Do you think we should add a separate patch 4 for XOP that removes
> >>> the
> >> special handling in match_template and completes its template? so we
> >> don't have to add special handling for src1vvvv or src2vvvv. This
> >> might go against your desire to reduce template size, but it would
> >> help simplify the logic. I'd like to know your thoughts.
> >>
> >> Indeed. You'd effectively revert earlier folding that I did. And the
> >> adjustment I suggested earlier ought to be small/simple enough.
> >>
> >
> > So, I continued working on the previous suggestion. With the following
> modification and it worked.
> >
> > @@ -8932,7 +8932,7 @@ match_template (char mnem_suffix)
> >                           || is_cpu (t, CpuAPX_F));
> >               if (!operand_type_match (overlap0, i.types[0])
> >                   || !operand_type_match (overlap1, i.types[j])
> > -                 || (t->operands == 3
> > +                 || (t->operands == 3 && !is_cpu (t, CpuXOP)
> >                       && !operand_type_match (overlap2, i.types[1]))
> >                   || (check_register
> >                       && !operand_type_register_match (i.types[0],
> 
> Just to mention it - this certainly isn't what I suggested. In fact I seem to
> vaguely recall that something similar was once proposed during the original
> APX work as well, where I then objected, too.
> 
> > But I found that there are 4 test files that failed, I didn't find the doc on
> how to encode vprotb but I guess that is because I changed the default
> template from Src2VVVV| VexW0 to Src1VVVV| VexW1, then all the related
> test cases needed to be modified. Do you have any comments here?
> >
> > regexp "^[      ]*[a-f0-9]+:    8f e9 40 90
> d8[         ]+vprotb %xmm7,%xmm0,%xmm3$"
> > line   "    11ed:       8f e9 f8 90 df          vprotb %xmm7,%xmm0,%xmm3"
> 
> Well, while changing the templates is in principle possible, and the resulting
> code would still be correct, changing encodings it usually not a good idea.
> Thus when it can be avoided, it should be avoided, imo. Hence why I didn't
> suggest this, but to amend the code doing the operand swapping (for the
> case where operand order is controlled by XOP.W).
> 

I mistakenly thought this was what you wanted, and I also think this modification is unacceptable. Could you elaborate further on your original suggestion?

Thanks,
Lili.
  
Jan Beulich April 29, 2024, 1:49 p.m. UTC | #9
On 29.04.2024 15:41, Cui, Lili wrote:
>> On 29.04.2024 14:23, Cui, Lili wrote:
>>>> On 28.04.2024 06:47, Cui, Lili wrote:
>>>>>> On 26.04.2024 10:14, Cui, Lili wrote:
>>>>>>>> On 24.04.2024 09:23, Cui, Lili wrote:
>>>>>>>>> --- a/gas/config/tc-i386.c
>>>>>>>>> +++ b/gas/config/tc-i386.c
>>>>>>>>> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
>>>>>>>>>
>>>>>>>>>    switch (i.tm.opcode_modifier.vexvvvv)
>>>>>>>>>      {
>>>>>>>>> +    case VexVVVV_SRC2:
>>>>>>>>> +      if (source != op)
>>>>>>>>> +	{
>>>>>>>>> +	  v = source++;
>>>>>>>>> +	  break;
>>>>>>>>> +	}
>>>>>>>>> +      /* For XOP: vpshl* and vpsha*.  */
>>>>>>>>> +      /* Fall through.  */
>>>>>>>>>      case VexVVVV_SRC1:
>>>>>>>>
>>>>>>>> This falling-through is odd and hence needs a better comment
>>>>>>>> (then also covering vprot*, which afaict is similarly affected).
>>>>>>>> The reason for this is the XOP.W-controlled operand swapping, if
>>>>>>>> I'm not mistaken? In which case perhaps instead of the
>>>>>>>> fall-through here the logic swapping the operands should replace
>>>>>>>> VexVVVV_SRC2 by
>>>>>> VexVVVV_SRC1?
>>>>>>>>
>>>>>>>
>>>>>>> Yes, vprot* should be included, and it is related to
>>>>>>> XOP.W-controlled
>>>>>> operand swapping, the comments says " /* Only the first two
>>>>>> register operands need reversing, alongside flipping VEX.W.  */ ",
>>>>>> But there is actually a memory operand, not two register operands.
>>>>>>>
>>>>>>> I think VexVVVV_SRC2 makes more sense here, it matches the actual
>>>>>> situation, we want to use vvvv to encode the first operand.
>>>>>>>
>>>>>>> Opcode table:
>>>>>>> vprot<xop>, 0x90 | <xop:opc>, XOP,
>>>>>>> D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf,
>>>> { RegXMM,
>>>>>>> RegXMM|Unspecified|BaseIndex, RegXMM }
>>>>>>>
>>>>>>> testcase:
>>>>>>> vprotb (%rax),%xmm12,%xmm15
>>>>>>> vprotb %xmm15,(%r12),%xmm0
>>>>>>
>>>>>> VexVVVV_Src2 is appropriate for the latter, yes, but not for the
>>>>>> former. That uses VexVVVV_Src1 layout. Hence my suggestion to
>>>>>> replace the attribute when swapping operands.
>>>>>>
>>>>>
>>>>> If replace the Src2VVVV| VexW0 with Src1VVVV| VexW1 and swapping
>>>> operands. We can put VexVVVV_SRC1 before VexVVVV_SRC2, but we still
>>>> need to add "(!is_cpu (&i.tm, CpuXOP) || source == op" under
>>>> VexVVVV_SRC1 , and match_template also needs to be adjusted (I made a
>>>> simple modification and it still failed, I think continuing like this
>>>> may go against the original intention).
>>>>>
>>>>>   switch (i.tm.opcode_modifier.vexvvvv)
>>>>>     {
>>>>>     /* VEX.vvvv encodes the first source register operand.  */
>>>>>     case VexVVVV_SRC1:
>>>>>       if (!is_cpu (&i.tm, CpuXOP) || source == op)
>>>>>         {
>>>>>           v =  dest - 1;
>>>>>           break;
>>>>>         }
>>>>>     /* For XOP: vpshl*, vpsha* and vprot*.  */
>>>>>     /* Fall through.  */
>>>>>     /* VEX.vvvv encodes the last source register operand.  */
>>>>>     case VexVVVV_SRC2:
>>>>>       v = source++;
>>>>>       break;
>>>>>     /* VEX.vvvv encodes the destination register operand.  */
>>>>>     case VexVVVV_DST:
>>>>>       v = dest--;
>>>>>       break;
>>>>>     default:
>>>>>       v = ~0;
>>>>>       break;
>>>>>      }
>>>>>
>>>>> Do you think we should add a separate patch 4 for XOP that removes
>>>>> the
>>>> special handling in match_template and completes its template? so we
>>>> don't have to add special handling for src1vvvv or src2vvvv. This
>>>> might go against your desire to reduce template size, but it would
>>>> help simplify the logic. I'd like to know your thoughts.
>>>>
>>>> Indeed. You'd effectively revert earlier folding that I did. And the
>>>> adjustment I suggested earlier ought to be small/simple enough.
>>>>
>>>
>>> So, I continued working on the previous suggestion. With the following
>> modification and it worked.
>>>
>>> @@ -8932,7 +8932,7 @@ match_template (char mnem_suffix)
>>>                           || is_cpu (t, CpuAPX_F));
>>>               if (!operand_type_match (overlap0, i.types[0])
>>>                   || !operand_type_match (overlap1, i.types[j])
>>> -                 || (t->operands == 3
>>> +                 || (t->operands == 3 && !is_cpu (t, CpuXOP)
>>>                       && !operand_type_match (overlap2, i.types[1]))
>>>                   || (check_register
>>>                       && !operand_type_register_match (i.types[0],
>>
>> Just to mention it - this certainly isn't what I suggested. In fact I seem to
>> vaguely recall that something similar was once proposed during the original
>> APX work as well, where I then objected, too.
>>
>>> But I found that there are 4 test files that failed, I didn't find the doc on
>> how to encode vprotb but I guess that is because I changed the default
>> template from Src2VVVV| VexW0 to Src1VVVV| VexW1, then all the related
>> test cases needed to be modified. Do you have any comments here?
>>>
>>> regexp "^[      ]*[a-f0-9]+:    8f e9 40 90
>> d8[         ]+vprotb %xmm7,%xmm0,%xmm3$"
>>> line   "    11ed:       8f e9 f8 90 df          vprotb %xmm7,%xmm0,%xmm3"
>>
>> Well, while changing the templates is in principle possible, and the resulting
>> code would still be correct, changing encodings it usually not a good idea.
>> Thus when it can be avoided, it should be avoided, imo. Hence why I didn't
>> suggest this, but to amend the code doing the operand swapping (for the
>> case where operand order is controlled by XOP.W).
>>
> 
> I mistakenly thought this was what you wanted, and I also think this modification is unacceptable. Could you elaborate further on your original suggestion?

At the bottom of match_template() we have

    case Opcode_VexW:
      /* Only the first two register operands need reversing, alongside
	 flipping VEX.W.  */
      i.tm.opcode_modifier.vexw ^= VEXW0 ^ VEXW1;

This is where I think you further want to adjust i.tm.opcode_modifier.vexvvvv.

Jan
  
Cui, Lili April 30, 2024, 2:56 a.m. UTC | #10
> On 29.04.2024 15:41, Cui, Lili wrote:
> >> On 29.04.2024 14:23, Cui, Lili wrote:
> >>>> On 28.04.2024 06:47, Cui, Lili wrote:
> >>>>>> On 26.04.2024 10:14, Cui, Lili wrote:
> >>>>>>>> On 24.04.2024 09:23, Cui, Lili wrote:
> >>>>>>>>> --- a/gas/config/tc-i386.c
> >>>>>>>>> +++ b/gas/config/tc-i386.c
> >>>>>>>>> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
> >>>>>>>>>
> >>>>>>>>>    switch (i.tm.opcode_modifier.vexvvvv)
> >>>>>>>>>      {
> >>>>>>>>> +    case VexVVVV_SRC2:
> >>>>>>>>> +      if (source != op)
> >>>>>>>>> +	{
> >>>>>>>>> +	  v = source++;
> >>>>>>>>> +	  break;
> >>>>>>>>> +	}
> >>>>>>>>> +      /* For XOP: vpshl* and vpsha*.  */
> >>>>>>>>> +      /* Fall through.  */
> >>>>>>>>>      case VexVVVV_SRC1:
> >>>>>>>>
> >>>>>>>> This falling-through is odd and hence needs a better comment
> >>>>>>>> (then also covering vprot*, which afaict is similarly affected).
> >>>>>>>> The reason for this is the XOP.W-controlled operand swapping,
> >>>>>>>> if I'm not mistaken? In which case perhaps instead of the
> >>>>>>>> fall-through here the logic swapping the operands should
> >>>>>>>> replace
> >>>>>>>> VexVVVV_SRC2 by
> >>>>>> VexVVVV_SRC1?
> >>>>>>>>
> >>>>>>>
> >>>>>>> Yes, vprot* should be included, and it is related to
> >>>>>>> XOP.W-controlled
> >>>>>> operand swapping, the comments says " /* Only the first two
> >>>>>> register operands need reversing, alongside flipping VEX.W.  */
> >>>>>> ", But there is actually a memory operand, not two register operands.
> >>>>>>>
> >>>>>>> I think VexVVVV_SRC2 makes more sense here, it matches the
> >>>>>>> actual
> >>>>>> situation, we want to use vvvv to encode the first operand.
> >>>>>>>
> >>>>>>> Opcode table:
> >>>>>>> vprot<xop>, 0x90 | <xop:opc>, XOP,
> >>>>>>> D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf,
> >>>> { RegXMM,
> >>>>>>> RegXMM|Unspecified|BaseIndex, RegXMM }
> >>>>>>>
> >>>>>>> testcase:
> >>>>>>> vprotb (%rax),%xmm12,%xmm15
> >>>>>>> vprotb %xmm15,(%r12),%xmm0
> >>>>>>
> >>>>>> VexVVVV_Src2 is appropriate for the latter, yes, but not for the
> >>>>>> former. That uses VexVVVV_Src1 layout. Hence my suggestion to
> >>>>>> replace the attribute when swapping operands.
> >>>>>>
> >>>>>
> >>>>> If replace the Src2VVVV| VexW0 with Src1VVVV| VexW1 and swapping
> >>>> operands. We can put VexVVVV_SRC1 before VexVVVV_SRC2, but we
> still
> >>>> need to add "(!is_cpu (&i.tm, CpuXOP) || source == op" under
> >>>> VexVVVV_SRC1 , and match_template also needs to be adjusted (I
> made
> >>>> a simple modification and it still failed, I think continuing like
> >>>> this may go against the original intention).
> >>>>>
> >>>>>   switch (i.tm.opcode_modifier.vexvvvv)
> >>>>>     {
> >>>>>     /* VEX.vvvv encodes the first source register operand.  */
> >>>>>     case VexVVVV_SRC1:
> >>>>>       if (!is_cpu (&i.tm, CpuXOP) || source == op)
> >>>>>         {
> >>>>>           v =  dest - 1;
> >>>>>           break;
> >>>>>         }
> >>>>>     /* For XOP: vpshl*, vpsha* and vprot*.  */
> >>>>>     /* Fall through.  */
> >>>>>     /* VEX.vvvv encodes the last source register operand.  */
> >>>>>     case VexVVVV_SRC2:
> >>>>>       v = source++;
> >>>>>       break;
> >>>>>     /* VEX.vvvv encodes the destination register operand.  */
> >>>>>     case VexVVVV_DST:
> >>>>>       v = dest--;
> >>>>>       break;
> >>>>>     default:
> >>>>>       v = ~0;
> >>>>>       break;
> >>>>>      }
> >>>>>
> >>>>> Do you think we should add a separate patch 4 for XOP that removes
> >>>>> the
> >>>> special handling in match_template and completes its template? so
> >>>> we don't have to add special handling for src1vvvv or src2vvvv.
> >>>> This might go against your desire to reduce template size, but it
> >>>> would help simplify the logic. I'd like to know your thoughts.
> >>>>
> >>>> Indeed. You'd effectively revert earlier folding that I did. And
> >>>> the adjustment I suggested earlier ought to be small/simple enough.
> >>>>
> >>>
> >>> So, I continued working on the previous suggestion. With the
> >>> following
> >> modification and it worked.
> >>>
> >>> @@ -8932,7 +8932,7 @@ match_template (char mnem_suffix)
> >>>                           || is_cpu (t, CpuAPX_F));
> >>>               if (!operand_type_match (overlap0, i.types[0])
> >>>                   || !operand_type_match (overlap1, i.types[j])
> >>> -                 || (t->operands == 3
> >>> +                 || (t->operands == 3 && !is_cpu (t, CpuXOP)
> >>>                       && !operand_type_match (overlap2, i.types[1]))
> >>>                   || (check_register
> >>>                       && !operand_type_register_match (i.types[0],
> >>
> >> Just to mention it - this certainly isn't what I suggested. In fact I
> >> seem to vaguely recall that something similar was once proposed
> >> during the original APX work as well, where I then objected, too.
> >>
> >>> But I found that there are 4 test files that failed, I didn't find
> >>> the doc on
> >> how to encode vprotb but I guess that is because I changed the
> >> default template from Src2VVVV| VexW0 to Src1VVVV| VexW1, then all
> >> the related test cases needed to be modified. Do you have any comments
> here?
> >>>
> >>> regexp "^[      ]*[a-f0-9]+:    8f e9 40 90
> >> d8[         ]+vprotb %xmm7,%xmm0,%xmm3$"
> >>> line   "    11ed:       8f e9 f8 90 df          vprotb %xmm7,%xmm0,%xmm3"
> >>
> >> Well, while changing the templates is in principle possible, and the
> >> resulting code would still be correct, changing encodings it usually not a
> good idea.
> >> Thus when it can be avoided, it should be avoided, imo. Hence why I
> >> didn't suggest this, but to amend the code doing the operand swapping
> >> (for the case where operand order is controlled by XOP.W).
> >>
> >
> > I mistakenly thought this was what you wanted, and I also think this
> modification is unacceptable. Could you elaborate further on your original
> suggestion?
> 
> At the bottom of match_template() we have
> 
>     case Opcode_VexW:
>       /* Only the first two register operands need reversing, alongside
> 	 flipping VEX.W.  */
>       i.tm.opcode_modifier.vexw ^= VEXW0 ^ VEXW1;
> 
> This is where I think you further want to adjust
> i.tm.opcode_modifier.vexvvvv.
> 

"vprotb %xmm7,%xmm0,%xmm3" doesn't go through this place, it finds the matching template directly (Src1VVVV|VexW1|NoSuf, { RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }). I inserted a judgment to solve this problem, it's a bit ugly. I'd like to abandon this optimization. 

To avoid confusion, I post all the changes.

--- a/gas/config/tc-i386.c
+++ b/gas/config/tc-i386.c
@@ -8932,7 +8932,7 @@ match_template (char mnem_suffix)
                          || is_cpu (t, CpuAPX_F));
              if (!operand_type_match (overlap0, i.types[0])
                  || !operand_type_match (overlap1, i.types[j])
-                 || (t->operands == 3
+                 || (t->operands == 3 && !is_cpu (t, CpuXOP)     --> This place also needs to change.
                      && !operand_type_match (overlap2, i.types[1]))
                  || (check_register
                      && !operand_type_register_match (i.types[0],
@@ -9035,6 +9035,10 @@ match_template (char mnem_suffix)
                      specific_error = progress (i.error);
                      continue;
                    }
+                 if (is_cpu (t, CpuXOP) && operand_types[0].bitfield.baseindex   --> It's ugly, and still has some test cases failing for other XOP instructions.
+                     && i.types[0].bitfield.class == RegSIMD
+                     && t->opcode_modifier.vexw == VEXW1)
+                   found_reverse_match = Opcode_VexW;
                  break;
                }
            } 
@@ -10434,19 +10434,21 @@ build_modrm_byte (void)
   switch (i.tm.opcode_modifier.vexvvvv)
     {
-    /* VEX.vvvv encodes the first source register operand.  */
-    case VexVVVV_SRC2:
-      if (source != op)
+      /* VEX.vvvv encodes the first source register operand.  */
+    case VexVVVV_SRC1:
+      if (!is_cpu (&i.tm, CpuXOP) || op == 0)     --> changed src1vvvv to src2vvvv of " vprotb %xmm7,%xmm0,%xmm3".
        {
-         v = source++;
+         v =  dest - 1;
          break;
        }
-      /* For XOP: vpshl* and vpsha*.  */
+      /* For XOP: vpshl*, vpsha* and vprot*.  */
       /* Fall through.  */
-    case VexVVVV_SRC1:
-      v =  dest - 1;
+      /* VEX.vvvv encodes the last source register operand.  */
+    case VexVVVV_SRC2:
+      v = source++;
       break;
-    /* VEX.vvvv encodes the destination register operand.  */
+
+      /* VEX.vvvv encodes the destination register operand.  */
     case VexVVVV_DST:
       v = dest--;
       break;
diff --git a/opcodes/i386-opc.tbl b/opcodes/i386-opc.tbl
index 5d6f1c74449..5eef1cabd8c 100644
--- a/opcodes/i386-opc.tbl
+++ b/opcodes/i386-opc.tbl
@@ -1938,10 +1938,10 @@ vpmacsww, 0x95, XOP, Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, Reg
 vpmadcsswd, 0xa6, XOP, Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
 vpmadcswd, 0xb6, XOP, Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
 vpperm, 0xa3, XOP, D|Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
-vprot<xop>, 0x90 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src2VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
+vprot<xop>, 0x90 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src1VVVV|VexW1|NoSuf, { RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
 vprot<xop>, 0xc0 | <xop:opc>, XOP, Modrm|Vex128|SpaceXOP08|VexW0|NoSuf, { Imm8|Imm8S, RegXMM|Unspecified|BaseIndex, RegXMM }
-vpsha<xop>, 0x98 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src2VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
-vpshl<xop>, 0x94 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src2VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
+vpsha<xop>, 0x98 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src1VVVV|VexW1|NoSuf, { RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
+vpshl<xop>, 0x94 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src1VVVV|VexW1|NoSuf, { RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }

Thanks,
Lili.
  
Jan Beulich April 30, 2024, 6:18 a.m. UTC | #11
On 30.04.2024 04:56, Cui, Lili wrote:
>> On 29.04.2024 15:41, Cui, Lili wrote:
>>>> On 29.04.2024 14:23, Cui, Lili wrote:
>>>>>> On 28.04.2024 06:47, Cui, Lili wrote:
>>>>>>>> On 26.04.2024 10:14, Cui, Lili wrote:
>>>>>>>>>> On 24.04.2024 09:23, Cui, Lili wrote:
>>>>>>>>>>> --- a/gas/config/tc-i386.c
>>>>>>>>>>> +++ b/gas/config/tc-i386.c
>>>>>>>>>>> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
>>>>>>>>>>>
>>>>>>>>>>>    switch (i.tm.opcode_modifier.vexvvvv)
>>>>>>>>>>>      {
>>>>>>>>>>> +    case VexVVVV_SRC2:
>>>>>>>>>>> +      if (source != op)
>>>>>>>>>>> +	{
>>>>>>>>>>> +	  v = source++;
>>>>>>>>>>> +	  break;
>>>>>>>>>>> +	}
>>>>>>>>>>> +      /* For XOP: vpshl* and vpsha*.  */
>>>>>>>>>>> +      /* Fall through.  */
>>>>>>>>>>>      case VexVVVV_SRC1:
>>>>>>>>>>
>>>>>>>>>> This falling-through is odd and hence needs a better comment
>>>>>>>>>> (then also covering vprot*, which afaict is similarly affected).
>>>>>>>>>> The reason for this is the XOP.W-controlled operand swapping,
>>>>>>>>>> if I'm not mistaken? In which case perhaps instead of the
>>>>>>>>>> fall-through here the logic swapping the operands should
>>>>>>>>>> replace
>>>>>>>>>> VexVVVV_SRC2 by
>>>>>>>> VexVVVV_SRC1?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, vprot* should be included, and it is related to
>>>>>>>>> XOP.W-controlled
>>>>>>>> operand swapping, the comments says " /* Only the first two
>>>>>>>> register operands need reversing, alongside flipping VEX.W.  */
>>>>>>>> ", But there is actually a memory operand, not two register operands.
>>>>>>>>>
>>>>>>>>> I think VexVVVV_SRC2 makes more sense here, it matches the
>>>>>>>>> actual
>>>>>>>> situation, we want to use vvvv to encode the first operand.
>>>>>>>>>
>>>>>>>>> Opcode table:
>>>>>>>>> vprot<xop>, 0x90 | <xop:opc>, XOP,
>>>>>>>>> D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf,
>>>>>> { RegXMM,
>>>>>>>>> RegXMM|Unspecified|BaseIndex, RegXMM }
>>>>>>>>>
>>>>>>>>> testcase:
>>>>>>>>> vprotb (%rax),%xmm12,%xmm15
>>>>>>>>> vprotb %xmm15,(%r12),%xmm0
>>>>>>>>
>>>>>>>> VexVVVV_Src2 is appropriate for the latter, yes, but not for the
>>>>>>>> former. That uses VexVVVV_Src1 layout. Hence my suggestion to
>>>>>>>> replace the attribute when swapping operands.
>>>>>>>>
>>>>>>>
>>>>>>> If replace the Src2VVVV| VexW0 with Src1VVVV| VexW1 and swapping
>>>>>> operands. We can put VexVVVV_SRC1 before VexVVVV_SRC2, but we
>> still
>>>>>> need to add "(!is_cpu (&i.tm, CpuXOP) || source == op" under
>>>>>> VexVVVV_SRC1 , and match_template also needs to be adjusted (I
>> made
>>>>>> a simple modification and it still failed, I think continuing like
>>>>>> this may go against the original intention).
>>>>>>>
>>>>>>>   switch (i.tm.opcode_modifier.vexvvvv)
>>>>>>>     {
>>>>>>>     /* VEX.vvvv encodes the first source register operand.  */
>>>>>>>     case VexVVVV_SRC1:
>>>>>>>       if (!is_cpu (&i.tm, CpuXOP) || source == op)
>>>>>>>         {
>>>>>>>           v =  dest - 1;
>>>>>>>           break;
>>>>>>>         }
>>>>>>>     /* For XOP: vpshl*, vpsha* and vprot*.  */
>>>>>>>     /* Fall through.  */
>>>>>>>     /* VEX.vvvv encodes the last source register operand.  */
>>>>>>>     case VexVVVV_SRC2:
>>>>>>>       v = source++;
>>>>>>>       break;
>>>>>>>     /* VEX.vvvv encodes the destination register operand.  */
>>>>>>>     case VexVVVV_DST:
>>>>>>>       v = dest--;
>>>>>>>       break;
>>>>>>>     default:
>>>>>>>       v = ~0;
>>>>>>>       break;
>>>>>>>      }
>>>>>>>
>>>>>>> Do you think we should add a separate patch 4 for XOP that removes
>>>>>>> the
>>>>>> special handling in match_template and completes its template? so
>>>>>> we don't have to add special handling for src1vvvv or src2vvvv.
>>>>>> This might go against your desire to reduce template size, but it
>>>>>> would help simplify the logic. I'd like to know your thoughts.
>>>>>>
>>>>>> Indeed. You'd effectively revert earlier folding that I did. And
>>>>>> the adjustment I suggested earlier ought to be small/simple enough.
>>>>>>
>>>>>
>>>>> So, I continued working on the previous suggestion. With the
>>>>> following
>>>> modification and it worked.
>>>>>
>>>>> @@ -8932,7 +8932,7 @@ match_template (char mnem_suffix)
>>>>>                           || is_cpu (t, CpuAPX_F));
>>>>>               if (!operand_type_match (overlap0, i.types[0])
>>>>>                   || !operand_type_match (overlap1, i.types[j])
>>>>> -                 || (t->operands == 3
>>>>> +                 || (t->operands == 3 && !is_cpu (t, CpuXOP)
>>>>>                       && !operand_type_match (overlap2, i.types[1]))
>>>>>                   || (check_register
>>>>>                       && !operand_type_register_match (i.types[0],
>>>>
>>>> Just to mention it - this certainly isn't what I suggested. In fact I
>>>> seem to vaguely recall that something similar was once proposed
>>>> during the original APX work as well, where I then objected, too.
>>>>
>>>>> But I found that there are 4 test files that failed, I didn't find
>>>>> the doc on
>>>> how to encode vprotb but I guess that is because I changed the
>>>> default template from Src2VVVV| VexW0 to Src1VVVV| VexW1, then all
>>>> the related test cases needed to be modified. Do you have any comments
>> here?
>>>>>
>>>>> regexp "^[      ]*[a-f0-9]+:    8f e9 40 90
>>>> d8[         ]+vprotb %xmm7,%xmm0,%xmm3$"
>>>>> line   "    11ed:       8f e9 f8 90 df          vprotb %xmm7,%xmm0,%xmm3"
>>>>
>>>> Well, while changing the templates is in principle possible, and the
>>>> resulting code would still be correct, changing encodings it usually not a
>> good idea.
>>>> Thus when it can be avoided, it should be avoided, imo. Hence why I
>>>> didn't suggest this, but to amend the code doing the operand swapping
>>>> (for the case where operand order is controlled by XOP.W).
>>>>
>>>
>>> I mistakenly thought this was what you wanted, and I also think this
>> modification is unacceptable. Could you elaborate further on your original
>> suggestion?
>>
>> At the bottom of match_template() we have
>>
>>     case Opcode_VexW:
>>       /* Only the first two register operands need reversing, alongside
>> 	 flipping VEX.W.  */
>>       i.tm.opcode_modifier.vexw ^= VEXW0 ^ VEXW1;
>>
>> This is where I think you further want to adjust
>> i.tm.opcode_modifier.vexvvvv.
>>
> 
> "vprotb %xmm7,%xmm0,%xmm3" doesn't go through this place, it finds the matching template directly (Src1VVVV|VexW1|NoSuf, { RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }).

Of course, for not having a memory operand (and no pseudo prefix). But that's
not the problem here - the problem is that for build_modrm_byte() you want to
make adjustments when operands need swapping (compared to what the templates
say).

> I inserted a judgment to solve this problem, it's a bit ugly. I'd like to abandon this optimization. 

Didn't you have a working form earlier on? Could we use that, leaving the XOP
stuff untouched for now, for me to see about looking into later?

> To avoid confusion, I post all the changes.
> 
> --- a/gas/config/tc-i386.c
> +++ b/gas/config/tc-i386.c
> @@ -8932,7 +8932,7 @@ match_template (char mnem_suffix)
>                           || is_cpu (t, CpuAPX_F));
>               if (!operand_type_match (overlap0, i.types[0])
>                   || !operand_type_match (overlap1, i.types[j])
> -                 || (t->operands == 3
> +                 || (t->operands == 3 && !is_cpu (t, CpuXOP)     --> This place also needs to change.
>                       && !operand_type_match (overlap2, i.types[1]))
>                   || (check_register
>                       && !operand_type_register_match (i.types[0],
> @@ -9035,6 +9035,10 @@ match_template (char mnem_suffix)
>                       specific_error = progress (i.error);
>                       continue;
>                     }
> +                 if (is_cpu (t, CpuXOP) && operand_types[0].bitfield.baseindex   --> It's ugly, and still has some test cases failing for other XOP instructions.
> +                     && i.types[0].bitfield.class == RegSIMD
> +                     && t->opcode_modifier.vexw == VEXW1)
> +                   found_reverse_match = Opcode_VexW;
>                   break;
>                 }
>             } 

Yeah, as indicated before: These probably shouldn't be changed like this.

> --- a/opcodes/i386-opc.tbl
> +++ b/opcodes/i386-opc.tbl
> @@ -1938,10 +1938,10 @@ vpmacsww, 0x95, XOP, Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, Reg
>  vpmadcsswd, 0xa6, XOP, Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
>  vpmadcswd, 0xb6, XOP, Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
>  vpperm, 0xa3, XOP, D|Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
> -vprot<xop>, 0x90 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src2VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
> +vprot<xop>, 0x90 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src1VVVV|VexW1|NoSuf, { RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
>  vprot<xop>, 0xc0 | <xop:opc>, XOP, Modrm|Vex128|SpaceXOP08|VexW0|NoSuf, { Imm8|Imm8S, RegXMM|Unspecified|BaseIndex, RegXMM }
> -vpsha<xop>, 0x98 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src2VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
> -vpshl<xop>, 0x94 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src2VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
> +vpsha<xop>, 0x98 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src1VVVV|VexW1|NoSuf, { RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
> +vpshl<xop>, 0x94 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src1VVVV|VexW1|NoSuf, { RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }

Aiui (and as per earlier discussion) this leads to encoding changes, which we
want to avoid whenever possible.

Jan
  
Cui, Lili April 30, 2024, 7:34 a.m. UTC | #12
> On 30.04.2024 04:56, Cui, Lili wrote:
> >> On 29.04.2024 15:41, Cui, Lili wrote:
> >>>> On 29.04.2024 14:23, Cui, Lili wrote:
> >>>>>> On 28.04.2024 06:47, Cui, Lili wrote:
> >>>>>>>> On 26.04.2024 10:14, Cui, Lili wrote:
> >>>>>>>>>> On 24.04.2024 09:23, Cui, Lili wrote:
> >>>>>>>>>>> --- a/gas/config/tc-i386.c
> >>>>>>>>>>> +++ b/gas/config/tc-i386.c
> >>>>>>>>>>> @@ -10434,6 +10434,14 @@ build_modrm_byte (void)
> >>>>>>>>>>>
> >>>>>>>>>>>    switch (i.tm.opcode_modifier.vexvvvv)
> >>>>>>>>>>>      {
> >>>>>>>>>>> +    case VexVVVV_SRC2:
> >>>>>>>>>>> +      if (source != op)
> >>>>>>>>>>> +	{
> >>>>>>>>>>> +	  v = source++;
> >>>>>>>>>>> +	  break;
> >>>>>>>>>>> +	}
> >>>>>>>>>>> +      /* For XOP: vpshl* and vpsha*.  */
> >>>>>>>>>>> +      /* Fall through.  */
> >>>>>>>>>>>      case VexVVVV_SRC1:
> >>>>>>>>>>
> >>>>>>>>>> This falling-through is odd and hence needs a better comment
> >>>>>>>>>> (then also covering vprot*, which afaict is similarly affected).
> >>>>>>>>>> The reason for this is the XOP.W-controlled operand swapping,
> >>>>>>>>>> if I'm not mistaken? In which case perhaps instead of the
> >>>>>>>>>> fall-through here the logic swapping the operands should
> >>>>>>>>>> replace
> >>>>>>>>>> VexVVVV_SRC2 by
> >>>>>>>> VexVVVV_SRC1?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Yes, vprot* should be included, and it is related to
> >>>>>>>>> XOP.W-controlled
> >>>>>>>> operand swapping, the comments says " /* Only the first two
> >>>>>>>> register operands need reversing, alongside flipping VEX.W.  */
> >>>>>>>> ", But there is actually a memory operand, not two register
> operands.
> >>>>>>>>>
> >>>>>>>>> I think VexVVVV_SRC2 makes more sense here, it matches the
> >>>>>>>>> actual
> >>>>>>>> situation, we want to use vvvv to encode the first operand.
> >>>>>>>>>
> >>>>>>>>> Opcode table:
> >>>>>>>>> vprot<xop>, 0x90 | <xop:opc>, XOP,
> >>>>>>>>> D|Modrm|Vex128|SpaceXOP09|VexVVVV_Src2|VexW0|NoSuf,
> >>>>>> { RegXMM,
> >>>>>>>>> RegXMM|Unspecified|BaseIndex, RegXMM }
> >>>>>>>>>
> >>>>>>>>> testcase:
> >>>>>>>>> vprotb (%rax),%xmm12,%xmm15
> >>>>>>>>> vprotb %xmm15,(%r12),%xmm0
> >>>>>>>>
> >>>>>>>> VexVVVV_Src2 is appropriate for the latter, yes, but not for
> >>>>>>>> the former. That uses VexVVVV_Src1 layout. Hence my suggestion
> >>>>>>>> to replace the attribute when swapping operands.
> >>>>>>>>
> >>>>>>>
> >>>>>>> If replace the Src2VVVV| VexW0 with Src1VVVV| VexW1 and
> swapping
> >>>>>> operands. We can put VexVVVV_SRC1 before VexVVVV_SRC2, but we
> >> still
> >>>>>> need to add "(!is_cpu (&i.tm, CpuXOP) || source == op" under
> >>>>>> VexVVVV_SRC1 , and match_template also needs to be adjusted (I
> >> made
> >>>>>> a simple modification and it still failed, I think continuing
> >>>>>> like this may go against the original intention).
> >>>>>>>
> >>>>>>>   switch (i.tm.opcode_modifier.vexvvvv)
> >>>>>>>     {
> >>>>>>>     /* VEX.vvvv encodes the first source register operand.  */
> >>>>>>>     case VexVVVV_SRC1:
> >>>>>>>       if (!is_cpu (&i.tm, CpuXOP) || source == op)
> >>>>>>>         {
> >>>>>>>           v =  dest - 1;
> >>>>>>>           break;
> >>>>>>>         }
> >>>>>>>     /* For XOP: vpshl*, vpsha* and vprot*.  */
> >>>>>>>     /* Fall through.  */
> >>>>>>>     /* VEX.vvvv encodes the last source register operand.  */
> >>>>>>>     case VexVVVV_SRC2:
> >>>>>>>       v = source++;
> >>>>>>>       break;
> >>>>>>>     /* VEX.vvvv encodes the destination register operand.  */
> >>>>>>>     case VexVVVV_DST:
> >>>>>>>       v = dest--;
> >>>>>>>       break;
> >>>>>>>     default:
> >>>>>>>       v = ~0;
> >>>>>>>       break;
> >>>>>>>      }
> >>>>>>>
> >>>>>>> Do you think we should add a separate patch 4 for XOP that
> >>>>>>> removes the
> >>>>>> special handling in match_template and completes its template? so
> >>>>>> we don't have to add special handling for src1vvvv or src2vvvv.
> >>>>>> This might go against your desire to reduce template size, but it
> >>>>>> would help simplify the logic. I'd like to know your thoughts.
> >>>>>>
> >>>>>> Indeed. You'd effectively revert earlier folding that I did. And
> >>>>>> the adjustment I suggested earlier ought to be small/simple enough.
> >>>>>>
> >>>>>
> >>>>> So, I continued working on the previous suggestion. With the
> >>>>> following
> >>>> modification and it worked.
> >>>>>
> >>>>> @@ -8932,7 +8932,7 @@ match_template (char mnem_suffix)
> >>>>>                           || is_cpu (t, CpuAPX_F));
> >>>>>               if (!operand_type_match (overlap0, i.types[0])
> >>>>>                   || !operand_type_match (overlap1, i.types[j])
> >>>>> -                 || (t->operands == 3
> >>>>> +                 || (t->operands == 3 && !is_cpu (t, CpuXOP)
> >>>>>                       && !operand_type_match (overlap2, i.types[1]))
> >>>>>                   || (check_register
> >>>>>                       && !operand_type_register_match (i.types[0],
> >>>>
> >>>> Just to mention it - this certainly isn't what I suggested. In fact
> >>>> I seem to vaguely recall that something similar was once proposed
> >>>> during the original APX work as well, where I then objected, too.
> >>>>
> >>>>> But I found that there are 4 test files that failed, I didn't find
> >>>>> the doc on
> >>>> how to encode vprotb but I guess that is because I changed the
> >>>> default template from Src2VVVV| VexW0 to Src1VVVV| VexW1, then all
> >>>> the related test cases needed to be modified. Do you have any
> >>>> comments
> >> here?
> >>>>>
> >>>>> regexp "^[      ]*[a-f0-9]+:    8f e9 40 90
> >>>> d8[         ]+vprotb %xmm7,%xmm0,%xmm3$"
> >>>>> line   "    11ed:       8f e9 f8 90 df          vprotb %xmm7,%xmm0,%xmm3"
> >>>>
> >>>> Well, while changing the templates is in principle possible, and
> >>>> the resulting code would still be correct, changing encodings it
> >>>> usually not a
> >> good idea.
> >>>> Thus when it can be avoided, it should be avoided, imo. Hence why I
> >>>> didn't suggest this, but to amend the code doing the operand
> >>>> swapping (for the case where operand order is controlled by XOP.W).
> >>>>
> >>>
> >>> I mistakenly thought this was what you wanted, and I also think this
> >> modification is unacceptable. Could you elaborate further on your
> >> original suggestion?
> >>
> >> At the bottom of match_template() we have
> >>
> >>     case Opcode_VexW:
> >>       /* Only the first two register operands need reversing, alongside
> >> 	 flipping VEX.W.  */
> >>       i.tm.opcode_modifier.vexw ^= VEXW0 ^ VEXW1;
> >>
> >> This is where I think you further want to adjust
> >> i.tm.opcode_modifier.vexvvvv.
> >>
> >
> > "vprotb %xmm7,%xmm0,%xmm3" doesn't go through this place, it finds
> the matching template directly (Src1VVVV|VexW1|NoSuf,
> { RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }).
> 
> Of course, for not having a memory operand (and no pseudo prefix). But
> that's not the problem here - the problem is that for build_modrm_byte()
> you want to make adjustments when operands need swapping (compared to
> what the templates say).
> 
> > I inserted a judgment to solve this problem, it's a bit ugly. I'd like to
> abandon this optimization.
> 
> Didn't you have a working form earlier on? Could we use that, leaving the
> XOP stuff untouched for now, for me to see about looking into later?
> 

Yes, I mean I will continue with these patches, but without the XOP optimization. Thanks for helping look at this part later.

The next 5 days are our Labor Day, and email responses will be slower.

Regards,
Lili.
  
Jan Beulich April 30, 2024, 9:22 a.m. UTC | #13
On 30.04.2024 09:34, Cui, Lili wrote:
> The next 5 days are our Labor Day, and email responses will be slower.

NP. As a heads up, I'll be on FTO following that, until the 13th.

Jan
  

Patch

diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c
index 640a83c8b2d..de1c08404cd 100644
--- a/gas/config/tc-i386.c
+++ b/gas/config/tc-i386.c
@@ -10434,6 +10434,14 @@  build_modrm_byte (void)
 
   switch (i.tm.opcode_modifier.vexvvvv)
     {
+    case VexVVVV_SRC2:
+      if (source != op)
+	{
+	  v = source++;
+	  break;
+	}
+      /* For XOP: vpshl* and vpsha*.  */
+      /* Fall through.  */
     case VexVVVV_SRC1:
       v =  dest - 1;
       break;
@@ -10452,14 +10460,6 @@  build_modrm_byte (void)
       dest = ~0;
     }
   gas_assert (source < dest);
-  if (i.tm.opcode_modifier.operandconstraint == SWAP_SOURCES
-      && source != op)
-    {
-      unsigned int tmp = source;
-
-      source = v;
-      v = tmp;
-    }
 
   if (v < MAX_OPERANDS)
     i.vex.register_specifier = i.op[v].regs;
diff --git a/opcodes/i386-opc.h b/opcodes/i386-opc.h
index 4ae87a3dbd5..fa482ca3d37 100644
--- a/opcodes/i386-opc.h
+++ b/opcodes/i386-opc.h
@@ -570,17 +570,15 @@  enum
      It implicitly denotes the register group of {x,y,z}mmN - {x,y,z}mm(N + 3).
    */
 #define IMPLICIT_QUAD_GROUP 5
-  /* Two source operands are swapped.  */
-#define SWAP_SOURCES 6
   /* Default mask isn't allowed.  */
-#define NO_DEFAULT_MASK 7
+#define NO_DEFAULT_MASK 6
   /* Address prefix changes register operand */
-#define ADDR_PREFIX_OP_REG 8
+#define ADDR_PREFIX_OP_REG 7
   /* Instrucion requires that destination must be distinct from source
      registers.  */
-#define DISTINCT_DEST 9
+#define DISTINCT_DEST 8
   /* Instruction updates stack pointer implicitly.  */
-#define IMPLICIT_STACK_OP 10
+#define IMPLICIT_STACK_OP 9
   OperandConstraint,
   /* instruction ignores operand size prefix and in Intel mode ignores
      mnemonic size suffix check.  */
@@ -640,9 +638,11 @@  enum
   Vex,
   /* How to encode VEX.vvvv:
      1: VEX.vvvv encodes the src1 register operand.
+     2: VEX.vvvv encodes the src2 register operand.
      3: VEX.vvvv encodes the dest register operand.
    */
 #define VexVVVV_SRC1   1
+#define VexVVVV_SRC2   2
 #define VexVVVV_DST    3
 
   VexVVVV,
diff --git a/opcodes/i386-opc.tbl b/opcodes/i386-opc.tbl
index 31c769857dc..143ac4e5597 100644
--- a/opcodes/i386-opc.tbl
+++ b/opcodes/i386-opc.tbl
@@ -83,7 +83,6 @@ 
 #define ImplicitQuadGroup OperandConstraint=IMPLICIT_QUAD_GROUP
 #define NoDefMask         OperandConstraint=NO_DEFAULT_MASK
 #define RegKludge         OperandConstraint=REG_KLUDGE
-#define SwapSources       OperandConstraint=SWAP_SOURCES
 #define Ugh               OperandConstraint=UGH
 #define ImplicitStackOp   OperandConstraint=IMPLICIT_STACK_OP
 
@@ -142,6 +141,7 @@ 
 #define Disp8ShiftVL Disp8MemShift=DISP8_SHIFT_VL
 
 #define Src1VVVV VexVVVV=VexVVVV_SRC1
+#define Src2VVVV VexVVVV=VexVVVV_SRC2
 #define DstVVVV  VexVVVV=VexVVVV_DST
 
 
@@ -1798,18 +1798,18 @@  vpsravd, 0x6646, AVX2|AVX512F, Modrm|Vex|EVexDYN|Masking|Space0F38|Src1VVVV|VexW
 vpsrlv<dq>, 0x6645, AVX2|AVX512F, Modrm|Vex|EVexDYN|Masking|Space0F38|Src1VVVV|<dq:vexw>|Broadcast|Disp8ShiftVL|CheckOperandSize|NoSuf, { RegXMM|RegYMM|RegZMM|<dq:elem>|Unspecified|BaseIndex, RegXMM|RegYMM|RegZMM, RegXMM|RegYMM|RegZMM }
 
 // AVX gather instructions
-vgatherdpd, 0x6692, AVX2, Modrm|Vex|Space0F38|Src1VVVV|VexW1|SwapSources|CheckOperandSize|NoSuf|VecSIB128, { RegXMM|RegYMM, Qword|Unspecified|BaseIndex, RegXMM|RegYMM }
-vgatherdps, 0x6692, AVX2, Modrm|Vex128|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf|VecSIB128, { RegXMM, Dword|Unspecified|BaseIndex, RegXMM }
-vgatherdps, 0x6692, AVX2, Modrm|Vex256|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf|VecSIB256, { RegYMM, Dword|Unspecified|BaseIndex, RegYMM }
-vgatherqp<sd>, 0x6693, AVX2, Modrm|Vex|Space0F38|Src1VVVV|<sd:vexw>|SwapSources|NoSuf|VecSIB128, { RegXMM, <sd:elem>|Unspecified|BaseIndex, RegXMM }
-vgatherqpd, 0x6693, AVX2, Modrm|Vex256|Space0F38|Src1VVVV|VexW1|SwapSources|NoSuf|VecSIB256, { RegYMM, Qword|Unspecified|BaseIndex, RegYMM }
-vgatherqps, 0x6693, AVX2, Modrm|Vex256|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf|VecSIB256, { RegXMM, Dword|Unspecified|BaseIndex, RegXMM }
-vpgatherdd, 0x6690, AVX2, Modrm|Vex128|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf|VecSIB128, { RegXMM, Dword|Unspecified|BaseIndex, RegXMM }
-vpgatherdd, 0x6690, AVX2, Modrm|Vex256|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf|VecSIB256, { RegYMM, Dword|Unspecified|BaseIndex, RegYMM }
-vpgatherdq, 0x6690, AVX2, Modrm|Vex|Space0F38|Src1VVVV|VexW1|SwapSources|CheckOperandSize|NoSuf|VecSIB128, { RegXMM|RegYMM, Qword|Unspecified|BaseIndex, RegXMM|RegYMM }
-vpgatherq<dq>, 0x6691, AVX2, Modrm|Vex128|Space0F38|Src1VVVV|<dq:vexw>|SwapSources|NoSuf|VecSIB128, { RegXMM, <dq:elem>|Unspecified|BaseIndex, RegXMM }
-vpgatherqd, 0x6691, AVX2, Modrm|Vex256|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf|VecSIB256, { RegXMM, Dword|Unspecified|BaseIndex, RegXMM }
-vpgatherqq, 0x6691, AVX2, Modrm|Vex256|Space0F38|Src1VVVV|VexW1|SwapSources|NoSuf|VecSIB256, { RegYMM, Qword|Unspecified|BaseIndex, RegYMM }
+vgatherdpd, 0x6692, AVX2, Modrm|Vex|Space0F38|Src2VVVV|VexW1|CheckOperandSize|NoSuf|VecSIB128, { RegXMM|RegYMM, Qword|Unspecified|BaseIndex, RegXMM|RegYMM }
+vgatherdps, 0x6692, AVX2, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf|VecSIB128, { RegXMM, Dword|Unspecified|BaseIndex, RegXMM }
+vgatherdps, 0x6692, AVX2, Modrm|Vex256|Space0F38|Src2VVVV|VexW0|NoSuf|VecSIB256, { RegYMM, Dword|Unspecified|BaseIndex, RegYMM }
+vgatherqp<sd>, 0x6693, AVX2, Modrm|Vex|Space0F38|Src2VVVV|<sd:vexw>|NoSuf|VecSIB128, { RegXMM, <sd:elem>|Unspecified|BaseIndex, RegXMM }
+vgatherqpd, 0x6693, AVX2, Modrm|Vex256|Space0F38|Src2VVVV|VexW1|NoSuf|VecSIB256, { RegYMM, Qword|Unspecified|BaseIndex, RegYMM }
+vgatherqps, 0x6693, AVX2, Modrm|Vex256|Space0F38|Src2VVVV|VexW0|NoSuf|VecSIB256, { RegXMM, Dword|Unspecified|BaseIndex, RegXMM }
+vpgatherdd, 0x6690, AVX2, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf|VecSIB128, { RegXMM, Dword|Unspecified|BaseIndex, RegXMM }
+vpgatherdd, 0x6690, AVX2, Modrm|Vex256|Space0F38|Src2VVVV|VexW0|NoSuf|VecSIB256, { RegYMM, Dword|Unspecified|BaseIndex, RegYMM }
+vpgatherdq, 0x6690, AVX2, Modrm|Vex|Space0F38|Src2VVVV|VexW1|CheckOperandSize|NoSuf|VecSIB128, { RegXMM|RegYMM, Qword|Unspecified|BaseIndex, RegXMM|RegYMM }
+vpgatherq<dq>, 0x6691, AVX2, Modrm|Vex128|Space0F38|Src2VVVV|<dq:vexw>|NoSuf|VecSIB128, { RegXMM, <dq:elem>|Unspecified|BaseIndex, RegXMM }
+vpgatherqd, 0x6691, AVX2, Modrm|Vex256|Space0F38|Src2VVVV|VexW0|NoSuf|VecSIB256, { RegXMM, Dword|Unspecified|BaseIndex, RegXMM }
+vpgatherqq, 0x6691, AVX2, Modrm|Vex256|Space0F38|Src2VVVV|VexW1|NoSuf|VecSIB256, { RegYMM, Qword|Unspecified|BaseIndex, RegYMM }
 
 // AES + AVX
 
@@ -1879,14 +1879,14 @@  xtest, 0xf01d6, HLE|RTM, NoSuf, {}
 
 // BMI2 instructions.
 
-bzhi, 0xf5, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|SwapSources|No_bSuf|No_wSuf|No_sSuf|NF, { Reg32|Reg64, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
+bzhi, 0xf5, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src2VVVV|No_bSuf|No_wSuf|No_sSuf|NF, { Reg32|Reg64, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
 mulx, 0xf2f6, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64, Reg32|Reg64 }
 pdep, 0xf2f5, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64, Reg32|Reg64 }
 pext, 0xf3f5, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64, Reg32|Reg64 }
 rorx, 0xf2f0, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F3A|No_bSuf|No_wSuf|No_sSuf, { Imm8|Imm8S, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
-sarx, 0xf3f7, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|SwapSources|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
-shlx, 0x66f7, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|SwapSources|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
-shrx, 0xf2f7, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|SwapSources|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
+sarx, 0xf3f7, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src2VVVV|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
+shlx, 0x66f7, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src2VVVV|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
+shrx, 0xf2f7, APX_F(BMI2), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src2VVVV|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
 
 // FMA4 instructions
 
@@ -1938,10 +1938,10 @@  vpmacsww, 0x95, XOP, Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, Reg
 vpmadcsswd, 0xa6, XOP, Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
 vpmadcswd, 0xb6, XOP, Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
 vpperm, 0xa3, XOP, D|Modrm|Vex128|SpaceXOP08|Src1VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM, RegXMM }
-vprot<xop>, 0x90 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src1VVVV|SwapSources|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
+vprot<xop>, 0x90 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src2VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
 vprot<xop>, 0xc0 | <xop:opc>, XOP, Modrm|Vex128|SpaceXOP08|VexW0|NoSuf, { Imm8|Imm8S, RegXMM|Unspecified|BaseIndex, RegXMM }
-vpsha<xop>, 0x98 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src1VVVV|SwapSources|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
-vpshl<xop>, 0x94 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src1VVVV|SwapSources|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
+vpsha<xop>, 0x98 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src2VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
+vpshl<xop>, 0x94 | <xop:opc>, XOP, D|Modrm|Vex128|SpaceXOP09|Src2VVVV|VexW0|NoSuf, { RegXMM, RegXMM|Unspecified|BaseIndex, RegXMM }
 
 <xop>
 <irel>
@@ -1957,7 +1957,7 @@  lwpins, 0x12/0, LWP, Modrm|SpaceXOP0A|NoSuf|Src1VVVV|Vex, { Imm32|Imm32S, Reg32|
 // BMI instructions
 
 andn, 0xf2, APX_F(BMI), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|No_bSuf|No_wSuf|No_sSuf|NF, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64, Reg32|Reg64 }
-bextr, 0xf7, APX_F(BMI), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|SwapSources|No_bSuf|No_wSuf|No_sSuf|NF, { Reg32|Reg64, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
+bextr, 0xf7, APX_F(BMI), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src2VVVV|No_bSuf|No_wSuf|No_sSuf|NF, { Reg32|Reg64, Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
 blsi, 0xf3/3, APX_F(BMI), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|No_bSuf|No_wSuf|No_sSuf|NF, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
 blsmsk, 0xf3/2, APX_F(BMI), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|No_bSuf|No_wSuf|No_sSuf|NF, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
 blsr, 0xf3/1, APX_F(BMI), Modrm|CheckOperandSize|Vex128|EVex128|Space0F38|Src1VVVV|No_bSuf|No_wSuf|No_sSuf|NF, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
@@ -3192,15 +3192,15 @@  xresldtrk, 0xf20f01e9, TSXLDTRK, NoSuf, {}
 ldtilecfg, 0x49/0, APX_F(AMX_TILE), Modrm|Vex128|EVex128|Space0F38|VexW0|NoSuf, { Unspecified|BaseIndex }
 sttilecfg, 0x6649/0, APX_F(AMX_TILE), Modrm|Vex128|EVex128|Space0F38|VexW0|NoSuf, { Unspecified|BaseIndex }
 
-tcmmimfp16ps, 0x666c, AMX_COMPLEX, Modrm|Vex128|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf, { RegTMM, RegTMM, RegTMM }
-tcmmrlfp16ps, 0x6c, AMX_COMPLEX, Modrm|Vex128|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf, { RegTMM, RegTMM, RegTMM }
+tcmmimfp16ps, 0x666c, AMX_COMPLEX, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
+tcmmrlfp16ps, 0x6c, AMX_COMPLEX, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
 
-tdpbf16ps, 0xf35c, AMX_BF16, Modrm|Vex128|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf, { RegTMM, RegTMM, RegTMM }
-tdpfp16ps, 0xf25c, AMX_FP16, Modrm|Vex128|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf, { RegTMM, RegTMM, RegTMM }
-tdpbssd, 0xf25e, AMX_INT8, Modrm|Vex128|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf, { RegTMM, RegTMM, RegTMM }
-tdpbuud, 0x5e, AMX_INT8, Modrm|Vex128|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf, { RegTMM, RegTMM, RegTMM }
-tdpbusd, 0x665e, AMX_INT8, Modrm|Vex128|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf, { RegTMM, RegTMM, RegTMM }
-tdpbsud, 0xf35e, AMX_INT8, Modrm|Vex128|Space0F38|Src1VVVV|VexW0|SwapSources|NoSuf, { RegTMM, RegTMM, RegTMM }
+tdpbf16ps, 0xf35c, AMX_BF16, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
+tdpfp16ps, 0xf25c, AMX_FP16, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
+tdpbssd, 0xf25e, AMX_INT8, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
+tdpbuud, 0x5e, AMX_INT8, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
+tdpbusd, 0x665e, AMX_INT8, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
+tdpbsud, 0xf35e, AMX_INT8, Modrm|Vex128|Space0F38|Src2VVVV|VexW0|NoSuf, { RegTMM, RegTMM, RegTMM }
 
 tileloadd, 0xf24b, APX_F(AMX_TILE), Sibmem|Vex128|EVex128|Space0F38|VexW0|NoSuf, { Unspecified|BaseIndex, RegTMM }
 tileloaddt1, 0x664b, APX_F(AMX_TILE), Sibmem|Vex128|EVex128|Space0F38|VexW0|NoSuf, { Unspecified|BaseIndex, RegTMM }
@@ -3372,7 +3372,7 @@  prefetchit1, 0xf18/6, PREFETCHI, Modrm|Anysize|IgnoreSize|NoSuf, { BaseIndex }
 
 // CMPCCXADD instructions.
 
-cmp<cc>xadd, 0x66e<cc:opc>, APX_F(CMPCCXADD), Modrm|Vex|EVex128|Space0F38|Src1VVVV|SwapSources|CheckOperandSize|NoSuf, { Reg32|Reg64, Reg32|Reg64, Dword|Qword|Unspecified|BaseIndex }
+cmp<cc>xadd, 0x66e<cc:opc>, APX_F(CMPCCXADD), Modrm|Vex|EVex128|Space0F38|Src2VVVV|CheckOperandSize|NoSuf, { Reg32|Reg64, Reg32|Reg64, Dword|Qword|Unspecified|BaseIndex }
 
 // CMPCCXADD instructions end.