[v1] x86-64: Replace `%ah` write with `%eax` read

Message ID 20230310024420.521941-1-goldstein.w.n@gmail.com
State Superseded
Headers
Series [v1] x86-64: Replace `%ah` write with `%eax` read |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686

Commit Message

Noah Goldstein March 10, 2023, 2:44 a.m. UTC
  High8 partial registers can incur a stall when being modified (if not
renamed seperately), or at the very least incur extra backend uops (if
renamed seperately). Either way `testl $0x0400, %eax` is preferable to
`andb $0x04, %ah`.

Function size is unchanged when accounting for 16-byte padding.
---
 sysdeps/x86_64/fpu/e_fmodl.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

H.J. Lu March 10, 2023, 4:38 p.m. UTC | #1
On Thu, Mar 9, 2023 at 6:44 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> High8 partial registers can incur a stall when being modified (if not
> renamed seperately), or at the very least incur extra backend uops (if
> renamed seperately). Either way `testl $0x0400, %eax` is preferable to
> `andb $0x04, %ah`.
>
> Function size is unchanged when accounting for 16-byte padding.
> ---
>  sysdeps/x86_64/fpu/e_fmodl.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S
> index d754668bce..d45f984e1a 100644
> --- a/sysdeps/x86_64/fpu/e_fmodl.S
> +++ b/sysdeps/x86_64/fpu/e_fmodl.S
> @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl)
>         fldt    8(%rsp)
>  1:     fprem
>         fstsw   %ax
> -       and     $04,%ah
> +       testl   $0x400,%eax
>         jnz     1b
>         fstp    %st(1)
>         ret
> --
> 2.34.1
>

OK.

Thanks.
  
Florian Weimer March 13, 2023, 8:03 a.m. UTC | #2
* Noah Goldstein via Libc-alpha:

> High8 partial registers can incur a stall when being modified (if not
> renamed seperately), or at the very least incur extra backend uops (if
> renamed seperately). Either way `testl $0x0400, %eax` is preferable to
> `andb $0x04, %ah`.
>
> Function size is unchanged when accounting for 16-byte padding.
> ---
>  sysdeps/x86_64/fpu/e_fmodl.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S
> index d754668bce..d45f984e1a 100644
> --- a/sysdeps/x86_64/fpu/e_fmodl.S
> +++ b/sysdeps/x86_64/fpu/e_fmodl.S
> @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl)
>  	fldt	8(%rsp)
>  1:	fprem
>  	fstsw	%ax
> -	and	$04,%ah
> +	testl	$0x400,%eax

Why not test $0x400,%ax or test $04,%ah?

Thanks,
Florian
  
Noah Goldstein March 13, 2023, 4:59 p.m. UTC | #3
On Mon, Mar 13, 2023 at 3:03 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Noah Goldstein via Libc-alpha:
>
> > High8 partial registers can incur a stall when being modified (if not
> > renamed seperately), or at the very least incur extra backend uops (if
> > renamed seperately). Either way `testl $0x0400, %eax` is preferable to
> > `andb $0x04, %ah`.
> >
> > Function size is unchanged when accounting for 16-byte padding.
> > ---
> >  sysdeps/x86_64/fpu/e_fmodl.S | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S
> > index d754668bce..d45f984e1a 100644
> > --- a/sysdeps/x86_64/fpu/e_fmodl.S
> > +++ b/sysdeps/x86_64/fpu/e_fmodl.S
> > @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl)
> >       fldt    8(%rsp)
> >  1:   fprem
> >       fstsw   %ax
> > -     and     $04,%ah
> > +     testl   $0x400,%eax
>
> Why not test $0x400,%ax or test $04,%ah?
`test $0x400,%ax` uses imm16 which can cause length-changing-prefix
(`0x66` in the opcode) stalls.
`test $0x4,%ah` is more okay, but partial register usage has several
delays associated with it (even pure
reads), depends on arch but for example hwl/skl have 2c latency added
(in this case where %ah is not
being renamed seperately).
In general, if you don't need the code size, best to stick with
32/64-bit instructions.

>
> Thanks,
> Florian
>
  
Florian Weimer March 13, 2023, 5:30 p.m. UTC | #4
* Noah Goldstein:

> On Mon, Mar 13, 2023 at 3:03 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Noah Goldstein via Libc-alpha:
>>
>> > High8 partial registers can incur a stall when being modified (if not
>> > renamed seperately), or at the very least incur extra backend uops (if
>> > renamed seperately). Either way `testl $0x0400, %eax` is preferable to
>> > `andb $0x04, %ah`.
>> >
>> > Function size is unchanged when accounting for 16-byte padding.
>> > ---
>> >  sysdeps/x86_64/fpu/e_fmodl.S | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S
>> > index d754668bce..d45f984e1a 100644
>> > --- a/sysdeps/x86_64/fpu/e_fmodl.S
>> > +++ b/sysdeps/x86_64/fpu/e_fmodl.S
>> > @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl)
>> >       fldt    8(%rsp)
>> >  1:   fprem
>> >       fstsw   %ax
>> > -     and     $04,%ah
>> > +     testl   $0x400,%eax
>>
>> Why not test $0x400,%ax or test $04,%ah?
> `test $0x400,%ax` uses imm16 which can cause length-changing-prefix
> (`0x66` in the opcode) stalls.
> `test $0x4,%ah` is more okay, but partial register usage has several
> delays associated with it (even pure
> reads), depends on arch but for example hwl/skl have 2c latency added
> (in this case where %ah is not
> being renamed seperately).
> In general, if you don't need the code size, best to stick with
> 32/64-bit instructions.

Do we need to clear %eax first to avoid a false dependency?

Thanks,
Florian
  
Noah Goldstein March 13, 2023, 8:49 p.m. UTC | #5
On Mon, Mar 13, 2023 at 12:30 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Noah Goldstein:
>
> > On Mon, Mar 13, 2023 at 3:03 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * Noah Goldstein via Libc-alpha:
> >>
> >> > High8 partial registers can incur a stall when being modified (if not
> >> > renamed seperately), or at the very least incur extra backend uops (if
> >> > renamed seperately). Either way `testl $0x0400, %eax` is preferable to
> >> > `andb $0x04, %ah`.
> >> >
> >> > Function size is unchanged when accounting for 16-byte padding.
> >> > ---
> >> >  sysdeps/x86_64/fpu/e_fmodl.S | 2 +-
> >> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >> >
> >> > diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S
> >> > index d754668bce..d45f984e1a 100644
> >> > --- a/sysdeps/x86_64/fpu/e_fmodl.S
> >> > +++ b/sysdeps/x86_64/fpu/e_fmodl.S
> >> > @@ -13,7 +13,7 @@ ENTRY(__ieee754_fmodl)
> >> >       fldt    8(%rsp)
> >> >  1:   fprem
> >> >       fstsw   %ax
> >> > -     and     $04,%ah
> >> > +     testl   $0x400,%eax
> >>
> >> Why not test $0x400,%ax or test $04,%ah?
> > `test $0x400,%ax` uses imm16 which can cause length-changing-prefix
> > (`0x66` in the opcode) stalls.
> > `test $0x4,%ah` is more okay, but partial register usage has several
> > delays associated with it (even pure
> > reads), depends on arch but for example hwl/skl have 2c latency added
> > (in this case where %ah is not
> > being renamed seperately).
> > In general, if you don't need the code size, best to stick with
> > 32/64-bit instructions.
>
> Do we need to clear %eax first to avoid a false dependency?

oh  yeah, guess you're right, probably `test %ah` is best.
>
> Thanks,
> Florian
>
  

Patch

diff --git a/sysdeps/x86_64/fpu/e_fmodl.S b/sysdeps/x86_64/fpu/e_fmodl.S
index d754668bce..d45f984e1a 100644
--- a/sysdeps/x86_64/fpu/e_fmodl.S
+++ b/sysdeps/x86_64/fpu/e_fmodl.S
@@ -13,7 +13,7 @@  ENTRY(__ieee754_fmodl)
 	fldt	8(%rsp)
 1:	fprem
 	fstsw	%ax
-	and	$04,%ah
+	testl	$0x400,%eax
 	jnz	1b
 	fstp	%st(1)
 	ret