x86-64: Compile branred.c with -mprefer-vector-width=128

Message ID CAMe9rOoP_293SM=sYpuqx5Yg9K1a-F9GtN6g0DDFjHTZM5OkcA@mail.gmail.com
State Committed
Headers

Commit Message

H.J. Lu June 7, 2019, 5:59 p.m. UTC
  -O3 with AVX vectorizes some loops in sysdeps/ieee754/dbl-64/branred.c
with 256-bit vector instructions, which leads to store forward stall:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579

There is no easy fix in compiler.  This patch limits vector width to
128 bits to work around this issue.  It improves performance of sin
and cos by more than 40% on Skylake compiled with -O3 -march=skylake.

OK for master branch?

* sysdeps/x86_64/fpu/Makefile (CFLAGS-branred.c): New.  Set
to -mprefer-vector-width=128.
  

Comments

Florian Weimer June 7, 2019, 6:53 p.m. UTC | #1
* H. J. Lu:

> -O3 with AVX vectorizes some loops in sysdeps/ieee754/dbl-64/branred.c
> with 256-bit vector instructions, which leads to store forward stall:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579
>
> There is no easy fix in compiler.  This patch limits vector width to
> 128 bits to work around this issue.  It improves performance of sin
> and cos by more than 40% on Skylake compiled with -O3 -march=skylake.
>
> OK for master branch?
>
> * sysdeps/x86_64/fpu/Makefile (CFLAGS-branred.c): New.  Set
> to -mprefer-vector-width=128.

This is bug 24603, right?

Let's hope that the reproducer in the test case is misleadingly reduced,
and we can fix the actual issue in the compiler.  I updated the GCC PR.

Thanks,
Florian
  
H.J. Lu June 7, 2019, 7:26 p.m. UTC | #2
On Fri, Jun 7, 2019 at 11:53 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > -O3 with AVX vectorizes some loops in sysdeps/ieee754/dbl-64/branred.c
> > with 256-bit vector instructions, which leads to store forward stall:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579
> >
> > There is no easy fix in compiler.  This patch limits vector width to
> > 128 bits to work around this issue.  It improves performance of sin
> > and cos by more than 40% on Skylake compiled with -O3 -march=skylake.
> >
> > OK for master branch?
> >
> > * sysdeps/x86_64/fpu/Makefile (CFLAGS-branred.c): New.  Set
> > to -mprefer-vector-width=128.
>
> This is bug 24603, right?

Yes.

> Let's hope that the reproducer in the test case is misleadingly reduced,
> and we can fix the actual issue in the compiler.  I updated the GCC PR.
>

It makes when arrays are 32-byte aligned.
  
Joseph Myers June 10, 2019, 3:45 p.m. UTC | #3
On Fri, 7 Jun 2019, H.J. Lu wrote:

> -O3 with AVX vectorizes some loops in sysdeps/ieee754/dbl-64/branred.c
> with 256-bit vector instructions, which leads to store forward stall:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579
> 
> There is no easy fix in compiler.  This patch limits vector width to
> 128 bits to work around this issue.  It improves performance of sin
> and cos by more than 40% on Skylake compiled with -O3 -march=skylake.
> 
> OK for master branch?
> 
> * sysdeps/x86_64/fpu/Makefile (CFLAGS-branred.c): New.  Set
> to -mprefer-vector-width=128.

This is a case where the Makefile definition clearly needs a comment 
explaining the issue, stating what GCC version is known to have the issue 
and pointing to the GCC bug.
  
H.J. Lu June 11, 2019, 3:52 p.m. UTC | #4
On Mon, Jun 10, 2019 at 8:45 AM Joseph Myers <joseph@codesourcery.com> wrote:
>
> On Fri, 7 Jun 2019, H.J. Lu wrote:
>
> > -O3 with AVX vectorizes some loops in sysdeps/ieee754/dbl-64/branred.c
> > with 256-bit vector instructions, which leads to store forward stall:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579
> >
> > There is no easy fix in compiler.  This patch limits vector width to
> > 128 bits to work around this issue.  It improves performance of sin
> > and cos by more than 40% on Skylake compiled with -O3 -march=skylake.
> >
> > OK for master branch?
> >
> > * sysdeps/x86_64/fpu/Makefile (CFLAGS-branred.c): New.  Set
> > to -mprefer-vector-width=128.
>
> This is a case where the Makefile definition clearly needs a comment
> explaining the issue, stating what GCC version is known to have the issue
> and pointing to the GCC bug.
>

Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
GCC 7 doesn't have this problem since it doesn't vectorize the second loop.

Here is the updated patch.

OK for master?

Thanks.
  
Florian Weimer June 11, 2019, 4:02 p.m. UTC | #5
* H. J. Lu:

> Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
> GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
>
> Here is the updated patch.
>
> OK for master?

I still don't understand why we need a workaround for a GCC performance
bug in glibc.  Is there any precedent for this?

Thanks,
Florian
  
H.J. Lu June 11, 2019, 4:07 p.m. UTC | #6
On Tue, Jun 11, 2019 at 9:02 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
> > GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
> >
> > Here is the updated patch.
> >
> > OK for master?
>
> I still don't understand why we need a workaround for a GCC performance
> bug in glibc.  Is there any precedent for this?
>

I don't know if we ever investigated performance impacts of different GCC
optimizations on glibc.
  
H.J. Lu June 11, 2019, 4:11 p.m. UTC | #7
On Tue, Jun 11, 2019 at 9:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Jun 11, 2019 at 9:02 AM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * H. J. Lu:
> >
> > > Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
> > > GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
> > >
> > > Here is the updated patch.
> > >
> > > OK for master?
> >
> > I still don't understand why we need a workaround for a GCC performance
> > bug in glibc.  Is there any precedent for this?
> >
>
> I don't know if we ever investigated performance impacts of different GCC
> optimizations on glibc.
>

I found:

sysdeps/powerpc/powerpc64/power4/Makefile:CFLAGS-memmove.c += --param
max-variable-expansions-in-unroller=2 --param max-unroll-times=2
-funroll-loops -fpeel-loops
  
Florian Weimer June 11, 2019, 4:16 p.m. UTC | #8
* H. J. Lu:

> On Tue, Jun 11, 2019 at 9:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Tue, Jun 11, 2019 at 9:02 AM Florian Weimer <fweimer@redhat.com> wrote:
>> >
>> > * H. J. Lu:
>> >
>> > > Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
>> > > GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
>> > >
>> > > Here is the updated patch.
>> > >
>> > > OK for master?
>> >
>> > I still don't understand why we need a workaround for a GCC performance
>> > bug in glibc.  Is there any precedent for this?
>> >
>>
>> I don't know if we ever investigated performance impacts of different GCC
>> optimizations on glibc.
>>
>
> I found:
>
> sysdeps/powerpc/powerpc64/power4/Makefile:CFLAGS-memmove.c += --param
> max-variable-expansions-in-unroller=2 --param max-unroll-times=2
> -funroll-loops -fpeel-loops

That's been there since before 2012.  This reflects my main concern
about such tweaks: That they will never be reviewed and removed when
they are no longer necessary (and probably have negative impact).

Thanks,
Florian
  
H.J. Lu June 11, 2019, 4:41 p.m. UTC | #9
On Tue, Jun 11, 2019 at 9:16 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > On Tue, Jun 11, 2019 at 9:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>
> >> On Tue, Jun 11, 2019 at 9:02 AM Florian Weimer <fweimer@redhat.com> wrote:
> >> >
> >> > * H. J. Lu:
> >> >
> >> > > Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
> >> > > GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
> >> > >
> >> > > Here is the updated patch.
> >> > >
> >> > > OK for master?
> >> >
> >> > I still don't understand why we need a workaround for a GCC performance
> >> > bug in glibc.  Is there any precedent for this?
> >> >
> >>
> >> I don't know if we ever investigated performance impacts of different GCC
> >> optimizations on glibc.
> >>
> >
> > I found:
> >
> > sysdeps/powerpc/powerpc64/power4/Makefile:CFLAGS-memmove.c += --param
> > max-variable-expansions-in-unroller=2 --param max-unroll-times=2
> > -funroll-loops -fpeel-loops
>
> That's been there since before 2012.  This reflects my main concern
> about such tweaks: That they will never be reviewed and removed when
> they are no longer necessary (and probably have negative impact).
>

In the case of sysdeps/ieee754/dbl-64/branred.c, 128-bit vector
instructions give better performance that  256-bit .  It won't change
with different or new compilers.
  
Adhemerval Zanella Netto June 11, 2019, 5:10 p.m. UTC | #10
On 11/06/2019 13:16, Florian Weimer wrote:
> * H. J. Lu:
> 
>> On Tue, Jun 11, 2019 at 9:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>
>>> On Tue, Jun 11, 2019 at 9:02 AM Florian Weimer <fweimer@redhat.com> wrote:
>>>>
>>>> * H. J. Lu:
>>>>
>>>>> Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
>>>>> GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
>>>>>
>>>>> Here is the updated patch.
>>>>>
>>>>> OK for master?
>>>>
>>>> I still don't understand why we need a workaround for a GCC performance
>>>> bug in glibc.  Is there any precedent for this?
>>>>
>>>
>>> I don't know if we ever investigated performance impacts of different GCC
>>> optimizations on glibc.
>>>
>>
>> I found:
>>
>> sysdeps/powerpc/powerpc64/power4/Makefile:CFLAGS-memmove.c += --param
>> max-variable-expansions-in-unroller=2 --param max-unroll-times=2
>> -funroll-loops -fpeel-loops
> 
> That's been there since before 2012.  This reflects my main concern
> about such tweaks: That they will never be reviewed and removed when
> they are no longer necessary (and probably have negative impact).
> 

In fact I have been playing in make powerpc sysdeps less convoluted and
I almost sure this snippet could be removed without performance differences,
as for some mpa.c hacks that tried to overcome some compiler not being
able to optimize that eventually turned out to produce worse results in
newer version and chips (c4c0848bbb7a4ad6ab8149abf982a0f10fd2821b).
  
Florian Weimer June 14, 2019, 1:58 p.m. UTC | #11
* Florian Weimer:

> * H. J. Lu:
>
>> On Tue, Jun 11, 2019 at 9:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>
>>> On Tue, Jun 11, 2019 at 9:02 AM Florian Weimer <fweimer@redhat.com> wrote:
>>> >
>>> > * H. J. Lu:
>>> >
>>> > > Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
>>> > > GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
>>> > >
>>> > > Here is the updated patch.
>>> > >
>>> > > OK for master?
>>> >
>>> > I still don't understand why we need a workaround for a GCC performance
>>> > bug in glibc.  Is there any precedent for this?
>>> >
>>>
>>> I don't know if we ever investigated performance impacts of different GCC
>>> optimizations on glibc.
>>>
>>
>> I found:
>>
>> sysdeps/powerpc/powerpc64/power4/Makefile:CFLAGS-memmove.c += --param
>> max-variable-expansions-in-unroller=2 --param max-unroll-times=2
>> -funroll-loops -fpeel-loops
>
> That's been there since before 2012.  This reflects my main concern
> about such tweaks: That they will never be reviewed and removed when
> they are no longer necessary (and probably have negative impact).

For the record, this is not a strong objection to your patch, H.J.

Thanks,
Florian
  
H.J. Lu June 14, 2019, 3:16 p.m. UTC | #12
On Fri, Jun 14, 2019 at 6:58 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Florian Weimer:
>
> > * H. J. Lu:
> >
> >> On Tue, Jun 11, 2019 at 9:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>
> >>> On Tue, Jun 11, 2019 at 9:02 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>> >
> >>> > * H. J. Lu:
> >>> >
> >>> > > Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
> >>> > > GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
> >>> > >
> >>> > > Here is the updated patch.
> >>> > >
> >>> > > OK for master?
> >>> >
> >>> > I still don't understand why we need a workaround for a GCC performance
> >>> > bug in glibc.  Is there any precedent for this?
> >>> >
> >>>
> >>> I don't know if we ever investigated performance impacts of different GCC
> >>> optimizations on glibc.
> >>>
> >>
> >> I found:
> >>
> >> sysdeps/powerpc/powerpc64/power4/Makefile:CFLAGS-memmove.c += --param
> >> max-variable-expansions-in-unroller=2 --param max-unroll-times=2
> >> -funroll-loops -fpeel-loops
> >
> > That's been there since before 2012.  This reflects my main concern
> > about such tweaks: That they will never be reviewed and removed when
> > they are no longer necessary (and probably have negative impact).
>
> For the record, this is not a strong objection to your patch, H.J.
>

In this particular case, 256-bit vectorizer caused 40% slowdown in
sin/cos bench.
After the GCC bug is fixed, we should update it to check for the fixed GCC.

I'd like to have a resolution before glibc 2.31 is released.

Thanks.
  
Jeff Law June 14, 2019, 4:43 p.m. UTC | #13
On 6/14/19 7:58 AM, Florian Weimer wrote:
> * Florian Weimer:
> 
>> * H. J. Lu:
>>
>>> On Tue, Jun 11, 2019 at 9:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>
>>>> On Tue, Jun 11, 2019 at 9:02 AM Florian Weimer <fweimer@redhat.com> wrote:
>>>>>
>>>>> * H. J. Lu:
>>>>>
>>>>>> Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
>>>>>> GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
>>>>>>
>>>>>> Here is the updated patch.
>>>>>>
>>>>>> OK for master?
>>>>>
>>>>> I still don't understand why we need a workaround for a GCC performance
>>>>> bug in glibc.  Is there any precedent for this?
>>>>>
>>>>
>>>> I don't know if we ever investigated performance impacts of different GCC
>>>> optimizations on glibc.
>>>>
>>>
>>> I found:
>>>
>>> sysdeps/powerpc/powerpc64/power4/Makefile:CFLAGS-memmove.c += --param
>>> max-variable-expansions-in-unroller=2 --param max-unroll-times=2
>>> -funroll-loops -fpeel-loops
>>
>> That's been there since before 2012.  This reflects my main concern
>> about such tweaks: That they will never be reviewed and removed when
>> they are no longer necessary (and probably have negative impact).
> 
> For the record, this is not a strong objection to your patch, H.J.
So one way to address this would be to somehow conditionalize the bits
on the GCC version where they were an issue.  If the version number is
higher then issue an error of some kind which would force someone to
look at the problem again.

But I certainly understand the concern here :-)

jeff
  
H.J. Lu June 17, 2019, 7:06 p.m. UTC | #14
On Fri, Jun 14, 2019 at 9:43 AM Jeff Law <law@redhat.com> wrote:
>
> On 6/14/19 7:58 AM, Florian Weimer wrote:
> > * Florian Weimer:
> >
> >> * H. J. Lu:
> >>
> >>> On Tue, Jun 11, 2019 at 9:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >>>>
> >>>> On Tue, Jun 11, 2019 at 9:02 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>>>>
> >>>>> * H. J. Lu:
> >>>>>
> >>>>>> Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
> >>>>>> GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
> >>>>>>
> >>>>>> Here is the updated patch.
> >>>>>>
> >>>>>> OK for master?
> >>>>>
> >>>>> I still don't understand why we need a workaround for a GCC performance
> >>>>> bug in glibc.  Is there any precedent for this?
> >>>>>
> >>>>
> >>>> I don't know if we ever investigated performance impacts of different GCC
> >>>> optimizations on glibc.
> >>>>
> >>>
> >>> I found:
> >>>
> >>> sysdeps/powerpc/powerpc64/power4/Makefile:CFLAGS-memmove.c += --param
> >>> max-variable-expansions-in-unroller=2 --param max-unroll-times=2
> >>> -funroll-loops -fpeel-loops
> >>
> >> That's been there since before 2012.  This reflects my main concern
> >> about such tweaks: That they will never be reviewed and removed when
> >> they are no longer necessary (and probably have negative impact).
> >
> > For the record, this is not a strong objection to your patch, H.J.
> So one way to address this would be to somehow conditionalize the bits
> on the GCC version where they were an issue.  If the version number is
> higher then issue an error of some kind which would force someone to
> look at the problem again.
>
We know GCC 8 and 9 have this issue.  But we don't know when this issue
will be fixed in GCC and we should allow building glibc with GCC newer than
the current minimum version, including GCC 10.  How should GCC version
be checked here?
  
Jeff Law June 17, 2019, 7:09 p.m. UTC | #15
On 6/17/19 1:06 PM, H.J. Lu wrote:
> On Fri, Jun 14, 2019 at 9:43 AM Jeff Law <law@redhat.com> wrote:
>>
>> On 6/14/19 7:58 AM, Florian Weimer wrote:
>>> * Florian Weimer:
>>>
>>>> * H. J. Lu:
>>>>
>>>>> On Tue, Jun 11, 2019 at 9:07 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>>>
>>>>>> On Tue, Jun 11, 2019 at 9:02 AM Florian Weimer <fweimer@redhat.com> wrote:
>>>>>>>
>>>>>>> * H. J. Lu:
>>>>>>>
>>>>>>>> Fixed.  I also added check for -mprefer-vector-width=128 which is new in GCC 8.
>>>>>>>> GCC 7 doesn't have this problem since it doesn't vectorize the second loop.
>>>>>>>>
>>>>>>>> Here is the updated patch.
>>>>>>>>
>>>>>>>> OK for master?
>>>>>>>
>>>>>>> I still don't understand why we need a workaround for a GCC performance
>>>>>>> bug in glibc.  Is there any precedent for this?
>>>>>>>
>>>>>>
>>>>>> I don't know if we ever investigated performance impacts of different GCC
>>>>>> optimizations on glibc.
>>>>>>
>>>>>
>>>>> I found:
>>>>>
>>>>> sysdeps/powerpc/powerpc64/power4/Makefile:CFLAGS-memmove.c += --param
>>>>> max-variable-expansions-in-unroller=2 --param max-unroll-times=2
>>>>> -funroll-loops -fpeel-loops
>>>>
>>>> That's been there since before 2012.  This reflects my main concern
>>>> about such tweaks: That they will never be reviewed and removed when
>>>> they are no longer necessary (and probably have negative impact).
>>>
>>> For the record, this is not a strong objection to your patch, H.J.
>> So one way to address this would be to somehow conditionalize the bits
>> on the GCC version where they were an issue.  If the version number is
>> higher then issue an error of some kind which would force someone to
>> look at the problem again.
>>
> We know GCC 8 and 9 have this issue.  But we don't know when this issue
> will be fixed in GCC and we should allow building glibc with GCC newer than
> the current minimum version, including GCC 10.  How should GCC version
> be checked here?
ISTM that 8, 9, 10 would use the new flag.  11 would issue an error
which would trigger a reinvestigation roughly a year from now.

The glibc maintainers would have the final say about this -- it's just
an idea off the top of my head.

jeff
>
  
Florian Weimer June 17, 2019, 7:11 p.m. UTC | #16
* Jeff Law:

> ISTM that 8, 9, 10 would use the new flag.  11 would issue an error
> which would trigger a reinvestigation roughly a year from now.

Do you suggest to put in a #warning for GCC 11, so that people can
configure with --disable-werror and still build with GCC 11?

Given that GCC 11 will probably add other warnings that break the build,
this proposal isn't entirely unreasonable (even though I generally
dislike such time bombs).

Thanks,
Florian
  
Joseph Myers June 17, 2019, 7:22 p.m. UTC | #17
On Mon, 17 Jun 2019, Florian Weimer wrote:

> * Jeff Law:
> 
> > ISTM that 8, 9, 10 would use the new flag.  11 would issue an error
> > which would trigger a reinvestigation roughly a year from now.
> 
> Do you suggest to put in a #warning for GCC 11, so that people can
> configure with --disable-werror and still build with GCC 11?

I don't think that's a good idea.  We might want e.g. a helper script to 
find places in the source tree to revisit after some given GCC version is 
out / after some given GCC version is no longer supported for building 
glibc.  (E.g. DIAG_IGNORE_NEEDS_COMMENT calls where the version specified 
is now too old to build glibc and so we should see if the warning in 
question still appears with the oldest supported version.)  That implies 
having a limited number of standard ways to mark such places in the 
sources so they can be found later.
  
H.J. Lu June 17, 2019, 7:27 p.m. UTC | #18
On Mon, Jun 17, 2019 at 12:12 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Jeff Law:
>
> > ISTM that 8, 9, 10 would use the new flag.  11 would issue an error
> > which would trigger a reinvestigation roughly a year from now.
>
> Do you suggest to put in a #warning for GCC 11, so that people can
> configure with --disable-werror and still build with GCC 11?
>
> Given that GCC 11 will probably add other warnings that break the build,
> this proposal isn't entirely unreasonable (even though I generally
> dislike such time bombs).
>

Since there is no issue in source, this requires an artificial warning
purely based on GCC version.   We can place GCC version check in
configure script with explicit -mprefer-vector-width=128 support.

--enable-mprefer-vector-width:

1.  Default.  For GCC 8, 9, 10, use -mprefer-vector-width=128.
For GCC 11 and above, issue a configure error.
2.  Enabled.  No GCC version check, use -mprefer-vector-width=128.
3.  Disabled.  No GCC version check, don't use -mprefer-vector-width=128.
  
Joseph Myers June 17, 2019, 9:37 p.m. UTC | #19
On Mon, 17 Jun 2019, H.J. Lu wrote:

> On Mon, Jun 17, 2019 at 12:12 PM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * Jeff Law:
> >
> > > ISTM that 8, 9, 10 would use the new flag.  11 would issue an error
> > > which would trigger a reinvestigation roughly a year from now.
> >
> > Do you suggest to put in a #warning for GCC 11, so that people can
> > configure with --disable-werror and still build with GCC 11?
> >
> > Given that GCC 11 will probably add other warnings that break the build,
> > this proposal isn't entirely unreasonable (even though I generally
> > dislike such time bombs).
> >
> 
> Since there is no issue in source, this requires an artificial warning
> purely based on GCC version.   We can place GCC version check in
> configure script with explicit -mprefer-vector-width=128 support.
> 
> --enable-mprefer-vector-width:
> 
> 1.  Default.  For GCC 8, 9, 10, use -mprefer-vector-width=128.
> For GCC 11 and above, issue a configure error.

I don't think anything causing the build to break with newer GCC versions 
is appropriate.
  
Jeff Law June 17, 2019, 9:41 p.m. UTC | #20
On 6/17/19 3:37 PM, Joseph Myers wrote:
> On Mon, 17 Jun 2019, H.J. Lu wrote:
> 
>> On Mon, Jun 17, 2019 at 12:12 PM Florian Weimer <fweimer@redhat.com> wrote:
>>>
>>> * Jeff Law:
>>>
>>>> ISTM that 8, 9, 10 would use the new flag.  11 would issue an error
>>>> which would trigger a reinvestigation roughly a year from now.
>>>
>>> Do you suggest to put in a #warning for GCC 11, so that people can
>>> configure with --disable-werror and still build with GCC 11?
>>>
>>> Given that GCC 11 will probably add other warnings that break the build,
>>> this proposal isn't entirely unreasonable (even though I generally
>>> dislike such time bombs).
>>>
>>
>> Since there is no issue in source, this requires an artificial warning
>> purely based on GCC version.   We can place GCC version check in
>> configure script with explicit -mprefer-vector-width=128 support.
>>
>> --enable-mprefer-vector-width:
>>
>> 1.  Default.  For GCC 8, 9, 10, use -mprefer-vector-width=128.
>> For GCC 11 and above, issue a configure error.
> 
> I don't think anything causing the build to break with newer GCC versions 
> is appropriate.
Fair enough -- hence a bit of a step away from the decision process in
my prior message on this subject.  I threw it out as a way to address
the issue Florian raised, but I strongly believe it's up to the glibc
team to decide how they want to go forward.

jeff
  
H.J. Lu June 20, 2019, 4:30 p.m. UTC | #21
On Mon, Jun 17, 2019 at 2:41 PM Jeff Law <law@redhat.com> wrote:
>
> On 6/17/19 3:37 PM, Joseph Myers wrote:
> > On Mon, 17 Jun 2019, H.J. Lu wrote:
> >
> >> On Mon, Jun 17, 2019 at 12:12 PM Florian Weimer <fweimer@redhat.com> wrote:
> >>>
> >>> * Jeff Law:
> >>>
> >>>> ISTM that 8, 9, 10 would use the new flag.  11 would issue an error
> >>>> which would trigger a reinvestigation roughly a year from now.
> >>>
> >>> Do you suggest to put in a #warning for GCC 11, so that people can
> >>> configure with --disable-werror and still build with GCC 11?
> >>>
> >>> Given that GCC 11 will probably add other warnings that break the build,
> >>> this proposal isn't entirely unreasonable (even though I generally
> >>> dislike such time bombs).
> >>>
> >>
> >> Since there is no issue in source, this requires an artificial warning
> >> purely based on GCC version.   We can place GCC version check in
> >> configure script with explicit -mprefer-vector-width=128 support.
> >>
> >> --enable-mprefer-vector-width:
> >>
> >> 1.  Default.  For GCC 8, 9, 10, use -mprefer-vector-width=128.
> >> For GCC 11 and above, issue a configure error.
> >
> > I don't think anything causing the build to break with newer GCC versions
> > is appropriate.
> Fair enough -- hence a bit of a step away from the decision process in
> my prior message on this subject.  I threw it out as a way to address
> the issue Florian raised, but I strongly believe it's up to the glibc
> team to decide how they want to go forward.
>

I'd like to check in my patch to close the 40% performance gap with GCC
8, 9 and 10.  We will monitor performance changes in the future GCC.

Thanks.
  
H.J. Lu July 1, 2019, 7:20 p.m. UTC | #22
On Thu, Jun 20, 2019 at 9:30 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Mon, Jun 17, 2019 at 2:41 PM Jeff Law <law@redhat.com> wrote:
> >
> > On 6/17/19 3:37 PM, Joseph Myers wrote:
> > > On Mon, 17 Jun 2019, H.J. Lu wrote:
> > >
> > >> On Mon, Jun 17, 2019 at 12:12 PM Florian Weimer <fweimer@redhat.com> wrote:
> > >>>
> > >>> * Jeff Law:
> > >>>
> > >>>> ISTM that 8, 9, 10 would use the new flag.  11 would issue an error
> > >>>> which would trigger a reinvestigation roughly a year from now.
> > >>>
> > >>> Do you suggest to put in a #warning for GCC 11, so that people can
> > >>> configure with --disable-werror and still build with GCC 11?
> > >>>
> > >>> Given that GCC 11 will probably add other warnings that break the build,
> > >>> this proposal isn't entirely unreasonable (even though I generally
> > >>> dislike such time bombs).
> > >>>
> > >>
> > >> Since there is no issue in source, this requires an artificial warning
> > >> purely based on GCC version.   We can place GCC version check in
> > >> configure script with explicit -mprefer-vector-width=128 support.
> > >>
> > >> --enable-mprefer-vector-width:
> > >>
> > >> 1.  Default.  For GCC 8, 9, 10, use -mprefer-vector-width=128.
> > >> For GCC 11 and above, issue a configure error.
> > >
> > > I don't think anything causing the build to break with newer GCC versions
> > > is appropriate.
> > Fair enough -- hence a bit of a step away from the decision process in
> > my prior message on this subject.  I threw it out as a way to address
> > the issue Florian raised, but I strongly believe it's up to the glibc
> > team to decide how they want to go forward.
> >
>
> I'd like to check in my patch to close the 40% performance gap with GCC
> 8, 9 and 10.  We will monitor performance changes in the future GCC.
>
> Thanks.
>

I will check in my patch tomorrow.

Thanks.
  

Patch

From 53f43ccf241896d37b759ac416df0ef0ccd2da0e Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Fri, 17 May 2019 14:23:03 -0700
Subject: [PATCH] x86-64: Compile branred.c with -mprefer-vector-width=128

-O3 with AVX vectorizes some loops in sysdeps/ieee754/dbl-64/branred.c
with 256-bit vector instructions, which leads to store forward stall:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579

There is no easy fix in compiler.  This patch limits vector width to
128 bits to work around this issue.  It improves performance of sin
and cos by more than 40% on Skylake compiled with -O3 -march=skylake.

	* sysdeps/x86_64/fpu/Makefile (CFLAGS-branred.c): New.  Set
	to -mprefer-vector-width=128.
---
 sysdeps/x86_64/fpu/Makefile | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/sysdeps/x86_64/fpu/Makefile b/sysdeps/x86_64/fpu/Makefile
index 2b7d69bb50..b5f9589021 100644
--- a/sysdeps/x86_64/fpu/Makefile
+++ b/sysdeps/x86_64/fpu/Makefile
@@ -237,3 +237,7 @@  CFLAGS-test-float-libmvec-sincosf-avx512.c = -DREQUIRE_AVX512F
 CFLAGS-test-float-libmvec-sincosf-avx512-main.c = $(libmvec-sincos-cflags) $(float-vlen16-arch-ext-cflags)
 endif
 endif
+
+ifeq ($(subdir),math)
+CFLAGS-branred.c = -mprefer-vector-width=128
+endif
-- 
2.20.1