New numbers in the benchtests.

Message ID 20171215042058.23705-1-woodard@redhat.com
State New, archived
Headers

Commit Message

Ben Woodard Dec. 15, 2017, 4:20 a.m. UTC
  A customer who has asked not to be named that has had some problems
with performance for some inputs of certain libm functions. We felt
that a good idea would be to capture and preserve these test cases so
that we could track and measure the performance for these cases over
time. These values evidently came from analyzing performance problems
with actual running codes. The intent is that more test cases will be
added over time.

The fast and the slow cases are separated in the test results so that
we could not only track any improvements in the worst case performance
but also detect any degregation in the performance of the optimal
cases should we change the algorithm or split the range different.

Here are the results of that part of the modified tests:

  "exp": {
   "": {
    "duration": 3.10837e+10,
    "iterations": 5.654e+06,
    "max": 37073.6,
    "min": 61.904,
    "mean": 5497.65
   },
   "144bits": {
    "duration": 2.9752e+10,
    "iterations": 1.83e+06,
    "max": 36458.4,
    "min": 159.793,
    "mean": 16257.9
   },
   "768bits": {
    "duration": 2.98869e+10,
    "iterations": 105000,
    "max": 307814,
    "min": 277808,
    "mean": 284637
   },
   "redhat-slow-customer-cases": {
    "duration": 2.90641e+10,
    "iterations": 241000,
    "max": 173768,
    "min": 118393,
    "mean": 120598
   },
   "redhat-fast-customer-cases": {
    "duration": 2.89308e+10,
    "iterations": 1.94562e+08,
    "max": 285.431,
    "min": 143.328,
    "mean": 148.697
   }
  },

  "pow": {
   "": {
    "duration": 2.91114e+10,
    "iterations": 5.9899e+07,
    "max": 1340.38,
    "min": 149.229,
    "mean": 486.009
   },
   "240bits": {
    "duration": 3.76374e+10,
    "iterations": 400000,
    "max": 110951,
    "min": 76769.9,
    "mean": 94093.4
   },
   "768bits": {
    "duration": 1.54488e+11,
    "iterations": 101000,
    "max": 1.60541e+06,
    "min": 742756,
    "mean": 1.52959e+06
   },
   "redhat-slow-customer-cases": {
    "duration": 3.02998e+10,
    "iterations": 42000,
    "max": 790097,
    "min": 692942,
    "mean": 721425
   },
   "redhat-fast-customer-cases": {
    "duration": 2.89989e+10,
    "iterations": 4.8063e+07,
    "max": 1339.97,
    "min": 564.295,
    "mean": 603.352
   }
  },

Signed-off-by: Ben Woodard <woodard@redhat.com>
---
 benchtests/exp-inputs |  9 +++++++++
 benchtests/pow-inputs | 12 ++++++++++++
 2 files changed, 21 insertions(+)
  

Comments

Carlos O'Donell Dec. 15, 2017, 4:25 a.m. UTC | #1
On 12/14/2017 08:20 PM, Ben Woodard wrote:
> A customer who has asked not to be named that has had some problems
> with performance for some inputs of certain libm functions. We felt
> that a good idea would be to capture and preserve these test cases so
> that we could track and measure the performance for these cases over
> time. These values evidently came from analyzing performance problems
> with actual running codes. The intent is that more test cases will be
> added over time.
> 
> The fast and the slow cases are separated in the test results so that
> we could not only track any improvements in the worst case performance
> but also detect any degregation in the performance of the optimal
> cases should we change the algorithm or split the range different.
This looks good to me, and I've always been suggesting that hardware
and software vendors should contribute important numbers upstream so
we can see which classes of users are impacted. I'm happy to see this
kind of thing go upstream. Some might argue these numbers should stay
downstream in RHEL, but I think it does a disservice. With my volunteer
hat on I'd like to know how my changes impact all sorts of users.

> Here are the results of that part of the modified tests:
> 
>   "exp": {
>    "": {
>     "duration": 3.10837e+10,
>     "iterations": 5.654e+06,
>     "max": 37073.6,
>     "min": 61.904,
>     "mean": 5497.65
>    },
>    "144bits": {
>     "duration": 2.9752e+10,
>     "iterations": 1.83e+06,
>     "max": 36458.4,
>     "min": 159.793,
>     "mean": 16257.9
>    },
>    "768bits": {
>     "duration": 2.98869e+10,
>     "iterations": 105000,
>     "max": 307814,
>     "min": 277808,
>     "mean": 284637
>    },
>    "redhat-slow-customer-cases": {
>     "duration": 2.90641e+10,
>     "iterations": 241000,
>     "max": 173768,
>     "min": 118393,
>     "mean": 120598
>    },
>    "redhat-fast-customer-cases": {
>     "duration": 2.89308e+10,
>     "iterations": 1.94562e+08,
>     "max": 285.431,
>     "min": 143.328,
>     "mean": 148.697
>    }
>   },
> 
>   "pow": {
>    "": {
>     "duration": 2.91114e+10,
>     "iterations": 5.9899e+07,
>     "max": 1340.38,
>     "min": 149.229,
>     "mean": 486.009
>    },
>    "240bits": {
>     "duration": 3.76374e+10,
>     "iterations": 400000,
>     "max": 110951,
>     "min": 76769.9,
>     "mean": 94093.4
>    },
>    "768bits": {
>     "duration": 1.54488e+11,
>     "iterations": 101000,
>     "max": 1.60541e+06,
>     "min": 742756,
>     "mean": 1.52959e+06
>    },
>    "redhat-slow-customer-cases": {
>     "duration": 3.02998e+10,
>     "iterations": 42000,
>     "max": 790097,
>     "min": 692942,
>     "mean": 721425
>    },
>    "redhat-fast-customer-cases": {
>     "duration": 2.89989e+10,
>     "iterations": 4.8063e+07,
>     "max": 1339.97,
>     "min": 564.295,
>     "mean": 603.352
>    }
>   },
> 
> Signed-off-by: Ben Woodard <woodard@redhat.com>

Modulo the nits below, you get my Reviewed-by.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

> ---
>  benchtests/exp-inputs |  9 +++++++++
>  benchtests/pow-inputs | 12 ++++++++++++
>  2 files changed, 21 insertions(+)
> 
> diff --git a/benchtests/exp-inputs b/benchtests/exp-inputs
> index aff3fb42f1..b94e19c4f5 100644
> --- a/benchtests/exp-inputs
> +++ b/benchtests/exp-inputs
> @@ -587,3 +587,12 @@
>  0x1.0000015853da7p0
>  0x1.0000098e5e007p0
>  0x1.0000099a1ac59p0
> +
> +# Contributed based on customer reports
> +# Ben Woodard woodard@redhat.com

Please use standard '<>' bracing to make this easier to parse in the future.

e.g.

Contact: Ben Woodard <woodard@redhat.com>

> +## name: redhat-slow-customer-cases
> +0x1.0p-53
> +## name: redhat-fast-customer-cases
> +0x1.0p-52
> +0x1.0p+0
> +0x1.999999999999Ap-4
> diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs
> index 78f8ac73d5..cea50edea7 100644
> --- a/benchtests/pow-inputs
> +++ b/benchtests/pow-inputs
> @@ -509,3 +509,15 @@
>  0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6
>  0x1.fa5c677254961p133, -0x1.c91962524971ep-1
>  0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1
> +
> +
> +# Contributed based on customer reports
> +# Ben Woodard woodard@redhat.com

Contact: Ben Woodard <woodard@redhat.com>

> +## name: redhat-slow-customer-cases
> +0x1.fffffffffffffp-1, 0x1.8p+0
> +0x1.ffffffffffffdp-1, 0x1.8p+0
> +0x1.ffffffffffff7p-1, 0x1.8p+0
> +## name: redhat-fast-customer-cases
> +0x1.ffffffffffffep-1, 0x1.8p+0
> +0x1.ffffffffffffcp-1, 0x1.8p+0
> +0x1.999999999999ap-4, 0x1.8p+0
>
  
Siddhesh Poyarekar Dec. 15, 2017, 7:43 a.m. UTC | #2
On Friday 15 December 2017 09:50 AM, Ben Woodard wrote:
> diff --git a/benchtests/exp-inputs b/benchtests/exp-inputs
> index aff3fb42f1..b94e19c4f5 100644
> --- a/benchtests/exp-inputs
> +++ b/benchtests/exp-inputs
> @@ -587,3 +587,12 @@
>  0x1.0000015853da7p0
>  0x1.0000098e5e007p0
>  0x1.0000099a1ac59p0
> +
> +# Contributed based on customer reports
> +# Ben Woodard woodard@redhat.com
> +## name: redhat-slow-customer-cases
> +0x1.0p-53
> +## name: redhat-fast-customer-cases
> +0x1.0p-52
> +0x1.0p+0
> +0x1.999999999999Ap-4

Paul McGehearty has re-implemented exp which will likely change the
nature of these numbers.  What it means is that the slow path will be
gone for exp.  I suggest re-testing with the new implementation because
we will probably just end up dropping all of these partitions and you
can then just add these inputs as is.

> diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs
> index 78f8ac73d5..cea50edea7 100644
> --- a/benchtests/pow-inputs
> +++ b/benchtests/pow-inputs
> @@ -509,3 +509,15 @@
>  0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6
>  0x1.fa5c677254961p133, -0x1.c91962524971ep-1
>  0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1
> +
> +
> +# Contributed based on customer reports
> +# Ben Woodard woodard@redhat.com
> +## name: redhat-slow-customer-cases
> +0x1.fffffffffffffp-1, 0x1.8p+0
> +0x1.ffffffffffffdp-1, 0x1.8p+0
> +0x1.ffffffffffff7p-1, 0x1.8p+0
> +## name: redhat-fast-customer-cases
> +0x1.ffffffffffffep-1, 0x1.8p+0
> +0x1.ffffffffffffcp-1, 0x1.8p+0
> +0x1.999999999999ap-4, 0x1.8p+0
> 

Why do these inputs need to be in a different namespace?  They should go
into one of the existing namespaces, i.e. 768bits, 240bits, etc.  Their
performance relative to the existing namespaces should give you an
indication where they fit in.  If you want these inputs to look distinct
then you could add a comment indicating that these inputs came in from
RH; comments are just a single #.

Thanks,
Siddhesh
  
Joseph Myers Dec. 15, 2017, 4:04 p.m. UTC | #3
On Thu, 14 Dec 2017, Ben Woodard wrote:

> Here are the results of that part of the modified tests:

What do you get with the exp patch (version 8) that has been approved and 
now just needs to be committed?
  
Carlos O'Donell Dec. 15, 2017, 5:06 p.m. UTC | #4
On 12/14/2017 11:43 PM, Siddhesh Poyarekar wrote:
> On Friday 15 December 2017 09:50 AM, Ben Woodard wrote:
>> diff --git a/benchtests/exp-inputs b/benchtests/exp-inputs
>> index aff3fb42f1..b94e19c4f5 100644
>> --- a/benchtests/exp-inputs
>> +++ b/benchtests/exp-inputs
>> @@ -587,3 +587,12 @@
>>  0x1.0000015853da7p0
>>  0x1.0000098e5e007p0
>>  0x1.0000099a1ac59p0
>> +
>> +# Contributed based on customer reports
>> +# Ben Woodard woodard@redhat.com
>> +## name: redhat-slow-customer-cases
>> +0x1.0p-53
>> +## name: redhat-fast-customer-cases
>> +0x1.0p-52
>> +0x1.0p+0
>> +0x1.999999999999Ap-4
> 
> Paul McGehearty has re-implemented exp which will likely change the
> nature of these numbers.  What it means is that the slow path will be
> gone for exp.  I suggest re-testing with the new implementation because
> we will probably just end up dropping all of these partitions and you
> can then just add these inputs as is.

That is good, and that will show up when it gets committed as a change
in the performance for these named sets.

>> diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs
>> index 78f8ac73d5..cea50edea7 100644
>> --- a/benchtests/pow-inputs
>> +++ b/benchtests/pow-inputs
>> @@ -509,3 +509,15 @@
>>  0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6
>>  0x1.fa5c677254961p133, -0x1.c91962524971ep-1
>>  0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1
>> +
>> +
>> +# Contributed based on customer reports
>> +# Ben Woodard woodard@redhat.com
>> +## name: redhat-slow-customer-cases
>> +0x1.fffffffffffffp-1, 0x1.8p+0
>> +0x1.ffffffffffffdp-1, 0x1.8p+0
>> +0x1.ffffffffffff7p-1, 0x1.8p+0
>> +## name: redhat-fast-customer-cases
>> +0x1.ffffffffffffep-1, 0x1.8p+0
>> +0x1.ffffffffffffcp-1, 0x1.8p+0
>> +0x1.999999999999ap-4, 0x1.8p+0
>>
> 
> Why do these inputs need to be in a different namespace?  They should go
> into one of the existing namespaces, i.e. 768bits, 240bits, etc.  Their
> performance relative to the existing namespaces should give you an
> indication where they fit in.  If you want these inputs to look distinct
> then you could add a comment indicating that these inputs came in from
> RH; comments are just a single #.

This is a question of outcome and expectations.

We have customers for whom certain inputs need to be performant relative
to others. If we mix those numbers into the existing namespace then it's
hard to see if they got better or worse on their own, since they are lost
in the mean.

We identify them as things that Red Hat as a downstream wants to keep
working well and we will put resources behind that.

Does that explain why we don't want to mix these with the other namespaces?
  
Ben Woodard Dec. 15, 2017, 6:19 p.m. UTC | #5
I think that this actually kind of gets at the nature of what these
benchtests currently show as opposed to what they could ultimately be
used for.

Right now, they demonstrate specific numbers that trigger the
implementation to go into different modes. With our current
implementation, this is the fast path, the slow path and the very slow
path. So these functions are there to show how well the current
implementation is running when it enters those modes. You can easily
see if a code change or a compiler change improves or impacts
performance. This is good for optimizing the implementation for
performance in the traditional way.

However, there are also certain inputs to functions which are known to
be problematic or more difficult to round correctly than others and
you must compute them out to N-digits to get correct rounding when
using a particular implementation. The quality of implementation is in
essence how often we hit those hard cases and how quickly we can deal
with them. These benchtests that I added are more along those lines.
For example, they include exp for the last subnormal and the first
normal floating point number. Together these represent a test case
which is known by the customer to identify quality of implementation
issues. They have indicated to us this case comes up frequently enough
that would like it to be tracked and Carlos suggested adding it to the
benchtests.

When we change the implementation of exp, I still feel like keeping
our eye on cases like this as well as the hard to round cases are
interesting. We want to make sure things like that don't blow up. That
was my intent with this.

-ben

On Thu, Dec 14, 2017 at 11:43 PM, Siddhesh Poyarekar
<siddhesh@gotplt.org> wrote:
> On Friday 15 December 2017 09:50 AM, Ben Woodard wrote:
>> diff --git a/benchtests/exp-inputs b/benchtests/exp-inputs
>> index aff3fb42f1..b94e19c4f5 100644
>> --- a/benchtests/exp-inputs
>> +++ b/benchtests/exp-inputs
>> @@ -587,3 +587,12 @@
>>  0x1.0000015853da7p0
>>  0x1.0000098e5e007p0
>>  0x1.0000099a1ac59p0
>> +
>> +# Contributed based on customer reports
>> +# Ben Woodard woodard@redhat.com
>> +## name: redhat-slow-customer-cases
>> +0x1.0p-53
>> +## name: redhat-fast-customer-cases
>> +0x1.0p-52
>> +0x1.0p+0
>> +0x1.999999999999Ap-4
>
> Paul McGehearty has re-implemented exp which will likely change the
> nature of these numbers.  What it means is that the slow path will be
> gone for exp.  I suggest re-testing with the new implementation because
> we will probably just end up dropping all of these partitions and you
> can then just add these inputs as is.
>
>> diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs
>> index 78f8ac73d5..cea50edea7 100644
>> --- a/benchtests/pow-inputs
>> +++ b/benchtests/pow-inputs
>> @@ -509,3 +509,15 @@
>>  0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6
>>  0x1.fa5c677254961p133, -0x1.c91962524971ep-1
>>  0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1
>> +
>> +
>> +# Contributed based on customer reports
>> +# Ben Woodard woodard@redhat.com
>> +## name: redhat-slow-customer-cases
>> +0x1.fffffffffffffp-1, 0x1.8p+0
>> +0x1.ffffffffffffdp-1, 0x1.8p+0
>> +0x1.ffffffffffff7p-1, 0x1.8p+0
>> +## name: redhat-fast-customer-cases
>> +0x1.ffffffffffffep-1, 0x1.8p+0
>> +0x1.ffffffffffffcp-1, 0x1.8p+0
>> +0x1.999999999999ap-4, 0x1.8p+0
>>
>
> Why do these inputs need to be in a different namespace?  They should go
> into one of the existing namespaces, i.e. 768bits, 240bits, etc.  Their
> performance relative to the existing namespaces should give you an
> indication where they fit in.  If you want these inputs to look distinct
> then you could add a comment indicating that these inputs came in from
> RH; comments are just a single #.
>
> Thanks,
> Siddhesh
  
Carlos O'Donell Dec. 15, 2017, 6:26 p.m. UTC | #6
On 12/15/2017 10:19 AM, Ben Woodard wrote:
> I think that this actually kind of gets at the nature of what these
> benchtests currently show as opposed to what they could ultimately be
> used for.
> 
> Right now, they demonstrate specific numbers that trigger the
> implementation to go into different modes. With our current
> implementation, this is the fast path, the slow path and the very slow
> path. So these functions are there to show how well the current
> implementation is running when it enters those modes. You can easily
> see if a code change or a compiler change improves or impacts
> performance. This is good for optimizing the implementation for
> performance in the traditional way.
> 
> However, there are also certain inputs to functions which are known to
> be problematic or more difficult to round correctly than others and
> you must compute them out to N-digits to get correct rounding when
> using a particular implementation. The quality of implementation is in
> essence how often we hit those hard cases and how quickly we can deal
> with them. These benchtests that I added are more along those lines.
> For example, they include exp for the last subnormal and the first
> normal floating point number. Together these represent a test case
> which is known by the customer to identify quality of implementation
> issues. They have indicated to us this case comes up frequently enough
> that would like it to be tracked and Carlos suggested adding it to the
> benchtests.
> 
> When we change the implementation of exp, I still feel like keeping
> our eye on cases like this as well as the hard to round cases are
> interesting. We want to make sure things like that don't blow up. That
> was my intent with this.

Agreed. That's a better explanation than mine :-)
  
Siddhesh Poyarekar Dec. 16, 2017, 5:25 a.m. UTC | #7
On Friday 15 December 2017 10:36 PM, Carlos O'Donell wrote:
> That is good, and that will show up when it gets committed as a change
> in the performance for these named sets.

Sure, no harm in adding the numbers, I just pointed it out since I
reckoned that Ben didn't notice that patch.

>>> diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs
>>> index 78f8ac73d5..cea50edea7 100644
>>> --- a/benchtests/pow-inputs
>>> +++ b/benchtests/pow-inputs
>>> @@ -509,3 +509,15 @@
>>>  0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6
>>>  0x1.fa5c677254961p133, -0x1.c91962524971ep-1
>>>  0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1
>>> +
>>> +
>>> +# Contributed based on customer reports
>>> +# Ben Woodard woodard@redhat.com
>>> +## name: redhat-slow-customer-cases
>>> +0x1.fffffffffffffp-1, 0x1.8p+0
>>> +0x1.ffffffffffffdp-1, 0x1.8p+0
>>> +0x1.ffffffffffff7p-1, 0x1.8p+0
>>> +## name: redhat-fast-customer-cases
>>> +0x1.ffffffffffffep-1, 0x1.8p+0
>>> +0x1.ffffffffffffcp-1, 0x1.8p+0
>>> +0x1.999999999999ap-4, 0x1.8p+0
>>>
>>
>> Why do these inputs need to be in a different namespace?  They should go
>> into one of the existing namespaces, i.e. 768bits, 240bits, etc.  Their
>> performance relative to the existing namespaces should give you an
>> indication where they fit in.  If you want these inputs to look distinct
>> then you could add a comment indicating that these inputs came in from
>> RH; comments are just a single #.
> 
> This is a question of outcome and expectations.
> 
> We have customers for whom certain inputs need to be performant relative
> to others. If we mix those numbers into the existing namespace then it's
> hard to see if they got better or worse on their own, since they are lost
> in the mean.
> 
> We identify them as things that Red Hat as a downstream wants to keep
> working well and we will put resources behind that.
> 
> Does that explain why we don't want to mix these with the other namespaces?

Sure, but that's more a Red Hat specific use case than an upstream use
case.  From the upstream perspective we only care about the input and it
doesn't get treated any more specially than other inputs (my response to
Ben explains why).  Maybe one way to deal with this is to upstream this
with just a comment at the end of the upstream namespace it belongs and
then replace the comment with the ##name tag in a downstream patch for
Red Hat/Fedora.

Siddhesh
  
Siddhesh Poyarekar Dec. 16, 2017, 5:31 a.m. UTC | #8
On Friday 15 December 2017 09:34 PM, Joseph Myers wrote:
> On Thu, 14 Dec 2017, Ben Woodard wrote:
> 
>> Here are the results of that part of the modified tests:
> 
> What do you get with the exp patch (version 8) that has been approved and 
> now just needs to be committed?

I assume you'll be committing that as reviewer; I don't think Paul has
commit access.

Siddhesh
  
Siddhesh Poyarekar Dec. 16, 2017, 5:56 a.m. UTC | #9
On Friday 15 December 2017 11:49 PM, Ben Woodard wrote:
> Right now, they demonstrate specific numbers that trigger the
> implementation to go into different modes. With our current
> implementation, this is the fast path, the slow path and the very slow
> path. So these functions are there to show how well the current
> implementation is running when it enters those modes. You can easily
> see if a code change or a compiler change improves or impacts
> performance. This is good for optimizing the implementation for
> performance in the traditional way.
> 
> However, there are also certain inputs to functions which are known to
> be problematic or more difficult to round correctly than others and
> you must compute them out to N-digits to get correct rounding when
> using a particular implementation. The quality of implementation is in
> essence how often we hit those hard cases and how quickly we can deal
> with them. These benchtests that I added are more along those lines.
> For example, they include exp for the last subnormal and the first
> normal floating point number. Together these represent a test case
> which is known by the customer to identify quality of implementation
> issues. They have indicated to us this case comes up frequently enough
> that would like it to be tracked and Carlos suggested adding it to the
> benchtests.

You're right in that specific inputs (or rather results) are more
important than others due to the number of bits required to ensure a
correct rounding regardless of the algorithm used.  However, they assume
importance from a practical standpoint only if they result in vastly
different behaviour in the algorithm.  Slow path vs fast path for
example makes sense if the algorithm actually has a slow and fast path.
In that context, the last subnormal and first normal number are
important, but not more important than other inputs that exercise the
exact same code path.

This is why I suggest merging these inputs into the code path that they
exercise since that sort of partitioning scheme makes more sense for
upstream code than 'redhat-fast-customer-cases' and
'redhat-slow-customer-cases'.  If you want to see those input results
differently then a downstream patch that converts the comment into a tag
"##name: " is a more appropriate solution.

Note that this is different from the concept of workloads that we add to
benchmark inputs where the sequence of calls is also important and hence
they're executed slightly differently.  See for example some CPU2006
inputs that we added to simulate that workload.  If your inputs emulate
a specific and interesting workload then that's interesting as a
separate partition even if it is vendor-specific but it doesn't seem
like that.

Siddhesh
  
Siddhesh Poyarekar Dec. 16, 2017, 6:06 a.m. UTC | #10
On Saturday 16 December 2017 10:55 AM, Siddhesh Poyarekar wrote:
> Sure, but that's more a Red Hat specific use case than an upstream use
> case.  From the upstream perspective we only care about the input and it
> doesn't get treated any more specially than other inputs (my response to
> Ben explains why).  Maybe one way to deal with this is to upstream this
> with just a comment at the end of the upstream namespace it belongs and
> then replace the comment with the ##name tag in a downstream patch for
> Red Hat/Fedora.

To be clear, I am all for adding the inputs, I am not convinced about
the separate vendor-specific partition for it.  Vendor-specific
partitions only make sense if they are for a specific workload, such as
for spec2006.wrf that Wilco added a while back for one of the math
functions.

Siddhesh
  
Patrick McGehearty Dec. 16, 2017, 8:10 p.m. UTC | #11
On 12/15/2017 11:31 PM, Siddhesh Poyarekar wrote:
> On Friday 15 December 2017 09:34 PM, Joseph Myers wrote:
>> On Thu, 14 Dec 2017, Ben Woodard wrote:
>>
>>> Here are the results of that part of the modified tests:
>> What do you get with the exp patch (version 8) that has been approved and
>> now just needs to be committed?
> I assume you'll be committing that as reviewer; I don't think Paul has
> commit access.
>
> Siddhesh
For clarity, the exp() submitter is Patrick McGehearty [not Paul 
McGehearty :-) ]
and I do not (yet) have commit access.

- patrick
  
Siddhesh Poyarekar Dec. 17, 2017, 3:54 a.m. UTC | #12
On Sunday 17 December 2017 01:40 AM, Patrick McGehearty wrote:
> For clarity, the exp() submitter is Patrick McGehearty [not Paul
> McGehearty :-) ]
> and I do not (yet) have commit access.

Ugh, I'm so sorry.  I guess I kept thinking Paul because it sounded so
much like Paul McCartney to me :)

Siddhesh
  
Joseph Myers Dec. 18, 2017, 5:13 p.m. UTC | #13
On Sat, 16 Dec 2017, Siddhesh Poyarekar wrote:

> You're right in that specific inputs (or rather results) are more
> important than others due to the number of bits required to ensure a
> correct rounding regardless of the algorithm used.  However, they assume

And such a correct rounding is not part of the accuracy goals for 
functions such as exp (only for a few functions such as sqrt and fma that 
are directly bound to IEEE 754 operations).

This does not rule out support for TS 18661-4 reserved function names such 
as crexp, though that was not part of my TS 18661-4 proposal.  If based on 
exhaustive searches for worst cases for correct rounding, such 
implementations would not actually need anywhere near the number of bits 
used by the old exp implementation in glibc.
  
Joseph Myers Dec. 18, 2017, 5:19 p.m. UTC | #14
On Sat, 16 Dec 2017, Patrick McGehearty wrote:

> On 12/15/2017 11:31 PM, Siddhesh Poyarekar wrote:
> > On Friday 15 December 2017 09:34 PM, Joseph Myers wrote:
> > > On Thu, 14 Dec 2017, Ben Woodard wrote:
> > > 
> > > > Here are the results of that part of the modified tests:
> > > What do you get with the exp patch (version 8) that has been approved and
> > > now just needs to be committed?
> > I assume you'll be committing that as reviewer; I don't think Paul has
> > commit access.
> > 
> > Siddhesh
> For clarity, the exp() submitter is Patrick McGehearty [not Paul McGehearty
> :-) ]
> and I do not (yet) have commit access.

If you'd like the patch committed for you, please provide a GNU ChangeLog 
entry for it.
  
Patrick McGehearty Dec. 19, 2017, 4:53 p.m. UTC | #15
On 12/18/2017 11:19 AM, Joseph Myers wrote:
> On Sat, 16 Dec 2017, Patrick McGehearty wrote:
>
>> On 12/15/2017 11:31 PM, Siddhesh Poyarekar wrote:
>>> On Friday 15 December 2017 09:34 PM, Joseph Myers wrote:
>>>> On Thu, 14 Dec 2017, Ben Woodard wrote:
>>>>
>>>>> Here are the results of that part of the modified tests:
>>>> What do you get with the exp patch (version 8) that has been approved and
>>>> now just needs to be committed?
>>> I assume you'll be committing that as reviewer; I don't think Paul has
>>> commit access.
>>>
>>> Siddhesh
>> For clarity, the exp() submitter is Patrick McGehearty [not Paul McGehearty
>> :-) ]
>> and I do not (yet) have commit access.
> If you'd like the patch committed for you, please provide a GNU ChangeLog
> entry for it.
>
Thank you for the offer.
Proposed ChangeLog entry:

2017-12-19  Patrick McGehearty <patrick.mcgehearty@oracle.com>

         * sysdeps/ieee754/dbl-64/e_exp.c: 5x faster __ieee754_exp() 
routine.
         * sysdeps/ieee754/dbl-64/eexp.tbl: New file for e_exp.
         * sysdeps/ieee754/dbl-64/slowexp.c: Removed.
         * sysdeps/x86_64/fpu/multiarch/slowexp-avx.c: Removed.
         * sysdeps/x86_64/fpu/multiarch/slowexp-fma.c: Removed.
         * sysdeps/x86_64/fpu/multiarch/slowexp-fma4.c: Removed.
         * sysdeps/generic/math_private.h: Remove slowexp.
         * sysdeps/ieee754/dbl-64/e_pow.c: Likewise.
         * sysdeps/powerpc/power4/fpu/Makefile: Likewise.
         * sysdeps/x86_64/fpu/multiarch/Makefile: Likewise.
         * sysdeps/x86_64/fpu/multiarch/e_exp-avx.c: Likewise.
         * sysdeps/x86_64/fpu/multiarch/e_exp-fma.c: Likewise.
         * sysdeps/x86_64/fpu/multiarch/e_exp-fma4.c: Likewise.
         * math/Makefile: Likewise.
         * manual/probes.texi: Remove slowexp documentation.
  
Joseph Myers Dec. 19, 2017, 5:29 p.m. UTC | #16
I have now committed the exp patch.  Note for future patches the changes 
made to the ChangeLog entry.  I added removals of 
sysdeps/i386/fpu/slowexp.c, sysdeps/ia64/fpu/slowexp.c and 
sysdeps/m68k/m680x0/fpu/slowexp.c, three files overriding the build of 
slowexp.c for architectures with their own exp implementations, which are 
no longer needed now nothing builds slowexp.c.
  
Adhemerval Zanella Netto Dec. 19, 2017, 6:05 p.m. UTC | #17
On 19/12/2017 15:29, Joseph Myers wrote:
> I have now committed the exp patch.  Note for future patches the changes 
> made to the ChangeLog entry.  I added removals of 
> sysdeps/i386/fpu/slowexp.c, sysdeps/ia64/fpu/slowexp.c and 
> sysdeps/m68k/m680x0/fpu/slowexp.c, three files overriding the build of 
> slowexp.c for architectures with their own exp implementations, which are 
> no longer needed now nothing builds slowexp.c.
> 

I am seeing now on master with GCC 7.1.1:

$ cat math/test-double-tgamma.out 
testing double (without inline functions)
Failure: tgamma_upward (-0xb.4ffffp+4): errno set to 0, expected 34 (ERANGE)
Failure: tgamma_upward (-0xb.60001p+4): errno set to 0, expected 34 (ERANGE)
Failure: tgamma_upward (-0xb.6ffffp+4): errno set to 0, expected 34 (ERANGE)

Test suite completed:
  1412 test cases plus 1408 tests for exception flags and
    1408 tests for errno executed.
  3 errors occurred.

Is it a known issue?
  
Joseph Myers Dec. 19, 2017, 6:13 p.m. UTC | #18
On Tue, 19 Dec 2017, Adhemerval Zanella wrote:

> I am seeing now on master with GCC 7.1.1:
> 
> $ cat math/test-double-tgamma.out 
> testing double (without inline functions)
> Failure: tgamma_upward (-0xb.4ffffp+4): errno set to 0, expected 34 (ERANGE)
> Failure: tgamma_upward (-0xb.60001p+4): errno set to 0, expected 34 (ERANGE)
> Failure: tgamma_upward (-0xb.6ffffp+4): errno set to 0, expected 34 (ERANGE)
> 
> Test suite completed:
>   1412 test cases plus 1408 tests for exception flags and
>     1408 tests for errno executed.
>   3 errors occurred.
> 
> Is it a known issue?

Thanks for pointing this out.  I've reverted the exp patch and ulps 
update.  As and when whatever caused those failures is fixed it can go 
back in.
  
Joseph Myers Dec. 19, 2017, 6:54 p.m. UTC | #19
Here is a possible cause for the failures, not verified as such:

The exp implementation was using get_rounding_mode.  But tgamma uses 
SET_RESTORE_ROUND and calls exp within the scope of SET_RESTORE_ROUND.  
SET_RESTORE_ROUND, for x86_64, only sets the SSE rounding mode, but 
get_rounding_mode gets the x87 rounding mode.  The effect would have been 
that the code in exp found that the x87 rounding mode was not to-nearest, 
then redundantly set the SSE rounding mode to to-nearest, then ended up 
trying to restore the rounding mode and actually setting the SSE rounding 
mode to the not-to-nearest value of the x87 rounding modes.

If this hypothesis is correct, I advise resubmitting the patch in a form 
that just uses SET_RESTORE_ROUND (FE_TONEAREST) like other libm functions 
- the performance improvement was sufficient that presumably even this 
form would still perform better than the existing code.  Then, once that 
is properly validated and checked in, it would be possible to restore the 
optimization that the use of get_rounding_mode and separate code paths 
were intended to achieve.  To do that, you'd need libc_fegetround{,f,l} 
which by default just use get_rounding_mode, but on x86 __SSE2_MATH__ do 
something different for the float and double versions to get the SSE 
rounding mode instead; exp would use libc_fegetround instead of using 
get_rounding_mode directly.
  
Patrick McGehearty Dec. 20, 2017, 4:04 p.m. UTC | #20
On 12/19/2017 12:54 PM, Joseph Myers wrote:
> Here is a possible cause for the failures, not verified as such:
>
> The exp implementation was using get_rounding_mode.  But tgamma uses
> SET_RESTORE_ROUND and calls exp within the scope of SET_RESTORE_ROUND.
> SET_RESTORE_ROUND, for x86_64, only sets the SSE rounding mode, but
> get_rounding_mode gets the x87 rounding mode.  The effect would have been
> that the code in exp found that the x87 rounding mode was not to-nearest,
> then redundantly set the SSE rounding mode to to-nearest, then ended up
> trying to restore the rounding mode and actually setting the SSE rounding
> mode to the not-to-nearest value of the x87 rounding modes.
>
> If this hypothesis is correct, I advise resubmitting the patch in a form
> that just uses SET_RESTORE_ROUND (FE_TONEAREST) like other libm functions
> - the performance improvement was sufficient that presumably even this
> form would still perform better than the existing code.  Then, once that
> is properly validated and checked in, it would be possible to restore the
> optimization that the use of get_rounding_mode and separate code paths
> were intended to achieve.  To do that, you'd need libc_fegetround{,f,l}
> which by default just use get_rounding_mode, but on x86 __SSE2_MATH__ do
> something different for the float and double versions to get the SSE
> rounding mode instead; exp would use libc_fegetround instead of using
> get_rounding_mode directly.
>

I did note in the comments (fairly far down) to the patch that
"For x86, tgamma showed a few values where the ulp increased to
6 (max ulp for tgamma is 5). Sparc tgamma did not show these failures."

As noted tgamma as currently computed is seriously non-linear in its
behavior with 1-bit differences in exp() likely to result in many-bit
differences in computed values. Nelson Beebe's recent (Aug 2017) book
"The Mathematical Function Computation Handbook" (over 1100 pgs) gives
an extended discussion of how to reduce the ulp diffs. But that's
beyond the scope of the current project.

I'll investigate Joseph's hypothesis.
If the hypothesis is correct, I find it odd for get_rounding_mode
to not have matching behavior with libc_fesetround for x86.

- patrick
  
Joseph Myers Dec. 20, 2017, 5:56 p.m. UTC | #21
On Wed, 20 Dec 2017, Patrick McGehearty wrote:

> I did note in the comments (fairly far down) to the patch that
> "For x86, tgamma showed a few values where the ulp increased to
> 6 (max ulp for tgamma is 5). Sparc tgamma did not show these failures."

The problem isn't ulps changes, the problem is that there were test 
failures (missing errno setting in this case) *other than* ones for which 
a libm-test-ulps regeneration sufficed, and such failures always need more 
investigation.

> If the hypothesis is correct, I find it odd for get_rounding_mode
> to not have matching behavior with libc_fesetround for x86.

Well, get_rounding_mode has previously only been used (on x86_64 / x86) in 
interfaces such as strtod / printf (for which it was originally added), 
for which results when the SSE and x87 rounding modes are different don't 
matter.  Whereas libm functions can be called internally when those modes 
are different, as part of optimizations that rely on knowing which 
rounding mode is relevant for computations in a particular type and 
avoiding changing the other part of the floating-point state.  Thus, you 
need type-specific interfaces in libm (libc_fegetround{,f,l}) whereas you 
don't in strtod / printf.  (Of course this doesn't make any difference for 
architectures other than x86_64 / x86, because that's the only case where 
you have two different sets of floating-point state at all.)
  
Patrick McGehearty Dec. 20, 2017, 11:12 p.m. UTC | #22
On 12/20/2017 11:56 AM, Joseph Myers wrote:
> On Wed, 20 Dec 2017, Patrick McGehearty wrote:
>
>> I did note in the comments (fairly far down) to the patch that
>> "For x86, tgamma showed a few values where the ulp increased to
>> 6 (max ulp for tgamma is 5). Sparc tgamma did not show these failures."
> The problem isn't ulps changes, the problem is that there were test
> failures (missing errno setting in this case) *other than* ones for which
> a libm-test-ulps regeneration sufficed, and such failures always need more
> investigation.
>
>> If the hypothesis is correct, I find it odd for get_rounding_mode
>> to not have matching behavior with libc_fesetround for x86.
> Well, get_rounding_mode has previously only been used (on x86_64 / x86) in
> interfaces such as strtod / printf (for which it was originally added),
> for which results when the SSE and x87 rounding modes are different don't
> matter.  Whereas libm functions can be called internally when those modes
> are different, as part of optimizations that rely on knowing which
> rounding mode is relevant for computations in a particular type and
> avoiding changing the other part of the floating-point state.  Thus, you
> need type-specific interfaces in libm (libc_fegetround{,f,l}) whereas you
> don't in strtod / printf.  (Of course this doesn't make any difference for
> architectures other than x86_64 / x86, because that's the only case where
> you have two different sets of floating-point state at all.)
>
Summary of results:
Tested replacing get_rounding_mode and libc_fesetround only when not in 
FE_TONEAREST
mode with SET_RESTORE_ROUND (FE_TONEAREST).

The tgamma failures disappeared, giving strong support to Joseph's 
hypothesis.
For x86, there is roughly a 5% performance cost which would be tolerable.
Unfortunately, for Sparc, there is a 76% performance cost which is
less tolerable. When Sparc changes the rounding mode, the instruction 
pipeline
is flushed which has a non-trival cost. I suspect the performance cost 
will be
higher on all platforms which have a high cost for changing the rounding 
mode.

Alternatives:
1) Accept the current behavior. - least work.

2) Just use SET_RESTORE_ROUND for now. - almost no additional work,
but slower on some platforms.

3) Define a macro either within e_exp.c or in an include file that selects
get_rounding_mode and libc_fesetround for all platforms except x86.
It selects SET_RESTORE_ROUND for x86.
Putting platform specific macros inside ieee754 branch seems
undesirable, but I thought I should mention it as a possibility.

4) Put the SET_RESTORE_ROUND version of e_exp.c as an x86 specific
version in the sysdeps tree and the get_rounding_mode version in
the ieee754 part of the tree.
Some more work when I'm ready to go on winter break. :-)

Or, do (2) now and maybe (4) later.


Also, as an separate experiment, I tested doubling the size of TBL[] again
to use step sizes of 1/128 instead of 1/64.
It reduces the error rate from 16.1/10000 to 9.7/10000.
It does not remove the 1 bit errors in the current "make check" tests.
To fix those will likely require increasing the size of TBL2[].
For my next patch I'm going to go with the larger TBL[] size unless
someone objects. Might as well reduce even small errors when we can.

- patrick

ps Apologies for continuing to accidentally send incomplete messaging
while composing. I apparently have an unconscious trained reflex to
type ^S whenever I pause for thought. That is beneficial for editors
that interpret that as "save file". No so much on mail programs
that interpret ^S as "send file". I will try to remember to compose
outside of mail in the future and then copy/paste into the mailer.
  
Joseph Myers Dec. 20, 2017, 11:42 p.m. UTC | #23
On Wed, 20 Dec 2017, Patrick McGehearty wrote:

> The tgamma failures disappeared, giving strong support to Joseph's hypothesis.
> For x86, there is roughly a 5% performance cost which would be tolerable.
> Unfortunately, for Sparc, there is a 76% performance cost which is
> less tolerable. When Sparc changes the rounding mode, the instruction pipeline

But is presumably still better than the existing code (as you mentioned a 
5x improvement), so is a reasonable incremental step.

> 3) Define a macro either within e_exp.c or in an include file that selects
> get_rounding_mode and libc_fesetround for all platforms except x86.
> It selects SET_RESTORE_ROUND for x86.
> Putting platform specific macros inside ieee754 branch seems
> undesirable, but I thought I should mention it as a possibility.

The correct thing to do is as I said: add libc_fegetround, 
libc_fegetroundf and libc_fegetroundl to the large set of math_private.h / 
fenv_private.h libc_fe* macros.  All of these would default to using 
get_rounding_mode, but sysdeps/i386/fpu/fenv_private.h would, in the 
__SSE_MATH__ case, use the SSE rounding mode for libc_fegetroundf, and in 
the __SSE2_MATH__ case use it also for libc_fegetround.  Then you could 
use libc_fegetround where you previously used get_rounding_mode.

However, using SET_RESTORE_ROUND as an incremental step still makes sense 
before adding libc_fegetround* as an improvement on top of that.
  
Patrick McGehearty Dec. 29, 2017, 11:46 p.m. UTC | #24
On 12/20/2017 5:42 PM, Joseph Myers wrote:
> On Wed, 20 Dec 2017, Patrick McGehearty wrote:
>
>> The tgamma failures disappeared, giving strong support to Joseph's hypothesis.
>> For x86, there is roughly a 5% performance cost which would be tolerable.
>> Unfortunately, for Sparc, there is a 76% performance cost which is
>> less tolerable. When Sparc changes the rounding mode, the instruction pipeline
> But is presumably still better than the existing code (as you mentioned a
> 5x improvement), so is a reasonable incremental step.
>
>> 3) Define a macro either within e_exp.c or in an include file that selects
>> get_rounding_mode and libc_fesetround for all platforms except x86.
>> It selects SET_RESTORE_ROUND for x86.
>> Putting platform specific macros inside ieee754 branch seems
>> undesirable, but I thought I should mention it as a possibility.
> The correct thing to do is as I said: add libc_fegetround,
> libc_fegetroundf and libc_fegetroundl to the large set of math_private.h /
> fenv_private.h libc_fe* macros.  All of these would default to using
> get_rounding_mode, but sysdeps/i386/fpu/fenv_private.h would, in the
> __SSE_MATH__ case, use the SSE rounding mode for libc_fegetroundf, and in
> the __SSE2_MATH__ case use it also for libc_fegetround.  Then you could
> use libc_fegetround where you previously used get_rounding_mode.
>
> However, using SET_RESTORE_ROUND as an incremental step still makes sense
> before adding libc_fegetround* as an improvement on top of that.
>
While investigating SET_RESTORE_ROUND, I found two cases which were
missing the Rounding mode correction. Fixing those removed the cexp()
test case failures. I also increased the table size to 256 from 64
to reduce the 1 ulp failures to less than 1 in 1000 by informal testing.
Those remaining failures are likely to be in the range of 0.51 ulp.
Investigating those issues delayed my revised submission until now.

I agree it would be desirable to add libc_fegetround* where appropriate,
but I also agree it is better to get the revised patch in the code
base first, then see about the rounding changes.

- Patrick McGehearty
  

Patch

diff --git a/benchtests/exp-inputs b/benchtests/exp-inputs
index aff3fb42f1..b94e19c4f5 100644
--- a/benchtests/exp-inputs
+++ b/benchtests/exp-inputs
@@ -587,3 +587,12 @@ 
 0x1.0000015853da7p0
 0x1.0000098e5e007p0
 0x1.0000099a1ac59p0
+
+# Contributed based on customer reports
+# Ben Woodard woodard@redhat.com
+## name: redhat-slow-customer-cases
+0x1.0p-53
+## name: redhat-fast-customer-cases
+0x1.0p-52
+0x1.0p+0
+0x1.999999999999Ap-4
diff --git a/benchtests/pow-inputs b/benchtests/pow-inputs
index 78f8ac73d5..cea50edea7 100644
--- a/benchtests/pow-inputs
+++ b/benchtests/pow-inputs
@@ -509,3 +509,15 @@ 
 0x1.f8b79758182dap-884, 0x1.ed6174093fca4p-6
 0x1.fa5c677254961p133, -0x1.c91962524971ep-1
 0x1.ff0544adacb78p649, -0x1.6c17c3a7210e2p-1
+
+
+# Contributed based on customer reports
+# Ben Woodard woodard@redhat.com
+## name: redhat-slow-customer-cases
+0x1.fffffffffffffp-1, 0x1.8p+0
+0x1.ffffffffffffdp-1, 0x1.8p+0
+0x1.ffffffffffff7p-1, 0x1.8p+0
+## name: redhat-fast-customer-cases
+0x1.ffffffffffffep-1, 0x1.8p+0
+0x1.ffffffffffffcp-1, 0x1.8p+0
+0x1.999999999999ap-4, 0x1.8p+0