[0/3] sparc M7 optimized memcpy/memset

Message ID da5e9e31-4323-0415-4d35-d956d5f9546b@linaro.org
State Dropped
Headers

Commit Message

Adhemerval Zanella Oct. 26, 2017, 7:39 p.m. UTC
  On 19/10/2017 14:46, Patrick McGehearty wrote:
> On 10/5/2017 7:19 AM, Adhemerval Zanella wrote:
>>
>> On 04/10/2017 18:07, Patrick McGehearty wrote:
>>> On 9/27/2017 11:41 PM, David Miller wrote:
>>>> From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
>>>> Date: Wed, 27 Sep 2017 16:09:56 -0400
>>>>
>>>>> The following patch set provides optimized versions of
>>>>> memcpy/mempcpy/memmove/memset/bzero for Sparc versions M7 (and later).
>>>>> Support for recognizing M7 is also provided.
>>>>> An assembly version of memmove for ultra1+ is provided.
>>>>>
>>>>> Jose E. Marchesi (2):
>>>>>     sparc: support the ADP hw capability.
>>>>>     sparc: assembly version of memmove for ultra1+
>>>>>
>>>>> Patrick McGehearty (1):
>>>>>     sparc: M7 optimized memcpy/mempcpy/memmove/memset/bzero.
>>>> Looks good to me.
>>> Could someone install the 9/28/2017 sparc M7 optimized memcpy/memset
>>> patches to the repository?
>>>
>>> As a new submitter, I don't have permissions.
>> What about https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html ?
>> I am finishing up the ifunc C implementation and I plan to send it
>> to review today.
> In https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html,
> I see three issues which I will cover here.
> 
> Performance:
> 
> I reported:
>>>> For memcpy/mempcpy/memmove, performance comparison with niagara4 code:
>>>> Long word aligned data
>>>>    0-127 bytes - minimal changes
>>>>    128-1023 bytes - 7-30% gain
>>>>    1024+ bytes - 1-7% gain (in cache); 30-100% gain (out of cache)
>>>> Word aligned data
>>>>    0-127 bytes - 50%+ gain
>>>>    128-1023 bytes - 10-200% gain
>>>>    1024+ bytes - 0-15% gain (in cache); 5-50% gain (out of cache)
>>>> Unaligned data
>>>>    0-127 bytes - 0-70%+ gain
>>>>    128-447 bytes - 40-80%+ gain
>>>>    448-511 bytes - 1-3% loss
>>>>    512-4096 bytes - 2-3% gain (in cache); 0-20% gain (out of cache)
>>>>    4096+ bytes - +/- 3% (in cache); 20-50% gain (out of cache)
>>>>
>>>> For memset/bzero, performance comparison with niagara4 code:
>>>> For memset nonzero data,
>>>>    256-1023 bytes - 60-90% gain (in cache); 5% gain (out of cache)
>>>>    1K+ bytes - 80-260% gain (in cache); 40-80% gain (out of cache)
>>>> For memset zero data (and bzero),
>>>>    256-1023 bytes - 80-120% gain (in cache), 0% gain (out of cache)
>>>>    1024+ bytes - 2-4x gain (in cache), 10-35% gain (out of cache)
> 
> Adhemerval asked:
>>Which benchmark did you use to get theses values? If it where obtained
>>with benchtests one please attach the resulting files.  If not, please
>>either indicate how to reproduce the data or work to improve benchtests
>>with required datapoints.
> 
> A fair question. As noted, the performance comparisions are with the
> niagara4 code for memcpy and memset, the current default for m7/t7/s7
> sparc platforms.
> 
> I used a personal perf test suite that I developed over a period of
> years while tuning memcpy/memset for Sparc/Solaris for most Sparc
> platforms going back to UltraSparc IV. The tests are run in isolation
> and not quickly adaptable to the existing Linux bench tests
> structure. I also ran the Linux bench tests. The results from
> the Linux bench tests are generally similar to the numbers shown
> above, with some individual differences due to different ordering
> of test sizes which causes different branch prediction and cache
> warmth behavior. Because these tests give similar results, I see
> no particular need to add my tests to the Linux bench tests.
> I've continued to use my tests because of my long familiarity them.

I presume by 'Linux bench tests' you referring to glibc benchtests,
right? Usually with performance enhancement is usual to send the
benchmark data from improved symbols along with the patch submission.

> 
> As possible interest for those designing "out of cache" tests, the
> approach used in my personal tests is to allocate a "cache clearing
> buffer" at least 4 times the size of the largest cache on the system
> under test. In these tests, the L3 cache size is 64 MBytes and my
> buffer size was 256 MBytes. Before each set of timing runs,
> my program wrote new data to each long word address of the
> cache clearing buffer. Then, during the timing test for each
> combination of src and destination alignments and test lengths,
> successive copies were executed with each successive src and dest
> started at least 512 bytes after the end of the previous copy.
> The innermost loop did 20 copies. Each timing test was repeated
> 7 times with the median being used to minimize noise effects
> of other system activity.

If you could improve the mem* benchtests with this suggestion by
sending patches to either benchmark changes or new benchmarks
it will be welcome.

> 
>> Adhemerval asked that memcpy and memset be in separate patches
>> within the patch set.
> 
> If it is necessary to resubmit the patch set for other reasons,
> I will split memcpy and memset. Without the need to resubmit,
> it feels like 'make work' in my opinion. The optimizations to both
> files are logically similar and driven by the architectural
> changes in the Block Init Store (BIS) instruction.
> 
>> Adhemerval asked that the ifunc selector code use C instead of assembly.
>> He has prepared changes to replace all assembly uses of ifunc with C.
> 
> The existing ifunc code in that directory uses assembly.
> Adhemerval's changes were not available when I submitted the patch.
> As far as I understand, they are still not available in the main branch.
> Making those changes increases risk for my memcpy/memset optimizations.
> I consider such work, while worthy in its own right, to be out of
> scope for my patch set.
> 
> Respectively,
> - Patrick McGehearty

I just sent a patchset to refactor all remaining IFUNC resolver still
in assembly to C, including all the missing sparc ones.  

Also, I adjusted your patches to my refactor in personal branch [2] and
I also split the patch in memcpy/memmove and memset/bzero. It simplifies 
a lot new ifunc inclusions, for instance the memcpy part is just:

---
---

So if you could help with any review I will be thankful. I would expect
the memcpy/memmove and memset/bzero refactor to be straightforward.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ifunc-c-sparc-m7
  

Comments

Patrick McGehearty Nov. 2, 2017, 3:17 p.m. UTC | #1
On 10/26/2017 2:39 PM, Adhemerval Zanella wrote:
>
> On 19/10/2017 14:46, Patrick McGehearty wrote:
>
>>> Adhemerval asked that the ifunc selector code use C instead of assembly.
>>> He has prepared changes to replace all assembly uses of ifunc with C.
>> The existing ifunc code in that directory uses assembly.
>> Adhemerval's changes were not available when I submitted the patch.
>> As far as I understand, they are still not available in the main branch.
>> Making those changes increases risk for my memcpy/memset optimizations.
>> I consider such work, while worthy in its own right, to be out of
>> scope for my patch set.
>>
>> Respectively,
>> - Patrick McGehearty
> I just sent a patchset to refactor all remaining IFUNC resolver still
> in assembly to C, including all the missing sparc ones.
>
> Also, I adjusted your patches to my refactor in personal branch [2] and
> I also split the patch in memcpy/memmove and memset/bzero. It simplifies
> a lot new ifunc inclusions, for instance the memcpy part is just:
>
> ---
> diff --git a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
> index 46f3795..dbdad2d 100644 (file)
> --- a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
> +++ b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
> @@ -19,6 +19,7 @@
>   
>   #include <ifunc-init.h>
>   
> +extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara7) attribute_hidden;
>   extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara4) attribute_hidden;
>   extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara2) attribute_hidden;
>   extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara1) attribute_hidden;
> @@ -28,6 +29,8 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (ultra1) attribute_hidden;
>   static inline void *
>   IFUNC_SELECTOR (int hwcap)
>   {
> +  if (hwcap & HWCAP_SPARC_ADP)
> +    return OPTIMIZE (niagara7);
>     if (hwcap & HWCAP_SPARC_CRYPTO)
>       return OPTIMIZE (niagara4);
>     if (hwcap & HWCAP_SPARC_N2)
> ---
>
> So if you could help with any review I will be thankful. I would expect
> the memcpy/memmove and memset/bzero refactor to be straightforward.
>
> [1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ifunc-c-sparc-m7
I've reviewed Adhemerval's changes including building
and running on my sparc s7 system. They look good to me.
Only issue I see is a naming difference in sysdeps/sparc/sparc64/multiarch
between:
ifunc-memmove.c
memcpy.c
mempcpy.c
memset.c

Perhaps ifunc-memmove.c should be named memmove.c for consistency
with memset.c, mempcpy.c and memcpy.c?  Or all named ifunc-mem*.c?

- Patrick McGehearty
  
Adhemerval Zanella Nov. 3, 2017, 11:51 p.m. UTC | #2
On 02/11/2017 13:17, Patrick McGehearty wrote:
> On 10/26/2017 2:39 PM, Adhemerval Zanella wrote:
>>
>> On 19/10/2017 14:46, Patrick McGehearty wrote:
>>
>>>> Adhemerval asked that the ifunc selector code use C instead of assembly.
>>>> He has prepared changes to replace all assembly uses of ifunc with C.
>>> The existing ifunc code in that directory uses assembly.
>>> Adhemerval's changes were not available when I submitted the patch.
>>> As far as I understand, they are still not available in the main branch.
>>> Making those changes increases risk for my memcpy/memset optimizations.
>>> I consider such work, while worthy in its own right, to be out of
>>> scope for my patch set.
>>>
>>> Respectively,
>>> - Patrick McGehearty
>> I just sent a patchset to refactor all remaining IFUNC resolver still
>> in assembly to C, including all the missing sparc ones.
>>
>> Also, I adjusted your patches to my refactor in personal branch [2] and
>> I also split the patch in memcpy/memmove and memset/bzero. It simplifies
>> a lot new ifunc inclusions, for instance the memcpy part is just:
>>
>> ---
>> diff --git a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
>> index 46f3795..dbdad2d 100644 (file)
>> --- a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
>> +++ b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
>> @@ -19,6 +19,7 @@
>>     #include <ifunc-init.h>
>>   +extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara7) attribute_hidden;
>>   extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara4) attribute_hidden;
>>   extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara2) attribute_hidden;
>>   extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara1) attribute_hidden;
>> @@ -28,6 +29,8 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (ultra1) attribute_hidden;
>>   static inline void *
>>   IFUNC_SELECTOR (int hwcap)
>>   {
>> +  if (hwcap & HWCAP_SPARC_ADP)
>> +    return OPTIMIZE (niagara7);
>>     if (hwcap & HWCAP_SPARC_CRYPTO)
>>       return OPTIMIZE (niagara4);
>>     if (hwcap & HWCAP_SPARC_N2)
>> ---
>>
>> So if you could help with any review I will be thankful. I would expect
>> the memcpy/memmove and memset/bzero refactor to be straightforward.
>>
>> [1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ifunc-c-sparc-m7
> I've reviewed Adhemerval's changes including building
> and running on my sparc s7 system. They look good to me.
> Only issue I see is a naming difference in sysdeps/sparc/sparc64/multiarch
> between:
> ifunc-memmove.c
> memcpy.c
> mempcpy.c
> memset.c
> 
> Perhaps ifunc-memmove.c should be named memmove.c for consistency
> with memset.c, mempcpy.c and memcpy.c?  Or all named ifunc-mem*.c?

Indeed this naming is a mistake and I will correct it before send it upstream
(sysdeps/sparc/sparc64/multiarch/ifunc-memmove.c should be indded
sysdeps/sparc/sparc64/multiarch/memmove.c).
It also shown a missing #endif at the end the file, thanks.
  

Patch

diff --git a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
index 46f3795..dbdad2d 100644 (file)
--- a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
+++ b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
@@ -19,6 +19,7 @@ 
 
 #include <ifunc-init.h>
 
+extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara7) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara4) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara2) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara1) attribute_hidden;
@@ -28,6 +29,8 @@  extern __typeof (REDIRECT_NAME) OPTIMIZE (ultra1) attribute_hidden;
 static inline void *
 IFUNC_SELECTOR (int hwcap)
 {
+  if (hwcap & HWCAP_SPARC_ADP)
+    return OPTIMIZE (niagara7);
   if (hwcap & HWCAP_SPARC_CRYPTO)
     return OPTIMIZE (niagara4);
   if (hwcap & HWCAP_SPARC_N2)