From patchwork Thu Oct 26 19:39:19 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Adhemerval Zanella Netto
 <adhemerval.zanella@linaro.org>
X-Patchwork-Id: 23893
Received: (qmail 48060 invoked by alias); 26 Oct 2017 19:39:30 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 46858 invoked by uid 89); 26 Oct 2017 19:39:29 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-26.4 required=5.0 tests=BAYES_00, GIT_PATCH_0,
	GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE,
	RCVD_IN_SORBS_SPAM,
	SPF_PASS autolearn=ham version=3.3.2 spammy=1035, developed,
	thankful, 020
X-HELO: mail-qk0-f196.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:subject:to:references:from:message-id:date
	:user-agent:mime-version:in-reply-to:content-language
	:content-transfer-encoding;
	bh=r7M6uutptInd/d+4ORhlLy7COCrgM75l5UXpmsSVGkM=;
	b=g1YaL2of6IWCa/roJhRypTBrd+6LsDn3qjgeG8tCM9YKPyFo4mvnKnU1X9Vc85eQ05
	qiI/Cixi+OvGhSKAVCCAdNBaDR+JAY/zn56OR5/zYVyLDBMue66DJllhJ6QfnjfT8/HX
	VRvG2woYMb9U6gNK7jOIPtecVTsZoLm6o57os5mQ+AgEF4ziXeCJkZf/Ue2bGsav4s4U
	lrLFBF/rLDXQWdiP9MwVEUKfgBpHqf2oDH0Y7tqkVZ1LDFk6cfWt86BRToCTiBD2bmn/
	VeGE/8Pri9jeYL09msyZ/bQaKnyq3phBREr4ge1XrZKSjR2VuWUDn8kQa7UJW9fHwmmB
	m4Jw==
X-Gm-Message-State: AMCzsaWHXIJy3LXXqexP1fABuIGPu4XW/Dw2au/vPdnPAZihzWIYyI81
	b6gJ9o1GAOK3gqWBEOa4a3nTav5hD6A=
X-Google-Smtp-Source: 
 ABhQp+QiAW+4JNvRG62RTpL9IRdiizrhwRA8I9i6nYj/0NPqwEaM5ZCPh5urR4k/ry+71+D/sXjdFg==
X-Received: by 10.55.156.210 with SMTP id f201mr9169300qke.283.1509046764426;
	Thu, 26 Oct 2017 12:39:24 -0700 (PDT)
Subject: Re: [PATCH 0/3] sparc M7 optimized memcpy/memset
To: libc-alpha@sourceware.org
References: <1506542999-97895-1-git-send-email-patrick.mcgehearty@oracle.com>
	<20170927.214103.2236958753795210322.davem@davemloft.net>
	<1f543a82-5816-11f1-d615-a5b4a07c8392@oracle.com>
	<5de85da5-b994-1619-a682-7d1fcd2179f6@linaro.org>
	<c3b177a8-f20e-b5c7-16dd-9b4440b31bd2@oracle.com>
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Message-ID: <da5e9e31-4323-0415-4d35-d956d5f9546b@linaro.org>
Date: Thu, 26 Oct 2017 17:39:19 -0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
	Thunderbird/52.4.0
MIME-Version: 1.0
In-Reply-To: <c3b177a8-f20e-b5c7-16dd-9b4440b31bd2@oracle.com>

On 19/10/2017 14:46, Patrick McGehearty wrote:
> On 10/5/2017 7:19 AM, Adhemerval Zanella wrote:
>>
>> On 04/10/2017 18:07, Patrick McGehearty wrote:
>>> On 9/27/2017 11:41 PM, David Miller wrote:
>>>> From: Patrick McGehearty <patrick.mcgehearty@oracle.com>
>>>> Date: Wed, 27 Sep 2017 16:09:56 -0400
>>>>
>>>>> The following patch set provides optimized versions of
>>>>> memcpy/mempcpy/memmove/memset/bzero for Sparc versions M7 (and later).
>>>>> Support for recognizing M7 is also provided.
>>>>> An assembly version of memmove for ultra1+ is provided.
>>>>>
>>>>> Jose E. Marchesi (2):
>>>>>     sparc: support the ADP hw capability.
>>>>>     sparc: assembly version of memmove for ultra1+
>>>>>
>>>>> Patrick McGehearty (1):
>>>>>     sparc: M7 optimized memcpy/mempcpy/memmove/memset/bzero.
>>>> Looks good to me.
>>> Could someone install the 9/28/2017 sparc M7 optimized memcpy/memset
>>> patches to the repository?
>>>
>>> As a new submitter, I don't have permissions.
>> What about https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html ?
>> I am finishing up the ifunc C implementation and I plan to send it
>> to review today.
> In https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html,
> I see three issues which I will cover here.
> 
> Performance:
> 
> I reported:
>>>> For memcpy/mempcpy/memmove, performance comparison with niagara4 code:
>>>> Long word aligned data
>>>>    0-127 bytes - minimal changes
>>>>    128-1023 bytes - 7-30% gain
>>>>    1024+ bytes - 1-7% gain (in cache); 30-100% gain (out of cache)
>>>> Word aligned data
>>>>    0-127 bytes - 50%+ gain
>>>>    128-1023 bytes - 10-200% gain
>>>>    1024+ bytes - 0-15% gain (in cache); 5-50% gain (out of cache)
>>>> Unaligned data
>>>>    0-127 bytes - 0-70%+ gain
>>>>    128-447 bytes - 40-80%+ gain
>>>>    448-511 bytes - 1-3% loss
>>>>    512-4096 bytes - 2-3% gain (in cache); 0-20% gain (out of cache)
>>>>    4096+ bytes - +/- 3% (in cache); 20-50% gain (out of cache)
>>>>
>>>> For memset/bzero, performance comparison with niagara4 code:
>>>> For memset nonzero data,
>>>>    256-1023 bytes - 60-90% gain (in cache); 5% gain (out of cache)
>>>>    1K+ bytes - 80-260% gain (in cache); 40-80% gain (out of cache)
>>>> For memset zero data (and bzero),
>>>>    256-1023 bytes - 80-120% gain (in cache), 0% gain (out of cache)
>>>>    1024+ bytes - 2-4x gain (in cache), 10-35% gain (out of cache)
> 
> Adhemerval asked:
>>Which benchmark did you use to get theses values? If it where obtained
>>with benchtests one please attach the resulting files.  If not, please
>>either indicate how to reproduce the data or work to improve benchtests
>>with required datapoints.
> 
> A fair question. As noted, the performance comparisions are with the
> niagara4 code for memcpy and memset, the current default for m7/t7/s7
> sparc platforms.
> 
> I used a personal perf test suite that I developed over a period of
> years while tuning memcpy/memset for Sparc/Solaris for most Sparc
> platforms going back to UltraSparc IV. The tests are run in isolation
> and not quickly adaptable to the existing Linux bench tests
> structure. I also ran the Linux bench tests. The results from
> the Linux bench tests are generally similar to the numbers shown
> above, with some individual differences due to different ordering
> of test sizes which causes different branch prediction and cache
> warmth behavior. Because these tests give similar results, I see
> no particular need to add my tests to the Linux bench tests.
> I've continued to use my tests because of my long familiarity them.

I presume by 'Linux bench tests' you referring to glibc benchtests,
right? Usually with performance enhancement is usual to send the
benchmark data from improved symbols along with the patch submission.

> 
> As possible interest for those designing "out of cache" tests, the
> approach used in my personal tests is to allocate a "cache clearing
> buffer" at least 4 times the size of the largest cache on the system
> under test. In these tests, the L3 cache size is 64 MBytes and my
> buffer size was 256 MBytes. Before each set of timing runs,
> my program wrote new data to each long word address of the
> cache clearing buffer. Then, during the timing test for each
> combination of src and destination alignments and test lengths,
> successive copies were executed with each successive src and dest
> started at least 512 bytes after the end of the previous copy.
> The innermost loop did 20 copies. Each timing test was repeated
> 7 times with the median being used to minimize noise effects
> of other system activity.

If you could improve the mem* benchtests with this suggestion by
sending patches to either benchmark changes or new benchmarks
it will be welcome.

> 
>> Adhemerval asked that memcpy and memset be in separate patches
>> within the patch set.
> 
> If it is necessary to resubmit the patch set for other reasons,
> I will split memcpy and memset. Without the need to resubmit,
> it feels like 'make work' in my opinion. The optimizations to both
> files are logically similar and driven by the architectural
> changes in the Block Init Store (BIS) instruction.
> 
>> Adhemerval asked that the ifunc selector code use C instead of assembly.
>> He has prepared changes to replace all assembly uses of ifunc with C.
> 
> The existing ifunc code in that directory uses assembly.
> Adhemerval's changes were not available when I submitted the patch.
> As far as I understand, they are still not available in the main branch.
> Making those changes increases risk for my memcpy/memset optimizations.
> I consider such work, while worthy in its own right, to be out of
> scope for my patch set.
> 
> Respectively,
> - Patrick McGehearty

I just sent a patchset to refactor all remaining IFUNC resolver still
in assembly to C, including all the missing sparc ones.  

Also, I adjusted your patches to my refactor in personal branch [2] and
I also split the patch in memcpy/memmove and memset/bzero. It simplifies 
a lot new ifunc inclusions, for instance the memcpy part is just:
---
---

So if you could help with any review I will be thankful. I would expect
the memcpy/memmove and memset/bzero refactor to be straightforward.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ifunc-c-sparc-m7

diff --git a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
index 46f3795..dbdad2d 100644 (file)
--- a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
+++ b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h
@@ -19,6 +19,7 @@
 
 #include <ifunc-init.h>
 
+extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara7) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara4) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara2) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara1) attribute_hidden;
@@ -28,6 +29,8 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (ultra1) attribute_hidden;
 static inline void *
 IFUNC_SELECTOR (int hwcap)
 {
+  if (hwcap & HWCAP_SPARC_ADP)
+    return OPTIMIZE (niagara7);
   if (hwcap & HWCAP_SPARC_CRYPTO)
     return OPTIMIZE (niagara4);
   if (hwcap & HWCAP_SPARC_N2)