From patchwork Thu Oct 26 19:39:19 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Adhemerval Zanella Netto X-Patchwork-Id: 23893 Received: (qmail 48060 invoked by alias); 26 Oct 2017 19:39:30 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 46858 invoked by uid 89); 26 Oct 2017 19:39:29 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-26.4 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, RCVD_IN_SORBS_SPAM, SPF_PASS autolearn=ham version=3.3.2 spammy=1035, developed, thankful, 020 X-HELO: mail-qk0-f196.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=r7M6uutptInd/d+4ORhlLy7COCrgM75l5UXpmsSVGkM=; b=g1YaL2of6IWCa/roJhRypTBrd+6LsDn3qjgeG8tCM9YKPyFo4mvnKnU1X9Vc85eQ05 qiI/Cixi+OvGhSKAVCCAdNBaDR+JAY/zn56OR5/zYVyLDBMue66DJllhJ6QfnjfT8/HX VRvG2woYMb9U6gNK7jOIPtecVTsZoLm6o57os5mQ+AgEF4ziXeCJkZf/Ue2bGsav4s4U lrLFBF/rLDXQWdiP9MwVEUKfgBpHqf2oDH0Y7tqkVZ1LDFk6cfWt86BRToCTiBD2bmn/ VeGE/8Pri9jeYL09msyZ/bQaKnyq3phBREr4ge1XrZKSjR2VuWUDn8kQa7UJW9fHwmmB m4Jw== X-Gm-Message-State: AMCzsaWHXIJy3LXXqexP1fABuIGPu4XW/Dw2au/vPdnPAZihzWIYyI81 b6gJ9o1GAOK3gqWBEOa4a3nTav5hD6A= X-Google-Smtp-Source: ABhQp+QiAW+4JNvRG62RTpL9IRdiizrhwRA8I9i6nYj/0NPqwEaM5ZCPh5urR4k/ry+71+D/sXjdFg== X-Received: by 10.55.156.210 with SMTP id f201mr9169300qke.283.1509046764426; Thu, 26 Oct 2017 12:39:24 -0700 (PDT) Subject: Re: [PATCH 0/3] sparc M7 optimized memcpy/memset To: libc-alpha@sourceware.org References: <1506542999-97895-1-git-send-email-patrick.mcgehearty@oracle.com> <20170927.214103.2236958753795210322.davem@davemloft.net> <1f543a82-5816-11f1-d615-a5b4a07c8392@oracle.com> <5de85da5-b994-1619-a682-7d1fcd2179f6@linaro.org> From: Adhemerval Zanella Message-ID: Date: Thu, 26 Oct 2017 17:39:19 -0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: On 19/10/2017 14:46, Patrick McGehearty wrote: > On 10/5/2017 7:19 AM, Adhemerval Zanella wrote: >> >> On 04/10/2017 18:07, Patrick McGehearty wrote: >>> On 9/27/2017 11:41 PM, David Miller wrote: >>>> From: Patrick McGehearty >>>> Date: Wed, 27 Sep 2017 16:09:56 -0400 >>>> >>>>> The following patch set provides optimized versions of >>>>> memcpy/mempcpy/memmove/memset/bzero for Sparc versions M7 (and later). >>>>> Support for recognizing M7 is also provided. >>>>> An assembly version of memmove for ultra1+ is provided. >>>>> >>>>> Jose E. Marchesi (2): >>>>>     sparc: support the ADP hw capability. >>>>>     sparc: assembly version of memmove for ultra1+ >>>>> >>>>> Patrick McGehearty (1): >>>>>     sparc: M7 optimized memcpy/mempcpy/memmove/memset/bzero. >>>> Looks good to me. >>> Could someone install the 9/28/2017 sparc M7 optimized memcpy/memset >>> patches to the repository? >>> >>> As a new submitter, I don't have permissions. >> What about https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html ? >> I am finishing up the ifunc C implementation and I plan to send it >> to review today. > In https://sourceware.org/ml/libc-alpha/2017-10/msg00125.html, > I see three issues which I will cover here. > > Performance: > > I reported: >>>> For memcpy/mempcpy/memmove, performance comparison with niagara4 code: >>>> Long word aligned data >>>>    0-127 bytes - minimal changes >>>>    128-1023 bytes - 7-30% gain >>>>    1024+ bytes - 1-7% gain (in cache); 30-100% gain (out of cache) >>>> Word aligned data >>>>    0-127 bytes - 50%+ gain >>>>    128-1023 bytes - 10-200% gain >>>>    1024+ bytes - 0-15% gain (in cache); 5-50% gain (out of cache) >>>> Unaligned data >>>>    0-127 bytes - 0-70%+ gain >>>>    128-447 bytes - 40-80%+ gain >>>>    448-511 bytes - 1-3% loss >>>>    512-4096 bytes - 2-3% gain (in cache); 0-20% gain (out of cache) >>>>    4096+ bytes - +/- 3% (in cache); 20-50% gain (out of cache) >>>> >>>> For memset/bzero, performance comparison with niagara4 code: >>>> For memset nonzero data, >>>>    256-1023 bytes - 60-90% gain (in cache); 5% gain (out of cache) >>>>    1K+ bytes - 80-260% gain (in cache); 40-80% gain (out of cache) >>>> For memset zero data (and bzero), >>>>    256-1023 bytes - 80-120% gain (in cache), 0% gain (out of cache) >>>>    1024+ bytes - 2-4x gain (in cache), 10-35% gain (out of cache) > > Adhemerval asked: >>Which benchmark did you use to get theses values? If it where obtained >>with benchtests one please attach the resulting files.  If not, please >>either indicate how to reproduce the data or work to improve benchtests >>with required datapoints. > > A fair question. As noted, the performance comparisions are with the > niagara4 code for memcpy and memset, the current default for m7/t7/s7 > sparc platforms. > > I used a personal perf test suite that I developed over a period of > years while tuning memcpy/memset for Sparc/Solaris for most Sparc > platforms going back to UltraSparc IV. The tests are run in isolation > and not quickly adaptable to the existing Linux bench tests > structure. I also ran the Linux bench tests. The results from > the Linux bench tests are generally similar to the numbers shown > above, with some individual differences due to different ordering > of test sizes which causes different branch prediction and cache > warmth behavior. Because these tests give similar results, I see > no particular need to add my tests to the Linux bench tests. > I've continued to use my tests because of my long familiarity them. I presume by 'Linux bench tests' you referring to glibc benchtests, right? Usually with performance enhancement is usual to send the benchmark data from improved symbols along with the patch submission. > > As possible interest for those designing "out of cache" tests, the > approach used in my personal tests is to allocate a "cache clearing > buffer" at least 4 times the size of the largest cache on the system > under test. In these tests, the L3 cache size is 64 MBytes and my > buffer size was 256 MBytes. Before each set of timing runs, > my program wrote new data to each long word address of the > cache clearing buffer. Then, during the timing test for each > combination of src and destination alignments and test lengths, > successive copies were executed with each successive src and dest > started at least 512 bytes after the end of the previous copy. > The innermost loop did 20 copies. Each timing test was repeated > 7 times with the median being used to minimize noise effects > of other system activity. If you could improve the mem* benchtests with this suggestion by sending patches to either benchmark changes or new benchmarks it will be welcome. > >> Adhemerval asked that memcpy and memset be in separate patches >> within the patch set. > > If it is necessary to resubmit the patch set for other reasons, > I will split memcpy and memset. Without the need to resubmit, > it feels like 'make work' in my opinion. The optimizations to both > files are logically similar and driven by the architectural > changes in the Block Init Store (BIS) instruction. > >> Adhemerval asked that the ifunc selector code use C instead of assembly. >> He has prepared changes to replace all assembly uses of ifunc with C. > > The existing ifunc code in that directory uses assembly. > Adhemerval's changes were not available when I submitted the patch. > As far as I understand, they are still not available in the main branch. > Making those changes increases risk for my memcpy/memset optimizations. > I consider such work, while worthy in its own right, to be out of > scope for my patch set. > > Respectively, > - Patrick McGehearty I just sent a patchset to refactor all remaining IFUNC resolver still in assembly to C, including all the missing sparc ones. Also, I adjusted your patches to my refactor in personal branch [2] and I also split the patch in memcpy/memmove and memset/bzero. It simplifies a lot new ifunc inclusions, for instance the memcpy part is just: --- --- So if you could help with any review I will be thankful. I would expect the memcpy/memmove and memset/bzero refactor to be straightforward. [1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ifunc-c-sparc-m7 diff --git a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h index 46f3795..dbdad2d 100644 (file) --- a/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h +++ b/sysdeps/sparc/sparc64/multiarch/ifunc-memcpy.h @@ -19,6 +19,7 @@ #include +extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara7) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara4) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara2) attribute_hidden; extern __typeof (REDIRECT_NAME) OPTIMIZE (niagara1) attribute_hidden; @@ -28,6 +29,8 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (ultra1) attribute_hidden; static inline void * IFUNC_SELECTOR (int hwcap) { + if (hwcap & HWCAP_SPARC_ADP) + return OPTIMIZE (niagara7); if (hwcap & HWCAP_SPARC_CRYPTO) return OPTIMIZE (niagara4); if (hwcap & HWCAP_SPARC_N2)