[AArch64] Adjust writeback in non-zero memset

Message ID	DB5PR08MB103082464EB664B31AFC5C5783C30@DB5PR08MB1030.eurprd08.prod.outlook.com
State	New, archived
Headers	Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org From: Wilco Dijkstra <Wilco.Dijkstra@arm.com> To: 'GNU C Library' <libc-alpha@sourceware.org>, Szabolcs Nagy <Szabolcs.Nagy@arm.com> CC: nd <nd@arm.com> Subject: Re: [PATCH][AArch64] Adjust writeback in non-zero memset Date: Wed, 14 Nov 2018 14:45:13 +0000 Message-ID: <DB5PR08MB103082464EB664B31AFC5C5783C30@DB5PR08MB1030.eurprd08.prod.outlook.com> References: <DB5PR08MB103042C88641FD9D3A04C3F983C40@DB5PR08MB1030.eurprd08.prod.outlook.com> In-Reply-To: <DB5PR08MB103042C88641FD9D3A04C3F983C40@DB5PR08MB1030.eurprd08.prod.outlook.com> received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0

Message ID

DB5PR08MB103082464EB664B31AFC5C5783C30@DB5PR08MB1030.eurprd08.prod.outlook.com

State

New, archived

Headers

Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
Sender: libc-alpha-owner@sourceware.org
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: 'GNU C Library' <libc-alpha@sourceware.org>, Szabolcs Nagy
	<Szabolcs.Nagy@arm.com>
CC: nd <nd@arm.com>
Subject: Re: [PATCH][AArch64] Adjust writeback in non-zero memset
Date: Wed, 14 Nov 2018 14:45:13 +0000
Message-ID: <DB5PR08MB103082464EB664B31AFC5C5783C30@DB5PR08MB1030.eurprd08.prod.outlook.com>
References: <DB5PR08MB103042C88641FD9D3A04C3F983C40@DB5PR08MB1030.eurprd08.prod.outlook.com>
In-Reply-To: <DB5PR08MB103042C88641FD9D3A04C3F983C40@DB5PR08MB1030.eurprd08.prod.outlook.com>
received-spf: None (protection.outlook.com: arm.com does not designate
	permitted sender hosts)
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Commit Message

Wilco Dijkstra Nov. 14, 2018, 2:45 p.m. UTC

  v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256.

This fixes an ineffiency in the non-zero memset.  Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.

Tested against the GLIBC testsuite, OK for commit?

ChangeLog:
2018-11-14  Wilco Dijkstra  <wdijkstr@arm.com>

        * sysdeps/aarch64/memset.S (MEMSET): Improve non-zero memset loop.

---

Comments

Siddhesh Poyarekar Nov. 16, 2018, 3:24 p.m. UTC | #1

On 14/11/18 8:15 PM, Wilco Dijkstra wrote:
> v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256.
> 
> This fixes an ineffiency in the non-zero memset.  Delaying the writeback
> until the end of the loop is slightly faster on some cores - this shows
> ~5% performance gain on Cortex-A53 when doing large non-zero memsets.
> 
> Tested against the GLIBC testsuite, OK for commit?

Can you please also summarize the performance results for other processors?

Thanks,
Siddhesh

Wilco Dijkstra Nov. 16, 2018, 5:04 p.m. UTC | #2

Hi Siddhesh,

>On 14/11/18 8:15 PM, Wilco Dijkstra wrote:
>> v2: Also bias dst in the zva_other code to avoid issues with zva sizes >= 256.
>> 
>> This fixes an ineffiency in the non-zero memset.  Delaying the writeback
>> until the end of the loop is slightly faster on some cores - this shows
>> ~5% performance gain on Cortex-A53 when doing large non-zero memsets.
>> 
>> Tested against the GLIBC testsuite, OK for commit?
>
> Can you please also summarize the performance results for other processors?

I ran it on Cortex-A72, and there isn't a performance difference on the memset bench
beyond measurement errors. On any out-of-order core address increments are
executed in parallel with loads/stores, ie. they have no measurable latency and so
their exact placement is irrelevant.

Wilco

Siddhesh Poyarekar Nov. 16, 2018, 5:16 p.m. UTC | #3

On 16/11/18 10:34 PM, Wilco Dijkstra wrote:
> I ran it on Cortex-A72, and there isn't a performance difference on the memset bench
> beyond measurement errors. On any out-of-order core address increments are
> executed in parallel with loads/stores, ie. they have no measurable latency and so
> their exact placement is irrelevant.

Thanks for the confirmation.  Looks OK to me.

Siddhesh

diff mbox

Patch

diff --git a/sysdeps/aarch64/memset.S b/sysdeps/aarch64/memset.S
index 4a454593618f78e22c55520d56737fab5d8f63a4..9738cf5fd55a1d937fb3392cec46f37b4d5fb51d 100644
--- a/sysdeps/aarch64/memset.S
+++ b/sysdeps/aarch64/memset.S
@@ -89,10 +89,10 @@  L(set_long):
Â Â Â  Â b.eqÂ Â  Â L(try_zva)
Â L(no_zva):
Â Â Â  Â subÂ Â  Â count, dstend, dstÂ Â  Â /* Count is 16 too large.Â  */
-Â Â  Â addÂ Â  Â dst, dst, 16
+Â Â  Â subÂ Â  Â dst, dst, 16Â Â  Â Â Â  Â /* Dst is biased by -32.Â  */
Â Â Â  Â subÂ Â  Â count, count, 64 + 16Â Â  Â /* Adjust count and bias for loop.Â  */
-1:Â Â  Â stpÂ Â  Â q0, q0, [dst], 64
-Â Â  Â stpÂ Â  Â q0, q0, [dst, -32]
+1:Â Â  Â stpÂ Â  Â q0, q0, [dst, 32]
+Â Â  Â stpÂ Â  Â q0, q0, [dst, 64]!
Â L(tail64):
Â Â Â  Â subsÂ Â  Â count, count, 64
Â Â Â  Â b.hiÂ Â  Â 1b
@@ -183,6 +183,7 @@  L(zva_other):
Â Â Â  Â subsÂ Â  Â count, count, zva_len
Â Â Â  Â b.hsÂ Â  Â 3b
Â 4:Â Â  Â addÂ Â  Â count, count, zva_len
+Â Â  Â subÂ Â  Â dst, dst, 32Â Â  Â Â Â  Â /* Bias dst for tail loop.Â  */
Â Â Â  Â bÂ Â  Â L(tail64)
Â #endif