Message ID | 085f3ec3cbe41e0a377b1d26089a871f04ffd5d6.camel@espressif.com |
---|---|
State | New |
Headers |
Return-Path: <newlib-bounces~patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C0496385829B for <patchwork@sourceware.org>; Mon, 27 Jan 2025 10:48:04 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C0496385829B Authentication-Results: sourceware.org; dkim=pass (1024-bit key, unprotected) header.d=espressifsystems.onmicrosoft.com header.i=@espressifsystems.onmicrosoft.com header.a=rsa-sha256 header.s=selector1-espressifsystems-onmicrosoft-com header.b=fXBVF5/W X-Original-To: newlib@sourceware.org Delivered-To: newlib@sourceware.org Received: from APC01-SG2-obe.outbound.protection.outlook.com (mail-sg2apc01on20726.outbound.protection.outlook.com [IPv6:2a01:111:f403:200f::726]) by sourceware.org (Postfix) with ESMTPS id 812793858282 for <newlib@sourceware.org>; Mon, 27 Jan 2025 10:46:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 812793858282 Authentication-Results: sourceware.org; dmarc=pass (p=quarantine dis=none) header.from=espressif.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=espressif.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 812793858282 Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=2a01:111:f403:200f::726 ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1737974762; cv=pass; b=uwdkPq1Vfe+UYDEtcK6GpYHKvmmTSqtF6a55vgmMM3HCfwuJ7LRTdgbf/DfiELv2m0bmVDVOqmFS0nqrwLtDHYQ7m7mGR73zfwZnBacJMzrE4AQkDoA+wNioCr/O4ibb4P81P4ycRC0QSWtYuI3N16SCh8aZi2JlWcGOwctz4Xc= ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1737974762; c=relaxed/simple; bh=XfJ31yDjuWOBP91sxQNgoc9X+y3ssesmEdiHgP58fCc=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=ONRVzMkLMPeV8UjzHjY5VZ2oPpod0pqysdYAyhn97uoG5ag7w+lv2m3oDRWZBxHww1qYbxZcF437EHemFgFC50MyY07UXhrcJuWktM2ldGig7G6gBmqZfeo27uiQMUVf1fyvCUdrP/8yuQka1ab2qeIVQHIv6/Yg0xVKts0pAMk= ARC-Authentication-Results: i=2; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 812793858282 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=LBLUIrMOhYhviMlNQCPDiv6eFLHpHMYEK9IP8tmhHfmLfQ2escJnOVKIUOd13ecXQ0AuqAEobUEL7Ljm58pLLOXjQo84J7Iy+YJkhR0CuHn/eMLodBkYqp3RR4DG53AYlv+pPyG7j2T++9eUhBgrCweVIdYx0W1rDyDhMymX1XyugYXivtXZyDCvVb9RkVqEC4GK1ALD1tNI+ib3yDNgLzqscRuiiiQVXAyu+DoHiN1bT2a/QGQwWFrnXATNF8YMv879cmbZNJNS7pgVoLcZm8siTy+Zs9Pehe+wcQ8gYv7YSlVFY7lW+VjMH36kEres1MxOQAx9WQ4hr1/ASHi0FQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=XfJ31yDjuWOBP91sxQNgoc9X+y3ssesmEdiHgP58fCc=; b=ykJSCYWOdmEituTxysZI8wzSJPOzKIW5YqGf44juvl9O3/W2i9Yp3ZLTpBfxdfn5YjrBQrAVHqAm6f1rXWO/wXuI4akD94dGaQUjPGaoFC87qxO6Jdlea4N2abqpd66pI/4d645nModemVL4qQ8iZ+uYZwm3AHDzfP0oHKHzxBd4wwbbqolZ32wOb/LfTFDuuiFAwB/dOOFOfRcsjwR/mUmHCgKaD8ZgUYY8v0b1rmAW4wRBAC9E7MlZ4ddpMy2ZdH8n2lUVX1HU0z/fF3HYVStJsR6PaRuolUOvq4f8Swuh33jJLn589ydOicN7toeOKTglMVjUQlMgbcWlw4dyuw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=espressif.com; dmarc=pass action=none header.from=espressif.com; dkim=pass header.d=espressif.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=espressifsystems.onmicrosoft.com; s=selector1-espressifsystems-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=XfJ31yDjuWOBP91sxQNgoc9X+y3ssesmEdiHgP58fCc=; b=fXBVF5/W+e8jrM+IINWBr9kqcwN27ARpV1ZkXi5U2iFuWCDai//YSecq0ugxlw1unRs/ebzqhBjQkK2yKyTUrOoRXbc/pUsR7lG0m1UPJ1mImTVcUUb9XV0JXc1zMeR15+CCUPYpo0vQp+ZEuW6vz0hVZOndVkWKyK5emHK1zlo= Received: from TYZPR04MB5736.apcprd04.prod.outlook.com (2603:1096:400:1fa::7) by TYSPR04MB8166.apcprd04.prod.outlook.com (2603:1096:405:9a::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8377.17; Mon, 27 Jan 2025 10:45:55 +0000 Received: from TYZPR04MB5736.apcprd04.prod.outlook.com ([fe80::6d6:9923:c880:a521]) by TYZPR04MB5736.apcprd04.prod.outlook.com ([fe80::6d6:9923:c880:a521%6]) with mapi id 15.20.8377.009; Mon, 27 Jan 2025 10:45:55 +0000 From: Alexey Lapshin <alexey.lapshin@espressif.com> To: "newlib@sourceware.org" <newlib@sourceware.org> CC: Alexey Gerenkov <alexey.gerenkov@espressif.com>, Ivan Grokhotkov <ivan@espressif.com> Subject: [PATCH 3/6] newlib: mem[p]cpy/memmove improve performance for optimized versions Thread-Topic: [PATCH 3/6] newlib: mem[p]cpy/memmove improve performance for optimized versions Thread-Index: AQHbcKikkEKa8DOoxU2aiZId7ut1Gg== Date: Mon, 27 Jan 2025 10:45:55 +0000 Message-ID: <085f3ec3cbe41e0a377b1d26089a871f04ffd5d6.camel@espressif.com> References: <4ca70bc28f5edbc5a23c747313151ac5d290f54b.camel@espressif.com> In-Reply-To: <4ca70bc28f5edbc5a23c747313151ac5d290f54b.camel@espressif.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=espressif.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: TYZPR04MB5736:EE_|TYSPR04MB8166:EE_ x-ms-office365-filtering-correlation-id: 459cfd03-9ed1-4581-3778-08dd3ebfc6e0 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; ARA:13230040|1800799024|366016|376014|38070700018; x-microsoft-antispam-message-info: =?utf-8?q?V1lS7IKkVWWDtDbHtsGEyIbkZvO3EJa?= =?utf-8?q?wTmAamSgIxLUEHWz188/noqOf+fqJvXVFRgl1fWlFnOidPdXvxs+/xj4rTz5+3r32?= =?utf-8?q?9Ly638bsEKAd2lcWCuKVI2kKHiHp3r+Bsu35pxEQutY67YPuN1iGFZX30Ck3+hGRi?= =?utf-8?q?eW6R+N3+zNpRo5W7ygRznSxOPwXxjpTYZZnBhwNehzb1Sud0e1ahIE0/V9g5ntOm2?= =?utf-8?q?YVWJv+9b3RUchc8hrysxXq2k+s81EexjSdJe4QRuT6uBTwu8Me7kI1E/1ErcrEOTK?= =?utf-8?q?3vo+uqsn6erUdbIArpz9rstW2V4ZuNTXRnR+bu7LftVEKpmJnezjVYwCH3Q+hMTW0?= =?utf-8?q?dsZIRg5sPTaLUt2IupNGz+SS47ZvL9bG1JpoQWbtO/1Vp9spURTVNrsuc+AgLk1Gn?= =?utf-8?q?QjEc4Ia2lbfoCFTbt3o2e+yLVvPDUxNHAH0bSQFu/ik7EkaloRXJGTElLuItB5rce?= =?utf-8?q?UbT6FiKWPL+dGtTXiEVq82MqlfyV8bCdNTge4EXtdwhXpSGqWiEDJTYb/zs1d56pR?= =?utf-8?q?PAl3Fe4Xwm6kOyjo4CsUkeAGYAmaHOmrXIBJMKSAk6riMheMMofDqvMNR0SY5f8VW?= =?utf-8?q?FNW2r582ZGq08z3YgT8txFldMRvn0afW8Edj7Tqf1keX146YXRC0QXgdF0x0bu5OQ?= =?utf-8?q?a7BMIqf9dyKqMsQMPqaqdRTVeWeDp5BLV0y02ryp56oAJKfgSnxvedMprT5L4zxfA?= =?utf-8?q?JRgKPOpcbigYbODxL03eeTiXYaECtVdWfWn0SZinNmO3/8zkpjnh5WjZJac6apFoT?= =?utf-8?q?6STjo1AZx2KkTSbPIwGvm5uNWsNmkN+6Rgo+Iav8SxC3eV+PqtUU/71PZeMTLmwr0?= =?utf-8?q?7DtUCkDTgrSdSEKGb7YH3E7DYIqqeXlvTuJ0LcfA7SazSZJLU3qITVaPB+IpqntQO?= =?utf-8?q?8k6hiqPE4NEkrpDSf6TH+lxDwQIO+yLS9XTC221cb6WD4KQw+jzv4AaZwcg1+/HLk?= =?utf-8?q?AxIuKMmj9VkdmMVMabgTs3TeBg/xS7jMTuCHDm0gyD0qU9xn6z/utJWXZYBHKDkWA?= =?utf-8?q?6485WX04JAe/jKcNozjTsGjbFdrEX2Uo/oglS+GHaB561IcL7vAWAmecjpMtexH25?= =?utf-8?q?QD4cY+GIHzZxwVtcdzZDtInLPfndC3tQPv0eWtKvWiVbwA5QNxQvOrPOE3zGBSH3z?= =?utf-8?q?2mq0PH3nu7iWJE1Wh/YlOSvrpZ6hNffcIaSz4WmzKtpTsh9rL7KHqB8lmarN4R6bk?= =?utf-8?q?EekXxTjheewXm5HM1OwUgx69s9SqyXRtzeHVi1d6mJm0KogQAoNLUZmQj2E2B+TpU?= =?utf-8?q?x3qr7UOf9PbO9n2b+F0nvW+p+aQBm3mP98h7krzy7CLV2sVR1xKNCY+sjCdETvO4k?= =?utf-8?q?2tix5r3+E9kVxfRY9/jX5bzVQh/LA4IiGCB/4mH1A4RiOY5vhMWzimm7r/kOij30D?= =?utf-8?q?hbnr8Hr4V41?= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:TYZPR04MB5736.apcprd04.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(366016)(376014)(38070700018); DIR:OUT; SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?utf-8?q?vqpejnJegcY2hwkmLywnnsa1E48w?= =?utf-8?q?DRuHFTL9g/oIv2C1pk3SqzE7MHpZBRGgVjdWTp1vK8jKZl4XUmBpnHxetUVxojeRJ?= =?utf-8?q?vgKccAN51KidCcHtgVGZstkVcpVUjOqVct6wE0xujxCqRlXB7uo6TBqqy/VSQtChA?= =?utf-8?q?SUm3OWFyZq2D2Cp9WMv1E+11/FMepdf4fWVbppl4fY3kAY85peEfwL9KN2kGZIUYc?= =?utf-8?q?xMSFtk0GaowbG//bXyW9yp9Lw9RbatKu1by97L258P1F0NLv9WTyAiMRrWOR/Q8Wr?= =?utf-8?q?0Mxq6dLv4gQoM2UPjWt2dpMgGn8zIP2BFk01n2ud61peEQwGyEe5YTxcueLH3GFQK?= =?utf-8?q?L0VUOdB8MZ2U5LA9vd79f6vSelONVzQ45zSLVDZ5gkvHpMCuIoBWYnh/9ei60CDmO?= =?utf-8?q?ItQw82D1s+SY60DW9aIN2HfGdA27Wka7hMfLacLdBt25Mgmzw6VycEShYKTOZ0HHC?= =?utf-8?q?mvD7Jh+dVcMDyjyANSK8bGWwpRULoOZFJx3b6MmIJ/vNzcMFHiHftNVM20j+0Gduf?= =?utf-8?q?SEFrMZDcISao7cPp3EeXVhrpROpVZUlsuTSN1JKOp3YnjNvaJrMFeB50vBm3E5cYl?= =?utf-8?q?OJ3VOziloeUMqJEAWwX/4lsmqylxkUchAfWXnw7hDKHxhFWxUn/KKk9N1yWH/P54r?= =?utf-8?q?lBS8+1CPET+I9bAMhMUqa14KkTLD4N/9dLZYS0tQCJKODf9IqvpnILwGTr8yPk80a?= =?utf-8?q?eR+ng9UaWuG8SjRDaV88Sxr6s30zheX+IEAMvW5G0Js3elmCoET53au5QoGtX3M/e?= =?utf-8?q?LirYzMJNdp8MS7cK7E830fnB+GD9v4S7SKRSHKrsMgBzWDs5JS10eJ34MHFPkgH3B?= =?utf-8?q?EJlU2ML3ygvU1OKS/pUlwf0d+YFK10wSdLCvdbgYV4HiN/V6hMNHEApNLpRgG8FJp?= =?utf-8?q?eVbhSaXhwj+XSP2hbeYWGpzmI+JOrddjix6tACknLJvJVZG3OM7ZlldSmKm7i3oOV?= =?utf-8?q?25LsfE0pC4ZKGq13+nINYT35dJZwidLiXrq/PXAbU+6IezFOEaNdLZpbpWbxzTqFn?= =?utf-8?q?zzBvtxayM8BMQOsUn9EMDSs68Vo1S6qZ6hnbWDOf+a/dH/ja0WvDSscAg0hv8GmeW?= =?utf-8?q?TuG/XH/2UV6fQHeILzQdya5FrNnXr6tLt3xNAXvx8vFbmUPIfkFzZsrUp2KvMyo9V?= =?utf-8?q?/u/0TaUomBYYj7FwXyyYdZxpoFfgu8CNyhMKsmlwU4s+dZuzLd87yesUi7IBzMv5d?= =?utf-8?q?qgS8IyogV5LJJkK1aiMrO5ADp+k5XBSpuRmIz0j8mViUavMX74+KAk/J6lNFlMWcc?= =?utf-8?q?CcDNGegiAkRGNwFs36VWXdIVwRb48TOFypvGUd1mBJdz8Dtx4ilBlea+WRPlBW5rS?= =?utf-8?q?lhjMYxCemdK6yWW363T4Cgi1MKVJ/NCFd1taruT1zDild9Q6uBmPglKuyuq07un7w?= =?utf-8?q?tV04W5hrbY2cgzeVuC1Tf+XvrRmiWSrE/Q4HMl00fYU3r/0GEOdqZgCWnWk8zIhTO?= =?utf-8?q?PvhTrU5pJUcdLsY9vxw2cemg8QyYJ8nbEow1tVxdzU/qs4J0/8LJl181nbdaBkXb7?= =?utf-8?q?eW83PJ61v3tYaJa2mQqYSrTEuLRFuMfWkAO/15+/4TqPbBmY3PJBxck=3D?= Content-Type: text/plain; charset="utf-8" Content-ID: <C9252CC768FF594E87947E706F575875@apcprd04.prod.outlook.com> Content-Transfer-Encoding: base64 MIME-Version: 1.0 X-OriginatorOrg: espressif.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: TYZPR04MB5736.apcprd04.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 459cfd03-9ed1-4581-3778-08dd3ebfc6e0 X-MS-Exchange-CrossTenant-originalarrivaltime: 27 Jan 2025 10:45:55.2463 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 5faf27fd-3557-4294-9545-8ea74a409f39 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: eZd1DG6B+YMAjU/urDiJaxXBygl8FRYy5bsLE+oi9St1PtMzwMmd4iyORh0f8y8XL9hCSYkGbBrBqc5LXzKcbi/6VzYLDvhW+YilZ3hDYxA= X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYSPR04MB8166 X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: newlib@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Newlib mailing list <newlib.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/newlib>, <mailto:newlib-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/newlib/> List-Post: <mailto:newlib@sourceware.org> List-Help: <mailto:newlib-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/newlib>, <mailto:newlib-request@sourceware.org?subject=subscribe> Errors-To: newlib-bounces~patchwork=sourceware.org@sourceware.org |
Series |
Refactor and optimize string/memory functions
|
|
Commit Message
Alexey Lapshin
Jan. 27, 2025, 10:45 a.m. UTC
This change improves performance on memory blocks with sizes in range [4..15]. Performance measurements made for RISCV machine (memset): size 4, CPU cycles change: 50 -> 37 size 5, CPU cycles change: 57 -> 40 size 6, CPU cycles change: 64 -> 47 size 7, CPU cycles change: 71 -> 54 size 8, CPU cycles change: 78 -> 44 size 9, CPU cycles change: 85 -> 47 size 10, CPU cycles change: 92 -> 54 size 11, CPU cycles change: 99 -> 61 size 12, CPU cycles change: 106 -> 51 size 13, CPU cycles change: 113 -> 54 size 14, CPU cycles change: 120 -> 61 size 15, CPU cycles change: 127 -> 68 --- newlib/libc/string/memcpy.c | 2 +- newlib/libc/string/memmove.c | 2 +- newlib/libc/string/mempcpy.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) -- 2.43.0
Comments
On Jan 27 10:45, Alexey Lapshin wrote: > This change improves performance on memory blocks with sizes in range > [4..15]. Performance measurements made for RISCV machine (memset): > > size 4, CPU cycles change: 50 -> 37 > size 5, CPU cycles change: 57 -> 40 > size 6, CPU cycles change: 64 -> 47 > size 7, CPU cycles change: 71 -> 54 > size 8, CPU cycles change: 78 -> 44 > size 9, CPU cycles change: 85 -> 47 > size 10, CPU cycles change: 92 -> 54 > size 11, CPU cycles change: 99 -> 61 > size 12, CPU cycles change: 106 -> 51 > size 13, CPU cycles change: 113 -> 54 > size 14, CPU cycles change: 120 -> 61 > size 15, CPU cycles change: 127 -> 68 But is that generally true for other architectures as well? Corinna
On 28/01/2025 16:11, Corinna Vinschen wrote: > On Jan 27 10:45, Alexey Lapshin wrote: >> This change improves performance on memory blocks with sizes in range >> [4..15]. Performance measurements made for RISCV machine (memset): >> >> size 4, CPU cycles change: 50 -> 37 >> size 5, CPU cycles change: 57 -> 40 >> size 6, CPU cycles change: 64 -> 47 >> size 7, CPU cycles change: 71 -> 54 >> size 8, CPU cycles change: 78 -> 44 >> size 9, CPU cycles change: 85 -> 47 >> size 10, CPU cycles change: 92 -> 54 >> size 11, CPU cycles change: 99 -> 61 >> size 12, CPU cycles change: 106 -> 51 >> size 13, CPU cycles change: 113 -> 54 >> size 14, CPU cycles change: 120 -> 61 >> size 15, CPU cycles change: 127 -> 68 > > But is that generally true for other architectures as well? > No, it can be very dependent on the microarchitecture. I know of Arm implementations where it would be better and implementations where it would be (much) worse. The other variable is that for misaligned copies there's a choice of bringing the source data to alignment or the target data (you really don't want to do a large copy with both misaligned). That can also vary by micro-architecture. But we have custom assembler versions for Arm, so it probably doesn't matter for us, except at -Os and there I wouldn't expect us to want large expanded chunks of code for all the cases that misaligned copies might involve. R.
On Jan 28 16:33, Richard Earnshaw (lists) wrote: > On 28/01/2025 16:11, Corinna Vinschen wrote: > > On Jan 27 10:45, Alexey Lapshin wrote: > >> This change improves performance on memory blocks with sizes in range > >> [4..15]. Performance measurements made for RISCV machine (memset): > >> > >> size 4, CPU cycles change: 50 -> 37 > >> size 5, CPU cycles change: 57 -> 40 > >> size 6, CPU cycles change: 64 -> 47 > >> size 7, CPU cycles change: 71 -> 54 > >> size 8, CPU cycles change: 78 -> 44 > >> size 9, CPU cycles change: 85 -> 47 > >> size 10, CPU cycles change: 92 -> 54 > >> size 11, CPU cycles change: 99 -> 61 > >> size 12, CPU cycles change: 106 -> 51 > >> size 13, CPU cycles change: 113 -> 54 > >> size 14, CPU cycles change: 120 -> 61 > >> size 15, CPU cycles change: 127 -> 68 > > > > But is that generally true for other architectures as well? > > > > No, it can be very dependent on the microarchitecture. I know of Arm > implementations where it would be better and implementations where it > would be (much) worse. Ok, we're talking about the case that memcpy runs the optimization based on the fact that the size of the block to copy is at least sizeof(long) vs. at least sizeof(long)*4, while the check for being aligned is based on sizeof(long) alone. So assuming sizeof(long) is 4, the optimization doesn't kick in for blocks < 32 bytes right now, while Alexey's change allows to run the optimization even for 4 byte blocks. As I understand it, the additional length checks in the optimizing code *may* have a bigger performance hit than the time saved by copying 4 bytes at once rather than bytewise. Alexey's test above show that even for a 4 byte copy, optimizing still has a performance boost compared to a bytewise copy on RISCV. This part is interesting. Do we really have a supported architecture, where one additional `while (len0 >= BIGBLOCKSIZE)' check has such an impact, that running the optimizing code is worse than a byte copy for small, but aligned blocks? > The other variable is that for misaligned > copies there's a choice of bringing the source data to alignment or > the target data (you really don't want to do a large copy with both > misaligned). That can also vary by micro-architecture. Yeah, but our simple fallback memcpy doesn't try to align, it just runs the optimizing copde block if both blocks are already aligned on input. Alexey's patch doesn't change this. > But we have custom assembler versions for Arm, so it probably doesn't > matter for us, except at -Os -Os isn't affected because it runs the PREFER_SIZE_OVER_SPEED code which only does byte copy anyway. Corinna
On Jan 28 16:33, Richard Earnshaw (lists) wrote: > On 28/01/2025 16:11, Corinna Vinschen wrote: > > On Jan 27 10:45, Alexey Lapshin wrote: > >> This change improves performance on memory blocks with sizes in range > >> [4..15]. Performance measurements made for RISCV machine (memset): > >> > >> size 4, CPU cycles change: 50 -> 37 > >> size 5, CPU cycles change: 57 -> 40 > >> size 6, CPU cycles change: 64 -> 47 > >> size 7, CPU cycles change: 71 -> 54 > >> size 8, CPU cycles change: 78 -> 44 > >> size 9, CPU cycles change: 85 -> 47 > >> size 10, CPU cycles change: 92 -> 54 > >> size 11, CPU cycles change: 99 -> 61 > >> size 12, CPU cycles change: 106 -> 51 > >> size 13, CPU cycles change: 113 -> 54 > >> size 14, CPU cycles change: 120 -> 61 > >> size 15, CPU cycles change: 127 -> 68 > > > > But is that generally true for other architectures as well? > > > > No, it can be very dependent on the microarchitecture. I know of Arm > implementations where it would be better and implementations where it > would be (much) worse. Ok, we're talking about the case that memcpy runs the optimization based on the fact that the size of the block to copy is at least sizeof(long) vs. at least sizeof(long)*4, while the check for being aligned is based on sizeof(long) alone. So assuming sizeof(long) is 4, the optimization doesn't kick in for blocks < 32 bytes right now, while Alexey's change allows to run the optimization even for 4 byte blocks. As I understand it, the additional length checks in the optimizing code *may* have a bigger performance hit than the time saved by copying 4 bytes at once rather than bytewise. Alexey's test above show that even for a 4 byte copy, optimizing still has a performance boost compared to a bytewise copy on RISCV. This part is interesting. Do we really have a supported architecture, where one additional `while (len0 >= BIGBLOCKSIZE)' check has such an impact, that running the optimizing code is worse than a byte copy for small, but aligned blocks? > The other variable is that for misaligned > copies there's a choice of bringing the source data to alignment or > the target data (you really don't want to do a large copy with both > misaligned). That can also vary by micro-architecture. Yeah, but our simple fallback memcpy doesn't try to align, it just runs the optimizing copde block if both blocks are already aligned on input. Alexey's patch doesn't change this. > But we have custom assembler versions for Arm, so it probably doesn't > matter for us, except at -Os -Os isn't affected because it runs the PREFER_SIZE_OVER_SPEED code which only does byte copy anyway. Corinna
diff --git a/newlib/libc/string/memcpy.c b/newlib/libc/string/memcpy.c index 1bbd4e0bf..e680c444d 100644 --- a/newlib/libc/string/memcpy.c +++ b/newlib/libc/string/memcpy.c @@ -57,7 +57,7 @@ memcpy (void *__restrict dst0, /* If the size is small, or either SRC or DST is unaligned, then punt into the byte copy loop. This should be rare. */ - if (!TOO_SMALL_BIG_BLOCK(len0) && !UNALIGNED_X_Y(src, dst)) + if (!TOO_SMALL_LITTLE_BLOCK(len0) && !UNALIGNED_X_Y(src, dst)) { aligned_dst = (long*)dst; aligned_src = (long*)src; diff --git a/newlib/libc/string/memmove.c b/newlib/libc/string/memmove.c index a82744c7d..4c5ec6f83 100644 --- a/newlib/libc/string/memmove.c +++ b/newlib/libc/string/memmove.c @@ -85,7 +85,7 @@ memmove (void *dst_void, /* Use optimizing algorithm for a non-destructive copy to closely match memcpy. If the size is small or either SRC or DST is unaligned, then punt into the byte copy loop. This should be rare. */ - if (!TOO_SMALL_BIG_BLOCK(length) && !UNALIGNED_X_Y(src, dst)) + if (!TOO_SMALL_LITTLE_BLOCK(length) && !UNALIGNED_X_Y(src, dst)) { aligned_dst = (long*)dst; aligned_src = (long*)src; diff --git a/newlib/libc/string/mempcpy.c b/newlib/libc/string/mempcpy.c index 06e97de85..561892199 100644 --- a/newlib/libc/string/mempcpy.c +++ b/newlib/libc/string/mempcpy.c @@ -53,7 +53,7 @@ mempcpy (void *dst0, /* If the size is small, or either SRC or DST is unaligned, then punt into the byte copy loop. This should be rare. */ - if (!TOO_SMALL_BIG_BLOCK(len0) && !UNALIGNED_X_Y(src, dst)) + if (!TOO_SMALL_LITTLE_BLOCK(len0) && !UNALIGNED_X_Y(src, dst)) { aligned_dst = (long*)dst; aligned_src = (long*)src;