Message ID | 20230901235224.3304592-1-evan@rivosinc.com |
---|---|
Headers |
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 66BD0385B515 for <patchwork@sourceware.org>; Fri, 1 Sep 2023 23:52:46 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by sourceware.org (Postfix) with ESMTPS id 06BC33858D1E for <libc-alpha@sourceware.org>; Fri, 1 Sep 2023 23:52:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 06BC33858D1E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivosinc.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivosinc.com Received: by mail-pl1-x636.google.com with SMTP id d9443c01a7336-1bf5c314a57so19556365ad.1 for <libc-alpha@sourceware.org>; Fri, 01 Sep 2023 16:52:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rivosinc-com.20230601.gappssmtp.com; s=20230601; t=1693612348; x=1694217148; darn=sourceware.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=lJc06l66y9gdglU5mw2LLEBJeEcuPf4dkj3VMk0NFyQ=; b=hWcWPcwlpHrYJi6ysf2JKlvTUXRqfcsec1bDlQkqaGgOiW5aNu9xPwm/cpxB+YS8yP SASn24i+0OR8WrHedOtfWtcCdqTUUbOkCJQep/apwT/SkuLoe3VYWPqvQP24ni91V256 /tbytmrcCIMiCpAy9StMwQewCNd04/LPm6gRkWFtHuG1G7mdtaE4KQb+w5lij2iLO6yT 29VBfebcO4X36+uLJYxOwyJHKqz4vYuiSx4YHkKv6ic+1W2BuckFPbfiEv5i6f5EMGA1 nJ/a6tngJVYbCBWg2NinOIJ0ty+4nT8Cj75I6psVHSZffg2VOP89EIK7YlrFK6HbONRS vqww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693612348; x=1694217148; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=lJc06l66y9gdglU5mw2LLEBJeEcuPf4dkj3VMk0NFyQ=; b=hMuEpeMJ+XE/L1XSaBpb8j4pCoINs1pRktGfOJk+fBYvBgHhTx1cmM9z6UsesN0tPM 9PUe08UL0Nk8CxIkN6eRLkJWr3n7TpMC8jHJR0P4X8s9ltX/C/9jiP4ajBs3bC0yJi38 cCFy5o6lVOaZIXyTfg0aYg6FS8SQe9VbXK9owz5KsH6sCnf24BDgFfUisP24Ky0k+pql UG7fuZDMCTOjc5ots8PWVnYJ0ZbVKtUk1GiDgTRZhZpEsWn4NlFlIWtY4eBVVJQWlSlv znr21YUOBQTHIT7oyPwsmY0wn4vYxko2iieOKS3CstszKgxDV2W+RNEKHOF2fDOWVgNH Lk+g== X-Gm-Message-State: AOJu0YynAWZnUFlWAtvVLvUL1dm8Ndr6/1r1bydnzGfQSpke2/IyG92j 7MocPxtHXptV4kCcV/6IGUBbBdNpGuwF+SuQDhs= X-Google-Smtp-Source: AGHT+IEg89airAZyeHCDh4CBxqhdD5I6i62Cy3tFGQbfBdb7/3hBUDPwNS1XmNwN7G5G1alGLGQm+w== X-Received: by 2002:a17:902:f806:b0:1b5:64a4:bea0 with SMTP id ix6-20020a170902f80600b001b564a4bea0mr4316700plb.10.1693612348271; Fri, 01 Sep 2023 16:52:28 -0700 (PDT) Received: from evan.ba.rivosinc.com ([66.220.2.162]) by smtp.gmail.com with ESMTPSA id c13-20020a170902d48d00b001bee782a1desm3499551plg.181.2023.09.01.16.52.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 01 Sep 2023 16:52:27 -0700 (PDT) From: Evan Green <evan@rivosinc.com> To: libc-alpha@sourceware.org Subject: [PATCH v8 0/6] RISC-V: ifunced memcpy using new kernel hwprobe interface Date: Fri, 1 Sep 2023 16:52:18 -0700 Message-Id: <20230901235224.3304592-1-evan@rivosinc.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> Cc: Florian Weimer <fweimer@redhat.com>, Evan Green <evan@rivosinc.com>, vineetg@rivosinc.com Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> |
Series |
RISC-V: ifunced memcpy using new kernel hwprobe interface
|
|
Message
Evan Green
Sept. 1, 2023, 11:52 p.m. UTC
This series illustrates the use of a recently accepted Linux syscall that enumerates architectural information about the RISC-V cores the system is running on. In this series we expose a small wrapper function around the syscall. An ifunc selector for memcpy queries it to see if unaligned access is "fast" on this hardware. If it is, it selects a newly provided implementation of memcpy that doesn't work hard at aligning the src and destination buffers. For applications and libraries outside of glibc that want to use __riscv_hwprobe() in ifunc selectors, this series also sends a pointer to the riscv_hwprobe() function in as the second argument to ifunc selectors. A new inline convenience function can help application and library callers to check for validity and quickly probe a single key. The memcpy implementation is independent enough from the rest of the series that it can be omitted safely if desired. Performance numbers were compared using a small test program [1], run on a D1 Nezha board, which supports fast unaligned access. "Fast" here means copying unaligned words is faster than copying byte-wise, but still slower than copying aligned words. Here's the speed of various memcpy()s with the generic implementation. The numbers before are using v4's memcpy implementation, with the "copy last byte via overlapping misaligned word" fix this should get slightly better. memcpy size 1 count 1000000 offset 0 took 109564 us memcpy size 3 count 1000000 offset 0 took 138425 us memcpy size 4 count 1000000 offset 0 took 148374 us memcpy size 7 count 1000000 offset 0 took 178433 us memcpy size 8 count 1000000 offset 0 took 188430 us memcpy size f count 1000000 offset 0 took 266118 us memcpy size f count 1000000 offset 1 took 265940 us memcpy size f count 1000000 offset 3 took 265934 us memcpy size f count 1000000 offset 7 took 266215 us memcpy size f count 1000000 offset 8 took 265954 us memcpy size f count 1000000 offset 9 took 265886 us memcpy size 10 count 1000000 offset 0 took 195308 us memcpy size 11 count 1000000 offset 0 took 205161 us memcpy size 17 count 1000000 offset 0 took 274376 us memcpy size 18 count 1000000 offset 0 took 199188 us memcpy size 19 count 1000000 offset 0 took 209258 us memcpy size 1f count 1000000 offset 0 took 278263 us memcpy size 20 count 1000000 offset 0 took 207364 us memcpy size 21 count 1000000 offset 0 took 217143 us memcpy size 3f count 1000000 offset 0 took 300023 us memcpy size 40 count 1000000 offset 0 took 231063 us memcpy size 41 count 1000000 offset 0 took 241259 us memcpy size 7c count 100000 offset 0 took 32807 us memcpy size 7f count 100000 offset 0 took 36274 us memcpy size ff count 100000 offset 0 took 47818 us memcpy size ff count 100000 offset 0 took 47932 us memcpy size 100 count 100000 offset 0 took 40468 us memcpy size 200 count 100000 offset 0 took 64245 us memcpy size 27f count 100000 offset 0 took 82549 us memcpy size 400 count 100000 offset 0 took 111254 us memcpy size 407 count 100000 offset 0 took 119364 us memcpy size 800 count 100000 offset 0 took 203899 us memcpy size 87f count 100000 offset 0 took 222465 us memcpy size 87f count 100000 offset 3 took 222289 us memcpy size 1000 count 100000 offset 0 took 388846 us memcpy size 1000 count 100000 offset 1 took 468827 us memcpy size 1000 count 100000 offset 3 took 397098 us memcpy size 1000 count 100000 offset 4 took 397379 us memcpy size 1000 count 100000 offset 5 took 397368 us memcpy size 1000 count 100000 offset 7 took 396867 us memcpy size 1000 count 100000 offset 8 took 389227 us memcpy size 1000 count 100000 offset 9 took 395949 us memcpy size 3000 count 50000 offset 0 took 674837 us memcpy size 3000 count 50000 offset 1 took 676944 us memcpy size 3000 count 50000 offset 3 took 679709 us memcpy size 3000 count 50000 offset 4 took 680829 us memcpy size 3000 count 50000 offset 5 took 678024 us memcpy size 3000 count 50000 offset 7 took 681097 us memcpy size 3000 count 50000 offset 8 took 670004 us memcpy size 3000 count 50000 offset 9 took 674553 us Here is that same test run with the assembly memcpy() in this series: memcpy size 1 count 1000000 offset 0 took 92703 us memcpy size 3 count 1000000 offset 0 took 112527 us memcpy size 4 count 1000000 offset 0 took 120481 us memcpy size 7 count 1000000 offset 0 took 149558 us memcpy size 8 count 1000000 offset 0 took 90617 us memcpy size f count 1000000 offset 0 took 174373 us memcpy size f count 1000000 offset 1 took 178615 us memcpy size f count 1000000 offset 3 took 178845 us memcpy size f count 1000000 offset 7 took 178636 us memcpy size f count 1000000 offset 8 took 174442 us memcpy size f count 1000000 offset 9 took 178660 us memcpy size 10 count 1000000 offset 0 took 99845 us memcpy size 11 count 1000000 offset 0 took 112522 us memcpy size 17 count 1000000 offset 0 took 179735 us memcpy size 18 count 1000000 offset 0 took 110870 us memcpy size 19 count 1000000 offset 0 took 121472 us memcpy size 1f count 1000000 offset 0 took 188231 us memcpy size 20 count 1000000 offset 0 took 119571 us memcpy size 21 count 1000000 offset 0 took 132429 us memcpy size 3f count 1000000 offset 0 took 227021 us memcpy size 40 count 1000000 offset 0 took 166416 us memcpy size 41 count 1000000 offset 0 took 180206 us memcpy size 7c count 100000 offset 0 took 28602 us memcpy size 7f count 100000 offset 0 took 31676 us memcpy size ff count 100000 offset 0 took 39257 us memcpy size ff count 100000 offset 0 took 39176 us memcpy size 100 count 100000 offset 0 took 21928 us memcpy size 200 count 100000 offset 0 took 35814 us memcpy size 27f count 100000 offset 0 took 60315 us memcpy size 400 count 100000 offset 0 took 63652 us memcpy size 407 count 100000 offset 0 took 73160 us memcpy size 800 count 100000 offset 0 took 121532 us memcpy size 87f count 100000 offset 0 took 147269 us memcpy size 87f count 100000 offset 3 took 144744 us memcpy size 1000 count 100000 offset 0 took 232057 us memcpy size 1000 count 100000 offset 1 took 254319 us memcpy size 1000 count 100000 offset 3 took 256973 us memcpy size 1000 count 100000 offset 4 took 257655 us memcpy size 1000 count 100000 offset 5 took 259456 us memcpy size 1000 count 100000 offset 7 took 260849 us memcpy size 1000 count 100000 offset 8 took 232347 us memcpy size 1000 count 100000 offset 9 took 254330 us memcpy size 3000 count 50000 offset 0 took 382376 us memcpy size 3000 count 50000 offset 1 took 389872 us memcpy size 3000 count 50000 offset 3 took 385310 us memcpy size 3000 count 50000 offset 4 took 389748 us memcpy size 3000 count 50000 offset 5 took 391707 us memcpy size 3000 count 50000 offset 7 took 386778 us memcpy size 3000 count 50000 offset 8 took 385691 us memcpy size 3000 count 50000 offset 9 took 392030 us The assembly routine is measurably better. [1] https://pastebin.com/DRyECNQW Changes in v8: - Fix missed 2.39 in abilists (Joseph) - Just return -r (Florian) Changes in v7: - Bumped Versions up to 2.39 (Joseph) - Used INTERNAL_SYSCALL_CALL, and return positive errno to match pthreads API (Florian). - Remove __THROW since it creates a warning in combination with the fortified access decorators. - Use INTERNAL_VSYSCALL_CALL (Florian) - Remove __THROW from function pointer type, as it creates warnings together with __fortified_attr_access. - Introduced static inline helper (Richard) - Use new helper function in memcpy ifunc selector (Richard) Changes in v6: - Prefixed __riscv_hwprobe() parameters names with __ to avoid user macro namespace pollution (Joseph) - Introduced riscv-ifunc.h for multi-arg ifunc selectors. - Fix a couple regressions in the assembly from v5 :/ - Use passed hwprobe pointer in memcpy ifunc selector. Changes in v5: - Do unaligned word access for final trailing bytes (Richard) Changes in v4: - Remove __USE_GNU (Florian) - __nonnull, __wur, __THROW, and __fortified_attr_access decorations (Florian) - change long to long int (Florian) - Fix comment formatting (Florian) - Update backup kernel header content copy. - Fix function declaration formatting (Florian) - Changed export versions to 2.38 - Fixed comment style (Florian) Changes in v3: - Update argument types to match v4 kernel interface - Add the "return" to the vsyscall - Fix up vdso arg types to match kernel v4 version - Remove ifdef around INLINE_VSYSCALL (Adhemerval) - Word align dest for large memcpy()s. - Add tags - Remove spurious blank line from sysdeps/riscv/memcpy.c Changes in v2: - hwprobe.h: Use __has_include and duplicate Linux content to make compilation work when Linux headers are absent (Adhemerval) - hwprobe.h: Put declaration under __USE_GNU (Adhemerval) - Use INLINE_SYSCALL_CALL (Adhemerval) - Update versions - Update UNALIGNED_MASK to match kernel v3 series. - Add vDSO interface - Used _MASK instead of _FAST value itself. Evan Green (6): riscv: Add Linux hwprobe syscall support riscv: Add hwprobe vdso call support riscv: Add __riscv_hwprobe pointer to ifunc calls riscv: Enable multi-arg ifunc resolvers riscv: Add ifunc helper method to hwprobe.h riscv: Add and use alignment-ignorant memcpy include/libc-symbols.h | 28 ++-- sysdeps/riscv/dl-irel.h | 8 +- sysdeps/riscv/memcopy.h | 26 ++++ sysdeps/riscv/memcpy.c | 63 ++++++++ sysdeps/riscv/memcpy_noalignment.S | 138 ++++++++++++++++++ sysdeps/riscv/riscv-ifunc.h | 27 ++++ sysdeps/unix/sysv/linux/dl-vdso-setup.c | 10 ++ sysdeps/unix/sysv/linux/dl-vdso-setup.h | 3 + sysdeps/unix/sysv/linux/riscv/Makefile | 8 +- sysdeps/unix/sysv/linux/riscv/Versions | 3 + sysdeps/unix/sysv/linux/riscv/hwprobe.c | 47 ++++++ .../unix/sysv/linux/riscv/memcpy-generic.c | 24 +++ .../unix/sysv/linux/riscv/rv32/libc.abilist | 1 + .../unix/sysv/linux/riscv/rv64/libc.abilist | 1 + sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h | 106 ++++++++++++++ sysdeps/unix/sysv/linux/riscv/sysdep.h | 1 + 16 files changed, 477 insertions(+), 17 deletions(-) create mode 100644 sysdeps/riscv/memcopy.h create mode 100644 sysdeps/riscv/memcpy.c create mode 100644 sysdeps/riscv/memcpy_noalignment.S create mode 100644 sysdeps/riscv/riscv-ifunc.h create mode 100644 sysdeps/unix/sysv/linux/riscv/hwprobe.c create mode 100644 sysdeps/unix/sysv/linux/riscv/memcpy-generic.c create mode 100644 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h
Comments
On Fri, 01 Sep 2023 16:52:18 PDT (-0700), Evan Green wrote: > > This series illustrates the use of a recently accepted Linux syscall that > enumerates architectural information about the RISC-V cores the system > is running on. In this series we expose a small wrapper function around > the syscall. An ifunc selector for memcpy queries it to see if unaligned > access is "fast" on this hardware. If it is, it selects a newly provided > implementation of memcpy that doesn't work hard at aligning the src and > destination buffers. > > For applications and libraries outside of glibc that want to use > __riscv_hwprobe() in ifunc selectors, this series also sends a pointer > to the riscv_hwprobe() function in as the second argument to ifunc > selectors. A new inline convenience function can help application and > library callers to check for validity and quickly probe a single key. This came up during the Cauldron. It seems like everyone's OK with this interface? It's fine with me, I just hadn't really felt qualified to review it as I don't really understand the rules around IFUNC resolution. Sounds like they're really quite complicated, though, so Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com> Let's give folks a week to reply, as there's likely still a lot of travel going on (Cauldron ended yesterday). > The memcpy implementation is independent enough from the rest of the > series that it can be omitted safely if desired. > > Performance numbers were compared using a small test program [1], run on > a D1 Nezha board, which supports fast unaligned access. "Fast" here > means copying unaligned words is faster than copying byte-wise, but > still slower than copying aligned words. Here's the speed of various > memcpy()s with the generic implementation. The numbers before are using > v4's memcpy implementation, with the "copy last byte via overlapping > misaligned word" fix this should get slightly better. > > memcpy size 1 count 1000000 offset 0 took 109564 us > memcpy size 3 count 1000000 offset 0 took 138425 us > memcpy size 4 count 1000000 offset 0 took 148374 us > memcpy size 7 count 1000000 offset 0 took 178433 us > memcpy size 8 count 1000000 offset 0 took 188430 us > memcpy size f count 1000000 offset 0 took 266118 us > memcpy size f count 1000000 offset 1 took 265940 us > memcpy size f count 1000000 offset 3 took 265934 us > memcpy size f count 1000000 offset 7 took 266215 us > memcpy size f count 1000000 offset 8 took 265954 us > memcpy size f count 1000000 offset 9 took 265886 us > memcpy size 10 count 1000000 offset 0 took 195308 us > memcpy size 11 count 1000000 offset 0 took 205161 us > memcpy size 17 count 1000000 offset 0 took 274376 us > memcpy size 18 count 1000000 offset 0 took 199188 us > memcpy size 19 count 1000000 offset 0 took 209258 us > memcpy size 1f count 1000000 offset 0 took 278263 us > memcpy size 20 count 1000000 offset 0 took 207364 us > memcpy size 21 count 1000000 offset 0 took 217143 us > memcpy size 3f count 1000000 offset 0 took 300023 us > memcpy size 40 count 1000000 offset 0 took 231063 us > memcpy size 41 count 1000000 offset 0 took 241259 us > memcpy size 7c count 100000 offset 0 took 32807 us > memcpy size 7f count 100000 offset 0 took 36274 us > memcpy size ff count 100000 offset 0 took 47818 us > memcpy size ff count 100000 offset 0 took 47932 us > memcpy size 100 count 100000 offset 0 took 40468 us > memcpy size 200 count 100000 offset 0 took 64245 us > memcpy size 27f count 100000 offset 0 took 82549 us > memcpy size 400 count 100000 offset 0 took 111254 us > memcpy size 407 count 100000 offset 0 took 119364 us > memcpy size 800 count 100000 offset 0 took 203899 us > memcpy size 87f count 100000 offset 0 took 222465 us > memcpy size 87f count 100000 offset 3 took 222289 us > memcpy size 1000 count 100000 offset 0 took 388846 us > memcpy size 1000 count 100000 offset 1 took 468827 us > memcpy size 1000 count 100000 offset 3 took 397098 us > memcpy size 1000 count 100000 offset 4 took 397379 us > memcpy size 1000 count 100000 offset 5 took 397368 us > memcpy size 1000 count 100000 offset 7 took 396867 us > memcpy size 1000 count 100000 offset 8 took 389227 us > memcpy size 1000 count 100000 offset 9 took 395949 us > memcpy size 3000 count 50000 offset 0 took 674837 us > memcpy size 3000 count 50000 offset 1 took 676944 us > memcpy size 3000 count 50000 offset 3 took 679709 us > memcpy size 3000 count 50000 offset 4 took 680829 us > memcpy size 3000 count 50000 offset 5 took 678024 us > memcpy size 3000 count 50000 offset 7 took 681097 us > memcpy size 3000 count 50000 offset 8 took 670004 us > memcpy size 3000 count 50000 offset 9 took 674553 us > > Here is that same test run with the assembly memcpy() in this series: > memcpy size 1 count 1000000 offset 0 took 92703 us > memcpy size 3 count 1000000 offset 0 took 112527 us > memcpy size 4 count 1000000 offset 0 took 120481 us > memcpy size 7 count 1000000 offset 0 took 149558 us > memcpy size 8 count 1000000 offset 0 took 90617 us > memcpy size f count 1000000 offset 0 took 174373 us > memcpy size f count 1000000 offset 1 took 178615 us > memcpy size f count 1000000 offset 3 took 178845 us > memcpy size f count 1000000 offset 7 took 178636 us > memcpy size f count 1000000 offset 8 took 174442 us > memcpy size f count 1000000 offset 9 took 178660 us > memcpy size 10 count 1000000 offset 0 took 99845 us > memcpy size 11 count 1000000 offset 0 took 112522 us > memcpy size 17 count 1000000 offset 0 took 179735 us > memcpy size 18 count 1000000 offset 0 took 110870 us > memcpy size 19 count 1000000 offset 0 took 121472 us > memcpy size 1f count 1000000 offset 0 took 188231 us > memcpy size 20 count 1000000 offset 0 took 119571 us > memcpy size 21 count 1000000 offset 0 took 132429 us > memcpy size 3f count 1000000 offset 0 took 227021 us > memcpy size 40 count 1000000 offset 0 took 166416 us > memcpy size 41 count 1000000 offset 0 took 180206 us > memcpy size 7c count 100000 offset 0 took 28602 us > memcpy size 7f count 100000 offset 0 took 31676 us > memcpy size ff count 100000 offset 0 took 39257 us > memcpy size ff count 100000 offset 0 took 39176 us > memcpy size 100 count 100000 offset 0 took 21928 us > memcpy size 200 count 100000 offset 0 took 35814 us > memcpy size 27f count 100000 offset 0 took 60315 us > memcpy size 400 count 100000 offset 0 took 63652 us > memcpy size 407 count 100000 offset 0 took 73160 us > memcpy size 800 count 100000 offset 0 took 121532 us > memcpy size 87f count 100000 offset 0 took 147269 us > memcpy size 87f count 100000 offset 3 took 144744 us > memcpy size 1000 count 100000 offset 0 took 232057 us > memcpy size 1000 count 100000 offset 1 took 254319 us > memcpy size 1000 count 100000 offset 3 took 256973 us > memcpy size 1000 count 100000 offset 4 took 257655 us > memcpy size 1000 count 100000 offset 5 took 259456 us > memcpy size 1000 count 100000 offset 7 took 260849 us > memcpy size 1000 count 100000 offset 8 took 232347 us > memcpy size 1000 count 100000 offset 9 took 254330 us > memcpy size 3000 count 50000 offset 0 took 382376 us > memcpy size 3000 count 50000 offset 1 took 389872 us > memcpy size 3000 count 50000 offset 3 took 385310 us > memcpy size 3000 count 50000 offset 4 took 389748 us > memcpy size 3000 count 50000 offset 5 took 391707 us > memcpy size 3000 count 50000 offset 7 took 386778 us > memcpy size 3000 count 50000 offset 8 took 385691 us > memcpy size 3000 count 50000 offset 9 took 392030 us > > The assembly routine is measurably better. > > [1] https://pastebin.com/DRyECNQW IIRC we're not supposed to have pastebin links, as they'll eventually disappear. I suppose we could just add this program as a testcase? > Changes in v8: > - Fix missed 2.39 in abilists (Joseph) > - Just return -r (Florian) > > Changes in v7: > - Bumped Versions up to 2.39 (Joseph) > - Used INTERNAL_SYSCALL_CALL, and return positive errno to match > pthreads API (Florian). > - Remove __THROW since it creates a warning in combination with the > fortified access decorators. > - Use INTERNAL_VSYSCALL_CALL (Florian) > - Remove __THROW from function pointer type, as it creates warnings > together with __fortified_attr_access. > - Introduced static inline helper (Richard) > - Use new helper function in memcpy ifunc selector (Richard) > > Changes in v6: > - Prefixed __riscv_hwprobe() parameters names with __ to avoid user > macro namespace pollution (Joseph) > - Introduced riscv-ifunc.h for multi-arg ifunc selectors. > - Fix a couple regressions in the assembly from v5 :/ > - Use passed hwprobe pointer in memcpy ifunc selector. > > Changes in v5: > - Do unaligned word access for final trailing bytes (Richard) > > Changes in v4: > - Remove __USE_GNU (Florian) > - __nonnull, __wur, __THROW, and __fortified_attr_access decorations > (Florian) > - change long to long int (Florian) > - Fix comment formatting (Florian) > - Update backup kernel header content copy. > - Fix function declaration formatting (Florian) > - Changed export versions to 2.38 > - Fixed comment style (Florian) > > Changes in v3: > - Update argument types to match v4 kernel interface > - Add the "return" to the vsyscall > - Fix up vdso arg types to match kernel v4 version > - Remove ifdef around INLINE_VSYSCALL (Adhemerval) > - Word align dest for large memcpy()s. > - Add tags > - Remove spurious blank line from sysdeps/riscv/memcpy.c > > Changes in v2: > - hwprobe.h: Use __has_include and duplicate Linux content to make > compilation work when Linux headers are absent (Adhemerval) > - hwprobe.h: Put declaration under __USE_GNU (Adhemerval) > - Use INLINE_SYSCALL_CALL (Adhemerval) > - Update versions > - Update UNALIGNED_MASK to match kernel v3 series. > - Add vDSO interface > - Used _MASK instead of _FAST value itself. > > Evan Green (6): > riscv: Add Linux hwprobe syscall support > riscv: Add hwprobe vdso call support > riscv: Add __riscv_hwprobe pointer to ifunc calls > riscv: Enable multi-arg ifunc resolvers > riscv: Add ifunc helper method to hwprobe.h > riscv: Add and use alignment-ignorant memcpy > > include/libc-symbols.h | 28 ++-- > sysdeps/riscv/dl-irel.h | 8 +- > sysdeps/riscv/memcopy.h | 26 ++++ > sysdeps/riscv/memcpy.c | 63 ++++++++ > sysdeps/riscv/memcpy_noalignment.S | 138 ++++++++++++++++++ > sysdeps/riscv/riscv-ifunc.h | 27 ++++ > sysdeps/unix/sysv/linux/dl-vdso-setup.c | 10 ++ > sysdeps/unix/sysv/linux/dl-vdso-setup.h | 3 + > sysdeps/unix/sysv/linux/riscv/Makefile | 8 +- > sysdeps/unix/sysv/linux/riscv/Versions | 3 + > sysdeps/unix/sysv/linux/riscv/hwprobe.c | 47 ++++++ > .../unix/sysv/linux/riscv/memcpy-generic.c | 24 +++ > .../unix/sysv/linux/riscv/rv32/libc.abilist | 1 + > .../unix/sysv/linux/riscv/rv64/libc.abilist | 1 + > sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h | 106 ++++++++++++++ > sysdeps/unix/sysv/linux/riscv/sysdep.h | 1 + > 16 files changed, 477 insertions(+), 17 deletions(-) > create mode 100644 sysdeps/riscv/memcopy.h > create mode 100644 sysdeps/riscv/memcpy.c > create mode 100644 sysdeps/riscv/memcpy_noalignment.S > create mode 100644 sysdeps/riscv/riscv-ifunc.h > create mode 100644 sysdeps/unix/sysv/linux/riscv/hwprobe.c > create mode 100644 sysdeps/unix/sysv/linux/riscv/memcpy-generic.c > create mode 100644 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h
yeah, i doubt i have a vote here, but -- although i thought the inline was a bit weird at first -- i think passing the function pointer to the resolver is a clever idea, and even the inline grew on me... it's harmless if you don't want it, and potentially handy if you don't know what you're doing. (the only reason i've held off on passing the function pointer in bionic is because i no-one's using it _outside_ bionic yet, and tbh until we have FMV support in clang in a few years, i don't actually expect any more users --- everything else i've unearthed will be parsing /proc/cpuinfo and handling their own function pointers. but this API lgtm from the perspective of a _different_ libc :-) ) On Mon, Sep 25, 2023 at 3:28 AM Palmer Dabbelt <palmer@rivosinc.com> wrote: > > On Fri, 01 Sep 2023 16:52:18 PDT (-0700), Evan Green wrote: > > > > This series illustrates the use of a recently accepted Linux syscall that > > enumerates architectural information about the RISC-V cores the system > > is running on. In this series we expose a small wrapper function around > > the syscall. An ifunc selector for memcpy queries it to see if unaligned > > access is "fast" on this hardware. If it is, it selects a newly provided > > implementation of memcpy that doesn't work hard at aligning the src and > > destination buffers. > > > > For applications and libraries outside of glibc that want to use > > __riscv_hwprobe() in ifunc selectors, this series also sends a pointer > > to the riscv_hwprobe() function in as the second argument to ifunc > > selectors. A new inline convenience function can help application and > > library callers to check for validity and quickly probe a single key. > > This came up during the Cauldron. It seems like everyone's OK with this > interface? It's fine with me, I just hadn't really felt qualified to > review it as I don't really understand the rules around IFUNC > resolution. Sounds like they're really quite complicated, though, so > > Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com> > > Let's give folks a week to reply, as there's likely still a lot of > travel going on (Cauldron ended yesterday). > > > The memcpy implementation is independent enough from the rest of the > > series that it can be omitted safely if desired. > > > > Performance numbers were compared using a small test program [1], run on > > a D1 Nezha board, which supports fast unaligned access. "Fast" here > > means copying unaligned words is faster than copying byte-wise, but > > still slower than copying aligned words. Here's the speed of various > > memcpy()s with the generic implementation. The numbers before are using > > v4's memcpy implementation, with the "copy last byte via overlapping > > misaligned word" fix this should get slightly better. > > > > memcpy size 1 count 1000000 offset 0 took 109564 us > > memcpy size 3 count 1000000 offset 0 took 138425 us > > memcpy size 4 count 1000000 offset 0 took 148374 us > > memcpy size 7 count 1000000 offset 0 took 178433 us > > memcpy size 8 count 1000000 offset 0 took 188430 us > > memcpy size f count 1000000 offset 0 took 266118 us > > memcpy size f count 1000000 offset 1 took 265940 us > > memcpy size f count 1000000 offset 3 took 265934 us > > memcpy size f count 1000000 offset 7 took 266215 us > > memcpy size f count 1000000 offset 8 took 265954 us > > memcpy size f count 1000000 offset 9 took 265886 us > > memcpy size 10 count 1000000 offset 0 took 195308 us > > memcpy size 11 count 1000000 offset 0 took 205161 us > > memcpy size 17 count 1000000 offset 0 took 274376 us > > memcpy size 18 count 1000000 offset 0 took 199188 us > > memcpy size 19 count 1000000 offset 0 took 209258 us > > memcpy size 1f count 1000000 offset 0 took 278263 us > > memcpy size 20 count 1000000 offset 0 took 207364 us > > memcpy size 21 count 1000000 offset 0 took 217143 us > > memcpy size 3f count 1000000 offset 0 took 300023 us > > memcpy size 40 count 1000000 offset 0 took 231063 us > > memcpy size 41 count 1000000 offset 0 took 241259 us > > memcpy size 7c count 100000 offset 0 took 32807 us > > memcpy size 7f count 100000 offset 0 took 36274 us > > memcpy size ff count 100000 offset 0 took 47818 us > > memcpy size ff count 100000 offset 0 took 47932 us > > memcpy size 100 count 100000 offset 0 took 40468 us > > memcpy size 200 count 100000 offset 0 took 64245 us > > memcpy size 27f count 100000 offset 0 took 82549 us > > memcpy size 400 count 100000 offset 0 took 111254 us > > memcpy size 407 count 100000 offset 0 took 119364 us > > memcpy size 800 count 100000 offset 0 took 203899 us > > memcpy size 87f count 100000 offset 0 took 222465 us > > memcpy size 87f count 100000 offset 3 took 222289 us > > memcpy size 1000 count 100000 offset 0 took 388846 us > > memcpy size 1000 count 100000 offset 1 took 468827 us > > memcpy size 1000 count 100000 offset 3 took 397098 us > > memcpy size 1000 count 100000 offset 4 took 397379 us > > memcpy size 1000 count 100000 offset 5 took 397368 us > > memcpy size 1000 count 100000 offset 7 took 396867 us > > memcpy size 1000 count 100000 offset 8 took 389227 us > > memcpy size 1000 count 100000 offset 9 took 395949 us > > memcpy size 3000 count 50000 offset 0 took 674837 us > > memcpy size 3000 count 50000 offset 1 took 676944 us > > memcpy size 3000 count 50000 offset 3 took 679709 us > > memcpy size 3000 count 50000 offset 4 took 680829 us > > memcpy size 3000 count 50000 offset 5 took 678024 us > > memcpy size 3000 count 50000 offset 7 took 681097 us > > memcpy size 3000 count 50000 offset 8 took 670004 us > > memcpy size 3000 count 50000 offset 9 took 674553 us > > > > Here is that same test run with the assembly memcpy() in this series: > > memcpy size 1 count 1000000 offset 0 took 92703 us > > memcpy size 3 count 1000000 offset 0 took 112527 us > > memcpy size 4 count 1000000 offset 0 took 120481 us > > memcpy size 7 count 1000000 offset 0 took 149558 us > > memcpy size 8 count 1000000 offset 0 took 90617 us > > memcpy size f count 1000000 offset 0 took 174373 us > > memcpy size f count 1000000 offset 1 took 178615 us > > memcpy size f count 1000000 offset 3 took 178845 us > > memcpy size f count 1000000 offset 7 took 178636 us > > memcpy size f count 1000000 offset 8 took 174442 us > > memcpy size f count 1000000 offset 9 took 178660 us > > memcpy size 10 count 1000000 offset 0 took 99845 us > > memcpy size 11 count 1000000 offset 0 took 112522 us > > memcpy size 17 count 1000000 offset 0 took 179735 us > > memcpy size 18 count 1000000 offset 0 took 110870 us > > memcpy size 19 count 1000000 offset 0 took 121472 us > > memcpy size 1f count 1000000 offset 0 took 188231 us > > memcpy size 20 count 1000000 offset 0 took 119571 us > > memcpy size 21 count 1000000 offset 0 took 132429 us > > memcpy size 3f count 1000000 offset 0 took 227021 us > > memcpy size 40 count 1000000 offset 0 took 166416 us > > memcpy size 41 count 1000000 offset 0 took 180206 us > > memcpy size 7c count 100000 offset 0 took 28602 us > > memcpy size 7f count 100000 offset 0 took 31676 us > > memcpy size ff count 100000 offset 0 took 39257 us > > memcpy size ff count 100000 offset 0 took 39176 us > > memcpy size 100 count 100000 offset 0 took 21928 us > > memcpy size 200 count 100000 offset 0 took 35814 us > > memcpy size 27f count 100000 offset 0 took 60315 us > > memcpy size 400 count 100000 offset 0 took 63652 us > > memcpy size 407 count 100000 offset 0 took 73160 us > > memcpy size 800 count 100000 offset 0 took 121532 us > > memcpy size 87f count 100000 offset 0 took 147269 us > > memcpy size 87f count 100000 offset 3 took 144744 us > > memcpy size 1000 count 100000 offset 0 took 232057 us > > memcpy size 1000 count 100000 offset 1 took 254319 us > > memcpy size 1000 count 100000 offset 3 took 256973 us > > memcpy size 1000 count 100000 offset 4 took 257655 us > > memcpy size 1000 count 100000 offset 5 took 259456 us > > memcpy size 1000 count 100000 offset 7 took 260849 us > > memcpy size 1000 count 100000 offset 8 took 232347 us > > memcpy size 1000 count 100000 offset 9 took 254330 us > > memcpy size 3000 count 50000 offset 0 took 382376 us > > memcpy size 3000 count 50000 offset 1 took 389872 us > > memcpy size 3000 count 50000 offset 3 took 385310 us > > memcpy size 3000 count 50000 offset 4 took 389748 us > > memcpy size 3000 count 50000 offset 5 took 391707 us > > memcpy size 3000 count 50000 offset 7 took 386778 us > > memcpy size 3000 count 50000 offset 8 took 385691 us > > memcpy size 3000 count 50000 offset 9 took 392030 us > > > > The assembly routine is measurably better. > > > > [1] https://pastebin.com/DRyECNQW > > IIRC we're not supposed to have pastebin links, as they'll eventually > disappear. I suppose we could just add this program as a testcase? > > > Changes in v8: > > - Fix missed 2.39 in abilists (Joseph) > > - Just return -r (Florian) > > > > Changes in v7: > > - Bumped Versions up to 2.39 (Joseph) > > - Used INTERNAL_SYSCALL_CALL, and return positive errno to match > > pthreads API (Florian). > > - Remove __THROW since it creates a warning in combination with the > > fortified access decorators. > > - Use INTERNAL_VSYSCALL_CALL (Florian) > > - Remove __THROW from function pointer type, as it creates warnings > > together with __fortified_attr_access. > > - Introduced static inline helper (Richard) > > - Use new helper function in memcpy ifunc selector (Richard) > > > > Changes in v6: > > - Prefixed __riscv_hwprobe() parameters names with __ to avoid user > > macro namespace pollution (Joseph) > > - Introduced riscv-ifunc.h for multi-arg ifunc selectors. > > - Fix a couple regressions in the assembly from v5 :/ > > - Use passed hwprobe pointer in memcpy ifunc selector. > > > > Changes in v5: > > - Do unaligned word access for final trailing bytes (Richard) > > > > Changes in v4: > > - Remove __USE_GNU (Florian) > > - __nonnull, __wur, __THROW, and __fortified_attr_access decorations > > (Florian) > > - change long to long int (Florian) > > - Fix comment formatting (Florian) > > - Update backup kernel header content copy. > > - Fix function declaration formatting (Florian) > > - Changed export versions to 2.38 > > - Fixed comment style (Florian) > > > > Changes in v3: > > - Update argument types to match v4 kernel interface > > - Add the "return" to the vsyscall > > - Fix up vdso arg types to match kernel v4 version > > - Remove ifdef around INLINE_VSYSCALL (Adhemerval) > > - Word align dest for large memcpy()s. > > - Add tags > > - Remove spurious blank line from sysdeps/riscv/memcpy.c > > > > Changes in v2: > > - hwprobe.h: Use __has_include and duplicate Linux content to make > > compilation work when Linux headers are absent (Adhemerval) > > - hwprobe.h: Put declaration under __USE_GNU (Adhemerval) > > - Use INLINE_SYSCALL_CALL (Adhemerval) > > - Update versions > > - Update UNALIGNED_MASK to match kernel v3 series. > > - Add vDSO interface > > - Used _MASK instead of _FAST value itself. > > > > Evan Green (6): > > riscv: Add Linux hwprobe syscall support > > riscv: Add hwprobe vdso call support > > riscv: Add __riscv_hwprobe pointer to ifunc calls > > riscv: Enable multi-arg ifunc resolvers > > riscv: Add ifunc helper method to hwprobe.h > > riscv: Add and use alignment-ignorant memcpy > > > > include/libc-symbols.h | 28 ++-- > > sysdeps/riscv/dl-irel.h | 8 +- > > sysdeps/riscv/memcopy.h | 26 ++++ > > sysdeps/riscv/memcpy.c | 63 ++++++++ > > sysdeps/riscv/memcpy_noalignment.S | 138 ++++++++++++++++++ > > sysdeps/riscv/riscv-ifunc.h | 27 ++++ > > sysdeps/unix/sysv/linux/dl-vdso-setup.c | 10 ++ > > sysdeps/unix/sysv/linux/dl-vdso-setup.h | 3 + > > sysdeps/unix/sysv/linux/riscv/Makefile | 8 +- > > sysdeps/unix/sysv/linux/riscv/Versions | 3 + > > sysdeps/unix/sysv/linux/riscv/hwprobe.c | 47 ++++++ > > .../unix/sysv/linux/riscv/memcpy-generic.c | 24 +++ > > .../unix/sysv/linux/riscv/rv32/libc.abilist | 1 + > > .../unix/sysv/linux/riscv/rv64/libc.abilist | 1 + > > sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h | 106 ++++++++++++++ > > sysdeps/unix/sysv/linux/riscv/sysdep.h | 1 + > > 16 files changed, 477 insertions(+), 17 deletions(-) > > create mode 100644 sysdeps/riscv/memcopy.h > > create mode 100644 sysdeps/riscv/memcpy.c > > create mode 100644 sysdeps/riscv/memcpy_noalignment.S > > create mode 100644 sysdeps/riscv/riscv-ifunc.h > > create mode 100644 sysdeps/unix/sysv/linux/riscv/hwprobe.c > > create mode 100644 sysdeps/unix/sysv/linux/riscv/memcpy-generic.c > > create mode 100644 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h