From patchwork Fri Apr 7 23:07:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Evan Green X-Patchwork-Id: 55671 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C3C013857704 for ; Fri, 7 Apr 2023 23:07:34 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by sourceware.org (Postfix) with ESMTPS id 2765D3858D28 for ; Fri, 7 Apr 2023 23:07:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2765D3858D28 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivosinc.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivosinc.com Received: by mail-pj1-x102c.google.com with SMTP id v9so4800851pjk.0 for ; Fri, 07 Apr 2023 16:07:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rivosinc-com.20210112.gappssmtp.com; s=20210112; t=1680908836; x=1683500836; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=sbUOIHcz5vdY9XYdh46RDGlU8DdREcLe9YGOaapbeoY=; b=m7+7694THwsqskFrwVzK7SepQoJz5oBQCw5WzU0srM/zLCnuuCsS0nfrPvMHx9g2US tJr57alTxMmaf+GJJyfx7WBEZ42rXXq0bWWAWKHc/zv0gIfduR3k3qHtHAv6bt2NlbG2 moSIJrDy6jM/ozP9rUHsFiRSTM+pABtrjtq/W6FUC0GCan0hPU7PdS1veklaxJbK+7aU zGIy7nSTAOmN4T83TAxelIz1K1l3pw4fB2/EG38WKyZvOfhJ2IUC0sTHMYOAUp/8d12y 8rByhQakaH7Y4VL5Go4E3SkXwatAQ0U4yDAckXIJjH04PcXStsEZcPoDFm1E38o0g3W+ bwQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680908836; x=1683500836; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=sbUOIHcz5vdY9XYdh46RDGlU8DdREcLe9YGOaapbeoY=; b=50KuIkof5A0saHzacsjsl4forw+4J2y3d60kWRta5XgfR9sA7vEKIqTIWmWkHPk73o jUjivBCKPIIgcrbQs5jrD2HYsTQ0TiegqgtEQ8cuizH+JZgCu2hCgIHvi/11QLRG2mxb VAl/r071NGyDzAWm6ZNTqE5xPEI/5ljTEmFcDuDt2btFhRTV/RXVSjuu/SCsC0CJSuc3 yzO8uV8l6dU4PWZ25BDIXFe2ArQHqPmc5kDQNUqnYXU984Bkv5uAAE3dFw82xjYPKpqh aDNZCrm/NMNOyMzqo2EfHYIxdMX0nO/rQ8vFWMFHfnt9nzGKONIBUe37dDBkjzth2Jdj Yrew== X-Gm-Message-State: AAQBX9cHiTdzXUCWeAw4hyAmd7RhVlTUK3kpS04lUAVmfL8Z9VjkfIjD koGlqgRO6l/huiKs+GqYMNOKUB3vjhI4Zud6MoI= X-Google-Smtp-Source: AKy350bKYjeyLi7MLcTCrSjYGfg0DqcMq1KYqmjWVFSNVAJAh7ZhSZUQjGHUuXyg9CjwXwKziOOvew== X-Received: by 2002:a17:902:d4cd:b0:19e:b2ed:6fff with SMTP id o13-20020a170902d4cd00b0019eb2ed6fffmr10546359plg.31.1680908836591; Fri, 07 Apr 2023 16:07:16 -0700 (PDT) Received: from evan.ba.rivosinc.com ([66.220.2.162]) by smtp.gmail.com with ESMTPSA id ei7-20020a17090ae54700b00240ab3c5f66sm3224691pjb.29.2023.04.07.16.07.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 07 Apr 2023 16:07:16 -0700 (PDT) From: Evan Green To: libc-alpha@sourceware.org Cc: vineetg@rivosinc.com, palmer@rivosinc.com, slewis@rivosinc.com, Evan Green Subject: [PATCH v3 0/3] RISC-V: ifunced memcpy using new kernel hwprobe interface Date: Fri, 7 Apr 2023 16:07:07 -0700 Message-Id: <20230407230711.2621614-1-evan@rivosinc.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" This series illustrates the use of a proposed Linux syscall that enumerates architectural information about the RISC-V cores the system is running on. In this series we expose a small wrapper function around the syscall. An ifunc selector for memcpy queries it to see if unaligned access is "fast" on this hardware. If it is, it selects a newly provided implementation of memcpy that doesn't work hard at aligning the src and destination buffers. Performance numbers were compared using a small test program [1], run on a D1 Nezha board, which supports fast unaligned access. "Fast" here means copying unaligned words is faster than copying byte-wise, but still slower than copying aligned words. Here's the speed of various memcpy()s with the generic implementation: memcpy size 1 count 1000000 offset 0 took 109564 us memcpy size 3 count 1000000 offset 0 took 138425 us memcpy size 4 count 1000000 offset 0 took 148374 us memcpy size 7 count 1000000 offset 0 took 178433 us memcpy size 8 count 1000000 offset 0 took 188430 us memcpy size f count 1000000 offset 0 took 266118 us memcpy size f count 1000000 offset 1 took 265940 us memcpy size f count 1000000 offset 3 took 265934 us memcpy size f count 1000000 offset 7 took 266215 us memcpy size f count 1000000 offset 8 took 265954 us memcpy size f count 1000000 offset 9 took 265886 us memcpy size 10 count 1000000 offset 0 took 195308 us memcpy size 11 count 1000000 offset 0 took 205161 us memcpy size 17 count 1000000 offset 0 took 274376 us memcpy size 18 count 1000000 offset 0 took 199188 us memcpy size 19 count 1000000 offset 0 took 209258 us memcpy size 1f count 1000000 offset 0 took 278263 us memcpy size 20 count 1000000 offset 0 took 207364 us memcpy size 21 count 1000000 offset 0 took 217143 us memcpy size 3f count 1000000 offset 0 took 300023 us memcpy size 40 count 1000000 offset 0 took 231063 us memcpy size 41 count 1000000 offset 0 took 241259 us memcpy size 7c count 100000 offset 0 took 32807 us memcpy size 7f count 100000 offset 0 took 36274 us memcpy size ff count 100000 offset 0 took 47818 us memcpy size ff count 100000 offset 0 took 47932 us memcpy size 100 count 100000 offset 0 took 40468 us memcpy size 200 count 100000 offset 0 took 64245 us memcpy size 27f count 100000 offset 0 took 82549 us memcpy size 400 count 100000 offset 0 took 111254 us memcpy size 407 count 100000 offset 0 took 119364 us memcpy size 800 count 100000 offset 0 took 203899 us memcpy size 87f count 100000 offset 0 took 222465 us memcpy size 87f count 100000 offset 3 took 222289 us memcpy size 1000 count 100000 offset 0 took 388846 us memcpy size 1000 count 100000 offset 1 took 468827 us memcpy size 1000 count 100000 offset 3 took 397098 us memcpy size 1000 count 100000 offset 4 took 397379 us memcpy size 1000 count 100000 offset 5 took 397368 us memcpy size 1000 count 100000 offset 7 took 396867 us memcpy size 1000 count 100000 offset 8 took 389227 us memcpy size 1000 count 100000 offset 9 took 395949 us memcpy size 3000 count 50000 offset 0 took 674837 us memcpy size 3000 count 50000 offset 1 took 676944 us memcpy size 3000 count 50000 offset 3 took 679709 us memcpy size 3000 count 50000 offset 4 took 680829 us memcpy size 3000 count 50000 offset 5 took 678024 us memcpy size 3000 count 50000 offset 7 took 681097 us memcpy size 3000 count 50000 offset 8 took 670004 us memcpy size 3000 count 50000 offset 9 took 674553 us Here is that same test run with the assembly memcpy() in this series: memcpy size 1 count 1000000 offset 0 took 92703 us memcpy size 3 count 1000000 offset 0 took 112527 us memcpy size 4 count 1000000 offset 0 took 120481 us memcpy size 7 count 1000000 offset 0 took 149558 us memcpy size 8 count 1000000 offset 0 took 90617 us memcpy size f count 1000000 offset 0 took 174373 us memcpy size f count 1000000 offset 1 took 178615 us memcpy size f count 1000000 offset 3 took 178845 us memcpy size f count 1000000 offset 7 took 178636 us memcpy size f count 1000000 offset 8 took 174442 us memcpy size f count 1000000 offset 9 took 178660 us memcpy size 10 count 1000000 offset 0 took 99845 us memcpy size 11 count 1000000 offset 0 took 112522 us memcpy size 17 count 1000000 offset 0 took 179735 us memcpy size 18 count 1000000 offset 0 took 110870 us memcpy size 19 count 1000000 offset 0 took 121472 us memcpy size 1f count 1000000 offset 0 took 188231 us memcpy size 20 count 1000000 offset 0 took 119571 us memcpy size 21 count 1000000 offset 0 took 132429 us memcpy size 3f count 1000000 offset 0 took 227021 us memcpy size 40 count 1000000 offset 0 took 166416 us memcpy size 41 count 1000000 offset 0 took 180206 us memcpy size 7c count 100000 offset 0 took 28602 us memcpy size 7f count 100000 offset 0 took 31676 us memcpy size ff count 100000 offset 0 took 39257 us memcpy size ff count 100000 offset 0 took 39176 us memcpy size 100 count 100000 offset 0 took 21928 us memcpy size 200 count 100000 offset 0 took 35814 us memcpy size 27f count 100000 offset 0 took 60315 us memcpy size 400 count 100000 offset 0 took 63652 us memcpy size 407 count 100000 offset 0 took 73160 us memcpy size 800 count 100000 offset 0 took 121532 us memcpy size 87f count 100000 offset 0 took 147269 us memcpy size 87f count 100000 offset 3 took 144744 us memcpy size 1000 count 100000 offset 0 took 232057 us memcpy size 1000 count 100000 offset 1 took 254319 us memcpy size 1000 count 100000 offset 3 took 256973 us memcpy size 1000 count 100000 offset 4 took 257655 us memcpy size 1000 count 100000 offset 5 took 259456 us memcpy size 1000 count 100000 offset 7 took 260849 us memcpy size 1000 count 100000 offset 8 took 232347 us memcpy size 1000 count 100000 offset 9 took 254330 us memcpy size 3000 count 50000 offset 0 took 382376 us memcpy size 3000 count 50000 offset 1 took 389872 us memcpy size 3000 count 50000 offset 3 took 385310 us memcpy size 3000 count 50000 offset 4 took 389748 us memcpy size 3000 count 50000 offset 5 took 391707 us memcpy size 3000 count 50000 offset 7 took 386778 us memcpy size 3000 count 50000 offset 8 took 385691 us memcpy size 3000 count 50000 offset 9 took 392030 us The assembly routine is measurably better. v5 of the Linux series can be found at [2]. v6 will be out momentarily, and will be compatible with this iteration of this series. [1] https://pastebin.com/DRyECNQW [2] https://lore.kernel.org/lkml/20230327163203.2918455-1-evan@rivosinc.com/ Changes in v3: - Update argument types to match v4 kernel interface - Add the "return" to the vsyscall - Fix up vdso arg types to match kernel v4 version - Remove ifdef around INLINE_VSYSCALL (Adhemerval) - Word align dest for large memcpy()s. - Add tags - Remove spurious blank line from sysdeps/riscv/memcpy.c Changes in v2: - hwprobe.h: Use __has_include and duplicate Linux content to make compilation work when Linux headers are absent (Adhemerval) - hwprobe.h: Put declaration under __USE_GNU (Adhemerval) - Use INLINE_SYSCALL_CALL (Adhemerval) - Update versions - Update UNALIGNED_MASK to match kernel v3 series. - Add vDSO interface - Used _MASK instead of _FAST value itself. Evan Green (3): riscv: Add Linux hwprobe syscall support riscv: Add hwprobe vdso call support riscv: Add and use alignment-ignorant memcpy sysdeps/riscv/memcopy.h | 28 ++++ sysdeps/riscv/memcpy.c | 64 +++++++++ sysdeps/riscv/memcpy_noalignment.S | 121 ++++++++++++++++++ sysdeps/unix/sysv/linux/dl-vdso-setup.c | 10 ++ sysdeps/unix/sysv/linux/dl-vdso-setup.h | 3 + sysdeps/unix/sysv/linux/riscv/Makefile | 8 +- sysdeps/unix/sysv/linux/riscv/Versions | 3 + sysdeps/unix/sysv/linux/riscv/hwprobe.c | 31 +++++ .../unix/sysv/linux/riscv/memcpy-generic.c | 24 ++++ .../unix/sysv/linux/riscv/rv32/arch-syscall.h | 1 + .../unix/sysv/linux/riscv/rv32/libc.abilist | 1 + .../unix/sysv/linux/riscv/rv64/arch-syscall.h | 1 + .../unix/sysv/linux/riscv/rv64/libc.abilist | 1 + sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h | 68 ++++++++++ sysdeps/unix/sysv/linux/riscv/sysdep.h | 1 + sysdeps/unix/sysv/linux/syscall-names.list | 1 + 16 files changed, 364 insertions(+), 2 deletions(-) create mode 100644 sysdeps/riscv/memcopy.h create mode 100644 sysdeps/riscv/memcpy.c create mode 100644 sysdeps/riscv/memcpy_noalignment.S create mode 100644 sysdeps/unix/sysv/linux/riscv/hwprobe.c create mode 100644 sysdeps/unix/sysv/linux/riscv/memcpy-generic.c create mode 100644 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h