Message ID | 20220413202401.408267-1-adhemerval.zanella@linaro.org |
---|---|
Headers |
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D43993857836 for <patchwork@sourceware.org>; Wed, 13 Apr 2022 20:24:28 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D43993857836 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1649881468; bh=HGTbV81LhYk3C2iDmBi+3J9o4lTa6YszvQDCKy7A3gw=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=saaj/hsb0v1uopmfBxjojTlrqmY7OrHDY29Cy3FBqL0foUIScdwdGEXM5WX6r1e93 jb/0vJT9UxfS5lBgQBEM18PIVYftOGyeRwD0oQcQCutnz6suj9SW9PyksUdfK4L8lg Qoz0Dgn0lTL8Cn6NwnfXuO1dOc8h4PmNUX3Orzik= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-ot1-x32a.google.com (mail-ot1-x32a.google.com [IPv6:2607:f8b0:4864:20::32a]) by sourceware.org (Postfix) with ESMTPS id 3672D3858C53 for <libc-alpha@sourceware.org>; Wed, 13 Apr 2022 20:24:07 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3672D3858C53 Received: by mail-ot1-x32a.google.com with SMTP id g17-20020a9d6191000000b005e8d8583c36so1946960otk.8 for <libc-alpha@sourceware.org>; Wed, 13 Apr 2022 13:24:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:subject:date:message-id:mime-version :content-transfer-encoding; bh=HGTbV81LhYk3C2iDmBi+3J9o4lTa6YszvQDCKy7A3gw=; b=QA4wqNc+VKXR9P6WLSeNJvIHo0BfFBa+3+xJBe7y/gbDLGFxUNSrEd9Bu1jT5h6u6b AFXVugd58p5fr+HH4pusCiALcTCYd65bJOSX5LSa856wH3SQV6CK1wB01LYq+eiOMiAD 4O26EOToppM+K5XYYGRi93qtzYWwsbgcVj0e88bbCmYTHrTZSjbPj5zFRCnQ2X52bXQB 9L0tk0I1RYC+O0uXhrtoUnXy19Z1hvXcQ4fJz5vh/X8eZkqRqOHa4103Yf9JRWgQ+IOl fHEBa52jnfBhL1ToCA/eDh6H4UIQKraE6hP2iVKPiW9TXx1LP7oKkOq6vVrO2n3hQbNP IrmQ== X-Gm-Message-State: AOAM5339DhCRnfeVdHh9agwdhGIxprVA7BRUfQOZp1q6fEiWfM1O9nBS /LFPEK7jhEBFuN2YmBX0a41MQjVc6u/mXg== X-Google-Smtp-Source: ABdhPJwxGy2NDwRE6B1N5c6QM16F+/qIOb/gIGHSX8ZHXkPwl7AaXbEvCj1S/QsaJwB1v1KB9J72Ig== X-Received: by 2002:a05:6830:25cc:b0:5c9:5fc5:32b1 with SMTP id d12-20020a05683025cc00b005c95fc532b1mr15369537otu.138.1649881445932; Wed, 13 Apr 2022 13:24:05 -0700 (PDT) Received: from birita.. ([2804:431:c7ca:431f:889f:8960:cca1:4a60]) by smtp.gmail.com with ESMTPSA id o8-20020a05680803c800b00321034c99a6sm26562oie.3.2022.04.13.13.24.04 for <libc-alpha@sourceware.org> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Apr 2022 13:24:05 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH 0/7] Add arc4random support Date: Wed, 13 Apr 2022 17:23:54 -0300 Message-Id: <20220413202401.408267-1-adhemerval.zanella@linaro.org> X-Mailer: git-send-email 2.32.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-6.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Adhemerval Zanella <adhemerval.zanella@linaro.org> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> |
Series |
Add arc4random support
|
|
Message
Adhemerval Zanella
April 13, 2022, 8:23 p.m. UTC
This patch adds the arc4random, arc4random_buf, and arc4random_uniform along with optimized versions for x86_64, aarch64, and powerpc64. The generic implementation is based on scalar Chacha20, with a global cache and locking. It uses getrandom or /dev/urandom as fallback to get the initial entropy, and reseeds the internal state on every 16MB of consumed entropy. It maintains an internal buffer which consumes at maximum one page on most systems (assuming 4k pages). The internal buffer optimizes the cipher encrypt calls, by amortize arc4random calls (where both function call and locks cost are the dominating factor). Fork detection is done by checking if MADV_WIPEONFORK supported. If not the fork callback will reset the state on the fork call. It does not handle direct clone calls, nor vfork or _Fork (arc4random is not async-signal-safe due the internal lock usage, althought the implementation does try to handle fork cases). The generic ChaCha20 implementation is based on the RFC8439 [1], which a simple memcpy with xor implementation. The optimized ones for x86_64, aarch64, and powerpc64 use vectorized instruction and they are based on libgcrypt code. This patchset is different than the previous ones by using a much simpler scheme of fork detection (there is no attempt in using a global shared counter to detect direct clone usages), and by using ChaCha20 instead of AES. ChaCha20 is used because is the standard cipher used on different arc4random implementation (BSDs, MacOSX), and recently on Linux random subsystem. It is also a much more simpler implementation than AES and shows better performance when no specialized instructions are present. One possible improvement, not implemented in this patchset, it to use a per-thread cache, since on some architecture the lock cost is somewhat high. Ideally it would reside in TCB to avoid require tuning static TLS size, and it work similar to the malloc tcache where arc4random would initially consume any thread local entropy thus avoid any locking. [1] https://sourceware.org/pipermail/libc-alpha/2018-June/094879.html Adhemerval Zanella (7): stdlib: Add arc4random, arc4random_buf, and arc4random_uniform (BZ #4417) stdlib: Add arc4random tests benchtests: Add arc4random benchtest x86: Add SSSE3 optimized chacha20 x86: Add AVX2 optimized chacha20 aarch64: Add optimized chacha20 powerpc64: Add optimized chacha20 LICENSES | 21 ++ NEWS | 4 +- benchtests/Makefile | 6 +- benchtests/bench-arc4random.c | 243 ++++++++++++ include/stdlib.h | 13 + posix/fork.c | 2 + stdlib/Makefile | 6 + stdlib/Versions | 5 + stdlib/arc4random.c | 242 ++++++++++++ stdlib/arc4random_uniform.c | 152 ++++++++ stdlib/chacha20.c | 214 +++++++++++ stdlib/stdlib.h | 14 + stdlib/tst-arc4random-chacha20.c | 225 +++++++++++ stdlib/tst-arc4random-fork.c | 174 +++++++++ stdlib/tst-arc4random-stats.c | 146 +++++++ stdlib/tst-arc4random-thread.c | 278 ++++++++++++++ sysdeps/aarch64/Makefile | 4 + sysdeps/aarch64/chacha20.S | 357 ++++++++++++++++++ sysdeps/aarch64/chacha20_arch.h | 43 +++ sysdeps/generic/chacha20_arch.h | 24 ++ sysdeps/generic/not-cancel.h | 2 + sysdeps/mach/hurd/i386/libc.abilist | 3 + sysdeps/mach/hurd/not-cancel.h | 3 + sysdeps/powerpc/powerpc64/Makefile | 3 + sysdeps/powerpc/powerpc64/chacha-ppc.c | 254 +++++++++++++ sysdeps/powerpc/powerpc64/chacha20_arch.h | 53 +++ sysdeps/unix/sysv/linux/aarch64/libc.abilist | 3 + sysdeps/unix/sysv/linux/alpha/libc.abilist | 3 + sysdeps/unix/sysv/linux/arc/libc.abilist | 3 + sysdeps/unix/sysv/linux/arm/be/libc.abilist | 3 + sysdeps/unix/sysv/linux/arm/le/libc.abilist | 3 + sysdeps/unix/sysv/linux/csky/libc.abilist | 3 + sysdeps/unix/sysv/linux/hppa/libc.abilist | 3 + sysdeps/unix/sysv/linux/i386/libc.abilist | 3 + sysdeps/unix/sysv/linux/ia64/libc.abilist | 3 + .../sysv/linux/m68k/coldfire/libc.abilist | 3 + .../unix/sysv/linux/m68k/m680x0/libc.abilist | 3 + .../sysv/linux/microblaze/be/libc.abilist | 3 + .../sysv/linux/microblaze/le/libc.abilist | 3 + .../sysv/linux/mips/mips32/fpu/libc.abilist | 3 + .../sysv/linux/mips/mips32/nofpu/libc.abilist | 3 + .../sysv/linux/mips/mips64/n32/libc.abilist | 3 + .../sysv/linux/mips/mips64/n64/libc.abilist | 3 + sysdeps/unix/sysv/linux/nios2/libc.abilist | 3 + sysdeps/unix/sysv/linux/not-cancel.h | 7 + sysdeps/unix/sysv/linux/or1k/libc.abilist | 3 + .../linux/powerpc/powerpc32/fpu/libc.abilist | 3 + .../powerpc/powerpc32/nofpu/libc.abilist | 3 + .../linux/powerpc/powerpc64/be/libc.abilist | 3 + .../linux/powerpc/powerpc64/le/libc.abilist | 3 + .../unix/sysv/linux/riscv/rv32/libc.abilist | 3 + .../unix/sysv/linux/riscv/rv64/libc.abilist | 3 + .../unix/sysv/linux/s390/s390-32/libc.abilist | 3 + .../unix/sysv/linux/s390/s390-64/libc.abilist | 3 + sysdeps/unix/sysv/linux/sh/be/libc.abilist | 3 + sysdeps/unix/sysv/linux/sh/le/libc.abilist | 3 + .../sysv/linux/sparc/sparc32/libc.abilist | 3 + .../sysv/linux/sparc/sparc64/libc.abilist | 3 + .../unix/sysv/linux/x86_64/64/libc.abilist | 3 + .../unix/sysv/linux/x86_64/x32/libc.abilist | 3 + sysdeps/x86_64/Makefile | 7 + sysdeps/x86_64/chacha20-avx2.S | 317 ++++++++++++++++ sysdeps/x86_64/chacha20-ssse3.S | 330 ++++++++++++++++ sysdeps/x86_64/chacha20_arch.h | 56 +++ 64 files changed, 3305 insertions(+), 2 deletions(-) create mode 100644 benchtests/bench-arc4random.c create mode 100644 stdlib/arc4random.c create mode 100644 stdlib/arc4random_uniform.c create mode 100644 stdlib/chacha20.c create mode 100644 stdlib/tst-arc4random-chacha20.c create mode 100644 stdlib/tst-arc4random-fork.c create mode 100644 stdlib/tst-arc4random-stats.c create mode 100644 stdlib/tst-arc4random-thread.c create mode 100644 sysdeps/aarch64/chacha20.S create mode 100644 sysdeps/aarch64/chacha20_arch.h create mode 100644 sysdeps/generic/chacha20_arch.h create mode 100644 sysdeps/powerpc/powerpc64/chacha-ppc.c create mode 100644 sysdeps/powerpc/powerpc64/chacha20_arch.h create mode 100644 sysdeps/x86_64/chacha20-avx2.S create mode 100644 sysdeps/x86_64/chacha20-ssse3.S create mode 100644 sysdeps/x86_64/chacha20_arch.h
Comments
Hi, Le 13/04/2022 à 22:23, Adhemerval Zanella via Libc-alpha a écrit : > This patch adds the arc4random, arc4random_buf, and arc4random_uniform > along with optimized versions for x86_64, aarch64, and powerpc64. > > The generic implementation is based on scalar Chacha20, with a global > cache and locking. It uses getrandom or /dev/urandom as fallback to > get the initial entropy, and reseeds the internal state on every 16MB > of consumed entropy. > > It maintains an internal buffer which consumes at maximum one page on > most systems (assuming 4k pages). The internal buffer optimizes the > cipher encrypt calls, by amortize arc4random calls (where both > function call and locks cost are the dominating factor). > > Fork detection is done by checking if MADV_WIPEONFORK supported. If not > the fork callback will reset the state on the fork call. It does not > handle direct clone calls, nor vfork or _Fork (arc4random is not > async-signal-safe due the internal lock usage, althought the > implementation does try to handle fork cases). > > The generic ChaCha20 implementation is based on the RFC8439 [1], which > a simple memcpy with xor implementation. The xor (with 0) is a waste of CPU cycles as the ChaCha20 keystream is the PRNG output. Regards.
If this interface is gonna added, GNU extensions that return uint64_t of arc4random and arc4random_uniform will be extremely cool. Even cooler if there is no global state.
On 14/04/2022 04:36, Yann Droneaud wrote: > Hi, > > Le 13/04/2022 à 22:23, Adhemerval Zanella via Libc-alpha a écrit : > >> This patch adds the arc4random, arc4random_buf, and arc4random_uniform >> along with optimized versions for x86_64, aarch64, and powerpc64. >> >> The generic implementation is based on scalar Chacha20, with a global >> cache and locking. It uses getrandom or /dev/urandom as fallback to >> get the initial entropy, and reseeds the internal state on every 16MB >> of consumed entropy. >> >> It maintains an internal buffer which consumes at maximum one page on >> most systems (assuming 4k pages). The internal buffer optimizes the >> cipher encrypt calls, by amortize arc4random calls (where both >> function call and locks cost are the dominating factor). >> >> Fork detection is done by checking if MADV_WIPEONFORK supported. If not >> the fork callback will reset the state on the fork call. It does not >> handle direct clone calls, nor vfork or _Fork (arc4random is not >> async-signal-safe due the internal lock usage, althought the >> implementation does try to handle fork cases). >> >> The generic ChaCha20 implementation is based on the RFC8439 [1], which >> a simple memcpy with xor implementation. > > The xor (with 0) is a waste of CPU cycles as the ChaCha20 keystream is the PRNG output. I don't have a strong feeling about, although it seems that any other ChaCha20 implementation I have checked does it (libgcrypt, Linux, BSD). The BSD also does it for arc4random, although most if not all come from OpenBSD and they are usually paranoid with security hardening. I am no security expert, so I will keep it as is for generic interface (also the arch optimization also does it, so I think it might be a good idea to keep the implementation with similar semantic).
On Thu, Apr 14, 2022 at 1:39 PM Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org> wrote: > > > > On 14/04/2022 04:36, Yann Droneaud wrote: > > Hi, > > > > Le 13/04/2022 à 22:23, Adhemerval Zanella via Libc-alpha a écrit : > > > >> This patch adds the arc4random, arc4random_buf, and arc4random_uniform > >> along with optimized versions for x86_64, aarch64, and powerpc64. > >> > >> The generic implementation is based on scalar Chacha20, with a global > >> cache and locking. It uses getrandom or /dev/urandom as fallback to > >> get the initial entropy, and reseeds the internal state on every 16MB > >> of consumed entropy. > >> > >> It maintains an internal buffer which consumes at maximum one page on > >> most systems (assuming 4k pages). The internal buffer optimizes the > >> cipher encrypt calls, by amortize arc4random calls (where both > >> function call and locks cost are the dominating factor). > >> > >> Fork detection is done by checking if MADV_WIPEONFORK supported. If not > >> the fork callback will reset the state on the fork call. It does not > >> handle direct clone calls, nor vfork or _Fork (arc4random is not > >> async-signal-safe due the internal lock usage, althought the > >> implementation does try to handle fork cases). > >> > >> The generic ChaCha20 implementation is based on the RFC8439 [1], which > >> a simple memcpy with xor implementation. > > > > The xor (with 0) is a waste of CPU cycles as the ChaCha20 keystream is the PRNG output. > > I don't have a strong feeling about, although it seems that any other > ChaCha20 implementation I have checked does it (libgcrypt, Linux, > BSD). The BSD also does it for arc4random, although most if not > all come from OpenBSD and they are usually paranoid with security > hardening. > > I am no security expert, so I will keep it as is for generic interface > (also the arch optimization also does it, so I think it might be a > good idea to keep the implementation with similar semantic). Does the arc4random usecase require the xor zeroing though? Think it would be a mistake to gurantee it as it seems like a pretty reasonable thing to want to optimize out if we need better performance.
On 14/04/2022 08:49, Cristian Rodríguez wrote: > If this interface is gonna added, GNU extensions that return uint64_t > of arc4random and arc4random_uniform will be extremely cool. > Even cooler if there is no global state. I don't think adding a uint64_t interface for arc4random would improve much, specially because a simple wrapper using arc4random_buf should be suffice. It would also require portable code to handle another GNU extension over a BSD defined interface that is presented in multiple systems. Also performance-wise I think it would be much different than arc4random_buf. It make some sense for arc4random_uniform, but I don't have a strong opinion. The global state adds some hardening by 'slicing up the stream' since multiple consumers getting different pieces add backtracking and prediction resistance. Theo de Raadt explains a bit why OpenBSD has added this concept [1] (check about minute 26) on its arc4random implementation. As he puts, there is no formal proof, but I agree that the ideas are reasonable. Also, not using a global state means we will need to add a per-thread or per-cpu state which is at least one page (due MADV_WIPEONFORK). The per-cpu state is only actually possible on newer Linux kernels that support rseq. We might just not care about MADV_WIPEONFORK and use a malloc buffer which would be reset by the atfork internal handler. [1] https://www.youtube.com/watch?v=gp_90-3R0pE
On Thu, Apr 14, 2022 at 2:26 PM Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org> wrote: > > > > On 14/04/2022 08:49, Cristian Rodríguez wrote: > > If this interface is gonna added, GNU extensions that return uint64_t > > of arc4random and arc4random_uniform will be extremely cool. > > Even cooler if there is no global state. > > I don't think adding a uint64_t interface for arc4random would improve > much, specially because a simple wrapper using arc4random_buf should > be suffice. It would also require portable code to handle another > GNU extension over a BSD defined interface that is presented in multiple > systems. Also performance-wise I think it would be much different than > arc4random_buf. It make some sense for arc4random_uniform, but I don't > have a strong opinion. > > The global state adds some hardening by 'slicing up the stream' since > multiple consumers getting different pieces add backtracking and prediction > resistance. Theo de Raadt explains a bit why OpenBSD has added this > concept [1] (check about minute 26) on its arc4random implementation. > As he puts, there is no formal proof, but I agree that the ideas are > reasonable. > > Also, not using a global state means we will need to add a per-thread or > per-cpu state which is at least one page (due MADV_WIPEONFORK). The > per-cpu state is only actually possible on newer Linux kernels that > support rseq. We might just not care about MADV_WIPEONFORK and use > a malloc buffer which would be reset by the atfork internal handler. We could best-effort per-cpu without rseq (select arena based on current cpu) and have a truly optimized version if rseq is supported. Either way it's likely to be an improvement if this function is hot. > > [1] https://www.youtube.com/watch?v=gp_90-3R0pE
Hi, Le 14/04/2022 à 20:39, Adhemerval Zanella a écrit : > On 14/04/2022 04:36, Yann Droneaud wrote: > > Le 13/04/2022 à 22:23, Adhemerval Zanella via Libc-alpha a écrit : > >>> This patch adds the arc4random, arc4random_buf, and arc4random_uniform >>> along with optimized versions for x86_64, aarch64, and powerpc64. >>> >>> The generic implementation is based on scalar Chacha20, with a global >>> cache and locking. It uses getrandom or /dev/urandom as fallback to >>> get the initial entropy, and reseeds the internal state on every 16MB >>> of consumed entropy. >>> >>> It maintains an internal buffer which consumes at maximum one page on >>> most systems (assuming 4k pages). The internal buffer optimizes the >>> cipher encrypt calls, by amortize arc4random calls (where both >>> function call and locks cost are the dominating factor). >>> >>> Fork detection is done by checking if MADV_WIPEONFORK supported. If not >>> the fork callback will reset the state on the fork call. It does not >>> handle direct clone calls, nor vfork or _Fork (arc4random is not >>> async-signal-safe due the internal lock usage, althought the >>> implementation does try to handle fork cases). >>> >>> The generic ChaCha20 implementation is based on the RFC8439 [1], which >>> a simple memcpy with xor implementation. >> The xor (with 0) is a waste of CPU cycles as the ChaCha20 keystream is the PRNG output. > I don't have a strong feeling about, although it seems that any other > ChaCha20 implementation I have checked does it (libgcrypt, Linux, > BSD). The BSD also does it for arc4random, although most if not > all come from OpenBSD and they are usually paranoid with security > hardening. Check #define KEYSTREAM_ONLY https://github.com/openbsd/src/blob/master/lib/libc/crypt/arc4random.c#L36 https://github.com/openbsd/src/blob/master/lib/libc/crypt/chacha_private.h#L166 Regards.