Message ID | 55D0BDA7.40402@panix.com |
---|---|
State | Superseded |
Headers |
Received: (qmail 99187 invoked by alias); 16 Aug 2015 16:43:42 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 99177 invoked by uid 89); 16 Aug 2015 16:43:40 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=4.4 required=5.0 tests=AWL, BAYES_50, FREEMAIL_FROM, RAR_ATTACHED, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=no version=3.3.2 X-HELO: mail-yk0-f177.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-type; bh=goI5pRtuaNcfoCSTCL5/bYakd4cAd4WDCZASOdgGvkU=; b=Br6200dhYJHggaYiGsXCayqC74/pg6KvgsnWerGmJisDBP3zJG9A+3IlojA5BjA1S9 vNNmEcWDGcQumNQEzWh88pMxcNg1qKTAyfhBPP+wZfUOLz8mrOej2DtU+wTSldtdifOC 6VRfplvy1oslyobKfIsJCXKH0mLZ+veiJ/v/Mh1xjr5lAPRdjmmMUCMi7TMzchs+Opi9 jA8E6kUb0lDGx4PcPl5Mjqpk7TMAsPgfNsbvdRnY2KldbKDBcSDyzMpjqDzqUXslDiTr MBTcWucwFBwK6VhOg9rIHhSxOf2jEzplo2zIvpql1NxEKTkR/qCVki0hA9+Px+Q3rAel 1Kuw== X-Received: by 10.170.150.7 with SMTP id r7mr55312842ykc.48.1439743413007; Sun, 16 Aug 2015 09:43:33 -0700 (PDT) Subject: Re: [PATCH RFC] explicit_bzero, again To: GNU C Library <libc-alpha@sourceware.org> References: <55C7E246.3000006@panix.com> From: Zack Weinberg <zackw@panix.com> Message-ID: <55D0BDA7.40402@panix.com> Date: Sun, 16 Aug 2015 12:43:19 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.1.0 MIME-Version: 1.0 In-Reply-To: <55C7E246.3000006@panix.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="rXJ4fNDhIu4ql9BEFCChdKpxIL98iSnrx" |
Commit Message
Zack Weinberg
Aug. 16, 2015, 4:43 p.m. UTC
On 08/09/2015 07:29 PM, Zack Weinberg wrote: > I'd like to resume the discussion of adding explicit_bzero or similar > to glibc. This was originally proposed by Nick Mathewson last > December (https://sourceware.org/ml/libc-alpha/2014-12/msg00506.html) > and I have prepared an updated patchset, which is attached to this > message. One patch adds the function, and the second adds several > intra-libc uses; I suspect I haven't found all of the places where > it should be used. This updated patchset corrects a number of minor errors (thanks, Joseph); adds tests for the fortification of explicit_bzero; and revises the new manual section a little. zw
Comments
I haven't reviewed the patch in detail, but I support the principle of adding this new API. Any other views on that?
On 16-08-2015 13:43, Zack Weinberg wrote: > On 08/09/2015 07:29 PM, Zack Weinberg wrote: >> I'd like to resume the discussion of adding explicit_bzero or similar >> to glibc. This was originally proposed by Nick Mathewson last >> December (https://sourceware.org/ml/libc-alpha/2014-12/msg00506.html) >> and I have prepared an updated patchset, which is attached to this >> message. One patch adds the function, and the second adds several >> intra-libc uses; I suspect I haven't found all of the places where >> it should be used. > > This updated patchset corrects a number of minor errors (thanks, > Joseph); adds tests for the fortification of explicit_bzero; and revises > the new manual section a little. > > zw > As Joseph I also support the principle of adding this new API. Regarding the patch some comments below. It would be better if you send it as inline instead of attachment, and send on message for each patch. > +#ifdef __USE_MISC > +# if __GNUC_PREREQ (4, 0) && !defined __clang__ && !defined __cplusplus > +__STRING_INLINE void > +__explicit_bzero_constn (void *__s, size_t __n) > +{ > + typedef struct {char __x[__n];} __memblk; > + memset (__s, 0, __n); > + __asm__ ("" : : "m" (*(__memblk __attribute__ ((__may_alias__)) *)__s)); > +} > + > +# define explicit_bzero(s, n) \ > + (__extension__ (__builtin_constant_p (n) && (n) > 0 \ > + ? __explicit_bzero_constn (s, n) \ > + : explicit_bzero (s, n))) > +# endif I noted that Linux kernel uses the following definition: #define barrier_data(ptr) __asm__ __volatile__("": :"r"(ptr) :"memory") With the justification that: * This version is i.e. to prevent dead stores elimination on @ptr * where gcc and llvm may behave differently when otherwise using * normal barrier(): while gcc behavior gets along with a normal * barrier(), llvm needs an explicit input variable to be assumed * clobbered. Would be better to use kernel definition and drop the !clang instead? Same applies for explicit_bzero at 'string/bits/string3.h'. > +extern void __explicit_bzero_chk (void *__dest, size_t __len, size_t __dstlen); > +extern void __REDIRECT_NTH (__explicit_bzero_alias, > + (void *__dest, size_t __len), > + explicit_bzero); > + > +__fortify_function void > +__NTH (explicit_bzero (void *__dest, size_t __len)) > +{ > +# if __GNUC_PREREQ (4, 0) && !defined __clang__ && !defined __cplusplus > + if (__builtin_constant_p (__len) && __len > 0) > + { > + (void) __builtin___memset_chk (__dest, '\0', __len, __bos0 (__dest)); > + typedef struct { char __x[__len]; } __memblk; > + __asm__ ("" :: "m" (*(__memblk __attribute__ ((__may_alias__)) *)__dest)); > + } > + else > +# endif > + if (__bos0 (__dest) != (size_t) -1 > + && (!__builtin_constant_p (__len) || __len > __bos0 (__dest))) > + __explicit_bzero_chk (__dest, __len, __bos0 (__dest)); > + else > + __explicit_bzero_alias (__dest, __len); > +} Do we want to 'chk' version being use even for non-check? Also I see that OpenBSD added a patch to try disable this option for LTO [1]. Do you think could it be worth to add as well in your version? [1] http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/explicit_bzero.c?rev=1.3&content-type=text/x-cvsweb-markup
On Thu, Aug 20, 2015 at 6:24 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > > As Joseph I also support the principle of adding this new API. I am pleased to hear that. > Regarding the > patch some comments below. It would be better if you send it as inline instead > of attachment, and send on message for each patch. Regrettably, this is not a thing I can do - I have tried several mail clients and they all mangle patches if I put them inline. >> + typedef struct {char __x[__n];} __memblk; >> + memset (__s, 0, __n); >> + __asm__ ("" : : "m" (*(__memblk __attribute__ ((__may_alias__)) *)__s)); > > I noted that Linux kernel uses the following definition: > > #define barrier_data(ptr) __asm__ __volatile__("": :"r"(ptr) :"memory") This is a much more aggressive optimization barrier than is necessary. A "memory" clobber tells the compiler that it can assume _nothing_ about what the asm statement does to memory -- so it has to discard _all_ its knowledge of what's in memory where. For explicit_bzero we don't need to invalidate _anything_ - we just need to tell the compiler that certain stores may not be eliminated even though they appear to be dead. When we can't do that precisely, leaving explicit_bzero as an out-of-line function call should be more efficient than a whole-memory clobber. I intend to start a conversation with the GCC and LLVM people about adding an intrinsic that does exactly what is needed in this context, but I do not think that should hold up the patch. (I am not qualified to speculate on whether the kernel is also being too aggressive; they might be using this for completely different stuff.) > Do we want to 'chk' version being use even for non-check? I do not understand this sentence. > Also I see that OpenBSD added a patch to try disable this option for LTO [1]. > Do you think could it be worth to add as well in your version? I could be wrong about this, but I am fairly sure that neither GCC nor LLVM is capable of LTO-optimizing through a call to a shared library, therefore this additional complication should not be necessary. zw
On 08/16/2015 06:43 PM, Zack Weinberg wrote: > +@strong{Warning:} The compiler is free to make additional copies of > +any object, or parts of it, in temporary storage areas (such as > +registers and ``scratch'' stack space). @code{explicit_bzero} does > +not guarantee that temporary copies of sensitive data are destroyed. Perhaps you should add that explicit_bzero can create the copy which it is about to overwrite, leaving the original untouched. A partial countermeasure could be a barrier with register clobbers for as many caller-saved registers as possible.
On Fri, Aug 21, 2015 at 2:37 AM, Florian Weimer <fweimer@redhat.com> wrote: > On 08/16/2015 06:43 PM, Zack Weinberg wrote: >> +@strong{Warning:} The compiler is free to make additional copies of >> +any object, or parts of it, in temporary storage areas (such as >> +registers and ``scratch'' stack space). @code{explicit_bzero} does >> +not guarantee that temporary copies of sensitive data are destroyed. > > Perhaps you should add that explicit_bzero can create the copy which it > is about to overwrite, leaving the original untouched. A partial > countermeasure could be a barrier with register clobbers for as many > caller-saved registers as possible. I'm not doubting you, but that's surprising enough that if I am going to put it in the manual, it needs an example of the situation where it can happen, and I am failing to think of one; what do you have in mind? Also, are you saying that this barrier should be part of explicit_bzero itself, or something the application needs to do? zw
On 08/21/2015 04:24 PM, Zack Weinberg wrote: > On Fri, Aug 21, 2015 at 2:37 AM, Florian Weimer <fweimer@redhat.com> wrote: >> On 08/16/2015 06:43 PM, Zack Weinberg wrote: >>> +@strong{Warning:} The compiler is free to make additional copies of >>> +any object, or parts of it, in temporary storage areas (such as >>> +registers and ``scratch'' stack space). @code{explicit_bzero} does >>> +not guarantee that temporary copies of sensitive data are destroyed. >> >> Perhaps you should add that explicit_bzero can create the copy which it >> is about to overwrite, leaving the original untouched. A partial >> countermeasure could be a barrier with register clobbers for as many >> caller-saved registers as possible. > > I'm not doubting you, but that's surprising enough that if I am going > to put it in the manual, it needs an example of the situation where it > can happen, and I am failing to think of one; what do you have in > mind? Oh, if you find it surprising, it is certainly worth documenting. Here's an example: <https://gcc.gnu.org/ml/gcc-help/2014-10/msg00071.html> Any variable which is not already addressable for another reason is likely to trigger such behavior. > Also, are you saying that this barrier should be part of > explicit_bzero itself, or something the application needs to do? Calling explicit_bzero should take care of that, I think. I'm not completely sure how to achieve that. It might be sufficient to put the barrier into the implementation, or it might have to be in a wrapper in a header file.
On 20-08-2015 23:49, Zack Weinberg wrote: > On Thu, Aug 20, 2015 at 6:24 PM, Adhemerval Zanella > <adhemerval.zanella@linaro.org> wrote: >> >> As Joseph I also support the principle of adding this new API. > > I am pleased to hear that. > >> Regarding the >> patch some comments below. It would be better if you send it as inline instead >> of attachment, and send on message for each patch. > > Regrettably, this is not a thing I can do - I have tried several mail > clients and they all mangle patches if I put them inline. You can easily with thunderbird (I use it) with some configurations. > >>> + typedef struct {char __x[__n];} __memblk; >>> + memset (__s, 0, __n); >>> + __asm__ ("" : : "m" (*(__memblk __attribute__ ((__may_alias__)) *)__s)); >> >> I noted that Linux kernel uses the following definition: >> >> #define barrier_data(ptr) __asm__ __volatile__("": :"r"(ptr) :"memory") > > This is a much more aggressive optimization barrier than is necessary. > A "memory" clobber tells the compiler that it can assume _nothing_ > about what the asm statement does to memory -- so it has to discard > _all_ its knowledge of what's in memory where. For explicit_bzero we > don't need to invalidate _anything_ - we just need to tell the > compiler that certain stores may not be eliminated even though they > appear to be dead. When we can't do that precisely, leaving > explicit_bzero as an out-of-line function call should be more > efficient than a whole-memory clobber. > > I intend to start a conversation with the GCC and LLVM people about > adding an intrinsic that does exactly what is needed in this context, > but I do not think that should hold up the patch. > > (I am not qualified to speculate on whether the kernel is also being > too aggressive; they might be using this for completely different > stuff.) Right, but this way you are proposing limits this API to only C programs and only for GNU compilers. Using kernel way to use the whole-memory clobber avoids the '__memblk' (and thus compiles for C++ as well) and also guarantees it works on other C compilers (clang). > >> Do we want to 'chk' version being use even for non-check? > > I do not understand this sentence. My mistake here. > >> Also I see that OpenBSD added a patch to try disable this option for LTO [1]. >> Do you think could it be worth to add as well in your version? > > I could be wrong about this, but I am fairly sure that neither GCC nor > LLVM is capable of LTO-optimizing through a call to a shared library, > therefore this additional complication should not be necessary. I agree, I was only curious if you know the background about this change. > > zw >
On Wed, Aug 26, 2015 at 5:06 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > On 20-08-2015 23:49, Zack Weinberg wrote: >> This is a much more aggressive optimization barrier than is necessary. ... > Right, but this way you are proposing limits this API to only C programs > and only for GNU compilers. Using kernel way to use the whole-memory > clobber avoids the '__memblk' (and thus compiles for C++ as well) and also > guarantees it works on other C compilers (clang). The *optimization* (replacing `explicit_bzero` with `memset` + vacuous use) is limited to C programs and GCC. The *API* works just fine regardless of compiler. I believe this is sufficient as a starting point. As and when appropriate ways to express a vacuous use become available in other compilers, we can add them. >>> Also I see that OpenBSD added a patch to try disable this option for LTO [1]. >>> Do you think could it be worth to add as well in your version? >> >> I could be wrong about this, but I am fairly sure that neither GCC nor >> LLVM is capable of LTO-optimizing through a call to a shared library, >> therefore this additional complication should not be necessary. > > I agree, I was only curious if you know the background about this change. It might be necessary for a C library intended to be linked statically, and compiled with -flto or equivalent. That's the only thing I can think of. If that means glibc should have something like this in the !SHARED case I'm willing to add it. zw
On 26-08-2015 18:13, Zack Weinberg wrote: > On Wed, Aug 26, 2015 at 5:06 PM, Adhemerval Zanella > <adhemerval.zanella@linaro.org> wrote: >> On 20-08-2015 23:49, Zack Weinberg wrote: >>> This is a much more aggressive optimization barrier than is necessary. > ... >> Right, but this way you are proposing limits this API to only C programs >> and only for GNU compilers. Using kernel way to use the whole-memory >> clobber avoids the '__memblk' (and thus compiles for C++ as well) and also >> guarantees it works on other C compilers (clang). > > The *optimization* (replacing `explicit_bzero` with `memset` + vacuous > use) is limited to C programs and GCC. The *API* works just fine > regardless of compiler. I believe this is sufficient as a starting > point. As and when appropriate ways to express a vacuous use become > available in other compilers, we can add them. Right, but do you know what kind of optimization the 'memory' cobbler avoids and your suggestion allows? I do understand that the 'memory' cobbler is indeed a more restrictive memory barrier, but for mostly targets avoids a function calls were is possible is a much more gain that some memory operations begin handle with restrictive scheduling. > >>>> Also I see that OpenBSD added a patch to try disable this option for LTO [1]. >>>> Do you think could it be worth to add as well in your version? >>> >>> I could be wrong about this, but I am fairly sure that neither GCC nor >>> LLVM is capable of LTO-optimizing through a call to a shared library, >>> therefore this additional complication should not be necessary. >> >> I agree, I was only curious if you know the background about this change. > > It might be necessary for a C library intended to be linked > statically, and compiled with -flto or equivalent. That's the only > thing I can think of. If that means glibc should have something like > this in the !SHARED case I'm willing to add it. I am not sure either, I will let it for future iterations (if someone found a scenarios where it might be an issue). > > zw >
On Fri, Aug 21, 2015 at 10:29 AM, Florian Weimer <fweimer@redhat.com> wrote: > On 08/21/2015 04:24 PM, Zack Weinberg wrote: >> On Fri, Aug 21, 2015 at 2:37 AM, Florian Weimer <fweimer@redhat.com> wrote: >>> On 08/16/2015 06:43 PM, Zack Weinberg wrote: >>>> +@strong{Warning:} The compiler is free to make additional copies of >>>> +any object, or parts of it, in temporary storage areas (such as >>>> +registers and ``scratch'' stack space). @code{explicit_bzero} does >>>> +not guarantee that temporary copies of sensitive data are destroyed. >>> >>> Perhaps you should add that explicit_bzero can create the copy which it >>> is about to overwrite, leaving the original untouched. ... > Oh, if you find it surprising, it is certainly worth documenting. > Here's an example: > > <https://gcc.gnu.org/ml/gcc-help/2014-10/msg00071.html> > > Any variable which is not already addressable for another reason is > likely to trigger such behavior. Thanks for the example. I will add this to the manual in the next revision of the patch. >>> A partial >>> countermeasure could be a barrier with register clobbers for as many >>> caller-saved registers as possible. ... >> are you saying that this barrier should be part of >> explicit_bzero itself, or something the application needs to do? > > Calling explicit_bzero should take care of that, I think. I'm not > completely sure how to achieve that. It might be sufficient to put the > barrier into the implementation, or it might have to be in a wrapper in > a header file. asm ("" ::: /* all call-preserved registers */) won't actually *do* anything to erase old values. If they are live, they will get spilled to the stack (another copy!) and pulled back in when needed. If they are dead, the compiler will just shrug and move on. OK, so we have to actually emit assembly instructions to clear call-preserved registers. That could be a substantial deoptimization, particularly in cases where there are several calls to explicit_bzero in a row (e.g. my changes to crypt/) and on architectures with large register files. For ia64 (is that still supported?) I'm not sure it's *possible* to write an appropriate assembly insert because how do I even know how many registers are live in the current call frame? I could see adding a *separate* macro (I think it has to be a macro) like so #define clear_register_file() do { \ __asm__("xorl %eax, %eax" ::: "eax") \ __asm__("xorl %ebx, %ebx" ::: "ebx") \ __asm__("xorl %ecx, %ecx" ::: "ecx") \ __asm__("xorl %edx, %edx" ::: "edx") \ __asm__("xorl %esi, %esi" ::: "esi") \ __asm__("xorl %edi, %edi" ::: "edi") \ __asm__("xorl %ebp, %ebp" ::: "ebp") \ /* ... etc etc FPU and vector registers */ \ } while (0) You'd use this once right before returning from a function that manipulated sensitive data. It would still be a pretty hefty deoptimization, and it wouldn't do anything about scratch stack slots. I actually think scratch stack slots are a bigger concern; I've heard about a lot more unintended data leaks via memory than via the register file. All in all I feel that the Right Thing is to not worry about this beyond mentioning it in documentation -- again, the situation is not worse than what people are hand-rolling right now -- and put the effort into adding compiler intrinsics that will let us do the job _properly_. zw
On Wed, Aug 26, 2015 at 5:34 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote: > On 26-08-2015 18:13, Zack Weinberg wrote: >> The *optimization* (replacing `explicit_bzero` with `memset` + vacuous >> use) is limited to C programs and GCC. The *API* works just fine >> regardless of compiler. I believe this is sufficient as a starting >> point. As and when appropriate ways to express a vacuous use become >> available in other compilers, we can add them. > > Right, but do you know what kind of optimization the 'memory' cobbler avoids > and your suggestion allows? I do understand that the 'memory' cobbler is > indeed a more restrictive memory barrier, but for mostly targets avoids > a function calls were is possible is a much more gain that some memory > operations begin handle with restrictive scheduling. That's a good question. I can tell you one general type of code that would have a problem... struct complicated_object { int many, other, fields; char sensitive[32]; } where code temporarily fills in `sensitive` and uses explicit_bzero to erase it when it's done; the "memory" clobber would force the compiler to flush all the other fields of `complicated_object` to memory and reload them afterward. They don't have to be tied together with a `struct` even; the sensitive data could be on the stack while all the other state being manipulated was on the heap, and it would still have to flush everything. What I don't know is what would make for better optimization in practice, and unfortunately I do not have time to go digging through existing users of secure-memory-clear functions to figure out where the tradeoff lies. And I don't feel that picking one kludge over another is a great use of anyone's time. Instead, I want to get something in with known, reliable semantics (but not optimal and with documented limits), and then move over to the compiler side to work on optimization and more thorough sanitization. I almost didn't bother with string[23].h optimizations at all and now I'm kind of regretting having done them. zw
On 08/26/2015 05:34 PM, Adhemerval Zanella wrote: > On 26-08-2015 18:13, Zack Weinberg wrote: >> The *optimization* (replacing `explicit_bzero` with `memset` + >> vacuous use) is limited to C programs and GCC. The *API* works >> just fine regardless of compiler. I believe this is sufficient as >> a starting point. As and when appropriate ways to express a >> vacuous use become available in other compilers, we can add them. > > Right, but do you know what kind of optimization the 'memory' > cobbler avoids and your suggestion allows? I do understand that the > 'memory' cobbler is indeed a more restrictive memory barrier, but > for mostly targets avoids a function calls were is possible is a > much more gain that some memory operations begin handle with > restrictive scheduling. I found some time to look into this in detail, by doing experiments on LibreSSL, which makes extensive use of explicit_bzero. There are 77 object files in the LibreSSL 2.2.3-portable release which contain at least one call to explicit_bzero, either directly or via the wrapper OPENSSL_cleanse, not counting the definition of OPENSSL_cleanse itself. I compiled the entire library three times against glibc 2.19 (therefore, using LibreSSL's internal definition of explicit_bzero), using gcc 5.2, with --disable-shared, as follows: 1) "funcall" -- baseline: all internal uses of the wrapper were replaced with direct calls to explicit_bzero (using a macro), but there were no other changes. 2) "memread" -- add to libressl's compat/string.h my original proposal for bits/string2.h optimization of explicit_bzero, which uses the asm ("" :: "m" (*(struct { char x[size]; } *)ptr)) construct. This is expected to produce better local optimization than either (1) or (3) when it works, but degenerates to (1) when it doesn't work. 3) "memclob" -- instead of the above optimization, use asm ("" :: "r" (ptr) : "memory"). This can be applied unconditionally, but is expected to produce worse code than (2) in the cases where (2) works. In the baseline compilation, there are 37 objects that make a call to memset. In the "memread" compilation, the number of objects that make a call to explicit_bzero drops to 38, but the number of objects that make a call to memset only goes up by one; this is partially because some of the changed objects already called memset, and partially because the "memread" optimization works in exactly the cases where GCC is most likely to replace a call to memset with inline code (i.e. the size parameter is constant). In the "memclob" compilation, there are no surviving calls to explicit_bzero (as expected) and 63 object files call memset. First off, I may have missed something, but it appears to me that the "memclob" approach does not replace any *more* explicit_bzero calls with inline code than "memread" does. Again, this is expected, since the "memread" optimization applies to the cases where GCC is most likely to inline a call to memset. Now let's look at a typical case where "memread" enables the compiler to replace memset with inline code: this is _rs_stir in arc4random.c: movq $0, 96(%rax) movq rs(%rip), %rax movq $984, (%rax) .L23: - movq %rsp, %rdi - movl $40, %esi - call explicit_bzero + movq $0, (%rsp) + movq $0, 8(%rsp) + movq $0, 16(%rsp) + movq $0, 24(%rsp) + movq $0, 32(%rsp) movq rs(%rip), %rax movq $0, (%rax) With "memclob" instead, the same code is generated for the actual memory clear, but we get slightly worse code for the rest of the function, because it's now considered to require a frame pointer. (- memread, + memclob; debug information stripped, slightly descheduled to emphasize the semantic change) _rs_stir: + pushq %r12 pushq %rbp pushq %rbx movl $40, %esi - subq $56, %rsp + subq $48, %rsp movq %rsp, %rdi + movq %rsp, %rbp movq %fs:40, %rax movq %rax, 40(%rsp) xorl %eax, %eax call getentropy cmpl $-1, %eax @@ movq rsx(%rip), %rdi leaq 64(%rdi), %rsi movq %rsi, %rdx call chacha_encrypt_bytes.constprop.2 movq rsx(%rip), %rbx xorl %eax, %eax .L24: - movzbl (%rsp,%rax), %edx + movzbl 0(%rbp,%rax), %edx xorb %dl, 64(%rbx,%rax) addq $1, %rax cmpq $40, %rax jne .L24 cmpq $0, rs(%rip) - leaq 64(%rbx), %rbp + leaq 64(%rbx), %r12 movq %rbx, %rdi je .L46 .L25: - movq %rbp, %rsi + movq %r12, %rsi call chacha_keysetup.isra.0.constprop.3 and so on. Here's another example where memclob causes slightly worse register allocation than memread. In this case, I think it's mostly because memset takes one more argument than explicit_bzero. This is tls_config_set_key_mem in tls/tls_config.c. tls_config_set_key_mem: + pushq %r13 pushq %r12 pushq %rbp pushq %rbx + movq %rdx, %r13 - movq %rdx, %r12 + movq %rsi, %r12 movq %rdi, %rbx + subq $8, %rsp - movq 80(%rdi), %rdi + movq 80(%rdi), %rbp - movq %rsi, %rbp - testq %rdi, %rdi + testq %rbp, %rbp je .L105 - movq 88(%rbx), %rsi + movq 88(%rdi), %rdx + xorl %esi, %esi + movq %rbp, %rdi - call explicit_bzero + call memset .L105: + addq $8, %rsp leaq 88(%rbx), %rsi leaq 80(%rbx), %rdi - movq %r12, %rcx + movq %r13, %rcx + movq %r12, %rdx popq %rbx - movq %rbp, %rdx popq %rbp popq %r12 + popq %r13 jmp set_mem Finally, the large, complicated apps/ca.c reads in part pkey = load_key(bio_err, keyfile, keyform, 0, key, e, "CA private key"); if (key) OPENSSL_cleanse(key, strlen(key)); if (pkey == NULL) { /* load_key() has already printed an appropriate message */ goto err; } With -memread, GCC decides that the OPENSSL_cleanse operation belongs out-of-line in .text.unlikely. With either -memclob or -funcall, this does not happen. This is a mis-optimization; I think control can't actually reach this point with a NULL key. However, it illustrates how -memread exposes more information to the compiler and lets it make more sophisticated decisions. I've tarred up all three build trees (with intermediate .s files included) and put them on my website at https://hacks.owlfolio.org/scratchpad/libressl-ebzero-experiments.tar.xz. I invite other people to dig into them and discuss the effects of the various choices. (Please note that it is 53MB and I can't leave it there forever due to hosting costs.) It remains my opinion that the patch as I posted it (with an update to the manual -- I will get to that in the next few days) is Good Enough and we should not hold off on making the API available until the ideal bits/string2.h optimization for it is possible. I will be sending a message to the gcc and clang mailing lists in the next few days to open a discussion of that ideal, but I don't have any plans to implement it myself. zw
On 07-09-2015 14:18, Zack Weinberg wrote: > It remains my opinion that the patch as I posted it (with an update to > the manual -- I will get to that in the next few days) is Good Enough > and we should not hold off on making the API available until the ideal > bits/string2.h optimization for it is possible. I will be sending a > message to the gcc and clang mailing lists in the next few days to > open a discussion of that ideal, but I don't have any plans to > implement it myself. > > zw Thanks for the detailed analysis, it indeed shows that the 'memread' is a better approach. In fact I was not really obstructing your patch, just curious about the 'memread' approach, and I apologize if I passed the impression I was trying to stall your patch based on this specific optimization.
Use explicit_bzero where appropriate I *believe* these are the only places where memset was being used to clear buffers containing sensitive data. The compiler probably couldn't optimize *all* of them out but it seems best to change them all. The legacy DES implementation wasn't bothering to clear its buffers, so I added that, mostly for consistency's sake. [ChangeLog:] YYYY-MM-DD Zack Weinberg <zackw@panix.com> * crypt/crypt-entry.c (__crypt_r): Clear key-dependent intermediate data before returning, using explicit_bzero. * crypt/md5-crypt.c (__md5_crypt_r): Likewise. * crypt/sha256-crypt.c (__sha256_crypt_r): Likewise. * crypt/sha512-crypt.c (__sha512_crypt_r): Likewise. --- crypt/crypt-entry.c | 11 +++++++++++ crypt/md5-crypt.c | 8 ++++---- crypt/sha256-crypt.c | 14 +++++++------- crypt/sha512-crypt.c | 14 +++++++------- 4 files changed, 29 insertions(+), 18 deletions(-) diff --git a/crypt/crypt-entry.c b/crypt/crypt-entry.c index 7e655ba..47db32c 100644 --- a/crypt/crypt-entry.c +++ b/crypt/crypt-entry.c @@ -143,6 +143,17 @@ __crypt_r (key, salt, data) * And convert back to 6 bit ASCII */ _ufc_output_conversion_r (res[0], res[1], salt, data); + +#ifdef _LIBC + /* + * Erase key-dependent intermediate data. Data dependent only on + * the salt is not considered sensitive. + */ + explicit_bzero (ktab, sizeof (ktab)); + explicit_bzero (data->keysched, sizeof (data->keysched)); + explicit_bzero (res, sizeof (res)); +#endif + return data->crypt_3_buf; } weak_alias (__crypt_r, crypt_r) diff --git a/crypt/md5-crypt.c b/crypt/md5-crypt.c index 1b890bc..b95766d 100644 --- a/crypt/md5-crypt.c +++ b/crypt/md5-crypt.c @@ -292,13 +292,13 @@ __md5_crypt_r (key, salt, buffer, buflen) #ifndef USE_NSS __md5_init_ctx (&ctx); __md5_finish_ctx (&ctx, alt_result); - memset (&ctx, '\0', sizeof (ctx)); - memset (&alt_ctx, '\0', sizeof (alt_ctx)); + explicit_bzero (&ctx, sizeof (ctx)); + explicit_bzero (&alt_ctx, sizeof (alt_ctx)); #endif if (copied_key != NULL) - memset (copied_key, '\0', key_len); + explicit_bzero (copied_key, key_len); if (copied_salt != NULL) - memset (copied_salt, '\0', salt_len); + explicit_bzero (copied_salt, salt_len); free (free_key); return buffer; diff --git a/crypt/sha256-crypt.c b/crypt/sha256-crypt.c index d90e291..189183a 100644 --- a/crypt/sha256-crypt.c +++ b/crypt/sha256-crypt.c @@ -375,16 +375,16 @@ __sha256_crypt_r (key, salt, buffer, buflen) #ifndef USE_NSS __sha256_init_ctx (&ctx); __sha256_finish_ctx (&ctx, alt_result); - memset (&ctx, '\0', sizeof (ctx)); - memset (&alt_ctx, '\0', sizeof (alt_ctx)); + explicit_bzero (&ctx, sizeof (ctx)); + explicit_bzero (&alt_ctx, sizeof (alt_ctx)); #endif - memset (temp_result, '\0', sizeof (temp_result)); - memset (p_bytes, '\0', key_len); - memset (s_bytes, '\0', salt_len); + explicit_bzero (temp_result, sizeof (temp_result)); + explicit_bzero (p_bytes, key_len); + explicit_bzero (s_bytes, salt_len); if (copied_key != NULL) - memset (copied_key, '\0', key_len); + explicit_bzero (copied_key, key_len); if (copied_salt != NULL) - memset (copied_salt, '\0', salt_len); + explicit_bzero (copied_salt, salt_len); free (free_key); free (free_pbytes); diff --git a/crypt/sha512-crypt.c b/crypt/sha512-crypt.c index 9c581ab..bbd5e24 100644 --- a/crypt/sha512-crypt.c +++ b/crypt/sha512-crypt.c @@ -397,16 +397,16 @@ __sha512_crypt_r (key, salt, buffer, buflen) #ifndef USE_NSS __sha512_init_ctx (&ctx); __sha512_finish_ctx (&ctx, alt_result); - memset (&ctx, '\0', sizeof (ctx)); - memset (&alt_ctx, '\0', sizeof (alt_ctx)); + explicit_bzero (&ctx, sizeof (ctx)); + explicit_bzero (&alt_ctx, sizeof (alt_ctx)); #endif - memset (temp_result, '\0', sizeof (temp_result)); - memset (p_bytes, '\0', key_len); - memset (s_bytes, '\0', salt_len); + explicit_bzero (temp_result, sizeof (temp_result)); + explicit_bzero (p_bytes, key_len); + explicit_bzero (s_bytes, salt_len); if (copied_key != NULL) - memset (copied_key, '\0', key_len); + explicit_bzero (copied_key, key_len); if (copied_salt != NULL) - memset (copied_salt, '\0', salt_len); + explicit_bzero (copied_salt, salt_len); free (free_key); free (free_pbytes); -- 2.5.0