[RFC] explicit_bzero, again

Message ID 55D0BDA7.40402@panix.com
State Superseded
Headers

Commit Message

Zack Weinberg Aug. 16, 2015, 4:43 p.m. UTC
  On 08/09/2015 07:29 PM, Zack Weinberg wrote:
> I'd like to resume the discussion of adding explicit_bzero or similar
> to glibc.  This was originally proposed by Nick Mathewson last
> December (https://sourceware.org/ml/libc-alpha/2014-12/msg00506.html)
> and I have prepared an updated patchset, which is attached to this
> message.  One patch adds the function, and the second adds several
> intra-libc uses; I suspect I haven't found all of the places where
> it should be used.

This updated patchset corrects a number of minor errors (thanks,
Joseph); adds tests for the fortification of explicit_bzero; and revises
the new manual section a little.

zw
  

Comments

Joseph Myers Aug. 19, 2015, 1 p.m. UTC | #1
I haven't reviewed the patch in detail, but I support the principle of 
adding this new API.  Any other views on that?
  
Adhemerval Zanella Aug. 20, 2015, 10:24 p.m. UTC | #2
On 16-08-2015 13:43, Zack Weinberg wrote:
> On 08/09/2015 07:29 PM, Zack Weinberg wrote:
>> I'd like to resume the discussion of adding explicit_bzero or similar
>> to glibc.  This was originally proposed by Nick Mathewson last
>> December (https://sourceware.org/ml/libc-alpha/2014-12/msg00506.html)
>> and I have prepared an updated patchset, which is attached to this
>> message.  One patch adds the function, and the second adds several
>> intra-libc uses; I suspect I haven't found all of the places where
>> it should be used.
> 
> This updated patchset corrects a number of minor errors (thanks,
> Joseph); adds tests for the fortification of explicit_bzero; and revises
> the new manual section a little.
> 
> zw
> 

As Joseph I also support the principle of adding this new API.  Regarding the
patch some comments below.  It would be better if you send it as inline instead
of attachment, and send on message for each patch.

> +#ifdef __USE_MISC
> +# if __GNUC_PREREQ (4, 0) && !defined __clang__ && !defined __cplusplus
> +__STRING_INLINE void
> +__explicit_bzero_constn (void *__s, size_t __n)
> +{
> +  typedef struct {char __x[__n];} __memblk;
> +  memset (__s, 0, __n);
> +  __asm__ ("" : : "m" (*(__memblk __attribute__ ((__may_alias__)) *)__s));
> +}
> +
> +#  define explicit_bzero(s, n)                          \
> +  (__extension__ (__builtin_constant_p (n) && (n) > 0   \
> +                  ? __explicit_bzero_constn (s, n)      \
> +                  : explicit_bzero (s, n)))
> +# endif

I noted that Linux kernel uses the following definition:

#define barrier_data(ptr) __asm__ __volatile__("": :"r"(ptr) :"memory")

With the justification that:

 * This version is i.e. to prevent dead stores elimination on @ptr
 * where gcc and llvm may behave differently when otherwise using
 * normal barrier(): while gcc behavior gets along with a normal
 * barrier(), llvm needs an explicit input variable to be assumed
 * clobbered.

Would be better to use kernel definition and drop the !clang instead?
Same applies for explicit_bzero at 'string/bits/string3.h'.

> +extern void __explicit_bzero_chk (void *__dest, size_t __len, size_t __dstlen);
> +extern void __REDIRECT_NTH (__explicit_bzero_alias,
> +                            (void *__dest, size_t __len),
> +                            explicit_bzero);
> +
> +__fortify_function void
> +__NTH (explicit_bzero (void *__dest, size_t __len))
> +{
> +# if __GNUC_PREREQ (4, 0) && !defined __clang__ && !defined __cplusplus
> +  if (__builtin_constant_p (__len) && __len > 0)
> +    {
> +      (void) __builtin___memset_chk (__dest, '\0', __len, __bos0 (__dest));
> +      typedef struct { char __x[__len]; } __memblk;
> +      __asm__ ("" :: "m" (*(__memblk __attribute__ ((__may_alias__)) *)__dest));
> +    }
> +  else
> +# endif
> +  if (__bos0 (__dest) != (size_t) -1
> +      && (!__builtin_constant_p (__len) || __len > __bos0 (__dest)))
> +    __explicit_bzero_chk (__dest, __len, __bos0 (__dest));
> +  else
> +    __explicit_bzero_alias (__dest, __len);
> +}

Do we want to 'chk' version being use even for non-check? Also I see that
OpenBSD added a patch to try disable this option for LTO [1].  Do you think
could it be worth to add as well in your version?

[1] http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/explicit_bzero.c?rev=1.3&content-type=text/x-cvsweb-markup
  
Zack Weinberg Aug. 21, 2015, 2:49 a.m. UTC | #3
On Thu, Aug 20, 2015 at 6:24 PM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
> As Joseph I also support the principle of adding this new API.

I am pleased to hear that.

> Regarding the
> patch some comments below.  It would be better if you send it as inline instead
> of attachment, and send on message for each patch.

Regrettably, this is not a thing I can do - I have tried several mail
clients and they all mangle patches if I put them inline.

>> +  typedef struct {char __x[__n];} __memblk;
>> +  memset (__s, 0, __n);
>> +  __asm__ ("" : : "m" (*(__memblk __attribute__ ((__may_alias__)) *)__s));
>
> I noted that Linux kernel uses the following definition:
>
> #define barrier_data(ptr) __asm__ __volatile__("": :"r"(ptr) :"memory")

This is a much more aggressive optimization barrier than is necessary.
A "memory" clobber tells the compiler that it can assume _nothing_
about what the asm statement does to memory -- so it has to discard
_all_ its knowledge of what's in memory where.  For explicit_bzero we
don't need to invalidate _anything_ - we just need to tell the
compiler that certain stores may not be eliminated even though they
appear to be dead.  When we can't do that precisely, leaving
explicit_bzero as an out-of-line function call should be more
efficient than a whole-memory clobber.

I intend to start a conversation with the GCC and LLVM people about
adding an intrinsic that does exactly what is needed in this context,
but I do not think that should hold up the patch.

(I am not qualified to speculate on whether the kernel is also being
too aggressive; they might be using this for completely different
stuff.)

> Do we want to 'chk' version being use even for non-check?

I do not understand this sentence.

> Also I see that OpenBSD added a patch to try disable this option for LTO [1].
> Do you think could it be worth to add as well in your version?

I could be wrong about this, but I am fairly sure that neither GCC nor
LLVM is capable of LTO-optimizing through a call to a shared library,
therefore this additional complication should not be necessary.

zw
  
Florian Weimer Aug. 21, 2015, 6:37 a.m. UTC | #4
On 08/16/2015 06:43 PM, Zack Weinberg wrote:
> +@strong{Warning:} The compiler is free to make additional copies of
> +any object, or parts of it, in temporary storage areas (such as
> +registers and ``scratch'' stack space).  @code{explicit_bzero} does
> +not guarantee that temporary copies of sensitive data are destroyed.

Perhaps you should add that explicit_bzero can create the copy which it
is about to overwrite, leaving the original untouched.  A partial
countermeasure could be a barrier with register clobbers for as many
caller-saved registers as possible.
  
Zack Weinberg Aug. 21, 2015, 2:24 p.m. UTC | #5
On Fri, Aug 21, 2015 at 2:37 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 08/16/2015 06:43 PM, Zack Weinberg wrote:
>> +@strong{Warning:} The compiler is free to make additional copies of
>> +any object, or parts of it, in temporary storage areas (such as
>> +registers and ``scratch'' stack space).  @code{explicit_bzero} does
>> +not guarantee that temporary copies of sensitive data are destroyed.
>
> Perhaps you should add that explicit_bzero can create the copy which it
> is about to overwrite, leaving the original untouched.  A partial
> countermeasure could be a barrier with register clobbers for as many
> caller-saved registers as possible.

I'm not doubting you, but that's surprising enough that if I am going
to put it in the manual, it needs an example of the situation where it
can happen, and I am failing to think of one; what do you have in
mind?

Also, are you saying that this barrier should be part of
explicit_bzero itself, or something the application needs to do?

zw
  
Florian Weimer Aug. 21, 2015, 2:29 p.m. UTC | #6
On 08/21/2015 04:24 PM, Zack Weinberg wrote:
> On Fri, Aug 21, 2015 at 2:37 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> On 08/16/2015 06:43 PM, Zack Weinberg wrote:
>>> +@strong{Warning:} The compiler is free to make additional copies of
>>> +any object, or parts of it, in temporary storage areas (such as
>>> +registers and ``scratch'' stack space).  @code{explicit_bzero} does
>>> +not guarantee that temporary copies of sensitive data are destroyed.
>>
>> Perhaps you should add that explicit_bzero can create the copy which it
>> is about to overwrite, leaving the original untouched.  A partial
>> countermeasure could be a barrier with register clobbers for as many
>> caller-saved registers as possible.
> 
> I'm not doubting you, but that's surprising enough that if I am going
> to put it in the manual, it needs an example of the situation where it
> can happen, and I am failing to think of one; what do you have in
> mind?

Oh, if you find it surprising, it is certainly worth documenting.
Here's an example:

  <https://gcc.gnu.org/ml/gcc-help/2014-10/msg00071.html>

Any variable which is not already addressable for another reason is
likely to trigger such behavior.

> Also, are you saying that this barrier should be part of
> explicit_bzero itself, or something the application needs to do?

Calling explicit_bzero should take care of that, I think.  I'm not
completely sure how to achieve that.  It might be sufficient to put the
barrier into the implementation, or it might have to be in a wrapper in
a header file.
  
Adhemerval Zanella Aug. 26, 2015, 9:06 p.m. UTC | #7
On 20-08-2015 23:49, Zack Weinberg wrote:
> On Thu, Aug 20, 2015 at 6:24 PM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>>
>> As Joseph I also support the principle of adding this new API.
> 
> I am pleased to hear that.
> 
>> Regarding the
>> patch some comments below.  It would be better if you send it as inline instead
>> of attachment, and send on message for each patch.
> 
> Regrettably, this is not a thing I can do - I have tried several mail
> clients and they all mangle patches if I put them inline.

You can easily with thunderbird (I use it) with some configurations.

> 
>>> +  typedef struct {char __x[__n];} __memblk;
>>> +  memset (__s, 0, __n);
>>> +  __asm__ ("" : : "m" (*(__memblk __attribute__ ((__may_alias__)) *)__s));
>>
>> I noted that Linux kernel uses the following definition:
>>
>> #define barrier_data(ptr) __asm__ __volatile__("": :"r"(ptr) :"memory")
> 
> This is a much more aggressive optimization barrier than is necessary.
> A "memory" clobber tells the compiler that it can assume _nothing_
> about what the asm statement does to memory -- so it has to discard
> _all_ its knowledge of what's in memory where.  For explicit_bzero we
> don't need to invalidate _anything_ - we just need to tell the
> compiler that certain stores may not be eliminated even though they
> appear to be dead.  When we can't do that precisely, leaving
> explicit_bzero as an out-of-line function call should be more
> efficient than a whole-memory clobber.
> 
> I intend to start a conversation with the GCC and LLVM people about
> adding an intrinsic that does exactly what is needed in this context,
> but I do not think that should hold up the patch.
> 
> (I am not qualified to speculate on whether the kernel is also being
> too aggressive; they might be using this for completely different
> stuff.)

Right, but this way you are proposing limits this API to only C programs
and only for GNU compilers.  Using kernel way to use the whole-memory 
clobber avoids the '__memblk' (and thus compiles for C++ as well) and also 
guarantees it works on other C compilers (clang).

> 
>> Do we want to 'chk' version being use even for non-check?
> 
> I do not understand this sentence.

My mistake here.

> 
>> Also I see that OpenBSD added a patch to try disable this option for LTO [1].
>> Do you think could it be worth to add as well in your version?
> 
> I could be wrong about this, but I am fairly sure that neither GCC nor
> LLVM is capable of LTO-optimizing through a call to a shared library,
> therefore this additional complication should not be necessary.

I agree, I was only curious if you know the background about this change.

> 
> zw
>
  
Zack Weinberg Aug. 26, 2015, 9:13 p.m. UTC | #8
On Wed, Aug 26, 2015 at 5:06 PM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
> On 20-08-2015 23:49, Zack Weinberg wrote:
>> This is a much more aggressive optimization barrier than is necessary.
...
> Right, but this way you are proposing limits this API to only C programs
> and only for GNU compilers.  Using kernel way to use the whole-memory
> clobber avoids the '__memblk' (and thus compiles for C++ as well) and also
> guarantees it works on other C compilers (clang).

The *optimization* (replacing `explicit_bzero` with `memset` + vacuous
use) is limited to C programs and GCC.  The *API* works just fine
regardless of compiler.  I believe this is sufficient as a starting
point.  As and when appropriate ways to express a vacuous use become
available in other compilers, we can add them.

>>> Also I see that OpenBSD added a patch to try disable this option for LTO [1].
>>> Do you think could it be worth to add as well in your version?
>>
>> I could be wrong about this, but I am fairly sure that neither GCC nor
>> LLVM is capable of LTO-optimizing through a call to a shared library,
>> therefore this additional complication should not be necessary.
>
> I agree, I was only curious if you know the background about this change.

It might be necessary for a C library intended to be linked
statically, and compiled with -flto or equivalent.  That's the only
thing I can think of.  If that means glibc should have something like
this in the !SHARED case I'm willing to add it.

zw
  
Adhemerval Zanella Aug. 26, 2015, 9:34 p.m. UTC | #9
On 26-08-2015 18:13, Zack Weinberg wrote:
> On Wed, Aug 26, 2015 at 5:06 PM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>> On 20-08-2015 23:49, Zack Weinberg wrote:
>>> This is a much more aggressive optimization barrier than is necessary.
> ...
>> Right, but this way you are proposing limits this API to only C programs
>> and only for GNU compilers.  Using kernel way to use the whole-memory
>> clobber avoids the '__memblk' (and thus compiles for C++ as well) and also
>> guarantees it works on other C compilers (clang).
> 
> The *optimization* (replacing `explicit_bzero` with `memset` + vacuous
> use) is limited to C programs and GCC.  The *API* works just fine
> regardless of compiler.  I believe this is sufficient as a starting
> point.  As and when appropriate ways to express a vacuous use become
> available in other compilers, we can add them.

Right, but do you know what kind of optimization the 'memory' cobbler avoids
and your suggestion allows?  I do understand that the 'memory' cobbler is
indeed a more restrictive memory barrier, but for mostly targets avoids
a function calls were is possible is a much more gain that some memory
operations begin handle with restrictive scheduling.

> 
>>>> Also I see that OpenBSD added a patch to try disable this option for LTO [1].
>>>> Do you think could it be worth to add as well in your version?
>>>
>>> I could be wrong about this, but I am fairly sure that neither GCC nor
>>> LLVM is capable of LTO-optimizing through a call to a shared library,
>>> therefore this additional complication should not be necessary.
>>
>> I agree, I was only curious if you know the background about this change.
> 
> It might be necessary for a C library intended to be linked
> statically, and compiled with -flto or equivalent.  That's the only
> thing I can think of.  If that means glibc should have something like
> this in the !SHARED case I'm willing to add it.

I am not sure either, I will let it for future iterations (if someone found
a scenarios where it might be an issue).

> 
> zw
>
  
Zack Weinberg Aug. 26, 2015, 9:38 p.m. UTC | #10
On Fri, Aug 21, 2015 at 10:29 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 08/21/2015 04:24 PM, Zack Weinberg wrote:
>> On Fri, Aug 21, 2015 at 2:37 AM, Florian Weimer <fweimer@redhat.com> wrote:
>>> On 08/16/2015 06:43 PM, Zack Weinberg wrote:
>>>> +@strong{Warning:} The compiler is free to make additional copies of
>>>> +any object, or parts of it, in temporary storage areas (such as
>>>> +registers and ``scratch'' stack space).  @code{explicit_bzero} does
>>>> +not guarantee that temporary copies of sensitive data are destroyed.
>>>
>>> Perhaps you should add that explicit_bzero can create the copy which it
>>> is about to overwrite, leaving the original untouched.
...
> Oh, if you find it surprising, it is certainly worth documenting.
> Here's an example:
>
>   <https://gcc.gnu.org/ml/gcc-help/2014-10/msg00071.html>
>
> Any variable which is not already addressable for another reason is
> likely to trigger such behavior.

Thanks for the example.  I will add this to the manual in the next
revision of the patch.

>>> A partial
>>> countermeasure could be a barrier with register clobbers for as many
>>> caller-saved registers as possible.
...
>> are you saying that this barrier should be part of
>> explicit_bzero itself, or something the application needs to do?
>
> Calling explicit_bzero should take care of that, I think.  I'm not
> completely sure how to achieve that.  It might be sufficient to put the
> barrier into the implementation, or it might have to be in a wrapper in
> a header file.

    asm ("" ::: /* all call-preserved registers */)

won't actually *do* anything to erase old values.  If they are live,
they will get spilled to the stack (another copy!) and pulled back in
when needed.  If they are dead, the compiler will just shrug and move
on.  OK, so we have to actually emit assembly instructions to clear
call-preserved registers.  That could be a substantial deoptimization,
particularly in cases where there are several calls to explicit_bzero
in a row (e.g. my changes to crypt/)  and on architectures with large
register files.  For ia64 (is that still supported?) I'm not sure it's
*possible* to write an appropriate assembly insert because how do I
even know how many registers are live in the current call frame?

I could see adding a *separate* macro (I think it has to be a macro) like so

    #define clear_register_file() do { \
        __asm__("xorl %eax, %eax" ::: "eax") \
        __asm__("xorl %ebx, %ebx" ::: "ebx") \
        __asm__("xorl %ecx, %ecx" ::: "ecx") \
        __asm__("xorl %edx, %edx" ::: "edx") \
        __asm__("xorl %esi, %esi" ::: "esi") \
        __asm__("xorl %edi, %edi" ::: "edi") \
        __asm__("xorl %ebp, %ebp" ::: "ebp") \
        /* ... etc etc FPU and vector registers */ \
    } while (0)

You'd use this once right before returning from a function that
manipulated sensitive data.  It would still be a pretty hefty
deoptimization, and it wouldn't do anything about scratch stack slots.
I actually think scratch stack slots are a bigger concern; I've heard
about a lot more unintended data leaks via memory than via the
register file.

All in all I feel that the Right Thing is to not worry about this
beyond mentioning it in documentation -- again, the situation is not
worse than what people are hand-rolling right now -- and put the
effort into adding compiler intrinsics that will let us do the job
_properly_.

zw
  
Zack Weinberg Aug. 26, 2015, 9:54 p.m. UTC | #11
On Wed, Aug 26, 2015 at 5:34 PM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
> On 26-08-2015 18:13, Zack Weinberg wrote:
>> The *optimization* (replacing `explicit_bzero` with `memset` + vacuous
>> use) is limited to C programs and GCC.  The *API* works just fine
>> regardless of compiler.  I believe this is sufficient as a starting
>> point.  As and when appropriate ways to express a vacuous use become
>> available in other compilers, we can add them.
>
> Right, but do you know what kind of optimization the 'memory' cobbler avoids
> and your suggestion allows?  I do understand that the 'memory' cobbler is
> indeed a more restrictive memory barrier, but for mostly targets avoids
> a function calls were is possible is a much more gain that some memory
> operations begin handle with restrictive scheduling.

That's a good question.  I can tell you one general type of code that
would have a problem...

    struct complicated_object {
       int many, other, fields;
       char sensitive[32];
   }

where code temporarily fills in `sensitive` and uses explicit_bzero to
erase it when it's done; the "memory" clobber would force the compiler
to flush all the other fields of `complicated_object` to memory and
reload them afterward.  They don't have to be tied together with a
`struct` even; the sensitive data could be on the stack while all the
other state being manipulated was on the heap, and it would still have
to flush everything.

What I don't know is what would make for better optimization in
practice, and unfortunately I do not have time to go digging through
existing users of secure-memory-clear functions to figure out where
the tradeoff lies.  And I don't feel that picking one kludge over
another is a great use of anyone's time.  Instead, I want to get
something in with known, reliable semantics (but not optimal and with
documented limits), and then move over to the compiler side to work on
optimization and more thorough sanitization.  I almost didn't bother
with string[23].h optimizations at all and now I'm kind of regretting
having done them.

zw
  
Zack Weinberg Sept. 7, 2015, 5:18 p.m. UTC | #12
On 08/26/2015 05:34 PM, Adhemerval Zanella wrote:
> On 26-08-2015 18:13, Zack Weinberg wrote:
>> The *optimization* (replacing `explicit_bzero` with `memset` +
>> vacuous use) is limited to C programs and GCC.  The *API* works
>> just fine regardless of compiler.  I believe this is sufficient as
>> a starting point.  As and when appropriate ways to express a
>> vacuous use become available in other compilers, we can add them.
>
> Right, but do you know what kind of optimization the 'memory'
> cobbler avoids and your suggestion allows?  I do understand that the
> 'memory' cobbler is indeed a more restrictive memory barrier, but
> for mostly targets avoids a function calls were is possible is a
> much more gain that some memory operations begin handle with
> restrictive scheduling.

I found some time to look into this in detail, by doing experiments on
LibreSSL, which makes extensive use of explicit_bzero.  There are 77
object files in the LibreSSL 2.2.3-portable release which contain at
least one call to explicit_bzero, either directly or via the wrapper
OPENSSL_cleanse, not counting the definition of OPENSSL_cleanse
itself.  I compiled the entire library three times against glibc 2.19
(therefore, using LibreSSL's internal definition of explicit_bzero),
using gcc 5.2, with --disable-shared, as follows:

1) "funcall" -- baseline: all internal uses of the wrapper were
   replaced with direct calls to explicit_bzero (using a macro), but
   there were no other changes.

2) "memread" -- add to libressl's compat/string.h my original proposal
   for bits/string2.h optimization of explicit_bzero, which uses the
   asm ("" :: "m" (*(struct { char x[size]; } *)ptr)) construct.  This
   is expected to produce better local optimization than either (1) or
   (3) when it works, but degenerates to (1) when it doesn't work.

3) "memclob" -- instead of the above optimization, use
   asm ("" :: "r" (ptr) : "memory").  This can be applied
   unconditionally, but is expected to produce worse code than (2) in
   the cases where (2) works.

In the baseline compilation, there are 37 objects that make a call to
memset.  In the "memread" compilation, the number of objects that make
a call to explicit_bzero drops to 38, but the number of objects that
make a call to memset only goes up by one; this is partially because
some of the changed objects already called memset, and partially
because the "memread" optimization works in exactly the cases where
GCC is most likely to replace a call to memset with inline code
(i.e. the size parameter is constant).  In the "memclob" compilation,
there are no surviving calls to explicit_bzero (as expected) and 63
object files call memset.

First off, I may have missed something, but it appears to me that the
"memclob" approach does not replace any *more* explicit_bzero calls with
inline code than "memread" does.  Again, this is expected, since the
"memread" optimization applies to the cases where GCC is most likely
to inline a call to memset.

Now let's look at a typical case where "memread" enables the compiler
to replace memset with inline code: this is _rs_stir in arc4random.c:

        movq    $0, 96(%rax)
        movq    rs(%rip), %rax
        movq    $984, (%rax)
 .L23:
-       movq    %rsp, %rdi
-       movl    $40, %esi
-       call    explicit_bzero
+       movq    $0, (%rsp)
+       movq    $0, 8(%rsp)
+       movq    $0, 16(%rsp)
+       movq    $0, 24(%rsp)
+       movq    $0, 32(%rsp)
        movq    rs(%rip), %rax
        movq    $0, (%rax)

With "memclob" instead, the same code is generated for the actual
memory clear, but we get slightly worse code for the rest of the
function, because it's now considered to require a frame pointer.
(- memread, + memclob; debug information stripped, slightly
descheduled to emphasize the semantic change)

 _rs_stir:
+       pushq   %r12
        pushq   %rbp
        pushq   %rbx
        movl    $40, %esi
-       subq    $56, %rsp
+       subq    $48, %rsp
        movq    %rsp, %rdi
+       movq    %rsp, %rbp
        movq    %fs:40, %rax
        movq    %rax, 40(%rsp)
        xorl    %eax, %eax
        call    getentropy
        cmpl    $-1, %eax
@@
        movq    rsx(%rip), %rdi
        leaq    64(%rdi), %rsi
        movq    %rsi, %rdx
        call    chacha_encrypt_bytes.constprop.2
        movq    rsx(%rip), %rbx
        xorl    %eax, %eax
 .L24:
-       movzbl  (%rsp,%rax), %edx
+       movzbl  0(%rbp,%rax), %edx
        xorb    %dl, 64(%rbx,%rax)
        addq    $1, %rax
        cmpq    $40, %rax
        jne     .L24
        cmpq    $0, rs(%rip)
-       leaq    64(%rbx), %rbp
+       leaq    64(%rbx), %r12
        movq    %rbx, %rdi
        je      .L46
 .L25:
-       movq    %rbp, %rsi
+       movq    %r12, %rsi
        call    chacha_keysetup.isra.0.constprop.3

and so on.

Here's another example where memclob causes slightly worse register
allocation than memread.  In this case, I think it's mostly because
memset takes one more argument than explicit_bzero.  This is
tls_config_set_key_mem in tls/tls_config.c.

 tls_config_set_key_mem:
+       pushq   %r13
        pushq   %r12
        pushq   %rbp
        pushq   %rbx
+       movq    %rdx, %r13
-       movq    %rdx, %r12
+       movq    %rsi, %r12
        movq    %rdi, %rbx
+       subq    $8, %rsp
-       movq    80(%rdi), %rdi
+       movq    80(%rdi), %rbp
-       movq    %rsi, %rbp
-       testq   %rdi, %rdi
+       testq   %rbp, %rbp
        je      .L105
-       movq    88(%rbx), %rsi
+       movq    88(%rdi), %rdx
+       xorl    %esi, %esi
+       movq    %rbp, %rdi
-       call    explicit_bzero
+       call    memset
 .L105:
+       addq    $8, %rsp
        leaq    88(%rbx), %rsi
        leaq    80(%rbx), %rdi
-       movq    %r12, %rcx
+       movq    %r13, %rcx
+       movq    %r12, %rdx
        popq    %rbx
-       movq    %rbp, %rdx
        popq    %rbp
        popq    %r12
+       popq    %r13
        jmp     set_mem

Finally, the large, complicated apps/ca.c reads in part

    pkey = load_key(bio_err, keyfile, keyform, 0, key, e, "CA private key");
    if (key)
            OPENSSL_cleanse(key, strlen(key));
    if (pkey == NULL) {
            /* load_key() has already printed an appropriate message */
            goto err;
    }

With -memread, GCC decides that the OPENSSL_cleanse operation belongs
out-of-line in .text.unlikely.  With either -memclob or -funcall, this
does not happen.  This is a mis-optimization; I think control can't
actually reach this point with a NULL key.  However, it illustrates
how -memread exposes more information to the compiler and lets it make
more sophisticated decisions.

I've tarred up all three build trees (with intermediate .s files
included) and put them on my website at
https://hacks.owlfolio.org/scratchpad/libressl-ebzero-experiments.tar.xz.
I invite other people to dig into them and discuss the effects of the
various choices.  (Please note that it is 53MB and I can't leave it
there forever due to hosting costs.)

It remains my opinion that the patch as I posted it (with an update to
the manual -- I will get to that in the next few days) is Good Enough
and we should not hold off on making the API available until the ideal
bits/string2.h optimization for it is possible.  I will be sending a
message to the gcc and clang mailing lists in the next few days to
open a discussion of that ideal, but I don't have any plans to
implement it myself.

zw
  
Adhemerval Zanella Sept. 7, 2015, 10:31 p.m. UTC | #13
On 07-09-2015 14:18, Zack Weinberg wrote:
> It remains my opinion that the patch as I posted it (with an update to
> the manual -- I will get to that in the next few days) is Good Enough
> and we should not hold off on making the API available until the ideal
> bits/string2.h optimization for it is possible.  I will be sending a
> message to the gcc and clang mailing lists in the next few days to
> open a discussion of that ideal, but I don't have any plans to
> implement it myself.
> 
> zw

Thanks for the detailed analysis, it indeed shows that the 'memread' 
is a better approach.  In fact I was not really obstructing your patch,
just curious about the 'memread' approach, and I apologize if I passed
the impression I was trying to stall your patch based on this specific
optimization.
  

Patch

Use explicit_bzero where appropriate

I *believe* these are the only places where memset was being used
to clear buffers containing sensitive data.  The compiler probably
couldn't optimize *all* of them out but it seems best to change them all.

The legacy DES implementation wasn't bothering to clear its buffers,
so I added that, mostly for consistency's sake.

[ChangeLog:]
YYYY-MM-DD  Zack Weinberg <zackw@panix.com>

	* crypt/crypt-entry.c (__crypt_r): Clear key-dependent intermediate
	data before returning, using explicit_bzero.
	* crypt/md5-crypt.c (__md5_crypt_r): Likewise.
	* crypt/sha256-crypt.c (__sha256_crypt_r): Likewise.
	* crypt/sha512-crypt.c (__sha512_crypt_r): Likewise.
---
 crypt/crypt-entry.c  | 11 +++++++++++
 crypt/md5-crypt.c    |  8 ++++----
 crypt/sha256-crypt.c | 14 +++++++-------
 crypt/sha512-crypt.c | 14 +++++++-------
 4 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/crypt/crypt-entry.c b/crypt/crypt-entry.c
index 7e655ba..47db32c 100644
--- a/crypt/crypt-entry.c
+++ b/crypt/crypt-entry.c
@@ -143,6 +143,17 @@  __crypt_r (key, salt, data)
    * And convert back to 6 bit ASCII
    */
   _ufc_output_conversion_r (res[0], res[1], salt, data);
+
+#ifdef _LIBC
+  /*
+   * Erase key-dependent intermediate data.  Data dependent only on
+   * the salt is not considered sensitive.
+   */
+  explicit_bzero (ktab, sizeof (ktab));
+  explicit_bzero (data->keysched, sizeof (data->keysched));
+  explicit_bzero (res, sizeof (res));
+#endif
+
   return data->crypt_3_buf;
 }
 weak_alias (__crypt_r, crypt_r)
diff --git a/crypt/md5-crypt.c b/crypt/md5-crypt.c
index 1b890bc..b95766d 100644
--- a/crypt/md5-crypt.c
+++ b/crypt/md5-crypt.c
@@ -292,13 +292,13 @@  __md5_crypt_r (key, salt, buffer, buflen)
 #ifndef USE_NSS
   __md5_init_ctx (&ctx);
   __md5_finish_ctx (&ctx, alt_result);
-  memset (&ctx, '\0', sizeof (ctx));
-  memset (&alt_ctx, '\0', sizeof (alt_ctx));
+  explicit_bzero (&ctx, sizeof (ctx));
+  explicit_bzero (&alt_ctx, sizeof (alt_ctx));
 #endif
   if (copied_key != NULL)
-    memset (copied_key, '\0', key_len);
+    explicit_bzero (copied_key, key_len);
   if (copied_salt != NULL)
-    memset (copied_salt, '\0', salt_len);
+    explicit_bzero (copied_salt, salt_len);
 
   free (free_key);
   return buffer;
diff --git a/crypt/sha256-crypt.c b/crypt/sha256-crypt.c
index d90e291..189183a 100644
--- a/crypt/sha256-crypt.c
+++ b/crypt/sha256-crypt.c
@@ -375,16 +375,16 @@  __sha256_crypt_r (key, salt, buffer, buflen)
 #ifndef USE_NSS
   __sha256_init_ctx (&ctx);
   __sha256_finish_ctx (&ctx, alt_result);
-  memset (&ctx, '\0', sizeof (ctx));
-  memset (&alt_ctx, '\0', sizeof (alt_ctx));
+  explicit_bzero (&ctx, sizeof (ctx));
+  explicit_bzero (&alt_ctx, sizeof (alt_ctx));
 #endif
-  memset (temp_result, '\0', sizeof (temp_result));
-  memset (p_bytes, '\0', key_len);
-  memset (s_bytes, '\0', salt_len);
+  explicit_bzero (temp_result, sizeof (temp_result));
+  explicit_bzero (p_bytes, key_len);
+  explicit_bzero (s_bytes, salt_len);
   if (copied_key != NULL)
-    memset (copied_key, '\0', key_len);
+    explicit_bzero (copied_key, key_len);
   if (copied_salt != NULL)
-    memset (copied_salt, '\0', salt_len);
+    explicit_bzero (copied_salt, salt_len);
 
   free (free_key);
   free (free_pbytes);
diff --git a/crypt/sha512-crypt.c b/crypt/sha512-crypt.c
index 9c581ab..bbd5e24 100644
--- a/crypt/sha512-crypt.c
+++ b/crypt/sha512-crypt.c
@@ -397,16 +397,16 @@  __sha512_crypt_r (key, salt, buffer, buflen)
 #ifndef USE_NSS
   __sha512_init_ctx (&ctx);
   __sha512_finish_ctx (&ctx, alt_result);
-  memset (&ctx, '\0', sizeof (ctx));
-  memset (&alt_ctx, '\0', sizeof (alt_ctx));
+  explicit_bzero (&ctx, sizeof (ctx));
+  explicit_bzero (&alt_ctx, sizeof (alt_ctx));
 #endif
-  memset (temp_result, '\0', sizeof (temp_result));
-  memset (p_bytes, '\0', key_len);
-  memset (s_bytes, '\0', salt_len);
+  explicit_bzero (temp_result, sizeof (temp_result));
+  explicit_bzero (p_bytes, key_len);
+  explicit_bzero (s_bytes, salt_len);
   if (copied_key != NULL)
-    memset (copied_key, '\0', key_len);
+    explicit_bzero (copied_key, key_len);
   if (copied_salt != NULL)
-    memset (copied_salt, '\0', salt_len);
+    explicit_bzero (copied_salt, salt_len);
 
   free (free_key);
   free (free_pbytes);
-- 
2.5.0