memcmp-sse4.S EqualHappy bug

Message ID 20150618145202.GG14955@redhat.com
State Dropped
Headers

Commit Message

Andrea Arcangeli June 18, 2015, 2:52 p.m. UTC
  On Thu, Jun 18, 2015 at 04:23:35PM +0200, Torvald Riegel wrote:
> There are a few differences to your memcmp test case though.  If
> barrier() is a compiler barrier, this should "remove" prior knowledge
> the compiler has about memory.  Second, you access volatile-qualified
> data, which at least requires to do the accesses byte-wise or similar.
> Does the memcmp test case also fail when you acess volatiles?

Actually the kernel code in the __builtin_memcpy case removes the
volatile qualifier.

On a side note it's not normal to use READ_ONCE/WRITE_ONCE on
objects larger than sizeof(long) and it would warn at build time but
it would build successfully.

Changing my testcase like this just creates warning about volatile
warning: passing argument 1 of ‘memset’ discards ‘volatile’ qualifier
from pointer target type, etc... it still builds and fail at runtime
like before. I added another barrier as well but it can't make a
difference, the problem is very clear and how to prevent this
practical problem is clear too. As long as memcmp-sse4.S is called
there's nothing the compiler can do to prevent this problem from
materializing.


> Note that I'm not saying that this is guaranteed to work.  It's still UB
> if there is a data race.  However, in practice, this might just continue
> to work.

READ_ONCE/WRITE_ONCE are only used if there is going to be a data
race, otherwise they're never used.

> Userland programs should really try to transition to atomics and the
> C11/C++11 memory model or just the __atomic builtins, if possible.

We should look into that.

Without a way to access non-atomic regions with C, RCU in the kernel
cannot work. But if memcpy in the kernel has to behave sane (and not
break out of the loop too happily) in presence of not-atomic changes
to the memory, then I don't see why memcmp in userland is different.

Also note there is also an userspace RCU, I don't know exactly how
that works because I never used it, but the whole concept of RCU is to
allow C language to access not-atomic regions of memory and to get
away with it. And if that way is memcpy, I don't see what's
fundamentally different about memcpy being required to cope with that,
but memcmp not.

I totally see why memcmp-sse4.S is safe by the standard, just the
standard is not the only way to gain performance, RCU is required to
scale 100% in SMP, pthread_mutex_rwlock wouldn't scale, so it's
unthinkable to get rid of READ_ONCE/WRITE_ONCE. Switching to __atomic
would be fine, and if __atomic can make memcpy (and IMHO memcmp)
behave in a way they won't happily return too soon on __atomic
builtins, it'd be great.
  

Comments

Torvald Riegel June 18, 2015, 3:50 p.m. UTC | #1
On Thu, 2015-06-18 at 16:52 +0200, Andrea Arcangeli wrote:
> On Thu, Jun 18, 2015 at 04:23:35PM +0200, Torvald Riegel wrote:
> > There are a few differences to your memcmp test case though.  If
> > barrier() is a compiler barrier, this should "remove" prior knowledge
> > the compiler has about memory.  Second, you access volatile-qualified
> > data, which at least requires to do the accesses byte-wise or similar.
> > Does the memcmp test case also fail when you acess volatiles?
> 
> Actually the kernel code in the __builtin_memcpy case removes the
> volatile qualifier.

yeah, never mind...

> On a side note it's not normal to use READ_ONCE/WRITE_ONCE on
> objects larger than sizeof(long) and it would warn at build time but
> it would build successfully.
> 
> Changing my testcase like this just creates warning about volatile
> warning: passing argument 1 of ‘memset’ discards ‘volatile’ qualifier
> from pointer target type, etc... it still builds and fail at runtime
> like before. I added another barrier as well but it can't make a
> difference, the problem is very clear and how to prevent this
> practical problem is clear too. As long as memcmp-sse4.S is called
> there's nothing the compiler can do to prevent this problem from
> materializing.

Yeah, but if there were a memcmp that would respect volatile, then it
could make a difference.

> > Note that I'm not saying that this is guaranteed to work.  It's still UB
> > if there is a data race.  However, in practice, this might just continue
> > to work.
> 
> READ_ONCE/WRITE_ONCE are only used if there is going to be a data
> race, otherwise they're never used.

I know.  Technically, data races are UB, yet that a particular compiler
does something that works for you can still happen.

> > Userland programs should really try to transition to atomics and the
> > C11/C++11 memory model or just the __atomic builtins, if possible.
> 
> We should look into that.
> 
> Without a way to access non-atomic regions with C

I assume regions means memory regions not of atomic type(s).

What do you mean by "with C"?  Do you mean plain accesses, or just C
code?  If you just want to snapshot data for RCU, using atomic loads
with memory_order_relaxed is fine.  This could be put into a custom
memcpy.

> , RCU in the kernel
> cannot work. But if memcpy in the kernel has to behave sane (and not
> break out of the loop too happily) in presence of not-atomic changes
> to the memory, then I don't see why memcmp in userland is different.

I was distinguishing userland because userland should just rely on C11
and its memory model, IMO.  I don't have a strong opinion on what the
kernel should do, so I simply didn't want to comment on the latter.

> Also note there is also an userspace RCU, I don't know exactly how
> that works because I never used it, but the whole concept of RCU is to
> allow C language to access not-atomic regions of memory and to get
> away with it.

In general, this cannot work.  There might be compiler optimizations
that expect that they can safely reload data that is not accessed
through atomics nor is of atomic type.  Those can make a lot of sense
considering average code, but would break in this particular case. 

> And if that way is memcpy, I don't see what's
> fundamentally different about memcpy being required to cope with that,
> but memcmp not.

If there is a way for the compiler to see that the data accessed
speculatively in RCU can be modified concurrently, things are fine.
This could be explicitly atomic accesses, a custom memcpy or such, etc.
The stock memcpy assumes non-atomic data though, and data-race-freedom.

> I totally see why memcmp-sse4.S is safe by the standard, just the
> standard is not the only way to gain performance, RCU is required to
> scale 100% in SMP, pthread_mutex_rwlock wouldn't scale, so it's
> unthinkable to get rid of READ_ONCE/WRITE_ONCE. Switching to __atomic
> would be fine, and if __atomic can make memcpy (and IMHO memcmp)
> behave in a way they won't happily return too soon on __atomic
> builtins, it'd be great.

The __atomic builtins are all you need to implement concurrent code.
The kernel's memory model might be more fine-grained than what C11
offers currently (e.g., data dependences: memory_order_consume is
broken, and the kernel keeps its fingers crossed that the compiler will
not do what it would be actually allowed to do; I've been working with
Paul McKenney on this...).  Nonetheless, besides those potential
performance differences on some archs, it's all there.

BTW, I'm wondering what your memcmp would actually do internally, and
what it would guarantee precisely.  It can't just use relaxed-MO reads /
read_once without further barriers, because this won't guarantee
anything in face of multiple concurrent updaters; if using just
relaxed-MO reads, memcmp might read a snapshot that never existed in any
seq-cst ordering of all individual loads and stores on this memory
region.
  
Andrea Arcangeli June 18, 2015, 4:19 p.m. UTC | #2
On Thu, Jun 18, 2015 at 05:50:35PM +0200, Torvald Riegel wrote:
> BTW, I'm wondering what your memcmp would actually do internally, and
> what it would guarantee precisely.  It can't just use relaxed-MO reads /

That's very simple: it would never return 0 if it didn't finish
comparing all the data according to the length parameter.

If minimal atomic granularity of the arch is 1, and I guarantee the
last byte is never changing and always different between src and dst
region, it would never return 0. It could return any other value but
0. It's undefined if it returns above or below zero (depending on the
modifications) but in no way my memcmp could return 0 for such a
workload.  That it can only return non zero is defined by the hardware
and trivial to enforce in the workload of my testcase.
  
Andrea Arcangeli June 18, 2015, 5:22 p.m. UTC | #3
On Thu, Jun 18, 2015 at 06:19:43PM +0200, Andrea Arcangeli wrote:
> workload.  That it can only return non zero is defined by the hardware
> and trivial to enforce in the workload of my testcase.

Instead of thinking in hardware terms, our colleague David (CC'ed)
suggested we can explain why there is a problem in C terms as well.

While the compiler would be free to reorder the load/stores if you
wrote the memcmp in C, and it would be free to duplicate a read, it
would never be free to terminate the comparison loop due any equal
result.
  
Torvald Riegel June 18, 2015, 5:49 p.m. UTC | #4
On Thu, 2015-06-18 at 19:22 +0200, Andrea Arcangeli wrote:
> On Thu, Jun 18, 2015 at 06:19:43PM +0200, Andrea Arcangeli wrote:
> > workload.  That it can only return non zero is defined by the hardware
> > and trivial to enforce in the workload of my testcase.
> 
> Instead of thinking in hardware terms, our colleague David (CC'ed)
> suggested we can explain why there is a problem in C terms as well.
> 
> While the compiler would be free to reorder the load/stores if you
> wrote the memcmp in C, and it would be free to duplicate a read, it
> would never be free to terminate the comparison loop due any equal
> result.

If there are data races, this is undefined behavior and *all* bets are
off (according to the standard).  This includes results such as
terminating the loop, stopping to do anything, etc.

This may seem overly drastic, but it's a useful line to draw because we
don't want to have to reason based on details of a compiler's
implementation.
  
Dr. David Alan Gilbert June 18, 2015, 6:12 p.m. UTC | #5
* Torvald Riegel (triegel@redhat.com) wrote:
> On Thu, 2015-06-18 at 19:22 +0200, Andrea Arcangeli wrote:
> > On Thu, Jun 18, 2015 at 06:19:43PM +0200, Andrea Arcangeli wrote:
> > > workload.  That it can only return non zero is defined by the hardware
> > > and trivial to enforce in the workload of my testcase.
> > 
> > Instead of thinking in hardware terms, our colleague David (CC'ed)
> > suggested we can explain why there is a problem in C terms as well.
> > 
> > While the compiler would be free to reorder the load/stores if you
> > wrote the memcmp in C, and it would be free to duplicate a read, it
> > would never be free to terminate the comparison loop due any equal
> > result.
> 
> If there are data races, this is undefined behavior and *all* bets are
> off (according to the standard).  This includes results such as
> terminating the loop, stopping to do anything, etc.
> 
> This may seem overly drastic, but it's a useful line to draw because we
> don't want to have to reason based on details of a compiler's
> implementation.

Yes, I guess what I find interesting about this case is that:
   a) Having found a difference and then reread it and found it identical
    you *know* you've hit an 'undefined' case.

   b) The action taken in that case is an outcome that seems impossible
    from any memory reordering or pure-rereading of data; so it's more
    surprising than normal - people are already used to the compiler
    or the CPU reordering stuff, and having static mismatched data
    in the block would seem to be a guarantee that 0 could never
    happen.

   c) I worry if there's anything where a strcmp could get optimised
    into a memcmp and then hit this case, but I agree it's difficult to
    find a situation where it would have been safe anyway.

IMHO it would be safer to abort than to return 0, but that's probably
an equally controvercial result.

Dave
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
  
Andrea Arcangeli June 19, 2015, 1:20 p.m. UTC | #6
On Thu, Jun 18, 2015 at 07:49:45PM +0200, Torvald Riegel wrote:
> On Thu, 2015-06-18 at 19:22 +0200, Andrea Arcangeli wrote:
> > On Thu, Jun 18, 2015 at 06:19:43PM +0200, Andrea Arcangeli wrote:
> > > workload.  That it can only return non zero is defined by the hardware
> > > and trivial to enforce in the workload of my testcase.
> > 
> > Instead of thinking in hardware terms, our colleague David (CC'ed)
> > suggested we can explain why there is a problem in C terms as well.
> > 
> > While the compiler would be free to reorder the load/stores if you
> > wrote the memcmp in C, and it would be free to duplicate a read, it
> > would never be free to terminate the comparison loop due any equal
> > result.
> 
> If there are data races, this is undefined behavior and *all* bets are
> off (according to the standard).  This includes results such as
> terminating the loop, stopping to do anything, etc.
> 
> This may seem overly drastic, but it's a useful line to draw because we
> don't want to have to reason based on details of a compiler's
> implementation.

My view is more practical, memcmp equality retvals can give false
negatives now. So any code using memcmp or bcmp to validate security
canary values (canary areas are always checked for equality) could
also return false negatives. It sounds unlikely that this will trigger
but in practice by now memcmp isn't usable for canary value checks or
anything else related to security unless this gets fixed.
  
Carlos O'Donell June 19, 2015, 1:38 p.m. UTC | #7
On 06/18/2015 02:12 PM, Dr. David Alan Gilbert wrote:
>    c) I worry if there's anything where a strcmp could get optimised
>     into a memcmp and then hit this case, but I agree it's difficult to
>     find a situation where it would have been safe anyway.

Exactly. It would not have been safe.
 
> IMHO it would be safer to abort than to return 0, but that's probably
> an equally controvercial result.

Not really.

It's undefined behaviour, we can do anything.

I agree with aborting, but only as long as the hot path's performance
is not impacted and I haven't thought about how to do that.

Per "Bugs in the user program" convention:
https://sourceware.org/glibc/wiki/Style_and_Conventions#Bugs_in_the_user_program

---
9.2. Bugs in the user program

If it's user code invoking undefined behavior, then it should fail early and catastrophically so that developers don't get the false impression that their code is OK when it happens not to break the use cases they test adequately. (Said another way, so that we avoid giving developers an excuse to complain when a future implementation change "breaks" their programs that were always broken, but theretofore ignorably so.) That too trades off against any runtime cost of detecting the case. I'd say the allowance for cost of detection is marginally higher than in the case of library bugs, because we expect user bugs to be more common that library bugs. But it's still not much, since correct programs performing better is more important to us than buggy programs being easier to debug. 
---

Cheers,
Carlos.
  
Andrea Arcangeli June 19, 2015, 2:07 p.m. UTC | #8
On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> I agree with aborting, but only as long as the hot path's performance
> is not impacted and I haven't thought about how to do that.
> 
> Per "Bugs in the user program" convention:
> https://sourceware.org/glibc/wiki/Style_and_Conventions#Bugs_in_the_user_program
> 
> ---
> 9.2. Bugs in the user program
> 
> If it's user code invoking undefined behavior, then it should fail early and catastrophically so that developers don't get the false impression that their code is OK when it happens not to break the use cases they test adequately. (Said another way, so that we avoid giving developers an excuse to complain when a future implementation change "breaks" their programs that were always broken, but theretofore ignorably so.) That too trades off against any runtime cost of detecting the case. I'd say the allowance for cost of detection is marginally higher than in the case of library bugs, because we expect user bugs to be more common that library bugs. But it's still not much, since correct programs performing better is more important to us than buggy programs being easier to debug. 
> ---

Aborting should be a satisfactory measure too and I think this issue
fits the worthwhile tradeoff described by the above document, as the
cost should be negligible. It's just a matter of propagating rdx in
addition to rdi/rsi and it won't affect the sse4 unrolled loops fast
paths.

If the testcase reproduces the bug frequently the abort would give a
better chance to fix the bug so the app becomes fully C compliant.

Aborting memcmp should still leave memcmp usable for security purposes
if the app can safely detect such a failure and shutdown (failing
early and catastrophically sounds good to me but then not everyone may
agree, if it's canary value that is being checked it may already have
a proper notification failure path that may not get invoked). Overall
as long as the current false negative is not returned, it should be
enough as far as security is concerned.

If the race condition is very rarely triggered and it ends up being
triggering in production, an abort wouldn't be welcome, but it's still
preferable and safer than the current behavior.

The cost of detecting the UB and aborting is practically identical to
the cost of making the memcmp not return 0. The only difference is
that instead of jumping at the start of the function and continuing
evaluating the rest of the buffer, we'd abort.
  
Simo Sorce June 19, 2015, 2:42 p.m. UTC | #9
On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > I agree with aborting, but only as long as the hot path's performance
> > is not impacted and I haven't thought about how to do that.
> > 
> > Per "Bugs in the user program" convention:
> > https://sourceware.org/glibc/wiki/Style_and_Conventions#Bugs_in_the_user_program
> > 
> > ---
> > 9.2. Bugs in the user program
> > 
> > If it's user code invoking undefined behavior, then it should fail early and catastrophically so that developers don't get the false impression that their code is OK when it happens not to break the use cases they test adequately. (Said another way, so that we avoid giving developers an excuse to complain when a future implementation change "breaks" their programs that were always broken, but theretofore ignorably so.) That too trades off against any runtime cost of detecting the case. I'd say the allowance for cost of detection is marginally higher than in the case of library bugs, because we expect user bugs to be more common that library bugs. But it's still not much, since correct programs performing better is more important to us than buggy programs being easier to debug. 
> > ---
> 
> Aborting should be a satisfactory measure too and I think this issue
> fits the worthwhile tradeoff described by the above document, as the
> cost should be negligible. It's just a matter of propagating rdx in
> addition to rdi/rsi and it won't affect the sse4 unrolled loops fast
> paths.
> 
> If the testcase reproduces the bug frequently the abort would give a
> better chance to fix the bug so the app becomes fully C compliant.
> 
> Aborting memcmp should still leave memcmp usable for security purposes
> if the app can safely detect such a failure and shutdown (failing
> early and catastrophically sounds good to me but then not everyone may
> agree, if it's canary value that is being checked it may already have
> a proper notification failure path that may not get invoked). Overall
> as long as the current false negative is not returned, it should be
> enough as far as security is concerned.
> 
> If the race condition is very rarely triggered and it ends up being
> triggering in production, an abort wouldn't be welcome, but it's still
> preferable and safer than the current behavior.
> 
> The cost of detecting the UB and aborting is practically identical to
> the cost of making the memcmp not return 0. The only difference is
> that instead of jumping at the start of the function and continuing
> evaluating the rest of the buffer, we'd abort.

My 2c,
although it is true that technically the spec says memcmp can return
arbitrary values if the tested areas change concurrently, it would be
really hostile to abort() in such a case.
The only real problem is returning an equal result when the underlying
memory never was, nor could be equal.
I do not think it matters what value is returned in that case, as long
as it is not 0.
If returning anything but 0 can be accomplished w/o sacrificing speed
then that is something that should be done.

There are many programs that use memory checks to handle lock-less
barriers. For those programs it doesn't matter at all what (or when)
changed). All that matters is that the memory region were different
during the test, and then they'll go on their business. If you were to
require locking in order to do lockless checks, then the whole purpose
would be defeated.
Clearly people are better using atomic comparisons on canary values
instead, but it seem easy to avoid false positives (returning 0 when
memory is clearly different) and keep these things working, so why not
do it ?

Simo.
  
Torvald Riegel June 19, 2015, 3:32 p.m. UTC | #10
On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > I agree with aborting, but only as long as the hot path's performance
> > > is not impacted and I haven't thought about how to do that.
> > > 
> > > Per "Bugs in the user program" convention:
> > > https://sourceware.org/glibc/wiki/Style_and_Conventions#Bugs_in_the_user_program
> > > 
> > > ---
> > > 9.2. Bugs in the user program
> > > 
> > > If it's user code invoking undefined behavior, then it should fail early and catastrophically so that developers don't get the false impression that their code is OK when it happens not to break the use cases they test adequately. (Said another way, so that we avoid giving developers an excuse to complain when a future implementation change "breaks" their programs that were always broken, but theretofore ignorably so.) That too trades off against any runtime cost of detecting the case. I'd say the allowance for cost of detection is marginally higher than in the case of library bugs, because we expect user bugs to be more common that library bugs. But it's still not much, since correct programs performing better is more important to us than buggy programs being easier to debug. 
> > > ---
> > 
> > Aborting should be a satisfactory measure too and I think this issue
> > fits the worthwhile tradeoff described by the above document, as the
> > cost should be negligible. It's just a matter of propagating rdx in
> > addition to rdi/rsi and it won't affect the sse4 unrolled loops fast
> > paths.
> > 
> > If the testcase reproduces the bug frequently the abort would give a
> > better chance to fix the bug so the app becomes fully C compliant.
> > 
> > Aborting memcmp should still leave memcmp usable for security purposes
> > if the app can safely detect such a failure and shutdown (failing
> > early and catastrophically sounds good to me but then not everyone may
> > agree, if it's canary value that is being checked it may already have
> > a proper notification failure path that may not get invoked). Overall
> > as long as the current false negative is not returned, it should be
> > enough as far as security is concerned.
> > 
> > If the race condition is very rarely triggered and it ends up being
> > triggering in production, an abort wouldn't be welcome, but it's still
> > preferable and safer than the current behavior.
> > 
> > The cost of detecting the UB and aborting is practically identical to
> > the cost of making the memcmp not return 0. The only difference is
> > that instead of jumping at the start of the function and continuing
> > evaluating the rest of the buffer, we'd abort.
> 
> My 2c,
> although it is true that technically the spec says memcmp can return
> arbitrary values if the tested areas change concurrently, it would be
> really hostile to abort() in such a case.
> The only real problem is returning an equal result when the underlying
> memory never was, nor could be equal.

This last sentence you wrote makes it look simple, but defining this
properly is more complicated.  The condition Andrea gave is simpler,
because it requires that there are no data races on some parts of the
checked memory, and those parts are guaranteed to be different.

In contrast, "never was nor could be equal" is not simple to define in
weak memory models; what I mean is that an intuitive understanding of
this can easily differ from what would be possible in general for
executions on a particular hardware.

> I do not think it matters what value is returned in that case, as long
> as it is not 0.
> If returning anything but 0 can be accomplished w/o sacrificing speed
> then that is something that should be done.
> 
> There are many programs that use memory checks to handle lock-less
> barriers.

What kind of barriers are you referring to?  Fences like an "mfence"
instruction on x86?  If so, what do you mean by "handle"?

> For those programs it doesn't matter at all what (or when)
> changed). All that matters is that the memory region were different
> during the test, and then they'll go on their business. If you were to
> require locking in order to do lockless checks, then the whole purpose
> would be defeated.

The discussion is about whether we can indeed rely on data race freedom,
or whether we want to handle data races gracefully in some situations.
That applies to both code using locks and code synchronizing
differently.

> Clearly people are better using atomic comparisons on canary values
> instead, but it seem easy to avoid false positives (returning 0 when
> memory is clearly different) and keep these things working, so why not
> do it ?

I see two separate issues here.  First, where do we draw the line, and
what do we guarantee.  I strongly believe that programs must not have
data races, and that they should use atomics or other synchronization
properly (this doesn't mean locks, but relaxed memory order atomics, for
example).
Second, do we try to keep buggy programs working.  If it has no cost to
do so (e.g., like it might be true in this case), then doing that can
help to trigger less errors.  But that doesn't mean the buggy programs
should get fixed eventually.
  
Dr. David Alan Gilbert June 19, 2015, 3:59 p.m. UTC | #11
* Torvald Riegel (triegel@redhat.com) wrote:
> On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > I agree with aborting, but only as long as the hot path's performance
> > > > is not impacted and I haven't thought about how to do that.
> > > > 

> > Clearly people are better using atomic comparisons on canary values
> > instead, but it seem easy to avoid false positives (returning 0 when
> > memory is clearly different) and keep these things working, so why not
> > do it ?
> 
> I see two separate issues here.  First, where do we draw the line, and
> what do we guarantee.  I strongly believe that programs must not have
> data races, and that they should use atomics or other synchronization
> properly (this doesn't mean locks, but relaxed memory order atomics, for
> example).
> Second, do we try to keep buggy programs working.  If it has no cost to
> do so (e.g., like it might be true in this case), then doing that can
> help to trigger less errors.  But that doesn't mean the buggy programs
> should get fixed eventually.

I find it difficult to understand the boundaries of what the C library
is allowed to do in this type of optimisation.

For example, consider the following:

    char a[128];
    char b[128];

    put some static data in a[0-63]
    put some static data in b[0-63]
    a[64]=0;
    b[64]=0;
    start a thread doing stuff in a[65..]
    start a thread doing stuff in b[65..]
    
    if (!strcmp(a,b)) {
      /* Do something */
    }

   a) Is that behaviour defined?
   b) Is it defined with strncmp(a,b,64) ?
   c) Is it defined with strncmp(a,b,128)?
   d) Is it defined with memcmp(a,b,64)?
   e) Is it defined with memcmp(a,b,128)?
   f) If I moved that boundary off a nice round % 8 mark would
      it matter?

I can imagine there may be lots of things that terminate
a string and let other stuff happen in the allocated space
after the end of the string in the belief that at the end
of that string all is unknown.   Andrea's case is a bit different
in that it's the later data that's static, but that doesn't
sound like it should change the answer as to what's allowed.

Dave
    
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
  
Ondrej Bilka June 19, 2015, 4:23 p.m. UTC | #12
On Fri, Jun 19, 2015 at 04:59:08PM +0100, Dr. David Alan Gilbert wrote:
> * Torvald Riegel (triegel@redhat.com) wrote:
> > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > is not impacted and I haven't thought about how to do that.
> > > > > 
> 
> > > Clearly people are better using atomic comparisons on canary values
> > > instead, but it seem easy to avoid false positives (returning 0 when
> > > memory is clearly different) and keep these things working, so why not
> > > do it ?
> > 
> > I see two separate issues here.  First, where do we draw the line, and
> > what do we guarantee.  I strongly believe that programs must not have
> > data races, and that they should use atomics or other synchronization
> > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > example).
> > Second, do we try to keep buggy programs working.  If it has no cost to
> > do so (e.g., like it might be true in this case), then doing that can
> > help to trigger less errors.  But that doesn't mean the buggy programs
> > should get fixed eventually.
> 
> I find it difficult to understand the boundaries of what the C library
> is allowed to do in this type of optimisation.
> 
> For example, consider the following:
> 
>     char a[128];
>     char b[128];
> 
>     put some static data in a[0-63]
>     put some static data in b[0-63]
>     a[64]=0;
>     b[64]=0;
>     start a thread doing stuff in a[65..]
>     start a thread doing stuff in b[65..]
>     
>     if (!strcmp(a,b)) {
>       /* Do something */
>     }
> 
>    a) Is that behaviour defined?
>    b) Is it defined with strncmp(a,b,64) ?
>    c) Is it defined with strncmp(a,b,128)?
>    d) Is it defined with memcmp(a,b,64)?
>    e) Is it defined with memcmp(a,b,128)?
>    f) If I moved that boundary off a nice round % 8 mark would
>       it matter?
>
Yes, this is correct as memory is indistingushible from using

struct foo
{
  char start[65];
  char end[63];
}

where end is modified in parallel.

Only thing that could differ is performance as we read past end as long
as it doesn't cross page, then discard data after end.
  
Torvald Riegel June 19, 2015, 4:56 p.m. UTC | #13
On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> * Torvald Riegel (triegel@redhat.com) wrote:
> > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > is not impacted and I haven't thought about how to do that.
> > > > > 
> 
> > > Clearly people are better using atomic comparisons on canary values
> > > instead, but it seem easy to avoid false positives (returning 0 when
> > > memory is clearly different) and keep these things working, so why not
> > > do it ?
> > 
> > I see two separate issues here.  First, where do we draw the line, and
> > what do we guarantee.  I strongly believe that programs must not have
> > data races, and that they should use atomics or other synchronization
> > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > example).
> > Second, do we try to keep buggy programs working.  If it has no cost to
> > do so (e.g., like it might be true in this case), then doing that can
> > help to trigger less errors.  But that doesn't mean the buggy programs
> > should get fixed eventually.
> 
> I find it difficult to understand the boundaries of what the C library
> is allowed to do in this type of optimisation.
> 
> For example, consider the following:
> 
>     char a[128];
>     char b[128];
> 
>     put some static data in a[0-63]
>     put some static data in b[0-63]
>     a[64]=0;
>     b[64]=0;
>     start a thread doing stuff in a[65..]
>     start a thread doing stuff in b[65..]
>     
>     if (!strcmp(a,b)) {
>       /* Do something */
>     }
> 
>    a) Is that behaviour defined?

Good question.  I think it should be.  This depends on both the data
race definition and what strcmp/strncmp/memcmp are specified to do using
the abstract machine.

The data race definition uses memory locations as granularity, which is
in 3.14 in C11.  Separate characters in an array should be separate
memory locations.

C11 isn't very specific regarding strcmp, and just says that it
"compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
regarding basic_string::compare (21.4.7.9), saying that first the length
of the strings are determined, and then a strncmp is run using the
smaller of the two lengths.

Assuming the C++ specs, only the array indices [0..64] should be
accessed by the abstract machine.  So no data race with the stuff going
on in [65..).

>    b) Is it defined with strncmp(a,b,64) ?

Yes.

>    c) Is it defined with strncmp(a,b,128)?

Not sure.  C11 says that not more than "n characters" are compared, and
characters that follow the a null character aren't compared either.
This indicates it wouldn't be different from strncmp(a,b,64) in the
particular case.
Regarding C++11, I'm not sure.  The closest copies a substring
(conceptually) and then compares, but the substring copying has to
determine length of the string and then subtracting the max length.
This would do a strlen first, which wouldn't access past index 64.
Thus, should be fine too.

>    d) Is it defined with memcmp(a,b,64)?

No data race, IMO.

>    e) Is it defined with memcmp(a,b,128)?

Data race.  Undefined behavior.

>    f) If I moved that boundary off a nice round % 8 mark would
>       it matter?

No difference as far as the standard is concerned.

> I can imagine there may be lots of things that terminate
> a string and let other stuff happen in the allocated space
> after the end of the string in the belief that at the end
> of that string all is unknown.   Andrea's case is a bit different
> in that it's the later data that's static, but that doesn't
> sound like it should change the answer as to what's allowed.

I think it does, because the question is whether there is a data race on
the memory locations that the abstract machine would access.
  
Simo Sorce June 19, 2015, 5:08 p.m. UTC | #14
It is true you identify 2 different issues, but I am not interested in
either, I merely care about failing safe.

On Fri, 2015-06-19 at 17:32 +0200, Torvald Riegel wrote:
> I see two separate issues here.  First, where do we draw the line, and
> what do we guarantee.  I strongly believe that programs must not have
> data races, and that they should use atomics or other synchronization
> properly (this doesn't mean locks, but relaxed memory order atomics,
> for example).

Programs should not have data races, true, but sometimes they do,
failing safe makes 'bad' programs not be worse.

> Second, do we try to keep buggy programs working.  If it has no cost
> to do so (e.g., like it might be true in this case), then doing that
> can help to trigger less errors.  But that doesn't mean the buggy
> programs should get fixed eventually.

Sure, but sometimes it is hard to define something as a bug.

In the specific case Andrea reported (which I pondered on), the race was
clearly there, but also clearly the underlying memory was never
identical at any given point, yet memcmp reported 0 which means: the
whole memory area was checked and matched. This was clearly not the
case, memcmp didn't even check the whole memory area, and returned 0.

Now, if the 2 memory areas ever where (even for a nanosecond) identical,
then returning zero would have been perfectly acceptable. During a race
you can't know what the result is. But a function that supposedly checks
a data area for equality and instead stops short and return "equal" seem
just wrong, and it is certainly unexpected.

I understand that the cause for that is the program is operating on part
of the area currently and it shouldn't but still, I would expect memcmp
to at least always go through the whole area it is expected to check and
not stop short on a fluke and return a "random" result.

Simo.
  
Simo Sorce June 19, 2015, 5:16 p.m. UTC | #15
On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > * Torvald Riegel (triegel@redhat.com) wrote:
> > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > 
> > 
> > > > Clearly people are better using atomic comparisons on canary values
> > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > memory is clearly different) and keep these things working, so why not
> > > > do it ?
> > > 
> > > I see two separate issues here.  First, where do we draw the line, and
> > > what do we guarantee.  I strongly believe that programs must not have
> > > data races, and that they should use atomics or other synchronization
> > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > example).
> > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > do so (e.g., like it might be true in this case), then doing that can
> > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > should get fixed eventually.
> > 
> > I find it difficult to understand the boundaries of what the C library
> > is allowed to do in this type of optimisation.
> > 
> > For example, consider the following:
> > 
> >     char a[128];
> >     char b[128];
> > 
> >     put some static data in a[0-63]
> >     put some static data in b[0-63]
> >     a[64]=0;
> >     b[64]=0;
> >     start a thread doing stuff in a[65..]
> >     start a thread doing stuff in b[65..]
> >     
> >     if (!strcmp(a,b)) {
> >       /* Do something */
> >     }
> > 
> >    a) Is that behaviour defined?
> 
> Good question.  I think it should be.  This depends on both the data
> race definition and what strcmp/strncmp/memcmp are specified to do using
> the abstract machine.
> 
> The data race definition uses memory locations as granularity, which is
> in 3.14 in C11.  Separate characters in an array should be separate
> memory locations.
> 
> C11 isn't very specific regarding strcmp, and just says that it
> "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> regarding basic_string::compare (21.4.7.9), saying that first the length
> of the strings are determined, and then a strncmp is run using the
> smaller of the two lengths.
> 
> Assuming the C++ specs, only the array indices [0..64] should be
> accessed by the abstract machine.  So no data race with the stuff going
> on in [65..).
> 
> >    b) Is it defined with strncmp(a,b,64) ?
> 
> Yes.
> 
> >    c) Is it defined with strncmp(a,b,128)?
> 
> Not sure.  C11 says that not more than "n characters" are compared, and
> characters that follow the a null character aren't compared either.
> This indicates it wouldn't be different from strncmp(a,b,64) in the
> particular case.
> Regarding C++11, I'm not sure.  The closest copies a substring
> (conceptually) and then compares, but the substring copying has to
> determine length of the string and then subtracting the max length.
> This would do a strlen first, which wouldn't access past index 64.
> Thus, should be fine too.
> 
> >    d) Is it defined with memcmp(a,b,64)?
> 
> No data race, IMO.
> 
> >    e) Is it defined with memcmp(a,b,128)?
> 
> Data race.  Undefined behavior.
> 
> >    f) If I moved that boundary off a nice round % 8 mark would
> >       it matter?
> 
> No difference as far as the standard is concerned.
> 
> > I can imagine there may be lots of things that terminate
> > a string and let other stuff happen in the allocated space
> > after the end of the string in the belief that at the end
> > of that string all is unknown.   Andrea's case is a bit different
> > in that it's the later data that's static, but that doesn't
> > sound like it should change the answer as to what's allowed.
> 
> I think it does, because the question is whether there is a data race on
> the memory locations that the abstract machine would access.

Well given we are making examples.

Assume 2 structures like this:

struct test {
    void *pointer;
    char start[16];
    char end[240];
}

and

struct test a;
struct test b;

memset(a.end, 1, 240);
memset(b.end, 2, 240);

In what case it would be expected/legal for 
memcmp(a.start, b.start, 256); to ever return 0 ?

Simo.
  
Simo Sorce June 19, 2015, 5:18 p.m. UTC | #16
On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > 
> > > 
> > > > > Clearly people are better using atomic comparisons on canary values
> > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > memory is clearly different) and keep these things working, so why not
> > > > > do it ?
> > > > 
> > > > I see two separate issues here.  First, where do we draw the line, and
> > > > what do we guarantee.  I strongly believe that programs must not have
> > > > data races, and that they should use atomics or other synchronization
> > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > example).
> > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > do so (e.g., like it might be true in this case), then doing that can
> > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > should get fixed eventually.
> > > 
> > > I find it difficult to understand the boundaries of what the C library
> > > is allowed to do in this type of optimisation.
> > > 
> > > For example, consider the following:
> > > 
> > >     char a[128];
> > >     char b[128];
> > > 
> > >     put some static data in a[0-63]
> > >     put some static data in b[0-63]
> > >     a[64]=0;
> > >     b[64]=0;
> > >     start a thread doing stuff in a[65..]
> > >     start a thread doing stuff in b[65..]
> > >     
> > >     if (!strcmp(a,b)) {
> > >       /* Do something */
> > >     }
> > > 
> > >    a) Is that behaviour defined?
> > 
> > Good question.  I think it should be.  This depends on both the data
> > race definition and what strcmp/strncmp/memcmp are specified to do using
> > the abstract machine.
> > 
> > The data race definition uses memory locations as granularity, which is
> > in 3.14 in C11.  Separate characters in an array should be separate
> > memory locations.
> > 
> > C11 isn't very specific regarding strcmp, and just says that it
> > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > regarding basic_string::compare (21.4.7.9), saying that first the length
> > of the strings are determined, and then a strncmp is run using the
> > smaller of the two lengths.
> > 
> > Assuming the C++ specs, only the array indices [0..64] should be
> > accessed by the abstract machine.  So no data race with the stuff going
> > on in [65..).
> > 
> > >    b) Is it defined with strncmp(a,b,64) ?
> > 
> > Yes.
> > 
> > >    c) Is it defined with strncmp(a,b,128)?
> > 
> > Not sure.  C11 says that not more than "n characters" are compared, and
> > characters that follow the a null character aren't compared either.
> > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > particular case.
> > Regarding C++11, I'm not sure.  The closest copies a substring
> > (conceptually) and then compares, but the substring copying has to
> > determine length of the string and then subtracting the max length.
> > This would do a strlen first, which wouldn't access past index 64.
> > Thus, should be fine too.
> > 
> > >    d) Is it defined with memcmp(a,b,64)?
> > 
> > No data race, IMO.
> > 
> > >    e) Is it defined with memcmp(a,b,128)?
> > 
> > Data race.  Undefined behavior.
> > 
> > >    f) If I moved that boundary off a nice round % 8 mark would
> > >       it matter?
> > 
> > No difference as far as the standard is concerned.
> > 
> > > I can imagine there may be lots of things that terminate
> > > a string and let other stuff happen in the allocated space
> > > after the end of the string in the belief that at the end
> > > of that string all is unknown.   Andrea's case is a bit different
> > > in that it's the later data that's static, but that doesn't
> > > sound like it should change the answer as to what's allowed.
> > 
> > I think it does, because the question is whether there is a data race on
> > the memory locations that the abstract machine would access.
> 
> Well given we are making examples.
> 
> Assume 2 structures like this:
> 
> struct test {
>     void *pointer;
>     char start[16];
>     char end[240];
> }
> 
> and
> 
> struct test a;
> struct test b;
> 
> memset(a.end, 1, 240);
> memset(b.end, 2, 240);
> 
> In what case it would be expected/legal for 
> memcmp(a.start, b.start, 256); to ever return 0 ?

Just to be clear, if there are data-races in [ab].pointer or [as].start
I see absolutely no problem in stopping short and returning a
positive/negative value. I just do not see how returning 0 would ever
make sense.

Simo.
  
Torvald Riegel June 19, 2015, 5:36 p.m. UTC | #17
On Fri, 2015-06-19 at 13:08 -0400, Simo Sorce wrote:
> It is true you identify 2 different issues, but I am not interested in
> either, I merely care about failing safe.
> 
> On Fri, 2015-06-19 at 17:32 +0200, Torvald Riegel wrote:
> > I see two separate issues here.  First, where do we draw the line, and
> > what do we guarantee.  I strongly believe that programs must not have
> > data races, and that they should use atomics or other synchronization
> > properly (this doesn't mean locks, but relaxed memory order atomics,
> > for example).
> 
> Programs should not have data races, true, but sometimes they do,
> failing safe makes 'bad' programs not be worse.
> 
> > Second, do we try to keep buggy programs working.  If it has no cost
> > to do so (e.g., like it might be true in this case), then doing that
> > can help to trigger less errors.  But that doesn't mean the buggy
> > programs should get fixed eventually.
> 
> Sure, but sometimes it is hard to define something as a bug.
> 
> In the specific case Andrea reported (which I pondered on), the race was
> clearly there, but also clearly the underlying memory was never
> identical at any given point, yet memcmp reported 0 which means: the
> whole memory area was checked and matched. This was clearly not the
> case, memcmp didn't even check the whole memory area, and returned 0.

This is allowed if something is undefined behavior.  A return value of 0
only has the specified meaning if there's no undefined behavior
elsewhere.

> Now, if the 2 memory areas ever where (even for a nanosecond) identical,
> then returning zero would have been perfectly acceptable.

How do you know that when there are data races?  Making this assumption
requires making assumptions about how the modification happened.  For
example, if the modifications used sequential code too, this could
potentially write temporary values.

Second, you say "even for a nanosecond", and that seems to say to you're
thinking about some kind of total order of events.  This isn't
necessarily the case though; even if you'd use relaxed memory-order
atomic stores and loads to access the array, what a the memcpy using the
atomics would see would not necessarily be ordered as if the stores
would happen one at a time, or in some way that can be totally ordered.

> During a race
> you can't know what the result is.

The standard "specifies" undefined behavior, which allows for much more
than not knowing what the result is.  It could we conforming to burn
your computer, for example ;) 

> But a function that supposedly checks
> a data area for equality and instead stops short and return "equal" seem
> just wrong, and it is certainly unexpected.

I do understand that it could be surprising behavior, especially if
thinking at the level of what a straight-forward implementation may or
may not do.  But we have to draw the line somewhere to enable compiler
optimizations, and we don't want to describe compiler optimizations in
detail nor reason about that detail, so the big hammer of undefined
behavior is a sensible choice.

> I understand that the cause for that is the program is operating on part
> of the area currently and it shouldn't but still, I would expect memcmp
> to at least always go through the whole area it is expected to check and
> not stop short on a fluke and return a "random" result.

I am sympathetic to this viewpoint, but a random result, or undefined
behavior, is really not that unrealistic if you consider more complex C
code with data races, that may be subject to more compiler optimization
than one would expect for memcmp or such.

Just to be clear, I don't want to be simply nitpicking here, nor say
that the standard is the only true design choice, or something like
that.  I just think that it's dangerous to start considering some data
races to be "benign" because that's a truly slippery slope and the
reasoning required for that can be really complicated.
  
Torvald Riegel June 19, 2015, 5:40 p.m. UTC | #18
On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote:
> On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > > 
> > > > 
> > > > > > Clearly people are better using atomic comparisons on canary values
> > > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > > memory is clearly different) and keep these things working, so why not
> > > > > > do it ?
> > > > > 
> > > > > I see two separate issues here.  First, where do we draw the line, and
> > > > > what do we guarantee.  I strongly believe that programs must not have
> > > > > data races, and that they should use atomics or other synchronization
> > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > > example).
> > > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > > do so (e.g., like it might be true in this case), then doing that can
> > > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > > should get fixed eventually.
> > > > 
> > > > I find it difficult to understand the boundaries of what the C library
> > > > is allowed to do in this type of optimisation.
> > > > 
> > > > For example, consider the following:
> > > > 
> > > >     char a[128];
> > > >     char b[128];
> > > > 
> > > >     put some static data in a[0-63]
> > > >     put some static data in b[0-63]
> > > >     a[64]=0;
> > > >     b[64]=0;
> > > >     start a thread doing stuff in a[65..]
> > > >     start a thread doing stuff in b[65..]
> > > >     
> > > >     if (!strcmp(a,b)) {
> > > >       /* Do something */
> > > >     }
> > > > 
> > > >    a) Is that behaviour defined?
> > > 
> > > Good question.  I think it should be.  This depends on both the data
> > > race definition and what strcmp/strncmp/memcmp are specified to do using
> > > the abstract machine.
> > > 
> > > The data race definition uses memory locations as granularity, which is
> > > in 3.14 in C11.  Separate characters in an array should be separate
> > > memory locations.
> > > 
> > > C11 isn't very specific regarding strcmp, and just says that it
> > > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > > regarding basic_string::compare (21.4.7.9), saying that first the length
> > > of the strings are determined, and then a strncmp is run using the
> > > smaller of the two lengths.
> > > 
> > > Assuming the C++ specs, only the array indices [0..64] should be
> > > accessed by the abstract machine.  So no data race with the stuff going
> > > on in [65..).
> > > 
> > > >    b) Is it defined with strncmp(a,b,64) ?
> > > 
> > > Yes.
> > > 
> > > >    c) Is it defined with strncmp(a,b,128)?
> > > 
> > > Not sure.  C11 says that not more than "n characters" are compared, and
> > > characters that follow the a null character aren't compared either.
> > > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > > particular case.
> > > Regarding C++11, I'm not sure.  The closest copies a substring
> > > (conceptually) and then compares, but the substring copying has to
> > > determine length of the string and then subtracting the max length.
> > > This would do a strlen first, which wouldn't access past index 64.
> > > Thus, should be fine too.
> > > 
> > > >    d) Is it defined with memcmp(a,b,64)?
> > > 
> > > No data race, IMO.
> > > 
> > > >    e) Is it defined with memcmp(a,b,128)?
> > > 
> > > Data race.  Undefined behavior.
> > > 
> > > >    f) If I moved that boundary off a nice round % 8 mark would
> > > >       it matter?
> > > 
> > > No difference as far as the standard is concerned.
> > > 
> > > > I can imagine there may be lots of things that terminate
> > > > a string and let other stuff happen in the allocated space
> > > > after the end of the string in the belief that at the end
> > > > of that string all is unknown.   Andrea's case is a bit different
> > > > in that it's the later data that's static, but that doesn't
> > > > sound like it should change the answer as to what's allowed.
> > > 
> > > I think it does, because the question is whether there is a data race on
> > > the memory locations that the abstract machine would access.
> > 
> > Well given we are making examples.
> > 
> > Assume 2 structures like this:
> > 
> > struct test {
> >     void *pointer;
> >     char start[16];
> >     char end[240];
> > }
> > 
> > and
> > 
> > struct test a;
> > struct test b;
> > 
> > memset(a.end, 1, 240);
> > memset(b.end, 2, 240);
> > 
> > In what case it would be expected/legal for 
> > memcmp(a.start, b.start, 256); to ever return 0 ?
> 
> Just to be clear, if there are data-races in [ab].pointer or [as].start
> I see absolutely no problem in stopping short and returning a
> positive/negative value. I just do not see how returning 0 would ever
> make sense.

It would be legal if the invocation of memcmp and at least one of the
invocations of memset would be concurrent (ie, not ordered by
happens-before).

I do see that this wouldn't be an intuitive outcome if one reasons based
on assuming some simplified (or straightforward, or natural, or whatever
the personal opinion might be...) implementation.  But a data race is
undefined behavior, and that's where we're drawing the line.
  
Simo Sorce June 19, 2015, 5:55 p.m. UTC | #19
On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote:
> On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote:
> > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > > > 
> > > > > 
> > > > > > > Clearly people are better using atomic comparisons on canary values
> > > > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > > > memory is clearly different) and keep these things working, so why not
> > > > > > > do it ?
> > > > > > 
> > > > > > I see two separate issues here.  First, where do we draw the line, and
> > > > > > what do we guarantee.  I strongly believe that programs must not have
> > > > > > data races, and that they should use atomics or other synchronization
> > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > > > example).
> > > > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > > > do so (e.g., like it might be true in this case), then doing that can
> > > > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > > > should get fixed eventually.
> > > > > 
> > > > > I find it difficult to understand the boundaries of what the C library
> > > > > is allowed to do in this type of optimisation.
> > > > > 
> > > > > For example, consider the following:
> > > > > 
> > > > >     char a[128];
> > > > >     char b[128];
> > > > > 
> > > > >     put some static data in a[0-63]
> > > > >     put some static data in b[0-63]
> > > > >     a[64]=0;
> > > > >     b[64]=0;
> > > > >     start a thread doing stuff in a[65..]
> > > > >     start a thread doing stuff in b[65..]
> > > > >     
> > > > >     if (!strcmp(a,b)) {
> > > > >       /* Do something */
> > > > >     }
> > > > > 
> > > > >    a) Is that behaviour defined?
> > > > 
> > > > Good question.  I think it should be.  This depends on both the data
> > > > race definition and what strcmp/strncmp/memcmp are specified to do using
> > > > the abstract machine.
> > > > 
> > > > The data race definition uses memory locations as granularity, which is
> > > > in 3.14 in C11.  Separate characters in an array should be separate
> > > > memory locations.
> > > > 
> > > > C11 isn't very specific regarding strcmp, and just says that it
> > > > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > > > regarding basic_string::compare (21.4.7.9), saying that first the length
> > > > of the strings are determined, and then a strncmp is run using the
> > > > smaller of the two lengths.
> > > > 
> > > > Assuming the C++ specs, only the array indices [0..64] should be
> > > > accessed by the abstract machine.  So no data race with the stuff going
> > > > on in [65..).
> > > > 
> > > > >    b) Is it defined with strncmp(a,b,64) ?
> > > > 
> > > > Yes.
> > > > 
> > > > >    c) Is it defined with strncmp(a,b,128)?
> > > > 
> > > > Not sure.  C11 says that not more than "n characters" are compared, and
> > > > characters that follow the a null character aren't compared either.
> > > > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > > > particular case.
> > > > Regarding C++11, I'm not sure.  The closest copies a substring
> > > > (conceptually) and then compares, but the substring copying has to
> > > > determine length of the string and then subtracting the max length.
> > > > This would do a strlen first, which wouldn't access past index 64.
> > > > Thus, should be fine too.
> > > > 
> > > > >    d) Is it defined with memcmp(a,b,64)?
> > > > 
> > > > No data race, IMO.
> > > > 
> > > > >    e) Is it defined with memcmp(a,b,128)?
> > > > 
> > > > Data race.  Undefined behavior.
> > > > 
> > > > >    f) If I moved that boundary off a nice round % 8 mark would
> > > > >       it matter?
> > > > 
> > > > No difference as far as the standard is concerned.
> > > > 
> > > > > I can imagine there may be lots of things that terminate
> > > > > a string and let other stuff happen in the allocated space
> > > > > after the end of the string in the belief that at the end
> > > > > of that string all is unknown.   Andrea's case is a bit different
> > > > > in that it's the later data that's static, but that doesn't
> > > > > sound like it should change the answer as to what's allowed.
> > > > 
> > > > I think it does, because the question is whether there is a data race on
> > > > the memory locations that the abstract machine would access.
> > > 
> > > Well given we are making examples.
> > > 
> > > Assume 2 structures like this:
> > > 
> > > struct test {
> > >     void *pointer;
> > >     char start[16];
> > >     char end[240];
> > > }
> > > 
> > > and
> > > 
> > > struct test a;
> > > struct test b;
> > > 
> > > memset(a.end, 1, 240);
> > > memset(b.end, 2, 240);
> > > 
> > > In what case it would be expected/legal for 
> > > memcmp(a.start, b.start, 256); to ever return 0 ?
> > 
> > Just to be clear, if there are data-races in [ab].pointer or [as].start
> > I see absolutely no problem in stopping short and returning a
> > positive/negative value. I just do not see how returning 0 would ever
> > make sense.
> 
> It would be legal if the invocation of memcmp and at least one of the
> invocations of memset would be concurrent (ie, not ordered by
> happens-before).

Nope, sorry , there is a misunderstanding, consider the memsets part of
initialization, *never* run in parallel with memcmp.

> I do see that this wouldn't be an intuitive outcome if one reasons based
> on assuming some simplified (or straightforward, or natural, or whatever
> the personal opinion might be...) implementation.  But a data race is
> undefined behavior, and that's where we're drawing the line.

Sure, I can understand that is the *whole* are under consideration is
racing. But here we are talking about a large chunk of the area that is
different and *never* raced upon.

I can accept that the C standard may still decide this is undefined
behavior even if just one small part of the total area sees concurrent
modifications, but again, you are returning "short" from the optimized
code, so it is unclear to me how you can return 0 when the memcmp code
hasn't checked all the memory, wouldn't it make more sense to return the
current position ? It is still undefined but at least it won't cause
false positives.

Simo.
  
Simo Sorce June 19, 2015, 5:59 p.m. UTC | #20
On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote:
> On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote:
> > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote:
> > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > > > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > > > > 
> > > > > > 
> > > > > > > > Clearly people are better using atomic comparisons on canary values
> > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > > > > memory is clearly different) and keep these things working, so why not
> > > > > > > > do it ?
> > > > > > > 
> > > > > > > I see two separate issues here.  First, where do we draw the line, and
> > > > > > > what do we guarantee.  I strongly believe that programs must not have
> > > > > > > data races, and that they should use atomics or other synchronization
> > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > > > > example).
> > > > > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > > > > do so (e.g., like it might be true in this case), then doing that can
> > > > > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > > > > should get fixed eventually.
> > > > > > 
> > > > > > I find it difficult to understand the boundaries of what the C library
> > > > > > is allowed to do in this type of optimisation.
> > > > > > 
> > > > > > For example, consider the following:
> > > > > > 
> > > > > >     char a[128];
> > > > > >     char b[128];
> > > > > > 
> > > > > >     put some static data in a[0-63]
> > > > > >     put some static data in b[0-63]
> > > > > >     a[64]=0;
> > > > > >     b[64]=0;
> > > > > >     start a thread doing stuff in a[65..]
> > > > > >     start a thread doing stuff in b[65..]
> > > > > >     
> > > > > >     if (!strcmp(a,b)) {
> > > > > >       /* Do something */
> > > > > >     }
> > > > > > 
> > > > > >    a) Is that behaviour defined?
> > > > > 
> > > > > Good question.  I think it should be.  This depends on both the data
> > > > > race definition and what strcmp/strncmp/memcmp are specified to do using
> > > > > the abstract machine.
> > > > > 
> > > > > The data race definition uses memory locations as granularity, which is
> > > > > in 3.14 in C11.  Separate characters in an array should be separate
> > > > > memory locations.
> > > > > 
> > > > > C11 isn't very specific regarding strcmp, and just says that it
> > > > > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > > > > regarding basic_string::compare (21.4.7.9), saying that first the length
> > > > > of the strings are determined, and then a strncmp is run using the
> > > > > smaller of the two lengths.
> > > > > 
> > > > > Assuming the C++ specs, only the array indices [0..64] should be
> > > > > accessed by the abstract machine.  So no data race with the stuff going
> > > > > on in [65..).
> > > > > 
> > > > > >    b) Is it defined with strncmp(a,b,64) ?
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > >    c) Is it defined with strncmp(a,b,128)?
> > > > > 
> > > > > Not sure.  C11 says that not more than "n characters" are compared, and
> > > > > characters that follow the a null character aren't compared either.
> > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > > > > particular case.
> > > > > Regarding C++11, I'm not sure.  The closest copies a substring
> > > > > (conceptually) and then compares, but the substring copying has to
> > > > > determine length of the string and then subtracting the max length.
> > > > > This would do a strlen first, which wouldn't access past index 64.
> > > > > Thus, should be fine too.
> > > > > 
> > > > > >    d) Is it defined with memcmp(a,b,64)?
> > > > > 
> > > > > No data race, IMO.
> > > > > 
> > > > > >    e) Is it defined with memcmp(a,b,128)?
> > > > > 
> > > > > Data race.  Undefined behavior.
> > > > > 
> > > > > >    f) If I moved that boundary off a nice round % 8 mark would
> > > > > >       it matter?
> > > > > 
> > > > > No difference as far as the standard is concerned.
> > > > > 
> > > > > > I can imagine there may be lots of things that terminate
> > > > > > a string and let other stuff happen in the allocated space
> > > > > > after the end of the string in the belief that at the end
> > > > > > of that string all is unknown.   Andrea's case is a bit different
> > > > > > in that it's the later data that's static, but that doesn't
> > > > > > sound like it should change the answer as to what's allowed.
> > > > > 
> > > > > I think it does, because the question is whether there is a data race on
> > > > > the memory locations that the abstract machine would access.
> > > > 
> > > > Well given we are making examples.
> > > > 
> > > > Assume 2 structures like this:
> > > > 
> > > > struct test {
> > > >     void *pointer;
> > > >     char start[16];
> > > >     char end[240];
> > > > }
> > > > 
> > > > and
> > > > 
> > > > struct test a;
> > > > struct test b;
> > > > 
> > > > memset(a.end, 1, 240);
> > > > memset(b.end, 2, 240);
> > > > 
> > > > In what case it would be expected/legal for 
> > > > memcmp(a.start, b.start, 256); to ever return 0 ?
> > > 
> > > Just to be clear, if there are data-races in [ab].pointer or [as].start
> > > I see absolutely no problem in stopping short and returning a
> > > positive/negative value. I just do not see how returning 0 would ever
> > > make sense.
> > 
> > It would be legal if the invocation of memcmp and at least one of the
> > invocations of memset would be concurrent (ie, not ordered by
> > happens-before).
> 
> Nope, sorry , there is a misunderstanding, consider the memsets part of
> initialization, *never* run in parallel with memcmp.
> 
> > I do see that this wouldn't be an intuitive outcome if one reasons based
> > on assuming some simplified (or straightforward, or natural, or whatever
> > the personal opinion might be...) implementation.  But a data race is
> > undefined behavior, and that's where we're drawing the line.
> 
> Sure, I can understand that is the *whole* are under consideration is

Sorry for the typos, the above should read ".. that IF the *whole* AREA
under consideration ..."

> racing. But here we are talking about a large chunk of the area that is
> different and *never* raced upon.
> 
> I can accept that the C standard may still decide this is undefined
> behavior even if just one small part of the total area sees concurrent
> modifications, but again, you are returning "short" from the optimized
> code, so it is unclear to me how you can return 0 when the memcmp code
> hasn't checked all the memory, wouldn't it make more sense to return the
> current position ? It is still undefined but at least it won't cause
> false positives.
> 
> Simo.
>
  
Dr. David Alan Gilbert June 19, 2015, 6:06 p.m. UTC | #21
* Torvald Riegel (triegel@redhat.com) wrote:
> On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > * Torvald Riegel (triegel@redhat.com) wrote:
> > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > 
> > 
> > > > Clearly people are better using atomic comparisons on canary values
> > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > memory is clearly different) and keep these things working, so why not
> > > > do it ?
> > > 
> > > I see two separate issues here.  First, where do we draw the line, and
> > > what do we guarantee.  I strongly believe that programs must not have
> > > data races, and that they should use atomics or other synchronization
> > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > example).
> > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > do so (e.g., like it might be true in this case), then doing that can
> > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > should get fixed eventually.
> > 
> > I find it difficult to understand the boundaries of what the C library
> > is allowed to do in this type of optimisation.
> > 
> > For example, consider the following:
> > 
> >     char a[128];
> >     char b[128];
> > 
> >     put some static data in a[0-63]
> >     put some static data in b[0-63]
> >     a[64]=0;
> >     b[64]=0;
> >     start a thread doing stuff in a[65..]
> >     start a thread doing stuff in b[65..]
> >     
> >     if (!strcmp(a,b)) {
> >       /* Do something */
> >     }
> > 
> >    a) Is that behaviour defined?
> 
> Good question.  I think it should be.  This depends on both the data
> race definition and what strcmp/strncmp/memcmp are specified to do using
> the abstract machine.
> 
> The data race definition uses memory locations as granularity, which is
> in 3.14 in C11.  Separate characters in an array should be separate
> memory locations.
> 
> C11 isn't very specific regarding strcmp, and just says that it
> "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> regarding basic_string::compare (21.4.7.9), saying that first the length
> of the strings are determined, and then a strncmp is run using the
> smaller of the two lengths.
> 
> Assuming the C++ specs, only the array indices [0..64] should be
> accessed by the abstract machine.  So no data race with the stuff going
> on in [65..).
> 
> >    b) Is it defined with strncmp(a,b,64) ?
> 
> Yes.
> 
> >    c) Is it defined with strncmp(a,b,128)?
> 
> Not sure.  C11 says that not more than "n characters" are compared, and
> characters that follow the a null character aren't compared either.
> This indicates it wouldn't be different from strncmp(a,b,64) in the
> particular case.
> Regarding C++11, I'm not sure.  The closest copies a substring
> (conceptually) and then compares, but the substring copying has to
> determine length of the string and then subtracting the max length.
> This would do a strlen first, which wouldn't access past index 64.
> Thus, should be fine too.
> 
> >    d) Is it defined with memcmp(a,b,64)?
> 
> No data race, IMO.

OK, so you mean that's defined behavior?

> >    e) Is it defined with memcmp(a,b,128)?
> 
> Data race.  Undefined behavior.

OK, so it's interesting your answer to (e) is different from (c);
aren't there compiler optimisations that replace some strcmp s with
memcmp s? (I'm not sure whether it would in this case?)

> >    f) If I moved that boundary off a nice round % 8 mark would
> >       it matter?
> 
> No difference as far as the standard is concerned.

OK, and I've not looked at this x86 code in detail, but are we sure
that where the memcmp length finishes in the middle of a word
the bytes after the end can't cause a match/mismatch that would
later change and be classed as undefined?

> > I can imagine there may be lots of things that terminate
> > a string and let other stuff happen in the allocated space
> > after the end of the string in the belief that at the end
> > of that string all is unknown.   Andrea's case is a bit different
> > in that it's the later data that's static, but that doesn't
> > sound like it should change the answer as to what's allowed.
> 
> I think it does, because the question is whether there is a data race on
> the memory locations that the abstract machine would access.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
  
Torvald Riegel June 20, 2015, 12:43 p.m. UTC | #22
On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote:
> On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote:
> > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote:
> > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > > > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > > > > 
> > > > > > 
> > > > > > > > Clearly people are better using atomic comparisons on canary values
> > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > > > > memory is clearly different) and keep these things working, so why not
> > > > > > > > do it ?
> > > > > > > 
> > > > > > > I see two separate issues here.  First, where do we draw the line, and
> > > > > > > what do we guarantee.  I strongly believe that programs must not have
> > > > > > > data races, and that they should use atomics or other synchronization
> > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > > > > example).
> > > > > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > > > > do so (e.g., like it might be true in this case), then doing that can
> > > > > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > > > > should get fixed eventually.
> > > > > > 
> > > > > > I find it difficult to understand the boundaries of what the C library
> > > > > > is allowed to do in this type of optimisation.
> > > > > > 
> > > > > > For example, consider the following:
> > > > > > 
> > > > > >     char a[128];
> > > > > >     char b[128];
> > > > > > 
> > > > > >     put some static data in a[0-63]
> > > > > >     put some static data in b[0-63]
> > > > > >     a[64]=0;
> > > > > >     b[64]=0;
> > > > > >     start a thread doing stuff in a[65..]
> > > > > >     start a thread doing stuff in b[65..]
> > > > > >     
> > > > > >     if (!strcmp(a,b)) {
> > > > > >       /* Do something */
> > > > > >     }
> > > > > > 
> > > > > >    a) Is that behaviour defined?
> > > > > 
> > > > > Good question.  I think it should be.  This depends on both the data
> > > > > race definition and what strcmp/strncmp/memcmp are specified to do using
> > > > > the abstract machine.
> > > > > 
> > > > > The data race definition uses memory locations as granularity, which is
> > > > > in 3.14 in C11.  Separate characters in an array should be separate
> > > > > memory locations.
> > > > > 
> > > > > C11 isn't very specific regarding strcmp, and just says that it
> > > > > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > > > > regarding basic_string::compare (21.4.7.9), saying that first the length
> > > > > of the strings are determined, and then a strncmp is run using the
> > > > > smaller of the two lengths.
> > > > > 
> > > > > Assuming the C++ specs, only the array indices [0..64] should be
> > > > > accessed by the abstract machine.  So no data race with the stuff going
> > > > > on in [65..).
> > > > > 
> > > > > >    b) Is it defined with strncmp(a,b,64) ?
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > >    c) Is it defined with strncmp(a,b,128)?
> > > > > 
> > > > > Not sure.  C11 says that not more than "n characters" are compared, and
> > > > > characters that follow the a null character aren't compared either.
> > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > > > > particular case.
> > > > > Regarding C++11, I'm not sure.  The closest copies a substring
> > > > > (conceptually) and then compares, but the substring copying has to
> > > > > determine length of the string and then subtracting the max length.
> > > > > This would do a strlen first, which wouldn't access past index 64.
> > > > > Thus, should be fine too.
> > > > > 
> > > > > >    d) Is it defined with memcmp(a,b,64)?
> > > > > 
> > > > > No data race, IMO.
> > > > > 
> > > > > >    e) Is it defined with memcmp(a,b,128)?
> > > > > 
> > > > > Data race.  Undefined behavior.
> > > > > 
> > > > > >    f) If I moved that boundary off a nice round % 8 mark would
> > > > > >       it matter?
> > > > > 
> > > > > No difference as far as the standard is concerned.
> > > > > 
> > > > > > I can imagine there may be lots of things that terminate
> > > > > > a string and let other stuff happen in the allocated space
> > > > > > after the end of the string in the belief that at the end
> > > > > > of that string all is unknown.   Andrea's case is a bit different
> > > > > > in that it's the later data that's static, but that doesn't
> > > > > > sound like it should change the answer as to what's allowed.
> > > > > 
> > > > > I think it does, because the question is whether there is a data race on
> > > > > the memory locations that the abstract machine would access.
> > > > 
> > > > Well given we are making examples.
> > > > 
> > > > Assume 2 structures like this:
> > > > 
> > > > struct test {
> > > >     void *pointer;
> > > >     char start[16];
> > > >     char end[240];
> > > > }
> > > > 
> > > > and
> > > > 
> > > > struct test a;
> > > > struct test b;
> > > > 
> > > > memset(a.end, 1, 240);
> > > > memset(b.end, 2, 240);
> > > > 
> > > > In what case it would be expected/legal for 
> > > > memcmp(a.start, b.start, 256); to ever return 0 ?
> > > 
> > > Just to be clear, if there are data-races in [ab].pointer or [as].start
> > > I see absolutely no problem in stopping short and returning a
> > > positive/negative value. I just do not see how returning 0 would ever
> > > make sense.
> > 
> > It would be legal if the invocation of memcmp and at least one of the
> > invocations of memset would be concurrent (ie, not ordered by
> > happens-before).
> 
> Nope, sorry , there is a misunderstanding, consider the memsets part of
> initialization, *never* run in parallel with memcmp.

Ah, I see.  I'm not really sure what the rules are in this case; my
guess would be that this is not UB, because the access is not beyond the
object that is struct test.

Perhaps this survey is interesting for you:  https://goo.gl/2TY0H5

> > I do see that this wouldn't be an intuitive outcome if one reasons based
> > on assuming some simplified (or straightforward, or natural, or whatever
> > the personal opinion might be...) implementation.  But a data race is
> > undefined behavior, and that's where we're drawing the line.
> 
> Sure, I can understand that is the *whole* are under consideration is
> racing. But here we are talking about a large chunk of the area that is
> different and *never* raced upon.

Well, undefined behavior is not constrained to some "neighborhood" or
such around the cause of the undefined behavior.
It makes sense that the standard doesn't constrain it, because then it
would have to guarantee the constraints and those would require
reasoning about a particular implementation, which a language standard
is meant to prevent :)
Consider a hypothetical GPU that wants to run C code.  It accepts code
in a virtual ISA that requires using special atomic operations for all
memory accesses that would be data races otherwise.  The implementation
of this GPU uses a fast JIT compiler to transform to the native ISA of
the GPU.  Now this JIT ocmpiler detects in a program that a whole page
is compared using memcmp (or a custom memcmp using non-atomics).  It
decides to move this page into local memory, assuming because of the
data-race-freedom requirement that no other thread will write to the
page.  After the memcmp, it will move it back.  But your program isn't
data-race-free, and another thread writes to one byte in the page.  This
byte is then lost when the page is moved back from local memory.

This is a made-up example (although such a virtual ISA is not
unrealistic) and maybe this implementation would try to fail gracefully
instead of just losing the update.  But I think it helps show that it
isn't quite clear in general which constraints would be realistic for
undefined behavior.

> I can accept that the C standard may still decide this is undefined
> behavior even if just one small part of the total area sees concurrent
> modifications, but again, you are returning "short" from the optimized
> code, so it is unclear to me how you can return 0 when the memcmp code
> hasn't checked all the memory, wouldn't it make more sense to return the
> current position ? It is still undefined but at least it won't cause
> false positives.

Yes, that's the second issue: Trying fail gracefully or mask failures in
buggy programs, provided the runtime overheads and maintenance costs
aren't too high.
  
Torvald Riegel June 20, 2015, 1 p.m. UTC | #23
On Fri, 2015-06-19 at 19:06 +0100, Dr. David Alan Gilbert wrote:
> * Torvald Riegel (triegel@redhat.com) wrote:
> > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > 
> > > 
> > > > > Clearly people are better using atomic comparisons on canary values
> > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > memory is clearly different) and keep these things working, so why not
> > > > > do it ?
> > > > 
> > > > I see two separate issues here.  First, where do we draw the line, and
> > > > what do we guarantee.  I strongly believe that programs must not have
> > > > data races, and that they should use atomics or other synchronization
> > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > example).
> > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > do so (e.g., like it might be true in this case), then doing that can
> > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > should get fixed eventually.
> > > 
> > > I find it difficult to understand the boundaries of what the C library
> > > is allowed to do in this type of optimisation.
> > > 
> > > For example, consider the following:
> > > 
> > >     char a[128];
> > >     char b[128];
> > > 
> > >     put some static data in a[0-63]
> > >     put some static data in b[0-63]
> > >     a[64]=0;
> > >     b[64]=0;
> > >     start a thread doing stuff in a[65..]
> > >     start a thread doing stuff in b[65..]
> > >     
> > >     if (!strcmp(a,b)) {
> > >       /* Do something */
> > >     }
> > > 
> > >    a) Is that behaviour defined?
> > 
> > Good question.  I think it should be.  This depends on both the data
> > race definition and what strcmp/strncmp/memcmp are specified to do using
> > the abstract machine.
> > 
> > The data race definition uses memory locations as granularity, which is
> > in 3.14 in C11.  Separate characters in an array should be separate
> > memory locations.
> > 
> > C11 isn't very specific regarding strcmp, and just says that it
> > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > regarding basic_string::compare (21.4.7.9), saying that first the length
> > of the strings are determined, and then a strncmp is run using the
> > smaller of the two lengths.
> > 
> > Assuming the C++ specs, only the array indices [0..64] should be
> > accessed by the abstract machine.  So no data race with the stuff going
> > on in [65..).
> > 
> > >    b) Is it defined with strncmp(a,b,64) ?
> > 
> > Yes.
> > 
> > >    c) Is it defined with strncmp(a,b,128)?
> > 
> > Not sure.  C11 says that not more than "n characters" are compared, and
> > characters that follow the a null character aren't compared either.
> > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > particular case.
> > Regarding C++11, I'm not sure.  The closest copies a substring
> > (conceptually) and then compares, but the substring copying has to
> > determine length of the string and then subtracting the max length.
> > This would do a strlen first, which wouldn't access past index 64.
> > Thus, should be fine too.
> > 
> > >    d) Is it defined with memcmp(a,b,64)?
> > 
> > No data race, IMO.
> 
> OK, so you mean that's defined behavior?

Yes, there's no undefined behavior, so memcmp should do what it's
specified to do in the normal case.

> > >    e) Is it defined with memcmp(a,b,128)?
> > 
> > Data race.  Undefined behavior.
> 
> OK, so it's interesting your answer to (e) is different from (c);
> aren't there compiler optimisations that replace some strcmp s with
> memcmp s? (I'm not sure whether it would in this case?)

I believe it only does that when the length of the strings is known.

> > >    f) If I moved that boundary off a nice round % 8 mark would
> > >       it matter?
> > 
> > No difference as far as the standard is concerned.
> 
> OK, and I've not looked at this x86 code in detail, but are we sure
> that where the memcmp length finishes in the middle of a word
> the bytes after the end can't cause a match/mismatch that would
> later change and be classed as undefined?

I haven't looked at the code either, but this would be a bug.  It could
also introduce data races, which the implementation must not do :)
  
Dr. David Alan Gilbert June 22, 2015, 5:43 p.m. UTC | #24
* Torvald Riegel (triegel@redhat.com) wrote:
> On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote:
> > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote:
> > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote:
> > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > > > > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > > > > > 
> > > > > > > 
> > > > > > > > > Clearly people are better using atomic comparisons on canary values
> > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > > > > > memory is clearly different) and keep these things working, so why not
> > > > > > > > > do it ?
> > > > > > > > 
> > > > > > > > I see two separate issues here.  First, where do we draw the line, and
> > > > > > > > what do we guarantee.  I strongly believe that programs must not have
> > > > > > > > data races, and that they should use atomics or other synchronization
> > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > > > > > example).
> > > > > > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > > > > > do so (e.g., like it might be true in this case), then doing that can
> > > > > > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > > > > > should get fixed eventually.
> > > > > > > 
> > > > > > > I find it difficult to understand the boundaries of what the C library
> > > > > > > is allowed to do in this type of optimisation.
> > > > > > > 
> > > > > > > For example, consider the following:
> > > > > > > 
> > > > > > >     char a[128];
> > > > > > >     char b[128];
> > > > > > > 
> > > > > > >     put some static data in a[0-63]
> > > > > > >     put some static data in b[0-63]
> > > > > > >     a[64]=0;
> > > > > > >     b[64]=0;
> > > > > > >     start a thread doing stuff in a[65..]
> > > > > > >     start a thread doing stuff in b[65..]
> > > > > > >     
> > > > > > >     if (!strcmp(a,b)) {
> > > > > > >       /* Do something */
> > > > > > >     }
> > > > > > > 
> > > > > > >    a) Is that behaviour defined?
> > > > > > 
> > > > > > Good question.  I think it should be.  This depends on both the data
> > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using
> > > > > > the abstract machine.
> > > > > > 
> > > > > > The data race definition uses memory locations as granularity, which is
> > > > > > in 3.14 in C11.  Separate characters in an array should be separate
> > > > > > memory locations.
> > > > > > 
> > > > > > C11 isn't very specific regarding strcmp, and just says that it
> > > > > > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length
> > > > > > of the strings are determined, and then a strncmp is run using the
> > > > > > smaller of the two lengths.
> > > > > > 
> > > > > > Assuming the C++ specs, only the array indices [0..64] should be
> > > > > > accessed by the abstract machine.  So no data race with the stuff going
> > > > > > on in [65..).
> > > > > > 
> > > > > > >    b) Is it defined with strncmp(a,b,64) ?
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > >    c) Is it defined with strncmp(a,b,128)?
> > > > > > 
> > > > > > Not sure.  C11 says that not more than "n characters" are compared, and
> > > > > > characters that follow the a null character aren't compared either.
> > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > > > > > particular case.
> > > > > > Regarding C++11, I'm not sure.  The closest copies a substring
> > > > > > (conceptually) and then compares, but the substring copying has to
> > > > > > determine length of the string and then subtracting the max length.
> > > > > > This would do a strlen first, which wouldn't access past index 64.
> > > > > > Thus, should be fine too.
> > > > > > 
> > > > > > >    d) Is it defined with memcmp(a,b,64)?
> > > > > > 
> > > > > > No data race, IMO.
> > > > > > 
> > > > > > >    e) Is it defined with memcmp(a,b,128)?
> > > > > > 
> > > > > > Data race.  Undefined behavior.
> > > > > > 
> > > > > > >    f) If I moved that boundary off a nice round % 8 mark would
> > > > > > >       it matter?
> > > > > > 
> > > > > > No difference as far as the standard is concerned.
> > > > > > 
> > > > > > > I can imagine there may be lots of things that terminate
> > > > > > > a string and let other stuff happen in the allocated space
> > > > > > > after the end of the string in the belief that at the end
> > > > > > > of that string all is unknown.   Andrea's case is a bit different
> > > > > > > in that it's the later data that's static, but that doesn't
> > > > > > > sound like it should change the answer as to what's allowed.
> > > > > > 
> > > > > > I think it does, because the question is whether there is a data race on
> > > > > > the memory locations that the abstract machine would access.
> > > > > 
> > > > > Well given we are making examples.
> > > > > 
> > > > > Assume 2 structures like this:
> > > > > 
> > > > > struct test {
> > > > >     void *pointer;
> > > > >     char start[16];
> > > > >     char end[240];
> > > > > }
> > > > > 
> > > > > and
> > > > > 
> > > > > struct test a;
> > > > > struct test b;
> > > > > 
> > > > > memset(a.end, 1, 240);
> > > > > memset(b.end, 2, 240);
> > > > > 
> > > > > In what case it would be expected/legal for 
> > > > > memcmp(a.start, b.start, 256); to ever return 0 ?
> > > > 
> > > > Just to be clear, if there are data-races in [ab].pointer or [as].start
> > > > I see absolutely no problem in stopping short and returning a
> > > > positive/negative value. I just do not see how returning 0 would ever
> > > > make sense.
> > > 
> > > It would be legal if the invocation of memcmp and at least one of the
> > > invocations of memset would be concurrent (ie, not ordered by
> > > happens-before).
> > 
> > Nope, sorry , there is a misunderstanding, consider the memsets part of
> > initialization, *never* run in parallel with memcmp.
> 
> Ah, I see.  I'm not really sure what the rules are in this case; my
> guess would be that this is not UB, because the access is not beyond the
> object that is struct test.
> 
> Perhaps this survey is interesting for you:  https://goo.gl/2TY0H5
> 
> > > I do see that this wouldn't be an intuitive outcome if one reasons based
> > > on assuming some simplified (or straightforward, or natural, or whatever
> > > the personal opinion might be...) implementation.  But a data race is
> > > undefined behavior, and that's where we're drawing the line.
> > 
> > Sure, I can understand that is the *whole* are under consideration is
> > racing. But here we are talking about a large chunk of the area that is
> > different and *never* raced upon.
> 
> Well, undefined behavior is not constrained to some "neighborhood" or
> such around the cause of the undefined behavior.
> It makes sense that the standard doesn't constrain it, because then it
> would have to guarantee the constraints and those would require
> reasoning about a particular implementation, which a language standard
> is meant to prevent :)
> Consider a hypothetical GPU that wants to run C code.  It accepts code
> in a virtual ISA that requires using special atomic operations for all
> memory accesses that would be data races otherwise.  The implementation
> of this GPU uses a fast JIT compiler to transform to the native ISA of
> the GPU.  Now this JIT ocmpiler detects in a program that a whole page
> is compared using memcmp (or a custom memcmp using non-atomics).  It
> decides to move this page into local memory, assuming because of the
> data-race-freedom requirement that no other thread will write to the
> page.  After the memcmp, it will move it back.  But your program isn't
> data-race-free, and another thread writes to one byte in the page.  This
> byte is then lost when the page is moved back from local memory.
> 
> This is a made-up example (although such a virtual ISA is not
> unrealistic) and maybe this implementation would try to fail gracefully
> instead of just losing the update.  But I think it helps show that it
> isn't quite clear in general which constraints would be realistic for
> undefined behavior.

Isn't that's illegal since the parameters to memcmp are 'const void *',
so it shouldn't be written?

Going back to an earlier part of the thread; I came up with a reason
why I really wouldn't want these routines to 'assert()'.

In emulation/virtualisation we generally haven't got a clue what our
guests are doing, and we don't want to stop them most of the time.
So there are a couple of failry common types of structure that
break these rules:

1) (e.g. framebuffer emulation)
  (Many untrusted threads writing to memory)
  One control thread
     do {
       read memory -> display it
     } while (!end)
     barrier;
     display it once more

  That 'display it' can be quite reasonably expected to get
inconsistent data, but at the higher level it will be safe because
we know we're always going to do one more 'display it' at the end
when things settle down.

similarly.
2) (VM migration)

   Many untrusted threads
      do {
        n=some-value;
        change some data[n];
        write_barrier();
        atomically_set(dirty[n]);
      }

    One control thread
      do {
        for (all i) {
          read_barrier();
          if (atomic_read(dirty[i])) {
              send(data[n]);
          }
        }
      } while (!end)
      barrier();
      for (all i) {
        if (atomic_read(dirty[i])) {
            send(data[n]);
        }
      }

   and often that 'send' process involves looking for compressable data;
I had a look and I don't think we're currently using libc routines
in there on the data for the comparison in (2), but I'm not 100% sure
about (1).

In these cases we expect the data to be inconsistent for some period of
time, but guarantee the consistency since we know we're going to read it
again at some better time.  We have no sensible way to stop all the
guest threads writing to memory.

Dave

> > I can accept that the C standard may still decide this is undefined
> > behavior even if just one small part of the total area sees concurrent
> > modifications, but again, you are returning "short" from the optimized
> > code, so it is unclear to me how you can return 0 when the memcmp code
> > hasn't checked all the memory, wouldn't it make more sense to return the
> > current position ? It is still undefined but at least it won't cause
> > false positives.
> 
> Yes, that's the second issue: Trying fail gracefully or mask failures in
> buggy programs, provided the runtime overheads and maintenance costs
> aren't too high.
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
  
Torvald Riegel June 23, 2015, 8:36 a.m. UTC | #25
On Mon, 2015-06-22 at 18:43 +0100, Dr. David Alan Gilbert wrote:
> * Torvald Riegel (triegel@redhat.com) wrote:
> > On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote:
> > > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote:
> > > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote:
> > > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> > > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > > > > > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > > > > > > 
> > > > > > > > 
> > > > > > > > > > Clearly people are better using atomic comparisons on canary values
> > > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > > > > > > memory is clearly different) and keep these things working, so why not
> > > > > > > > > > do it ?
> > > > > > > > > 
> > > > > > > > > I see two separate issues here.  First, where do we draw the line, and
> > > > > > > > > what do we guarantee.  I strongly believe that programs must not have
> > > > > > > > > data races, and that they should use atomics or other synchronization
> > > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > > > > > > example).
> > > > > > > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > > > > > > do so (e.g., like it might be true in this case), then doing that can
> > > > > > > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > > > > > > should get fixed eventually.
> > > > > > > > 
> > > > > > > > I find it difficult to understand the boundaries of what the C library
> > > > > > > > is allowed to do in this type of optimisation.
> > > > > > > > 
> > > > > > > > For example, consider the following:
> > > > > > > > 
> > > > > > > >     char a[128];
> > > > > > > >     char b[128];
> > > > > > > > 
> > > > > > > >     put some static data in a[0-63]
> > > > > > > >     put some static data in b[0-63]
> > > > > > > >     a[64]=0;
> > > > > > > >     b[64]=0;
> > > > > > > >     start a thread doing stuff in a[65..]
> > > > > > > >     start a thread doing stuff in b[65..]
> > > > > > > >     
> > > > > > > >     if (!strcmp(a,b)) {
> > > > > > > >       /* Do something */
> > > > > > > >     }
> > > > > > > > 
> > > > > > > >    a) Is that behaviour defined?
> > > > > > > 
> > > > > > > Good question.  I think it should be.  This depends on both the data
> > > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using
> > > > > > > the abstract machine.
> > > > > > > 
> > > > > > > The data race definition uses memory locations as granularity, which is
> > > > > > > in 3.14 in C11.  Separate characters in an array should be separate
> > > > > > > memory locations.
> > > > > > > 
> > > > > > > C11 isn't very specific regarding strcmp, and just says that it
> > > > > > > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length
> > > > > > > of the strings are determined, and then a strncmp is run using the
> > > > > > > smaller of the two lengths.
> > > > > > > 
> > > > > > > Assuming the C++ specs, only the array indices [0..64] should be
> > > > > > > accessed by the abstract machine.  So no data race with the stuff going
> > > > > > > on in [65..).
> > > > > > > 
> > > > > > > >    b) Is it defined with strncmp(a,b,64) ?
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > >    c) Is it defined with strncmp(a,b,128)?
> > > > > > > 
> > > > > > > Not sure.  C11 says that not more than "n characters" are compared, and
> > > > > > > characters that follow the a null character aren't compared either.
> > > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > > > > > > particular case.
> > > > > > > Regarding C++11, I'm not sure.  The closest copies a substring
> > > > > > > (conceptually) and then compares, but the substring copying has to
> > > > > > > determine length of the string and then subtracting the max length.
> > > > > > > This would do a strlen first, which wouldn't access past index 64.
> > > > > > > Thus, should be fine too.
> > > > > > > 
> > > > > > > >    d) Is it defined with memcmp(a,b,64)?
> > > > > > > 
> > > > > > > No data race, IMO.
> > > > > > > 
> > > > > > > >    e) Is it defined with memcmp(a,b,128)?
> > > > > > > 
> > > > > > > Data race.  Undefined behavior.
> > > > > > > 
> > > > > > > >    f) If I moved that boundary off a nice round % 8 mark would
> > > > > > > >       it matter?
> > > > > > > 
> > > > > > > No difference as far as the standard is concerned.
> > > > > > > 
> > > > > > > > I can imagine there may be lots of things that terminate
> > > > > > > > a string and let other stuff happen in the allocated space
> > > > > > > > after the end of the string in the belief that at the end
> > > > > > > > of that string all is unknown.   Andrea's case is a bit different
> > > > > > > > in that it's the later data that's static, but that doesn't
> > > > > > > > sound like it should change the answer as to what's allowed.
> > > > > > > 
> > > > > > > I think it does, because the question is whether there is a data race on
> > > > > > > the memory locations that the abstract machine would access.
> > > > > > 
> > > > > > Well given we are making examples.
> > > > > > 
> > > > > > Assume 2 structures like this:
> > > > > > 
> > > > > > struct test {
> > > > > >     void *pointer;
> > > > > >     char start[16];
> > > > > >     char end[240];
> > > > > > }
> > > > > > 
> > > > > > and
> > > > > > 
> > > > > > struct test a;
> > > > > > struct test b;
> > > > > > 
> > > > > > memset(a.end, 1, 240);
> > > > > > memset(b.end, 2, 240);
> > > > > > 
> > > > > > In what case it would be expected/legal for 
> > > > > > memcmp(a.start, b.start, 256); to ever return 0 ?
> > > > > 
> > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start
> > > > > I see absolutely no problem in stopping short and returning a
> > > > > positive/negative value. I just do not see how returning 0 would ever
> > > > > make sense.
> > > > 
> > > > It would be legal if the invocation of memcmp and at least one of the
> > > > invocations of memset would be concurrent (ie, not ordered by
> > > > happens-before).
> > > 
> > > Nope, sorry , there is a misunderstanding, consider the memsets part of
> > > initialization, *never* run in parallel with memcmp.
> > 
> > Ah, I see.  I'm not really sure what the rules are in this case; my
> > guess would be that this is not UB, because the access is not beyond the
> > object that is struct test.
> > 
> > Perhaps this survey is interesting for you:  https://goo.gl/2TY0H5
> > 
> > > > I do see that this wouldn't be an intuitive outcome if one reasons based
> > > > on assuming some simplified (or straightforward, or natural, or whatever
> > > > the personal opinion might be...) implementation.  But a data race is
> > > > undefined behavior, and that's where we're drawing the line.
> > > 
> > > Sure, I can understand that is the *whole* are under consideration is
> > > racing. But here we are talking about a large chunk of the area that is
> > > different and *never* raced upon.
> > 
> > Well, undefined behavior is not constrained to some "neighborhood" or
> > such around the cause of the undefined behavior.
> > It makes sense that the standard doesn't constrain it, because then it
> > would have to guarantee the constraints and those would require
> > reasoning about a particular implementation, which a language standard
> > is meant to prevent :)
> > Consider a hypothetical GPU that wants to run C code.  It accepts code
> > in a virtual ISA that requires using special atomic operations for all
> > memory accesses that would be data races otherwise.  The implementation
> > of this GPU uses a fast JIT compiler to transform to the native ISA of
> > the GPU.  Now this JIT ocmpiler detects in a program that a whole page
> > is compared using memcmp (or a custom memcmp using non-atomics).  It
> > decides to move this page into local memory, assuming because of the
> > data-race-freedom requirement that no other thread will write to the
> > page.  After the memcmp, it will move it back.  But your program isn't
> > data-race-free, and another thread writes to one byte in the page.  This
> > byte is then lost when the page is moved back from local memory.
> > 
> > This is a made-up example (although such a virtual ISA is not
> > unrealistic) and maybe this implementation would try to fail gracefully
> > instead of just losing the update.  But I think it helps show that it
> > isn't quite clear in general which constraints would be realistic for
> > undefined behavior.
> 
> Isn't that's illegal since the parameters to memcmp are 'const void *',
> so it shouldn't be written?

That's what your program guarantees, but it's not automatically a
constraint on what the implementation does.  If the compiler can detect
that the memory is completely managed and provided by the implementation
(e.g., never accessed through volatile accesses, never escaping to
functions the compiler isn't sure won't access it through
volatile, ...), then the additional writes aren't visible in terms of
observable behavior of the abstract machine.

> Going back to an earlier part of the thread; I came up with a reason
> why I really wouldn't want these routines to 'assert()'.

So, put generally, you don't want to fail fast because there is code
that is not data-race-free but seems to work right now?
I think that this would be a convincing reason.

> In emulation/virtualisation we generally haven't got a clue what our
> guests are doing, and we don't want to stop them most of the time.
> So there are a couple of failry common types of structure that
> break these rules:
> 
> 1) (e.g. framebuffer emulation)
>   (Many untrusted threads writing to memory)
>   One control thread
>      do {
>        read memory -> display it
>      } while (!end)
>      barrier;
>      display it once more
> 
>   That 'display it' can be quite reasonably expected to get
> inconsistent data, but at the higher level it will be safe because
> we know we're always going to do one more 'display it' at the end
> when things settle down.

Using atomic (memory_order_relaxed) accesses for reads of "memory" (and
"end", I suppose) would clean this up.  The barrier is am
memory_order_acquire fence, I suppose?

> similarly.
> 2) (VM migration)
> 
>    Many untrusted threads
>       do {
>         n=some-value;
>         change some data[n];
>         write_barrier();
>         atomically_set(dirty[n]);

Same as above, using relaxed atomic accesses for data and a
memory_order_release store for setting dirty (or use relaxed MO and keep
the barrier / release fence) would clean this up.

>       }
> 
>     One control thread
>       do {
>         for (all i) {
>           read_barrier();

Hmm, why is the barrier here, not between the loads from dirty and data?

>           if (atomic_read(dirty[i])) {
>               send(data[n]);
>           }
>         }
>       } while (!end)
>       barrier();
>       for (all i) {
>         if (atomic_read(dirty[i])) {
>             send(data[n]);
>         }
>       }
> 
>    and often that 'send' process involves looking for compressable data;
> I had a look and I don't think we're currently using libc routines
> in there on the data for the comparison in (2), but I'm not 100% sure
> about (1).
> 
> In these cases we expect the data to be inconsistent for some period of
> time, but guarantee the consistency since we know we're going to read it
> again at some better time.  We have no sensible way to stop all the
> guest threads writing to memory.

The uncontrolled writes of the guest threads are not really covered by
the memory model, but if we assume that they are effectively a
collection of memory_order_relaxed atomic stores of some granularity,
then you can "avoid" the data race if using relaxed atomic loads at the
consumer side too.  And then it's just using compatible fences and
atomic accesses for "end" and "dirty" to make this work cleanly.
  
Dr. David Alan Gilbert June 23, 2015, 9:02 a.m. UTC | #26
* Torvald Riegel (triegel@redhat.com) wrote:
> On Mon, 2015-06-22 at 18:43 +0100, Dr. David Alan Gilbert wrote:
> > * Torvald Riegel (triegel@redhat.com) wrote:
> > > On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote:
> > > > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote:
> > > > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote:
> > > > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> > > > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > > > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > > > > > > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > > Clearly people are better using atomic comparisons on canary values
> > > > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > > > > > > > memory is clearly different) and keep these things working, so why not
> > > > > > > > > > > do it ?
> > > > > > > > > > 
> > > > > > > > > > I see two separate issues here.  First, where do we draw the line, and
> > > > > > > > > > what do we guarantee.  I strongly believe that programs must not have
> > > > > > > > > > data races, and that they should use atomics or other synchronization
> > > > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > > > > > > > example).
> > > > > > > > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > > > > > > > do so (e.g., like it might be true in this case), then doing that can
> > > > > > > > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > > > > > > > should get fixed eventually.
> > > > > > > > > 
> > > > > > > > > I find it difficult to understand the boundaries of what the C library
> > > > > > > > > is allowed to do in this type of optimisation.
> > > > > > > > > 
> > > > > > > > > For example, consider the following:
> > > > > > > > > 
> > > > > > > > >     char a[128];
> > > > > > > > >     char b[128];
> > > > > > > > > 
> > > > > > > > >     put some static data in a[0-63]
> > > > > > > > >     put some static data in b[0-63]
> > > > > > > > >     a[64]=0;
> > > > > > > > >     b[64]=0;
> > > > > > > > >     start a thread doing stuff in a[65..]
> > > > > > > > >     start a thread doing stuff in b[65..]
> > > > > > > > >     
> > > > > > > > >     if (!strcmp(a,b)) {
> > > > > > > > >       /* Do something */
> > > > > > > > >     }
> > > > > > > > > 
> > > > > > > > >    a) Is that behaviour defined?
> > > > > > > > 
> > > > > > > > Good question.  I think it should be.  This depends on both the data
> > > > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using
> > > > > > > > the abstract machine.
> > > > > > > > 
> > > > > > > > The data race definition uses memory locations as granularity, which is
> > > > > > > > in 3.14 in C11.  Separate characters in an array should be separate
> > > > > > > > memory locations.
> > > > > > > > 
> > > > > > > > C11 isn't very specific regarding strcmp, and just says that it
> > > > > > > > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > > > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length
> > > > > > > > of the strings are determined, and then a strncmp is run using the
> > > > > > > > smaller of the two lengths.
> > > > > > > > 
> > > > > > > > Assuming the C++ specs, only the array indices [0..64] should be
> > > > > > > > accessed by the abstract machine.  So no data race with the stuff going
> > > > > > > > on in [65..).
> > > > > > > > 
> > > > > > > > >    b) Is it defined with strncmp(a,b,64) ?
> > > > > > > > 
> > > > > > > > Yes.
> > > > > > > > 
> > > > > > > > >    c) Is it defined with strncmp(a,b,128)?
> > > > > > > > 
> > > > > > > > Not sure.  C11 says that not more than "n characters" are compared, and
> > > > > > > > characters that follow the a null character aren't compared either.
> > > > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > > > > > > > particular case.
> > > > > > > > Regarding C++11, I'm not sure.  The closest copies a substring
> > > > > > > > (conceptually) and then compares, but the substring copying has to
> > > > > > > > determine length of the string and then subtracting the max length.
> > > > > > > > This would do a strlen first, which wouldn't access past index 64.
> > > > > > > > Thus, should be fine too.
> > > > > > > > 
> > > > > > > > >    d) Is it defined with memcmp(a,b,64)?
> > > > > > > > 
> > > > > > > > No data race, IMO.
> > > > > > > > 
> > > > > > > > >    e) Is it defined with memcmp(a,b,128)?
> > > > > > > > 
> > > > > > > > Data race.  Undefined behavior.
> > > > > > > > 
> > > > > > > > >    f) If I moved that boundary off a nice round % 8 mark would
> > > > > > > > >       it matter?
> > > > > > > > 
> > > > > > > > No difference as far as the standard is concerned.
> > > > > > > > 
> > > > > > > > > I can imagine there may be lots of things that terminate
> > > > > > > > > a string and let other stuff happen in the allocated space
> > > > > > > > > after the end of the string in the belief that at the end
> > > > > > > > > of that string all is unknown.   Andrea's case is a bit different
> > > > > > > > > in that it's the later data that's static, but that doesn't
> > > > > > > > > sound like it should change the answer as to what's allowed.
> > > > > > > > 
> > > > > > > > I think it does, because the question is whether there is a data race on
> > > > > > > > the memory locations that the abstract machine would access.
> > > > > > > 
> > > > > > > Well given we are making examples.
> > > > > > > 
> > > > > > > Assume 2 structures like this:
> > > > > > > 
> > > > > > > struct test {
> > > > > > >     void *pointer;
> > > > > > >     char start[16];
> > > > > > >     char end[240];
> > > > > > > }
> > > > > > > 
> > > > > > > and
> > > > > > > 
> > > > > > > struct test a;
> > > > > > > struct test b;
> > > > > > > 
> > > > > > > memset(a.end, 1, 240);
> > > > > > > memset(b.end, 2, 240);
> > > > > > > 
> > > > > > > In what case it would be expected/legal for 
> > > > > > > memcmp(a.start, b.start, 256); to ever return 0 ?
> > > > > > 
> > > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start
> > > > > > I see absolutely no problem in stopping short and returning a
> > > > > > positive/negative value. I just do not see how returning 0 would ever
> > > > > > make sense.
> > > > > 
> > > > > It would be legal if the invocation of memcmp and at least one of the
> > > > > invocations of memset would be concurrent (ie, not ordered by
> > > > > happens-before).
> > > > 
> > > > Nope, sorry , there is a misunderstanding, consider the memsets part of
> > > > initialization, *never* run in parallel with memcmp.
> > > 
> > > Ah, I see.  I'm not really sure what the rules are in this case; my
> > > guess would be that this is not UB, because the access is not beyond the
> > > object that is struct test.
> > > 
> > > Perhaps this survey is interesting for you:  https://goo.gl/2TY0H5
> > > 
> > > > > I do see that this wouldn't be an intuitive outcome if one reasons based
> > > > > on assuming some simplified (or straightforward, or natural, or whatever
> > > > > the personal opinion might be...) implementation.  But a data race is
> > > > > undefined behavior, and that's where we're drawing the line.
> > > > 
> > > > Sure, I can understand that is the *whole* are under consideration is
> > > > racing. But here we are talking about a large chunk of the area that is
> > > > different and *never* raced upon.
> > > 
> > > Well, undefined behavior is not constrained to some "neighborhood" or
> > > such around the cause of the undefined behavior.
> > > It makes sense that the standard doesn't constrain it, because then it
> > > would have to guarantee the constraints and those would require
> > > reasoning about a particular implementation, which a language standard
> > > is meant to prevent :)
> > > Consider a hypothetical GPU that wants to run C code.  It accepts code
> > > in a virtual ISA that requires using special atomic operations for all
> > > memory accesses that would be data races otherwise.  The implementation
> > > of this GPU uses a fast JIT compiler to transform to the native ISA of
> > > the GPU.  Now this JIT ocmpiler detects in a program that a whole page
> > > is compared using memcmp (or a custom memcmp using non-atomics).  It
> > > decides to move this page into local memory, assuming because of the
> > > data-race-freedom requirement that no other thread will write to the
> > > page.  After the memcmp, it will move it back.  But your program isn't
> > > data-race-free, and another thread writes to one byte in the page.  This
> > > byte is then lost when the page is moved back from local memory.
> > > 
> > > This is a made-up example (although such a virtual ISA is not
> > > unrealistic) and maybe this implementation would try to fail gracefully
> > > instead of just losing the update.  But I think it helps show that it
> > > isn't quite clear in general which constraints would be realistic for
> > > undefined behavior.
> > 
> > Isn't that's illegal since the parameters to memcmp are 'const void *',
> > so it shouldn't be written?
> 
> That's what your program guarantees, but it's not automatically a
> constraint on what the implementation does.  If the compiler can detect
> that the memory is completely managed and provided by the implementation
> (e.g., never accessed through volatile accesses, never escaping to
> functions the compiler isn't sure won't access it through
> volatile, ...), then the additional writes aren't visible in terms of
> observable behavior of the abstract machine.

However, this is a library interface; the implentation of memcmp in the
library can't get that much information from the compiler, so the standard
memcmp implementation can't do that write because, given that it's const,
that the address space it's comparing may be read-only.

> > Going back to an earlier part of the thread; I came up with a reason
> > why I really wouldn't want these routines to 'assert()'.
> 
> So, put generally, you don't want to fail fast because there is code
> that is not data-race-free but seems to work right now?
> I think that this would be a convincing reason.

:-)  It's not a 'seems to work' - the design of both of these algorithms
is specifically allowing data races to happen on the intent of containing
the effect of the races.  The racy reads are expected to produce indeterminate
data but the algorithm knows that if that happens that data is always read
again in the future at a time when a race can't happen.  This is significantly
faster than trying to stop all the other threads whenever the read is
taking place.

> > In emulation/virtualisation we generally haven't got a clue what our
> > guests are doing, and we don't want to stop them most of the time.
> > So there are a couple of failry common types of structure that
> > break these rules:
> > 
> > 1) (e.g. framebuffer emulation)
> >   (Many untrusted threads writing to memory)
> >   One control thread
> >      do {
> >        read memory -> display it
> >      } while (!end)
> >      barrier;
> >      display it once more
> > 
> >   That 'display it' can be quite reasonably expected to get
> > inconsistent data, but at the higher level it will be safe because
> > we know we're always going to do one more 'display it' at the end
> > when things settle down.
> 
> Using atomic (memory_order_relaxed) accesses for reads of "memory" (and
> "end", I suppose) would clean this up.

  1a) Note that this data isn't necessarily word aligned etc - so I'm not
      sure why using atomic accesses would help here.  Also, my problem here
      is that I have no efficient C library call to use that is legal
      because by your definition of 'undefined' any library call might
      assert or do anything else - even if I don't care about the data
      they produce.
>  The barrier is am
> memory_order_acquire fence, I suppose?
  1b) Actually I've not found the barrier yet; I was just boiling these
      examples down from code I know we've got - I'm more familiar with
      the 2nd case.

> > similarly.
> > 2) (VM migration)
> > 
> >    Many untrusted threads
> >       do {
> >         n=some-value;
> >         change some data[n];
> >         write_barrier();
> >         atomically_set(dirty[n]);
> 
> Same as above, using relaxed atomic accesses for data and a
> memory_order_release store for setting dirty (or use relaxed MO and keep
> the barrier / release fence) would clean this up.

   Note the thing changing data[n] is totally out of the control of our
code; it's just normal applications, user land code; we've got no control
over it at all - it's using normal writes, it could be malicious.
 The marking of dirty[n] is something that the kernel does for us, and we
do get control over.

> >       }
> > 
> >     One control thread
> >       do {
> >         for (all i) {
> >           read_barrier();
> 
> Hmm, why is the barrier here, not between the loads from dirty and data?

Yes, that probably would be best.

> >           if (atomic_read(dirty[i])) {
> >               send(data[n]);
> >           }
> >         }
> >       } while (!end)
> >       barrier();
> >       for (all i) {
> >         if (atomic_read(dirty[i])) {
> >             send(data[n]);
> >         }
> >       }
> > 
> >    and often that 'send' process involves looking for compressable data;
> > I had a look and I don't think we're currently using libc routines
> > in there on the data for the comparison in (2), but I'm not 100% sure
> > about (1).
> > 
> > In these cases we expect the data to be inconsistent for some period of
> > time, but guarantee the consistency since we know we're going to read it
> > again at some better time.  We have no sensible way to stop all the
> > guest threads writing to memory.
> 
> The uncontrolled writes of the guest threads are not really covered by
> the memory model, but if we assume that they are effectively a
> collection of memory_order_relaxed atomic stores of some granularity,
> then you can "avoid" the data race if using relaxed atomic loads at the
> consumer side too.  And then it's just using compatible fences and
> atomic accesses for "end" and "dirty" to make this work cleanly.

Again my problem, and why I included in this thread, is use of C library
routines;  If I want to copy that data[n] as part of the send, I'd expect
to able to use a standard memcpy() call,  if I want to see if it's the
same as the previous version I sent I'd want memcmp() - hence the reason
I mention it.  (Actually the way our compression data[n] works the comparison
in this case doesn't quite fit to a memcmp or memchr)

In short I want the benefit of the highly optimised library routines,
knowing that the actual results they return for data that's changing under
them might be wrong, but with the knowledge they're not going to blow up.
Hence why I changed my mind and said assert() is unreasonable.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
  
Torvald Riegel June 23, 2015, 9:41 a.m. UTC | #27
On Tue, 2015-06-23 at 10:02 +0100, Dr. David Alan Gilbert wrote:
> * Torvald Riegel (triegel@redhat.com) wrote:
> > On Mon, 2015-06-22 at 18:43 +0100, Dr. David Alan Gilbert wrote:
> > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote:
> > > > > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote:
> > > > > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote:
> > > > > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote:
> > > > > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote:
> > > > > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote:
> > > > > > > > > > * Torvald Riegel (triegel@redhat.com) wrote:
> > > > > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote:
> > > > > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote:
> > > > > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote:
> > > > > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance
> > > > > > > > > > > > > > is not impacted and I haven't thought about how to do that.
> > > > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values
> > > > > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when
> > > > > > > > > > > > memory is clearly different) and keep these things working, so why not
> > > > > > > > > > > > do it ?
> > > > > > > > > > > 
> > > > > > > > > > > I see two separate issues here.  First, where do we draw the line, and
> > > > > > > > > > > what do we guarantee.  I strongly believe that programs must not have
> > > > > > > > > > > data races, and that they should use atomics or other synchronization
> > > > > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for
> > > > > > > > > > > example).
> > > > > > > > > > > Second, do we try to keep buggy programs working.  If it has no cost to
> > > > > > > > > > > do so (e.g., like it might be true in this case), then doing that can
> > > > > > > > > > > help to trigger less errors.  But that doesn't mean the buggy programs
> > > > > > > > > > > should get fixed eventually.
> > > > > > > > > > 
> > > > > > > > > > I find it difficult to understand the boundaries of what the C library
> > > > > > > > > > is allowed to do in this type of optimisation.
> > > > > > > > > > 
> > > > > > > > > > For example, consider the following:
> > > > > > > > > > 
> > > > > > > > > >     char a[128];
> > > > > > > > > >     char b[128];
> > > > > > > > > > 
> > > > > > > > > >     put some static data in a[0-63]
> > > > > > > > > >     put some static data in b[0-63]
> > > > > > > > > >     a[64]=0;
> > > > > > > > > >     b[64]=0;
> > > > > > > > > >     start a thread doing stuff in a[65..]
> > > > > > > > > >     start a thread doing stuff in b[65..]
> > > > > > > > > >     
> > > > > > > > > >     if (!strcmp(a,b)) {
> > > > > > > > > >       /* Do something */
> > > > > > > > > >     }
> > > > > > > > > > 
> > > > > > > > > >    a) Is that behaviour defined?
> > > > > > > > > 
> > > > > > > > > Good question.  I think it should be.  This depends on both the data
> > > > > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using
> > > > > > > > > the abstract machine.
> > > > > > > > > 
> > > > > > > > > The data race definition uses memory locations as granularity, which is
> > > > > > > > > in 3.14 in C11.  Separate characters in an array should be separate
> > > > > > > > > memory locations.
> > > > > > > > > 
> > > > > > > > > C11 isn't very specific regarding strcmp, and just says that it
> > > > > > > > > "compares the string[s]" (7.24.4.2).  C++14 is a bit more specific
> > > > > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length
> > > > > > > > > of the strings are determined, and then a strncmp is run using the
> > > > > > > > > smaller of the two lengths.
> > > > > > > > > 
> > > > > > > > > Assuming the C++ specs, only the array indices [0..64] should be
> > > > > > > > > accessed by the abstract machine.  So no data race with the stuff going
> > > > > > > > > on in [65..).
> > > > > > > > > 
> > > > > > > > > >    b) Is it defined with strncmp(a,b,64) ?
> > > > > > > > > 
> > > > > > > > > Yes.
> > > > > > > > > 
> > > > > > > > > >    c) Is it defined with strncmp(a,b,128)?
> > > > > > > > > 
> > > > > > > > > Not sure.  C11 says that not more than "n characters" are compared, and
> > > > > > > > > characters that follow the a null character aren't compared either.
> > > > > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the
> > > > > > > > > particular case.
> > > > > > > > > Regarding C++11, I'm not sure.  The closest copies a substring
> > > > > > > > > (conceptually) and then compares, but the substring copying has to
> > > > > > > > > determine length of the string and then subtracting the max length.
> > > > > > > > > This would do a strlen first, which wouldn't access past index 64.
> > > > > > > > > Thus, should be fine too.
> > > > > > > > > 
> > > > > > > > > >    d) Is it defined with memcmp(a,b,64)?
> > > > > > > > > 
> > > > > > > > > No data race, IMO.
> > > > > > > > > 
> > > > > > > > > >    e) Is it defined with memcmp(a,b,128)?
> > > > > > > > > 
> > > > > > > > > Data race.  Undefined behavior.
> > > > > > > > > 
> > > > > > > > > >    f) If I moved that boundary off a nice round % 8 mark would
> > > > > > > > > >       it matter?
> > > > > > > > > 
> > > > > > > > > No difference as far as the standard is concerned.
> > > > > > > > > 
> > > > > > > > > > I can imagine there may be lots of things that terminate
> > > > > > > > > > a string and let other stuff happen in the allocated space
> > > > > > > > > > after the end of the string in the belief that at the end
> > > > > > > > > > of that string all is unknown.   Andrea's case is a bit different
> > > > > > > > > > in that it's the later data that's static, but that doesn't
> > > > > > > > > > sound like it should change the answer as to what's allowed.
> > > > > > > > > 
> > > > > > > > > I think it does, because the question is whether there is a data race on
> > > > > > > > > the memory locations that the abstract machine would access.
> > > > > > > > 
> > > > > > > > Well given we are making examples.
> > > > > > > > 
> > > > > > > > Assume 2 structures like this:
> > > > > > > > 
> > > > > > > > struct test {
> > > > > > > >     void *pointer;
> > > > > > > >     char start[16];
> > > > > > > >     char end[240];
> > > > > > > > }
> > > > > > > > 
> > > > > > > > and
> > > > > > > > 
> > > > > > > > struct test a;
> > > > > > > > struct test b;
> > > > > > > > 
> > > > > > > > memset(a.end, 1, 240);
> > > > > > > > memset(b.end, 2, 240);
> > > > > > > > 
> > > > > > > > In what case it would be expected/legal for 
> > > > > > > > memcmp(a.start, b.start, 256); to ever return 0 ?
> > > > > > > 
> > > > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start
> > > > > > > I see absolutely no problem in stopping short and returning a
> > > > > > > positive/negative value. I just do not see how returning 0 would ever
> > > > > > > make sense.
> > > > > > 
> > > > > > It would be legal if the invocation of memcmp and at least one of the
> > > > > > invocations of memset would be concurrent (ie, not ordered by
> > > > > > happens-before).
> > > > > 
> > > > > Nope, sorry , there is a misunderstanding, consider the memsets part of
> > > > > initialization, *never* run in parallel with memcmp.
> > > > 
> > > > Ah, I see.  I'm not really sure what the rules are in this case; my
> > > > guess would be that this is not UB, because the access is not beyond the
> > > > object that is struct test.
> > > > 
> > > > Perhaps this survey is interesting for you:  https://goo.gl/2TY0H5
> > > > 
> > > > > > I do see that this wouldn't be an intuitive outcome if one reasons based
> > > > > > on assuming some simplified (or straightforward, or natural, or whatever
> > > > > > the personal opinion might be...) implementation.  But a data race is
> > > > > > undefined behavior, and that's where we're drawing the line.
> > > > > 
> > > > > Sure, I can understand that is the *whole* are under consideration is
> > > > > racing. But here we are talking about a large chunk of the area that is
> > > > > different and *never* raced upon.
> > > > 
> > > > Well, undefined behavior is not constrained to some "neighborhood" or
> > > > such around the cause of the undefined behavior.
> > > > It makes sense that the standard doesn't constrain it, because then it
> > > > would have to guarantee the constraints and those would require
> > > > reasoning about a particular implementation, which a language standard
> > > > is meant to prevent :)
> > > > Consider a hypothetical GPU that wants to run C code.  It accepts code
> > > > in a virtual ISA that requires using special atomic operations for all
> > > > memory accesses that would be data races otherwise.  The implementation
> > > > of this GPU uses a fast JIT compiler to transform to the native ISA of
> > > > the GPU.  Now this JIT ocmpiler detects in a program that a whole page
> > > > is compared using memcmp (or a custom memcmp using non-atomics).  It
> > > > decides to move this page into local memory, assuming because of the
> > > > data-race-freedom requirement that no other thread will write to the
> > > > page.  After the memcmp, it will move it back.  But your program isn't
> > > > data-race-free, and another thread writes to one byte in the page.  This
> > > > byte is then lost when the page is moved back from local memory.
> > > > 
> > > > This is a made-up example (although such a virtual ISA is not
> > > > unrealistic) and maybe this implementation would try to fail gracefully
> > > > instead of just losing the update.  But I think it helps show that it
> > > > isn't quite clear in general which constraints would be realistic for
> > > > undefined behavior.
> > > 
> > > Isn't that's illegal since the parameters to memcmp are 'const void *',
> > > so it shouldn't be written?
> > 
> > That's what your program guarantees, but it's not automatically a
> > constraint on what the implementation does.  If the compiler can detect
> > that the memory is completely managed and provided by the implementation
> > (e.g., never accessed through volatile accesses, never escaping to
> > functions the compiler isn't sure won't access it through
> > volatile, ...), then the additional writes aren't visible in terms of
> > observable behavior of the abstract machine.
> 
> However, this is a library interface; the implentation of memcmp in the
> library can't get that much information from the compiler, so the standard
> memcmp implementation can't do that write because, given that it's const,
> that the address space it's comparing may be read-only.

I'm assuming in my example that the compiler isn't calling an opaque
memcmp, but knows about the semantics of memcmp and uses a built-in
memcmp or such.

> > > Going back to an earlier part of the thread; I came up with a reason
> > > why I really wouldn't want these routines to 'assert()'.
> > 
> > So, put generally, you don't want to fail fast because there is code
> > that is not data-race-free but seems to work right now?
> > I think that this would be a convincing reason.
> 
> :-)  It's not a 'seems to work'

It is not guaranteed by the language standards to work.  There's no
explicit guarantee by the compiler either.  Therefore I'm putting this
in the "seems to work" category, which isn't meant to be saying that it
doesn't, or would be always fragile in practice.  I'm just trying to be
clear here regarding when we have guaranteed behavior (eg, by the
language standard), and when we don't.

> - the design of both of these algorithms
> is specifically allowing data races to happen on the intent of containing
> the effect of the races.  The racy reads are expected to produce indeterminate
> data but the algorithm knows that if that happens that data is always read
> again in the future at a time when a race can't happen.  This is significantly
> faster than trying to stop all the other threads whenever the read is
> taking place.

Let's try to not overload the term "race".  From the language's memory
model perspective, data races are cases when the program is not
expressing concurrent code correctly.  Using atomics, for example,
avoids this because you tell the compiler that the accesses can be
concurrent with accesses by other threads.  Avoiding data races does not
mean you need to use coarser-grain mutual exclusion such as through
using locks.

At higher levels of abstractions, there can still be race conditions in
the code, of course.  But as you say, you're taking care of these. 

(Just to be clear, I'm nitpicking on the terminology here because IMO,
we need to be clear what kind of "race" we are talking about (e.g.,
whether it's at the higher level of an algorithm or at lower levels
where the compiler is involved).  I do understand what you're trying to
do in these algorithms, and how you want to synchronize.) 

> > > In emulation/virtualisation we generally haven't got a clue what our
> > > guests are doing, and we don't want to stop them most of the time.
> > > So there are a couple of failry common types of structure that
> > > break these rules:
> > > 
> > > 1) (e.g. framebuffer emulation)
> > >   (Many untrusted threads writing to memory)
> > >   One control thread
> > >      do {
> > >        read memory -> display it
> > >      } while (!end)
> > >      barrier;
> > >      display it once more
> > > 
> > >   That 'display it' can be quite reasonably expected to get
> > > inconsistent data, but at the higher level it will be safe because
> > > we know we're always going to do one more 'display it' at the end
> > > when things settle down.
> > 
> > Using atomic (memory_order_relaxed) accesses for reads of "memory" (and
> > "end", I suppose) would clean this up.
> 
>   1a) Note that this data isn't necessarily word aligned etc - so I'm not
>       sure why using atomic accesses would help here.  Also, my problem here
>       is that I have no efficient C library call to use that is legal
>       because by your definition of 'undefined' any library call might
>       assert or do anything else - even if I don't care about the data
>       they produce.

I agree that what's currently provided by the standard isn't quite
sufficient to cover this use case.  I also think that it's an important
use case, and so we should have some solution for that that is specified
to work correctly (either in the standard, or as a GNU extension).

Specifically, we want to snapshot memory in a data-race-free way, and
the caller makes sure through additional synchronization that it only
uses snapshots that actually happened when the snapshotted data was in a
quiescent state.
(I have the same issue in GCC's libitm (transactional memory runtime
library), and opted to use plain loads there, with a clear comment that
this was outside the standard.)

I'm not yet sure which interface we should offer for such snapshotting.
There's a related proposal being discussed in the C++ committee (Atomic
View), but that's about atomic accesses.  We just need ones that don't
create data races but don't need to be atomic; essentially, we just have
to tell the compiler to not assume data-race-freedom for the snapshot.
Having variants for memcpy, memcmp, etc. might be one way to do it.

A first step in your code could be to at least document the cases where
you rely on that.  (IMHO, it would also be useful as documentation for
your existing concurrent code.)
  

Patch

diff --git a/equalhappybug.c b/equalhappybug.c
index 6c5fac5..ed51fd7 100644
--- a/equalhappybug.c
+++ b/equalhappybug.c
@@ -18,7 +18,7 @@ 
 
 static long nr_cpus, page_size;
 #define NON_ZERO_OFFSET 16U
-static char *page, *zeropage;
+static volatile char *page, *zeropage;
 volatile int finished;
 
 static int my_bcmp(char *str1, char *str2, unsigned long n)
@@ -47,6 +47,7 @@  static void *locking_thread(void *arg)
 #else
 		unsigned long loops = 0;
 		//while (!bcmp(page, zeropage, page_size))
+		asm volatile("" : : : "memory");
 		while (!memcmp(page, zeropage, page_size)) {
 			loops += 1;
 			asm volatile("" : : : "memory");