Message ID | 20150618145202.GG14955@redhat.com |
---|---|
State | Dropped |
Headers |
Received: (qmail 107958 invoked by alias); 18 Jun 2015 14:52:11 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 107929 invoked by uid 89); 18 Jun 2015 14:52:11 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL, BAYES_00, KAM_LAZY_DOMAIN_SECURITY, RP_MATCHES_RCVD, SPF_HELO_PASS autolearn=no version=3.3.2 X-HELO: mx1.redhat.com Date: Thu, 18 Jun 2015 16:52:02 +0200 From: Andrea Arcangeli <aarcange@redhat.com> To: Torvald Riegel <triegel@redhat.com> Cc: Szabolcs Nagy <nsz@port70.net>, libc-alpha@sourceware.org, "H.J. Lu" <hongjiu.lu@intel.com> Subject: Re: memcmp-sse4.S EqualHappy bug Message-ID: <20150618145202.GG14955@redhat.com> References: <20150617172903.GC4317@redhat.com> <20150617185952.GE22285@port70.net> <20150617210612.GB14955@redhat.com> <1434621040.5250.212.camel@localhost.localdomain> <20150618124900.GD14955@redhat.com> <1434637415.5250.271.camel@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1434637415.5250.271.camel@localhost.localdomain> User-Agent: Mutt/1.5.23 (2014-03-12) |
Commit Message
Andrea Arcangeli
June 18, 2015, 2:52 p.m. UTC
On Thu, Jun 18, 2015 at 04:23:35PM +0200, Torvald Riegel wrote: > There are a few differences to your memcmp test case though. If > barrier() is a compiler barrier, this should "remove" prior knowledge > the compiler has about memory. Second, you access volatile-qualified > data, which at least requires to do the accesses byte-wise or similar. > Does the memcmp test case also fail when you acess volatiles? Actually the kernel code in the __builtin_memcpy case removes the volatile qualifier. On a side note it's not normal to use READ_ONCE/WRITE_ONCE on objects larger than sizeof(long) and it would warn at build time but it would build successfully. Changing my testcase like this just creates warning about volatile warning: passing argument 1 of ‘memset’ discards ‘volatile’ qualifier from pointer target type, etc... it still builds and fail at runtime like before. I added another barrier as well but it can't make a difference, the problem is very clear and how to prevent this practical problem is clear too. As long as memcmp-sse4.S is called there's nothing the compiler can do to prevent this problem from materializing. > Note that I'm not saying that this is guaranteed to work. It's still UB > if there is a data race. However, in practice, this might just continue > to work. READ_ONCE/WRITE_ONCE are only used if there is going to be a data race, otherwise they're never used. > Userland programs should really try to transition to atomics and the > C11/C++11 memory model or just the __atomic builtins, if possible. We should look into that. Without a way to access non-atomic regions with C, RCU in the kernel cannot work. But if memcpy in the kernel has to behave sane (and not break out of the loop too happily) in presence of not-atomic changes to the memory, then I don't see why memcmp in userland is different. Also note there is also an userspace RCU, I don't know exactly how that works because I never used it, but the whole concept of RCU is to allow C language to access not-atomic regions of memory and to get away with it. And if that way is memcpy, I don't see what's fundamentally different about memcpy being required to cope with that, but memcmp not. I totally see why memcmp-sse4.S is safe by the standard, just the standard is not the only way to gain performance, RCU is required to scale 100% in SMP, pthread_mutex_rwlock wouldn't scale, so it's unthinkable to get rid of READ_ONCE/WRITE_ONCE. Switching to __atomic would be fine, and if __atomic can make memcpy (and IMHO memcmp) behave in a way they won't happily return too soon on __atomic builtins, it'd be great.
Comments
On Thu, 2015-06-18 at 16:52 +0200, Andrea Arcangeli wrote: > On Thu, Jun 18, 2015 at 04:23:35PM +0200, Torvald Riegel wrote: > > There are a few differences to your memcmp test case though. If > > barrier() is a compiler barrier, this should "remove" prior knowledge > > the compiler has about memory. Second, you access volatile-qualified > > data, which at least requires to do the accesses byte-wise or similar. > > Does the memcmp test case also fail when you acess volatiles? > > Actually the kernel code in the __builtin_memcpy case removes the > volatile qualifier. yeah, never mind... > On a side note it's not normal to use READ_ONCE/WRITE_ONCE on > objects larger than sizeof(long) and it would warn at build time but > it would build successfully. > > Changing my testcase like this just creates warning about volatile > warning: passing argument 1 of ‘memset’ discards ‘volatile’ qualifier > from pointer target type, etc... it still builds and fail at runtime > like before. I added another barrier as well but it can't make a > difference, the problem is very clear and how to prevent this > practical problem is clear too. As long as memcmp-sse4.S is called > there's nothing the compiler can do to prevent this problem from > materializing. Yeah, but if there were a memcmp that would respect volatile, then it could make a difference. > > Note that I'm not saying that this is guaranteed to work. It's still UB > > if there is a data race. However, in practice, this might just continue > > to work. > > READ_ONCE/WRITE_ONCE are only used if there is going to be a data > race, otherwise they're never used. I know. Technically, data races are UB, yet that a particular compiler does something that works for you can still happen. > > Userland programs should really try to transition to atomics and the > > C11/C++11 memory model or just the __atomic builtins, if possible. > > We should look into that. > > Without a way to access non-atomic regions with C I assume regions means memory regions not of atomic type(s). What do you mean by "with C"? Do you mean plain accesses, or just C code? If you just want to snapshot data for RCU, using atomic loads with memory_order_relaxed is fine. This could be put into a custom memcpy. > , RCU in the kernel > cannot work. But if memcpy in the kernel has to behave sane (and not > break out of the loop too happily) in presence of not-atomic changes > to the memory, then I don't see why memcmp in userland is different. I was distinguishing userland because userland should just rely on C11 and its memory model, IMO. I don't have a strong opinion on what the kernel should do, so I simply didn't want to comment on the latter. > Also note there is also an userspace RCU, I don't know exactly how > that works because I never used it, but the whole concept of RCU is to > allow C language to access not-atomic regions of memory and to get > away with it. In general, this cannot work. There might be compiler optimizations that expect that they can safely reload data that is not accessed through atomics nor is of atomic type. Those can make a lot of sense considering average code, but would break in this particular case. > And if that way is memcpy, I don't see what's > fundamentally different about memcpy being required to cope with that, > but memcmp not. If there is a way for the compiler to see that the data accessed speculatively in RCU can be modified concurrently, things are fine. This could be explicitly atomic accesses, a custom memcpy or such, etc. The stock memcpy assumes non-atomic data though, and data-race-freedom. > I totally see why memcmp-sse4.S is safe by the standard, just the > standard is not the only way to gain performance, RCU is required to > scale 100% in SMP, pthread_mutex_rwlock wouldn't scale, so it's > unthinkable to get rid of READ_ONCE/WRITE_ONCE. Switching to __atomic > would be fine, and if __atomic can make memcpy (and IMHO memcmp) > behave in a way they won't happily return too soon on __atomic > builtins, it'd be great. The __atomic builtins are all you need to implement concurrent code. The kernel's memory model might be more fine-grained than what C11 offers currently (e.g., data dependences: memory_order_consume is broken, and the kernel keeps its fingers crossed that the compiler will not do what it would be actually allowed to do; I've been working with Paul McKenney on this...). Nonetheless, besides those potential performance differences on some archs, it's all there. BTW, I'm wondering what your memcmp would actually do internally, and what it would guarantee precisely. It can't just use relaxed-MO reads / read_once without further barriers, because this won't guarantee anything in face of multiple concurrent updaters; if using just relaxed-MO reads, memcmp might read a snapshot that never existed in any seq-cst ordering of all individual loads and stores on this memory region.
On Thu, Jun 18, 2015 at 05:50:35PM +0200, Torvald Riegel wrote: > BTW, I'm wondering what your memcmp would actually do internally, and > what it would guarantee precisely. It can't just use relaxed-MO reads / That's very simple: it would never return 0 if it didn't finish comparing all the data according to the length parameter. If minimal atomic granularity of the arch is 1, and I guarantee the last byte is never changing and always different between src and dst region, it would never return 0. It could return any other value but 0. It's undefined if it returns above or below zero (depending on the modifications) but in no way my memcmp could return 0 for such a workload. That it can only return non zero is defined by the hardware and trivial to enforce in the workload of my testcase.
On Thu, Jun 18, 2015 at 06:19:43PM +0200, Andrea Arcangeli wrote: > workload. That it can only return non zero is defined by the hardware > and trivial to enforce in the workload of my testcase. Instead of thinking in hardware terms, our colleague David (CC'ed) suggested we can explain why there is a problem in C terms as well. While the compiler would be free to reorder the load/stores if you wrote the memcmp in C, and it would be free to duplicate a read, it would never be free to terminate the comparison loop due any equal result.
On Thu, 2015-06-18 at 19:22 +0200, Andrea Arcangeli wrote: > On Thu, Jun 18, 2015 at 06:19:43PM +0200, Andrea Arcangeli wrote: > > workload. That it can only return non zero is defined by the hardware > > and trivial to enforce in the workload of my testcase. > > Instead of thinking in hardware terms, our colleague David (CC'ed) > suggested we can explain why there is a problem in C terms as well. > > While the compiler would be free to reorder the load/stores if you > wrote the memcmp in C, and it would be free to duplicate a read, it > would never be free to terminate the comparison loop due any equal > result. If there are data races, this is undefined behavior and *all* bets are off (according to the standard). This includes results such as terminating the loop, stopping to do anything, etc. This may seem overly drastic, but it's a useful line to draw because we don't want to have to reason based on details of a compiler's implementation.
* Torvald Riegel (triegel@redhat.com) wrote: > On Thu, 2015-06-18 at 19:22 +0200, Andrea Arcangeli wrote: > > On Thu, Jun 18, 2015 at 06:19:43PM +0200, Andrea Arcangeli wrote: > > > workload. That it can only return non zero is defined by the hardware > > > and trivial to enforce in the workload of my testcase. > > > > Instead of thinking in hardware terms, our colleague David (CC'ed) > > suggested we can explain why there is a problem in C terms as well. > > > > While the compiler would be free to reorder the load/stores if you > > wrote the memcmp in C, and it would be free to duplicate a read, it > > would never be free to terminate the comparison loop due any equal > > result. > > If there are data races, this is undefined behavior and *all* bets are > off (according to the standard). This includes results such as > terminating the loop, stopping to do anything, etc. > > This may seem overly drastic, but it's a useful line to draw because we > don't want to have to reason based on details of a compiler's > implementation. Yes, I guess what I find interesting about this case is that: a) Having found a difference and then reread it and found it identical you *know* you've hit an 'undefined' case. b) The action taken in that case is an outcome that seems impossible from any memory reordering or pure-rereading of data; so it's more surprising than normal - people are already used to the compiler or the CPU reordering stuff, and having static mismatched data in the block would seem to be a guarantee that 0 could never happen. c) I worry if there's anything where a strcmp could get optimised into a memcmp and then hit this case, but I agree it's difficult to find a situation where it would have been safe anyway. IMHO it would be safer to abort than to return 0, but that's probably an equally controvercial result. Dave > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Thu, Jun 18, 2015 at 07:49:45PM +0200, Torvald Riegel wrote: > On Thu, 2015-06-18 at 19:22 +0200, Andrea Arcangeli wrote: > > On Thu, Jun 18, 2015 at 06:19:43PM +0200, Andrea Arcangeli wrote: > > > workload. That it can only return non zero is defined by the hardware > > > and trivial to enforce in the workload of my testcase. > > > > Instead of thinking in hardware terms, our colleague David (CC'ed) > > suggested we can explain why there is a problem in C terms as well. > > > > While the compiler would be free to reorder the load/stores if you > > wrote the memcmp in C, and it would be free to duplicate a read, it > > would never be free to terminate the comparison loop due any equal > > result. > > If there are data races, this is undefined behavior and *all* bets are > off (according to the standard). This includes results such as > terminating the loop, stopping to do anything, etc. > > This may seem overly drastic, but it's a useful line to draw because we > don't want to have to reason based on details of a compiler's > implementation. My view is more practical, memcmp equality retvals can give false negatives now. So any code using memcmp or bcmp to validate security canary values (canary areas are always checked for equality) could also return false negatives. It sounds unlikely that this will trigger but in practice by now memcmp isn't usable for canary value checks or anything else related to security unless this gets fixed.
On 06/18/2015 02:12 PM, Dr. David Alan Gilbert wrote: > c) I worry if there's anything where a strcmp could get optimised > into a memcmp and then hit this case, but I agree it's difficult to > find a situation where it would have been safe anyway. Exactly. It would not have been safe. > IMHO it would be safer to abort than to return 0, but that's probably > an equally controvercial result. Not really. It's undefined behaviour, we can do anything. I agree with aborting, but only as long as the hot path's performance is not impacted and I haven't thought about how to do that. Per "Bugs in the user program" convention: https://sourceware.org/glibc/wiki/Style_and_Conventions#Bugs_in_the_user_program --- 9.2. Bugs in the user program If it's user code invoking undefined behavior, then it should fail early and catastrophically so that developers don't get the false impression that their code is OK when it happens not to break the use cases they test adequately. (Said another way, so that we avoid giving developers an excuse to complain when a future implementation change "breaks" their programs that were always broken, but theretofore ignorably so.) That too trades off against any runtime cost of detecting the case. I'd say the allowance for cost of detection is marginally higher than in the case of library bugs, because we expect user bugs to be more common that library bugs. But it's still not much, since correct programs performing better is more important to us than buggy programs being easier to debug. --- Cheers, Carlos.
On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > I agree with aborting, but only as long as the hot path's performance > is not impacted and I haven't thought about how to do that. > > Per "Bugs in the user program" convention: > https://sourceware.org/glibc/wiki/Style_and_Conventions#Bugs_in_the_user_program > > --- > 9.2. Bugs in the user program > > If it's user code invoking undefined behavior, then it should fail early and catastrophically so that developers don't get the false impression that their code is OK when it happens not to break the use cases they test adequately. (Said another way, so that we avoid giving developers an excuse to complain when a future implementation change "breaks" their programs that were always broken, but theretofore ignorably so.) That too trades off against any runtime cost of detecting the case. I'd say the allowance for cost of detection is marginally higher than in the case of library bugs, because we expect user bugs to be more common that library bugs. But it's still not much, since correct programs performing better is more important to us than buggy programs being easier to debug. > --- Aborting should be a satisfactory measure too and I think this issue fits the worthwhile tradeoff described by the above document, as the cost should be negligible. It's just a matter of propagating rdx in addition to rdi/rsi and it won't affect the sse4 unrolled loops fast paths. If the testcase reproduces the bug frequently the abort would give a better chance to fix the bug so the app becomes fully C compliant. Aborting memcmp should still leave memcmp usable for security purposes if the app can safely detect such a failure and shutdown (failing early and catastrophically sounds good to me but then not everyone may agree, if it's canary value that is being checked it may already have a proper notification failure path that may not get invoked). Overall as long as the current false negative is not returned, it should be enough as far as security is concerned. If the race condition is very rarely triggered and it ends up being triggering in production, an abort wouldn't be welcome, but it's still preferable and safer than the current behavior. The cost of detecting the UB and aborting is practically identical to the cost of making the memcmp not return 0. The only difference is that instead of jumping at the start of the function and continuing evaluating the rest of the buffer, we'd abort.
On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > I agree with aborting, but only as long as the hot path's performance > > is not impacted and I haven't thought about how to do that. > > > > Per "Bugs in the user program" convention: > > https://sourceware.org/glibc/wiki/Style_and_Conventions#Bugs_in_the_user_program > > > > --- > > 9.2. Bugs in the user program > > > > If it's user code invoking undefined behavior, then it should fail early and catastrophically so that developers don't get the false impression that their code is OK when it happens not to break the use cases they test adequately. (Said another way, so that we avoid giving developers an excuse to complain when a future implementation change "breaks" their programs that were always broken, but theretofore ignorably so.) That too trades off against any runtime cost of detecting the case. I'd say the allowance for cost of detection is marginally higher than in the case of library bugs, because we expect user bugs to be more common that library bugs. But it's still not much, since correct programs performing better is more important to us than buggy programs being easier to debug. > > --- > > Aborting should be a satisfactory measure too and I think this issue > fits the worthwhile tradeoff described by the above document, as the > cost should be negligible. It's just a matter of propagating rdx in > addition to rdi/rsi and it won't affect the sse4 unrolled loops fast > paths. > > If the testcase reproduces the bug frequently the abort would give a > better chance to fix the bug so the app becomes fully C compliant. > > Aborting memcmp should still leave memcmp usable for security purposes > if the app can safely detect such a failure and shutdown (failing > early and catastrophically sounds good to me but then not everyone may > agree, if it's canary value that is being checked it may already have > a proper notification failure path that may not get invoked). Overall > as long as the current false negative is not returned, it should be > enough as far as security is concerned. > > If the race condition is very rarely triggered and it ends up being > triggering in production, an abort wouldn't be welcome, but it's still > preferable and safer than the current behavior. > > The cost of detecting the UB and aborting is practically identical to > the cost of making the memcmp not return 0. The only difference is > that instead of jumping at the start of the function and continuing > evaluating the rest of the buffer, we'd abort. My 2c, although it is true that technically the spec says memcmp can return arbitrary values if the tested areas change concurrently, it would be really hostile to abort() in such a case. The only real problem is returning an equal result when the underlying memory never was, nor could be equal. I do not think it matters what value is returned in that case, as long as it is not 0. If returning anything but 0 can be accomplished w/o sacrificing speed then that is something that should be done. There are many programs that use memory checks to handle lock-less barriers. For those programs it doesn't matter at all what (or when) changed). All that matters is that the memory region were different during the test, and then they'll go on their business. If you were to require locking in order to do lockless checks, then the whole purpose would be defeated. Clearly people are better using atomic comparisons on canary values instead, but it seem easy to avoid false positives (returning 0 when memory is clearly different) and keep these things working, so why not do it ? Simo.
On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > I agree with aborting, but only as long as the hot path's performance > > > is not impacted and I haven't thought about how to do that. > > > > > > Per "Bugs in the user program" convention: > > > https://sourceware.org/glibc/wiki/Style_and_Conventions#Bugs_in_the_user_program > > > > > > --- > > > 9.2. Bugs in the user program > > > > > > If it's user code invoking undefined behavior, then it should fail early and catastrophically so that developers don't get the false impression that their code is OK when it happens not to break the use cases they test adequately. (Said another way, so that we avoid giving developers an excuse to complain when a future implementation change "breaks" their programs that were always broken, but theretofore ignorably so.) That too trades off against any runtime cost of detecting the case. I'd say the allowance for cost of detection is marginally higher than in the case of library bugs, because we expect user bugs to be more common that library bugs. But it's still not much, since correct programs performing better is more important to us than buggy programs being easier to debug. > > > --- > > > > Aborting should be a satisfactory measure too and I think this issue > > fits the worthwhile tradeoff described by the above document, as the > > cost should be negligible. It's just a matter of propagating rdx in > > addition to rdi/rsi and it won't affect the sse4 unrolled loops fast > > paths. > > > > If the testcase reproduces the bug frequently the abort would give a > > better chance to fix the bug so the app becomes fully C compliant. > > > > Aborting memcmp should still leave memcmp usable for security purposes > > if the app can safely detect such a failure and shutdown (failing > > early and catastrophically sounds good to me but then not everyone may > > agree, if it's canary value that is being checked it may already have > > a proper notification failure path that may not get invoked). Overall > > as long as the current false negative is not returned, it should be > > enough as far as security is concerned. > > > > If the race condition is very rarely triggered and it ends up being > > triggering in production, an abort wouldn't be welcome, but it's still > > preferable and safer than the current behavior. > > > > The cost of detecting the UB and aborting is practically identical to > > the cost of making the memcmp not return 0. The only difference is > > that instead of jumping at the start of the function and continuing > > evaluating the rest of the buffer, we'd abort. > > My 2c, > although it is true that technically the spec says memcmp can return > arbitrary values if the tested areas change concurrently, it would be > really hostile to abort() in such a case. > The only real problem is returning an equal result when the underlying > memory never was, nor could be equal. This last sentence you wrote makes it look simple, but defining this properly is more complicated. The condition Andrea gave is simpler, because it requires that there are no data races on some parts of the checked memory, and those parts are guaranteed to be different. In contrast, "never was nor could be equal" is not simple to define in weak memory models; what I mean is that an intuitive understanding of this can easily differ from what would be possible in general for executions on a particular hardware. > I do not think it matters what value is returned in that case, as long > as it is not 0. > If returning anything but 0 can be accomplished w/o sacrificing speed > then that is something that should be done. > > There are many programs that use memory checks to handle lock-less > barriers. What kind of barriers are you referring to? Fences like an "mfence" instruction on x86? If so, what do you mean by "handle"? > For those programs it doesn't matter at all what (or when) > changed). All that matters is that the memory region were different > during the test, and then they'll go on their business. If you were to > require locking in order to do lockless checks, then the whole purpose > would be defeated. The discussion is about whether we can indeed rely on data race freedom, or whether we want to handle data races gracefully in some situations. That applies to both code using locks and code synchronizing differently. > Clearly people are better using atomic comparisons on canary values > instead, but it seem easy to avoid false positives (returning 0 when > memory is clearly different) and keep these things working, so why not > do it ? I see two separate issues here. First, where do we draw the line, and what do we guarantee. I strongly believe that programs must not have data races, and that they should use atomics or other synchronization properly (this doesn't mean locks, but relaxed memory order atomics, for example). Second, do we try to keep buggy programs working. If it has no cost to do so (e.g., like it might be true in this case), then doing that can help to trigger less errors. But that doesn't mean the buggy programs should get fixed eventually.
* Torvald Riegel (triegel@redhat.com) wrote: > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > I agree with aborting, but only as long as the hot path's performance > > > > is not impacted and I haven't thought about how to do that. > > > > > > Clearly people are better using atomic comparisons on canary values > > instead, but it seem easy to avoid false positives (returning 0 when > > memory is clearly different) and keep these things working, so why not > > do it ? > > I see two separate issues here. First, where do we draw the line, and > what do we guarantee. I strongly believe that programs must not have > data races, and that they should use atomics or other synchronization > properly (this doesn't mean locks, but relaxed memory order atomics, for > example). > Second, do we try to keep buggy programs working. If it has no cost to > do so (e.g., like it might be true in this case), then doing that can > help to trigger less errors. But that doesn't mean the buggy programs > should get fixed eventually. I find it difficult to understand the boundaries of what the C library is allowed to do in this type of optimisation. For example, consider the following: char a[128]; char b[128]; put some static data in a[0-63] put some static data in b[0-63] a[64]=0; b[64]=0; start a thread doing stuff in a[65..] start a thread doing stuff in b[65..] if (!strcmp(a,b)) { /* Do something */ } a) Is that behaviour defined? b) Is it defined with strncmp(a,b,64) ? c) Is it defined with strncmp(a,b,128)? d) Is it defined with memcmp(a,b,64)? e) Is it defined with memcmp(a,b,128)? f) If I moved that boundary off a nice round % 8 mark would it matter? I can imagine there may be lots of things that terminate a string and let other stuff happen in the allocated space after the end of the string in the belief that at the end of that string all is unknown. Andrea's case is a bit different in that it's the later data that's static, but that doesn't sound like it should change the answer as to what's allowed. Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, Jun 19, 2015 at 04:59:08PM +0100, Dr. David Alan Gilbert wrote: > * Torvald Riegel (triegel@redhat.com) wrote: > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > instead, but it seem easy to avoid false positives (returning 0 when > > > memory is clearly different) and keep these things working, so why not > > > do it ? > > > > I see two separate issues here. First, where do we draw the line, and > > what do we guarantee. I strongly believe that programs must not have > > data races, and that they should use atomics or other synchronization > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > example). > > Second, do we try to keep buggy programs working. If it has no cost to > > do so (e.g., like it might be true in this case), then doing that can > > help to trigger less errors. But that doesn't mean the buggy programs > > should get fixed eventually. > > I find it difficult to understand the boundaries of what the C library > is allowed to do in this type of optimisation. > > For example, consider the following: > > char a[128]; > char b[128]; > > put some static data in a[0-63] > put some static data in b[0-63] > a[64]=0; > b[64]=0; > start a thread doing stuff in a[65..] > start a thread doing stuff in b[65..] > > if (!strcmp(a,b)) { > /* Do something */ > } > > a) Is that behaviour defined? > b) Is it defined with strncmp(a,b,64) ? > c) Is it defined with strncmp(a,b,128)? > d) Is it defined with memcmp(a,b,64)? > e) Is it defined with memcmp(a,b,128)? > f) If I moved that boundary off a nice round % 8 mark would > it matter? > Yes, this is correct as memory is indistingushible from using struct foo { char start[65]; char end[63]; } where end is modified in parallel. Only thing that could differ is performance as we read past end as long as it doesn't cross page, then discard data after end.
On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > * Torvald Riegel (triegel@redhat.com) wrote: > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > instead, but it seem easy to avoid false positives (returning 0 when > > > memory is clearly different) and keep these things working, so why not > > > do it ? > > > > I see two separate issues here. First, where do we draw the line, and > > what do we guarantee. I strongly believe that programs must not have > > data races, and that they should use atomics or other synchronization > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > example). > > Second, do we try to keep buggy programs working. If it has no cost to > > do so (e.g., like it might be true in this case), then doing that can > > help to trigger less errors. But that doesn't mean the buggy programs > > should get fixed eventually. > > I find it difficult to understand the boundaries of what the C library > is allowed to do in this type of optimisation. > > For example, consider the following: > > char a[128]; > char b[128]; > > put some static data in a[0-63] > put some static data in b[0-63] > a[64]=0; > b[64]=0; > start a thread doing stuff in a[65..] > start a thread doing stuff in b[65..] > > if (!strcmp(a,b)) { > /* Do something */ > } > > a) Is that behaviour defined? Good question. I think it should be. This depends on both the data race definition and what strcmp/strncmp/memcmp are specified to do using the abstract machine. The data race definition uses memory locations as granularity, which is in 3.14 in C11. Separate characters in an array should be separate memory locations. C11 isn't very specific regarding strcmp, and just says that it "compares the string[s]" (7.24.4.2). C++14 is a bit more specific regarding basic_string::compare (21.4.7.9), saying that first the length of the strings are determined, and then a strncmp is run using the smaller of the two lengths. Assuming the C++ specs, only the array indices [0..64] should be accessed by the abstract machine. So no data race with the stuff going on in [65..). > b) Is it defined with strncmp(a,b,64) ? Yes. > c) Is it defined with strncmp(a,b,128)? Not sure. C11 says that not more than "n characters" are compared, and characters that follow the a null character aren't compared either. This indicates it wouldn't be different from strncmp(a,b,64) in the particular case. Regarding C++11, I'm not sure. The closest copies a substring (conceptually) and then compares, but the substring copying has to determine length of the string and then subtracting the max length. This would do a strlen first, which wouldn't access past index 64. Thus, should be fine too. > d) Is it defined with memcmp(a,b,64)? No data race, IMO. > e) Is it defined with memcmp(a,b,128)? Data race. Undefined behavior. > f) If I moved that boundary off a nice round % 8 mark would > it matter? No difference as far as the standard is concerned. > I can imagine there may be lots of things that terminate > a string and let other stuff happen in the allocated space > after the end of the string in the belief that at the end > of that string all is unknown. Andrea's case is a bit different > in that it's the later data that's static, but that doesn't > sound like it should change the answer as to what's allowed. I think it does, because the question is whether there is a data race on the memory locations that the abstract machine would access.
It is true you identify 2 different issues, but I am not interested in either, I merely care about failing safe. On Fri, 2015-06-19 at 17:32 +0200, Torvald Riegel wrote: > I see two separate issues here. First, where do we draw the line, and > what do we guarantee. I strongly believe that programs must not have > data races, and that they should use atomics or other synchronization > properly (this doesn't mean locks, but relaxed memory order atomics, > for example). Programs should not have data races, true, but sometimes they do, failing safe makes 'bad' programs not be worse. > Second, do we try to keep buggy programs working. If it has no cost > to do so (e.g., like it might be true in this case), then doing that > can help to trigger less errors. But that doesn't mean the buggy > programs should get fixed eventually. Sure, but sometimes it is hard to define something as a bug. In the specific case Andrea reported (which I pondered on), the race was clearly there, but also clearly the underlying memory was never identical at any given point, yet memcmp reported 0 which means: the whole memory area was checked and matched. This was clearly not the case, memcmp didn't even check the whole memory area, and returned 0. Now, if the 2 memory areas ever where (even for a nanosecond) identical, then returning zero would have been perfectly acceptable. During a race you can't know what the result is. But a function that supposedly checks a data area for equality and instead stops short and return "equal" seem just wrong, and it is certainly unexpected. I understand that the cause for that is the program is operating on part of the area currently and it shouldn't but still, I would expect memcmp to at least always go through the whole area it is expected to check and not stop short on a fluke and return a "random" result. Simo.
On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote: > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > * Torvald Riegel (triegel@redhat.com) wrote: > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > memory is clearly different) and keep these things working, so why not > > > > do it ? > > > > > > I see two separate issues here. First, where do we draw the line, and > > > what do we guarantee. I strongly believe that programs must not have > > > data races, and that they should use atomics or other synchronization > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > example). > > > Second, do we try to keep buggy programs working. If it has no cost to > > > do so (e.g., like it might be true in this case), then doing that can > > > help to trigger less errors. But that doesn't mean the buggy programs > > > should get fixed eventually. > > > > I find it difficult to understand the boundaries of what the C library > > is allowed to do in this type of optimisation. > > > > For example, consider the following: > > > > char a[128]; > > char b[128]; > > > > put some static data in a[0-63] > > put some static data in b[0-63] > > a[64]=0; > > b[64]=0; > > start a thread doing stuff in a[65..] > > start a thread doing stuff in b[65..] > > > > if (!strcmp(a,b)) { > > /* Do something */ > > } > > > > a) Is that behaviour defined? > > Good question. I think it should be. This depends on both the data > race definition and what strcmp/strncmp/memcmp are specified to do using > the abstract machine. > > The data race definition uses memory locations as granularity, which is > in 3.14 in C11. Separate characters in an array should be separate > memory locations. > > C11 isn't very specific regarding strcmp, and just says that it > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > regarding basic_string::compare (21.4.7.9), saying that first the length > of the strings are determined, and then a strncmp is run using the > smaller of the two lengths. > > Assuming the C++ specs, only the array indices [0..64] should be > accessed by the abstract machine. So no data race with the stuff going > on in [65..). > > > b) Is it defined with strncmp(a,b,64) ? > > Yes. > > > c) Is it defined with strncmp(a,b,128)? > > Not sure. C11 says that not more than "n characters" are compared, and > characters that follow the a null character aren't compared either. > This indicates it wouldn't be different from strncmp(a,b,64) in the > particular case. > Regarding C++11, I'm not sure. The closest copies a substring > (conceptually) and then compares, but the substring copying has to > determine length of the string and then subtracting the max length. > This would do a strlen first, which wouldn't access past index 64. > Thus, should be fine too. > > > d) Is it defined with memcmp(a,b,64)? > > No data race, IMO. > > > e) Is it defined with memcmp(a,b,128)? > > Data race. Undefined behavior. > > > f) If I moved that boundary off a nice round % 8 mark would > > it matter? > > No difference as far as the standard is concerned. > > > I can imagine there may be lots of things that terminate > > a string and let other stuff happen in the allocated space > > after the end of the string in the belief that at the end > > of that string all is unknown. Andrea's case is a bit different > > in that it's the later data that's static, but that doesn't > > sound like it should change the answer as to what's allowed. > > I think it does, because the question is whether there is a data race on > the memory locations that the abstract machine would access. Well given we are making examples. Assume 2 structures like this: struct test { void *pointer; char start[16]; char end[240]; } and struct test a; struct test b; memset(a.end, 1, 240); memset(b.end, 2, 240); In what case it would be expected/legal for memcmp(a.start, b.start, 256); to ever return 0 ? Simo.
On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote: > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote: > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > > memory is clearly different) and keep these things working, so why not > > > > > do it ? > > > > > > > > I see two separate issues here. First, where do we draw the line, and > > > > what do we guarantee. I strongly believe that programs must not have > > > > data races, and that they should use atomics or other synchronization > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > > example). > > > > Second, do we try to keep buggy programs working. If it has no cost to > > > > do so (e.g., like it might be true in this case), then doing that can > > > > help to trigger less errors. But that doesn't mean the buggy programs > > > > should get fixed eventually. > > > > > > I find it difficult to understand the boundaries of what the C library > > > is allowed to do in this type of optimisation. > > > > > > For example, consider the following: > > > > > > char a[128]; > > > char b[128]; > > > > > > put some static data in a[0-63] > > > put some static data in b[0-63] > > > a[64]=0; > > > b[64]=0; > > > start a thread doing stuff in a[65..] > > > start a thread doing stuff in b[65..] > > > > > > if (!strcmp(a,b)) { > > > /* Do something */ > > > } > > > > > > a) Is that behaviour defined? > > > > Good question. I think it should be. This depends on both the data > > race definition and what strcmp/strncmp/memcmp are specified to do using > > the abstract machine. > > > > The data race definition uses memory locations as granularity, which is > > in 3.14 in C11. Separate characters in an array should be separate > > memory locations. > > > > C11 isn't very specific regarding strcmp, and just says that it > > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > > regarding basic_string::compare (21.4.7.9), saying that first the length > > of the strings are determined, and then a strncmp is run using the > > smaller of the two lengths. > > > > Assuming the C++ specs, only the array indices [0..64] should be > > accessed by the abstract machine. So no data race with the stuff going > > on in [65..). > > > > > b) Is it defined with strncmp(a,b,64) ? > > > > Yes. > > > > > c) Is it defined with strncmp(a,b,128)? > > > > Not sure. C11 says that not more than "n characters" are compared, and > > characters that follow the a null character aren't compared either. > > This indicates it wouldn't be different from strncmp(a,b,64) in the > > particular case. > > Regarding C++11, I'm not sure. The closest copies a substring > > (conceptually) and then compares, but the substring copying has to > > determine length of the string and then subtracting the max length. > > This would do a strlen first, which wouldn't access past index 64. > > Thus, should be fine too. > > > > > d) Is it defined with memcmp(a,b,64)? > > > > No data race, IMO. > > > > > e) Is it defined with memcmp(a,b,128)? > > > > Data race. Undefined behavior. > > > > > f) If I moved that boundary off a nice round % 8 mark would > > > it matter? > > > > No difference as far as the standard is concerned. > > > > > I can imagine there may be lots of things that terminate > > > a string and let other stuff happen in the allocated space > > > after the end of the string in the belief that at the end > > > of that string all is unknown. Andrea's case is a bit different > > > in that it's the later data that's static, but that doesn't > > > sound like it should change the answer as to what's allowed. > > > > I think it does, because the question is whether there is a data race on > > the memory locations that the abstract machine would access. > > Well given we are making examples. > > Assume 2 structures like this: > > struct test { > void *pointer; > char start[16]; > char end[240]; > } > > and > > struct test a; > struct test b; > > memset(a.end, 1, 240); > memset(b.end, 2, 240); > > In what case it would be expected/legal for > memcmp(a.start, b.start, 256); to ever return 0 ? Just to be clear, if there are data-races in [ab].pointer or [as].start I see absolutely no problem in stopping short and returning a positive/negative value. I just do not see how returning 0 would ever make sense. Simo.
On Fri, 2015-06-19 at 13:08 -0400, Simo Sorce wrote: > It is true you identify 2 different issues, but I am not interested in > either, I merely care about failing safe. > > On Fri, 2015-06-19 at 17:32 +0200, Torvald Riegel wrote: > > I see two separate issues here. First, where do we draw the line, and > > what do we guarantee. I strongly believe that programs must not have > > data races, and that they should use atomics or other synchronization > > properly (this doesn't mean locks, but relaxed memory order atomics, > > for example). > > Programs should not have data races, true, but sometimes they do, > failing safe makes 'bad' programs not be worse. > > > Second, do we try to keep buggy programs working. If it has no cost > > to do so (e.g., like it might be true in this case), then doing that > > can help to trigger less errors. But that doesn't mean the buggy > > programs should get fixed eventually. > > Sure, but sometimes it is hard to define something as a bug. > > In the specific case Andrea reported (which I pondered on), the race was > clearly there, but also clearly the underlying memory was never > identical at any given point, yet memcmp reported 0 which means: the > whole memory area was checked and matched. This was clearly not the > case, memcmp didn't even check the whole memory area, and returned 0. This is allowed if something is undefined behavior. A return value of 0 only has the specified meaning if there's no undefined behavior elsewhere. > Now, if the 2 memory areas ever where (even for a nanosecond) identical, > then returning zero would have been perfectly acceptable. How do you know that when there are data races? Making this assumption requires making assumptions about how the modification happened. For example, if the modifications used sequential code too, this could potentially write temporary values. Second, you say "even for a nanosecond", and that seems to say to you're thinking about some kind of total order of events. This isn't necessarily the case though; even if you'd use relaxed memory-order atomic stores and loads to access the array, what a the memcpy using the atomics would see would not necessarily be ordered as if the stores would happen one at a time, or in some way that can be totally ordered. > During a race > you can't know what the result is. The standard "specifies" undefined behavior, which allows for much more than not knowing what the result is. It could we conforming to burn your computer, for example ;) > But a function that supposedly checks > a data area for equality and instead stops short and return "equal" seem > just wrong, and it is certainly unexpected. I do understand that it could be surprising behavior, especially if thinking at the level of what a straight-forward implementation may or may not do. But we have to draw the line somewhere to enable compiler optimizations, and we don't want to describe compiler optimizations in detail nor reason about that detail, so the big hammer of undefined behavior is a sensible choice. > I understand that the cause for that is the program is operating on part > of the area currently and it shouldn't but still, I would expect memcmp > to at least always go through the whole area it is expected to check and > not stop short on a fluke and return a "random" result. I am sympathetic to this viewpoint, but a random result, or undefined behavior, is really not that unrealistic if you consider more complex C code with data races, that may be subject to more compiler optimization than one would expect for memcmp or such. Just to be clear, I don't want to be simply nitpicking here, nor say that the standard is the only true design choice, or something like that. I just think that it's dangerous to start considering some data races to be "benign" because that's a truly slippery slope and the reasoning required for that can be really complicated.
On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote: > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote: > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote: > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > > > memory is clearly different) and keep these things working, so why not > > > > > > do it ? > > > > > > > > > > I see two separate issues here. First, where do we draw the line, and > > > > > what do we guarantee. I strongly believe that programs must not have > > > > > data races, and that they should use atomics or other synchronization > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > > > example). > > > > > Second, do we try to keep buggy programs working. If it has no cost to > > > > > do so (e.g., like it might be true in this case), then doing that can > > > > > help to trigger less errors. But that doesn't mean the buggy programs > > > > > should get fixed eventually. > > > > > > > > I find it difficult to understand the boundaries of what the C library > > > > is allowed to do in this type of optimisation. > > > > > > > > For example, consider the following: > > > > > > > > char a[128]; > > > > char b[128]; > > > > > > > > put some static data in a[0-63] > > > > put some static data in b[0-63] > > > > a[64]=0; > > > > b[64]=0; > > > > start a thread doing stuff in a[65..] > > > > start a thread doing stuff in b[65..] > > > > > > > > if (!strcmp(a,b)) { > > > > /* Do something */ > > > > } > > > > > > > > a) Is that behaviour defined? > > > > > > Good question. I think it should be. This depends on both the data > > > race definition and what strcmp/strncmp/memcmp are specified to do using > > > the abstract machine. > > > > > > The data race definition uses memory locations as granularity, which is > > > in 3.14 in C11. Separate characters in an array should be separate > > > memory locations. > > > > > > C11 isn't very specific regarding strcmp, and just says that it > > > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > > > regarding basic_string::compare (21.4.7.9), saying that first the length > > > of the strings are determined, and then a strncmp is run using the > > > smaller of the two lengths. > > > > > > Assuming the C++ specs, only the array indices [0..64] should be > > > accessed by the abstract machine. So no data race with the stuff going > > > on in [65..). > > > > > > > b) Is it defined with strncmp(a,b,64) ? > > > > > > Yes. > > > > > > > c) Is it defined with strncmp(a,b,128)? > > > > > > Not sure. C11 says that not more than "n characters" are compared, and > > > characters that follow the a null character aren't compared either. > > > This indicates it wouldn't be different from strncmp(a,b,64) in the > > > particular case. > > > Regarding C++11, I'm not sure. The closest copies a substring > > > (conceptually) and then compares, but the substring copying has to > > > determine length of the string and then subtracting the max length. > > > This would do a strlen first, which wouldn't access past index 64. > > > Thus, should be fine too. > > > > > > > d) Is it defined with memcmp(a,b,64)? > > > > > > No data race, IMO. > > > > > > > e) Is it defined with memcmp(a,b,128)? > > > > > > Data race. Undefined behavior. > > > > > > > f) If I moved that boundary off a nice round % 8 mark would > > > > it matter? > > > > > > No difference as far as the standard is concerned. > > > > > > > I can imagine there may be lots of things that terminate > > > > a string and let other stuff happen in the allocated space > > > > after the end of the string in the belief that at the end > > > > of that string all is unknown. Andrea's case is a bit different > > > > in that it's the later data that's static, but that doesn't > > > > sound like it should change the answer as to what's allowed. > > > > > > I think it does, because the question is whether there is a data race on > > > the memory locations that the abstract machine would access. > > > > Well given we are making examples. > > > > Assume 2 structures like this: > > > > struct test { > > void *pointer; > > char start[16]; > > char end[240]; > > } > > > > and > > > > struct test a; > > struct test b; > > > > memset(a.end, 1, 240); > > memset(b.end, 2, 240); > > > > In what case it would be expected/legal for > > memcmp(a.start, b.start, 256); to ever return 0 ? > > Just to be clear, if there are data-races in [ab].pointer or [as].start > I see absolutely no problem in stopping short and returning a > positive/negative value. I just do not see how returning 0 would ever > make sense. It would be legal if the invocation of memcmp and at least one of the invocations of memset would be concurrent (ie, not ordered by happens-before). I do see that this wouldn't be an intuitive outcome if one reasons based on assuming some simplified (or straightforward, or natural, or whatever the personal opinion might be...) implementation. But a data race is undefined behavior, and that's where we're drawing the line.
On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote: > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote: > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote: > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote: > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > > > > memory is clearly different) and keep these things working, so why not > > > > > > > do it ? > > > > > > > > > > > > I see two separate issues here. First, where do we draw the line, and > > > > > > what do we guarantee. I strongly believe that programs must not have > > > > > > data races, and that they should use atomics or other synchronization > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > > > > example). > > > > > > Second, do we try to keep buggy programs working. If it has no cost to > > > > > > do so (e.g., like it might be true in this case), then doing that can > > > > > > help to trigger less errors. But that doesn't mean the buggy programs > > > > > > should get fixed eventually. > > > > > > > > > > I find it difficult to understand the boundaries of what the C library > > > > > is allowed to do in this type of optimisation. > > > > > > > > > > For example, consider the following: > > > > > > > > > > char a[128]; > > > > > char b[128]; > > > > > > > > > > put some static data in a[0-63] > > > > > put some static data in b[0-63] > > > > > a[64]=0; > > > > > b[64]=0; > > > > > start a thread doing stuff in a[65..] > > > > > start a thread doing stuff in b[65..] > > > > > > > > > > if (!strcmp(a,b)) { > > > > > /* Do something */ > > > > > } > > > > > > > > > > a) Is that behaviour defined? > > > > > > > > Good question. I think it should be. This depends on both the data > > > > race definition and what strcmp/strncmp/memcmp are specified to do using > > > > the abstract machine. > > > > > > > > The data race definition uses memory locations as granularity, which is > > > > in 3.14 in C11. Separate characters in an array should be separate > > > > memory locations. > > > > > > > > C11 isn't very specific regarding strcmp, and just says that it > > > > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > > > > regarding basic_string::compare (21.4.7.9), saying that first the length > > > > of the strings are determined, and then a strncmp is run using the > > > > smaller of the two lengths. > > > > > > > > Assuming the C++ specs, only the array indices [0..64] should be > > > > accessed by the abstract machine. So no data race with the stuff going > > > > on in [65..). > > > > > > > > > b) Is it defined with strncmp(a,b,64) ? > > > > > > > > Yes. > > > > > > > > > c) Is it defined with strncmp(a,b,128)? > > > > > > > > Not sure. C11 says that not more than "n characters" are compared, and > > > > characters that follow the a null character aren't compared either. > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the > > > > particular case. > > > > Regarding C++11, I'm not sure. The closest copies a substring > > > > (conceptually) and then compares, but the substring copying has to > > > > determine length of the string and then subtracting the max length. > > > > This would do a strlen first, which wouldn't access past index 64. > > > > Thus, should be fine too. > > > > > > > > > d) Is it defined with memcmp(a,b,64)? > > > > > > > > No data race, IMO. > > > > > > > > > e) Is it defined with memcmp(a,b,128)? > > > > > > > > Data race. Undefined behavior. > > > > > > > > > f) If I moved that boundary off a nice round % 8 mark would > > > > > it matter? > > > > > > > > No difference as far as the standard is concerned. > > > > > > > > > I can imagine there may be lots of things that terminate > > > > > a string and let other stuff happen in the allocated space > > > > > after the end of the string in the belief that at the end > > > > > of that string all is unknown. Andrea's case is a bit different > > > > > in that it's the later data that's static, but that doesn't > > > > > sound like it should change the answer as to what's allowed. > > > > > > > > I think it does, because the question is whether there is a data race on > > > > the memory locations that the abstract machine would access. > > > > > > Well given we are making examples. > > > > > > Assume 2 structures like this: > > > > > > struct test { > > > void *pointer; > > > char start[16]; > > > char end[240]; > > > } > > > > > > and > > > > > > struct test a; > > > struct test b; > > > > > > memset(a.end, 1, 240); > > > memset(b.end, 2, 240); > > > > > > In what case it would be expected/legal for > > > memcmp(a.start, b.start, 256); to ever return 0 ? > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start > > I see absolutely no problem in stopping short and returning a > > positive/negative value. I just do not see how returning 0 would ever > > make sense. > > It would be legal if the invocation of memcmp and at least one of the > invocations of memset would be concurrent (ie, not ordered by > happens-before). Nope, sorry , there is a misunderstanding, consider the memsets part of initialization, *never* run in parallel with memcmp. > I do see that this wouldn't be an intuitive outcome if one reasons based > on assuming some simplified (or straightforward, or natural, or whatever > the personal opinion might be...) implementation. But a data race is > undefined behavior, and that's where we're drawing the line. Sure, I can understand that is the *whole* are under consideration is racing. But here we are talking about a large chunk of the area that is different and *never* raced upon. I can accept that the C standard may still decide this is undefined behavior even if just one small part of the total area sees concurrent modifications, but again, you are returning "short" from the optimized code, so it is unclear to me how you can return 0 when the memcmp code hasn't checked all the memory, wouldn't it make more sense to return the current position ? It is still undefined but at least it won't cause false positives. Simo.
On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote: > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote: > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote: > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote: > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote: > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > > > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > > > > > memory is clearly different) and keep these things working, so why not > > > > > > > > do it ? > > > > > > > > > > > > > > I see two separate issues here. First, where do we draw the line, and > > > > > > > what do we guarantee. I strongly believe that programs must not have > > > > > > > data races, and that they should use atomics or other synchronization > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > > > > > example). > > > > > > > Second, do we try to keep buggy programs working. If it has no cost to > > > > > > > do so (e.g., like it might be true in this case), then doing that can > > > > > > > help to trigger less errors. But that doesn't mean the buggy programs > > > > > > > should get fixed eventually. > > > > > > > > > > > > I find it difficult to understand the boundaries of what the C library > > > > > > is allowed to do in this type of optimisation. > > > > > > > > > > > > For example, consider the following: > > > > > > > > > > > > char a[128]; > > > > > > char b[128]; > > > > > > > > > > > > put some static data in a[0-63] > > > > > > put some static data in b[0-63] > > > > > > a[64]=0; > > > > > > b[64]=0; > > > > > > start a thread doing stuff in a[65..] > > > > > > start a thread doing stuff in b[65..] > > > > > > > > > > > > if (!strcmp(a,b)) { > > > > > > /* Do something */ > > > > > > } > > > > > > > > > > > > a) Is that behaviour defined? > > > > > > > > > > Good question. I think it should be. This depends on both the data > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using > > > > > the abstract machine. > > > > > > > > > > The data race definition uses memory locations as granularity, which is > > > > > in 3.14 in C11. Separate characters in an array should be separate > > > > > memory locations. > > > > > > > > > > C11 isn't very specific regarding strcmp, and just says that it > > > > > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length > > > > > of the strings are determined, and then a strncmp is run using the > > > > > smaller of the two lengths. > > > > > > > > > > Assuming the C++ specs, only the array indices [0..64] should be > > > > > accessed by the abstract machine. So no data race with the stuff going > > > > > on in [65..). > > > > > > > > > > > b) Is it defined with strncmp(a,b,64) ? > > > > > > > > > > Yes. > > > > > > > > > > > c) Is it defined with strncmp(a,b,128)? > > > > > > > > > > Not sure. C11 says that not more than "n characters" are compared, and > > > > > characters that follow the a null character aren't compared either. > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the > > > > > particular case. > > > > > Regarding C++11, I'm not sure. The closest copies a substring > > > > > (conceptually) and then compares, but the substring copying has to > > > > > determine length of the string and then subtracting the max length. > > > > > This would do a strlen first, which wouldn't access past index 64. > > > > > Thus, should be fine too. > > > > > > > > > > > d) Is it defined with memcmp(a,b,64)? > > > > > > > > > > No data race, IMO. > > > > > > > > > > > e) Is it defined with memcmp(a,b,128)? > > > > > > > > > > Data race. Undefined behavior. > > > > > > > > > > > f) If I moved that boundary off a nice round % 8 mark would > > > > > > it matter? > > > > > > > > > > No difference as far as the standard is concerned. > > > > > > > > > > > I can imagine there may be lots of things that terminate > > > > > > a string and let other stuff happen in the allocated space > > > > > > after the end of the string in the belief that at the end > > > > > > of that string all is unknown. Andrea's case is a bit different > > > > > > in that it's the later data that's static, but that doesn't > > > > > > sound like it should change the answer as to what's allowed. > > > > > > > > > > I think it does, because the question is whether there is a data race on > > > > > the memory locations that the abstract machine would access. > > > > > > > > Well given we are making examples. > > > > > > > > Assume 2 structures like this: > > > > > > > > struct test { > > > > void *pointer; > > > > char start[16]; > > > > char end[240]; > > > > } > > > > > > > > and > > > > > > > > struct test a; > > > > struct test b; > > > > > > > > memset(a.end, 1, 240); > > > > memset(b.end, 2, 240); > > > > > > > > In what case it would be expected/legal for > > > > memcmp(a.start, b.start, 256); to ever return 0 ? > > > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start > > > I see absolutely no problem in stopping short and returning a > > > positive/negative value. I just do not see how returning 0 would ever > > > make sense. > > > > It would be legal if the invocation of memcmp and at least one of the > > invocations of memset would be concurrent (ie, not ordered by > > happens-before). > > Nope, sorry , there is a misunderstanding, consider the memsets part of > initialization, *never* run in parallel with memcmp. > > > I do see that this wouldn't be an intuitive outcome if one reasons based > > on assuming some simplified (or straightforward, or natural, or whatever > > the personal opinion might be...) implementation. But a data race is > > undefined behavior, and that's where we're drawing the line. > > Sure, I can understand that is the *whole* are under consideration is Sorry for the typos, the above should read ".. that IF the *whole* AREA under consideration ..." > racing. But here we are talking about a large chunk of the area that is > different and *never* raced upon. > > I can accept that the C standard may still decide this is undefined > behavior even if just one small part of the total area sees concurrent > modifications, but again, you are returning "short" from the optimized > code, so it is unclear to me how you can return 0 when the memcmp code > hasn't checked all the memory, wouldn't it make more sense to return the > current position ? It is still undefined but at least it won't cause > false positives. > > Simo. >
* Torvald Riegel (triegel@redhat.com) wrote: > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > * Torvald Riegel (triegel@redhat.com) wrote: > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > memory is clearly different) and keep these things working, so why not > > > > do it ? > > > > > > I see two separate issues here. First, where do we draw the line, and > > > what do we guarantee. I strongly believe that programs must not have > > > data races, and that they should use atomics or other synchronization > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > example). > > > Second, do we try to keep buggy programs working. If it has no cost to > > > do so (e.g., like it might be true in this case), then doing that can > > > help to trigger less errors. But that doesn't mean the buggy programs > > > should get fixed eventually. > > > > I find it difficult to understand the boundaries of what the C library > > is allowed to do in this type of optimisation. > > > > For example, consider the following: > > > > char a[128]; > > char b[128]; > > > > put some static data in a[0-63] > > put some static data in b[0-63] > > a[64]=0; > > b[64]=0; > > start a thread doing stuff in a[65..] > > start a thread doing stuff in b[65..] > > > > if (!strcmp(a,b)) { > > /* Do something */ > > } > > > > a) Is that behaviour defined? > > Good question. I think it should be. This depends on both the data > race definition and what strcmp/strncmp/memcmp are specified to do using > the abstract machine. > > The data race definition uses memory locations as granularity, which is > in 3.14 in C11. Separate characters in an array should be separate > memory locations. > > C11 isn't very specific regarding strcmp, and just says that it > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > regarding basic_string::compare (21.4.7.9), saying that first the length > of the strings are determined, and then a strncmp is run using the > smaller of the two lengths. > > Assuming the C++ specs, only the array indices [0..64] should be > accessed by the abstract machine. So no data race with the stuff going > on in [65..). > > > b) Is it defined with strncmp(a,b,64) ? > > Yes. > > > c) Is it defined with strncmp(a,b,128)? > > Not sure. C11 says that not more than "n characters" are compared, and > characters that follow the a null character aren't compared either. > This indicates it wouldn't be different from strncmp(a,b,64) in the > particular case. > Regarding C++11, I'm not sure. The closest copies a substring > (conceptually) and then compares, but the substring copying has to > determine length of the string and then subtracting the max length. > This would do a strlen first, which wouldn't access past index 64. > Thus, should be fine too. > > > d) Is it defined with memcmp(a,b,64)? > > No data race, IMO. OK, so you mean that's defined behavior? > > e) Is it defined with memcmp(a,b,128)? > > Data race. Undefined behavior. OK, so it's interesting your answer to (e) is different from (c); aren't there compiler optimisations that replace some strcmp s with memcmp s? (I'm not sure whether it would in this case?) > > f) If I moved that boundary off a nice round % 8 mark would > > it matter? > > No difference as far as the standard is concerned. OK, and I've not looked at this x86 code in detail, but are we sure that where the memcmp length finishes in the middle of a word the bytes after the end can't cause a match/mismatch that would later change and be classed as undefined? > > I can imagine there may be lots of things that terminate > > a string and let other stuff happen in the allocated space > > after the end of the string in the belief that at the end > > of that string all is unknown. Andrea's case is a bit different > > in that it's the later data that's static, but that doesn't > > sound like it should change the answer as to what's allowed. > > I think it does, because the question is whether there is a data race on > the memory locations that the abstract machine would access. Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote: > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote: > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote: > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote: > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote: > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > > > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > > > > > memory is clearly different) and keep these things working, so why not > > > > > > > > do it ? > > > > > > > > > > > > > > I see two separate issues here. First, where do we draw the line, and > > > > > > > what do we guarantee. I strongly believe that programs must not have > > > > > > > data races, and that they should use atomics or other synchronization > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > > > > > example). > > > > > > > Second, do we try to keep buggy programs working. If it has no cost to > > > > > > > do so (e.g., like it might be true in this case), then doing that can > > > > > > > help to trigger less errors. But that doesn't mean the buggy programs > > > > > > > should get fixed eventually. > > > > > > > > > > > > I find it difficult to understand the boundaries of what the C library > > > > > > is allowed to do in this type of optimisation. > > > > > > > > > > > > For example, consider the following: > > > > > > > > > > > > char a[128]; > > > > > > char b[128]; > > > > > > > > > > > > put some static data in a[0-63] > > > > > > put some static data in b[0-63] > > > > > > a[64]=0; > > > > > > b[64]=0; > > > > > > start a thread doing stuff in a[65..] > > > > > > start a thread doing stuff in b[65..] > > > > > > > > > > > > if (!strcmp(a,b)) { > > > > > > /* Do something */ > > > > > > } > > > > > > > > > > > > a) Is that behaviour defined? > > > > > > > > > > Good question. I think it should be. This depends on both the data > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using > > > > > the abstract machine. > > > > > > > > > > The data race definition uses memory locations as granularity, which is > > > > > in 3.14 in C11. Separate characters in an array should be separate > > > > > memory locations. > > > > > > > > > > C11 isn't very specific regarding strcmp, and just says that it > > > > > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length > > > > > of the strings are determined, and then a strncmp is run using the > > > > > smaller of the two lengths. > > > > > > > > > > Assuming the C++ specs, only the array indices [0..64] should be > > > > > accessed by the abstract machine. So no data race with the stuff going > > > > > on in [65..). > > > > > > > > > > > b) Is it defined with strncmp(a,b,64) ? > > > > > > > > > > Yes. > > > > > > > > > > > c) Is it defined with strncmp(a,b,128)? > > > > > > > > > > Not sure. C11 says that not more than "n characters" are compared, and > > > > > characters that follow the a null character aren't compared either. > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the > > > > > particular case. > > > > > Regarding C++11, I'm not sure. The closest copies a substring > > > > > (conceptually) and then compares, but the substring copying has to > > > > > determine length of the string and then subtracting the max length. > > > > > This would do a strlen first, which wouldn't access past index 64. > > > > > Thus, should be fine too. > > > > > > > > > > > d) Is it defined with memcmp(a,b,64)? > > > > > > > > > > No data race, IMO. > > > > > > > > > > > e) Is it defined with memcmp(a,b,128)? > > > > > > > > > > Data race. Undefined behavior. > > > > > > > > > > > f) If I moved that boundary off a nice round % 8 mark would > > > > > > it matter? > > > > > > > > > > No difference as far as the standard is concerned. > > > > > > > > > > > I can imagine there may be lots of things that terminate > > > > > > a string and let other stuff happen in the allocated space > > > > > > after the end of the string in the belief that at the end > > > > > > of that string all is unknown. Andrea's case is a bit different > > > > > > in that it's the later data that's static, but that doesn't > > > > > > sound like it should change the answer as to what's allowed. > > > > > > > > > > I think it does, because the question is whether there is a data race on > > > > > the memory locations that the abstract machine would access. > > > > > > > > Well given we are making examples. > > > > > > > > Assume 2 structures like this: > > > > > > > > struct test { > > > > void *pointer; > > > > char start[16]; > > > > char end[240]; > > > > } > > > > > > > > and > > > > > > > > struct test a; > > > > struct test b; > > > > > > > > memset(a.end, 1, 240); > > > > memset(b.end, 2, 240); > > > > > > > > In what case it would be expected/legal for > > > > memcmp(a.start, b.start, 256); to ever return 0 ? > > > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start > > > I see absolutely no problem in stopping short and returning a > > > positive/negative value. I just do not see how returning 0 would ever > > > make sense. > > > > It would be legal if the invocation of memcmp and at least one of the > > invocations of memset would be concurrent (ie, not ordered by > > happens-before). > > Nope, sorry , there is a misunderstanding, consider the memsets part of > initialization, *never* run in parallel with memcmp. Ah, I see. I'm not really sure what the rules are in this case; my guess would be that this is not UB, because the access is not beyond the object that is struct test. Perhaps this survey is interesting for you: https://goo.gl/2TY0H5 > > I do see that this wouldn't be an intuitive outcome if one reasons based > > on assuming some simplified (or straightforward, or natural, or whatever > > the personal opinion might be...) implementation. But a data race is > > undefined behavior, and that's where we're drawing the line. > > Sure, I can understand that is the *whole* are under consideration is > racing. But here we are talking about a large chunk of the area that is > different and *never* raced upon. Well, undefined behavior is not constrained to some "neighborhood" or such around the cause of the undefined behavior. It makes sense that the standard doesn't constrain it, because then it would have to guarantee the constraints and those would require reasoning about a particular implementation, which a language standard is meant to prevent :) Consider a hypothetical GPU that wants to run C code. It accepts code in a virtual ISA that requires using special atomic operations for all memory accesses that would be data races otherwise. The implementation of this GPU uses a fast JIT compiler to transform to the native ISA of the GPU. Now this JIT ocmpiler detects in a program that a whole page is compared using memcmp (or a custom memcmp using non-atomics). It decides to move this page into local memory, assuming because of the data-race-freedom requirement that no other thread will write to the page. After the memcmp, it will move it back. But your program isn't data-race-free, and another thread writes to one byte in the page. This byte is then lost when the page is moved back from local memory. This is a made-up example (although such a virtual ISA is not unrealistic) and maybe this implementation would try to fail gracefully instead of just losing the update. But I think it helps show that it isn't quite clear in general which constraints would be realistic for undefined behavior. > I can accept that the C standard may still decide this is undefined > behavior even if just one small part of the total area sees concurrent > modifications, but again, you are returning "short" from the optimized > code, so it is unclear to me how you can return 0 when the memcmp code > hasn't checked all the memory, wouldn't it make more sense to return the > current position ? It is still undefined but at least it won't cause > false positives. Yes, that's the second issue: Trying fail gracefully or mask failures in buggy programs, provided the runtime overheads and maintenance costs aren't too high.
On Fri, 2015-06-19 at 19:06 +0100, Dr. David Alan Gilbert wrote: > * Torvald Riegel (triegel@redhat.com) wrote: > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > > memory is clearly different) and keep these things working, so why not > > > > > do it ? > > > > > > > > I see two separate issues here. First, where do we draw the line, and > > > > what do we guarantee. I strongly believe that programs must not have > > > > data races, and that they should use atomics or other synchronization > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > > example). > > > > Second, do we try to keep buggy programs working. If it has no cost to > > > > do so (e.g., like it might be true in this case), then doing that can > > > > help to trigger less errors. But that doesn't mean the buggy programs > > > > should get fixed eventually. > > > > > > I find it difficult to understand the boundaries of what the C library > > > is allowed to do in this type of optimisation. > > > > > > For example, consider the following: > > > > > > char a[128]; > > > char b[128]; > > > > > > put some static data in a[0-63] > > > put some static data in b[0-63] > > > a[64]=0; > > > b[64]=0; > > > start a thread doing stuff in a[65..] > > > start a thread doing stuff in b[65..] > > > > > > if (!strcmp(a,b)) { > > > /* Do something */ > > > } > > > > > > a) Is that behaviour defined? > > > > Good question. I think it should be. This depends on both the data > > race definition and what strcmp/strncmp/memcmp are specified to do using > > the abstract machine. > > > > The data race definition uses memory locations as granularity, which is > > in 3.14 in C11. Separate characters in an array should be separate > > memory locations. > > > > C11 isn't very specific regarding strcmp, and just says that it > > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > > regarding basic_string::compare (21.4.7.9), saying that first the length > > of the strings are determined, and then a strncmp is run using the > > smaller of the two lengths. > > > > Assuming the C++ specs, only the array indices [0..64] should be > > accessed by the abstract machine. So no data race with the stuff going > > on in [65..). > > > > > b) Is it defined with strncmp(a,b,64) ? > > > > Yes. > > > > > c) Is it defined with strncmp(a,b,128)? > > > > Not sure. C11 says that not more than "n characters" are compared, and > > characters that follow the a null character aren't compared either. > > This indicates it wouldn't be different from strncmp(a,b,64) in the > > particular case. > > Regarding C++11, I'm not sure. The closest copies a substring > > (conceptually) and then compares, but the substring copying has to > > determine length of the string and then subtracting the max length. > > This would do a strlen first, which wouldn't access past index 64. > > Thus, should be fine too. > > > > > d) Is it defined with memcmp(a,b,64)? > > > > No data race, IMO. > > OK, so you mean that's defined behavior? Yes, there's no undefined behavior, so memcmp should do what it's specified to do in the normal case. > > > e) Is it defined with memcmp(a,b,128)? > > > > Data race. Undefined behavior. > > OK, so it's interesting your answer to (e) is different from (c); > aren't there compiler optimisations that replace some strcmp s with > memcmp s? (I'm not sure whether it would in this case?) I believe it only does that when the length of the strings is known. > > > f) If I moved that boundary off a nice round % 8 mark would > > > it matter? > > > > No difference as far as the standard is concerned. > > OK, and I've not looked at this x86 code in detail, but are we sure > that where the memcmp length finishes in the middle of a word > the bytes after the end can't cause a match/mismatch that would > later change and be classed as undefined? I haven't looked at the code either, but this would be a bug. It could also introduce data races, which the implementation must not do :)
* Torvald Riegel (triegel@redhat.com) wrote: > On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote: > > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote: > > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote: > > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote: > > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote: > > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > > > > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > > > > > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > > > > > > memory is clearly different) and keep these things working, so why not > > > > > > > > > do it ? > > > > > > > > > > > > > > > > I see two separate issues here. First, where do we draw the line, and > > > > > > > > what do we guarantee. I strongly believe that programs must not have > > > > > > > > data races, and that they should use atomics or other synchronization > > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > > > > > > example). > > > > > > > > Second, do we try to keep buggy programs working. If it has no cost to > > > > > > > > do so (e.g., like it might be true in this case), then doing that can > > > > > > > > help to trigger less errors. But that doesn't mean the buggy programs > > > > > > > > should get fixed eventually. > > > > > > > > > > > > > > I find it difficult to understand the boundaries of what the C library > > > > > > > is allowed to do in this type of optimisation. > > > > > > > > > > > > > > For example, consider the following: > > > > > > > > > > > > > > char a[128]; > > > > > > > char b[128]; > > > > > > > > > > > > > > put some static data in a[0-63] > > > > > > > put some static data in b[0-63] > > > > > > > a[64]=0; > > > > > > > b[64]=0; > > > > > > > start a thread doing stuff in a[65..] > > > > > > > start a thread doing stuff in b[65..] > > > > > > > > > > > > > > if (!strcmp(a,b)) { > > > > > > > /* Do something */ > > > > > > > } > > > > > > > > > > > > > > a) Is that behaviour defined? > > > > > > > > > > > > Good question. I think it should be. This depends on both the data > > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using > > > > > > the abstract machine. > > > > > > > > > > > > The data race definition uses memory locations as granularity, which is > > > > > > in 3.14 in C11. Separate characters in an array should be separate > > > > > > memory locations. > > > > > > > > > > > > C11 isn't very specific regarding strcmp, and just says that it > > > > > > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length > > > > > > of the strings are determined, and then a strncmp is run using the > > > > > > smaller of the two lengths. > > > > > > > > > > > > Assuming the C++ specs, only the array indices [0..64] should be > > > > > > accessed by the abstract machine. So no data race with the stuff going > > > > > > on in [65..). > > > > > > > > > > > > > b) Is it defined with strncmp(a,b,64) ? > > > > > > > > > > > > Yes. > > > > > > > > > > > > > c) Is it defined with strncmp(a,b,128)? > > > > > > > > > > > > Not sure. C11 says that not more than "n characters" are compared, and > > > > > > characters that follow the a null character aren't compared either. > > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the > > > > > > particular case. > > > > > > Regarding C++11, I'm not sure. The closest copies a substring > > > > > > (conceptually) and then compares, but the substring copying has to > > > > > > determine length of the string and then subtracting the max length. > > > > > > This would do a strlen first, which wouldn't access past index 64. > > > > > > Thus, should be fine too. > > > > > > > > > > > > > d) Is it defined with memcmp(a,b,64)? > > > > > > > > > > > > No data race, IMO. > > > > > > > > > > > > > e) Is it defined with memcmp(a,b,128)? > > > > > > > > > > > > Data race. Undefined behavior. > > > > > > > > > > > > > f) If I moved that boundary off a nice round % 8 mark would > > > > > > > it matter? > > > > > > > > > > > > No difference as far as the standard is concerned. > > > > > > > > > > > > > I can imagine there may be lots of things that terminate > > > > > > > a string and let other stuff happen in the allocated space > > > > > > > after the end of the string in the belief that at the end > > > > > > > of that string all is unknown. Andrea's case is a bit different > > > > > > > in that it's the later data that's static, but that doesn't > > > > > > > sound like it should change the answer as to what's allowed. > > > > > > > > > > > > I think it does, because the question is whether there is a data race on > > > > > > the memory locations that the abstract machine would access. > > > > > > > > > > Well given we are making examples. > > > > > > > > > > Assume 2 structures like this: > > > > > > > > > > struct test { > > > > > void *pointer; > > > > > char start[16]; > > > > > char end[240]; > > > > > } > > > > > > > > > > and > > > > > > > > > > struct test a; > > > > > struct test b; > > > > > > > > > > memset(a.end, 1, 240); > > > > > memset(b.end, 2, 240); > > > > > > > > > > In what case it would be expected/legal for > > > > > memcmp(a.start, b.start, 256); to ever return 0 ? > > > > > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start > > > > I see absolutely no problem in stopping short and returning a > > > > positive/negative value. I just do not see how returning 0 would ever > > > > make sense. > > > > > > It would be legal if the invocation of memcmp and at least one of the > > > invocations of memset would be concurrent (ie, not ordered by > > > happens-before). > > > > Nope, sorry , there is a misunderstanding, consider the memsets part of > > initialization, *never* run in parallel with memcmp. > > Ah, I see. I'm not really sure what the rules are in this case; my > guess would be that this is not UB, because the access is not beyond the > object that is struct test. > > Perhaps this survey is interesting for you: https://goo.gl/2TY0H5 > > > > I do see that this wouldn't be an intuitive outcome if one reasons based > > > on assuming some simplified (or straightforward, or natural, or whatever > > > the personal opinion might be...) implementation. But a data race is > > > undefined behavior, and that's where we're drawing the line. > > > > Sure, I can understand that is the *whole* are under consideration is > > racing. But here we are talking about a large chunk of the area that is > > different and *never* raced upon. > > Well, undefined behavior is not constrained to some "neighborhood" or > such around the cause of the undefined behavior. > It makes sense that the standard doesn't constrain it, because then it > would have to guarantee the constraints and those would require > reasoning about a particular implementation, which a language standard > is meant to prevent :) > Consider a hypothetical GPU that wants to run C code. It accepts code > in a virtual ISA that requires using special atomic operations for all > memory accesses that would be data races otherwise. The implementation > of this GPU uses a fast JIT compiler to transform to the native ISA of > the GPU. Now this JIT ocmpiler detects in a program that a whole page > is compared using memcmp (or a custom memcmp using non-atomics). It > decides to move this page into local memory, assuming because of the > data-race-freedom requirement that no other thread will write to the > page. After the memcmp, it will move it back. But your program isn't > data-race-free, and another thread writes to one byte in the page. This > byte is then lost when the page is moved back from local memory. > > This is a made-up example (although such a virtual ISA is not > unrealistic) and maybe this implementation would try to fail gracefully > instead of just losing the update. But I think it helps show that it > isn't quite clear in general which constraints would be realistic for > undefined behavior. Isn't that's illegal since the parameters to memcmp are 'const void *', so it shouldn't be written? Going back to an earlier part of the thread; I came up with a reason why I really wouldn't want these routines to 'assert()'. In emulation/virtualisation we generally haven't got a clue what our guests are doing, and we don't want to stop them most of the time. So there are a couple of failry common types of structure that break these rules: 1) (e.g. framebuffer emulation) (Many untrusted threads writing to memory) One control thread do { read memory -> display it } while (!end) barrier; display it once more That 'display it' can be quite reasonably expected to get inconsistent data, but at the higher level it will be safe because we know we're always going to do one more 'display it' at the end when things settle down. similarly. 2) (VM migration) Many untrusted threads do { n=some-value; change some data[n]; write_barrier(); atomically_set(dirty[n]); } One control thread do { for (all i) { read_barrier(); if (atomic_read(dirty[i])) { send(data[n]); } } } while (!end) barrier(); for (all i) { if (atomic_read(dirty[i])) { send(data[n]); } } and often that 'send' process involves looking for compressable data; I had a look and I don't think we're currently using libc routines in there on the data for the comparison in (2), but I'm not 100% sure about (1). In these cases we expect the data to be inconsistent for some period of time, but guarantee the consistency since we know we're going to read it again at some better time. We have no sensible way to stop all the guest threads writing to memory. Dave > > I can accept that the C standard may still decide this is undefined > > behavior even if just one small part of the total area sees concurrent > > modifications, but again, you are returning "short" from the optimized > > code, so it is unclear to me how you can return 0 when the memcmp code > > hasn't checked all the memory, wouldn't it make more sense to return the > > current position ? It is still undefined but at least it won't cause > > false positives. > > Yes, that's the second issue: Trying fail gracefully or mask failures in > buggy programs, provided the runtime overheads and maintenance costs > aren't too high. > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Mon, 2015-06-22 at 18:43 +0100, Dr. David Alan Gilbert wrote: > * Torvald Riegel (triegel@redhat.com) wrote: > > On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote: > > > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote: > > > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote: > > > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote: > > > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote: > > > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > > > > > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > > > > > > > memory is clearly different) and keep these things working, so why not > > > > > > > > > > do it ? > > > > > > > > > > > > > > > > > > I see two separate issues here. First, where do we draw the line, and > > > > > > > > > what do we guarantee. I strongly believe that programs must not have > > > > > > > > > data races, and that they should use atomics or other synchronization > > > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > > > > > > > example). > > > > > > > > > Second, do we try to keep buggy programs working. If it has no cost to > > > > > > > > > do so (e.g., like it might be true in this case), then doing that can > > > > > > > > > help to trigger less errors. But that doesn't mean the buggy programs > > > > > > > > > should get fixed eventually. > > > > > > > > > > > > > > > > I find it difficult to understand the boundaries of what the C library > > > > > > > > is allowed to do in this type of optimisation. > > > > > > > > > > > > > > > > For example, consider the following: > > > > > > > > > > > > > > > > char a[128]; > > > > > > > > char b[128]; > > > > > > > > > > > > > > > > put some static data in a[0-63] > > > > > > > > put some static data in b[0-63] > > > > > > > > a[64]=0; > > > > > > > > b[64]=0; > > > > > > > > start a thread doing stuff in a[65..] > > > > > > > > start a thread doing stuff in b[65..] > > > > > > > > > > > > > > > > if (!strcmp(a,b)) { > > > > > > > > /* Do something */ > > > > > > > > } > > > > > > > > > > > > > > > > a) Is that behaviour defined? > > > > > > > > > > > > > > Good question. I think it should be. This depends on both the data > > > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using > > > > > > > the abstract machine. > > > > > > > > > > > > > > The data race definition uses memory locations as granularity, which is > > > > > > > in 3.14 in C11. Separate characters in an array should be separate > > > > > > > memory locations. > > > > > > > > > > > > > > C11 isn't very specific regarding strcmp, and just says that it > > > > > > > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > > > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length > > > > > > > of the strings are determined, and then a strncmp is run using the > > > > > > > smaller of the two lengths. > > > > > > > > > > > > > > Assuming the C++ specs, only the array indices [0..64] should be > > > > > > > accessed by the abstract machine. So no data race with the stuff going > > > > > > > on in [65..). > > > > > > > > > > > > > > > b) Is it defined with strncmp(a,b,64) ? > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > c) Is it defined with strncmp(a,b,128)? > > > > > > > > > > > > > > Not sure. C11 says that not more than "n characters" are compared, and > > > > > > > characters that follow the a null character aren't compared either. > > > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the > > > > > > > particular case. > > > > > > > Regarding C++11, I'm not sure. The closest copies a substring > > > > > > > (conceptually) and then compares, but the substring copying has to > > > > > > > determine length of the string and then subtracting the max length. > > > > > > > This would do a strlen first, which wouldn't access past index 64. > > > > > > > Thus, should be fine too. > > > > > > > > > > > > > > > d) Is it defined with memcmp(a,b,64)? > > > > > > > > > > > > > > No data race, IMO. > > > > > > > > > > > > > > > e) Is it defined with memcmp(a,b,128)? > > > > > > > > > > > > > > Data race. Undefined behavior. > > > > > > > > > > > > > > > f) If I moved that boundary off a nice round % 8 mark would > > > > > > > > it matter? > > > > > > > > > > > > > > No difference as far as the standard is concerned. > > > > > > > > > > > > > > > I can imagine there may be lots of things that terminate > > > > > > > > a string and let other stuff happen in the allocated space > > > > > > > > after the end of the string in the belief that at the end > > > > > > > > of that string all is unknown. Andrea's case is a bit different > > > > > > > > in that it's the later data that's static, but that doesn't > > > > > > > > sound like it should change the answer as to what's allowed. > > > > > > > > > > > > > > I think it does, because the question is whether there is a data race on > > > > > > > the memory locations that the abstract machine would access. > > > > > > > > > > > > Well given we are making examples. > > > > > > > > > > > > Assume 2 structures like this: > > > > > > > > > > > > struct test { > > > > > > void *pointer; > > > > > > char start[16]; > > > > > > char end[240]; > > > > > > } > > > > > > > > > > > > and > > > > > > > > > > > > struct test a; > > > > > > struct test b; > > > > > > > > > > > > memset(a.end, 1, 240); > > > > > > memset(b.end, 2, 240); > > > > > > > > > > > > In what case it would be expected/legal for > > > > > > memcmp(a.start, b.start, 256); to ever return 0 ? > > > > > > > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start > > > > > I see absolutely no problem in stopping short and returning a > > > > > positive/negative value. I just do not see how returning 0 would ever > > > > > make sense. > > > > > > > > It would be legal if the invocation of memcmp and at least one of the > > > > invocations of memset would be concurrent (ie, not ordered by > > > > happens-before). > > > > > > Nope, sorry , there is a misunderstanding, consider the memsets part of > > > initialization, *never* run in parallel with memcmp. > > > > Ah, I see. I'm not really sure what the rules are in this case; my > > guess would be that this is not UB, because the access is not beyond the > > object that is struct test. > > > > Perhaps this survey is interesting for you: https://goo.gl/2TY0H5 > > > > > > I do see that this wouldn't be an intuitive outcome if one reasons based > > > > on assuming some simplified (or straightforward, or natural, or whatever > > > > the personal opinion might be...) implementation. But a data race is > > > > undefined behavior, and that's where we're drawing the line. > > > > > > Sure, I can understand that is the *whole* are under consideration is > > > racing. But here we are talking about a large chunk of the area that is > > > different and *never* raced upon. > > > > Well, undefined behavior is not constrained to some "neighborhood" or > > such around the cause of the undefined behavior. > > It makes sense that the standard doesn't constrain it, because then it > > would have to guarantee the constraints and those would require > > reasoning about a particular implementation, which a language standard > > is meant to prevent :) > > Consider a hypothetical GPU that wants to run C code. It accepts code > > in a virtual ISA that requires using special atomic operations for all > > memory accesses that would be data races otherwise. The implementation > > of this GPU uses a fast JIT compiler to transform to the native ISA of > > the GPU. Now this JIT ocmpiler detects in a program that a whole page > > is compared using memcmp (or a custom memcmp using non-atomics). It > > decides to move this page into local memory, assuming because of the > > data-race-freedom requirement that no other thread will write to the > > page. After the memcmp, it will move it back. But your program isn't > > data-race-free, and another thread writes to one byte in the page. This > > byte is then lost when the page is moved back from local memory. > > > > This is a made-up example (although such a virtual ISA is not > > unrealistic) and maybe this implementation would try to fail gracefully > > instead of just losing the update. But I think it helps show that it > > isn't quite clear in general which constraints would be realistic for > > undefined behavior. > > Isn't that's illegal since the parameters to memcmp are 'const void *', > so it shouldn't be written? That's what your program guarantees, but it's not automatically a constraint on what the implementation does. If the compiler can detect that the memory is completely managed and provided by the implementation (e.g., never accessed through volatile accesses, never escaping to functions the compiler isn't sure won't access it through volatile, ...), then the additional writes aren't visible in terms of observable behavior of the abstract machine. > Going back to an earlier part of the thread; I came up with a reason > why I really wouldn't want these routines to 'assert()'. So, put generally, you don't want to fail fast because there is code that is not data-race-free but seems to work right now? I think that this would be a convincing reason. > In emulation/virtualisation we generally haven't got a clue what our > guests are doing, and we don't want to stop them most of the time. > So there are a couple of failry common types of structure that > break these rules: > > 1) (e.g. framebuffer emulation) > (Many untrusted threads writing to memory) > One control thread > do { > read memory -> display it > } while (!end) > barrier; > display it once more > > That 'display it' can be quite reasonably expected to get > inconsistent data, but at the higher level it will be safe because > we know we're always going to do one more 'display it' at the end > when things settle down. Using atomic (memory_order_relaxed) accesses for reads of "memory" (and "end", I suppose) would clean this up. The barrier is am memory_order_acquire fence, I suppose? > similarly. > 2) (VM migration) > > Many untrusted threads > do { > n=some-value; > change some data[n]; > write_barrier(); > atomically_set(dirty[n]); Same as above, using relaxed atomic accesses for data and a memory_order_release store for setting dirty (or use relaxed MO and keep the barrier / release fence) would clean this up. > } > > One control thread > do { > for (all i) { > read_barrier(); Hmm, why is the barrier here, not between the loads from dirty and data? > if (atomic_read(dirty[i])) { > send(data[n]); > } > } > } while (!end) > barrier(); > for (all i) { > if (atomic_read(dirty[i])) { > send(data[n]); > } > } > > and often that 'send' process involves looking for compressable data; > I had a look and I don't think we're currently using libc routines > in there on the data for the comparison in (2), but I'm not 100% sure > about (1). > > In these cases we expect the data to be inconsistent for some period of > time, but guarantee the consistency since we know we're going to read it > again at some better time. We have no sensible way to stop all the > guest threads writing to memory. The uncontrolled writes of the guest threads are not really covered by the memory model, but if we assume that they are effectively a collection of memory_order_relaxed atomic stores of some granularity, then you can "avoid" the data race if using relaxed atomic loads at the consumer side too. And then it's just using compatible fences and atomic accesses for "end" and "dirty" to make this work cleanly.
* Torvald Riegel (triegel@redhat.com) wrote: > On Mon, 2015-06-22 at 18:43 +0100, Dr. David Alan Gilbert wrote: > > * Torvald Riegel (triegel@redhat.com) wrote: > > > On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote: > > > > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote: > > > > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote: > > > > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote: > > > > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote: > > > > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > > > > > > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > > > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > > > > > > > > memory is clearly different) and keep these things working, so why not > > > > > > > > > > > do it ? > > > > > > > > > > > > > > > > > > > > I see two separate issues here. First, where do we draw the line, and > > > > > > > > > > what do we guarantee. I strongly believe that programs must not have > > > > > > > > > > data races, and that they should use atomics or other synchronization > > > > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > > > > > > > > example). > > > > > > > > > > Second, do we try to keep buggy programs working. If it has no cost to > > > > > > > > > > do so (e.g., like it might be true in this case), then doing that can > > > > > > > > > > help to trigger less errors. But that doesn't mean the buggy programs > > > > > > > > > > should get fixed eventually. > > > > > > > > > > > > > > > > > > I find it difficult to understand the boundaries of what the C library > > > > > > > > > is allowed to do in this type of optimisation. > > > > > > > > > > > > > > > > > > For example, consider the following: > > > > > > > > > > > > > > > > > > char a[128]; > > > > > > > > > char b[128]; > > > > > > > > > > > > > > > > > > put some static data in a[0-63] > > > > > > > > > put some static data in b[0-63] > > > > > > > > > a[64]=0; > > > > > > > > > b[64]=0; > > > > > > > > > start a thread doing stuff in a[65..] > > > > > > > > > start a thread doing stuff in b[65..] > > > > > > > > > > > > > > > > > > if (!strcmp(a,b)) { > > > > > > > > > /* Do something */ > > > > > > > > > } > > > > > > > > > > > > > > > > > > a) Is that behaviour defined? > > > > > > > > > > > > > > > > Good question. I think it should be. This depends on both the data > > > > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using > > > > > > > > the abstract machine. > > > > > > > > > > > > > > > > The data race definition uses memory locations as granularity, which is > > > > > > > > in 3.14 in C11. Separate characters in an array should be separate > > > > > > > > memory locations. > > > > > > > > > > > > > > > > C11 isn't very specific regarding strcmp, and just says that it > > > > > > > > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > > > > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length > > > > > > > > of the strings are determined, and then a strncmp is run using the > > > > > > > > smaller of the two lengths. > > > > > > > > > > > > > > > > Assuming the C++ specs, only the array indices [0..64] should be > > > > > > > > accessed by the abstract machine. So no data race with the stuff going > > > > > > > > on in [65..). > > > > > > > > > > > > > > > > > b) Is it defined with strncmp(a,b,64) ? > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > c) Is it defined with strncmp(a,b,128)? > > > > > > > > > > > > > > > > Not sure. C11 says that not more than "n characters" are compared, and > > > > > > > > characters that follow the a null character aren't compared either. > > > > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the > > > > > > > > particular case. > > > > > > > > Regarding C++11, I'm not sure. The closest copies a substring > > > > > > > > (conceptually) and then compares, but the substring copying has to > > > > > > > > determine length of the string and then subtracting the max length. > > > > > > > > This would do a strlen first, which wouldn't access past index 64. > > > > > > > > Thus, should be fine too. > > > > > > > > > > > > > > > > > d) Is it defined with memcmp(a,b,64)? > > > > > > > > > > > > > > > > No data race, IMO. > > > > > > > > > > > > > > > > > e) Is it defined with memcmp(a,b,128)? > > > > > > > > > > > > > > > > Data race. Undefined behavior. > > > > > > > > > > > > > > > > > f) If I moved that boundary off a nice round % 8 mark would > > > > > > > > > it matter? > > > > > > > > > > > > > > > > No difference as far as the standard is concerned. > > > > > > > > > > > > > > > > > I can imagine there may be lots of things that terminate > > > > > > > > > a string and let other stuff happen in the allocated space > > > > > > > > > after the end of the string in the belief that at the end > > > > > > > > > of that string all is unknown. Andrea's case is a bit different > > > > > > > > > in that it's the later data that's static, but that doesn't > > > > > > > > > sound like it should change the answer as to what's allowed. > > > > > > > > > > > > > > > > I think it does, because the question is whether there is a data race on > > > > > > > > the memory locations that the abstract machine would access. > > > > > > > > > > > > > > Well given we are making examples. > > > > > > > > > > > > > > Assume 2 structures like this: > > > > > > > > > > > > > > struct test { > > > > > > > void *pointer; > > > > > > > char start[16]; > > > > > > > char end[240]; > > > > > > > } > > > > > > > > > > > > > > and > > > > > > > > > > > > > > struct test a; > > > > > > > struct test b; > > > > > > > > > > > > > > memset(a.end, 1, 240); > > > > > > > memset(b.end, 2, 240); > > > > > > > > > > > > > > In what case it would be expected/legal for > > > > > > > memcmp(a.start, b.start, 256); to ever return 0 ? > > > > > > > > > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start > > > > > > I see absolutely no problem in stopping short and returning a > > > > > > positive/negative value. I just do not see how returning 0 would ever > > > > > > make sense. > > > > > > > > > > It would be legal if the invocation of memcmp and at least one of the > > > > > invocations of memset would be concurrent (ie, not ordered by > > > > > happens-before). > > > > > > > > Nope, sorry , there is a misunderstanding, consider the memsets part of > > > > initialization, *never* run in parallel with memcmp. > > > > > > Ah, I see. I'm not really sure what the rules are in this case; my > > > guess would be that this is not UB, because the access is not beyond the > > > object that is struct test. > > > > > > Perhaps this survey is interesting for you: https://goo.gl/2TY0H5 > > > > > > > > I do see that this wouldn't be an intuitive outcome if one reasons based > > > > > on assuming some simplified (or straightforward, or natural, or whatever > > > > > the personal opinion might be...) implementation. But a data race is > > > > > undefined behavior, and that's where we're drawing the line. > > > > > > > > Sure, I can understand that is the *whole* are under consideration is > > > > racing. But here we are talking about a large chunk of the area that is > > > > different and *never* raced upon. > > > > > > Well, undefined behavior is not constrained to some "neighborhood" or > > > such around the cause of the undefined behavior. > > > It makes sense that the standard doesn't constrain it, because then it > > > would have to guarantee the constraints and those would require > > > reasoning about a particular implementation, which a language standard > > > is meant to prevent :) > > > Consider a hypothetical GPU that wants to run C code. It accepts code > > > in a virtual ISA that requires using special atomic operations for all > > > memory accesses that would be data races otherwise. The implementation > > > of this GPU uses a fast JIT compiler to transform to the native ISA of > > > the GPU. Now this JIT ocmpiler detects in a program that a whole page > > > is compared using memcmp (or a custom memcmp using non-atomics). It > > > decides to move this page into local memory, assuming because of the > > > data-race-freedom requirement that no other thread will write to the > > > page. After the memcmp, it will move it back. But your program isn't > > > data-race-free, and another thread writes to one byte in the page. This > > > byte is then lost when the page is moved back from local memory. > > > > > > This is a made-up example (although such a virtual ISA is not > > > unrealistic) and maybe this implementation would try to fail gracefully > > > instead of just losing the update. But I think it helps show that it > > > isn't quite clear in general which constraints would be realistic for > > > undefined behavior. > > > > Isn't that's illegal since the parameters to memcmp are 'const void *', > > so it shouldn't be written? > > That's what your program guarantees, but it's not automatically a > constraint on what the implementation does. If the compiler can detect > that the memory is completely managed and provided by the implementation > (e.g., never accessed through volatile accesses, never escaping to > functions the compiler isn't sure won't access it through > volatile, ...), then the additional writes aren't visible in terms of > observable behavior of the abstract machine. However, this is a library interface; the implentation of memcmp in the library can't get that much information from the compiler, so the standard memcmp implementation can't do that write because, given that it's const, that the address space it's comparing may be read-only. > > Going back to an earlier part of the thread; I came up with a reason > > why I really wouldn't want these routines to 'assert()'. > > So, put generally, you don't want to fail fast because there is code > that is not data-race-free but seems to work right now? > I think that this would be a convincing reason. :-) It's not a 'seems to work' - the design of both of these algorithms is specifically allowing data races to happen on the intent of containing the effect of the races. The racy reads are expected to produce indeterminate data but the algorithm knows that if that happens that data is always read again in the future at a time when a race can't happen. This is significantly faster than trying to stop all the other threads whenever the read is taking place. > > In emulation/virtualisation we generally haven't got a clue what our > > guests are doing, and we don't want to stop them most of the time. > > So there are a couple of failry common types of structure that > > break these rules: > > > > 1) (e.g. framebuffer emulation) > > (Many untrusted threads writing to memory) > > One control thread > > do { > > read memory -> display it > > } while (!end) > > barrier; > > display it once more > > > > That 'display it' can be quite reasonably expected to get > > inconsistent data, but at the higher level it will be safe because > > we know we're always going to do one more 'display it' at the end > > when things settle down. > > Using atomic (memory_order_relaxed) accesses for reads of "memory" (and > "end", I suppose) would clean this up. 1a) Note that this data isn't necessarily word aligned etc - so I'm not sure why using atomic accesses would help here. Also, my problem here is that I have no efficient C library call to use that is legal because by your definition of 'undefined' any library call might assert or do anything else - even if I don't care about the data they produce. > The barrier is am > memory_order_acquire fence, I suppose? 1b) Actually I've not found the barrier yet; I was just boiling these examples down from code I know we've got - I'm more familiar with the 2nd case. > > similarly. > > 2) (VM migration) > > > > Many untrusted threads > > do { > > n=some-value; > > change some data[n]; > > write_barrier(); > > atomically_set(dirty[n]); > > Same as above, using relaxed atomic accesses for data and a > memory_order_release store for setting dirty (or use relaxed MO and keep > the barrier / release fence) would clean this up. Note the thing changing data[n] is totally out of the control of our code; it's just normal applications, user land code; we've got no control over it at all - it's using normal writes, it could be malicious. The marking of dirty[n] is something that the kernel does for us, and we do get control over. > > } > > > > One control thread > > do { > > for (all i) { > > read_barrier(); > > Hmm, why is the barrier here, not between the loads from dirty and data? Yes, that probably would be best. > > if (atomic_read(dirty[i])) { > > send(data[n]); > > } > > } > > } while (!end) > > barrier(); > > for (all i) { > > if (atomic_read(dirty[i])) { > > send(data[n]); > > } > > } > > > > and often that 'send' process involves looking for compressable data; > > I had a look and I don't think we're currently using libc routines > > in there on the data for the comparison in (2), but I'm not 100% sure > > about (1). > > > > In these cases we expect the data to be inconsistent for some period of > > time, but guarantee the consistency since we know we're going to read it > > again at some better time. We have no sensible way to stop all the > > guest threads writing to memory. > > The uncontrolled writes of the guest threads are not really covered by > the memory model, but if we assume that they are effectively a > collection of memory_order_relaxed atomic stores of some granularity, > then you can "avoid" the data race if using relaxed atomic loads at the > consumer side too. And then it's just using compatible fences and > atomic accesses for "end" and "dirty" to make this work cleanly. Again my problem, and why I included in this thread, is use of C library routines; If I want to copy that data[n] as part of the send, I'd expect to able to use a standard memcpy() call, if I want to see if it's the same as the previous version I sent I'd want memcmp() - hence the reason I mention it. (Actually the way our compression data[n] works the comparison in this case doesn't quite fit to a memcmp or memchr) In short I want the benefit of the highly optimised library routines, knowing that the actual results they return for data that's changing under them might be wrong, but with the knowledge they're not going to blow up. Hence why I changed my mind and said assert() is unreasonable. Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Tue, 2015-06-23 at 10:02 +0100, Dr. David Alan Gilbert wrote: > * Torvald Riegel (triegel@redhat.com) wrote: > > On Mon, 2015-06-22 at 18:43 +0100, Dr. David Alan Gilbert wrote: > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > On Fri, 2015-06-19 at 13:55 -0400, Simo Sorce wrote: > > > > > On Fri, 2015-06-19 at 19:40 +0200, Torvald Riegel wrote: > > > > > > On Fri, 2015-06-19 at 13:18 -0400, Simo Sorce wrote: > > > > > > > On Fri, 2015-06-19 at 13:16 -0400, Simo Sorce wrote: > > > > > > > > On Fri, 2015-06-19 at 18:56 +0200, Torvald Riegel wrote: > > > > > > > > > On Fri, 2015-06-19 at 16:59 +0100, Dr. David Alan Gilbert wrote: > > > > > > > > > > * Torvald Riegel (triegel@redhat.com) wrote: > > > > > > > > > > > On Fri, 2015-06-19 at 10:42 -0400, Simo Sorce wrote: > > > > > > > > > > > > On Fri, 2015-06-19 at 16:07 +0200, Andrea Arcangeli wrote: > > > > > > > > > > > > > On Fri, Jun 19, 2015 at 09:38:51AM -0400, Carlos O'Donell wrote: > > > > > > > > > > > > > > I agree with aborting, but only as long as the hot path's performance > > > > > > > > > > > > > > is not impacted and I haven't thought about how to do that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Clearly people are better using atomic comparisons on canary values > > > > > > > > > > > > instead, but it seem easy to avoid false positives (returning 0 when > > > > > > > > > > > > memory is clearly different) and keep these things working, so why not > > > > > > > > > > > > do it ? > > > > > > > > > > > > > > > > > > > > > > I see two separate issues here. First, where do we draw the line, and > > > > > > > > > > > what do we guarantee. I strongly believe that programs must not have > > > > > > > > > > > data races, and that they should use atomics or other synchronization > > > > > > > > > > > properly (this doesn't mean locks, but relaxed memory order atomics, for > > > > > > > > > > > example). > > > > > > > > > > > Second, do we try to keep buggy programs working. If it has no cost to > > > > > > > > > > > do so (e.g., like it might be true in this case), then doing that can > > > > > > > > > > > help to trigger less errors. But that doesn't mean the buggy programs > > > > > > > > > > > should get fixed eventually. > > > > > > > > > > > > > > > > > > > > I find it difficult to understand the boundaries of what the C library > > > > > > > > > > is allowed to do in this type of optimisation. > > > > > > > > > > > > > > > > > > > > For example, consider the following: > > > > > > > > > > > > > > > > > > > > char a[128]; > > > > > > > > > > char b[128]; > > > > > > > > > > > > > > > > > > > > put some static data in a[0-63] > > > > > > > > > > put some static data in b[0-63] > > > > > > > > > > a[64]=0; > > > > > > > > > > b[64]=0; > > > > > > > > > > start a thread doing stuff in a[65..] > > > > > > > > > > start a thread doing stuff in b[65..] > > > > > > > > > > > > > > > > > > > > if (!strcmp(a,b)) { > > > > > > > > > > /* Do something */ > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > a) Is that behaviour defined? > > > > > > > > > > > > > > > > > > Good question. I think it should be. This depends on both the data > > > > > > > > > race definition and what strcmp/strncmp/memcmp are specified to do using > > > > > > > > > the abstract machine. > > > > > > > > > > > > > > > > > > The data race definition uses memory locations as granularity, which is > > > > > > > > > in 3.14 in C11. Separate characters in an array should be separate > > > > > > > > > memory locations. > > > > > > > > > > > > > > > > > > C11 isn't very specific regarding strcmp, and just says that it > > > > > > > > > "compares the string[s]" (7.24.4.2). C++14 is a bit more specific > > > > > > > > > regarding basic_string::compare (21.4.7.9), saying that first the length > > > > > > > > > of the strings are determined, and then a strncmp is run using the > > > > > > > > > smaller of the two lengths. > > > > > > > > > > > > > > > > > > Assuming the C++ specs, only the array indices [0..64] should be > > > > > > > > > accessed by the abstract machine. So no data race with the stuff going > > > > > > > > > on in [65..). > > > > > > > > > > > > > > > > > > > b) Is it defined with strncmp(a,b,64) ? > > > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > c) Is it defined with strncmp(a,b,128)? > > > > > > > > > > > > > > > > > > Not sure. C11 says that not more than "n characters" are compared, and > > > > > > > > > characters that follow the a null character aren't compared either. > > > > > > > > > This indicates it wouldn't be different from strncmp(a,b,64) in the > > > > > > > > > particular case. > > > > > > > > > Regarding C++11, I'm not sure. The closest copies a substring > > > > > > > > > (conceptually) and then compares, but the substring copying has to > > > > > > > > > determine length of the string and then subtracting the max length. > > > > > > > > > This would do a strlen first, which wouldn't access past index 64. > > > > > > > > > Thus, should be fine too. > > > > > > > > > > > > > > > > > > > d) Is it defined with memcmp(a,b,64)? > > > > > > > > > > > > > > > > > > No data race, IMO. > > > > > > > > > > > > > > > > > > > e) Is it defined with memcmp(a,b,128)? > > > > > > > > > > > > > > > > > > Data race. Undefined behavior. > > > > > > > > > > > > > > > > > > > f) If I moved that boundary off a nice round % 8 mark would > > > > > > > > > > it matter? > > > > > > > > > > > > > > > > > > No difference as far as the standard is concerned. > > > > > > > > > > > > > > > > > > > I can imagine there may be lots of things that terminate > > > > > > > > > > a string and let other stuff happen in the allocated space > > > > > > > > > > after the end of the string in the belief that at the end > > > > > > > > > > of that string all is unknown. Andrea's case is a bit different > > > > > > > > > > in that it's the later data that's static, but that doesn't > > > > > > > > > > sound like it should change the answer as to what's allowed. > > > > > > > > > > > > > > > > > > I think it does, because the question is whether there is a data race on > > > > > > > > > the memory locations that the abstract machine would access. > > > > > > > > > > > > > > > > Well given we are making examples. > > > > > > > > > > > > > > > > Assume 2 structures like this: > > > > > > > > > > > > > > > > struct test { > > > > > > > > void *pointer; > > > > > > > > char start[16]; > > > > > > > > char end[240]; > > > > > > > > } > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > struct test a; > > > > > > > > struct test b; > > > > > > > > > > > > > > > > memset(a.end, 1, 240); > > > > > > > > memset(b.end, 2, 240); > > > > > > > > > > > > > > > > In what case it would be expected/legal for > > > > > > > > memcmp(a.start, b.start, 256); to ever return 0 ? > > > > > > > > > > > > > > Just to be clear, if there are data-races in [ab].pointer or [as].start > > > > > > > I see absolutely no problem in stopping short and returning a > > > > > > > positive/negative value. I just do not see how returning 0 would ever > > > > > > > make sense. > > > > > > > > > > > > It would be legal if the invocation of memcmp and at least one of the > > > > > > invocations of memset would be concurrent (ie, not ordered by > > > > > > happens-before). > > > > > > > > > > Nope, sorry , there is a misunderstanding, consider the memsets part of > > > > > initialization, *never* run in parallel with memcmp. > > > > > > > > Ah, I see. I'm not really sure what the rules are in this case; my > > > > guess would be that this is not UB, because the access is not beyond the > > > > object that is struct test. > > > > > > > > Perhaps this survey is interesting for you: https://goo.gl/2TY0H5 > > > > > > > > > > I do see that this wouldn't be an intuitive outcome if one reasons based > > > > > > on assuming some simplified (or straightforward, or natural, or whatever > > > > > > the personal opinion might be...) implementation. But a data race is > > > > > > undefined behavior, and that's where we're drawing the line. > > > > > > > > > > Sure, I can understand that is the *whole* are under consideration is > > > > > racing. But here we are talking about a large chunk of the area that is > > > > > different and *never* raced upon. > > > > > > > > Well, undefined behavior is not constrained to some "neighborhood" or > > > > such around the cause of the undefined behavior. > > > > It makes sense that the standard doesn't constrain it, because then it > > > > would have to guarantee the constraints and those would require > > > > reasoning about a particular implementation, which a language standard > > > > is meant to prevent :) > > > > Consider a hypothetical GPU that wants to run C code. It accepts code > > > > in a virtual ISA that requires using special atomic operations for all > > > > memory accesses that would be data races otherwise. The implementation > > > > of this GPU uses a fast JIT compiler to transform to the native ISA of > > > > the GPU. Now this JIT ocmpiler detects in a program that a whole page > > > > is compared using memcmp (or a custom memcmp using non-atomics). It > > > > decides to move this page into local memory, assuming because of the > > > > data-race-freedom requirement that no other thread will write to the > > > > page. After the memcmp, it will move it back. But your program isn't > > > > data-race-free, and another thread writes to one byte in the page. This > > > > byte is then lost when the page is moved back from local memory. > > > > > > > > This is a made-up example (although such a virtual ISA is not > > > > unrealistic) and maybe this implementation would try to fail gracefully > > > > instead of just losing the update. But I think it helps show that it > > > > isn't quite clear in general which constraints would be realistic for > > > > undefined behavior. > > > > > > Isn't that's illegal since the parameters to memcmp are 'const void *', > > > so it shouldn't be written? > > > > That's what your program guarantees, but it's not automatically a > > constraint on what the implementation does. If the compiler can detect > > that the memory is completely managed and provided by the implementation > > (e.g., never accessed through volatile accesses, never escaping to > > functions the compiler isn't sure won't access it through > > volatile, ...), then the additional writes aren't visible in terms of > > observable behavior of the abstract machine. > > However, this is a library interface; the implentation of memcmp in the > library can't get that much information from the compiler, so the standard > memcmp implementation can't do that write because, given that it's const, > that the address space it's comparing may be read-only. I'm assuming in my example that the compiler isn't calling an opaque memcmp, but knows about the semantics of memcmp and uses a built-in memcmp or such. > > > Going back to an earlier part of the thread; I came up with a reason > > > why I really wouldn't want these routines to 'assert()'. > > > > So, put generally, you don't want to fail fast because there is code > > that is not data-race-free but seems to work right now? > > I think that this would be a convincing reason. > > :-) It's not a 'seems to work' It is not guaranteed by the language standards to work. There's no explicit guarantee by the compiler either. Therefore I'm putting this in the "seems to work" category, which isn't meant to be saying that it doesn't, or would be always fragile in practice. I'm just trying to be clear here regarding when we have guaranteed behavior (eg, by the language standard), and when we don't. > - the design of both of these algorithms > is specifically allowing data races to happen on the intent of containing > the effect of the races. The racy reads are expected to produce indeterminate > data but the algorithm knows that if that happens that data is always read > again in the future at a time when a race can't happen. This is significantly > faster than trying to stop all the other threads whenever the read is > taking place. Let's try to not overload the term "race". From the language's memory model perspective, data races are cases when the program is not expressing concurrent code correctly. Using atomics, for example, avoids this because you tell the compiler that the accesses can be concurrent with accesses by other threads. Avoiding data races does not mean you need to use coarser-grain mutual exclusion such as through using locks. At higher levels of abstractions, there can still be race conditions in the code, of course. But as you say, you're taking care of these. (Just to be clear, I'm nitpicking on the terminology here because IMO, we need to be clear what kind of "race" we are talking about (e.g., whether it's at the higher level of an algorithm or at lower levels where the compiler is involved). I do understand what you're trying to do in these algorithms, and how you want to synchronize.) > > > In emulation/virtualisation we generally haven't got a clue what our > > > guests are doing, and we don't want to stop them most of the time. > > > So there are a couple of failry common types of structure that > > > break these rules: > > > > > > 1) (e.g. framebuffer emulation) > > > (Many untrusted threads writing to memory) > > > One control thread > > > do { > > > read memory -> display it > > > } while (!end) > > > barrier; > > > display it once more > > > > > > That 'display it' can be quite reasonably expected to get > > > inconsistent data, but at the higher level it will be safe because > > > we know we're always going to do one more 'display it' at the end > > > when things settle down. > > > > Using atomic (memory_order_relaxed) accesses for reads of "memory" (and > > "end", I suppose) would clean this up. > > 1a) Note that this data isn't necessarily word aligned etc - so I'm not > sure why using atomic accesses would help here. Also, my problem here > is that I have no efficient C library call to use that is legal > because by your definition of 'undefined' any library call might > assert or do anything else - even if I don't care about the data > they produce. I agree that what's currently provided by the standard isn't quite sufficient to cover this use case. I also think that it's an important use case, and so we should have some solution for that that is specified to work correctly (either in the standard, or as a GNU extension). Specifically, we want to snapshot memory in a data-race-free way, and the caller makes sure through additional synchronization that it only uses snapshots that actually happened when the snapshotted data was in a quiescent state. (I have the same issue in GCC's libitm (transactional memory runtime library), and opted to use plain loads there, with a clear comment that this was outside the standard.) I'm not yet sure which interface we should offer for such snapshotting. There's a related proposal being discussed in the C++ committee (Atomic View), but that's about atomic accesses. We just need ones that don't create data races but don't need to be atomic; essentially, we just have to tell the compiler to not assume data-race-freedom for the snapshot. Having variants for memcpy, memcmp, etc. might be one way to do it. A first step in your code could be to at least document the cases where you rely on that. (IMHO, it would also be useful as documentation for your existing concurrent code.)
diff --git a/equalhappybug.c b/equalhappybug.c index 6c5fac5..ed51fd7 100644 --- a/equalhappybug.c +++ b/equalhappybug.c @@ -18,7 +18,7 @@ static long nr_cpus, page_size; #define NON_ZERO_OFFSET 16U -static char *page, *zeropage; +static volatile char *page, *zeropage; volatile int finished; static int my_bcmp(char *str1, char *str2, unsigned long n) @@ -47,6 +47,7 @@ static void *locking_thread(void *arg) #else unsigned long loops = 0; //while (!bcmp(page, zeropage, page_size)) + asm volatile("" : : : "memory"); while (!memcmp(page, zeropage, page_size)) { loops += 1; asm volatile("" : : : "memory");