Message ID | 1440439895-11812-1-git-send-email-tuliom@linux.vnet.ibm.com |
---|---|
State | Superseded |
Headers |
Received: (qmail 36093 invoked by alias); 24 Aug 2015 18:12:18 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <libc-alpha.sourceware.org> List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 36071 invoked by uid 89); 24 Aug 2015 18:12:17 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.2 required=5.0 tests=AWL, BAYES_00, KAM_LAZY_DOMAIN_SECURITY, RP_MATCHES_RCVD autolearn=ham version=3.3.2 X-HELO: e24smtp05.br.ibm.com X-MailFrom: tuliom@linux.vnet.ibm.com X-RcptTo: libc-alpha@sourceware.org From: "Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> To: carlos@redhat.com, adhemerval.zanella@linaro.org Cc: triegel@redhat.com, libc-alpha@sourceware.org, munroesj@linux.vnet.ibm.com Subject: [PATCHv3] PowerPC: Fix a race condition when eliding a lock Date: Mon, 24 Aug 2015 15:11:35 -0300 Message-Id: <1440439895-11812-1-git-send-email-tuliom@linux.vnet.ibm.com> In-Reply-To: <55D742D3.9050600@redhat.com> References: <55D742D3.9050600@redhat.com> X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15082418-0033-0000-0000-0000034E2D31 |
Commit Message
Tulio Magno Quites Machado Filho
Aug. 24, 2015, 6:11 p.m. UTC
Changes since v2: - Moved part of the source code comments to the commit message. - Added O'Donnel's suggestion to the source code comments. Changes since v1: - Improved commit message. - Added new comments in the source code alerting to the concurrency details intrinsic to this code. - Removed compiler barriers from this patch, which will be treated in another patch and will be synchronized with the GCC implementation. 8<---------- The previous code used to evaluate the preprocessor token is_lock_free to a variable before starting a transaction. This behavior can cause an error if another thread got the lock (without using a transaction) between the evaluation of the token and the beginning of the transaction. This bug can be triggered with the following order of events: 1. The lock accessed by is_lock_free is free. 2. Thread T1 evaluates is_lock_free and stores into register R1 that the lock is free. 3. Thread T2 acquires the same lock used in is_lock_free. 4. T1 begins the transaction, creating a memory barrier where is_lock_free is false, but R1 is true. 5. T1 reads R1 and doesn't abort the transaction. 6. T1 calls ELIDE_UNLOCK, which reads false from is_lock_free and decides to unlock a lock acquired by T2, leading to undefined behavior. This patch delays the evaluation of is_lock_free to inside a transaction by moving this part of the code to the macro ELIDE_LOCK. 2015-08-24 Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com> [BZ #18743] * sysdeps/powerpc/nptl/elide.h (__elide_lock): Move most of this code to... (ELIDE_LOCK): ...here. (__get_new_count): New function with part of the code from __elide_lock that updates the value of adapt_count after a transaction abort. (__elided_trylock): Moved this code to... (ELIDE_TRYLOCK): ...here. --- NEWS | 3 +- sysdeps/powerpc/nptl/elide.h | 113 +++++++++++++++++++++++-------------------- 2 files changed, 63 insertions(+), 53 deletions(-)
Comments
On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote: > Changes since v2: > - Moved part of the source code comments to the commit message. > - Added O'Donnel's suggestion to the source code comments. > > Changes since v1: > - Improved commit message. > - Added new comments in the source code alerting to the concurrency details > intrinsic to this code. > - Removed compiler barriers from this patch, which will be treated in another > patch and will be synchronized with the GCC implementation. > + if (__builtin_tbegin (0)) \ > + { \ > + if (is_lock_free) \ > + { \ > + ret = 1; \ > + break; \ > + } \ > + __builtin_tabort (_ABORT_LOCK_BUSY); \ > + } \ > + else \ > + if (!__get_new_count(&adapt_count)) \ > + break; \ Toravld still suggests an atomic_load_relaxed here: https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html Is there any technical objection to that request? It does highlight, as he notes, the transactional and non-transactional accesses for that evaluated value. Cheers, Carlos.
"Carlos O'Donell" <carlos@redhat.com> writes: > On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote: >> Changes since v2: >> - Moved part of the source code comments to the commit message. >> - Added O'Donnel's suggestion to the source code comments. >> >> Changes since v1: >> - Improved commit message. >> - Added new comments in the source code alerting to the concurrency details >> intrinsic to this code. >> - Removed compiler barriers from this patch, which will be treated in another >> patch and will be synchronized with the GCC implementation. > >> + if (__builtin_tbegin (0)) \ >> + { \ >> + if (is_lock_free) \ >> + { \ >> + ret = 1; \ >> + break; \ >> + } \ >> + __builtin_tabort (_ABORT_LOCK_BUSY); \ >> + } \ >> + else \ >> + if (!__get_new_count(&adapt_count)) \ >> + break; \ > > Toravld still suggests an atomic_load_relaxed here: > https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html > > Is there any technical objection to that request? > > It does highlight, as he notes, the transactional and non-transactional > accesses for that evaluated value. I'm not objecting, but I don't see any value in adding this as I explained at the end of this message [1]. AFAIU, there are 2 reasons for doing this: 1. Make the code easier to understand. 2. Help moving to proper atomic types in the future. IMHO, reason #1 is very personal and I could change my opinion if I had seen an example where this kind of changes helped to make the code easier to understand. Reason #2 is a valid point, but is unrelated to this patch, i.e. I wouldn't backport the atomic access to glibc 2.21 and 2.22 if the only reason for it is #2. So, it would be better as a separate patch. [1] https://sourceware.org/ml/libc-alpha/2015-08/msg00885.html
On 24-08-2015 15:58, Carlos O'Donell wrote: > On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote: >> Changes since v2: >> - Moved part of the source code comments to the commit message. >> - Added O'Donnel's suggestion to the source code comments. >> >> Changes since v1: >> - Improved commit message. >> - Added new comments in the source code alerting to the concurrency details >> intrinsic to this code. >> - Removed compiler barriers from this patch, which will be treated in another >> patch and will be synchronized with the GCC implementation. > >> + if (__builtin_tbegin (0)) \ >> + { \ >> + if (is_lock_free) \ >> + { \ >> + ret = 1; \ >> + break; \ >> + } \ >> + __builtin_tabort (_ABORT_LOCK_BUSY); \ >> + } \ >> + else \ >> + if (!__get_new_count(&adapt_count)) \ >> + break; \ > > Toravld still suggests an atomic_load_relaxed here: > https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html > > Is there any technical objection to that request? > > It does highlight, as he notes, the transactional and non-transactional > accesses for that evaluated value. No 'technical' objections, only that the transaction idea is exactly to avoid the use of atomic accesses altogether. For relaxed accesses it won't change anything, but if the idea is to highlight every access within transaction with atomics, it might be case where the algorithm will require another semantic (like a acquire or release) and it will require to drop the atomic usage to avoid generate subpar code. > > Cheers, > Carlos. >
On 08/24/2015 03:28 PM, Adhemerval Zanella wrote: > > > On 24-08-2015 15:58, Carlos O'Donell wrote: >> On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote: >>> Changes since v2: >>> - Moved part of the source code comments to the commit message. >>> - Added O'Donnel's suggestion to the source code comments. >>> >>> Changes since v1: >>> - Improved commit message. >>> - Added new comments in the source code alerting to the concurrency details >>> intrinsic to this code. >>> - Removed compiler barriers from this patch, which will be treated in another >>> patch and will be synchronized with the GCC implementation. >> >>> + if (__builtin_tbegin (0)) \ >>> + { \ >>> + if (is_lock_free) \ >>> + { \ >>> + ret = 1; \ >>> + break; \ >>> + } \ >>> + __builtin_tabort (_ABORT_LOCK_BUSY); \ >>> + } \ >>> + else \ >>> + if (!__get_new_count(&adapt_count)) \ >>> + break; \ >> >> Toravld still suggests an atomic_load_relaxed here: >> https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html >> >> Is there any technical objection to that request? >> >> It does highlight, as he notes, the transactional and non-transactional >> accesses for that evaluated value. > > No 'technical' objections, only that the transaction idea is exactly to > avoid the use of atomic accesses altogether. For relaxed accesses it > won't change anything, but if the idea is to highlight every access > within transaction with atomics, it might be case where the algorithm > will require another semantic (like a acquire or release) and it will > require to drop the atomic usage to avoid generate subpar code. I'll respond to Torvald and ask him to explain in more detail. Cheers, Carlos.
On Mon, 2015-08-24 at 16:28 -0300, Adhemerval Zanella wrote: > > On 24-08-2015 15:58, Carlos O'Donell wrote: > > On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote: > >> Changes since v2: > >> - Moved part of the source code comments to the commit message. > >> - Added O'Donnel's suggestion to the source code comments. > >> > >> Changes since v1: > >> - Improved commit message. > >> - Added new comments in the source code alerting to the concurrency details > >> intrinsic to this code. > >> - Removed compiler barriers from this patch, which will be treated in another > >> patch and will be synchronized with the GCC implementation. > > > >> + if (__builtin_tbegin (0)) \ > >> + { \ > >> + if (is_lock_free) \ > >> + { \ > >> + ret = 1; \ > >> + break; \ > >> + } \ > >> + __builtin_tabort (_ABORT_LOCK_BUSY); \ > >> + } \ > >> + else \ > >> + if (!__get_new_count(&adapt_count)) \ > >> + break; \ > > > > Toravld still suggests an atomic_load_relaxed here: > > https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html > > > > Is there any technical objection to that request? > > > > It does highlight, as he notes, the transactional and non-transactional > > accesses for that evaluated value. > > No 'technical' objections, only that the transaction idea is exactly to > avoid the use of atomic accesses altogether. For relaxed accesses it > won't change anything, Yes. > but if the idea is to highlight every access > within transaction with atomics, No. The rule I'd like to follow is that if we need atomic accesses to a memory object somewhere in the code base, that we then consistently use atomic accesses for all accesses to this object (with the exception of initialization code, where things like memset can be simpler). Thus, if you use just transactions to prevent data races for an object, no atomics are required. This holds for the vast majority of objects accessed in transactions (except lock implementations and such). IMO, not using atomics for accessing is_lock_free in the txn creates an exception in our rules, and given that there's no problem to used an MO-relaxed access there, it's just easier to avoid creating an exception. And it makes it possible to use an atomic type for it in the future. > it might be case where the algorithm > will require another semantic (like a acquire or release) and it will > require to drop the atomic usage to avoid generate subpar code. I don't quite understand what you mean, sorry.
On Mon, 2015-08-24 at 16:19 -0300, Tulio Magno Quites Machado Filho wrote: > "Carlos O'Donell" <carlos@redhat.com> writes: > > > On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote: > >> Changes since v2: > >> - Moved part of the source code comments to the commit message. > >> - Added O'Donnel's suggestion to the source code comments. > >> > >> Changes since v1: > >> - Improved commit message. > >> - Added new comments in the source code alerting to the concurrency details > >> intrinsic to this code. > >> - Removed compiler barriers from this patch, which will be treated in another > >> patch and will be synchronized with the GCC implementation. > > > >> + if (__builtin_tbegin (0)) \ > >> + { \ > >> + if (is_lock_free) \ > >> + { \ > >> + ret = 1; \ > >> + break; \ > >> + } \ > >> + __builtin_tabort (_ABORT_LOCK_BUSY); \ > >> + } \ > >> + else \ > >> + if (!__get_new_count(&adapt_count)) \ > >> + break; \ > > > > Toravld still suggests an atomic_load_relaxed here: > > https://www.sourceware.org/ml/libc-alpha/2015-08/msg00893.html > > > > Is there any technical objection to that request? > > > > It does highlight, as he notes, the transactional and non-transactional > > accesses for that evaluated value. > > I'm not objecting, but I don't see any value in adding this as I explained at > the end of this message [1]. > > AFAIU, there are 2 reasons for doing this: > 1. Make the code easier to understand. > 2. Help moving to proper atomic types in the future. > > IMHO, reason #1 is very personal and I could change my opinion if I had seen an > example where this kind of changes helped to make the code easier to > understand. The synchronization around is_lock_free is mixed txnal/nontxnal synchronization. Given that, it's not as simple as purely txnal/txnal synchronization. As I mentioned in my reply to Carlos, we do have the general rule that we don't use atomic types right now but assume that every object that needs to be atomically accessed at some place is atomically accessed everywhere (ie, what a C11/C++11 atomic type would provide). Making an exception for mixed txnal/nontxnal synchronization is creating more complexity, just because there is another exception. > Reason #2 is a valid point, but is unrelated to this patch, i.e. I wouldn't > backport the atomic access to glibc 2.21 and 2.22 if the only reason for it > is #2. So, it would be better as a separate patch. I wouldn't object if you split this into two parts, and only backport those patches that are necessary for correctness.
On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho wrote: > Changes since v2: > - Moved part of the source code comments to the commit message. > - Added O'Donnel's suggestion to the source code comments. > > Changes since v1: > - Improved commit message. > - Added new comments in the source code alerting to the concurrency details > intrinsic to this code. > - Removed compiler barriers from this patch, which will be treated in another > patch and will be synchronized with the GCC implementation. > > 8<---------- > > The previous code used to evaluate the preprocessor token is_lock_free to > a variable before starting a transaction. This behavior can cause an > error if another thread got the lock (without using a transaction) > between the evaluation of the token and the beginning of the transaction. I find the use of the preprocessor to capture an evaluation really ugly. And it is error-prone, as the bug fixed by this patch shows. I would appreciate a follow-up patch that replaces is_lock_free with either a function that is called (relying on the compiler to inline it if the function pointer is constant, which it should be), or an address to a value that is compared to some caller-provided constant (which would limit flexibility for the condition somewhat, but is cleaner). > This bug can be triggered with the following order of events: > 1. The lock accessed by is_lock_free is free. > 2. Thread T1 evaluates is_lock_free and stores into register R1 that the > lock is free. > 3. Thread T2 acquires the same lock used in is_lock_free. > 4. T1 begins the transaction, creating a memory barrier where is_lock_free > is false, but R1 is true. > 5. T1 reads R1 and doesn't abort the transaction. > 6. T1 calls ELIDE_UNLOCK, which reads false from is_lock_free and decides > to unlock a lock acquired by T2, leading to undefined behavior. > > This patch delays the evaluation of is_lock_free to inside a transaction > by moving this part of the code to the macro ELIDE_LOCK. > > 2015-08-24 Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com> > > [BZ #18743] > * sysdeps/powerpc/nptl/elide.h (__elide_lock): Move most of this > code to... > (ELIDE_LOCK): ...here. > (__get_new_count): New function with part of the code from > __elide_lock that updates the value of adapt_count after a > transaction abort. > (__elided_trylock): Moved this code to... > (ELIDE_TRYLOCK): ...here. > --- > NEWS | 3 +- > sysdeps/powerpc/nptl/elide.h | 113 +++++++++++++++++++++++-------------------- > 2 files changed, 63 insertions(+), 53 deletions(-) > > diff --git a/NEWS b/NEWS > index 6b36c08..76df3fd 100644 > --- a/NEWS > +++ b/NEWS > @@ -11,7 +11,8 @@ Version 2.23 > > 14341, 16517, 16519, 16520, 16734, 16973, 17787, 17905, 18084, 18086, > 18265, 18370, 18421, 18480, 18525, 18618, 18647, 18661, 18674, 18681, > - 18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824. > + 18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824, > + 18743. > > * The obsolete header <regexp.h> has been removed. Programs that require > this header must be updated to use <regex.h> instead. > diff --git a/sysdeps/powerpc/nptl/elide.h b/sysdeps/powerpc/nptl/elide.h > index 389f5a5..1cc9890 100644 > --- a/sysdeps/powerpc/nptl/elide.h > +++ b/sysdeps/powerpc/nptl/elide.h > @@ -23,67 +23,76 @@ > # include <htm.h> > # include <elision-conf.h> > > -/* Returns true if the lock defined by is_lock_free as elided. > - ADAPT_COUNT is a pointer to per-lock state variable. */ > - > +/* Get the new value of adapt_count according to the elision > + configurations. Returns true if the system should retry again or false > + otherwise. */ > static inline bool > -__elide_lock (uint8_t *adapt_count, int is_lock_free) > +__get_new_count (uint8_t *adapt_count) > { > - if (*adapt_count > 0) > + /* A persistent failure indicates that a retry will probably > + result in another failure. Use normal locking now and > + for the next couple of calls. */ > + if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ())) > { > - (*adapt_count)--; > + if (__elision_aconf.skip_lock_internal_abort > 0) > + *adapt_count = __elision_aconf.skip_lock_internal_abort; > return false; > } > - > - for (int i = __elision_aconf.try_tbegin; i > 0; i--) > - { > - if (__builtin_tbegin (0)) > - { > - if (is_lock_free) > - return true; > - /* Lock was busy. */ > - __builtin_tabort (_ABORT_LOCK_BUSY); > - } > - else > - { > - /* A persistent failure indicates that a retry will probably > - result in another failure. Use normal locking now and > - for the next couple of calls. */ > - if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ())) > - { > - if (__elision_aconf.skip_lock_internal_abort > 0) > - *adapt_count = __elision_aconf.skip_lock_internal_abort; > - break; > - } > - /* Same logic as above, but for a number of temporary failures in a > - a row. */ > - else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0 > - && __elision_aconf.try_tbegin > 0) > - *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries; > - } > - } > - > - return false; > + /* Same logic as above, but for a number of temporary failures in a > + a row. */ > + else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0 > + && __elision_aconf.try_tbegin > 0) > + *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries; > + return true; > } > > -# define ELIDE_LOCK(adapt_count, is_lock_free) \ > - __elide_lock (&(adapt_count), is_lock_free) > - > - > -static inline bool > -__elide_trylock (uint8_t *adapt_count, int is_lock_free, int write) > -{ > - if (__elision_aconf.try_tbegin > 0) > - { > - if (write) > - __builtin_tabort (_ABORT_NESTED_TRYLOCK); > - return __elide_lock (adapt_count, is_lock_free); > - } > - return false; > -} > +/* CONCURRENCY NOTES: > + > + The value of is_lock_free is read and possibly written to by multiple As you write above, is_lock_free is not a value, it is an expression. The expression will read from a memory location, that is then modified by other threads. Please clarify that so this is clear to readers (this misunderstanding seems to be related to the bug you're fixing...). > + threads, either during lock acquisition, or elision. In order to avoid > + this evaluation from becoming a data race the access of is_lock_free It could be a data race because you're not using atomics there, but that's not the whole point. (We use the term "data race" specifically to refer to the C11/C++11 memory model concept of the same name.) You want to ensure the lock elision synchronization scheme, and thus are moving it inside the txn. > + is placed *inside* the transaction. Within the transaction is_lock_free > + can be accessed with relaxed ordering. We are assured by the transaction Please use "memory order" to be specific, or abbreviate with MO as mentioned on the Concurrency wiki page. > + that the value of is_lock_free is consistent with the state of the lock, > + otherwise the hardware will abort the transaction. */ That's not the really why this lock elisions synchronization scheme works :) (At least you're not giving sufficient reason why it is.) Txn/txn synchronization is relatively simple, basically txns that access the same memory locations are atomic and ordered. Txnal/nontxnal synchronization is more complex. I don't remember how IBM has specified txn semantics in detail, but the way it's basically supposed to work is: * If the txn reads-from data written in a critical section, then it will also observe the lock acquisition of this critical section or a later write to the same lock memory locations by this critical section (e.g., a lock release). * If the txn reads-from a lock release, it will observe all prior writes in the respective critical section. * If a txn reads-from a memory location, it picks a write to read-from. It won't change this choice if it reads-from the same location again. If it can't read-from the same write, it aborts. (Here, "observe" means that it reads-from those locations, then the respective writes are visible side effects.) The result of this (rough) set of rules is that in all executions in which it doesn't abort, it will not read intermediate state produced by other critical sections, but is ordered wrt to them and their writes to application data. It's hard to describe that in our current terminology because txns aren't covered in the C11 memory model. Also note that this implementation would not be correct: if (tbegin(0)) { <application code> if (!is_lock_free) tabort(); tend(); } This is because the application code could call tend on some path, for example when reading an intermediate state of a concurrent non-elided critical section. I'm mentioning this because this shows that is_lock_free must be "evaluated" early, and that the compiler must not reorder it to the end of the critical section. We could use atomic MO on the load in is_lock_free for that, but then we get the unnecessary HW acquire fence. The reason why it should work without that is that the compiler shouldn't execute side effects that the abstract machine wouldn't do if is_lock_free evaluated false, and application code that may call tend() is such a side effect. Thus, you can't in practice avoid the is_lock_free check, and once you do it the rules above take care of correctness. Intel TSX and IBM's HTMs (IIRC) don't let any side effects, such as page faults, escape from an aborted txn. In contrast, AMD's ASF proposal did let page faults escape, in which case the acquire MO would be required. I hope this clarifies why I'm saying that mixed txnal/nontxnal synchronization isn't all that simple. > +/* Returns 0 if the lock defined by is_lock_free was elided. > + ADAPT_COUNT is a per-lock state variable. */ > +# define ELIDE_LOCK(adapt_count, is_lock_free) \ > + ({ \ > + int ret = 0; \ > + if (adapt_count > 0) \ > + (adapt_count)--; \ > + else \ > + for (int i = __elision_aconf.try_tbegin; i > 0; i--) \ > + { \ > + if (__builtin_tbegin (0)) \ > + { \ > + if (is_lock_free) \ This evaluation should use relaxed MO atomic accesses, IMO. But that can be changed by a follow-up patch. > + { \ > + ret = 1; \ > + break; \ > + } \ > + __builtin_tabort (_ABORT_LOCK_BUSY); \ > + } \ > + else \ > + if (!__get_new_count(&adapt_count)) \ > + break; \ > + } \ > + ret; \ > + }) > > # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write) \ > - __elide_trylock (&(adapt_count), is_lock_free, write) > + ({ \ > + int ret = 0; \ > + if (__elision_aconf.try_tbegin > 0) \ > + { \ I'd prefer a comment here why you abort if write is true. Or a reference to another source file / function where this is explained. Can be in a follow-up patch. > + if (write) \ > + __builtin_tabort (_ABORT_NESTED_TRYLOCK); \ > + ret = ELIDE_LOCK (adapt_count, is_lock_free); \ > + } \ > + ret; \ > + }) > > > static inline bool
On 25-08-2015 08:34, Torvald Riegel wrote: > > No. The rule I'd like to follow is that if we need atomic accesses to a > memory object somewhere in the code base, that we then consistently use > atomic accesses for all accesses to this object (with the exception of > initialization code, where things like memset can be simpler). > > Thus, if you use just transactions to prevent data races for an object, > no atomics are required. This holds for the vast majority of objects > accessed in transactions (except lock implementations and such). > > IMO, not using atomics for accessing is_lock_free in the txn creates an > exception in our rules, and given that there's no problem to used an > MO-relaxed access there, it's just easier to avoid creating an > exception. And it makes it possible to use an atomic type for it in the > future. The problem I see it all hardware transactional either do not explicit guarantee progress forward for transactional code or require a fallback code for performance reason, and then non-transactional code would be required. > >> it might be case where the algorithm >> will require another semantic (like a acquire or release) and it will >> require to drop the atomic usage to avoid generate subpar code. > > I don't quite understand what you mean, sorry. > And what I am trying to avoid is to require some atomic access in a non MO-relaxed way in a transactional just to follow the atomic access rule. Anyway, I do see the value of making atomic access using atomic api insides transactions and we can handle futures cases in specific cases.
On Tue, 2015-08-25 at 11:41 -0300, Adhemerval Zanella wrote: > > On 25-08-2015 08:34, Torvald Riegel wrote: > > > > No. The rule I'd like to follow is that if we need atomic accesses to a > > memory object somewhere in the code base, that we then consistently use > > atomic accesses for all accesses to this object (with the exception of > > initialization code, where things like memset can be simpler). > > > > Thus, if you use just transactions to prevent data races for an object, > > no atomics are required. This holds for the vast majority of objects > > accessed in transactions (except lock implementations and such). > > > > IMO, not using atomics for accessing is_lock_free in the txn creates an > > exception in our rules, and given that there's no problem to used an > > MO-relaxed access there, it's just easier to avoid creating an > > exception. And it makes it possible to use an atomic type for it in the > > future. > > The problem I see it all hardware transactional either do not explicit > guarantee progress forward for transactional code or require a fallback > code for performance reason, and then non-transactional code would be > required. Agreed current HTMs are almost always best-effort, but I don't think a relaxed-MO atomic should abort a txn. > > > > >> it might be case where the algorithm > >> will require another semantic (like a acquire or release) and it will > >> require to drop the atomic usage to avoid generate subpar code. > > > > I don't quite understand what you mean, sorry. > > > > And what I am trying to avoid is to require some atomic access in a non > MO-relaxed way in a transactional just to follow the atomic access rule. Agreed. > Anyway, I do see the value of making atomic access using atomic api > insides transactions and we can handle futures cases in specific cases. Good.
Torvald Riegel <triegel@redhat.com> writes: > On Mon, 2015-08-24 at 16:19 -0300, Tulio Magno Quites Machado Filho > wrote: >> "Carlos O'Donell" <carlos@redhat.com> writes: >> >> > On 08/24/2015 02:11 PM, Tulio Magno Quites Machado Filho wrote: >> >> Changes since v2: >> >> - Moved part of the source code comments to the commit message. >> >> - Added O'Donnel's suggestion to the source code comments. >> >> >> >> Changes since v1: >> >> - Improved commit message. >> >> - Added new comments in the source code alerting to the concurrency details >> >> intrinsic to this code. >> >> - Removed compiler barriers from this patch, which will be treated in another >> >> patch and will be synchronized with the GCC implementation. >> > >> Reason #2 is a valid point, but is unrelated to this patch, i.e. I wouldn't >> backport the atomic access to glibc 2.21 and 2.22 if the only reason for it >> is #2. So, it would be better as a separate patch. > > I wouldn't object if you split this into two parts, and only backport > those patches that are necessary for correctness. Great! So, let's split this discussion in 2: one to review the patch for bug #18743 and another which is an RFC to create a rule to consistently use atomic access to a memory object that need at least one atomic access. Both discussions are completely unrelated, except that the former started the latter.
The discussion this patch launched has been quite informative. Circling back to the patch; it should fix the bug in question while we determine how to improve the common code. My only trivial nit is the x86 version might benefit from a similar comment regarding concurrency, otherwise LGTM. On 08/24/2015 01:11 PM, Tulio Magno Quites Machado Filho wrote: > Changes since v2: > - Moved part of the source code comments to the commit message. > - Added O'Donnel's suggestion to the source code comments. > > Changes since v1: > - Improved commit message. > - Added new comments in the source code alerting to the concurrency details > intrinsic to this code. > - Removed compiler barriers from this patch, which will be treated in another > patch and will be synchronized with the GCC implementation. > > 8<---------- > > The previous code used to evaluate the preprocessor token is_lock_free to > a variable before starting a transaction. This behavior can cause an > error if another thread got the lock (without using a transaction) > between the evaluation of the token and the beginning of the transaction. > > This bug can be triggered with the following order of events: > 1. The lock accessed by is_lock_free is free. > 2. Thread T1 evaluates is_lock_free and stores into register R1 that the > lock is free. > 3. Thread T2 acquires the same lock used in is_lock_free. > 4. T1 begins the transaction, creating a memory barrier where is_lock_free > is false, but R1 is true. > 5. T1 reads R1 and doesn't abort the transaction. > 6. T1 calls ELIDE_UNLOCK, which reads false from is_lock_free and decides > to unlock a lock acquired by T2, leading to undefined behavior. > > This patch delays the evaluation of is_lock_free to inside a transaction > by moving this part of the code to the macro ELIDE_LOCK. > > 2015-08-24 Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com> > > [BZ #18743] > * sysdeps/powerpc/nptl/elide.h (__elide_lock): Move most of this > code to... > (ELIDE_LOCK): ...here. > (__get_new_count): New function with part of the code from > __elide_lock that updates the value of adapt_count after a > transaction abort. > (__elided_trylock): Moved this code to... > (ELIDE_TRYLOCK): ...here. > --- > NEWS | 3 +- > sysdeps/powerpc/nptl/elide.h | 113 +++++++++++++++++++++++-------------------- > 2 files changed, 63 insertions(+), 53 deletions(-) > > diff --git a/NEWS b/NEWS > index 6b36c08..76df3fd 100644 > --- a/NEWS > +++ b/NEWS > @@ -11,7 +11,8 @@ Version 2.23 > > 14341, 16517, 16519, 16520, 16734, 16973, 17787, 17905, 18084, 18086, > 18265, 18370, 18421, 18480, 18525, 18618, 18647, 18661, 18674, 18681, > - 18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824. > + 18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824, > + 18743. > > * The obsolete header <regexp.h> has been removed. Programs that require > this header must be updated to use <regex.h> instead. > diff --git a/sysdeps/powerpc/nptl/elide.h b/sysdeps/powerpc/nptl/elide.h > index 389f5a5..1cc9890 100644 > --- a/sysdeps/powerpc/nptl/elide.h > +++ b/sysdeps/powerpc/nptl/elide.h > @@ -23,67 +23,76 @@ > # include <htm.h> > # include <elision-conf.h> > > -/* Returns true if the lock defined by is_lock_free as elided. > - ADAPT_COUNT is a pointer to per-lock state variable. */ > - > +/* Get the new value of adapt_count according to the elision > + configurations. Returns true if the system should retry again or false > + otherwise. */ > static inline bool > -__elide_lock (uint8_t *adapt_count, int is_lock_free) > +__get_new_count (uint8_t *adapt_count) > { > - if (*adapt_count > 0) > + /* A persistent failure indicates that a retry will probably > + result in another failure. Use normal locking now and > + for the next couple of calls. */ > + if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ())) > { > - (*adapt_count)--; > + if (__elision_aconf.skip_lock_internal_abort > 0) > + *adapt_count = __elision_aconf.skip_lock_internal_abort; > return false; > } > - > - for (int i = __elision_aconf.try_tbegin; i > 0; i--) > - { > - if (__builtin_tbegin (0)) > - { > - if (is_lock_free) > - return true; > - /* Lock was busy. */ > - __builtin_tabort (_ABORT_LOCK_BUSY); > - } > - else > - { > - /* A persistent failure indicates that a retry will probably > - result in another failure. Use normal locking now and > - for the next couple of calls. */ > - if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ())) > - { > - if (__elision_aconf.skip_lock_internal_abort > 0) > - *adapt_count = __elision_aconf.skip_lock_internal_abort; > - break; > - } > - /* Same logic as above, but for a number of temporary failures in a > - a row. */ > - else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0 > - && __elision_aconf.try_tbegin > 0) > - *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries; > - } > - } > - > - return false; > + /* Same logic as above, but for a number of temporary failures in a > + a row. */ > + else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0 > + && __elision_aconf.try_tbegin > 0) > + *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries; > + return true; > } > > -# define ELIDE_LOCK(adapt_count, is_lock_free) \ > - __elide_lock (&(adapt_count), is_lock_free) > - > - > -static inline bool > -__elide_trylock (uint8_t *adapt_count, int is_lock_free, int write) > -{ > - if (__elision_aconf.try_tbegin > 0) > - { > - if (write) > - __builtin_tabort (_ABORT_NESTED_TRYLOCK); > - return __elide_lock (adapt_count, is_lock_free); > - } > - return false; > -} > +/* CONCURRENCY NOTES: > + > + The value of is_lock_free is read and possibly written to by multiple > + threads, either during lock acquisition, or elision. In order to avoid > + this evaluation from becoming a data race the access of is_lock_free > + is placed *inside* the transaction. Within the transaction is_lock_free > + can be accessed with relaxed ordering. We are assured by the transaction > + that the value of is_lock_free is consistent with the state of the lock, > + otherwise the hardware will abort the transaction. */ > + > +/* Returns 0 if the lock defined by is_lock_free was elided. > + ADAPT_COUNT is a per-lock state variable. */ > +# define ELIDE_LOCK(adapt_count, is_lock_free) \ > + ({ \ > + int ret = 0; \ > + if (adapt_count > 0) \ > + (adapt_count)--; \ > + else \ > + for (int i = __elision_aconf.try_tbegin; i > 0; i--) \ > + { \ > + if (__builtin_tbegin (0)) \ > + { \ > + if (is_lock_free) \ > + { \ > + ret = 1; \ > + break; \ > + } \ > + __builtin_tabort (_ABORT_LOCK_BUSY); \ > + } \ > + else \ > + if (!__get_new_count(&adapt_count)) \ > + break; \ > + } \ > + ret; \ > + }) > > # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write) \ > - __elide_trylock (&(adapt_count), is_lock_free, write) > + ({ \ > + int ret = 0; \ > + if (__elision_aconf.try_tbegin > 0) \ > + { \ > + if (write) \ > + __builtin_tabort (_ABORT_NESTED_TRYLOCK); \ > + ret = ELIDE_LOCK (adapt_count, is_lock_free); \ > + } \ > + ret; \ > + }) > > > static inline bool >
Torvald Riegel <triegel@redhat.com> writes: > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho > wrote: >> Changes since v2: >> - Moved part of the source code comments to the commit message. >> - Added O'Donnel's suggestion to the source code comments. >> >> Changes since v1: >> - Improved commit message. >> - Added new comments in the source code alerting to the concurrency details >> intrinsic to this code. >> - Removed compiler barriers from this patch, which will be treated in another >> patch and will be synchronized with the GCC implementation. >> >> 8<---------- >> >> The previous code used to evaluate the preprocessor token is_lock_free to >> a variable before starting a transaction. This behavior can cause an >> error if another thread got the lock (without using a transaction) >> between the evaluation of the token and the beginning of the transaction. > > I find the use of the preprocessor to capture an evaluation really ugly. > And it is error-prone, as the bug fixed by this patch shows. > I would appreciate a follow-up patch that replaces is_lock_free with either a > function that is called (relying on the compiler to inline it if the > function pointer is constant, which it should be), or an address to a > value that is compared to some caller-provided constant (which would > limit flexibility for the condition somewhat, but is cleaner). I completely agree with you >> +/* CONCURRENCY NOTES: >> + >> + The value of is_lock_free is read and possibly written to by multiple > > As you write above, is_lock_free is not a value, it is an expression. > The expression will read from a memory location, that is then modified > by other threads. Please clarify that so this is clear to readers (this > misunderstanding seems to be related to the bug you're fixing...). Ack. I'll change it to: "The macro expression is_lock_free is..." > >> + threads, either during lock acquisition, or elision. In order to avoid >> + this evaluation from becoming a data race the access of is_lock_free > > It could be a data race because you're not using atomics there, but > that's not the whole point. (We use the term "data race" specifically > to refer to the C11/C++11 memory model concept of the same name.) > You want to ensure the lock elision synchronization scheme, and thus are > moving it inside the txn. Are you complaining about the usage of the term "data race"? If so, what about "race condition"? >> + is placed *inside* the transaction. Within the transaction is_lock_free >> + can be accessed with relaxed ordering. We are assured by the transaction > > Please use "memory order" to be specific, or abbreviate with MO as > mentioned on the Concurrency wiki page. Ack. >> + that the value of is_lock_free is consistent with the state of the lock, >> + otherwise the hardware will abort the transaction. */ > > That's not the really why this lock elisions synchronization scheme > works :) (At least you're not giving sufficient reason why it is.) I do agree with you that comment needs some adjustments but its intention is only to describe the concurrency details of the next macro. It has no intention of describing how lock elision or this macro work. It was added per request [1] [2]. What do you think about the following? Within the transaction we are assured that all memory accesses are atomic and is_lock_free can be evaluated with relaxed memory order. That way, the value of is_lock_free is consistent with the state of the lock until the end of the transaction. [1] https://sourceware.org/ml/libc-alpha/2015-07/msg00987.html [2] https://sourceware.org/ml/libc-alpha/2015-08/msg00870.html >> +/* Returns 0 if the lock defined by is_lock_free was elided. >> + ADAPT_COUNT is a per-lock state variable. */ >> +# define ELIDE_LOCK(adapt_count, is_lock_free) \ >> + ({ \ >> + int ret = 0; \ >> + if (adapt_count > 0) \ >> + (adapt_count)--; \ >> + else \ >> + for (int i = __elision_aconf.try_tbegin; i > 0; i--) \ >> + { \ >> + if (__builtin_tbegin (0)) \ >> + { \ >> + if (is_lock_free) \ > > This evaluation should use relaxed MO atomic accesses, IMO. But that > can be changed by a follow-up patch. Ack. >> + { \ >> + ret = 1; \ >> + break; \ >> + } \ >> + __builtin_tabort (_ABORT_LOCK_BUSY); \ >> + } \ >> + else \ >> + if (!__get_new_count(&adapt_count)) \ >> + break; \ >> + } \ >> + ret; \ >> + }) >> >> # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write) \ >> - __elide_trylock (&(adapt_count), is_lock_free, write) >> + ({ \ >> + int ret = 0; \ >> + if (__elision_aconf.try_tbegin > 0) \ >> + { \ > > I'd prefer a comment here why you abort if write is true. Or a > reference to another source file / function where this is explained. > Can be in a follow-up patch. Let's leave this to another patch. I want to unblock this bugfix. ;-) Anyway, this code is not new. This patch is just moving it from a function to a macro. AFAICS, x86 and powerpc are the only architectures to implement this macro. They both do the same thing and they don't provide comments. I'll have to investigate why it was implemented this way. Thanks!
On Tue, 2015-08-25 at 19:08 -0300, Tulio Magno Quites Machado Filho wrote: > Torvald Riegel <triegel@redhat.com> writes: > > > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho > >> + threads, either during lock acquisition, or elision. In order to avoid > >> + this evaluation from becoming a data race the access of is_lock_free > > > > It could be a data race because you're not using atomics there, but > > that's not the whole point. (We use the term "data race" specifically > > to refer to the C11/C++11 memory model concept of the same name.) > > You want to ensure the lock elision synchronization scheme, and thus are > > moving it inside the txn. > > Are you complaining about the usage of the term "data race"? Yes. > If so, what about "race condition"? Well, that would be better, but a race condition is not necessarily a bad thing. It's usually better to say which execution or interleaving you need to avoid, than just saying "race condition". > >> + is placed *inside* the transaction. Within the transaction is_lock_free > >> + can be accessed with relaxed ordering. We are assured by the transaction > > > > Please use "memory order" to be specific, or abbreviate with MO as > > mentioned on the Concurrency wiki page. > > Ack. > > >> + that the value of is_lock_free is consistent with the state of the lock, > >> + otherwise the hardware will abort the transaction. */ > > > > That's not the really why this lock elisions synchronization scheme > > works :) (At least you're not giving sufficient reason why it is.) > > I do agree with you that comment needs some adjustments but its intention is > only to describe the concurrency details of the next macro. It has no intention > of describing how lock elision or this macro work. > It was added per request [1] [2]. > > What do you think about the following? > > Within the transaction we are assured that all memory accesses are atomic > and is_lock_free can be evaluated with relaxed memory order. That way, the > value of is_lock_free is consistent with the state of the lock until the > end of the transaction. That's good enough for this patch in that it hints at the right thing. I'd still like to see this getting improved eventually, and I'd appreciate if you could participate in that. A shared comment for both the Intel and IBM lock elision implementations would be useful, I believe, as the general principle used in both is the same. I'd like to see this documented properly just so that future glibc developers know how to work with this, and that it get's less likely that bugs are introduced in the future. > >> + { \ > >> + ret = 1; \ > >> + break; \ > >> + } \ > >> + __builtin_tabort (_ABORT_LOCK_BUSY); \ > >> + } \ > >> + else \ > >> + if (!__get_new_count(&adapt_count)) \ > >> + break; \ > >> + } \ > >> + ret; \ > >> + }) > >> > >> # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write) \ > >> - __elide_trylock (&(adapt_count), is_lock_free, write) > >> + ({ \ > >> + int ret = 0; \ > >> + if (__elision_aconf.try_tbegin > 0) \ > >> + { \ > > > > I'd prefer a comment here why you abort if write is true. Or a > > reference to another source file / function where this is explained. > > Can be in a follow-up patch. > > Let's leave this to another patch. I want to unblock this bugfix. ;-) OK.
Torvald Riegel <triegel@redhat.com> writes: > On Tue, 2015-08-25 at 19:08 -0300, Tulio Magno Quites Machado Filho > wrote: >> Torvald Riegel <triegel@redhat.com> writes: >> >> > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho >> >> + threads, either during lock acquisition, or elision. In order to avoid >> >> + this evaluation from becoming a data race the access of is_lock_free >> > >> > It could be a data race because you're not using atomics there, but >> > that's not the whole point. (We use the term "data race" specifically >> > to refer to the C11/C++11 memory model concept of the same name.) >> > You want to ensure the lock elision synchronization scheme, and thus are >> > moving it inside the txn. >> >> Are you complaining about the usage of the term "data race"? > > Yes. > >> If so, what about "race condition"? > > Well, that would be better, but a race condition is not necessarily a > bad thing. It's usually better to say which execution or interleaving > you need to avoid, than just saying "race condition". What about the following? /* CONCURRENCY NOTES: The macro expression is_lock_free is read and possibly written to by multiple threads, either during lock acquisition, or elision. In order to avoid this evaluation from becoming a race condition with the lock acquirement from the lock primitive, the access of is_lock_free is placed *inside* the transaction. Within the transaction we are assured that all memory accesses are atomic and is_lock_free can be evaluated with relaxed memory order. That way, the value of is_lock_free is consistent with the state of the lock until the end of the transaction. */ > I'd still like to see this getting improved eventually, and I'd > appreciate if you could participate in that. Ack. Count me in.
On 08/25/2015 05:06 PM, Paul E. Murphy wrote: > The discussion this patch launched has been quite informative. Circling > back to the patch; it should fix the bug in question while we determine > how to improve the common code. > > My only trivial nit is the x86 version might benefit from a similar > comment regarding concurrency, otherwise LGTM. Thanks for the review Paul! Yes, the x86 code needs the same treatment and I think we'll do that next by fixing up all callers to use an inline function. c.
Ping, Torvald. "Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> writes: > Torvald Riegel <triegel@redhat.com> writes: > >> On Tue, 2015-08-25 at 19:08 -0300, Tulio Magno Quites Machado Filho >> wrote: >>> Torvald Riegel <triegel@redhat.com> writes: >>> >>> > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho >>> >> + threads, either during lock acquisition, or elision. In order to avoid >>> >> + this evaluation from becoming a data race the access of is_lock_free >>> > >>> > It could be a data race because you're not using atomics there, but >>> > that's not the whole point. (We use the term "data race" specifically >>> > to refer to the C11/C++11 memory model concept of the same name.) >>> > You want to ensure the lock elision synchronization scheme, and thus are >>> > moving it inside the txn. >>> >>> Are you complaining about the usage of the term "data race"? >> >> Yes. >> >>> If so, what about "race condition"? >> >> Well, that would be better, but a race condition is not necessarily a >> bad thing. It's usually better to say which execution or interleaving >> you need to avoid, than just saying "race condition". > > What about the following? > > /* CONCURRENCY NOTES: > > The macro expression is_lock_free is read and possibly written to by > multiple threads, either during lock acquisition, or elision. In order to > avoid this evaluation from becoming a race condition with the lock > acquirement from the lock primitive, the access of is_lock_free is placed > *inside* the transaction. Within the transaction we are assured that all > memory accesses are atomic and is_lock_free can be evaluated with relaxed > memory order. That way, the value of is_lock_free is consistent with the > state of the lock until the end of the transaction. */ Anything else you'd like to change before I push this? Thanks,
On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho wrote: > +/* Returns 0 if the lock defined by is_lock_free was elided. > + ADAPT_COUNT is a per-lock state variable. */ > +# define ELIDE_LOCK(adapt_count, is_lock_free) \ > + ({ \ > + int ret = 0; \ > + if (adapt_count > 0) \ > + (adapt_count)--; \ > + else \ > + for (int i = __elision_aconf.try_tbegin; i > 0; i--) \ > + { \ > + if (__builtin_tbegin (0)) \ > + { \ > + if (is_lock_free) \ > + { \ > + ret = 1; \ > + break; \ > + } \ > + __builtin_tabort (_ABORT_LOCK_BUSY); \ > + } \ > + else \ > + if (!__get_new_count(&adapt_count)) \ > + break; \ > + } \ > + ret; \ > + }) Interesting. I was going to suggest following your __builtin_tabort() call here with a call to __builtin_unreachable() so we can give the compiler a little more freedom to optimize things on that path. Then I remembered we had some conditionl tabort insns we might be able to use here, since the tabort above is conditional on is_lock_free == 0. The following code seems to generate much better code, but at the loss of not passing _ABORT_LOCK_BUSY as a failure cause. /* Returns 0 if the lock defined by is_lock_free was elided. ADAPT_COUNT is a per-lock state variable. */ # define ELIDE_LOCK(adapt_count, is_lock_free) \ ({ \ int ret = 0; \ if (adapt_count > 0) \ (adapt_count)--; \ else \ for (int i = __elision_aconf.try_tbegin; i > 0; i--) \ { \ if (__builtin_tbegin (0)) \ { \ __builtin_tabortwci (0x4, is_lock_free, 0) \ ret = 1; \ break; \ } \ else \ if (!__get_new_count(&adapt_count)) \ break; \ } \ ret; \ }) However, if I look at the code that consumes the failure code, I see that we're only looking for the persistent bit being set and not any specific failure cause we passed in: static inline bool __get_new_count (uint8_t *adapt_count) { /* A persistent failure indicates that a retry will probably result in another failure. Use normal locking now and for the next couple of calls. */ if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ())) { if (__elision_aconf.skip_lock_internal_abort > 0) *adapt_count = __elision_aconf.skip_lock_internal_abort; return false; } /* Same logic as above, but for a number of temporary failures in a a row. */ else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0 && __elision_aconf.try_tbegin > 0) *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries; return true; } ...although looking closer, I see have the definitions we use for passing in failure codes is: /* Definitions used for TEXASR Failure code (bits 0:6), they need to be even because tabort. always sets the first bit. */ #define _ABORT_LOCK_BUSY 0x3f /* Lock already used. */ #define _ABORT_NESTED_TRYLOCK 0x3e /* Write operation in trylock. */ #define _ABORT_SYSCALL 0x3d /* Syscall issued. */ I'm not sure I understand the "even" part above, since two of these values are clearly not even. In fact, the two odd values above if passed to __builtin_tabort() will lead to the persistent bit being set, since tabort. sets the upper part of the TEXASR by taking the least significant 8 bits of the value you passed in and concatenating 0x000001 to the end of it, so essentially (using _ABORT_LOCK_BUSY as a failure cause example): TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1; Given the persistent bit is bit 7, that means any odd value we pass into __builtin_tabort(XXX) will set the TEXASR persistent bit. Did we mean to do that? Maybe we did, I'm just asking. :-) I can see that if we're about to do a syscall, that should be persistent. Are we also treating someone having a lock as persistent too? If we did, then using __builtin_tabortwci() won't work, since we wouldn't be setting the persistent bit like we are now... unless we can pass in is_lock_free to __get_new_count and test that along with the persistent bit still. Peter
On 09/01/2015 02:38 PM, Peter Bergner wrote: > ...although looking closer, I see have the definitions we use for > passing in failure codes is: > > /* Definitions used for TEXASR Failure code (bits 0:6), they need to be even > because tabort. always sets the first bit. */ This is a bad comment. This is a misinterpretation of the ISA, as you've noted, bit 31 is always set if tabort* is used. This makes using the conditional versions impractical within the TLE implementation. I think the implication here is you can't trust bit 7 of texasr to deduce the abort code, as it could be set due to one of bits 8-11. I'm not sure if the cases leading bits 8-11 to get set are mutually exclusive with calling tabort. I.e is failure code 1 the only reserved code? > #define _ABORT_LOCK_BUSY 0x3f /* Lock already used. */ > #define _ABORT_NESTED_TRYLOCK 0x3e /* Write operation in trylock. */ This should definitely be an odd value too. I'll see to fixing its usage. As implemented now, we might retry 3 (try_tbegin) times in __lll_trylock_elision if we hit that one. > #define _ABORT_SYSCALL 0x3d /* Syscall issued. */ > > I'm not sure I understand the "even" part above, since two of these > values are clearly not even. In fact, the two odd values above if > passed to __builtin_tabort() will lead to the persistent bit being > set, since tabort. sets the upper part of the TEXASR by taking the > least significant 8 bits of the value you passed in and concatenating > 0x000001 to the end of it, so essentially (using _ABORT_LOCK_BUSY as > a failure cause example): > > TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1; > > Given the persistent bit is bit 7, that means any odd value we pass > into __builtin_tabort(XXX) will set the TEXASR persistent bit. > Did we mean to do that? Maybe we did, I'm just asking. :-) > I can see that if we're about to do a syscall, that should be persistent. > Are we also treating someone having a lock as persistent too? > > If we did, then using __builtin_tabortwci() won't work, since we > wouldn't be setting the persistent bit like we are now... unless > we can pass in is_lock_free to __get_new_count and test that > along with the persistent bit still. > > Peter
On 01-09-2015 16:38, Peter Bergner wrote: > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho wrote: >> +/* Returns 0 if the lock defined by is_lock_free was elided. >> + ADAPT_COUNT is a per-lock state variable. */ >> +# define ELIDE_LOCK(adapt_count, is_lock_free) \ >> + ({ \ >> + int ret = 0; \ >> + if (adapt_count > 0) \ >> + (adapt_count)--; \ >> + else \ >> + for (int i = __elision_aconf.try_tbegin; i > 0; i--) \ >> + { \ >> + if (__builtin_tbegin (0)) \ >> + { \ >> + if (is_lock_free) \ >> + { \ >> + ret = 1; \ >> + break; \ >> + } \ >> + __builtin_tabort (_ABORT_LOCK_BUSY); \ >> + } \ >> + else \ >> + if (!__get_new_count(&adapt_count)) \ >> + break; \ >> + } \ >> + ret; \ >> + }) > > Interesting. I was going to suggest following your __builtin_tabort() > call here with a call to __builtin_unreachable() so we can give the > compiler a little more freedom to optimize things on that path. > Then I remembered we had some conditionl tabort insns we might > be able to use here, since the tabort above is conditional on > is_lock_free == 0. > > The following code seems to generate much better code, but at the > loss of not passing _ABORT_LOCK_BUSY as a failure cause. > > /* Returns 0 if the lock defined by is_lock_free was elided. > ADAPT_COUNT is a per-lock state variable. */ > # define ELIDE_LOCK(adapt_count, is_lock_free) \ > ({ \ > int ret = 0; \ > if (adapt_count > 0) \ > (adapt_count)--; \ > else \ > for (int i = __elision_aconf.try_tbegin; i > 0; i--) \ > { \ > if (__builtin_tbegin (0)) \ > { \ > __builtin_tabortwci (0x4, is_lock_free, 0) \ > ret = 1; \ > break; \ > } \ > else \ > if (!__get_new_count(&adapt_count)) \ > break; \ > } \ > ret; \ > }) > > > However, if I look at the code that consumes the failure code, I see > that we're only looking for the persistent bit being set and not any > specific failure cause we passed in: The idea of passing a error code in transaction abort is to differentiate it from other possible abort types, for instance newer kernels that might abort in syscalls or a user program that defines it is own transaction abort code. > > static inline bool > __get_new_count (uint8_t *adapt_count) > { > /* A persistent failure indicates that a retry will probably > result in another failure. Use normal locking now and > for the next couple of calls. */ > if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ())) > { > if (__elision_aconf.skip_lock_internal_abort > 0) > *adapt_count = __elision_aconf.skip_lock_internal_abort; > return false; > } > > /* Same logic as above, but for a number of temporary failures in a > a row. */ > else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0 > && __elision_aconf.try_tbegin > 0) > *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries; > return true; > } > > ...although looking closer, I see have the definitions we use for > passing in failure codes is: > > /* Definitions used for TEXASR Failure code (bits 0:6), they need to be even > because tabort. always sets the first bit. */ > #define _ABORT_LOCK_BUSY 0x3f /* Lock already used. */ > #define _ABORT_NESTED_TRYLOCK 0x3e /* Write operation in trylock. */ > #define _ABORT_SYSCALL 0x3d /* Syscall issued. */ > > I'm not sure I understand the "even" part above, since two of these > values are clearly not even. In fact, the two odd values above if > passed to __builtin_tabort() will lead to the persistent bit being > set, since tabort. sets the upper part of the TEXASR by taking the > least significant 8 bits of the value you passed in and concatenating > 0x000001 to the end of it, so essentially (using _ABORT_LOCK_BUSY as > a failure cause example): > > TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1; > > Given the persistent bit is bit 7, that means any odd value we pass > into __builtin_tabort(XXX) will set the TEXASR persistent bit. > Did we mean to do that? Maybe we did, I'm just asking. :-) > I can see that if we're about to do a syscall, that should be persistent. > Are we also treating someone having a lock as persistent too? Indeed the 'odd' comment does not make sense and we should just remove it (I misread texasr definition). My initial idea was define some codes that set the persistent failures and some that do not. I think I best approach would be: /* tabort will set TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1 and the TEXASR persistent bit is bit 25 (32-7). Only the syscall code means a persistent error that should trigger a default lock acquisition. */ #define _ABORT_SYSCALL 0x1 /* Syscall issued. */ #define _ABORT_LOCK_BUSY 0x2 /* Lock already used. */ #define _ABORT_NESTED_TRYLOCK 0x4 /* Write operation in trylock. */ > > If we did, then using __builtin_tabortwci() won't work, since we > wouldn't be setting the persistent bit like we are now... unless > we can pass in is_lock_free to __get_new_count and test that > along with the persistent bit still. I think we can use '__builtin_tabortwci' since the transactional aborts here are not meant to be persistent (the syscall one are done in syscall wrappers). What do you think? > > Peter > > >
On 09/01/2015 03:46 PM, Adhemerval Zanella wrote: > Indeed the 'odd' comment does not make sense and we should just remove it > (I misread texasr definition). My initial idea was define some codes > that set the persistent failures and some that do not. I think I best > approach would be: > > /* tabort will set TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1 > and the TEXASR persistent bit is bit 25 (32-7). Only the syscall > code means a persistent error that should trigger a default lock > acquisition. */ > #define _ABORT_SYSCALL 0x1 /* Syscall issued. */ > #define _ABORT_LOCK_BUSY 0x2 /* Lock already used. */ > #define _ABORT_NESTED_TRYLOCK 0x4 /* Write operation in trylock. */ The kernel defines several abort codes, we'll want to work with them, or recycle them as needed. I'm not convinced any of the existing codes should be non-persistent: A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is guaranteed to fail try_tbegin times if there is no contention on the lock. Aborts get increasingly expensive as you increase the amount of speculative execution. A busy lock likely indicates contention in the critical section which does not benefit from elision, I'd err on the side of a persistent failure.
On Tue, 2015-09-01 at 15:25 -0500, Paul E. Murphy wrote: > > On 09/01/2015 02:38 PM, Peter Bergner wrote: > > ...although looking closer, I see have the definitions we use for > > passing in failure codes is: > > > > /* Definitions used for TEXASR Failure code (bits 0:6), they need to be even > > because tabort. always sets the first bit. */ > > This is a bad comment. This is a misinterpretation of the ISA, as you've noted, > bit 31 is always set if tabort* is used. This makes using the conditional > versions impractical within the TLE implementation. I don't understand this. All tabort* instructions, including the unconditional tabort. set bit 31. The only thing that tabort. does that the conditional taborts[wd]c[i] don't do is set bits 0-7 (ie, failure code & persistent bits). > I think the implication here is you can't trust bit 7 of texasr to deduce the > abort code, as it could be set due to one of bits 8-11. I'm not sure if the cases > leading bits 8-11 to get set are mutually exclusive with calling tabort. I.e is > failure code 1 the only reserved code? Bit 7 is supposed to be used to specify whether a transaction failure is persistent (ie, whether we should retry the tbegin.) or not. That bit can be set by either a HW conflict leading to a transaction failure or an explicit use of tabort. It cannot be set with conditional tabort[wd]c[i] instructions, but Adhemerval's other note seems to imply we weren't supposed to be setting the persistent bit in this code, so using the conditional tabort[wd]c[i] should be ok here. Peter
On Tue, 2015-09-01 at 17:46 -0300, Adhemerval Zanella wrote: > On 01-09-2015 16:38, Peter Bergner wrote: > > /* Definitions used for TEXASR Failure code (bits 0:6), they need to be even > > because tabort. always sets the first bit. */ [snip] > Indeed the 'odd' comment does not make sense and we should just remove it > (I misread texasr definition). My initial idea was define some codes > that set the persistent failures and some that do not. I think I best > approach would be: > > /* tabort will set TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1 > and the TEXASR persistent bit is bit 25 (32-7). Only the syscall > code means a persistent error that should trigger a default lock > acquisition. */ > #define _ABORT_SYSCALL 0x1 /* Syscall issued. */ > #define _ABORT_LOCK_BUSY 0x2 /* Lock already used. */ > #define _ABORT_NESTED_TRYLOCK 0x4 /* Write operation in trylock. */ So according to this, only the syscall is deemed persistent. > I think we can use '__builtin_tabortwci' since the transactional aborts > here are not meant to be persistent (the syscall one are done in syscall > wrappers). If we decide that these uses of the explicit tabort*s are really to be treated as not persistent, then I agree we could use the conditional tabort[wd]c[i] instructions, but it seems Paul thinks otherwise. I guess we should let Paul describe why. Peter
On 01-09-2015 18:24, Paul E. Murphy wrote: > > > On 09/01/2015 03:46 PM, Adhemerval Zanella wrote: >> Indeed the 'odd' comment does not make sense and we should just remove it >> (I misread texasr definition). My initial idea was define some codes >> that set the persistent failures and some that do not. I think I best >> approach would be: >> >> /* tabort will set TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1 >> and the TEXASR persistent bit is bit 25 (32-7). Only the syscall >> code means a persistent error that should trigger a default lock >> acquisition. */ >> #define _ABORT_SYSCALL 0x1 /* Syscall issued. */ >> #define _ABORT_LOCK_BUSY 0x2 /* Lock already used. */ >> #define _ABORT_NESTED_TRYLOCK 0x4 /* Write operation in trylock. */ > > The kernel defines several abort codes, we'll want to work with them, or > recycle them as needed. Yeap, but these are the one kernel itself generates (not GLIBC) and I think it would be useful to use different code from abort generated from GLIBC itself (so an external tool like perf can detect from where exactly the abort came from). And I used the number above just an example, checking the kernel code seems it defines: #define TM_CAUSE_PERSISTENT 0x01 #define TM_CAUSE_KVM_RESCHED 0xe0 /* From PAPR */ #define TM_CAUSE_KVM_FAC_UNAV 0xe2 /* From PAPR */ #define TM_CAUSE_RESCHED 0xde #define TM_CAUSE_TLBI 0xdc #define TM_CAUSE_FAC_UNAV 0xda #define TM_CAUSE_SYSCALL 0xd8 #define TM_CAUSE_MISC 0xd6 /* future use */ #define TM_CAUSE_SIGNAL 0xd4 #define TM_CAUSE_ALIGNMENT 0xd2 #define TM_CAUSE_EMULATE 0xd0 So I think we might use 0xaX or something for the GLIBC ones. > > I'm not convinced any of the existing codes should be non-persistent: > > A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is > guaranteed to fail try_tbegin times if there is no contention on the lock. > Aborts get increasingly expensive as you increase the amount of speculative > execution. > > A busy lock likely indicates contention in the critical section which > does not benefit from elision, I'd err on the side of a persistent > failure. > I do not that much information to decide it, although I do see for pthread_mutex_t at least the _ABORT_LOCK_BUSY is not persistent if the critical region does not generate enough contention.
On 09/01/2015 04:58 PM, Adhemerval Zanella wrote: >> I'm not convinced any of the existing codes should be non-persistent: >> >> A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is >> guaranteed to fail try_tbegin times if there is no contention on the lock. >> Aborts get increasingly expensive as you increase the amount of speculative >> execution. >> >> A busy lock likely indicates contention in the critical section which >> does not benefit from elision, I'd err on the side of a persistent >> failure. >> > > I do not that much information to decide it, although I do see for pthread_mutex_t > at least the _ABORT_LOCK_BUSY is not persistent if the critical region does not > generate enough contention. This seems to violate my understanding of the adaptive algorithm. I agree, there is a possibility another thread might successfully elide while adapt_count != 0, *futex == 0, and it happens to be within the retry loop. Transactions are not free. The question is: Is it cheaper to risk a few more failures in hopes of the transaction succeeding, or just grabbing the lock? This becomes less desirable with increasing values of try_tbegin. I've found 11 to be the value which factors out the "false" failures under SMT8.
On 01-09-2015 19:43, Paul E. Murphy wrote: > > > On 09/01/2015 04:58 PM, Adhemerval Zanella wrote: > >>> I'm not convinced any of the existing codes should be non-persistent: >>> >>> A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is >>> guaranteed to fail try_tbegin times if there is no contention on the lock. >>> Aborts get increasingly expensive as you increase the amount of speculative >>> execution. >>> >>> A busy lock likely indicates contention in the critical section which >>> does not benefit from elision, I'd err on the side of a persistent >>> failure. >>> >> >> I do not that much information to decide it, although I do see for pthread_mutex_t >> at least the _ABORT_LOCK_BUSY is not persistent if the critical region does not >> generate enough contention. > > This seems to violate my understanding of the adaptive algorithm. I agree, there > is a possibility another thread might successfully elide while adapt_count != 0, > *futex == 0, and it happens to be within the retry loop. > > Transactions are not free. The question is: Is it cheaper to risk a few more > failures in hopes of the transaction succeeding, or just grabbing the lock? Well that is exactly the question I do not know it will heavy depend of the kind of algorithm and contention you have. > > This becomes less desirable with increasing values of try_tbegin. I've found > 11 to be the value which factors out the "false" failures under SMT8. > If the workloads you are trying are showing a higher try_tbegin is better, than indeed we can either increase it or set _ABORT_LOCK_BUSY as a persistent failure so the algorithm will bail out and use locks instead of retrying. Which kind of benchmark or workloads are you using to evaluate it?
On 09/02/2015 07:45 AM, Adhemerval Zanella wrote: > > > On 01-09-2015 19:43, Paul E. Murphy wrote: >> >> >> On 09/01/2015 04:58 PM, Adhemerval Zanella wrote: >> >>>> I'm not convinced any of the existing codes should be non-persistent: >>>> >>>> A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is >>>> guaranteed to fail try_tbegin times if there is no contention on the lock. >>>> Aborts get increasingly expensive as you increase the amount of speculative >>>> execution. >>>> >>>> A busy lock likely indicates contention in the critical section which >>>> does not benefit from elision, I'd err on the side of a persistent >>>> failure. >>>> >>> >>> I do not that much information to decide it, although I do see for pthread_mutex_t >>> at least the _ABORT_LOCK_BUSY is not persistent if the critical region does not >>> generate enough contention. >> >> This seems to violate my understanding of the adaptive algorithm. I agree, there >> is a possibility another thread might successfully elide while adapt_count != 0, >> *futex == 0, and it happens to be within the retry loop. >> >> Transactions are not free. The question is: Is it cheaper to risk a few more >> failures in hopes of the transaction succeeding, or just grabbing the lock? > > Well that is exactly the question I do not know it will heavy depend of the > kind of algorithm and contention you have. > >> >> This becomes less desirable with increasing values of try_tbegin. I've found >> 11 to be the value which factors out the "false" failures under SMT8. >> > > If the workloads you are trying are showing a higher try_tbegin is better, than > indeed we can either increase it or set _ABORT_LOCK_BUSY as a persistent failure > so the algorithm will bail out and use locks instead of retrying. > > Which kind of benchmark or workloads are you using to evaluate it? > I've posted my "workload" examples at git@github.com:pmur/glibc-test-tle.git. Take them with a grain of salt, and think of them as sanity checks. I've been using various derivations of them to poke at what I think are edge cases. Any suggestions or additions are appreciated.
Peter Bergner <bergner@vnet.ibm.com> writes: > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho wrote: >> +/* Returns 0 if the lock defined by is_lock_free was elided. >> + ADAPT_COUNT is a per-lock state variable. */ >> +# define ELIDE_LOCK(adapt_count, is_lock_free) \ >> + ({ \ >> + int ret = 0; \ >> + if (adapt_count > 0) \ >> + (adapt_count)--; \ >> + else \ >> + for (int i = __elision_aconf.try_tbegin; i > 0; i--) \ >> + { \ >> + if (__builtin_tbegin (0)) \ >> + { \ >> + if (is_lock_free) \ >> + { \ >> + ret = 1; \ >> + break; \ >> + } \ >> + __builtin_tabort (_ABORT_LOCK_BUSY); \ >> + } \ >> + else \ >> + if (!__get_new_count(&adapt_count)) \ >> + break; \ >> + } \ >> + ret; \ >> + }) > > Interesting. I was going to suggest following your __builtin_tabort() > call here with a call to __builtin_unreachable() so we can give the > compiler a little more freedom to optimize things on that path. > Then I remembered we had some conditionl tabort insns we might > be able to use here, since the tabort above is conditional on > is_lock_free == 0. > > The following code seems to generate much better code, but at the > loss of not passing _ABORT_LOCK_BUSY as a failure cause. > > /* Returns 0 if the lock defined by is_lock_free was elided. > ADAPT_COUNT is a per-lock state variable. */ > # define ELIDE_LOCK(adapt_count, is_lock_free) \ > ({ \ > int ret = 0; \ > if (adapt_count > 0) \ > (adapt_count)--; \ > else \ > for (int i = __elision_aconf.try_tbegin; i > 0; i--) \ > { \ > if (__builtin_tbegin (0)) \ > { \ > __builtin_tabortwci (0x4, is_lock_free, 0) \ > ret = 1; \ > break; \ > } \ > else \ > if (!__get_new_count(&adapt_count)) \ > break; \ > } \ > ret; \ > }) > > > However, if I look at the code that consumes the failure code, I see > that we're only looking for the persistent bit being set and not any > specific failure cause we passed in: > > static inline bool > __get_new_count (uint8_t *adapt_count) > { > /* A persistent failure indicates that a retry will probably > result in another failure. Use normal locking now and > for the next couple of calls. */ > if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ())) > { > if (__elision_aconf.skip_lock_internal_abort > 0) > *adapt_count = __elision_aconf.skip_lock_internal_abort; > return false; > } > > /* Same logic as above, but for a number of temporary failures in a > a row. */ > else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0 > && __elision_aconf.try_tbegin > 0) > *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries; > return true; > } That seems an interesting suggestion, but out of scope for this patch, IMO. We're just fixing an issue here. ;-)
Ping x2. "Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> writes: > Ping, Torvald. > > > "Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> writes: > >> Torvald Riegel <triegel@redhat.com> writes: >> >>> On Tue, 2015-08-25 at 19:08 -0300, Tulio Magno Quites Machado Filho >>> wrote: >>>> Torvald Riegel <triegel@redhat.com> writes: >>>> >>>> > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho >>>> >> + threads, either during lock acquisition, or elision. In order to avoid >>>> >> + this evaluation from becoming a data race the access of is_lock_free >>>> > >>>> > It could be a data race because you're not using atomics there, but >>>> > that's not the whole point. (We use the term "data race" specifically >>>> > to refer to the C11/C++11 memory model concept of the same name.) >>>> > You want to ensure the lock elision synchronization scheme, and thus are >>>> > moving it inside the txn. >>>> >>>> Are you complaining about the usage of the term "data race"? >>> >>> Yes. >>> >>>> If so, what about "race condition"? >>> >>> Well, that would be better, but a race condition is not necessarily a >>> bad thing. It's usually better to say which execution or interleaving >>> you need to avoid, than just saying "race condition". >> >> What about the following? >> >> /* CONCURRENCY NOTES: >> >> The macro expression is_lock_free is read and possibly written to by >> multiple threads, either during lock acquisition, or elision. In order to >> avoid this evaluation from becoming a race condition with the lock >> acquirement from the lock primitive, the access of is_lock_free is placed >> *inside* the transaction. Within the transaction we are assured that all >> memory accesses are atomic and is_lock_free can be evaluated with relaxed >> memory order. That way, the value of is_lock_free is consistent with the >> state of the lock until the end of the transaction. */ > > Anything else you'd like to change before I push this? > > Thanks,
Ping x3, Torvald. ;-) Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com> writes: > Ping x2. > > "Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> writes: > >> Ping, Torvald. >> >> >> "Tulio Magno Quites Machado Filho" <tuliom@linux.vnet.ibm.com> writes: >> >>> Torvald Riegel <triegel@redhat.com> writes: >>> >>>> On Tue, 2015-08-25 at 19:08 -0300, Tulio Magno Quites Machado Filho >>>> wrote: >>>>> Torvald Riegel <triegel@redhat.com> writes: >>>>> >>>>> > On Mon, 2015-08-24 at 15:11 -0300, Tulio Magno Quites Machado Filho >>>>> >> + threads, either during lock acquisition, or elision. In order to avoid >>>>> >> + this evaluation from becoming a data race the access of is_lock_free >>>>> > >>>>> > It could be a data race because you're not using atomics there, but >>>>> > that's not the whole point. (We use the term "data race" specifically >>>>> > to refer to the C11/C++11 memory model concept of the same name.) >>>>> > You want to ensure the lock elision synchronization scheme, and thus are >>>>> > moving it inside the txn. >>>>> >>>>> Are you complaining about the usage of the term "data race"? >>>> >>>> Yes. >>>> >>>>> If so, what about "race condition"? >>>> >>>> Well, that would be better, but a race condition is not necessarily a >>>> bad thing. It's usually better to say which execution or interleaving >>>> you need to avoid, than just saying "race condition". >>> >>> What about the following? >>> >>> /* CONCURRENCY NOTES: >>> >>> The macro expression is_lock_free is read and possibly written to by >>> multiple threads, either during lock acquisition, or elision. In order to >>> avoid this evaluation from becoming a race condition with the lock >>> acquirement from the lock primitive, the access of is_lock_free is placed >>> *inside* the transaction. Within the transaction we are assured that all >>> memory accesses are atomic and is_lock_free can be evaluated with relaxed >>> memory order. That way, the value of is_lock_free is consistent with the >>> state of the lock until the end of the transaction. */ >> >> Anything else you'd like to change before I push this? >> >> Thanks,
Sorry for the very long response time. I've been sick since a while and am slow in catching up. On Wed, 2015-08-26 at 13:31 -0300, Tulio Magno Quites Machado Filho wrote: > What about the following? > > /* CONCURRENCY NOTES: > > The macro expression is_lock_free is read and possibly written to by > multiple threads, either during lock acquisition, or elision. In order to > avoid this evaluation from becoming a race condition with the lock > acquirement from the lock primitive, the access of is_lock_free is placed > *inside* the transaction. Within the transaction we are assured that all > memory accesses are atomic and is_lock_free can be evaluated with relaxed > memory order. That way, the value of is_lock_free is consistent with the > state of the lock until the end of the transaction. */ I suggest instead: The evaluation of the macro expression is_lock_free encompasses one or more loads from memory locations that are concurrently modified by other threads. For lock elision to work, this evaluation and the rest of the critical section protected by the lock must be atomic because an execution with lock elision must be equivalent to an execution in which the lock would have been actually acquired and released. Therefore, we evaluate is_lock_free inside of the transaction that represents the critical section for which we want to use lock elision, which ensures the atomicity that we require. If you think this is adequate, please commit.
On Tue, 2015-09-01 at 16:24 -0500, Paul E. Murphy wrote: > > On 09/01/2015 03:46 PM, Adhemerval Zanella wrote: > > Indeed the 'odd' comment does not make sense and we should just remove it > > (I misread texasr definition). My initial idea was define some codes > > that set the persistent failures and some that do not. I think I best > > approach would be: > > > > /* tabort will set TEXASR(0:31) = ((_ABORT_LOCK_BUSY & 0xff) << 24) | 0x1 > > and the TEXASR persistent bit is bit 25 (32-7). Only the syscall > > code means a persistent error that should trigger a default lock > > acquisition. */ > > #define _ABORT_SYSCALL 0x1 /* Syscall issued. */ > > #define _ABORT_LOCK_BUSY 0x2 /* Lock already used. */ > > #define _ABORT_NESTED_TRYLOCK 0x4 /* Write operation in trylock. */ > > The kernel defines several abort codes, we'll want to work with them, or > recycle them as needed. Agreed in general. The abort codes should be part of the ABI, or we'll get into trouble interpreting them given that transactions can span across different layers. I've asked Andi Kleen whether he can make this happen for x86 (eg, through appropriate documentation somewhere). It would be good if this can be clarified/specified on PowerPC too. Can kernel-level abort codes actually reach userspace, or do we only need to establish userspace consensus for these codes? For x86, it's userspace only AFAIK. Would it be likely that userspace explicit aborts get interpreted by the kernel? They can't be trusted more than one would trust the program. > I'm not convinced any of the existing codes should be non-persistent: > > A pthread_mutex_trylock attempt within an elided pthread_mutex_lock is > guaranteed to fail try_tbegin times if there is no contention on the lock. > Aborts get increasingly expensive as you increase the amount of speculative > execution. Although that this depends on the program (e.g., trylock might not be called all the time), in this case I would also guess that it would be better to consider it a persistent condition. > A busy lock likely indicates contention in the critical section which > does not benefit from elision, I'd err on the side of a persistent > failure. I don't think I agree. An already-acquired lock is something that could be hit less likely by using elision more often on this lock (think about the "lemming effect").
[Adding Andi if he has an opinion] On 10/08/2015 10:54 AM, Torvald Riegel wrote: >> A busy lock likely indicates contention in the critical section which >> does not benefit from elision, I'd err on the side of a persistent >> failure. > > I don't think I agree. An already-acquired lock is something that could > be hit less likely by using elision more often on this lock (think about > the "lemming effect"). > My line of thinking is that you are already experiencing the effect if a thread hits a taken lock. It most likely [1] implies adapt_count > 0, so you are guaranteed a time period of serialization. The best outcome is that you manage to run a transaction in between an actual lock. The worst being you extend the serialization period. I think perhaps the best of both worlds might be to check the lock, and set the persistent bit if the lock is busy with waiters. Thoughts? [1] A pthread condition waiter is an odd exception in that they will set the lock to busy with waiters when they wake up using an elidable lock, but not set adapt_count. Also, you could hit the case where adapt_count goes from 1 -> 0.
On Fri, Oct 09, 2015 at 10:30:47AM -0500, Paul E. Murphy wrote: > > [Adding Andi if he has an opinion] > > On 10/08/2015 10:54 AM, Torvald Riegel wrote: > >> A busy lock likely indicates contention in the critical section which > >> does not benefit from elision, I'd err on the side of a persistent > >> failure. > > > > I don't think I agree. An already-acquired lock is something that could > > be hit less likely by using elision more often on this lock (think about > > the "lemming effect"). > > > > My line of thinking is that you are already experiencing the effect if > a thread hits a taken lock. It most likely [1] implies adapt_count > 0, > so you are guaranteed a time period of serialization. The best outcome is > that you manage to run a transaction in between an actual lock. The worst > being you extend the serialization period. Lemming effects happens after the taken lock is freed. It prevents speedy recovery into the parallel eliding state. The typical way to improve Lemming is to wait for the lock to become free before retrying. Unfortunately that is difficult due to the way the glibc mutexes work, as there's no way to do this with the futex interface. It would be possible to spin a bit, but the code doesn't do that. But to do that you shouldn't do a persistent abort, but keep going for a bit. > > I think perhaps the best of both worlds might be to check the lock, and > set the persistent bit if the lock is busy with waiters. Thoughts? There's unfortunately nothing to check when you're experiencing the Lemming effect. The other users just speculate on their own. -Andi
Torvald Riegel <triegel@redhat.com> writes: > Sorry for the very long response time. Please do not. You have been helping me a lot here. ;-) > I've been sick since a while and am slow in catching up. I hope you're better now. > On Wed, 2015-08-26 at 13:31 -0300, Tulio Magno Quites Machado Filho > wrote: >> What about the following? >> >> /* CONCURRENCY NOTES: >> >> The macro expression is_lock_free is read and possibly written to by >> multiple threads, either during lock acquisition, or elision. In order to >> avoid this evaluation from becoming a race condition with the lock >> acquirement from the lock primitive, the access of is_lock_free is placed >> *inside* the transaction. Within the transaction we are assured that all >> memory accesses are atomic and is_lock_free can be evaluated with relaxed >> memory order. That way, the value of is_lock_free is consistent with the >> state of the lock until the end of the transaction. */ > > I suggest instead: > > The evaluation of the macro expression is_lock_free encompasses one or > more loads from memory locations that are concurrently modified by other > threads. For lock elision to work, this evaluation and the rest of the > critical section protected by the lock must be atomic because an > execution with lock elision must be equivalent to an execution in which > the lock would have been actually acquired and released. Therefore, we > evaluate is_lock_free inside of the transaction that represents the > critical section for which we want to use lock elision, which ensures > the atomicity that we require. > > > If you think this is adequate, please commit. LGTM. I will commit. Thanks again!
diff --git a/NEWS b/NEWS index 6b36c08..76df3fd 100644 --- a/NEWS +++ b/NEWS @@ -11,7 +11,8 @@ Version 2.23 14341, 16517, 16519, 16520, 16734, 16973, 17787, 17905, 18084, 18086, 18265, 18370, 18421, 18480, 18525, 18618, 18647, 18661, 18674, 18681, - 18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824. + 18778, 18781, 18787, 18789, 18790, 18795, 18796, 18820, 18823, 18824, + 18743. * The obsolete header <regexp.h> has been removed. Programs that require this header must be updated to use <regex.h> instead. diff --git a/sysdeps/powerpc/nptl/elide.h b/sysdeps/powerpc/nptl/elide.h index 389f5a5..1cc9890 100644 --- a/sysdeps/powerpc/nptl/elide.h +++ b/sysdeps/powerpc/nptl/elide.h @@ -23,67 +23,76 @@ # include <htm.h> # include <elision-conf.h> -/* Returns true if the lock defined by is_lock_free as elided. - ADAPT_COUNT is a pointer to per-lock state variable. */ - +/* Get the new value of adapt_count according to the elision + configurations. Returns true if the system should retry again or false + otherwise. */ static inline bool -__elide_lock (uint8_t *adapt_count, int is_lock_free) +__get_new_count (uint8_t *adapt_count) { - if (*adapt_count > 0) + /* A persistent failure indicates that a retry will probably + result in another failure. Use normal locking now and + for the next couple of calls. */ + if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ())) { - (*adapt_count)--; + if (__elision_aconf.skip_lock_internal_abort > 0) + *adapt_count = __elision_aconf.skip_lock_internal_abort; return false; } - - for (int i = __elision_aconf.try_tbegin; i > 0; i--) - { - if (__builtin_tbegin (0)) - { - if (is_lock_free) - return true; - /* Lock was busy. */ - __builtin_tabort (_ABORT_LOCK_BUSY); - } - else - { - /* A persistent failure indicates that a retry will probably - result in another failure. Use normal locking now and - for the next couple of calls. */ - if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ())) - { - if (__elision_aconf.skip_lock_internal_abort > 0) - *adapt_count = __elision_aconf.skip_lock_internal_abort; - break; - } - /* Same logic as above, but for a number of temporary failures in a - a row. */ - else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0 - && __elision_aconf.try_tbegin > 0) - *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries; - } - } - - return false; + /* Same logic as above, but for a number of temporary failures in a + a row. */ + else if (__elision_aconf.skip_lock_out_of_tbegin_retries > 0 + && __elision_aconf.try_tbegin > 0) + *adapt_count = __elision_aconf.skip_lock_out_of_tbegin_retries; + return true; } -# define ELIDE_LOCK(adapt_count, is_lock_free) \ - __elide_lock (&(adapt_count), is_lock_free) - - -static inline bool -__elide_trylock (uint8_t *adapt_count, int is_lock_free, int write) -{ - if (__elision_aconf.try_tbegin > 0) - { - if (write) - __builtin_tabort (_ABORT_NESTED_TRYLOCK); - return __elide_lock (adapt_count, is_lock_free); - } - return false; -} +/* CONCURRENCY NOTES: + + The value of is_lock_free is read and possibly written to by multiple + threads, either during lock acquisition, or elision. In order to avoid + this evaluation from becoming a data race the access of is_lock_free + is placed *inside* the transaction. Within the transaction is_lock_free + can be accessed with relaxed ordering. We are assured by the transaction + that the value of is_lock_free is consistent with the state of the lock, + otherwise the hardware will abort the transaction. */ + +/* Returns 0 if the lock defined by is_lock_free was elided. + ADAPT_COUNT is a per-lock state variable. */ +# define ELIDE_LOCK(adapt_count, is_lock_free) \ + ({ \ + int ret = 0; \ + if (adapt_count > 0) \ + (adapt_count)--; \ + else \ + for (int i = __elision_aconf.try_tbegin; i > 0; i--) \ + { \ + if (__builtin_tbegin (0)) \ + { \ + if (is_lock_free) \ + { \ + ret = 1; \ + break; \ + } \ + __builtin_tabort (_ABORT_LOCK_BUSY); \ + } \ + else \ + if (!__get_new_count(&adapt_count)) \ + break; \ + } \ + ret; \ + }) # define ELIDE_TRYLOCK(adapt_count, is_lock_free, write) \ - __elide_trylock (&(adapt_count), is_lock_free, write) + ({ \ + int ret = 0; \ + if (__elision_aconf.try_tbegin > 0) \ + { \ + if (write) \ + __builtin_tabort (_ABORT_NESTED_TRYLOCK); \ + ret = ELIDE_LOCK (adapt_count, is_lock_free); \ + } \ + ret; \ + }) static inline bool