[BZ,18743] PowerPC: Fix a race condition when eliding a lock

Message ID 55BBD3E9.8040005@linaro.org
State Dropped
Delegated to: Tulio Magno Quites Machado Filho
Headers

Commit Message

Adhemerval Zanella Netto July 31, 2015, 8 p.m. UTC
  On 31-07-2015 11:57, Tulio Magno Quites Machado Filho wrote:
> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
> 
>> On 30-07-2015 23:06, Tulio Magno Quites Machado Filho wrote:
>>> Using pthread_rwlock_rdlock as an example, is_lock_free would be a preprocessor
>>> token with the following contents:
>>>
>>>  128   if (ELIDE_LOCK (rwlock->__data.__rwelision,
>>>  129                   rwlock->__data.__lock == 0
>>>  130                   && rwlock->__data.__writer == 0
>>>  131                   && rwlock->__data.__nr_readers == 0))
>>>
>>> And this is where the important memory access happens.
>>> If one of these fields are changed by another thread after this point, but
>>> before __builtin_tbegin, we're able to reproduce the first problem I mentioned
>>> previously.
>>
>> My understanding is this fields will be changed only in the lock taken path. Let's
>> say a thread t1 grab the lock (either by a transaction or by using a default lock)
>> and modifies any 'rwlock->__data' fields after the ELIDE_LOCK check and before
>> the transaction begin on a second thread t2.  My understanding of the issue you
>> are seeing is the transaction will initiate and since is_lock_free is 1 it won't
>> abort and continue as the a lock taken in both threads.
> 
> I agree with you.  But this won't reproduce the bug.  You have to change the
> order of events.
> 
> Let's say t1 is the first one to read rwlock->__data fields and they're all 0.
> t1 will set is_lock_free to 0.

I believe you mean is_lock_free will be set to '1' if all fields are '0' (since
the lock will not be held).

> Then t2 modifies rwlock->__data before t1 starts the transaction (as in: get a
> lock, instead of a transaction).
> After that, t1 starts the transaction.  Here comes the problem: the memory
> barrier is created with rwlock->__data saying that someone took the lock,
> but is_lock_free is set to 0.  In this scenario, t1 will proceed with the
> transaction.
> 
>> However my understanding is the transaction won't be able to commit since it
>> will have a conflict in the 'rwlock->__data.lock' or in any memory being access
>> in the critical region.
> 
> Yes, but with the order of events I mentioned, when t1 calls
> pthread_rwlock_unlock, it will run the following code:
> 
>   38   if (ELIDE_UNLOCK (rwlock->__data.__writer == 0
>   39                     && rwlock->__data.__nr_readers == 0))
>   40     return 0;
> 
> Where ELIDE_UNLOCK is:
> 
>   89 static inline bool
>   90 __elide_unlock (int is_lock_free)
>   91 {
>   92   if (is_lock_free)
>   93     {
>   94       __builtin_tend (0);
>   95       return true;
>   96     }
>   97   return false;
>   98 }
>   99 
>  100 # define ELIDE_UNLOCK(is_lock_free) \
>  101   __elide_unlock (is_lock_free)
> 
> Remember that t1 started the transaction even if rwlock->__data said someone
> was holding the lock.
> So now, in __elide_unlock, is_lock_free will be 1.  __elide_unlock will
> return false and t1 will unlock a lock acquired by t2.

I see now a window where the pthread_rwlock_t internal structures might have 
concurrent access and the transaction is never finished. However I do not 
think this is a powerpc specific and thus I think it should be fixed in generic
algorithm instead. Something like this:

--

--

I did not fix the x86_64 code, which would require some adjustments.  I checked
on ppc64le and make nptl/tests passes.  It would be good if you could create a
testcase to stress this issue (and I do see this could be quite hard since it is
very timing dependent).

The problem with this approach is it couple the elide macros with rdlock
fields, and I think the idea would make this header more generic for multiple
elide algorithms.  Suggestions?

> 
>> Are you saying that the transaction is not being aborted
>> in the critical region?
> 
> No.  I'm saying the memory access to rwlock->__data is critical and should
> happen inside the transaction.  Without this, we can't say the whole process
> is atomic.
> 
>> If compiler is doing memory reordering around HTM builtin it is *clearly* a bug
>> since transaction begin/end acts a full memory barrier (sync).  You should also
>> add a compiler barrier in pthread elision code as well.
> 
> Agreed.  But I guess that's out of scope for this particular bug.
> I could prepare a separate patch with these changes if you agree with that.
>
  

Comments

Tulio Magno Quites Machado Filho July 31, 2015, 9:44 p.m. UTC | #1
Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:

> On 31-07-2015 11:57, Tulio Magno Quites Machado Filho wrote:
>> Let's say t1 is the first one to read rwlock->__data fields and they're all 0.
>> t1 will set is_lock_free to 0.
>
> I believe you mean is_lock_free will be set to '1' if all fields are '0' (since
> the lock will not be held).

Yes!  I'm sorry.

> I see now a window where the pthread_rwlock_t internal structures might have 
> concurrent access and the transaction is never finished. However I do not 
> think this is a powerpc specific and thus I think it should be fixed in generic
> algorithm instead. Something like this:
>
> ...
>
> I did not fix the x86_64 code, which would require some adjustments.  I checked
> on ppc64le and make nptl/tests passes.

I didn't test your patch, but after reading it, I'm sure it does fix the
problem too.

> It would be good if you could create a
> testcase to stress this issue (and I do see this could be quite hard since it is
> very timing dependent).

Yes.
I tried, but I didn't like the idea of creating a testcase that would only
provide intermittent failures.  Especially because it's very difficult to
reproduce this failure.
Any suggestions?

> The problem with this approach is it couple the elide macros with rdlock
> fields, and I think the idea would make this header more generic for multiple
> elide algorithms.  Suggestions?

And that's why I preferred to modify only the powerpc implementation.
Both proposals are identical, in the sense that they move the memory access
to the transaction.

I'd appreciate if you could explain why you prefer to change the argument list
of ELIDE_LOCK instead of the powerpc specific code.
  
Adhemerval Zanella Netto July 31, 2015, 9:59 p.m. UTC | #2
On 31-07-2015 18:44, Tulio Magno Quites Machado Filho wrote:
> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
> 
>> On 31-07-2015 11:57, Tulio Magno Quites Machado Filho wrote:
>>> Let's say t1 is the first one to read rwlock->__data fields and they're all 0.
>>> t1 will set is_lock_free to 0.
>>
>> I believe you mean is_lock_free will be set to '1' if all fields are '0' (since
>> the lock will not be held).
> 
> Yes!  I'm sorry.
> 
>> I see now a window where the pthread_rwlock_t internal structures might have 
>> concurrent access and the transaction is never finished. However I do not 
>> think this is a powerpc specific and thus I think it should be fixed in generic
>> algorithm instead. Something like this:
>>
>> ...
>>
>> I did not fix the x86_64 code, which would require some adjustments.  I checked
>> on ppc64le and make nptl/tests passes.
> 
> I didn't test your patch, but after reading it, I'm sure it does fix the
> problem too.
> 
>> It would be good if you could create a
>> testcase to stress this issue (and I do see this could be quite hard since it is
>> very timing dependent).
> 
> Yes.
> I tried, but I didn't like the idea of creating a testcase that would only
> provide intermittent failures.  Especially because it's very difficult to
> reproduce this failure.
> Any suggestions?
> 
>> The problem with this approach is it couple the elide macros with rdlock
>> fields, and I think the idea would make this header more generic for multiple
>> elide algorithms.  Suggestions?
> 
> And that's why I preferred to modify only the powerpc implementation.
> Both proposals are identical, in the sense that they move the memory access
> to the transaction.
> 
> I'd appreciate if you could explain why you prefer to change the argument list
> of ELIDE_LOCK instead of the powerpc specific code.
> 

Mainly because I see this issue is not powerpc specific and I think we should
also fix for x86 (that currently also uses the same mechanism) and for future
arches that also might potentially implement lock elision using this.
  
Tulio Magno Quites Machado Filho July 31, 2015, 11:05 p.m. UTC | #3
Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:

> Mainly because I see this issue is not powerpc specific and I think we should
> also fix for x86 (that currently also uses the same mechanism) and for future
> arches that also might potentially implement lock elision using this.

AFAICS, this is powerpc specific.  x86 ensures the memory access happens
inside the transaction.

Here is the x86 implementation:

  50 /* is_lock_free must be executed inside the transaction */
  51 
  52 /* Returns true if lock defined by IS_LOCK_FREE was elided.
  53    ADAPT_COUNT is a pointer to per-lock state variable. */
  54 
  55 #define ELIDE_LOCK(adapt_count, is_lock_free)                   \
  56   ({                                                            \
  57     int ret = 0;                                                \
  58                                                                 \
  59     if ((adapt_count) <= 0)                                     \
  60       {                                                         \
  61         for (int i = __elision_aconf.retry_try_xbegin; i > 0; i--) \
  62           {                                                     \
  63             unsigned int status;                                \
  64             if ((status = _xbegin ()) == _XBEGIN_STARTED)       \
  65               {                                                 \
  66                 if (is_lock_free)                               \
  67                   {                                             \
  68                     ret = 1;                                    \
  69                     break;                                      \
  70                   }                                             \
  71                 _xabort (_ABORT_LOCK_BUSY);                     \
  72               }                                                 \
  73             if (!elision_adapt (&(adapt_count), status))        \
  74               break;                                            \
  75           }                                                     \
  76       }                                                         \
  77     else                                                        \
  78       (adapt_count)--; /* missing updates ok */                 \
  79     ret;                                                        \
  80   })
  
Adhemerval Zanella Netto Aug. 1, 2015, 12:23 p.m. UTC | #4
> Em 31/07/2015, às 20:05, Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com> escreveu:
> 
> Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
> 
>> Mainly because I see this issue is not powerpc specific and I think we should
>> also fix for x86 (that currently also uses the same mechanism) and for future
>> arches that also might potentially implement lock elision using this.
> 
> AFAICS, this is powerpc specific.  x86 ensures the memory access happens
> inside the transaction.
> 
> Here is the x86 implementation:
> 

Right, x86 is evaluated as a macro. Indeed your initial approach seems to be better indeed. Thanks for catching it. I would suggest add a comment similar to x86 as well.

>  50 /* is_lock_free must be executed inside the transaction */
>  51 
>  52 /* Returns true if lock defined by IS_LOCK_FREE was elided.
>  53    ADAPT_COUNT is a pointer to per-lock state variable. */
>  54 
>  55 #define ELIDE_LOCK(adapt_count, is_lock_free)                   \
>  56   ({                                                            \
>  57     int ret = 0;                                                \
>  58                                                                 \
>  59     if ((adapt_count) <= 0)                                     \
>  60       {                                                         \
>  61         for (int i = __elision_aconf.retry_try_xbegin; i > 0; i--) \
>  62           {                                                     \
>  63             unsigned int status;                                \
>  64             if ((status = _xbegin ()) == _XBEGIN_STARTED)       \
>  65               {                                                 \
>  66                 if (is_lock_free)                               \
>  67                   {                                             \
>  68                     ret = 1;                                    \
>  69                     break;                                      \
>  70                   }                                             \
>  71                 _xabort (_ABORT_LOCK_BUSY);                     \
>  72               }                                                 \
>  73             if (!elision_adapt (&(adapt_count), status))        \
>  74               break;                                            \
>  75           }                                                     \
>  76       }                                                         \
>  77     else                                                        \
>  78       (adapt_count)--; /* missing updates ok */                 \
>  79     ret;                                                        \
>  80   })
> 
> -- 
> Tulio Magno
>
  
Torvald Riegel Aug. 3, 2015, 4:25 p.m. UTC | #5
On Sat, 2015-08-01 at 09:23 -0300, Adhemerval Zanella wrote:
> 
> 
> 
> > Em 31/07/2015, às 20:05, Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com> escreveu:
> > 
> > Adhemerval Zanella <adhemerval.zanella@linaro.org> writes:
> > 
> >> Mainly because I see this issue is not powerpc specific and I think we should
> >> also fix for x86 (that currently also uses the same mechanism) and for future
> >> arches that also might potentially implement lock elision using this.
> > 
> > AFAICS, this is powerpc specific.  x86 ensures the memory access happens
> > inside the transaction.
> > 
> > Here is the x86 implementation:
> > 
> 
> Right, x86 is evaluated as a macro. Indeed your initial approach seems to be better indeed. Thanks for catching it. I would suggest add a comment similar to x86 as well.
> 
> >  50 /* is_lock_free must be executed inside the transaction */
> >  51 

I think "evaluated" is better in this case.
  

Patch

diff --git a/nptl/pthread_rwlock_rdlock.c b/nptl/pthread_rwlock_rdlock.c
index eb7ac8d..3756bb6 100644
--- a/nptl/pthread_rwlock_rdlock.c
+++ b/nptl/pthread_rwlock_rdlock.c
@@ -126,9 +126,9 @@  __pthread_rwlock_rdlock (pthread_rwlock_t *rwlock)
   LIBC_PROBE (rdlock_entry, 1, rwlock);
 
   if (ELIDE_LOCK (rwlock->__data.__rwelision,
-		  rwlock->__data.__lock == 0
-		  && rwlock->__data.__writer == 0
-		  && rwlock->__data.__nr_readers == 0))
+		  rwlock->__data.__lock,
+		  rwlock->__data.__writer,
+		  rwlock->__data.__nr_readers))
     return 0;
 
   /* Make sure we are alone.  */
diff --git a/nptl/pthread_rwlock_tryrdlock.c b/nptl/pthread_rwlock_tryrdlock.c
index 256188a..bb199e4 100644
--- a/nptl/pthread_rwlock_tryrdlock.c
+++ b/nptl/pthread_rwlock_tryrdlock.c
@@ -33,9 +33,9 @@  __pthread_rwlock_tryrdlock (pthread_rwlock_t *rwlock)
       rwlock->__data.__shared == LLL_PRIVATE ? FUTEX_PRIVATE : FUTEX_SHARED;
 
   if (ELIDE_TRYLOCK (rwlock->__data.__rwelision,
-		     rwlock->__data.__lock == 0
-		     && rwlock->__data.__nr_readers == 0
-		     && rwlock->__data.__writer, 0))
+		     rwlock->__data.__lock,
+		     rwlock->__data.__writer,
+		     rwlock->__data.__nr_readers, 0))
     return 0;
 
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
diff --git a/nptl/pthread_rwlock_trywrlock.c b/nptl/pthread_rwlock_trywrlock.c
index 918a8f2..49c6757 100644
--- a/nptl/pthread_rwlock_trywrlock.c
+++ b/nptl/pthread_rwlock_trywrlock.c
@@ -28,9 +28,9 @@  __pthread_rwlock_trywrlock (pthread_rwlock_t *rwlock)
   int result = EBUSY;
 
   if (ELIDE_TRYLOCK (rwlock->__data.__rwelision,
-		     rwlock->__data.__lock == 0
-		     && rwlock->__data.__nr_readers == 0
-		     && rwlock->__data.__writer, 1))
+		     rwlock->__data.__lock,
+		     rwlock->__data.__writer,
+		     rwlock->__data.__nr_readers, 1))
     return 0;
 
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
diff --git a/nptl/pthread_rwlock_unlock.c b/nptl/pthread_rwlock_unlock.c
index bdd115d..0946d0c 100644
--- a/nptl/pthread_rwlock_unlock.c
+++ b/nptl/pthread_rwlock_unlock.c
@@ -35,8 +35,7 @@  __pthread_rwlock_unlock (pthread_rwlock_t *rwlock)
 
   LIBC_PROBE (rwlock_unlock, 1, rwlock);
 
-  if (ELIDE_UNLOCK (rwlock->__data.__writer == 0
-		    && rwlock->__data.__nr_readers == 0))
+  if (ELIDE_UNLOCK (rwlock->__data.__writer, rwlock->__data.__nr_readers))
     return 0;
 
   lll_lock (rwlock->__data.__lock, rwlock->__data.__shared);
diff --git a/nptl/pthread_rwlock_wrlock.c b/nptl/pthread_rwlock_wrlock.c
index 60fa909..0161876 100644
--- a/nptl/pthread_rwlock_wrlock.c
+++ b/nptl/pthread_rwlock_wrlock.c
@@ -98,9 +98,9 @@  __pthread_rwlock_wrlock (pthread_rwlock_t *rwlock)
   LIBC_PROBE (wrlock_entry, 1, rwlock);
 
   if (ELIDE_LOCK (rwlock->__data.__rwelision,
-		  rwlock->__data.__lock == 0
-		  && rwlock->__data.__writer == 0
-		  && rwlock->__data.__nr_readers == 0))
+		  rwlock->__data.__lock,
+		  rwlock->__data.__writer,
+		  rwlock->__data.__nr_readers))
     return 0;
 
   /* Make sure we are alone.  */
diff --git a/sysdeps/generic/elide.h b/sysdeps/generic/elide.h
index 80a3a03..64d9cce 100644
--- a/sysdeps/generic/elide.h
+++ b/sysdeps/generic/elide.h
@@ -15,11 +15,12 @@ 
    You should have received a copy of the GNU Lesser General Public
    License along with the GNU C Library; if not, see
    <http://www.gnu.org/licenses/>.  */
+
 #ifndef ELIDE_H
 #define ELIDE_H 1
 
-#define ELIDE_LOCK(adapt_count, is_lock_free) 0
-#define ELIDE_TRYLOCK(adapt_count, is_lock_free, write) 0
-#define ELIDE_UNLOCK(is_lock_free) 0
+#define ELIDE_LOCK(adapt_count, lock, writer, readers) 0
+#define ELIDE_TRYLOCK(adapt_count, lock, writer, readers, write) 0
+#define ELIDE_UNLOCK(writer, readers) 0
 
 #endif
diff --git a/sysdeps/powerpc/nptl/elide.h b/sysdeps/powerpc/nptl/elide.h
index 389f5a5..b3cc50a 100644
--- a/sysdeps/powerpc/nptl/elide.h
+++ b/sysdeps/powerpc/nptl/elide.h
@@ -27,7 +27,7 @@ 
    ADAPT_COUNT is a pointer to per-lock state variable. */
 
 static inline bool
-__elide_lock (uint8_t *adapt_count, int is_lock_free)
+__elide_lock (uint8_t *adapt_count, int *lock, int *writer, unsigned int *readers)
 {
   if (*adapt_count > 0)
     {
@@ -39,7 +39,10 @@  __elide_lock (uint8_t *adapt_count, int is_lock_free)
     {
       if (__builtin_tbegin (0))
 	{
-	  if (is_lock_free)
+	  /* The compiler barrier is required because some GCC version might
+	     reorder the lock read before the transaction init builtin.  */
+	  asm volatile("" ::: "memory");
+	  if ((*lock == 0) && (*writer == 0) && (*readers == 0))
 	    return true;
 	  /* Lock was busy.  */
 	  __builtin_tabort (_ABORT_LOCK_BUSY);
@@ -66,30 +69,31 @@  __elide_lock (uint8_t *adapt_count, int is_lock_free)
   return false;
 }
 
-# define ELIDE_LOCK(adapt_count, is_lock_free) \
-  __elide_lock (&(adapt_count), is_lock_free)
+# define ELIDE_LOCK(adapt_count, lock, writer, readers) \
+  __elide_lock (&(adapt_count), &(lock), &(writer), &(readers))
 
 
 static inline bool
-__elide_trylock (uint8_t *adapt_count, int is_lock_free, int write)
+__elide_trylock (uint8_t *adapt_count, int *lock, int *writer,
+                 unsigned int *readers, int write)
 {
   if (__elision_aconf.try_tbegin > 0)
     {
       if (write)
 	__builtin_tabort (_ABORT_NESTED_TRYLOCK);
-      return __elide_lock (adapt_count, is_lock_free);
+      return __elide_lock (adapt_count, lock, writer, readers);
     }
   return false;
 }
 
-# define ELIDE_TRYLOCK(adapt_count, is_lock_free, write)	\
-  __elide_trylock (&(adapt_count), is_lock_free, write)
+# define ELIDE_TRYLOCK(adapt_count, lock, writer, readers, write) \
+  __elide_trylock (&(adapt_count), &(lock), &(writer), &(readers), write)
 
 
 static inline bool
-__elide_unlock (int is_lock_free)
+__elide_unlock (int *writer, unsigned int *readers)
 {
-  if (is_lock_free)
+  if ((*writer == 0) && (*readers == 0))
     {
       __builtin_tend (0);
       return true;
@@ -97,14 +101,14 @@  __elide_unlock (int is_lock_free)
   return false;
 }
 
-# define ELIDE_UNLOCK(is_lock_free) \
-  __elide_unlock (is_lock_free)
+# define ELIDE_UNLOCK(writer, readers) \
+  __elide_unlock (&(writer), &(readers))
 
 # else
 
-# define ELIDE_LOCK(adapt_count, is_lock_free) 0
-# define ELIDE_TRYLOCK(adapt_count, is_lock_free, write) 0
-# define ELIDE_UNLOCK(is_lock_free) 0
+# define ELIDE_LOCK(adapt_count, lock, writer, readers) 0
+# define ELIDE_TRYLOCK(adapt_count, lock, writer, readers, write) 0
+# define ELIDE_UNLOCK(writer, readers) 0
 
 #endif /* ENABLE_LOCK_ELISION  */