shared data protection failed in pthread_cond_timedwait

Message ID 535A078F.3050003@huawei.com
State Not applicable
Headers

Commit Message

Yang Yingliang April 25, 2014, 6:58 a.m. UTC
  Hi,


I have 22 threads wait in pthread_cond_timedwait. When they are all woke up, I found 
there are more than one threads can access shared data in pthread_cond_timedwait.

I added print messages as follow code:



I tested on Linux arma15el 3.10.37+ #2 SMP Fri Apr 25 11:23:25 CST 2014 armv7l GNU/Linux.
Here is the result:

start do sub :45, lock:1 0xb6d9a460
end do sub :43, lock:1 0xb6d9a460
start do sub :43, lock:1 0xb6d9e460
end do sub :41, lock:2 0xb6d9e460
start do sub :43, lock:2 0xb6dbe460    //two threads both access the shared data
start do sub :41, lock:1 0xb6daa460
end do sub :39, lock:2 0xb6daa460
start do sub :39, lock:2 0xb6de6460
end do sub :37, lock:2 0xb6de6460
start do sub :37, lock:2 0xb6db6460
end do sub :35, lock:2 0xb6db6460
start do sub :35, lock:2 0xb6dc2460
end do sub :33, lock:2 0xb6dc2460
end do sub :37, lock:2 0xb6dbe460
start do sub :33, lock:2 0xb6dc6460
end do sub :31, lock:0 0xb6dc6460
start do sub :31, lock:2 0xb6dae460
end do sub :29, lock:2 0xb6dae460
start do sub :29, lock:2 0xb6db2460
end do sub :27, lock:2 0xb6db2460
start do sub :27, lock:2 0xb6dba460
end do sub :25, lock:2 0xb6dba460
start do sub :25, lock:2 0xb6da2460
end do sub :23, lock:2 0xb6da2460

Is lll_lock (cond->__data.__lock, pshared) failed?

pshared is LLL_SHARED.
  

Comments

Will Newton April 25, 2014, 9:43 a.m. UTC | #1
On 25 April 2014 07:58, Yang Yingliang <yangyingliang@huawei.com> wrote:
> Hi,
>
>
> I have 22 threads wait in pthread_cond_timedwait. When they are all woke up, I found
> there are more than one threads can access shared data in pthread_cond_timedwait.
>
> I added print messages as follow code:
>
> --- libc/nptl/pthread_cond_timedwait.c
> +++ libc/nptl/pthread_cond_timedwait.c
> @@ -34,6 +34,7 @@
>  #else
>  # include <bits/libc-vdso.h>
>  #endif
> +#include <stdio.h>
>
>  /* Cleanup handler, defined in pthread_cond_wait.c.  */
>  extern void __condvar_cleanup (void *arg)
> @@ -235,7 +239,9 @@
>
>   bc_out:
>
> +printf("start do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
>    cond->__data.__nwaiters -= 1 << COND_NWAITERS_SHIFT;
> +printf("end do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
>
>    /* If pthread_cond_destroy was called on this variable already,
>       notify the pthread_cond_destroy caller all waiters have left
>
>
> I tested on Linux arma15el 3.10.37+ #2 SMP Fri Apr 25 11:23:25 CST 2014 armv7l GNU/Linux.
> Here is the result:
>
> start do sub :45, lock:1 0xb6d9a460
> end do sub :43, lock:1 0xb6d9a460
> start do sub :43, lock:1 0xb6d9e460
> end do sub :41, lock:2 0xb6d9e460
> start do sub :43, lock:2 0xb6dbe460    //two threads both access the shared data
> start do sub :41, lock:1 0xb6daa460
> end do sub :39, lock:2 0xb6daa460
> start do sub :39, lock:2 0xb6de6460
> end do sub :37, lock:2 0xb6de6460
> start do sub :37, lock:2 0xb6db6460
> end do sub :35, lock:2 0xb6db6460
> start do sub :35, lock:2 0xb6dc2460
> end do sub :33, lock:2 0xb6dc2460
> end do sub :37, lock:2 0xb6dbe460
> start do sub :33, lock:2 0xb6dc6460
> end do sub :31, lock:0 0xb6dc6460
> start do sub :31, lock:2 0xb6dae460
> end do sub :29, lock:2 0xb6dae460
> start do sub :29, lock:2 0xb6db2460
> end do sub :27, lock:2 0xb6db2460
> start do sub :27, lock:2 0xb6dba460
> end do sub :25, lock:2 0xb6dba460
> start do sub :25, lock:2 0xb6da2460
> end do sub :23, lock:2 0xb6da2460
>
> Is lll_lock (cond->__data.__lock, pshared) failed?
>
> pshared is LLL_SHARED.

I have had a quick look at this and there is no obvious reason I can
see for this behaviour, unless there is some way that IO buffering
could cause the messages to be strangely interleaved. The other
alternative that may be worth investigating is whether or not
ldrex/strex is working correctly in your SMP system.
  
Yang Yingliang April 26, 2014, 6:45 a.m. UTC | #2
On 2014/4/25 17:43, Will Newton wrote:
> On 25 April 2014 07:58, Yang Yingliang <yangyingliang@huawei.com> wrote:
>> Hi,
>>
>>
>> I have 22 threads wait in pthread_cond_timedwait. When they are all woke up, I found
>> there are more than one threads can access shared data in pthread_cond_timedwait.
>>
>> I added print messages as follow code:
>>
>> --- libc/nptl/pthread_cond_timedwait.c
>> +++ libc/nptl/pthread_cond_timedwait.c
>> @@ -34,6 +34,7 @@
>>  #else
>>  # include <bits/libc-vdso.h>
>>  #endif
>> +#include <stdio.h>
>>
>>  /* Cleanup handler, defined in pthread_cond_wait.c.  */
>>  extern void __condvar_cleanup (void *arg)
>> @@ -235,7 +239,9 @@
>>
>>   bc_out:
>>
>> +printf("start do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
>>    cond->__data.__nwaiters -= 1 << COND_NWAITERS_SHIFT;
>> +printf("end do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
>>
>>    /* If pthread_cond_destroy was called on this variable already,
>>       notify the pthread_cond_destroy caller all waiters have left
>>
>>
>> I tested on Linux arma15el 3.10.37+ #2 SMP Fri Apr 25 11:23:25 CST 2014 armv7l GNU/Linux.
>> Here is the result:
>>
>> start do sub :45, lock:1 0xb6d9a460
>> end do sub :43, lock:1 0xb6d9a460
>> start do sub :43, lock:1 0xb6d9e460
>> end do sub :41, lock:2 0xb6d9e460
>> start do sub :43, lock:2 0xb6dbe460    //two threads both access the shared data
>> start do sub :41, lock:1 0xb6daa460
>> end do sub :39, lock:2 0xb6daa460
>> start do sub :39, lock:2 0xb6de6460
>> end do sub :37, lock:2 0xb6de6460
>> start do sub :37, lock:2 0xb6db6460
>> end do sub :35, lock:2 0xb6db6460
>> start do sub :35, lock:2 0xb6dc2460
>> end do sub :33, lock:2 0xb6dc2460
>> end do sub :37, lock:2 0xb6dbe460
>> start do sub :33, lock:2 0xb6dc6460
>> end do sub :31, lock:0 0xb6dc6460
>> start do sub :31, lock:2 0xb6dae460
>> end do sub :29, lock:2 0xb6dae460
>> start do sub :29, lock:2 0xb6db2460
>> end do sub :27, lock:2 0xb6db2460
>> start do sub :27, lock:2 0xb6dba460
>> end do sub :25, lock:2 0xb6dba460
>> start do sub :25, lock:2 0xb6da2460
>> end do sub :23, lock:2 0xb6da2460
>>
>> Is lll_lock (cond->__data.__lock, pshared) failed?
>>
>> pshared is LLL_SHARED.
> 
> I have had a quick look at this and there is no obvious reason I can
> see for this behaviour, unless there is some way that IO buffering
> could cause the messages to be strangely interleaved. The other
> alternative that may be worth investigating is whether or not
> ldrex/strex is working correctly in your SMP system.
> 

After doing some investigation, it looks like atomic_compare_and_exchange_val_acq
not doing atomic. So two threads can both acquire lock when futex is 0. Is there
something wrong in atomic_compare_and_exchange_val_acq ?

#define __lll_lock(futex, private)					      \
  ((void) ({								      \
    int *__futex = (futex);						      \
    if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex,       \
								1, 0), 0))    \
      {									      \
	if (__builtin_constant_p (private) && (private) == LLL_PRIVATE)	      \
	  __lll_lock_wait_private (__futex);				      \
	else								      \
	  __lll_lock_wait (__futex, private);				      \
      }									      \
  }))
  
Yang Yingliang April 26, 2014, 9:48 a.m. UTC | #3
On 2014/4/26 14:45, Yang Yingliang wrote:
> On 2014/4/25 17:43, Will Newton wrote:
>> On 25 April 2014 07:58, Yang Yingliang <yangyingliang@huawei.com> wrote:
>>> Hi,
>>>
>>>
>>> I have 22 threads wait in pthread_cond_timedwait. When they are all woke up, I found
>>> there are more than one threads can access shared data in pthread_cond_timedwait.
>>>
>>> I added print messages as follow code:
>>>
>>> --- libc/nptl/pthread_cond_timedwait.c
>>> +++ libc/nptl/pthread_cond_timedwait.c
>>> @@ -34,6 +34,7 @@
>>>  #else
>>>  # include <bits/libc-vdso.h>
>>>  #endif
>>> +#include <stdio.h>
>>>
>>>  /* Cleanup handler, defined in pthread_cond_wait.c.  */
>>>  extern void __condvar_cleanup (void *arg)
>>> @@ -235,7 +239,9 @@
>>>
>>>   bc_out:
>>>
>>> +printf("start do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
>>>    cond->__data.__nwaiters -= 1 << COND_NWAITERS_SHIFT;
>>> +printf("end do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
>>>
>>>    /* If pthread_cond_destroy was called on this variable already,
>>>       notify the pthread_cond_destroy caller all waiters have left
>>>
>>>
>>> I tested on Linux arma15el 3.10.37+ #2 SMP Fri Apr 25 11:23:25 CST 2014 armv7l GNU/Linux.
>>> Here is the result:
>>>
>>> start do sub :45, lock:1 0xb6d9a460
>>> end do sub :43, lock:1 0xb6d9a460
>>> start do sub :43, lock:1 0xb6d9e460
>>> end do sub :41, lock:2 0xb6d9e460
>>> start do sub :43, lock:2 0xb6dbe460    //two threads both access the shared data
>>> start do sub :41, lock:1 0xb6daa460
>>> end do sub :39, lock:2 0xb6daa460
>>> start do sub :39, lock:2 0xb6de6460
>>> end do sub :37, lock:2 0xb6de6460
>>> start do sub :37, lock:2 0xb6db6460
>>> end do sub :35, lock:2 0xb6db6460
>>> start do sub :35, lock:2 0xb6dc2460
>>> end do sub :33, lock:2 0xb6dc2460
>>> end do sub :37, lock:2 0xb6dbe460
>>> start do sub :33, lock:2 0xb6dc6460
>>> end do sub :31, lock:0 0xb6dc6460
>>> start do sub :31, lock:2 0xb6dae460
>>> end do sub :29, lock:2 0xb6dae460
>>> start do sub :29, lock:2 0xb6db2460
>>> end do sub :27, lock:2 0xb6db2460
>>> start do sub :27, lock:2 0xb6dba460
>>> end do sub :25, lock:2 0xb6dba460
>>> start do sub :25, lock:2 0xb6da2460
>>> end do sub :23, lock:2 0xb6da2460
>>>
>>> Is lll_lock (cond->__data.__lock, pshared) failed?
>>>
>>> pshared is LLL_SHARED.
>>
>> I have had a quick look at this and there is no obvious reason I can
>> see for this behaviour, unless there is some way that IO buffering
>> could cause the messages to be strangely interleaved. The other
>> alternative that may be worth investigating is whether or not
>> ldrex/strex is working correctly in your SMP system.
>>
> 
> After doing some investigation, it looks like atomic_compare_and_exchange_val_acq
> not doing atomic. So two threads can both acquire lock when futex is 0. Is there
> something wrong in atomic_compare_and_exchange_val_acq ?
> 
> #define __lll_lock(futex, private)					      \
>   ((void) ({								      \
>     int *__futex = (futex);						      \
>     if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex,       \
> 								1, 0), 0))    \
>       {									      \
> 	if (__builtin_constant_p (private) && (private) == LLL_PRIVATE)	      \
> 	  __lll_lock_wait_private (__futex);				      \
> 	else								      \
> 	  __lll_lock_wait (__futex, private);				      \
>       }									      \
>   }))
> 

I noticed that atomic_compare_and_exchange_val_acq are not atomic for ARM in glibc-2.18.
I will try glibc-2.19.
  
Will Newton April 28, 2014, 7:55 a.m. UTC | #4
On 26 April 2014 10:48, Yang Yingliang <yangyingliang@huawei.com> wrote:
> On 2014/4/26 14:45, Yang Yingliang wrote:
>> On 2014/4/25 17:43, Will Newton wrote:
>>> On 25 April 2014 07:58, Yang Yingliang <yangyingliang@huawei.com> wrote:
>>>> Hi,
>>>>
>>>>
>>>> I have 22 threads wait in pthread_cond_timedwait. When they are all woke up, I found
>>>> there are more than one threads can access shared data in pthread_cond_timedwait.
>>>>
>>>> I added print messages as follow code:
>>>>
>>>> --- libc/nptl/pthread_cond_timedwait.c
>>>> +++ libc/nptl/pthread_cond_timedwait.c
>>>> @@ -34,6 +34,7 @@
>>>>  #else
>>>>  # include <bits/libc-vdso.h>
>>>>  #endif
>>>> +#include <stdio.h>
>>>>
>>>>  /* Cleanup handler, defined in pthread_cond_wait.c.  */
>>>>  extern void __condvar_cleanup (void *arg)
>>>> @@ -235,7 +239,9 @@
>>>>
>>>>   bc_out:
>>>>
>>>> +printf("start do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
>>>>    cond->__data.__nwaiters -= 1 << COND_NWAITERS_SHIFT;
>>>> +printf("end do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
>>>>
>>>>    /* If pthread_cond_destroy was called on this variable already,
>>>>       notify the pthread_cond_destroy caller all waiters have left
>>>>
>>>>
>>>> I tested on Linux arma15el 3.10.37+ #2 SMP Fri Apr 25 11:23:25 CST 2014 armv7l GNU/Linux.
>>>> Here is the result:
>>>>
>>>> start do sub :45, lock:1 0xb6d9a460
>>>> end do sub :43, lock:1 0xb6d9a460
>>>> start do sub :43, lock:1 0xb6d9e460
>>>> end do sub :41, lock:2 0xb6d9e460
>>>> start do sub :43, lock:2 0xb6dbe460    //two threads both access the shared data
>>>> start do sub :41, lock:1 0xb6daa460
>>>> end do sub :39, lock:2 0xb6daa460
>>>> start do sub :39, lock:2 0xb6de6460
>>>> end do sub :37, lock:2 0xb6de6460
>>>> start do sub :37, lock:2 0xb6db6460
>>>> end do sub :35, lock:2 0xb6db6460
>>>> start do sub :35, lock:2 0xb6dc2460
>>>> end do sub :33, lock:2 0xb6dc2460
>>>> end do sub :37, lock:2 0xb6dbe460
>>>> start do sub :33, lock:2 0xb6dc6460
>>>> end do sub :31, lock:0 0xb6dc6460
>>>> start do sub :31, lock:2 0xb6dae460
>>>> end do sub :29, lock:2 0xb6dae460
>>>> start do sub :29, lock:2 0xb6db2460
>>>> end do sub :27, lock:2 0xb6db2460
>>>> start do sub :27, lock:2 0xb6dba460
>>>> end do sub :25, lock:2 0xb6dba460
>>>> start do sub :25, lock:2 0xb6da2460
>>>> end do sub :23, lock:2 0xb6da2460
>>>>
>>>> Is lll_lock (cond->__data.__lock, pshared) failed?
>>>>
>>>> pshared is LLL_SHARED.
>>>
>>> I have had a quick look at this and there is no obvious reason I can
>>> see for this behaviour, unless there is some way that IO buffering
>>> could cause the messages to be strangely interleaved. The other
>>> alternative that may be worth investigating is whether or not
>>> ldrex/strex is working correctly in your SMP system.
>>>
>>
>> After doing some investigation, it looks like atomic_compare_and_exchange_val_acq
>> not doing atomic. So two threads can both acquire lock when futex is 0. Is there
>> something wrong in atomic_compare_and_exchange_val_acq ?
>>
>> #define __lll_lock(futex, private)                                          \
>>   ((void) ({                                                                \
>>     int *__futex = (futex);                                                 \
>>     if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex,       \
>>                                                               1, 0), 0))    \
>>       {                                                                             \
>>       if (__builtin_constant_p (private) && (private) == LLL_PRIVATE)       \
>>         __lll_lock_wait_private (__futex);                                  \
>>       else                                                                  \
>>         __lll_lock_wait (__futex, private);                                 \
>>       }                                                                             \
>>   }))
>>
>
> I noticed that atomic_compare_and_exchange_val_acq are not atomic for ARM in glibc-2.18.
> I will try glibc-2.19.

Which version of gcc are you using? Does the disassembly of the code
look reasonably correct i.e. ldrex/strex?
  
Torvald Riegel April 28, 2014, 6:46 p.m. UTC | #5
On Sat, 2014-04-26 at 17:48 +0800, Yang Yingliang wrote:
> On 2014/4/26 14:45, Yang Yingliang wrote:
> > On 2014/4/25 17:43, Will Newton wrote:
> >> On 25 April 2014 07:58, Yang Yingliang <yangyingliang@huawei.com> wrote:
> >>> Hi,
> >>>
> >>>
> >>> I have 22 threads wait in pthread_cond_timedwait. When they are all woke up, I found
> >>> there are more than one threads can access shared data in pthread_cond_timedwait.
> >>>
> >>> I added print messages as follow code:
> >>>
> >>> --- libc/nptl/pthread_cond_timedwait.c
> >>> +++ libc/nptl/pthread_cond_timedwait.c
> >>> @@ -34,6 +34,7 @@
> >>>  #else
> >>>  # include <bits/libc-vdso.h>
> >>>  #endif
> >>> +#include <stdio.h>
> >>>
> >>>  /* Cleanup handler, defined in pthread_cond_wait.c.  */
> >>>  extern void __condvar_cleanup (void *arg)
> >>> @@ -235,7 +239,9 @@
> >>>
> >>>   bc_out:
> >>>
> >>> +printf("start do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
> >>>    cond->__data.__nwaiters -= 1 << COND_NWAITERS_SHIFT;
> >>> +printf("end do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
> >>>
> >>>    /* If pthread_cond_destroy was called on this variable already,
> >>>       notify the pthread_cond_destroy caller all waiters have left
> >>>
> >>>
> >>> I tested on Linux arma15el 3.10.37+ #2 SMP Fri Apr 25 11:23:25 CST 2014 armv7l GNU/Linux.
> >>> Here is the result:
> >>>
> >>> start do sub :45, lock:1 0xb6d9a460
> >>> end do sub :43, lock:1 0xb6d9a460
> >>> start do sub :43, lock:1 0xb6d9e460
> >>> end do sub :41, lock:2 0xb6d9e460
> >>> start do sub :43, lock:2 0xb6dbe460    //two threads both access the shared data
> >>> start do sub :41, lock:1 0xb6daa460
> >>> end do sub :39, lock:2 0xb6daa460
> >>> start do sub :39, lock:2 0xb6de6460
> >>> end do sub :37, lock:2 0xb6de6460
> >>> start do sub :37, lock:2 0xb6db6460
> >>> end do sub :35, lock:2 0xb6db6460
> >>> start do sub :35, lock:2 0xb6dc2460
> >>> end do sub :33, lock:2 0xb6dc2460
> >>> end do sub :37, lock:2 0xb6dbe460
> >>> start do sub :33, lock:2 0xb6dc6460
> >>> end do sub :31, lock:0 0xb6dc6460
> >>> start do sub :31, lock:2 0xb6dae460
> >>> end do sub :29, lock:2 0xb6dae460
> >>> start do sub :29, lock:2 0xb6db2460
> >>> end do sub :27, lock:2 0xb6db2460
> >>> start do sub :27, lock:2 0xb6dba460
> >>> end do sub :25, lock:2 0xb6dba460
> >>> start do sub :25, lock:2 0xb6da2460
> >>> end do sub :23, lock:2 0xb6da2460
> >>>
> >>> Is lll_lock (cond->__data.__lock, pshared) failed?
> >>>
> >>> pshared is LLL_SHARED.
> >>
> >> I have had a quick look at this and there is no obvious reason I can
> >> see for this behaviour, unless there is some way that IO buffering
> >> could cause the messages to be strangely interleaved. The other
> >> alternative that may be worth investigating is whether or not
> >> ldrex/strex is working correctly in your SMP system.
> >>
> > 
> > After doing some investigation, it looks like atomic_compare_and_exchange_val_acq
> > not doing atomic. So two threads can both acquire lock when futex is 0. Is there
> > something wrong in atomic_compare_and_exchange_val_acq ?
> > 
> > #define __lll_lock(futex, private)					      \
> >   ((void) ({								      \
> >     int *__futex = (futex);						      \
> >     if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex,       \
> > 								1, 0), 0))    \
> >       {									      \
> > 	if (__builtin_constant_p (private) && (private) == LLL_PRIVATE)	      \
> > 	  __lll_lock_wait_private (__futex);				      \
> > 	else								      \
> > 	  __lll_lock_wait (__futex, private);				      \
> >       }									      \
> >   }))
> > 
> 
> I noticed that atomic_compare_and_exchange_val_acq are not atomic for ARM in glibc-2.18.
> I will try glibc-2.19.

Could you share the full test case for this, please?  Did you just test
whether some mutex can be acquired several times?
  

Patch

--- libc/nptl/pthread_cond_timedwait.c
+++ libc/nptl/pthread_cond_timedwait.c
@@ -34,6 +34,7 @@ 
 #else
 # include <bits/libc-vdso.h>
 #endif
+#include <stdio.h>

 /* Cleanup handler, defined in pthread_cond_wait.c.  */
 extern void __condvar_cleanup (void *arg)
@@ -235,7 +239,9 @@ 

  bc_out:

+printf("start do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());
   cond->__data.__nwaiters -= 1 << COND_NWAITERS_SHIFT;
+printf("end do sub :%d, lock:%d %p\n", cond->__data.__nwaiters, cond->__data.__lock, pthread_self());

   /* If pthread_cond_destroy was called on this variable already,
      notify the pthread_cond_destroy caller all waiters have left