Possible fix for bug #13165

Message ID 53F26769.20801@linux.vnet.ibm.com
State Dropped
Delegated to: Adhemerval Zanella Netto
Headers

Commit Message

Adhemerval Zanella Netto Aug. 18, 2014, 8:51 p.m. UTC
  On 18-08-2014 16:53, Rich Felker wrote:
> A couple days ago I posted ideas for a fix for this issue on the bug
> tracker:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=13165#c38
>
> Anybody who does glibc development/builds/testing up for trying my
> idea and seeing if it works?
>
> Rich
>
If I understood correctly, you are proposing something like:


correct? I saw not NPTL issues in nether powerpc64 or x86_64.
  

Comments

Rich Felker Aug. 18, 2014, 9:27 p.m. UTC | #1
On Mon, Aug 18, 2014 at 05:51:53PM -0300, Adhemerval Zanella wrote:
> On 18-08-2014 16:53, Rich Felker wrote:
> > A couple days ago I posted ideas for a fix for this issue on the bug
> > tracker:
> >
> > https://sourceware.org/bugzilla/show_bug.cgi?id=13165#c38
> >
> > Anybody who does glibc development/builds/testing up for trying my
> > idea and seeing if it works?
> >
> > Rich
> >
> If I understood correctly, you are proposing something like:
> 
> diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
> index fc5eac4..a16c5d5 100644
> --- a/nptl/pthread_cond_wait.c
> +++ b/nptl/pthread_cond_wait.c
> @@ -118,14 +118,6 @@ __pthread_cond_wait (cond, mutex)
>    /* Make sure we are alone.  */
>    lll_lock (cond->__data.__lock, pshared);
>  
> -  /* Now we can release the mutex.  */
> -  err = __pthread_mutex_unlock_usercnt (mutex, 0);
> -  if (__glibc_unlikely (err))
> -    {
> -      lll_unlock (cond->__data.__lock, pshared);
> -      return err;
> -    }
> -
>    /* We have one new user of the condvar.  */
>    ++cond->__data.__total_seq;
>    ++cond->__data.__futex;
> @@ -153,6 +145,14 @@ __pthread_cond_wait (cond, mutex)
>    /* Remember the broadcast counter.  */
>    cbuffer.bc_seq = cond->__data.__broadcast_seq;
>  
> +  /* Now we can release the mutex.  */
> +  err = __pthread_mutex_unlock_usercnt (mutex, 0);
> +  if (__glibc_unlikely (err))
> +    {
> +      lll_unlock (cond->__data.__lock, pshared);
> +      return err;
> +    }
> +
>    do
>      {
>        unsigned int futex_val = cond->__data.__futex;
> 
> correct?

Yes, that looks like what I had in mind. But on second thought I'm not
sure it should be necessary; I missed the internal lock that's being
taken before the mutex is unlocked.

> I saw not NPTL issues in nether powerpc64 or x86_64.

You mean you can't reproduce the issue on these targets? Maybe it's
only present on targets that are using non-default versions of the
code. If so, and if the offending target-specific versions have been
removed, maybe the bug is fixed now? I haven't had a chance to try the
test case on latest glibc, but the bug was present on x86 (32-bit)
last time I checked.

It would be nice if this bug has already been fixed without taking any
specific action to do so...

Rich
  
Torvald Riegel Aug. 18, 2014, 10:18 p.m. UTC | #2
On Mon, 2014-08-18 at 17:27 -0400, Rich Felker wrote:
> On Mon, Aug 18, 2014 at 05:51:53PM -0300, Adhemerval Zanella wrote:
> > On 18-08-2014 16:53, Rich Felker wrote:
> > > A couple days ago I posted ideas for a fix for this issue on the bug
> > > tracker:
> > >
> > > https://sourceware.org/bugzilla/show_bug.cgi?id=13165#c38
> > >
> > > Anybody who does glibc development/builds/testing up for trying my
> > > idea and seeing if it works?
> > >
> > > Rich
> > >
> > If I understood correctly, you are proposing something like:
> > 
> > diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
> > index fc5eac4..a16c5d5 100644
> > --- a/nptl/pthread_cond_wait.c
> > +++ b/nptl/pthread_cond_wait.c
> > @@ -118,14 +118,6 @@ __pthread_cond_wait (cond, mutex)
> >    /* Make sure we are alone.  */
> >    lll_lock (cond->__data.__lock, pshared);
> >  
> > -  /* Now we can release the mutex.  */
> > -  err = __pthread_mutex_unlock_usercnt (mutex, 0);
> > -  if (__glibc_unlikely (err))
> > -    {
> > -      lll_unlock (cond->__data.__lock, pshared);
> > -      return err;
> > -    }
> > -
> >    /* We have one new user of the condvar.  */
> >    ++cond->__data.__total_seq;
> >    ++cond->__data.__futex;
> > @@ -153,6 +145,14 @@ __pthread_cond_wait (cond, mutex)
> >    /* Remember the broadcast counter.  */
> >    cbuffer.bc_seq = cond->__data.__broadcast_seq;
> >  
> > +  /* Now we can release the mutex.  */
> > +  err = __pthread_mutex_unlock_usercnt (mutex, 0);
> > +  if (__glibc_unlikely (err))
> > +    {
> > +      lll_unlock (cond->__data.__lock, pshared);
> > +      return err;
> > +    }
> > +
> >    do
> >      {
> >        unsigned int futex_val = cond->__data.__futex;
> > 
> > correct?
> 
> Yes, that looks like what I had in mind. But on second thought I'm not
> sure it should be necessary; I missed the internal lock that's being
> taken before the mutex is unlocked.
> 
> > I saw not NPTL issues in nether powerpc64 or x86_64.
> 
> You mean you can't reproduce the issue on these targets? Maybe it's
> only present on targets that are using non-default versions of the
> code. If so, and if the offending target-specific versions have been
> removed, maybe the bug is fixed now? I haven't had a chance to try the
> test case on latest glibc, but the bug was present on x86 (32-bit)
> last time I checked.
> 
> It would be nice if this bug has already been fixed without taking any
> specific action to do so...

The bug is not fixed.  It is an algorithmic issue -- though of course it
may or may not trigger in a particular execution depending on the
respective interleavings.
  
Adhemerval Zanella Netto Aug. 19, 2014, 12:20 a.m. UTC | #3
On 18-08-2014 19:18, Torvald Riegel wrote:
> On Mon, 2014-08-18 at 17:27 -0400, Rich Felker wrote:
>> On Mon, Aug 18, 2014 at 05:51:53PM -0300, Adhemerval Zanella wrote:
>>> On 18-08-2014 16:53, Rich Felker wrote:
>>>> A couple days ago I posted ideas for a fix for this issue on the bug
>>>> tracker:
>>>>
>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=13165#c38
>>>>
>>>> Anybody who does glibc development/builds/testing up for trying my
>>>> idea and seeing if it works?
>>>>
>>>> Rich
>>>>
>>> If I understood correctly, you are proposing something like:
>>>
>>> diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
>>> index fc5eac4..a16c5d5 100644
>>> --- a/nptl/pthread_cond_wait.c
>>> +++ b/nptl/pthread_cond_wait.c
>>> @@ -118,14 +118,6 @@ __pthread_cond_wait (cond, mutex)
>>>    /* Make sure we are alone.  */
>>>    lll_lock (cond->__data.__lock, pshared);
>>>  
>>> -  /* Now we can release the mutex.  */
>>> -  err = __pthread_mutex_unlock_usercnt (mutex, 0);
>>> -  if (__glibc_unlikely (err))
>>> -    {
>>> -      lll_unlock (cond->__data.__lock, pshared);
>>> -      return err;
>>> -    }
>>> -
>>>    /* We have one new user of the condvar.  */
>>>    ++cond->__data.__total_seq;
>>>    ++cond->__data.__futex;
>>> @@ -153,6 +145,14 @@ __pthread_cond_wait (cond, mutex)
>>>    /* Remember the broadcast counter.  */
>>>    cbuffer.bc_seq = cond->__data.__broadcast_seq;
>>>  
>>> +  /* Now we can release the mutex.  */
>>> +  err = __pthread_mutex_unlock_usercnt (mutex, 0);
>>> +  if (__glibc_unlikely (err))
>>> +    {
>>> +      lll_unlock (cond->__data.__lock, pshared);
>>> +      return err;
>>> +    }
>>> +
>>>    do
>>>      {
>>>        unsigned int futex_val = cond->__data.__futex;
>>>
>>> correct?
>> Yes, that looks like what I had in mind. But on second thought I'm not
>> sure it should be necessary; I missed the internal lock that's being
>> taken before the mutex is unlocked.
>>
>>> I saw not NPTL issues in nether powerpc64 or x86_64.
>> You mean you can't reproduce the issue on these targets? Maybe it's
>> only present on targets that are using non-default versions of the
>> code. If so, and if the offending target-specific versions have been
>> removed, maybe the bug is fixed now? I haven't had a chance to try the
>> test case on latest glibc, but the bug was present on x86 (32-bit)
>> last time I checked.
>>
>> It would be nice if this bug has already been fixed without taking any
>> specific action to do so...
> The bug is not fixed.  It is an algorithmic issue -- though of course it
> may or may not trigger in a particular execution depending on the
> respective interleavings.
>
I just noted I wasn't specific, by 'not issues' I meant NPTL testcase ran fine, 
however the testcase from bug stills fails.
  

Patch

diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
index fc5eac4..a16c5d5 100644
--- a/nptl/pthread_cond_wait.c
+++ b/nptl/pthread_cond_wait.c
@@ -118,14 +118,6 @@  __pthread_cond_wait (cond, mutex)
   /* Make sure we are alone.  */
   lll_lock (cond->__data.__lock, pshared);
 
-  /* Now we can release the mutex.  */
-  err = __pthread_mutex_unlock_usercnt (mutex, 0);
-  if (__glibc_unlikely (err))
-    {
-      lll_unlock (cond->__data.__lock, pshared);
-      return err;
-    }
-
   /* We have one new user of the condvar.  */
   ++cond->__data.__total_seq;
   ++cond->__data.__futex;
@@ -153,6 +145,14 @@  __pthread_cond_wait (cond, mutex)
   /* Remember the broadcast counter.  */
   cbuffer.bc_seq = cond->__data.__broadcast_seq;
 
+  /* Now we can release the mutex.  */
+  err = __pthread_mutex_unlock_usercnt (mutex, 0);
+  if (__glibc_unlikely (err))
+    {
+      lll_unlock (cond->__data.__lock, pshared);
+      return err;
+    }
+
   do
     {
       unsigned int futex_val = cond->__data.__futex;