nptl: Fix join/tryjoin comments.

Message ID 448cd43d-8f24-7d02-e6dc-b6fa8e0dc44e@redhat.com
State Superseded
Headers

Commit Message

Carlos O'Donell Feb. 11, 2019, 9:11 p.m. UTC
  Florian,

I noticed these comment mistakes while reviewing your patch for BZ #24211.

OK for master?

~~~
This commit fixes several inaccurate comments in the implementation of
__pthread_timedjoin_ex(), pthread_tryjoin_np, and lll_wait_tid, all of
which are used to implement POSIX thread joining.

Firstly, in pthread_tryjoin_np() we only attempt the join if a read of
pd->tid == 0, because that means the thread has already been reaped by
the kernel and we can safely join it without blocking.

Secondly, all join implementations call the common
__pthread_timedjoin_ex and only if abstime is NULL (block) do we
actually need to use cancellation (to cancel the potentially infinite
wait).

Lastly, in lll_wait_tid we only call the canellation futex operations if
abstime is non-NULL.

With the comments corrected the implementation notes are consistent and
logical.

Signed-off-by: Carlos O'Donell <carlos@redhat.com>
---
 ChangeLog                   | 7 +++++++
 nptl/lll_timedwait_tid.c    | 3 +--
 nptl/pthread_join_common.c  | 4 ++--
 nptl/pthread_tryjoin.c      | 2 +-
 sysdeps/nptl/lowlevellock.h | 2 +-
 5 files changed, 12 insertions(+), 6 deletions(-)
  

Comments

Florian Weimer Feb. 11, 2019, 9:37 p.m. UTC | #1
* Carlos O'Donell:

> This commit fixes several inaccurate comments in the implementation of
> __pthread_timedjoin_ex(), pthread_tryjoin_np, and lll_wait_tid, all of
> which are used to implement POSIX thread joining.

GNU style is not to use () after function names.

> Firstly, in pthread_tryjoin_np() we only attempt the join if a read of
> pd->tid == 0, because that means the thread has already been reaped by
> the kernel and we can safely join it without blocking.
>
> Secondly, all join implementations call the common
> __pthread_timedjoin_ex and only if abstime is NULL (block) do we
> actually need to use cancellation (to cancel the potentially infinite
> wait).

I think the change regarding abstime == NULL is wrong.  This is about
pthread_join/pthread_timedjoin_np (blocking) on the one hand and
pthread_tryjoin_np on the other (not blocking).

Thanks,
Florian
  
Carlos O'Donell Feb. 12, 2019, 1:18 a.m. UTC | #2
On 2/11/19 4:37 PM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> This commit fixes several inaccurate comments in the implementation of
>> __pthread_timedjoin_ex(), pthread_tryjoin_np, and lll_wait_tid, all of
>> which are used to implement POSIX thread joining.
> 
> GNU style is not to use () after function names.

Commit message fixed (by removing ()).

>> Firstly, in pthread_tryjoin_np() we only attempt the join if a read of
>> pd->tid == 0, because that means the thread has already been reaped by
>> the kernel and we can safely join it without blocking.
>>
>> Secondly, all join implementations call the common
>> __pthread_timedjoin_ex and only if abstime is NULL (block) do we
>> actually need to use cancellation (to cancel the potentially infinite
>> wait).
> 
> I think the change regarding abstime == NULL is wrong.  This is about
> pthread_join/pthread_timedjoin_np (blocking) on the one hand and
> pthread_tryjoin_np on the other (not blocking).

Why is the comment wrong? Or do you mean to say the comment could be
made more accurate?

The same core __pthread_timedjoin_ex is used for everything, so it handles
3 cases:

* join
* tryjoin
* timedjoin

In a join with abstime NULL we switch to asynchronous cancellation, and
this is done behind the scenes by lll_wait_tid calling lll_futex_wait_cancel
when abstime is NULL.

In a tryjoin we only call __pthread_timedjoin_ex when __tid is 0, and we
optimize away the lll_wait_tid by setting block to true. It doesn't change
the fact that if we had called lll_wait_tid, we would not have blocked
anyway, we would have stopped at the first atomic_load_acquire and never
waited.

Only in timedjoin, with abstime non-NULL, do we ever reach the call to
lll_futex_wait_cancel. It's the only time we switch to asychronous ancellation
because it's what we do around the *_cancel variants (until we have a better
method for doing cancellation).

Given this information:

(a) For join, block == true, and abstime == NULL, so we do use async cancel
    for the futex wait. And the old comment was fine, but the new comment
    is clearer. The old comment also was for when we *unconditionally*
    called CANCEL_ASYNC() (which means "turn on async cancel"), but this
    code was removed in 2018 by ce7eb0e9031, making this not something we
    do unconditionally anymore.

(b) For tryjoin, block == false, and abstime == NULL, in theory the comment
    in pthread_tryjoin should be deleted because block == false will mean
    we *never* call lll_wait_tid...

 29   /* If pd->tid != 0 then lll_wait_tid will not block on futex
 30      operation.  */

      ^^^^ With the block == false optimization we never call lll_wait_tid!
      But the comment is wrong, probably just a typo in 2017 by
      4735850f7a4.

 31   return __pthread_timedjoin_ex (threadid, thread_return, NULL, false);

    but I like the comment because it holds regardless of the optimization.

(c) For timedjoin, block == true, and abstime != NULL, and the comment is
    wrong, there is no async cancel involved. We call lll_wait_tid and
    the abstime != NULL case calls __lll_timedwait_tid which is not a
    cancellation point. No cancellation for timedjoin.

So in (c) we show that it's not about block vs non-block, it's about
making only pthread_join a cancellation point.

Is this a bug? Should __lll_timedwait_tid for very long timeouts have used
cancellation? No. Since the timedjoin is a GNU addition it is ours to define
as a cancellation point or not, and we choose not to. While join is mandated
by POSIX to be a cancellation point and it is (the only one of the joins to
be so).
  
Florian Weimer Feb. 12, 2019, 9:55 a.m. UTC | #3
* Carlos O'Donell:

> Why is the comment wrong? Or do you mean to say the comment could be
> made more accurate?

Let's go back a step.  My understanding is that pthread_join and
pthread_timedjoin_np are both cancellation points, but
pthread_tryjoin_np is not.  Are you saying that pthread_timedjoin_np is
not a cancellation point?

I assume that would be a bug.

Thanks,
Florian
  
Adhemerval Zanella Feb. 12, 2019, 11:39 a.m. UTC | #4
On 12/02/2019 07:55, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> Why is the comment wrong? Or do you mean to say the comment could be
>> made more accurate?
> 
> Let's go back a step.  My understanding is that pthread_join and
> pthread_timedjoin_np are both cancellation points, but
> pthread_tryjoin_np is not.  Are you saying that pthread_timedjoin_np is
> not a cancellation point?
> 
> I assume that would be a bug.

This is my mistake done by ce7eb0e90315eb1939ac29f656d39b3db858c092,
I refactored lll_wait_tid and wrongly assumed __lll_timedwait_tid is
a cancellation point. The lll_wait_tid is only used on 
__pthread_timedjoin_ex and replicated on arch lowlevellock.h
implementation, I think it would be simpler to just move it to 
nptl/pthread_join_common.c.
  
Adhemerval Zanella Feb. 12, 2019, 11:49 a.m. UTC | #5
On 12/02/2019 09:39, Adhemerval Zanella wrote:
> 
> 
> On 12/02/2019 07:55, Florian Weimer wrote:
>> * Carlos O'Donell:
>>
>>> Why is the comment wrong? Or do you mean to say the comment could be
>>> made more accurate?
>>
>> Let's go back a step.  My understanding is that pthread_join and
>> pthread_timedjoin_np are both cancellation points, but
>> pthread_tryjoin_np is not.  Are you saying that pthread_timedjoin_np is
>> not a cancellation point?
>>
>> I assume that would be a bug.
> 
> This is my mistake done by ce7eb0e90315eb1939ac29f656d39b3db858c092,
> I refactored lll_wait_tid and wrongly assumed __lll_timedwait_tid is
> a cancellation point. The lll_wait_tid is only used on 
> __pthread_timedjoin_ex and replicated on arch lowlevellock.h
> implementation, I think it would be simpler to just move it to 
> nptl/pthread_join_common.c.
> 
> 

I am working on fixing it.
  
Florian Weimer Feb. 12, 2019, 12:21 p.m. UTC | #6
* Adhemerval Zanella:

> On 12/02/2019 09:39, Adhemerval Zanella wrote:
>> 
>> 
>> On 12/02/2019 07:55, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>>
>>>> Why is the comment wrong? Or do you mean to say the comment could be
>>>> made more accurate?
>>>
>>> Let's go back a step.  My understanding is that pthread_join and
>>> pthread_timedjoin_np are both cancellation points, but
>>> pthread_tryjoin_np is not.  Are you saying that pthread_timedjoin_np is
>>> not a cancellation point?
>>>
>>> I assume that would be a bug.
>> 
>> This is my mistake done by ce7eb0e90315eb1939ac29f656d39b3db858c092,
>> I refactored lll_wait_tid and wrongly assumed __lll_timedwait_tid is
>> a cancellation point. The lll_wait_tid is only used on 
>> __pthread_timedjoin_ex and replicated on arch lowlevellock.h
>> implementation, I think it would be simpler to just move it to 
>> nptl/pthread_join_common.c.
>> 
>> 
>
> I am working on fixing it.

Thanks.

Meanwhile, I filed <https://sourceware.org/bugzilla/show_bug.cgi?id=24215>
with an out-of-tree test case.

Florian
  

Patch

diff --git a/ChangeLog b/ChangeLog
index 6f1d967f62..37e7babf75 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,10 @@ 
+2019-02-11  Carlos O'Donell  <carlos@redhat.com>
+
+	* nptl/lll_timedwait_tid.c: Use GNU-style comment.
+	* nptl/pthread_join_common.c: Fix comment.
+	* nptl/pthread_tryjoin.c: Likewise.
+	* sysdeps/nptl/lowlevellock.h: Likewise.
+
 2019-02-11  Paul A. Clarke  <pc@us.ibm.com>
 
 	* sysdeps/powerpc/fpu/e_sqrt.c (__slow_ieee754_sqrtf):
diff --git a/nptl/lll_timedwait_tid.c b/nptl/lll_timedwait_tid.c
index 59ecbb4446..8f22da9b05 100644
--- a/nptl/lll_timedwait_tid.c
+++ b/nptl/lll_timedwait_tid.c
@@ -60,8 +60,7 @@  __lll_timedwait_tid (int *tidp, const struct timespec *abstime)
 
       /* If *tidp == tid, wait until thread terminates or the wait times out.
          The kernel up to version 3.16.3 does not use the private futex
-         operations for futex wake-up when the clone terminates.
-      */
+         operations for futex wake-up when the clone terminates.  */
       if (lll_futex_timed_wait (tidp, tid, &rt, LLL_SHARED) == -ETIMEDOUT)
         return ETIMEDOUT;
     }
diff --git a/nptl/pthread_join_common.c b/nptl/pthread_join_common.c
index ecb78ffba5..e99854ac39 100644
--- a/nptl/pthread_join_common.c
+++ b/nptl/pthread_join_common.c
@@ -76,8 +76,8 @@  __pthread_timedjoin_ex (pthread_t threadid, void **thread_return,
 
   if (block)
     {
-      /* During the wait we change to asynchronous cancellation.  If we
-	 are cancelled the thread we are waiting for must be marked as
+      /* If abstime is NULL we switch to asynchronous cancellation.  If we
+	 are cancelled then the thread we are waiting for must be marked as
 	 un-wait-ed for again.  */
       pthread_cleanup_push (cleanup, &pd->joinid);
 
diff --git a/nptl/pthread_tryjoin.c b/nptl/pthread_tryjoin.c
index aa4fe07af5..7dbf22868d 100644
--- a/nptl/pthread_tryjoin.c
+++ b/nptl/pthread_tryjoin.c
@@ -26,7 +26,7 @@  pthread_tryjoin_np (pthread_t threadid, void **thread_return)
   if (pd->tid != 0)
     return EBUSY;
 
-  /* If pd->tid != 0 then lll_wait_tid will not block on futex
+  /* If pd->tid == 0 then lll_wait_tid will not block on futex
      operation.  */
   return __pthread_timedjoin_ex (threadid, thread_return, NULL, false);
 }
diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h
index 36bbc472e7..2581ed5ca9 100644
--- a/sysdeps/nptl/lowlevellock.h
+++ b/sysdeps/nptl/lowlevellock.h
@@ -185,7 +185,7 @@  extern int __lll_timedwait_tid (int *, const struct timespec *)
    operations for futex wake-up when the clone terminates.
    If ABSTIME is not NULL, is used a timeout for futex call.  If the timeout
    occurs then return ETIMEOUT, if ABSTIME is invalid, return EINVAL.
-   The futex operation are issues with cancellable versions.  */
+   If abstime is NULL we use a cancellable futex wait operation.  */
 #define lll_wait_tid(tid, abstime)					\
   ({									\
     int __res = 0;							\