[2/4] nptl: Handle EPIPE on tst-cancel2

Message ID 20190815211843.22799-2-adhemerval.zanella@linaro.org
State Dropped
Headers

Commit Message

Adhemerval Zanella Netto Aug. 15, 2019, 9:18 p.m. UTC
  The SIGPIPE can be handled before SIGCANCEL, which makes write fail
and the thread return a non expected result.

Checked on x86_64-linux-gnu.

	* nptl/tst-cancel2.c (tf): Do not abort with EPIPE.
---
 nptl/tst-cancel2.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)
  

Comments

Adhemerval Zanella Netto Aug. 19, 2019, 8:30 p.m. UTC | #1
I will push shortly if no one opposes it.

On 15/08/2019 18:18, Adhemerval Zanella wrote:
> The SIGPIPE can be handled before SIGCANCEL, which makes write fail
> and the thread return a non expected result.
> 
> Checked on x86_64-linux-gnu.
> 
> 	* nptl/tst-cancel2.c (tf): Do not abort with EPIPE.
> ---
>  nptl/tst-cancel2.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c
> index 1f0429d343..632ea4e0ae 100644
> --- a/nptl/tst-cancel2.c
> +++ b/nptl/tst-cancel2.c
> @@ -20,6 +20,7 @@
>  #include <signal.h>
>  #include <stdio.h>
>  #include <unistd.h>
> +#include <errno.h>
>  
>  
>  static int fd[2];
> @@ -32,7 +33,14 @@ tf (void *arg)
>       write blocks.  */
>    char buf[100000];
>  
> -  while (write (fd[1], buf, sizeof (buf)) > 0);
> +  while (1)
> +    {
> +      /* Ignore EPIPE errors for case the SIGPIPE is handle before
> +	 SIGCANCEL.  */
> +      ssize_t ret = write (fd[1], buf, sizeof (buf));
> +      if (ret == 0 || (ret == -1 && errno != EPIPE))
> +       break;
> +    }
>  
>    return (void *) 42l;
>  }
>
  
Florian Weimer Aug. 20, 2019, 10:23 a.m. UTC | #2
* Adhemerval Zanella:

> The SIGPIPE can be handled before SIGCANCEL, which makes write fail
> and the thread return a non expected result.
>
> Checked on x86_64-linux-gnu.
>
> 	* nptl/tst-cancel2.c (tf): Do not abort with EPIPE.
> ---
>  nptl/tst-cancel2.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c
> index 1f0429d343..632ea4e0ae 100644
> --- a/nptl/tst-cancel2.c
> +++ b/nptl/tst-cancel2.c
> @@ -20,6 +20,7 @@
>  #include <signal.h>
>  #include <stdio.h>
>  #include <unistd.h>
> +#include <errno.h>
>  
>  
>  static int fd[2];
> @@ -32,7 +33,14 @@ tf (void *arg)
>       write blocks.  */
>    char buf[100000];
>  
> -  while (write (fd[1], buf, sizeof (buf)) > 0);
> +  while (1)
> +    {
> +      /* Ignore EPIPE errors for case the SIGPIPE is handle before
> +	 SIGCANCEL.  */
> +      ssize_t ret = write (fd[1], buf, sizeof (buf));
> +      if (ret == 0 || (ret == -1 && errno != EPIPE))
> +       break;
> +    }
>  
>    return (void *) 42l;
>  }

I disagree with this change.  SIGPIPE does not appear to be a valid
error from write in this test because the thread is canceled before the
read end of the pipe is closed.  POSIX says this:

| The cancellation processing in the target thread shall run
| asynchronously with respect to the calling thread returning from
| pthread_cancel().

But that doesn't mean that the thread can observe actions in its
uncanceled state that happen after the cancellation, which is the case
when the write fails with SIGPIPE.

Thanks,
Florian
  
Adhemerval Zanella Netto Aug. 20, 2019, 12:14 p.m. UTC | #3
On 20/08/2019 07:23, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> The SIGPIPE can be handled before SIGCANCEL, which makes write fail
>> and the thread return a non expected result.
>>
>> Checked on x86_64-linux-gnu.
>>
>> 	* nptl/tst-cancel2.c (tf): Do not abort with EPIPE.
>> ---
>>  nptl/tst-cancel2.c | 10 +++++++++-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c
>> index 1f0429d343..632ea4e0ae 100644
>> --- a/nptl/tst-cancel2.c
>> +++ b/nptl/tst-cancel2.c
>> @@ -20,6 +20,7 @@
>>  #include <signal.h>
>>  #include <stdio.h>
>>  #include <unistd.h>
>> +#include <errno.h>
>>  
>>  
>>  static int fd[2];
>> @@ -32,7 +33,14 @@ tf (void *arg)
>>       write blocks.  */
>>    char buf[100000];
>>  
>> -  while (write (fd[1], buf, sizeof (buf)) > 0);
>> +  while (1)
>> +    {
>> +      /* Ignore EPIPE errors for case the SIGPIPE is handle before
>> +	 SIGCANCEL.  */
>> +      ssize_t ret = write (fd[1], buf, sizeof (buf));
>> +      if (ret == 0 || (ret == -1 && errno != EPIPE))
>> +       break;
>> +    }
>>  
>>    return (void *) 42l;
>>  }
> 
> I disagree with this change.  SIGPIPE does not appear to be a valid
> error from write in this test because the thread is canceled before the
> read end of the pipe is closed.  POSIX says this:
> 
> | The cancellation processing in the target thread shall run
> | asynchronously with respect to the calling thread returning from
> | pthread_cancel().
> 
> But that doesn't mean that the thread can observe actions in its
> uncanceled state that happen after the cancellation, which is the case
> when the write fails with SIGPIPE.

My understanding is since thread cancellation is implemented with an 
asynchronous signal and there is no *extra* synchronization between the
pthread_cancel from main thread and write call on the create one, there
is no happens before order between them. 

It might the case where, due scheduling, the cancellation signal is 
triggered on the test during the write call and it is a side effect that
should be handled. This is exactly the kind of race conditions BZ#12683
make more explicit.
  
Florian Weimer Aug. 20, 2019, 12:59 p.m. UTC | #4
* Adhemerval Zanella:

> On 20/08/2019 07:23, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> The SIGPIPE can be handled before SIGCANCEL, which makes write fail
>>> and the thread return a non expected result.
>>>
>>> Checked on x86_64-linux-gnu.
>>>
>>> 	* nptl/tst-cancel2.c (tf): Do not abort with EPIPE.
>>> ---
>>>  nptl/tst-cancel2.c | 10 +++++++++-
>>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c
>>> index 1f0429d343..632ea4e0ae 100644
>>> --- a/nptl/tst-cancel2.c
>>> +++ b/nptl/tst-cancel2.c
>>> @@ -20,6 +20,7 @@
>>>  #include <signal.h>
>>>  #include <stdio.h>
>>>  #include <unistd.h>
>>> +#include <errno.h>
>>>  
>>>  
>>>  static int fd[2];
>>> @@ -32,7 +33,14 @@ tf (void *arg)
>>>       write blocks.  */
>>>    char buf[100000];
>>>  
>>> -  while (write (fd[1], buf, sizeof (buf)) > 0);
>>> +  while (1)
>>> +    {
>>> +      /* Ignore EPIPE errors for case the SIGPIPE is handle before
>>> +	 SIGCANCEL.  */
>>> +      ssize_t ret = write (fd[1], buf, sizeof (buf));
>>> +      if (ret == 0 || (ret == -1 && errno != EPIPE))
>>> +       break;
>>> +    }
>>>  
>>>    return (void *) 42l;
>>>  }
>> 
>> I disagree with this change.  SIGPIPE does not appear to be a valid
>> error from write in this test because the thread is canceled before the
>> read end of the pipe is closed.  POSIX says this:
>> 
>> | The cancellation processing in the target thread shall run
>> | asynchronously with respect to the calling thread returning from
>> | pthread_cancel().
>> 
>> But that doesn't mean that the thread can observe actions in its
>> uncanceled state that happen after the cancellation, which is the case
>> when the write fails with SIGPIPE.
>
> My understanding is since thread cancellation is implemented with an 
> asynchronous signal and there is no *extra* synchronization between the
> pthread_cancel from main thread and write call on the create one, there
> is no happens before order between them. 
>
> It might the case where, due scheduling, the cancellation signal is 
> triggered on the test during the write call and it is a side effect that
> should be handled. This is exactly the kind of race conditions BZ#12683
> make more explicit.

Hmm.  What are the write call sequences for this test?

I can think of the following scenarios:

(A) The write call is canceled immediately and never returns.

(B) The write call returns a positive value.  The second write call is
    canceled.

(C) The first write call returns a positive value.  The second write
    call fails with with EPIPE.  The third write call is canceled.

I have serious doubts that that (C) is a valid sequence.  Practically
speaking, I think it will lead to regressions because threads may fail
to act upon cancellation compared to what we have today.  This works out
for the test because there is an infinite number of write calls.

Looking at the actual cancellation change, I see this:

+  /* Call the arch-specific entry points that contains the globals markers
+     to be checked by SIGCANCEL handler.  */
+  result = __syscall_cancel_arch (&pd->cancelhandling, nr, a1, a2, a3, a4, a5,
+			          a6);
+
+  if ((result == -EINTR)
+      && (pd->cancelhandling & CANCELED_BITMASK)
+      && !(pd->cancelhandling & CANCELSTATE_BITMASK))
+    __do_cancel ();

Why the restriction to EINTR?  Any signal can result in EINTR.  And with
SA_RESTART, the SIGCANCEL handler will not even result in EINTR.

Thanks,
Florian
  
Adhemerval Zanella Netto Aug. 20, 2019, 2:10 p.m. UTC | #5
On 20/08/2019 09:59, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 20/08/2019 07:23, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> The SIGPIPE can be handled before SIGCANCEL, which makes write fail
>>>> and the thread return a non expected result.
>>>>
>>>> Checked on x86_64-linux-gnu.
>>>>
>>>> 	* nptl/tst-cancel2.c (tf): Do not abort with EPIPE.
>>>> ---
>>>>  nptl/tst-cancel2.c | 10 +++++++++-
>>>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c
>>>> index 1f0429d343..632ea4e0ae 100644
>>>> --- a/nptl/tst-cancel2.c
>>>> +++ b/nptl/tst-cancel2.c
>>>> @@ -20,6 +20,7 @@
>>>>  #include <signal.h>
>>>>  #include <stdio.h>
>>>>  #include <unistd.h>
>>>> +#include <errno.h>
>>>>  
>>>>  
>>>>  static int fd[2];
>>>> @@ -32,7 +33,14 @@ tf (void *arg)
>>>>       write blocks.  */
>>>>    char buf[100000];
>>>>  
>>>> -  while (write (fd[1], buf, sizeof (buf)) > 0);
>>>> +  while (1)
>>>> +    {
>>>> +      /* Ignore EPIPE errors for case the SIGPIPE is handle before
>>>> +	 SIGCANCEL.  */
>>>> +      ssize_t ret = write (fd[1], buf, sizeof (buf));
>>>> +      if (ret == 0 || (ret == -1 && errno != EPIPE))
>>>> +       break;
>>>> +    }
>>>>  
>>>>    return (void *) 42l;
>>>>  }
>>>
>>> I disagree with this change.  SIGPIPE does not appear to be a valid
>>> error from write in this test because the thread is canceled before the
>>> read end of the pipe is closed.  POSIX says this:
>>>
>>> | The cancellation processing in the target thread shall run
>>> | asynchronously with respect to the calling thread returning from
>>> | pthread_cancel().
>>>
>>> But that doesn't mean that the thread can observe actions in its
>>> uncanceled state that happen after the cancellation, which is the case
>>> when the write fails with SIGPIPE.
>>
>> My understanding is since thread cancellation is implemented with an 
>> asynchronous signal and there is no *extra* synchronization between the
>> pthread_cancel from main thread and write call on the create one, there
>> is no happens before order between them. 
>>
>> It might the case where, due scheduling, the cancellation signal is 
>> triggered on the test during the write call and it is a side effect that
>> should be handled. This is exactly the kind of race conditions BZ#12683
>> make more explicit.
> 
> Hmm.  What are the write call sequences for this test?
> 
> I can think of the following scenarios:
> 
> (A) The write call is canceled immediately and never returns.
> 
> (B) The write call returns a positive value.  The second write call is
>     canceled.
> 
> (C) The first write call returns a positive value.  The second write
>     call fails with with EPIPE.  The third write call is canceled.
> 
> I have serious doubts that that (C) is a valid sequence.  Practically
> speaking, I think it will lead to regressions because threads may fail
> to act upon cancellation compared to what we have today.  This works out
> for the test because there is an infinite number of write calls.

The idea of using a large buffer is exactly to force the write to block
indefinitely.  Ideally the test should use F_SETPIPE_SZ to make it sure
it would block on first call, but current scheme might issue multiple
calls until buffer is full (which usually does not trigger because
SIGCANCEL is usually triggered beforehand).

What happens for current asynchronous handling is if the syscall is
indeed interrupted, it will call __pthread_disable_asynccancel and
then it will wait on futex_wait_simple (line 97) until the cancellation
handling is done. It waits on the futex because the pthread_cancel sets the
CANCELING_BITMASK | CANCELED_BITMASK. The sigcancel_handler then acts upon
cancellation and finishes the thread.

That's the *very* racy issue BZ#12683 aims to fix, since current cancellation
does not give the information back to program that a side-effect of the sycall
has happens (in this case EPIPE). 

For tst-cancel2.c, if I add a sleep (1) between pthread_create and 
pthread_cancel you can see this issue more clearly (dump with strace):

[pid  2587] set_robust_list(0x7fffabccf290, 24) = 0
[pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
[pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
[pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
[pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
[pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
[pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
[pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
[pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
[pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...>
[pid  2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0

########### Cancellation start to act here, by loading the libgcc to unwinding
[pid  2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5
[pid  2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0
[pid  2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000
[pid  2586] close(5)                    = 0
[pid  2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5
[pid  2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832
[pid  2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0
[pid  2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000
[pid  2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000
[pid  2586] close(5)                    = 0
[pid  2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0
[pid  2586] munmap(0x7fffabf00000, 63776) = 0
[pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
[pid  2586] close(3)                    = 0
[pid  2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...>

########### Write returns with broken PIPE and __pthread_disable_asynccancel is called
[pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
[pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
[pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
[pid  2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0

########### No side-effects reported back to program
[pid  2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0
[pid  2587] exit(0)                     = ?

With BZ#12683 fix the cancellation is not acted upon and the testcase then fails
depending whether the write is interrupted or not by the cancellation signal.

> 
> Looking at the actual cancellation change, I see this:
> 
> +  /* Call the arch-specific entry points that contains the globals markers
> +     to be checked by SIGCANCEL handler.  */
> +  result = __syscall_cancel_arch (&pd->cancelhandling, nr, a1, a2, a3, a4, a5,
> +			          a6);
> +
> +  if ((result == -EINTR)
> +      && (pd->cancelhandling & CANCELED_BITMASK)
> +      && !(pd->cancelhandling & CANCELSTATE_BITMASK))
> +    __do_cancel ();
> 
> Why the restriction to EINTR?  Any signal can result in EINTR.  And with
> SA_RESTART, the SIGCANCEL handler will not even result in EINTR.

The EINTR was a special case for __NR_close from original Rich suggestion,
which I think we will need to discuss further it is indeed make sense for
glibc (it is the long-standing issue regarding of close failure and how to
handle it).  I removed it in the current version.
  
Florian Weimer Aug. 20, 2019, 3:30 p.m. UTC | #6
* Adhemerval Zanella:

> For tst-cancel2.c, if I add a sleep (1) between pthread_create and 
> pthread_cancel you can see this issue more clearly (dump with strace):
>
> [pid  2587] set_robust_list(0x7fffabccf290, 24) = 0
> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...>
> [pid  2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0
>
> ########### Cancellation start to act here, by loading the libgcc to unwinding
> [pid  2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5
> [pid  2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0
> [pid  2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000
> [pid  2586] close(5)                    = 0
> [pid  2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5
> [pid  2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832
> [pid  2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0
> [pid  2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000
> [pid  2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000
> [pid  2586] close(5)                    = 0
> [pid  2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0
> [pid  2586] munmap(0x7fffabf00000, 63776) = 0
> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
> [pid  2586] close(3)                    = 0
> [pid  2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...>
>
> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called
> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
> [pid  2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0
>
> ########### No side-effects reported back to program
> [pid  2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0
> [pid  2587] exit(0)                     = ?
>
> With BZ#12683 fix the cancellation is not acted upon and the testcase then fails
> depending whether the write is interrupted or not by the cancellation signal.

Hmm.  Which cancellation implementation is this?  At which point in the
trace do we start unwinding?  I'm surprised that strace reports the
EPIPE before the SIGPIPE, but maybe that's just a kernel race.  My
expectation is that the current code unwinds after the system call
returns with the EPIPE error, never returning it to the application.  I
think this is the right behavior for the write system call.

Thanks,
Florian
  
Adhemerval Zanella Netto Aug. 20, 2019, 4:46 p.m. UTC | #7
On 20/08/2019 12:30, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> For tst-cancel2.c, if I add a sleep (1) between pthread_create and 
>> pthread_cancel you can see this issue more clearly (dump with strace):
>>
>> [pid  2587] set_robust_list(0x7fffabccf290, 24) = 0
>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...>
>> [pid  2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0
>>
>> ########### Cancellation start to act here, by loading the libgcc to unwinding
>> [pid  2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5
>> [pid  2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0
>> [pid  2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000
>> [pid  2586] close(5)                    = 0
>> [pid  2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5
>> [pid  2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832
>> [pid  2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0
>> [pid  2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000
>> [pid  2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000
>> [pid  2586] close(5)                    = 0
>> [pid  2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0
>> [pid  2586] munmap(0x7fffabf00000, 63776) = 0
>> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
>> [pid  2586] close(3)                    = 0
>> [pid  2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...>
>>
>> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called
>> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
>> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
>> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
>> [pid  2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0
>>
>> ########### No side-effects reported back to program
>> [pid  2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0
>> [pid  2587] exit(0)                     = ?
>>
>> With BZ#12683 fix the cancellation is not acted upon and the testcase then fails
>> depending whether the write is interrupted or not by the cancellation signal.
> 
> Hmm.  Which cancellation implementation is this?  At which point in the
> trace do we start unwinding?  I'm surprised that strace reports the
> EPIPE before the SIGPIPE, but maybe that's just a kernel race.  My
> expectation is that the current code unwinds after the system call
> returns with the EPIPE error, never returning it to the application.  I
> think this is the right behavior for the write system call.

This is current implement, more specifically glibc 2.17, CentOS 7.6 on
powerpc64le.  From the trace :

>> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0

This where pthread_cancel sends the SIGCANCEL signal to thread.

>> [pid  2586] close(3)                    = 0

This is the

    /* This will cause the write in the child to return.  */
    close (fd[0]);

In tst-cancel2.c.

And finally:

>> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
>> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---

SIGPIPE is receives, making the write fail with EPIPE and then

>> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---

sigcancel_handler is issued.  And the implementation *does* unwind after the
syscall is done, the problem is it ignores -1/EPIPE (and it is BZ#12683).
  
Adhemerval Zanella Netto Aug. 21, 2019, 7:41 p.m. UTC | #8
On 20/08/2019 13:46, Adhemerval Zanella wrote:
> 
> 
> On 20/08/2019 12:30, Florian Weimer wrote:
>> * Adhemerval Zanella:
>>
>>> For tst-cancel2.c, if I add a sleep (1) between pthread_create and 
>>> pthread_cancel you can see this issue more clearly (dump with strace):
>>>
>>> [pid  2587] set_robust_list(0x7fffabccf290, 24) = 0
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...>
>>> [pid  2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0
>>>
>>> ########### Cancellation start to act here, by loading the libgcc to unwinding
>>> [pid  2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5
>>> [pid  2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0
>>> [pid  2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000
>>> [pid  2586] close(5)                    = 0
>>> [pid  2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5
>>> [pid  2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832
>>> [pid  2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0
>>> [pid  2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000
>>> [pid  2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000
>>> [pid  2586] close(5)                    = 0
>>> [pid  2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0
>>> [pid  2586] munmap(0x7fffabf00000, 63776) = 0
>>> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
>>> [pid  2586] close(3)                    = 0
>>> [pid  2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...>
>>>
>>> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called
>>> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
>>> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
>>> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
>>> [pid  2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0
>>>
>>> ########### No side-effects reported back to program
>>> [pid  2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0
>>> [pid  2587] exit(0)                     = ?
>>>
>>> With BZ#12683 fix the cancellation is not acted upon and the testcase then fails
>>> depending whether the write is interrupted or not by the cancellation signal.
>>
>> Hmm.  Which cancellation implementation is this?  At which point in the
>> trace do we start unwinding?  I'm surprised that strace reports the
>> EPIPE before the SIGPIPE, but maybe that's just a kernel race.  My
>> expectation is that the current code unwinds after the system call
>> returns with the EPIPE error, never returning it to the application.  I
>> think this is the right behavior for the write system call.
> 
> This is current implement, more specifically glibc 2.17, CentOS 7.6 on
> powerpc64le.  From the trace :
> 
>>> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
> 
> This where pthread_cancel sends the SIGCANCEL signal to thread.
> 
>>> [pid  2586] close(3)                    = 0
> 
> This is the
> 
>     /* This will cause the write in the child to return.  */
>     close (fd[0]);
> 
> In tst-cancel2.c.
> 
> And finally:
> 
>>> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
>>> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
> 
> SIGPIPE is receives, making the write fail with EPIPE and then
> 
>>> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
> 
> sigcancel_handler is issued.  And the implementation *does* unwind after the
> syscall is done, the problem is it ignores -1/EPIPE (and it is BZ#12683).
> 

Florian, do still disagree with this change?
  
Florian Weimer Aug. 22, 2019, 6:45 a.m. UTC | #9
* Adhemerval Zanella:

> Florian, do still disagree with this change?

Sorry, I need to run more experiments, hopefully today.

Thanks,
Florian
  
Florian Weimer Aug. 23, 2019, 12:47 p.m. UTC | #10
* Adhemerval Zanella:

> On 20/08/2019 12:30, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> For tst-cancel2.c, if I add a sleep (1) between pthread_create and 
>>> pthread_cancel you can see this issue more clearly (dump with strace):
>>>
>>> [pid  2587] set_robust_list(0x7fffabccf290, 24) = 0
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...>
>>> [pid  2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0
>>>
>>> ########### Cancellation start to act here, by loading the libgcc to unwinding
>>> [pid  2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5
>>> [pid  2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0
>>> [pid  2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000
>>> [pid  2586] close(5)                    = 0
>>> [pid  2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5
>>> [pid  2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832
>>> [pid  2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0
>>> [pid  2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000
>>> [pid  2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000
>>> [pid  2586] close(5)                    = 0
>>> [pid  2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0
>>> [pid  2586] munmap(0x7fffabf00000, 63776) = 0
>>> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
>>> [pid  2586] close(3)                    = 0
>>> [pid  2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...>
>>>
>>> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called
>>> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
>>> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
>>> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
>>> [pid  2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0
>>>
>>> ########### No side-effects reported back to program
>>> [pid  2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0
>>> [pid  2587] exit(0)                     = ?
>>>
>>> With BZ#12683 fix the cancellation is not acted upon and the testcase then fails
>>> depending whether the write is interrupted or not by the cancellation signal.
>> 
>> Hmm.  Which cancellation implementation is this?  At which point in the
>> trace do we start unwinding?  I'm surprised that strace reports the
>> EPIPE before the SIGPIPE, but maybe that's just a kernel race.  My
>> expectation is that the current code unwinds after the system call
>> returns with the EPIPE error, never returning it to the application.  I
>> think this is the right behavior for the write system call.
>
> This is current implement, more specifically glibc 2.17, CentOS 7.6 on
> powerpc64le.  From the trace :
>
>>> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
>
> This where pthread_cancel sends the SIGCANCEL signal to thread.
>
>>> [pid  2586] close(3)                    = 0
>
> This is the
>
>     /* This will cause the write in the child to return.  */
>     close (fd[0]);
>
> In tst-cancel2.c.
>
> And finally:
>
>>> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
>>> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
>
> SIGPIPE is receives, making the write fail with EPIPE and then
>
>>> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
>
> sigcancel_handler is issued.  And the implementation *does* unwind after the
> syscall is done, the problem is it ignores -1/EPIPE (and it is BZ#12683).

I looked at this some more.  It seems that in theory, we could also see
EPIPE in the old implementation or at least observe execution of a
signal handler, and the asynchronous cancel hides that.

In the new implementation, if there is a signal handler and a signal for
it is delivered before SIGCANCEL (even if the signal was sent *after*
pthread_cancel, from the same thread), I do not think there is any
chance whatsoever that we can hide the behavior difference.  After all,
SIGCANCEL may not trigger an asynchronous cancellation from the signal
handler.

System calls that are cancellation points appear to fall into these
categories:

(A) Those that do not perform any resource allocation (write, read).

(B) Those that perform resource allocation, but the allocation can be
    easily reverted (openat, accept4).

(C) Those that perform resource allocation, but the allocation is
    difficult to undo (recvmsg with descriptor passing).

(D) close.

For (A), I think POSIX allows sufficient wiggle room so that we can
exhibit a partial side effect and still act on the cancellation (without
reporting a partial write first to the application).

For (B), maybe we should undo the resource allocation and then proceed
to act on the cancellation.

For (C), the complexity may not be worth it.

For (A), (B), (C), we can act on the cancellation in the error case,
after we observe the cancellation flag in the signal handler trampoline.
(I think dropping the EINTR restriction from there achieves that.)  If
we do that, we do not need to change the test case.

(D) is very special.  Ideally, we would specify what happens with the
descriptor if the close call is canceled.  POSIX does not even specify
what the state of file descriptor is after an EINTR error, so it doesn't
say what happens with cancellation, either.  Maybe we have to leave that
undefined.

Thanks,
Florian
  
Adhemerval Zanella Netto Aug. 23, 2019, 2:25 p.m. UTC | #11
On 23/08/2019 09:47, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 20/08/2019 12:30, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> For tst-cancel2.c, if I add a sleep (1) between pthread_create and 
>>>> pthread_cancel you can see this issue more clearly (dump with strace):
>>>>
>>>> [pid  2587] set_robust_list(0x7fffabccf290, 24) = 0
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000
>>>> [pid  2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...>
>>>> [pid  2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0
>>>>
>>>> ########### Cancellation start to act here, by loading the libgcc to unwinding
>>>> [pid  2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5
>>>> [pid  2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0
>>>> [pid  2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000
>>>> [pid  2586] close(5)                    = 0
>>>> [pid  2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5
>>>> [pid  2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832
>>>> [pid  2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0
>>>> [pid  2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000
>>>> [pid  2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000
>>>> [pid  2586] close(5)                    = 0
>>>> [pid  2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0
>>>> [pid  2586] munmap(0x7fffabf00000, 63776) = 0
>>>> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
>>>> [pid  2586] close(3)                    = 0
>>>> [pid  2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...>
>>>>
>>>> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called
>>>> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
>>>> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
>>>> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
>>>> [pid  2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0
>>>>
>>>> ########### No side-effects reported back to program
>>>> [pid  2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0
>>>> [pid  2587] exit(0)                     = ?
>>>>
>>>> With BZ#12683 fix the cancellation is not acted upon and the testcase then fails
>>>> depending whether the write is interrupted or not by the cancellation signal.
>>>
>>> Hmm.  Which cancellation implementation is this?  At which point in the
>>> trace do we start unwinding?  I'm surprised that strace reports the
>>> EPIPE before the SIGPIPE, but maybe that's just a kernel race.  My
>>> expectation is that the current code unwinds after the system call
>>> returns with the EPIPE error, never returning it to the application.  I
>>> think this is the right behavior for the write system call.
>>
>> This is current implement, more specifically glibc 2.17, CentOS 7.6 on
>> powerpc64le.  From the trace :
>>
>>>> [pid  2586] tgkill(2586, 2587, SIGRTMIN) = 0
>>
>> This where pthread_cancel sends the SIGCANCEL signal to thread.
>>
>>>> [pid  2586] close(3)                    = 0
>>
>> This is the
>>
>>     /* This will cause the write in the child to return.  */
>>     close (fd[0]);
>>
>> In tst-cancel2.c.
>>
>> And finally:
>>
>>>> [pid  2587] <... write resumed>)        = -1 EPIPE (Broken pipe)
>>>> [pid  2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} ---
>>
>> SIGPIPE is receives, making the write fail with EPIPE and then
>>
>>>> [pid  2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} ---
>>
>> sigcancel_handler is issued.  And the implementation *does* unwind after the
>> syscall is done, the problem is it ignores -1/EPIPE (and it is BZ#12683).
> 
> I looked at this some more.  It seems that in theory, we could also see
> EPIPE in the old implementation or at least observe execution of a
> signal handler, and the asynchronous cancel hides that.
> 
> In the new implementation, if there is a signal handler and a signal for
> it is delivered before SIGCANCEL (even if the signal was sent *after*
> pthread_cancel, from the same thread), I do not think there is any
> chance whatsoever that we can hide the behavior difference.  After all,
> SIGCANCEL may not trigger an asynchronous cancellation from the signal
> handler.

Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled
by the signal handle by removing it from ucontext_t signal mask.
> 
> System calls that are cancellation points appear to fall into these
> categories:
> 
> (A) Those that do not perform any resource allocation (write, read).
> 
> (B) Those that perform resource allocation, but the allocation can be
>     easily reverted (openat, accept4).
> 
> (C) Those that perform resource allocation, but the allocation is
>     difficult to undo (recvmsg with descriptor passing).
> 
> (D) close.
> 
> For (A), I think POSIX allows sufficient wiggle room so that we can
> exhibit a partial side effect and still act on the cancellation (without
> reporting a partial write first to the application).

I agree, the side-effects for such syscalls should not lead to possible
resources leakage.

> 
> For (B), maybe we should undo the resource allocation and then proceed
> to act on the cancellation.

Besides this requires a lot of more complexity by mapping what kind of
resources each syscall would require to free and when to free based on
returned codes, it would hide it from the application. My view we just
need to return that the syscall has visible side-effects that may result
in system resources leaks and let the application deal with.

> 
> For (C), the complexity may not be worth it.
> 
> For (A), (B), (C), we can act on the cancellation in the error case,
> after we observe the cancellation flag in the signal handler trampoline.
> (I think dropping the EINTR restriction from there achieves that.)  If
> we do that, we do not need to change the test case.

I don't see a real net gain in adding such complexity.  It is not on the 
error  case, but rather when visible side-effects has happened and we should
alert the program to act upon that. The tst-cancel2 case is an error case
because of the EPIPE semantic, but the new tst-cancel28 (based on the leak
example from BZ#12683) is a case where the open call returns a valid file
descriptor that should be freed.

If glibc starts to add extra semantics for cancellation, we will have a standard
extensions since now 'open', for instance, would close the file descriptor if 
the cancellation signal is delivered after the OS allocates the file descriptor 
resource.  IMHO this is clearly what the program itself should be handling, 
because it might the case where it is prepared to close itself the file descriptor
in a different thread.  This would lead to possible another incompatibilities 
regarding libc system implementation, besides adding more complexity.

I would like to make cancellation racy-free with minimal semantic change as 
possible.

> 
> (D) is very special.  Ideally, we would specify what happens with the
> descriptor if the close call is canceled.  POSIX does not even specify
> what the state of file descriptor is after an EINTR error, so it doesn't
> say what happens with cancellation, either.  Maybe we have to leave that
> undefined.

Indeed even after the POSIX close language clarification [1], the specification 
is messy (it really does not help that some system does have interruptible
close implementation that might return -1).

However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux
since it shouldn't return -1/EINTR in any case (and I don't know if/when
there are kernels that actually does it) [2].  So assuming the nptl is Linux 
specific, we can assume that, although close is specified as a cancellable
entrypoint, it won't happen in practice.

[1] http://austingroupbugs.net/view.php?id=529
[2] https://lwn.net/Articles/576478/
  
Florian Weimer Aug. 26, 2019, 10:06 a.m. UTC | #12
* Adhemerval Zanella:

>> In the new implementation, if there is a signal handler and a signal for
>> it is delivered before SIGCANCEL (even if the signal was sent *after*
>> pthread_cancel, from the same thread), I do not think there is any
>> chance whatsoever that we can hide the behavior difference.  After all,
>> SIGCANCEL may not trigger an asynchronous cancellation from the signal
>> handler.
>
> Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled
> by the signal handle by removing it from ucontext_t signal mask.

Does this mean a thread would lose its ability to be canceled in
blocking system calls if it setjmps out of a signal handler?  (Where as
sigsetjmp would keep cancellation working.) 

>> System calls that are cancellation points appear to fall into these
>> categories:
>> 
>> (A) Those that do not perform any resource allocation (write, read).
>> 
>> (B) Those that perform resource allocation, but the allocation can be
>>     easily reverted (openat, accept4).
>> 
>> (C) Those that perform resource allocation, but the allocation is
>>     difficult to undo (recvmsg with descriptor passing).
>> 
>> (D) close.

>> For (B), maybe we should undo the resource allocation and then proceed
>> to act on the cancellation.
>
> Besides this requires a lot of more complexity by mapping what kind of
> resources each syscall would require to free and when to free based on
> returned codes, it would hide it from the application. My view we just
> need to return that the syscall has visible side-effects that may result
> in system resources leaks and let the application deal with.

My concern is that there might be no further cancellation point, and if
we do not cancel in the case of (B), we might fail to act on the
cancellation even though the canceled thread was blocked at a
cancellation point when pthread_cancel was called.

>> For (C), the complexity may not be worth it.
>> 
>> For (A), (B), (C), we can act on the cancellation in the error case,
>> after we observe the cancellation flag in the signal handler trampoline.
>> (I think dropping the EINTR restriction from there achieves that.)  If
>> we do that, we do not need to change the test case.
>
> I don't see a real net gain in adding such complexity.  It is not on the 
> error  case, but rather when visible side-effects has happened and we should
> alert the program to act upon that. The tst-cancel2 case is an error case
> because of the EPIPE semantic, but the new tst-cancel28 (based on the leak
> example from BZ#12683) is a case where the open call returns a valid file
> descriptor that should be freed.

I totally agree that the descriptor must not leak.  However,
tst-cancel28 is written in such a way that it does not expect an error
from open:

static void *
leaker (void *arg)
{
  int fd = open (arg, O_RDONLY);
  pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, 0);
  close (fd);
  return NULL;
}

It would be clearer if it actually used error checking, but we call
close (-1) on error, which is invalid.

If this test represents how application code is actually written, I
think we really need to check for cancellation before returning the file
descriptor.

> If glibc starts to add extra semantics for cancellation, we will have
> a standard extensions since now 'open', for instance, would close the
> file descriptor if the cancellation signal is delivered after the OS
> allocates the file descriptor resource.  IMHO this is clearly what the
> program itself should be handling, because it might the case where it
> is prepared to close itself the file descriptor in a different thread.

open is a cancellation point, so according to POSIX, the thread is
canceled if the open call completes after pthread_cancel returns.  (Due
to integration with the memory model, thread ordering issues are
difficult to reason about in POSIX, though.)  I do not think we should
return a descriptor to the application that can only exist because of
something that happened *after* the pthread_cancel call.

> This would lead to possible another incompatibilities regarding libc
> system implementation, besides adding more complexity.

I think we would just implement POSIX behavior more closely because open
is required to be a cancellation point.  It's unfortunate that we do not
have kernel support for this.  Other systems can likely implement
cancellation semantics correctly more easily.

>> (D) is very special.  Ideally, we would specify what happens with the
>> descriptor if the close call is canceled.  POSIX does not even specify
>> what the state of file descriptor is after an EINTR error, so it doesn't
>> say what happens with cancellation, either.  Maybe we have to leave that
>> undefined.
>
> Indeed even after the POSIX close language clarification [1], the specification 
> is messy (it really does not help that some system does have interruptible
> close implementation that might return -1).
>
> However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux
> since it shouldn't return -1/EINTR in any case (and I don't know if/when
> there are kernels that actually does it) [2].  So assuming the nptl is Linux 
> specific, we can assume that, although close is specified as a cancellable
> entrypoint, it won't happen in practice.
>
> [1] http://austingroupbugs.net/view.php?id=529
> [2] https://lwn.net/Articles/576478/

I think this is slightly different from the EINTR case.  With
SA_RESTART, perhaps the signal handler can run *before* the descriptor
is closed, so if we never return from the handler, the descriptor leaks?

Thanks,
Florian
  
Adhemerval Zanella Netto Aug. 26, 2019, 2:35 p.m. UTC | #13
On 26/08/2019 07:06, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>>> In the new implementation, if there is a signal handler and a signal for
>>> it is delivered before SIGCANCEL (even if the signal was sent *after*
>>> pthread_cancel, from the same thread), I do not think there is any
>>> chance whatsoever that we can hide the behavior difference.  After all,
>>> SIGCANCEL may not trigger an asynchronous cancellation from the signal
>>> handler.
>>
>> Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled
>> by the signal handle by removing it from ucontext_t signal mask.
> 
> Does this mean a thread would lose its ability to be canceled in
> blocking system calls if it setjmps out of a signal handler?  (Where as
> sigsetjmp would keep cancellation working.) 
The setjmp does not really matter here, once the async-cancellation fails
to act due side-effects SIGCANCEL is removed from thread signal mask. The
idea of making it sticky is always allow the thread to act on such cases
regardless whether it calls another cancellation entrypoint.

The thread will still be cancelled in the next cancellation entrypoint
due it is marked as CANCELED_BIT.  The __syscall_cancel_arch will check
for the bit and call __do_cancel before issuing the syscall.

> 
>>> System calls that are cancellation points appear to fall into these
>>> categories:
>>>
>>> (A) Those that do not perform any resource allocation (write, read).
>>>
>>> (B) Those that perform resource allocation, but the allocation can be
>>>     easily reverted (openat, accept4).
>>>
>>> (C) Those that perform resource allocation, but the allocation is
>>>     difficult to undo (recvmsg with descriptor passing).
>>>
>>> (D) close.
> 
>>> For (B), maybe we should undo the resource allocation and then proceed
>>> to act on the cancellation.
>>
>> Besides this requires a lot of more complexity by mapping what kind of
>> resources each syscall would require to free and when to free based on
>> returned codes, it would hide it from the application. My view we just
>> need to return that the syscall has visible side-effects that may result
>> in system resources leaks and let the application deal with.
> 
> My concern is that there might be no further cancellation point, and if
> we do not cancel in the case of (B), we might fail to act on the
> cancellation even though the canceled thread was blocked at a
> cancellation point when pthread_cancel was called.

I shared your concern, but my understanding of the whole BZ#12683 issue is
we can't really cancel the thread for the (B) case.

> 
>>> For (C), the complexity may not be worth it.
>>>
>>> For (A), (B), (C), we can act on the cancellation in the error case,
>>> after we observe the cancellation flag in the signal handler trampoline.
>>> (I think dropping the EINTR restriction from there achieves that.)  If
>>> we do that, we do not need to change the test case.
>>
>> I don't see a real net gain in adding such complexity.  It is not on the 
>> error  case, but rather when visible side-effects has happened and we should
>> alert the program to act upon that. The tst-cancel2 case is an error case
>> because of the EPIPE semantic, but the new tst-cancel28 (based on the leak
>> example from BZ#12683) is a case where the open call returns a valid file
>> descriptor that should be freed.
> 
> I totally agree that the descriptor must not leak.  However,
> tst-cancel28 is written in such a way that it does not expect an error
> from open:
> 
> static void *
> leaker (void *arg)
> {
>   int fd = open (arg, O_RDONLY);
>   pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, 0);
>   close (fd);
>   return NULL;
> }
> 
> It would be clearer if it actually used error checking, but we call
> close (-1) on error, which is invalid.

We can hardness the test to check for open failures (due file descriptor
exhaustion or other arcane issue), but the point of the test is fd is
always for the *close* call.  If the open syscall is interrupted before the
file descriptor is actually opened, the thread will be cancelled before
close call.  Otherwise, the kernel will have to return a valid value
and indicate on the PC from the ucontext_t that the syscall has side-effect.

> 
> If this test represents how application code is actually written, I
> think we really need to check for cancellation before returning the file
> descriptor.
> 
>> If glibc starts to add extra semantics for cancellation, we will have
>> a standard extensions since now 'open', for instance, would close the
>> file descriptor if the cancellation signal is delivered after the OS
>> allocates the file descriptor resource.  IMHO this is clearly what the
>> program itself should be handling, because it might the case where it
>> is prepared to close itself the file descriptor in a different thread.
> 
> open is a cancellation point, so according to POSIX, the thread is
> canceled if the open call completes after pthread_cancel returns.  (Due
> to integration with the memory model, thread ordering issues are
> difficult to reason about in POSIX, though.)  I do not think we should
> return a descriptor to the application that can only exist because of
> something that happened *after* the pthread_cancel call.

That's not the interpretation I see, the cancellation description for
such cases is afaiu:

"The side-effects of acting upon a cancellation request while suspended
during a call of a function are the same as the side-effects that may be
seen in a single-threaded program when a call to a function is interrupted
by a signal and the given function returns [EINTR]"

And on signal concepts:

"[EINTR]
Interrupted function call. An asynchronous signal was caught by the process 
during the execution of an interruptible function. If the signal handler 
performs a normal return, the interrupted function call may return this 
condition (see the Base Definitions volume of POSIX.1-2017, <signal.h>)."

And my understanding is, since glibc implemented cancellation using signals
and it requires to use SA_RESTART we can actually return to user if the
syscall has side-effects visible to caller.

It follows the same interpretation Rich used in explaining the semantic
required for the a possible kernel helper for cancellation [1]

> 
>> This would lead to possible another incompatibilities regarding libc
>> system implementation, besides adding more complexity.
> 
> I think we would just implement POSIX behavior more closely because open
> is required to be a cancellation point.  It's unfortunate that we do not
> have kernel support for this.  Other systems can likely implement
> cancellation semantics correctly more easily.

I don't think there is an easy solution that kernel can provide to help
cancellation to be standard conformant *and* race free.  On the same
kernel thread [2], Rich discuss some option with kernel developers and
although the conclusion is open the ideas are either no conformant (the
'stick signal') or feasible (a new set of syscalls with an extra sigmak
mask).

> 
>>> (D) is very special.  Ideally, we would specify what happens with the
>>> descriptor if the close call is canceled.  POSIX does not even specify
>>> what the state of file descriptor is after an EINTR error, so it doesn't
>>> say what happens with cancellation, either.  Maybe we have to leave that
>>> undefined.
>>
>> Indeed even after the POSIX close language clarification [1], the specification 
>> is messy (it really does not help that some system does have interruptible
>> close implementation that might return -1).
>>
>> However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux
>> since it shouldn't return -1/EINTR in any case (and I don't know if/when
>> there are kernels that actually does it) [2].  So assuming the nptl is Linux 
>> specific, we can assume that, although close is specified as a cancellable
>> entrypoint, it won't happen in practice.
>>
>> [1] http://austingroupbugs.net/view.php?id=529
>> [2] https://lwn.net/Articles/576478/
> 
> I think this is slightly different from the EINTR case.  With
> SA_RESTART, perhaps the signal handler can run *before* the descriptor
> is closed, so if we never return from the handler, the descriptor leaks?

My understanding is the cancellation handler for close won't interrupt the
syscall, regardless SA_RESTART. But indeed it might be the case where the
signal is acted just before the syscall is issue and after the test for
CANCELED_BITMASK is done.  In this case the thread will be cancelled and
the descriptor leaked.  The musl handler SYS_close differently by not using
the syscall bridge, so cancellation can't be acted upon and so close always
succeed. On the tst-cancel28.c, pthread cancellation is explicit disable
before calling close.

[1] https://lkml.org/lkml/2016/3/9/1065
[2] https://lkml.org/lkml/2016/3/10/424
  
Florian Weimer Aug. 30, 2019, 11:51 a.m. UTC | #14
* Adhemerval Zanella:

> On 26/08/2019 07:06, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>>> In the new implementation, if there is a signal handler and a signal for
>>>> it is delivered before SIGCANCEL (even if the signal was sent *after*
>>>> pthread_cancel, from the same thread), I do not think there is any
>>>> chance whatsoever that we can hide the behavior difference.  After all,
>>>> SIGCANCEL may not trigger an asynchronous cancellation from the signal
>>>> handler.
>>>
>>> Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled
>>> by the signal handle by removing it from ucontext_t signal mask.
>> 
>> Does this mean a thread would lose its ability to be canceled in
>> blocking system calls if it setjmps out of a signal handler?  (Where as
>> sigsetjmp would keep cancellation working.) 

> The setjmp does not really matter here, once the async-cancellation fails
> to act due side-effects SIGCANCEL is removed from thread signal mask. The
> idea of making it sticky is always allow the thread to act on such cases
> regardless whether it calls another cancellation entrypoint.

I see.  But why are we doing this?  Can't we suppress the cancel signal
inside pthread_cancel.  Then we don't have to worry about the location
of the signal mask in the signal context.

> The thread will still be cancelled in the next cancellation entrypoint
> due it is marked as CANCELED_BIT.  The __syscall_cancel_arch will check
> for the bit and call __do_cancel before issuing the syscall.

Sure, but we can exit the SIGCANCEL handler if we detect that
cancellation is already in progress.

>>>> System calls that are cancellation points appear to fall into these
>>>> categories:
>>>>
>>>> (A) Those that do not perform any resource allocation (write, read).
>>>>
>>>> (B) Those that perform resource allocation, but the allocation can be
>>>>     easily reverted (openat, accept4).
>>>>
>>>> (C) Those that perform resource allocation, but the allocation is
>>>>     difficult to undo (recvmsg with descriptor passing).
>>>>
>>>> (D) close.
>> 
>>>> For (B), maybe we should undo the resource allocation and then proceed
>>>> to act on the cancellation.
>>>
>>> Besides this requires a lot of more complexity by mapping what kind of
>>> resources each syscall would require to free and when to free based on
>>> returned codes, it would hide it from the application. My view we just
>>> need to return that the syscall has visible side-effects that may result
>>> in system resources leaks and let the application deal with.
>> 
>> My concern is that there might be no further cancellation point, and if
>> we do not cancel in the case of (B), we might fail to act on the
>> cancellation even though the canceled thread was blocked at a
>> cancellation point when pthread_cancel was called.
>
> I shared your concern, but my understanding of the whole BZ#12683
> issue is we can't really cancel the thread for the (B) case.

Even if we undo the resource allocation before that?  That is, check
that a cancellation has occurred on the system call return path, and if
it has, close the descriptor and start unwinding.

>> I totally agree that the descriptor must not leak.  However,
>> tst-cancel28 is written in such a way that it does not expect an error
>> from open:
>> 
>> static void *
>> leaker (void *arg)
>> {
>>   int fd = open (arg, O_RDONLY);
>>   pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, 0);
>>   close (fd);
>>   return NULL;
>> }
>> 
>> It would be clearer if it actually used error checking, but we call
>> close (-1) on error, which is invalid.
>
> We can hardness the test to check for open failures (due file descriptor
> exhaustion or other arcane issue), but the point of the test is fd is
> always for the *close* call.  If the open syscall is interrupted before the
> file descriptor is actually opened, the thread will be cancelled before
> close call.  Otherwise, the kernel will have to return a valid value
> and indicate on the PC from the ucontext_t that the syscall has side-effect.

My point is that the test should still be valid without the close call
because the open is always canceled according to POSIX.

>> If this test represents how application code is actually written, I
>> think we really need to check for cancellation before returning the file
>> descriptor.
>> 
>>> If glibc starts to add extra semantics for cancellation, we will have
>>> a standard extensions since now 'open', for instance, would close the
>>> file descriptor if the cancellation signal is delivered after the OS
>>> allocates the file descriptor resource.  IMHO this is clearly what the
>>> program itself should be handling, because it might the case where it
>>> is prepared to close itself the file descriptor in a different thread.
>> 
>> open is a cancellation point, so according to POSIX, the thread is
>> canceled if the open call completes after pthread_cancel returns.  (Due
>> to integration with the memory model, thread ordering issues are
>> difficult to reason about in POSIX, though.)  I do not think we should
>> return a descriptor to the application that can only exist because of
>> something that happened *after* the pthread_cancel call.
>
> That's not the interpretation I see, the cancellation description for
> such cases is afaiu:
>
> "The side-effects of acting upon a cancellation request while suspended
> during a call of a function are the same as the side-effects that may be
> seen in a single-threaded program when a call to a function is interrupted
> by a signal and the given function returns [EINTR]"
>
> And on signal concepts:
>
> "[EINTR]
> Interrupted function call. An asynchronous signal was caught by the process 
> during the execution of an interruptible function. If the signal handler 
> performs a normal return, the interrupted function call may return this 
> condition (see the Base Definitions volume of POSIX.1-2017, <signal.h>)."
>
> And my understanding is, since glibc implemented cancellation using signals
> and it requires to use SA_RESTART we can actually return to user if the
> syscall has side-effects visible to caller.

The open function cannot allocate a new file descriptor and fail with
EINTR.  So I don't see how this case applies here.  Sorry.

>>>> (D) is very special.  Ideally, we would specify what happens with the
>>>> descriptor if the close call is canceled.  POSIX does not even specify
>>>> what the state of file descriptor is after an EINTR error, so it doesn't
>>>> say what happens with cancellation, either.  Maybe we have to leave that
>>>> undefined.
>>>
>>> Indeed even after the POSIX close language clarification [1], the specification 
>>> is messy (it really does not help that some system does have interruptible
>>> close implementation that might return -1).
>>>
>>> However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux
>>> since it shouldn't return -1/EINTR in any case (and I don't know if/when
>>> there are kernels that actually does it) [2].  So assuming the nptl is Linux 
>>> specific, we can assume that, although close is specified as a cancellable
>>> entrypoint, it won't happen in practice.
>>>
>>> [1] http://austingroupbugs.net/view.php?id=529
>>> [2] https://lwn.net/Articles/576478/
>> 
>> I think this is slightly different from the EINTR case.  With
>> SA_RESTART, perhaps the signal handler can run *before* the descriptor
>> is closed, so if we never return from the handler, the descriptor leaks?
>
> My understanding is the cancellation handler for close won't interrupt the
> syscall, regardless SA_RESTART.

What do you mean by “interrupt” here?  That close will not return with
EINTR?  Or that the signal handler does not run?

> But indeed it might be the case where the
> signal is acted just before the syscall is issue and after the test for
> CANCELED_BITMASK is done.  In this case the thread will be cancelled and
> the descriptor leaked.  The musl handler SYS_close differently by not using
> the syscall bridge, so cancellation can't be acted upon and so close always
> succeed.

That seems reasonable.

If we check for cancellation after the close system call (successful or
not), then I think we have a compliant, leak-free close implementation.

Thanks,
Florian
  
Rich Felker Aug. 30, 2019, 1:51 p.m. UTC | #15
On Fri, Aug 23, 2019 at 02:47:39PM +0200, Florian Weimer wrote:
> I looked at this some more.  It seems that in theory, we could also see
> EPIPE in the old implementation or at least observe execution of a
> signal handler, and the asynchronous cancel hides that.
> 
> In the new implementation, if there is a signal handler and a signal for
> it is delivered before SIGCANCEL (even if the signal was sent *after*
> pthread_cancel, from the same thread), I do not think there is any
> chance whatsoever that we can hide the behavior difference.  After all,
> SIGCANCEL may not trigger an asynchronous cancellation from the signal
> handler.
> 
> System calls that are cancellation points appear to fall into these
> categories:
> 
> (A) Those that do not perform any resource allocation (write, read).
> 
> (B) Those that perform resource allocation, but the allocation can be
>     easily reverted (openat, accept4).
> 
> (C) Those that perform resource allocation, but the allocation is
>     difficult to undo (recvmsg with descriptor passing).
> 
> (D) close.
> 
> For (A), I think POSIX allows sufficient wiggle room so that we can
> exhibit a partial side effect and still act on the cancellation (without
> reporting a partial write first to the application).

No. POSIX is very clear that cancellation is only permitted to be
acted on when the side effects are the same as EINTR, which can only
happen for read/write when no data has been transferred yet. If data
has already been transferred, the only option not ruled out by one or
more constraints is to return successfully with a short read or write;
pending cancellation will then be handled on a subsequent call to a
cancellation point. (If you're doing a retry loop until the total
length n is transferred, then in the next iteration of the loop.)

Of course this case of non-conformance would be less severe than UAF
or resource leak errors in most cases, but it's still not good.

> For (B), maybe we should undo the resource allocation and then proceed
> to act on the cancellation.

No, why would you do that? In general it's not possible (think of
things like O_TRUNC), and even if it were it's more work. You just
don't act on the cancellation if success has already happened.

I don't see how/why you're characterizing "accept4" as "can easily be
reverted". Closing the accepted socket would *drop a connection*
rather than leaving the connection pending for another thread calling
accept to receive.

> For (C), the complexity may not be worth it.
> 
> For (A), (B), (C), we can act on the cancellation in the error case,
> after we observe the cancellation flag in the signal handler trampoline.
> (I think dropping the EINTR restriction from there achieves that.)  If
> we do that, we do not need to change the test case.
> 
> (D) is very special.  Ideally, we would specify what happens with the
> descriptor if the close call is canceled.  POSIX does not even specify
> what the state of file descriptor is after an EINTR error, so it doesn't
> say what happens with cancellation, either.  Maybe we have to leave that
> undefined.

Failure to specify EINTR was an error in POSIX that's been fixed for
the next issue. Leaving it undefined is atrociously bad and results in
UAF type errors. Even if glibc doesn't want to apply the fix for
EINTR, cancellation of close absolutely needs to behave *as if* by the
newly-specified behavior for close with EINTR.

Rich
  
Adhemerval Zanella Netto Aug. 30, 2019, 2:05 p.m. UTC | #16
On 30/08/2019 08:51, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 26/08/2019 07:06, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>>> In the new implementation, if there is a signal handler and a signal for
>>>>> it is delivered before SIGCANCEL (even if the signal was sent *after*
>>>>> pthread_cancel, from the same thread), I do not think there is any
>>>>> chance whatsoever that we can hide the behavior difference.  After all,
>>>>> SIGCANCEL may not trigger an asynchronous cancellation from the signal
>>>>> handler.
>>>>
>>>> Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled
>>>> by the signal handle by removing it from ucontext_t signal mask.
>>>
>>> Does this mean a thread would lose its ability to be canceled in
>>> blocking system calls if it setjmps out of a signal handler?  (Where as
>>> sigsetjmp would keep cancellation working.) 
> 
>> The setjmp does not really matter here, once the async-cancellation fails
>> to act due side-effects SIGCANCEL is removed from thread signal mask. The
>> idea of making it sticky is always allow the thread to act on such cases
>> regardless whether it calls another cancellation entrypoint.
> 
> I see.  But why are we doing this?  Can't we suppress the cancel signal
> inside pthread_cancel.  Then we don't have to worry about the location
> of the signal mask in the signal context.

I guess we can, it would require a sigprocmask to obtain the current
mask, and another one to remove the cancellation signal.  I still think
it cleaner to just sets on the signal handler, since the kernel will
restore the signal handler after the userspace returns.

> 
>> The thread will still be cancelled in the next cancellation entrypoint
>> due it is marked as CANCELED_BIT.  The __syscall_cancel_arch will check
>> for the bit and call __do_cancel before issuing the syscall.
> 
> Sure, but we can exit the SIGCANCEL handler if we detect that
> cancellation is already in progress.

I think it is possible, but I see that enforcing it using the kernel facilities
to avoid extra code in userspace is cleaner.

> 
>>>>> System calls that are cancellation points appear to fall into these
>>>>> categories:
>>>>>
>>>>> (A) Those that do not perform any resource allocation (write, read).
>>>>>
>>>>> (B) Those that perform resource allocation, but the allocation can be
>>>>>     easily reverted (openat, accept4).
>>>>>
>>>>> (C) Those that perform resource allocation, but the allocation is
>>>>>     difficult to undo (recvmsg with descriptor passing).
>>>>>
>>>>> (D) close.
>>>
>>>>> For (B), maybe we should undo the resource allocation and then proceed
>>>>> to act on the cancellation.
>>>>
>>>> Besides this requires a lot of more complexity by mapping what kind of
>>>> resources each syscall would require to free and when to free based on
>>>> returned codes, it would hide it from the application. My view we just
>>>> need to return that the syscall has visible side-effects that may result
>>>> in system resources leaks and let the application deal with.
>>>
>>> My concern is that there might be no further cancellation point, and if
>>> we do not cancel in the case of (B), we might fail to act on the
>>> cancellation even though the canceled thread was blocked at a
>>> cancellation point when pthread_cancel was called.
>>
>> I shared your concern, but my understanding of the whole BZ#12683
>> issue is we can't really cancel the thread for the (B) case.
> 
> Even if we undo the resource allocation before that?  That is, check
> that a cancellation has occurred on the system call return path, and if
> it has, close the descriptor and start unwinding.

What if the application requires extra steps for the case the syscall
returns a side-effects? What about some specific syscalls, such as
passing file descriptor using recvmsg? It would need the libc to map
*all* possible side-effects for each syscall, determine which might
leak side-effect and acts uppon then. What if the kernel, at some version,
added another feature to share resources that might create side-effects?
We will need to constantly map such cases, adding even more arch-specific
glue code to handle it.

I really think we should not go on this path.

> 
>>> I totally agree that the descriptor must not leak.  However,
>>> tst-cancel28 is written in such a way that it does not expect an error
>>> from open:
>>>
>>> static void *
>>> leaker (void *arg)
>>> {
>>>   int fd = open (arg, O_RDONLY);
>>>   pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, 0);
>>>   close (fd);
>>>   return NULL;
>>> }
>>>
>>> It would be clearer if it actually used error checking, but we call
>>> close (-1) on error, which is invalid.
>>
>> We can hardness the test to check for open failures (due file descriptor
>> exhaustion or other arcane issue), but the point of the test is fd is
>> always for the *close* call.  If the open syscall is interrupted before the
>> file descriptor is actually opened, the thread will be cancelled before
>> close call.  Otherwise, the kernel will have to return a valid value
>> and indicate on the PC from the ucontext_t that the syscall has side-effect.
> 
> My point is that the test should still be valid without the close call
> because the open is always canceled according to POSIX.
> 
>>> If this test represents how application code is actually written, I
>>> think we really need to check for cancellation before returning the file
>>> descriptor.
>>>
>>>> If glibc starts to add extra semantics for cancellation, we will have
>>>> a standard extensions since now 'open', for instance, would close the
>>>> file descriptor if the cancellation signal is delivered after the OS
>>>> allocates the file descriptor resource.  IMHO this is clearly what the
>>>> program itself should be handling, because it might the case where it
>>>> is prepared to close itself the file descriptor in a different thread.
>>>
>>> open is a cancellation point, so according to POSIX, the thread is
>>> canceled if the open call completes after pthread_cancel returns.  (Due
>>> to integration with the memory model, thread ordering issues are
>>> difficult to reason about in POSIX, though.)  I do not think we should
>>> return a descriptor to the application that can only exist because of
>>> something that happened *after* the pthread_cancel call.
>>
>> That's not the interpretation I see, the cancellation description for
>> such cases is afaiu:
>>
>> "The side-effects of acting upon a cancellation request while suspended
>> during a call of a function are the same as the side-effects that may be
>> seen in a single-threaded program when a call to a function is interrupted
>> by a signal and the given function returns [EINTR]"
>>
>> And on signal concepts:
>>
>> "[EINTR]
>> Interrupted function call. An asynchronous signal was caught by the process 
>> during the execution of an interruptible function. If the signal handler 
>> performs a normal return, the interrupted function call may return this 
>> condition (see the Base Definitions volume of POSIX.1-2017, <signal.h>)."
>>
>> And my understanding is, since glibc implemented cancellation using signals
>> and it requires to use SA_RESTART we can actually return to user if the
>> syscall has side-effects visible to caller.
> 
> The open function cannot allocate a new file descriptor and fail with
> EINTR.  So I don't see how this case applies here.  Sorry.

But that is *exactly* the point. The open *does not* fail, but cancellation acts
*before* it can return to the callee.  We can't really make it fail with EINTR
and acts on the cancellation. The idea is "the interrupted function call may 
return this condition".

> 
>>>>> (D) is very special.  Ideally, we would specify what happens with the
>>>>> descriptor if the close call is canceled.  POSIX does not even specify
>>>>> what the state of file descriptor is after an EINTR error, so it doesn't
>>>>> say what happens with cancellation, either.  Maybe we have to leave that
>>>>> undefined.
>>>>
>>>> Indeed even after the POSIX close language clarification [1], the specification 
>>>> is messy (it really does not help that some system does have interruptible
>>>> close implementation that might return -1).
>>>>
>>>> However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux
>>>> since it shouldn't return -1/EINTR in any case (and I don't know if/when
>>>> there are kernels that actually does it) [2].  So assuming the nptl is Linux 
>>>> specific, we can assume that, although close is specified as a cancellable
>>>> entrypoint, it won't happen in practice.
>>>>
>>>> [1] http://austingroupbugs.net/view.php?id=529
>>>> [2] https://lwn.net/Articles/576478/
>>>
>>> I think this is slightly different from the EINTR case.  With
>>> SA_RESTART, perhaps the signal handler can run *before* the descriptor
>>> is closed, so if we never return from the handler, the descriptor leaks?
>>
>> My understanding is the cancellation handler for close won't interrupt the
>> syscall, regardless SA_RESTART.
> 
> What do you mean by “interrupt” here?  That close will not return with
> EINTR?  Or that the signal handler does not run?

My understanding on the discussions is Linux close implementation does not
return EINTR/-1 in any case (and if it did in the past it was a bug).

> 
>> But indeed it might be the case where the
>> signal is acted just before the syscall is issue and after the test for
>> CANCELED_BITMASK is done.  In this case the thread will be cancelled and
>> the descriptor leaked.  The musl handler SYS_close differently by not using
>> the syscall bridge, so cancellation can't be acted upon and so close always
>> succeed.
> 
> That seems reasonable.
> 
> If we check for cancellation after the close system call (successful or
> not), then I think we have a compliant, leak-free close implementation.

In my understanding musl does this to prevent possible problematic cancel
syscall implementation that might return -1/EINTR (since for this case
syscall will be restart due SA_RESTART and cancellation won't acted upon).
  
Florian Weimer Sept. 2, 2019, 12:37 p.m. UTC | #17
* Rich Felker:

> On Fri, Aug 23, 2019 at 02:47:39PM +0200, Florian Weimer wrote:
>> I looked at this some more.  It seems that in theory, we could also see
>> EPIPE in the old implementation or at least observe execution of a
>> signal handler, and the asynchronous cancel hides that.
>> 
>> In the new implementation, if there is a signal handler and a signal for
>> it is delivered before SIGCANCEL (even if the signal was sent *after*
>> pthread_cancel, from the same thread), I do not think there is any
>> chance whatsoever that we can hide the behavior difference.  After all,
>> SIGCANCEL may not trigger an asynchronous cancellation from the signal
>> handler.
>> 
>> System calls that are cancellation points appear to fall into these
>> categories:
>> 
>> (A) Those that do not perform any resource allocation (write, read).
>> 
>> (B) Those that perform resource allocation, but the allocation can be
>>     easily reverted (openat, accept4).
>> 
>> (C) Those that perform resource allocation, but the allocation is
>>     difficult to undo (recvmsg with descriptor passing).
>> 
>> (D) close.
>> 
>> For (A), I think POSIX allows sufficient wiggle room so that we can
>> exhibit a partial side effect and still act on the cancellation (without
>> reporting a partial write first to the application).
>
> No. POSIX is very clear that cancellation is only permitted to be
> acted on when the side effects are the same as EINTR, which can only
> happen for read/write when no data has been transferred yet. If data
> has already been transferred, the only option not ruled out by one or
> more constraints is to return successfully with a short read or write;
> pending cancellation will then be handled on a subsequent call to a
> cancellation point. (If you're doing a retry loop until the total
> length n is transferred, then in the next iteration of the loop.)

This seems to be a backwards interpretation to me.  I think POSIX wants
to describe what can happen during cancellation, and not say that any
cancellation point can be skipped if some side effect happens that
cannot happen before EINTR.

> Of course this case of non-conformance would be less severe than UAF
> or resource leak errors in most cases, but it's still not good.

Agreed.  And I think not acting on the cancellation and returning
success is also not conforming.

The main problem is that POSIX would need to specify a memory model in
order to make the current language (“side-effects occur before any
cancellation cleanup handlers are called”, “a thread is canceled while
executing the function”, “a cancellation request has been made”, etc.).

>> For (B), maybe we should undo the resource allocation and then proceed
>> to act on the cancellation.
>
> No, why would you do that? In general it's not possible (think of
> things like O_TRUNC), and even if it were it's more work. You just
> don't act on the cancellation if success has already happened.

The problem is that the implementation reorders cancellation requests
fairly arbitrarily.  I don't think that's what the authors of POSIX had
in mind when they said that the cancellation itself progresses
asynchronously.  I don't think that can be fixed without kernel support,
though.  We'll see eventually how this impacts existing applications.

> I don't see how/why you're characterizing "accept4" as "can easily be
> reverted". Closing the accepted socket would *drop a connection*
> rather than leaving the connection pending for another thread calling
> accept to receive.

And this is true for open as well.  For FIFOs, there is an observable
side effect, likewise for rewinding tape devices, and even for regular
files, the close call affects POSIX advisory locks on the file.  Oh
well.  So it's just not possible to unwind directly after file
descriptor allocation.

>> For (C), the complexity may not be worth it.
>> 
>> For (A), (B), (C), we can act on the cancellation in the error case,
>> after we observe the cancellation flag in the signal handler trampoline.
>> (I think dropping the EINTR restriction from there achieves that.)  If
>> we do that, we do not need to change the test case.
>> 
>> (D) is very special.  Ideally, we would specify what happens with the
>> descriptor if the close call is canceled.  POSIX does not even specify
>> what the state of file descriptor is after an EINTR error, so it doesn't
>> say what happens with cancellation, either.  Maybe we have to leave that
>> undefined.
>
> Failure to specify EINTR was an error in POSIX that's been fixed for
> the next issue. Leaving it undefined is atrociously bad and results in
> UAF type errors. Even if glibc doesn't want to apply the fix for
> EINTR, cancellation of close absolutely needs to behave *as if* by the
> newly-specified behavior for close with EINTR.

Can we even make sure we act on the cancellation only if the descriptor
has been closed?

Thanks,
Florian
  
Florian Weimer Sept. 2, 2019, 12:51 p.m. UTC | #18
* Adhemerval Zanella:

> I think it is possible, but I see that enforcing it using the kernel
> facilities to avoid extra code in userspace is cleaner.

Okay.

>> Even if we undo the resource allocation before that?  That is, check
>> that a cancellation has occurred on the system call return path, and if
>> it has, close the descriptor and start unwinding.
>
> What if the application requires extra steps for the case the syscall
> returns a side-effects? What about some specific syscalls, such as
> passing file descriptor using recvmsg? It would need the libc to map
> *all* possible side-effects for each syscall, determine which might
> leak side-effect and acts uppon then. What if the kernel, at some version,
> added another feature to share resources that might create side-effects?
> We will need to constantly map such cases, adding even more arch-specific
> glue code to handle it.
>
> I really think we should not go on this path.

I agree now, given what we've got from the kernel.  See my response to
Rich.

>>> That's not the interpretation I see, the cancellation description for
>>> such cases is afaiu:
>>>
>>> "The side-effects of acting upon a cancellation request while suspended
>>> during a call of a function are the same as the side-effects that may be
>>> seen in a single-threaded program when a call to a function is interrupted
>>> by a signal and the given function returns [EINTR]"
>>>
>>> And on signal concepts:
>>>
>>> "[EINTR]
>>> Interrupted function call. An asynchronous signal was caught by the process 
>>> during the execution of an interruptible function. If the signal handler 
>>> performs a normal return, the interrupted function call may return this 
>>> condition (see the Base Definitions volume of POSIX.1-2017, <signal.h>)."
>>>
>>> And my understanding is, since glibc implemented cancellation using signals
>>> and it requires to use SA_RESTART we can actually return to user if the
>>> syscall has side-effects visible to caller.
>> 
>> The open function cannot allocate a new file descriptor and fail with
>> EINTR.  So I don't see how this case applies here.  Sorry.
>
> But that is *exactly* the point. The open *does not* fail, but
> cancellation acts *before* it can return to the callee.  We can't
> really make it fail with EINTR and acts on the cancellation. The idea
> is "the interrupted function call may return this condition".

My uneasiness stems from the fact that we're reading POSIX backwards,
i.e., our interpretation (necessarily, given the kernel interfaces)
behaves in a way that is obviously to complying with a direct reading of
POSIX (a to-be-canceled thread synchronizing with the canceling thread,
when the pthread_cancel call happened before the synchronization in
program order in the canceling thread), and we then proceed to find
reasons why this not totally broken.

So, back to the start of the thread: Can we avoid the EPIPE error in an
other way?

Can we remove this line?

  /* This will cause the write in the child to return.  */
  close (fd[0]);

Cancellation should still happen if the write is blocked, right?  It's
not really clear to me what the intent of the test was.

Thanks,
Florian
  
Szabolcs Nagy Sept. 2, 2019, 6:35 p.m. UTC | #19
On 02/09/2019 13:37, Florian Weimer wrote:
> * Rich Felker:

>> No. POSIX is very clear that cancellation is only permitted to be

>> acted on when the side effects are the same as EINTR, which can only

>> happen for read/write when no data has been transferred yet. If data

>> has already been transferred, the only option not ruled out by one or

>> more constraints is to return successfully with a short read or write;

>> pending cancellation will then be handled on a subsequent call to a

>> cancellation point. (If you're doing a retry loop until the total

>> length n is transferred, then in the next iteration of the loop.)

> 

> This seems to be a backwards interpretation to me.  I think POSIX wants

> to describe what can happen during cancellation, and not say that any

> cancellation point can be skipped if some side effect happens that

> cannot happen before EINTR.

> 

>> Of course this case of non-conformance would be less severe than UAF

>> or resource leak errors in most cases, but it's still not good.

> 

> Agreed.  And I think not acting on the cancellation and returning

> success is also not conforming.


how is that non-conformance observable?

the user cannot know if the cancellation
request arrived in time or not if io
error (short write) can happen for other
reasons than the cancellation signal.
  
Rich Felker Sept. 2, 2019, 11:26 p.m. UTC | #20
On Mon, Sep 02, 2019 at 02:37:17PM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> > On Fri, Aug 23, 2019 at 02:47:39PM +0200, Florian Weimer wrote:
> >> I looked at this some more.  It seems that in theory, we could also see
> >> EPIPE in the old implementation or at least observe execution of a
> >> signal handler, and the asynchronous cancel hides that.
> >> 
> >> In the new implementation, if there is a signal handler and a signal for
> >> it is delivered before SIGCANCEL (even if the signal was sent *after*
> >> pthread_cancel, from the same thread), I do not think there is any
> >> chance whatsoever that we can hide the behavior difference.  After all,
> >> SIGCANCEL may not trigger an asynchronous cancellation from the signal
> >> handler.
> >> 
> >> System calls that are cancellation points appear to fall into these
> >> categories:
> >> 
> >> (A) Those that do not perform any resource allocation (write, read).
> >> 
> >> (B) Those that perform resource allocation, but the allocation can be
> >>     easily reverted (openat, accept4).
> >> 
> >> (C) Those that perform resource allocation, but the allocation is
> >>     difficult to undo (recvmsg with descriptor passing).
> >> 
> >> (D) close.
> >> 
> >> For (A), I think POSIX allows sufficient wiggle room so that we can
> >> exhibit a partial side effect and still act on the cancellation (without
> >> reporting a partial write first to the application).
> >
> > No. POSIX is very clear that cancellation is only permitted to be
> > acted on when the side effects are the same as EINTR, which can only
> > happen for read/write when no data has been transferred yet. If data
> > has already been transferred, the only option not ruled out by one or
> > more constraints is to return successfully with a short read or write;
> > pending cancellation will then be handled on a subsequent call to a
> > cancellation point. (If you're doing a retry loop until the total
> > length n is transferred, then in the next iteration of the loop.)
> 
> This seems to be a backwards interpretation to me.  I think POSIX wants
> to describe what can happen during cancellation, and not say that any
> cancellation point can be skipped if some side effect happens that
> cannot happen before EINTR.

The exact text is:

    "The side-effects of acting upon a cancellation request while
    suspended during a call of a function are the same as the
    side-effects that may be seen in a single-threaded program when a
    call to a function is interrupted by a signal and the given
    function returns [EINTR]."

Admittedly it took me a while to understand this back when I first
started investigating the problem in 2010, but based on both careful
reading of this text, and on reasoning about what conditions are
*absolutely necessary* to make the cancellation functionality usable
in any meaningful way (without race conditions), yes it means that if
you act on cancellation, the [only] side effects the cancelled
function can have are side effects that would be allowed if the
function had failed with EINTR.

In hindsight, the main obstacle to my understanding this was bias
based on how badly wrong all existing implementations I'd seen were,
and resistance to believing they were all so badly wrong. But they
were/are.

Moreover, the specification gives the implementation liberty *not* to
act on cancellation in certain conditions (very end of XSH 2.9.5
Thread Cancellation, subheading Cancellation points):

    "It is unspecified whether the cancellation request is acted upon
    or whether the cancellation request remains pending and the thread
    resumes normal execution if:

    - The thread is suspended at a cancellation point and the event for
    which it is waiting occurs

    - A specified timeout expired

    before the cancellation request is acted upon."

The way I read this is that, if you start processing a cancellation
request, but see that the timeout has already expired or that whatever
the function was blocked waiting for has already happened, you can
abort cancellation and leave it pending and return the result.

In an implementation where libc/libpthread cancellation and the kernel
are tightly coupled, this is really a choice -- for example the kernel
could check at the last point before success during SYS_read to see
that cancellation is pending, and *not actually transfer the data*,
but instead initiate cancellation with no side effects on the open
file. But on an implementation like Linux where libc/libpthread and
the cancellation implementation is completely decoupled from the
kernel and implemented via signals (which is a hack!), you don't
actually have this freedom. By the time you see the signal, it's
possible that the syscall *already succeeded* with (potentially
serious) side effects, and in that case you have to use the option
POSIX gives you here to defer cancellation until the next cancellation
point. (If POSIX did not give you this option, then it may not even be
possible to do a conforming implementation based on the signal hack).

> > Of course this case of non-conformance would be less severe than UAF
> > or resource leak errors in most cases, but it's still not good.
> 
> Agreed.  And I think not acting on the cancellation and returning
> success is also not conforming.

It is conforming, by the above-quoted passage.

> >> For (C), the complexity may not be worth it.
> >> 
> >> For (A), (B), (C), we can act on the cancellation in the error case,
> >> after we observe the cancellation flag in the signal handler trampoline.
> >> (I think dropping the EINTR restriction from there achieves that.)  If
> >> we do that, we do not need to change the test case.
> >> 
> >> (D) is very special.  Ideally, we would specify what happens with the
> >> descriptor if the close call is canceled.  POSIX does not even specify
> >> what the state of file descriptor is after an EINTR error, so it doesn't
> >> say what happens with cancellation, either.  Maybe we have to leave that
> >> undefined.
> >
> > Failure to specify EINTR was an error in POSIX that's been fixed for
> > the next issue. Leaving it undefined is atrociously bad and results in
> > UAF type errors. Even if glibc doesn't want to apply the fix for
> > EINTR, cancellation of close absolutely needs to behave *as if* by the
> > newly-specified behavior for close with EINTR.
> 
> Can we even make sure we act on the cancellation only if the descriptor
> has been closed?

If the descriptor has been closed, you *can't* act on cancellation. On
Linux, SYS_close *never* returns to userspace without the fd having
been closed. It can return EINTR, indicating that the fd has been
closed but the underlying open file description (if this was the last
reference to it) has not, but this EINTR is not POSIX conforming with
the new changes to POSIX, and it was always inconsistent with the
semantics implied by EINTR ("no nontrivial side effects have happened
yet"). See https://sourceware.org/bugzilla/show_bug.cgi?id=14627

In any case, this is easy to deal with for cancellation, even if
fixing the EINTR issue is controversial. If PC is before or at the
syscall instruction, nothing has happened yet (fd is still open) and
you can act on cancellation. If PC is after the syscall instruction,
you can't act on it.

This is actually the same as all other syscalls, in some sense, except
that for (at least a few) other syscalls you want to also act on
cancellation if EINTR is returned, because a few of them EINTR on
receipt of any signal, including SA_RESTART ones like SIGCANCEL.
Historically there were lots like this, but most of them were
conformance bugs and fixed. I think select and/or poll is still like
that intentionally, but I may be misremembering.

Rich
  
Adhemerval Zanella Netto Sept. 3, 2019, 12:15 p.m. UTC | #21
On 02/09/2019 20:26, Rich Felker wrote:
> If the descriptor has been closed, you *can't* act on cancellation. On
> Linux, SYS_close *never* returns to userspace without the fd having
> been closed. It can return EINTR, indicating that the fd has been
> closed but the underlying open file description (if this was the last
> reference to it) has not, but this EINTR is not POSIX conforming with
> the new changes to POSIX, and it was always inconsistent with the
> semantics implied by EINTR ("no nontrivial side effects have happened
> yet"). See https://sourceware.org/bugzilla/show_bug.cgi?id=14627

So is it true even for newer version? I had the impression from the lwn
article [1] that this is driver dependent and this behaviour will eventually
be removed from kernel to not allow close to return EINTR.

About BZ#14627, does it still make sense for *Linux* to map EINTR to
EINPROGRESS? 

[1] https://lwn.net/Articles/576478/
  
Rich Felker Sept. 3, 2019, 12:24 p.m. UTC | #22
On Tue, Sep 03, 2019 at 09:15:58AM -0300, Adhemerval Zanella wrote:
> 
> 
> On 02/09/2019 20:26, Rich Felker wrote:
> > If the descriptor has been closed, you *can't* act on cancellation. On
> > Linux, SYS_close *never* returns to userspace without the fd having
> > been closed. It can return EINTR, indicating that the fd has been
> > closed but the underlying open file description (if this was the last
> > reference to it) has not, but this EINTR is not POSIX conforming with
> > the new changes to POSIX, and it was always inconsistent with the
> > semantics implied by EINTR ("no nontrivial side effects have happened
> > yet"). See https://sourceware.org/bugzilla/show_bug.cgi?id=14627
> 
> So is it true even for newer version? I had the impression from the lwn
> article [1] that this is driver dependent and this behaviour will eventually
> be removed from kernel to not allow close to return EINTR.
> 
> About BZ#14627, does it still make sense for *Linux* to map EINTR to
> EINPROGRESS? 
> 
> [1] https://lwn.net/Articles/576478/

The kernel folks are adamantly opposed to changing SYS_close to match
the POSIX requirement here, which is the right thing to do but for the
wrong reasons -- if it were changed, libc could not safely handle the
situation because it couldn't know what meaning the kernel was
assigning to EINTR.

Hopefully they will just stop returning EINTR entirely. That would not
break any of the behavior discussed above. Returning EINPROGRESS may
be a problem though because applications may not know to expect it and
may treat it as a hard error rather than a reasonable condition
(discussed in https://sourceware.org/bugzilla/show_bug.cgi?id=14627).
The ideal behavior is probably just returning 0.

Rich
  
Adhemerval Zanella Netto Sept. 3, 2019, 6:17 p.m. UTC | #23
On 02/09/2019 09:51, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> I think it is possible, but I see that enforcing it using the kernel
>> facilities to avoid extra code in userspace is cleaner.
> 
> Okay.
> 
>>> Even if we undo the resource allocation before that?  That is, check
>>> that a cancellation has occurred on the system call return path, and if
>>> it has, close the descriptor and start unwinding.
>>
>> What if the application requires extra steps for the case the syscall
>> returns a side-effects? What about some specific syscalls, such as
>> passing file descriptor using recvmsg? It would need the libc to map
>> *all* possible side-effects for each syscall, determine which might
>> leak side-effect and acts uppon then. What if the kernel, at some version,
>> added another feature to share resources that might create side-effects?
>> We will need to constantly map such cases, adding even more arch-specific
>> glue code to handle it.
>>
>> I really think we should not go on this path.
> 
> I agree now, given what we've got from the kernel.  See my response to
> Rich.
> 
>>>> That's not the interpretation I see, the cancellation description for
>>>> such cases is afaiu:
>>>>
>>>> "The side-effects of acting upon a cancellation request while suspended
>>>> during a call of a function are the same as the side-effects that may be
>>>> seen in a single-threaded program when a call to a function is interrupted
>>>> by a signal and the given function returns [EINTR]"
>>>>
>>>> And on signal concepts:
>>>>
>>>> "[EINTR]
>>>> Interrupted function call. An asynchronous signal was caught by the process 
>>>> during the execution of an interruptible function. If the signal handler 
>>>> performs a normal return, the interrupted function call may return this 
>>>> condition (see the Base Definitions volume of POSIX.1-2017, <signal.h>)."
>>>>
>>>> And my understanding is, since glibc implemented cancellation using signals
>>>> and it requires to use SA_RESTART we can actually return to user if the
>>>> syscall has side-effects visible to caller.
>>>
>>> The open function cannot allocate a new file descriptor and fail with
>>> EINTR.  So I don't see how this case applies here.  Sorry.
>>
>> But that is *exactly* the point. The open *does not* fail, but
>> cancellation acts *before* it can return to the callee.  We can't
>> really make it fail with EINTR and acts on the cancellation. The idea
>> is "the interrupted function call may return this condition".
> 
> My uneasiness stems from the fact that we're reading POSIX backwards,
> i.e., our interpretation (necessarily, given the kernel interfaces)
> behaves in a way that is obviously to complying with a direct reading of
> POSIX (a to-be-canceled thread synchronizing with the canceling thread,
> when the pthread_cancel call happened before the synchronization in
> program order in the canceling thread), and we then proceed to find
> reasons why this not totally broken.
> 
> So, back to the start of the thread: Can we avoid the EPIPE error in an
> other way?
> 
> Can we remove this line?
> 
>   /* This will cause the write in the child to return.  */
>   close (fd[0]);
> 
> Cancellation should still happen if the write is blocked, right?  It's
> not really clear to me what the intent of the test was.

I think Rich has answered it better than me that using signals to implement
thread cancellation is an implementation detail and POSIX allows us for this
case to not act on some cancellation signals if it leads to visible 
side-effects that might lead to resource leakage.

Said that, it would be possible to just remove the close call to avoid the
EPIPE error on write.  The cancellation should still happen because the
write is called on a loop, so since it might return with a visible side-effect
(partial write), it would be cancelled in the next write call due the thread
being marked as cancelled.

I really don't have a strong opinion which change is the better, my point is
the current code is racy and it will start to fail more often once we fix
BZ#12683.
  

Patch

diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c
index 1f0429d343..632ea4e0ae 100644
--- a/nptl/tst-cancel2.c
+++ b/nptl/tst-cancel2.c
@@ -20,6 +20,7 @@ 
 #include <signal.h>
 #include <stdio.h>
 #include <unistd.h>
+#include <errno.h>
 
 
 static int fd[2];
@@ -32,7 +33,14 @@  tf (void *arg)
      write blocks.  */
   char buf[100000];
 
-  while (write (fd[1], buf, sizeof (buf)) > 0);
+  while (1)
+    {
+      /* Ignore EPIPE errors for case the SIGPIPE is handle before
+	 SIGCANCEL.  */
+      ssize_t ret = write (fd[1], buf, sizeof (buf));
+      if (ret == 0 || (ret == -1 && errno != EPIPE))
+       break;
+    }
 
   return (void *) 42l;
 }