diff mbox series

[2/2] manual: Document __libc_single_threaded

Message ID	2c218c9ed9586ed5491f6fa08045d1e883b126c3.1589998207.git.fweimer@redhat.com
State	Superseded
Headers	DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 52017385DC32 To: libc-alpha@sourceware.org Subject: [PATCH 2/2] manual: Document __libc_single_threaded In-Reply-To: <cover.1589998207.git.fweimer@redhat.com> References: <cover.1589998207.git.fweimer@redhat.com> Message-Id: <2c218c9ed9586ed5491f6fa08045d1e883b126c3.1589998207.git.fweimer@redhat.com> Date: Wed, 20 May 2020 20:12:59 +0200 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Precedence: list From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Florian Weimer <fweimer@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>
Series	Add __libc_single_threaded \| [0/2] Add __libc_single_threaded [1/2] Add the __libc_single_threaded variable [2/2] manual: Document __libc_single_threaded

Commit Message

Florian Weimer May 20, 2020, 6:12 p.m. UTC

  ---
 manual/threads.texi | 89 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

Comments

Michael Kerrisk \(man-pages\) May 21, 2020, 7:52 a.m. UTC | #1

On 5/20/20 8:12 PM, Florian Weimer wrote:
> ---
>  manual/threads.texi | 89 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 89 insertions(+)
> 
> diff --git a/manual/threads.texi b/manual/threads.texi
> index a425635179..d4c261a0e9 100644
> --- a/manual/threads.texi
> +++ b/manual/threads.texi
> @@ -627,6 +627,7 @@ the standard.
>  					  threads in a process.
>  * Waiting with Explicit Clocks::          Functions for waiting with an
>                                            explicit clock specification.
> +* Single-Threaded::     Detecting single-threaded execution.
>  @end menu
>  
>  @node Default Thread Attributes
> @@ -771,6 +772,94 @@ Behaves like @code{pthread_timedjoin_np} except that the absolute time in
>  @var{abstime} is measured against the clock specified by @var{clockid}.
>  @end deftypefun
>  
> +@node Single-Threaded
> +@subsubsection Detecting Single-Threaded Execution
> +
> +Multi-threaded programs require synchronization among threads.  This
> +synchronization can be costly even if there is just a single thread
> +and no data is shared between multiple processors.  @Theglibc{} offers
> +an interface to detect whether the process is in single-threaded mode.
> +Applications can use this information to avoid synchronization, for
> +example by using regular instructions to load and store memory instead
> +of atomic instructions, or using relaxed memory ordering instead of
> +stronger memory ordering.
> +
> +@deftypevar char __libc_single_threaded
> +@standards{GNU, sys/single_threaded.h}
> +This variable is non-zero if the current process is definitely
> +single-threaded.  If it is zero, the process can be multi-threaded,

s/can be/may be/

> +or @theglibc{} cannot determine at this point of the program execution
> +whether the process is single-threaded or not.
> +
> +Applications must never write to this variable.
> +@end deftypevar
> +
> +Most applications should perform the same actions whether or not
> +@code{__libc_single_threaded} is true, except with less
> +synchronization.  If this rule is followed, a process that
> +subsequently becomes multi-threaded is already in a consistent state.
> +For example, in order to increment a reference count, the following
> +code can be used:
> +
> +@smallexample
> +if (__libc_single_threaded)
> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
> +else
> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
> +@end smallexample
> +
> +This still requires some form of synchronization on the
> +single-threaded branch, so it can be beneficial not to declare the
> +reference count as @code{_Atomic}, and use the GCC @code{__atomic}
> +built-ins.  @xref{__atomic Builtins,, Built-in Functions for Memory
> +Model Aware Atomic Operations, gcc, Using the GNU Compiler Collection
> +(GCC)}.  Then the code to increment a reference count looks like this:
> +
> +@smallexample
> +if (__libc_single_threaded)
> +  ++refeference_count;inf
> +else
> +  __atomic_fetch_add (&reference_count, 1, __ATOMIC_ACQ_REL);
> +@end smallexample
> +
> +(Depending on the data associated with the reference count, it may be
> +possible to use the weaker @code{__ATOMIC_RELAXED} memory ordering on
> +the multi-threaded branch.)
> +
> +Several functions in @theglibc{} can change the value of the
> +@code{__libc_single_threaded} variable.  For example, creating new
> +threads using the @code{pthread_create} or @code{thrd_create} function
> +sets the variable to false.  This can also happen directly, say via a

s/directly/indirectly/ ?

> +call to @code{dlopen}.  Therefore, applications need to make a copy of
> +the value of @code{__libc_single_threaded} if after such a function
> +call, behavior must match the value as it was before the call, like
> +this:
> +
> +@smallexample
> +bool single_threaded = __libc_single_threaded;
> +if (single_threaded)
> +  prepare_single_threaded ();
> +else
> +  prepare_multi_thread ();
> +
> +void *handle = dlopen (shared_library_name, RTLD_NOW);
> +lookup_symbols (handle);
> +
> +if (single_threaded)
> +  cleanup_single_threaded ();
> +else
> +  cleanup_multi_thread ();
> +@end smallexample
> +
> +Since the value of @code{__libc_single_threaded} can change from true
> +to false during the execution of the program, it is not useful for
> +selecting optimized function implementations in IFUNC resolvers.
> +
> +Atomic operations can also be used on mappings shared among
> +single-threaded processes.  This means that a compiler cannot use
> +@code{__libc_single_threaded} to optimize atomic operations, unless it
> +is able to prove that the memory is not shared.
> +
>  @c FIXME these are undocumented:
>  @c pthread_atfork
>  @c pthread_attr_destroy

Good info!

Thanks,

Michael

Szabolcs Nagy May 21, 2020, 11:18 a.m. UTC | #2

The 05/20/2020 20:12, Florian Weimer via Libc-alpha wrote:
> +@smallexample
> +if (__libc_single_threaded)
> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
> +else
> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
> +@end smallexample
> +
> +This still requires some form of synchronization on the
> +single-threaded branch, so it can be beneficial not to declare the
> +reference count as @code{_Atomic}, and use the GCC @code{__atomic}
> +built-ins.  @xref{__atomic Builtins,, Built-in Functions for Memory
> +Model Aware Atomic Operations, gcc, Using the GNU Compiler Collection
> +(GCC)}.  Then the code to increment a reference count looks like this:
> +
> +@smallexample
> +if (__libc_single_threaded)
> +  ++refeference_count;inf

inf typo

Florian Weimer May 21, 2020, 12:16 p.m. UTC | #3

* Szabolcs Nagy:

> The 05/20/2020 20:12, Florian Weimer via Libc-alpha wrote:
>> +@smallexample
>> +if (__libc_single_threaded)
>> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
>> +else
>> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
>> +@end smallexample
>> +
>> +This still requires some form of synchronization on the
>> +single-threaded branch, so it can be beneficial not to declare the
>> +reference count as @code{_Atomic}, and use the GCC @code{__atomic}
>> +built-ins.  @xref{__atomic Builtins,, Built-in Functions for Memory
>> +Model Aware Atomic Operations, gcc, Using the GNU Compiler Collection
>> +(GCC)}.  Then the code to increment a reference count looks like this:
>> +
>> +@smallexample
>> +if (__libc_single_threaded)
>> +  ++refeference_count;inf
>
> inf typo

Thanks, queued for V2.

Florian

Florian Weimer May 21, 2020, 12:17 p.m. UTC | #4

* Michael Kerrisk:

>> +@deftypevar char __libc_single_threaded
>> +@standards{GNU, sys/single_threaded.h}
>> +This variable is non-zero if the current process is definitely
>> +single-threaded.  If it is zero, the process can be multi-threaded,
>
> s/can be/may be/

Fixed.

>> +Several functions in @theglibc{} can change the value of the
>> +@code{__libc_single_threaded} variable.  For example, creating new
>> +threads using the @code{pthread_create} or @code{thrd_create} function
>> +sets the variable to false.  This can also happen directly, say via a
>
> s/directly/indirectly/ ?

Indeed, fixed.

I'll repost a V2 once there are comments on the actual code.

Thanks,
Florian

Adhemerval Zanella Netto May 21, 2020, 12:50 p.m. UTC | #5

On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote:

> +@smallexample
> +if (__libc_single_threaded)
> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
> +else
> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
> +@end smallexample

Shouldn't the access to __libc_single_threaded be atomic itself
(at least with relaxed semantic)?

Szabolcs Nagy May 21, 2020, 1:09 p.m. UTC | #6

The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote:
> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote:
> 
> > +@smallexample
> > +if (__libc_single_threaded)
> > +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
> > +else
> > +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
> > +@end smallexample
> 
> Shouldn't the access to __libc_single_threaded be atomic itself
> (at least with relaxed semantic)?

not if we guarantee that this object can only be
written while the process is single threaded.

(e.g. an exiting detached thread cannot update it
even if only one thread left.. because that may
concurrently read it)

Florian Weimer May 21, 2020, 1:14 p.m. UTC | #7

* Adhemerval Zanella via Libc-alpha:

> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote:
>
>> +@smallexample
>> +if (__libc_single_threaded)
>> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
>> +else
>> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
>> +@end smallexample
>
> Shouldn't the access to __libc_single_threaded be atomic itself
> (at least with relaxed semantic)?

Good question.  In the current implementation, it is not needed because
the variable is never written again once the process is multi-threaded.

We must retain relaxed MO access as a valid use of this variable.  A
future implementation may set __libc_single_threaded to true after
detecting that the process has become single-threaded again.  But I
think this requires that the last remaining thread synchronizes in some
way with the exit of the other, second-to-last remaining thread.  And
that in turn means that no explicit MO is needed for the variable read.

I'm going to add this to the manual as an implementation note, after the
first example:

@c Note: No memory order on __libc_single_threaded.  The
@c implementation must ensure that exit of the critical
@c (second-to-last) thread happens-before setting
@c __libc_single_threaded to true.  Otherwise, acquire MO might be
@c needed for reading the variable in some scenarios, and that would
@c completely defeat its purpose.

For detached thread exits, this kind of synchronization may not be
easily obtainable in all cases.  I don't think we can do it on the
on-thread exit path because the kernel will perform certain actions
afterwards (like robust mutex updates), no matter how late we do it.  I
guess we could perhaps piggy-back on the stack reclamation mechanism.

Thanks,
Florian

Adhemerval Zanella Netto May 21, 2020, 1:15 p.m. UTC | #8

On 21/05/2020 10:09, Szabolcs Nagy wrote:
> The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote:
>> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote:
>>
>>> +@smallexample
>>> +if (__libc_single_threaded)
>>> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
>>> +else
>>> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
>>> +@end smallexample
>>
>> Shouldn't the access to __libc_single_threaded be atomic itself
>> (at least with relaxed semantic)?
> 
> not if we guarantee that this object can only be
> written while the process is single threaded.
> 
> (e.g. an exiting detached thread cannot update it
> even if only one thread left.. because that may
> concurrently read it)
> 

OK, so I think we should outline that atomic operations are not required 
to acess this object and that once __libc_single_threaded is set 0 it will 
continue to indicate non-single thread even thread are jointed or detached 
thread finishes.

Szabolcs Nagy May 21, 2020, 1:30 p.m. UTC | #9

The 05/21/2020 10:15, Adhemerval Zanella wrote:
> On 21/05/2020 10:09, Szabolcs Nagy wrote:
> > The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote:
> >> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote:
> >>
> >>> +@smallexample
> >>> +if (__libc_single_threaded)
> >>> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
> >>> +else
> >>> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
> >>> +@end smallexample
> >>
> >> Shouldn't the access to __libc_single_threaded be atomic itself
> >> (at least with relaxed semantic)?
> > 
> > not if we guarantee that this object can only be
> > written while the process is single threaded.
> > 
> > (e.g. an exiting detached thread cannot update it
> > even if only one thread left.. because that may
> > concurrently read it)
> > 
> 
> OK, so I think we should outline that atomic operations are not required 
> to acess this object and that once __libc_single_threaded is set 0 it will 
> continue to indicate non-single thread even thread are jointed or detached 
> thread finishes.

what's wrong with pthread_join updating it?
(other than the already mentioned dlmopened
libc case)

if only one thread left that is doing the join
then there cannot be concurent access or any
memory ordering problem, the process is single
threaded.

Florian Weimer May 21, 2020, 1:44 p.m. UTC | #10

* Szabolcs Nagy:

> The 05/21/2020 10:15, Adhemerval Zanella wrote:
>> On 21/05/2020 10:09, Szabolcs Nagy wrote:
>> > The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote:
>> >> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote:
>> >>
>> >>> +@smallexample
>> >>> +if (__libc_single_threaded)
>> >>> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
>> >>> +else
>> >>> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
>> >>> +@end smallexample
>> >>
>> >> Shouldn't the access to __libc_single_threaded be atomic itself
>> >> (at least with relaxed semantic)?
>> > 
>> > not if we guarantee that this object can only be
>> > written while the process is single threaded.
>> > 
>> > (e.g. an exiting detached thread cannot update it
>> > even if only one thread left.. because that may
>> > concurrently read it)
>> > 
>> 
>> OK, so I think we should outline that atomic operations are not required 
>> to acess this object and that once __libc_single_threaded is set 0 it will 
>> continue to indicate non-single thread even thread are jointed or detached 
>> thread finishes.
>
> what's wrong with pthread_join updating it?

It's tricky do it correctly if there are two remaining threads, one of
them the one being joined, the other one a detached thread.  A
straightforward implementation merely looking at __nptl_nthreads before
returning from pthread_join would not perform the required
synchronization on the detached thread exit.

Thanks,
Florian

Adhemerval Zanella Netto May 21, 2020, 1:56 p.m. UTC | #11

On 21/05/2020 10:30, Szabolcs Nagy wrote:
> The 05/21/2020 10:15, Adhemerval Zanella wrote:
>> On 21/05/2020 10:09, Szabolcs Nagy wrote:
>>> The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote:
>>>> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote:
>>>>
>>>>> +@smallexample
>>>>> +if (__libc_single_threaded)
>>>>> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
>>>>> +else
>>>>> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
>>>>> +@end smallexample
>>>>
>>>> Shouldn't the access to __libc_single_threaded be atomic itself
>>>> (at least with relaxed semantic)?
>>>
>>> not if we guarantee that this object can only be
>>> written while the process is single threaded.
>>>
>>> (e.g. an exiting detached thread cannot update it
>>> even if only one thread left.. because that may
>>> concurrently read it)
>>>
>>
>> OK, so I think we should outline that atomic operations are not required 
>> to acess this object and that once __libc_single_threaded is set 0 it will 
>> continue to indicate non-single thread even thread are jointed or detached 
>> thread finishes.
> 
> what's wrong with pthread_join updating it?
> (other than the already mentioned dlmopened
> libc case)
> 

I don't think it is a problem, but we should document the current semantic
(which does not update the value on pthread_join or detached thread exit).

> if only one thread left that is doing the join
> then there cannot be concurent access or any
> memory ordering problem, the process is single
> threaded.
> 

I am not sure it is valid for C11 atomic semantics to access an object
that is concurrently updated without proper atomic semantic, however 
afaiu for this specific case it should result only in missing optimization
(to use the mt patch instead of the single threaded one).

Adhemerval Zanella Netto May 21, 2020, 1:58 p.m. UTC | #12

On 21/05/2020 10:44, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
>> The 05/21/2020 10:15, Adhemerval Zanella wrote:
>>> On 21/05/2020 10:09, Szabolcs Nagy wrote:
>>>> The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote:
>>>>> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote:
>>>>>
>>>>>> +@smallexample
>>>>>> +if (__libc_single_threaded)
>>>>>> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
>>>>>> +else
>>>>>> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
>>>>>> +@end smallexample
>>>>>
>>>>> Shouldn't the access to __libc_single_threaded be atomic itself
>>>>> (at least with relaxed semantic)?
>>>>
>>>> not if we guarantee that this object can only be
>>>> written while the process is single threaded.
>>>>
>>>> (e.g. an exiting detached thread cannot update it
>>>> even if only one thread left.. because that may
>>>> concurrently read it)
>>>>
>>>
>>> OK, so I think we should outline that atomic operations are not required 
>>> to acess this object and that once __libc_single_threaded is set 0 it will 
>>> continue to indicate non-single thread even thread are jointed or detached 
>>> thread finishes.
>>
>> what's wrong with pthread_join updating it?
> 
> It's tricky do it correctly if there are two remaining threads, one of
> them the one being joined, the other one a detached thread.  A
> straightforward implementation merely looking at __nptl_nthreads before
> returning from pthread_join would not perform the required
> synchronization on the detached thread exit.

Couldn't we accomplish by making __libc_single_threaded count the total
number of threads and making pthread_create/pthread_exit/detach exit
atomically updating it?

Florian Weimer May 21, 2020, 2:03 p.m. UTC | #13

* Adhemerval Zanella:

> On 21/05/2020 10:44, Florian Weimer wrote:
>> * Szabolcs Nagy:
>> 
>>> The 05/21/2020 10:15, Adhemerval Zanella wrote:
>>>> On 21/05/2020 10:09, Szabolcs Nagy wrote:
>>>>> The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote:
>>>>>> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote:
>>>>>>
>>>>>>> +@smallexample
>>>>>>> +if (__libc_single_threaded)
>>>>>>> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
>>>>>>> +else
>>>>>>> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
>>>>>>> +@end smallexample
>>>>>>
>>>>>> Shouldn't the access to __libc_single_threaded be atomic itself
>>>>>> (at least with relaxed semantic)?
>>>>>
>>>>> not if we guarantee that this object can only be
>>>>> written while the process is single threaded.
>>>>>
>>>>> (e.g. an exiting detached thread cannot update it
>>>>> even if only one thread left.. because that may
>>>>> concurrently read it)
>>>>>
>>>>
>>>> OK, so I think we should outline that atomic operations are not required 
>>>> to acess this object and that once __libc_single_threaded is set 0 it will 
>>>> continue to indicate non-single thread even thread are jointed or detached 
>>>> thread finishes.
>>>
>>> what's wrong with pthread_join updating it?
>> 
>> It's tricky do it correctly if there are two remaining threads, one of
>> them the one being joined, the other one a detached thread.  A
>> straightforward implementation merely looking at __nptl_nthreads before
>> returning from pthread_join would not perform the required
>> synchronization on the detached thread exit.
>
> Couldn't we accomplish by making __libc_single_threaded count the total
> number of threads and making pthread_create/pthread_exit/detach exit
> atomically updating it?

We already have __nptl_nthreads as a global thread count, but it is
currently decremented too early (on the exiting thread).  As I tried to
explain, we cannot decrement it on the exiting thread itself because it
would not give us the desired synchronization, particularly not with any
kernel actions that happen afterwards.

Thanks,
Florian

Adhemerval Zanella Netto May 21, 2020, 2:32 p.m. UTC | #14

On 21/05/2020 10:14, Florian Weimer wrote:
> * Adhemerval Zanella via Libc-alpha:
> 
>> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote:
>>
>>> +@smallexample
>>> +if (__libc_single_threaded)
>>> +  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
>>> +else
>>> +  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
>>> +@end smallexample
>>
>> Shouldn't the access to __libc_single_threaded be atomic itself
>> (at least with relaxed semantic)?
> 
> Good question.  In the current implementation, it is not needed because
> the variable is never written again once the process is multi-threaded.
> 
> We must retain relaxed MO access as a valid use of this variable.  A
> future implementation may set __libc_single_threaded to true after
> detecting that the process has become single-threaded again.  But I
> think this requires that the last remaining thread synchronizes in some
> way with the exit of the other, second-to-last remaining thread.  And
> that in turn means that no explicit MO is needed for the variable read.
> 
> I'm going to add this to the manual as an implementation note, after the
> first example:
> 
> @c Note: No memory order on __libc_single_threaded.  The
> @c implementation must ensure that exit of the critical
> @c (second-to-last) thread happens-before setting
> @c __libc_single_threaded to true.  Otherwise, acquire MO might be
> @c needed for reading the variable in some scenarios, and that would
> @c completely defeat its purpose.

The comments is sound, but I still think we should properly document 
that this initial version does not attempt to update 
__libc_single_threaded on pthread_join or detach exit and maybe also
the brief explanation you added on why this semantic was chose (to
avoid the requirement of more strict MO). 

> 
> For detached thread exits, this kind of synchronization may not be
> easily obtainable in all cases.  I don't think we can do it on the
> on-thread exit path because the kernel will perform certain actions
> afterwards (like robust mutex updates), no matter how late we do it.  I
> guess we could perhaps piggy-back on the stack reclamation mechanism.

It seems that robust mutexes updates are indeed a problem, but I am not
sure if CLONE_CHILD_CLEARTID clear helps here.  It signals the thread
is done with the memory synchronization, but the stack cache is not
really updated.  Maybe an extra clone3 flag ?

Szabolcs Nagy May 22, 2020, 10:01 a.m. UTC | #15

The 05/21/2020 15:44, Florian Weimer wrote:
> * Szabolcs Nagy:
> > what's wrong with pthread_join updating it?
> 
> It's tricky do it correctly if there are two remaining threads, one of
> them the one being joined, the other one a detached thread.  A
> straightforward implementation merely looking at __nptl_nthreads before
> returning from pthread_join would not perform the required
> synchronization on the detached thread exit.

i'm trying to understand this, but don't see
what's wrong if the last thread is detached.

do you mean user code in atexit handlers?
or synchronization in libc? what is libc
synchronizing with in a single detached thread?

so far i don't see why __libc_single_thread
cannot go back to true once it was false
(there may be usability issues but i need
to look at some example usage to see)

Florian Weimer May 22, 2020, 10:05 a.m. UTC | #16

* Szabolcs Nagy:

> The 05/21/2020 15:44, Florian Weimer wrote:
>> * Szabolcs Nagy:
>> > what's wrong with pthread_join updating it?
>> 
>> It's tricky do it correctly if there are two remaining threads, one of
>> them the one being joined, the other one a detached thread.  A
>> straightforward implementation merely looking at __nptl_nthreads before
>> returning from pthread_join would not perform the required
>> synchronization on the detached thread exit.
>
> i'm trying to understand this, but don't see
> what's wrong if the last thread is detached.

Sorry, I meant three reamining threads in total, i.e., two more threads
in addition to the one thread that keeps going after the other two
exited, and may use __libc_single_threaded in the future.

Clearer now?

Thanks,
Florian

Szabolcs Nagy May 22, 2020, 10:54 a.m. UTC | #17

The 05/22/2020 12:05, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
> > The 05/21/2020 15:44, Florian Weimer wrote:
> >> * Szabolcs Nagy:
> >> > what's wrong with pthread_join updating it?
> >> 
> >> It's tricky do it correctly if there are two remaining threads, one of
> >> them the one being joined, the other one a detached thread.  A
> >> straightforward implementation merely looking at __nptl_nthreads before
> >> returning from pthread_join would not perform the required
> >> synchronization on the detached thread exit.
> >
> > i'm trying to understand this, but don't see
> > what's wrong if the last thread is detached.
> 
> Sorry, I meant three reamining threads in total, i.e., two more threads
> in addition to the one thread that keeps going after the other two
> exited, and may use __libc_single_threaded in the future.
> 
> Clearer now?

hm so a detached thread is concurrently exiting with
a pthread_join which sees a decremented __nptl_nthreads
but the detached thread has not actually exited yet.

i think glibc can issue a memory barrier syscall before
decrementing __nptl_nthreads in a detached thread, this
means if pthread_join observes __nptl_nthreads==1
then user memory accesses in the detached thread are
synchronized with non-atomic memory accesses after
pthread_join returns. (i.e. __nptl_nthreads==1 should
mean at all times that as far as user code is concerned
the process is single threaded even if some detached
thread is still hanging around)

i think __libc_single_threaded should be possible to
update in pthread_join with the above change, in
which case we need not document that it stays false
forever, so we can change this in the future.
(unless somebody finds usecases where a false->true
transition would cause problems)

Florian Weimer May 22, 2020, 11:08 a.m. UTC | #18

* Szabolcs Nagy:

> The 05/22/2020 12:05, Florian Weimer wrote:
>> * Szabolcs Nagy:
>> 
>> > The 05/21/2020 15:44, Florian Weimer wrote:
>> >> * Szabolcs Nagy:
>> >> > what's wrong with pthread_join updating it?
>> >> 
>> >> It's tricky do it correctly if there are two remaining threads, one of
>> >> them the one being joined, the other one a detached thread.  A
>> >> straightforward implementation merely looking at __nptl_nthreads before
>> >> returning from pthread_join would not perform the required
>> >> synchronization on the detached thread exit.
>> >
>> > i'm trying to understand this, but don't see
>> > what's wrong if the last thread is detached.
>> 
>> Sorry, I meant three reamining threads in total, i.e., two more threads
>> in addition to the one thread that keeps going after the other two
>> exited, and may use __libc_single_threaded in the future.
>> 
>> Clearer now?
>
> hm so a detached thread is concurrently exiting with
> a pthread_join which sees a decremented __nptl_nthreads
> but the detached thread has not actually exited yet.

Correct.

> i think glibc can issue a memory barrier syscall before
> decrementing __nptl_nthreads in a detached thread, this
> means if pthread_join observes __nptl_nthreads==1
> then user memory accesses in the detached thread are
> synchronized with non-atomic memory accesses after
> pthread_join returns. (i.e. __nptl_nthreads==1 should
> mean at all times that as far as user code is concerned
> the process is single threaded even if some detached
> thread is still hanging around)

This depends on the extent to which kernel actions after thread exit are
visible to the last remaining threads.  I think it would be safer to
require that the stack must have been reclaimed.  From a high-level
perspective, we have a similar synchronization issue with stack
reclamation, and it should be possible to reuse the same mechanism for
that.

> i think __libc_single_threaded should be possible to
> update in pthread_join with the above change, in
> which case we need not document that it stays false
> forever, so we can change this in the future.

Yes, I expect future such changes, which is why I mentioned saving the
value of __libc_single_threaded for consistent execution.  The other
obvious case for optimization is fork, which is easier to implement: it
is only necessary to detect that the libc in question is an outer libc.

Thanks,
Florian

Rich Felker May 22, 2020, 3:07 p.m. UTC | #19

On Fri, May 22, 2020 at 11:54:58AM +0100, Szabolcs Nagy wrote:
> The 05/22/2020 12:05, Florian Weimer wrote:
> > * Szabolcs Nagy:
> > 
> > > The 05/21/2020 15:44, Florian Weimer wrote:
> > >> * Szabolcs Nagy:
> > >> > what's wrong with pthread_join updating it?
> > >> 
> > >> It's tricky do it correctly if there are two remaining threads, one of
> > >> them the one being joined, the other one a detached thread.  A
> > >> straightforward implementation merely looking at __nptl_nthreads before
> > >> returning from pthread_join would not perform the required
> > >> synchronization on the detached thread exit.
> > >
> > > i'm trying to understand this, but don't see
> > > what's wrong if the last thread is detached.
> > 
> > Sorry, I meant three reamining threads in total, i.e., two more threads
> > in addition to the one thread that keeps going after the other two
> > exited, and may use __libc_single_threaded in the future.
> > 
> > Clearer now?
> 
> hm so a detached thread is concurrently exiting with
> a pthread_join which sees a decremented __nptl_nthreads
> but the detached thread has not actually exited yet.

In principle this is no big deal as long as the exiting thread cannot
make any further actions where its existence causes an observable
effect on users of __libc_single_threaded. (For this purpose, I think
you actually need to define what uses are valid, though; see setxid
remarks below.) If it makes a problem for pthread_join that's an
implementation detail that should be fixable. The bigger issue is
memory synchronization.

> i think glibc can issue a memory barrier syscall before
> decrementing __nptl_nthreads in a detached thread, this
> means if pthread_join observes __nptl_nthreads==1
> then user memory accesses in the detached thread are
> synchronized with non-atomic memory accesses after
> pthread_join returns. (i.e. __nptl_nthreads==1 should
> mean at all times that as far as user code is concerned
> the process is single threaded even if some detached
> thread is still hanging around)

This still has consequences for setxid safety which is why musl now
fully synchronizes the existing threads list. But if you're not using
the thread count for that, it's not an issue. Indeed I think
SYS_membarrier is a solution here, but if it's not supported or
blocked by seccomp then __libc_single_threaded must not be made true
again at this time.

> i think __libc_single_threaded should be possible to
> update in pthread_join with the above change, in
> which case we need not document that it stays false
> forever, so we can change this in the future.
> (unless somebody finds usecases where a false->true
> transition would cause problems)

pthread_join is not unique here. In principle __libc_single_threaded
can be made true again any time a non-AS-safe function observes a
thread count of 1 after performing a memory barrier. There may be
other internal-locking situations where it's beneficial to update it
so that future locks can be elided.

Rich

Rich Felker May 22, 2020, 4:14 p.m. UTC | #20

On Fri, May 22, 2020 at 11:07:20AM -0400, Rich Felker wrote:
> On Fri, May 22, 2020 at 11:54:58AM +0100, Szabolcs Nagy wrote:
> > The 05/22/2020 12:05, Florian Weimer wrote:
> > > * Szabolcs Nagy:
> > > 
> > > > The 05/21/2020 15:44, Florian Weimer wrote:
> > > >> * Szabolcs Nagy:
> > > >> > what's wrong with pthread_join updating it?
> > > >> 
> > > >> It's tricky do it correctly if there are two remaining threads, one of
> > > >> them the one being joined, the other one a detached thread.  A
> > > >> straightforward implementation merely looking at __nptl_nthreads before
> > > >> returning from pthread_join would not perform the required
> > > >> synchronization on the detached thread exit.
> > > >
> > > > i'm trying to understand this, but don't see
> > > > what's wrong if the last thread is detached.
> > > 
> > > Sorry, I meant three reamining threads in total, i.e., two more threads
> > > in addition to the one thread that keeps going after the other two
> > > exited, and may use __libc_single_threaded in the future.
> > > 
> > > Clearer now?
> > 
> > hm so a detached thread is concurrently exiting with
> > a pthread_join which sees a decremented __nptl_nthreads
> > but the detached thread has not actually exited yet.
> 
> In principle this is no big deal as long as the exiting thread cannot
> make any further actions where its existence causes an observable
> effect on users of __libc_single_threaded. (For this purpose, I think
> you actually need to define what uses are valid, though; see setxid
> remarks below.) If it makes a problem for pthread_join that's an
> implementation detail that should be fixable. The bigger issue is
> memory synchronization.
> 
> > i think glibc can issue a memory barrier syscall before
> > decrementing __nptl_nthreads in a detached thread, this
> > means if pthread_join observes __nptl_nthreads==1
> > then user memory accesses in the detached thread are
> > synchronized with non-atomic memory accesses after
> > pthread_join returns. (i.e. __nptl_nthreads==1 should
> > mean at all times that as far as user code is concerned
> > the process is single threaded even if some detached
> > thread is still hanging around)
> 
> This still has consequences for setxid safety which is why musl now
> fully synchronizes the existing threads list. But if you're not using
> the thread count for that, it's not an issue. Indeed I think
> SYS_membarrier is a solution here, but if it's not supported or
> blocked by seccomp then __libc_single_threaded must not be made true
> again at this time.

Uhg, SYS_membarrier is *not* a solution here. The problem is far
worse, because the user of __libc_single_threaded potentially lacks
*compiler barriers* too.

Consider something like:

	if (!__libc_single_threaded) { lock(); need_unlock=1; }
	x = *p;
	if (need_unlock) unlock();
	/* ... */
	if (!__libc_single_threaded) { lock(); need_unlock=1; }
	x = *p;
	if (need_unlock) unlock();

Here, in the case where __libc_single_threaded is true the second time
around, there is no (memory or compiler) acquire barrier between the
first access to *p and the second. Thus the compiler can (and actually
does! I don't have a minimal PoC but musl actually just hit a bug very
close to this) omit the second load from memory, and uses the cached
value, which may be incorrect because the exiting thread modified it.

This could potentially be avoided with complex contracts about
barriers needed to use __libc_single_threaded, but it seems highly
error-prone.

Note that the issue becomes much easier to hit with a sort of "pretest
not under lock, re-check with lock" idiom of the form:

	x = *p;
	if (predicate(x)) {
		if (!__libc_single_threaded) { lock(); need_unlock=1; }
		x = *p;
		/* ... */
		if (need_unlock) unlock();
	}

Unlike the above, this one does not depend on the release barrier in
unlock() not also being a compiler acquire barrier.

Rich

Adhemerval Zanella Netto May 22, 2020, 4:36 p.m. UTC | #21

On 22/05/2020 13:14, Rich Felker wrote:
> On Fri, May 22, 2020 at 11:07:20AM -0400, Rich Felker wrote:
>> On Fri, May 22, 2020 at 11:54:58AM +0100, Szabolcs Nagy wrote:
>>> The 05/22/2020 12:05, Florian Weimer wrote:
>>>> * Szabolcs Nagy:
>>>>
>>>>> The 05/21/2020 15:44, Florian Weimer wrote:
>>>>>> * Szabolcs Nagy:
>>>>>>> what's wrong with pthread_join updating it?
>>>>>>
>>>>>> It's tricky do it correctly if there are two remaining threads, one of
>>>>>> them the one being joined, the other one a detached thread.  A
>>>>>> straightforward implementation merely looking at __nptl_nthreads before
>>>>>> returning from pthread_join would not perform the required
>>>>>> synchronization on the detached thread exit.
>>>>>
>>>>> i'm trying to understand this, but don't see
>>>>> what's wrong if the last thread is detached.
>>>>
>>>> Sorry, I meant three reamining threads in total, i.e., two more threads
>>>> in addition to the one thread that keeps going after the other two
>>>> exited, and may use __libc_single_threaded in the future.
>>>>
>>>> Clearer now?
>>>
>>> hm so a detached thread is concurrently exiting with
>>> a pthread_join which sees a decremented __nptl_nthreads
>>> but the detached thread has not actually exited yet.
>>
>> In principle this is no big deal as long as the exiting thread cannot
>> make any further actions where its existence causes an observable
>> effect on users of __libc_single_threaded. (For this purpose, I think
>> you actually need to define what uses are valid, though; see setxid
>> remarks below.) If it makes a problem for pthread_join that's an
>> implementation detail that should be fixable. The bigger issue is
>> memory synchronization.
>>
>>> i think glibc can issue a memory barrier syscall before
>>> decrementing __nptl_nthreads in a detached thread, this
>>> means if pthread_join observes __nptl_nthreads==1
>>> then user memory accesses in the detached thread are
>>> synchronized with non-atomic memory accesses after
>>> pthread_join returns. (i.e. __nptl_nthreads==1 should
>>> mean at all times that as far as user code is concerned
>>> the process is single threaded even if some detached
>>> thread is still hanging around)
>>
>> This still has consequences for setxid safety which is why musl now
>> fully synchronizes the existing threads list. But if you're not using
>> the thread count for that, it's not an issue. Indeed I think
>> SYS_membarrier is a solution here, but if it's not supported or
>> blocked by seccomp then __libc_single_threaded must not be made true
>> again at this time.
> 
> Uhg, SYS_membarrier is *not* a solution here. The problem is far
> worse, because the user of __libc_single_threaded potentially lacks
> *compiler barriers* too.
> 
> Consider something like:
> 
> 	if (!__libc_single_threaded) { lock(); need_unlock=1; }
> 	x = *p;
> 	if (need_unlock) unlock();
> 	/* ... */
> 	if (!__libc_single_threaded) { lock(); need_unlock=1; }
> 	x = *p;
> 	if (need_unlock) unlock();
> 
> Here, in the case where __libc_single_threaded is true the second time
> around, there is no (memory or compiler) acquire barrier between the
> first access to *p and the second. Thus the compiler can (and actually
> does! I don't have a minimal PoC but musl actually just hit a bug very
> close to this) omit the second load from memory, and uses the cached
> value, which may be incorrect because the exiting thread modified it.

Does it help to enforce a relaxed atomic MO on __libc_single_threaded
access in this example?

> 
> This could potentially be avoided with complex contracts about
> barriers needed to use __libc_single_threaded, but it seems highly
> error-prone.
> 
> Note that the issue becomes much easier to hit with a sort of "pretest
> not under lock, re-check with lock" idiom of the form:
> 
> 	x = *p;
> 	if (predicate(x)) {
> 		if (!__libc_single_threaded) { lock(); need_unlock=1; }
> 		x = *p;
> 		/* ... */
> 		if (need_unlock) unlock();
> 	}
> 
> Unlike the above, this one does not depend on the release barrier in
> unlock() not also being a compiler acquire barrier.
> 
> Rich
>

Florian Weimer May 22, 2020, 5:02 p.m. UTC | #22

* Rich Felker:

>> This still has consequences for setxid safety which is why musl now
>> fully synchronizes the existing threads list. But if you're not using
>> the thread count for that, it's not an issue. Indeed I think
>> SYS_membarrier is a solution here, but if it's not supported or
>> blocked by seccomp then __libc_single_threaded must not be made true
>> again at this time.
>
> Uhg, SYS_membarrier is *not* a solution here. The problem is far
> worse, because the user of __libc_single_threaded potentially lacks
> *compiler barriers* too.
>
> Consider something like:
>
> 	if (!__libc_single_threaded) { lock(); need_unlock=1; }
> 	x = *p;
> 	if (need_unlock) unlock();
> 	/* ... */
> 	if (!__libc_single_threaded) { lock(); need_unlock=1; }
> 	x = *p;
> 	if (need_unlock) unlock();
>
> Here, in the case where __libc_single_threaded is true the second time
> around, there is no (memory or compiler) acquire barrier between the
> first access to *p and the second. Thus the compiler can (and actually
> does! I don't have a minimal PoC but musl actually just hit a bug very
> close to this) omit the second load from memory, and uses the cached
> value, which may be incorrect because the exiting thread modified it.
>
> This could potentially be avoided with complex contracts about
> barriers needed to use __libc_single_threaded, but it seems highly
> error-prone.

Well, yes.  It's clearly a data race if the implementation sets
__libc_single_threaded directly from an exiting thread.  I don't see a
way around that.

Our discussion focused on the problem that observing a thread count of 1
in pthread_join does not necessarily mean that it is safe to assume at
this point that the process is single-threaded, in glibc's
implementation that uses a simple __nptl_nthreads counter decremented on
the thread itself.  This does not cause a low-level data race directly,
but is potentially still incorrect (I'm not quite sure yet).

In glibc, we annotate many functions with __attribute__ ((leaf)),
implicitly via __THROW.  None of these functions may reset
__libc_single_threaded.  I expect that many compilers have a built-in
list of standard functions they treat as leaf functions.  This means
that these functions cannot write in practice to __libc_single_threaded
(or any other global variable apart from errno).  Not following this
rule would result in undefined behavior, similar to an actual data race
in the memory model.

A compiler cannot treat pthread_create as a leaf function, so the simple
implementation of __libc_single_threaded I posted should be fine in this
regard.

Thanjs,
Florian

Florian Weimer May 22, 2020, 5:18 p.m. UTC | #23

* Florian Weimer:

> * Rich Felker:
>
>>> This still has consequences for setxid safety which is why musl now
>>> fully synchronizes the existing threads list. But if you're not using
>>> the thread count for that, it's not an issue. Indeed I think
>>> SYS_membarrier is a solution here, but if it's not supported or
>>> blocked by seccomp then __libc_single_threaded must not be made true
>>> again at this time.
>>
>> Uhg, SYS_membarrier is *not* a solution here. The problem is far
>> worse, because the user of __libc_single_threaded potentially lacks
>> *compiler barriers* too.
>>
>> Consider something like:
>>
>> 	if (!__libc_single_threaded) { lock(); need_unlock=1; }
>> 	x = *p;
>> 	if (need_unlock) unlock();
>> 	/* ... */
>> 	if (!__libc_single_threaded) { lock(); need_unlock=1; }
>> 	x = *p;
>> 	if (need_unlock) unlock();
>>
>> Here, in the case where __libc_single_threaded is true the second time
>> around, there is no (memory or compiler) acquire barrier between the
>> first access to *p and the second. Thus the compiler can (and actually
>> does! I don't have a minimal PoC but musl actually just hit a bug very
>> close to this) omit the second load from memory, and uses the cached
>> value, which may be incorrect because the exiting thread modified it.
>>
>> This could potentially be avoided with complex contracts about
>> barriers needed to use __libc_single_threaded, but it seems highly
>> error-prone.
>
> Well, yes.  It's clearly a data race if the implementation sets
> __libc_single_threaded directly from an exiting thread.  I don't see a
> way around that.
>
> Our discussion focused on the problem that observing a thread count of 1
> in pthread_join does not necessarily mean that it is safe to assume at
> this point that the process is single-threaded, in glibc's
> implementation that uses a simple __nptl_nthreads counter decremented on
> the thread itself.  This does not cause a low-level data race directly,
> but is potentially still incorrect (I'm not quite sure yet).
>
> In glibc, we annotate many functions with __attribute__ ((leaf)),
> implicitly via __THROW.  None of these functions may reset
> __libc_single_threaded.  I expect that many compilers have a built-in
> list of standard functions they treat as leaf functions.  This means
> that these functions cannot write in practice to __libc_single_threaded
> (or any other global variable apart from errno).  Not following this
> rule would result in undefined behavior, similar to an actual data race
> in the memory model.
>
> A compiler cannot treat pthread_create as a leaf function, so the simple
> implementation of __libc_single_threaded I posted should be fine in this
> regard.

Sorry, it's not the leaf attribute what I'm looking here.  It has this
affect only for static variables in translation units that contains
functions whose address is taken.  What I'm after is more like the pure
attribute.

Clang does it for malloc in C++ mode, but not in C mode:

#include <stdlib.h>

extern int a;
extern void *p;

int
f (void)
{
  int a1 = a;
  p = malloc (1);
  return a1 - a;
}

turns into:

_Z1fv:                                  # @_Z1fv
# %bb.0:
	pushq	%rax
	movl	$1, %edi
	callq	malloc
	movq	%rax, p(%rip)
	xorl	%eax, %eax
	popq	%rcx
	retq

So the code is compiled as if it were written like this, consistent with
eliding the relaod:

int
f (void)
{
  p = malloc (1);
  return 0;
}

Thanks,
Florian

Rich Felker May 22, 2020, 5:28 p.m. UTC | #24

On Fri, May 22, 2020 at 07:02:01PM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> >> This still has consequences for setxid safety which is why musl now
> >> fully synchronizes the existing threads list. But if you're not using
> >> the thread count for that, it's not an issue. Indeed I think
> >> SYS_membarrier is a solution here, but if it's not supported or
> >> blocked by seccomp then __libc_single_threaded must not be made true
> >> again at this time.
> >
> > Uhg, SYS_membarrier is *not* a solution here. The problem is far
> > worse, because the user of __libc_single_threaded potentially lacks
> > *compiler barriers* too.
> >
> > Consider something like:
> >
> > 	if (!__libc_single_threaded) { lock(); need_unlock=1; }
> > 	x = *p;
> > 	if (need_unlock) unlock();
> > 	/* ... */
> > 	if (!__libc_single_threaded) { lock(); need_unlock=1; }
> > 	x = *p;
> > 	if (need_unlock) unlock();
> >
> > Here, in the case where __libc_single_threaded is true the second time
> > around, there is no (memory or compiler) acquire barrier between the
> > first access to *p and the second. Thus the compiler can (and actually
> > does! I don't have a minimal PoC but musl actually just hit a bug very
> > close to this) omit the second load from memory, and uses the cached
> > value, which may be incorrect because the exiting thread modified it.
> >
> > This could potentially be avoided with complex contracts about
> > barriers needed to use __libc_single_threaded, but it seems highly
> > error-prone.
> 
> Well, yes.  It's clearly a data race if the implementation sets
> __libc_single_threaded directly from an exiting thread.  I don't see a
> way around that.
> 
> Our discussion focused on the problem that observing a thread count of 1
> in pthread_join does not necessarily mean that it is safe to assume at
> this point that the process is single-threaded, in glibc's
> implementation that uses a simple __nptl_nthreads counter decremented on
> the thread itself.  This does not cause a low-level data race directly,
> but is potentially still incorrect (I'm not quite sure yet).

pthread_join necessarily has an acquire barrier (this is a fundamental
requirement of the interface contract; join is acquiring the results
of the thread) so under some weak assumptions on unsynchronized memory
access (e.g. non-tearing, not seeing a value that wasn't stored
sometime between the last and next acquire barriers on the observer's
side) I think observing it from pthread_join is safe.

On the other hand I'm skeptical of the utility. In a program that
only makes small use of threads, the join may happen long after the
thread exits, during which time many operations may have been slowed
down by inability to skip locks.

> In glibc, we annotate many functions with __attribute__ ((leaf)),
> implicitly via __THROW.  None of these functions may reset
> __libc_single_threaded.

I don't think leaf (at least the gcc attribute leaf) is the actual
issue here; it's more complicated. Nothing about leaf forbids stores
to global variables. It just means the compiler can assume it has a
fuller picture for escape analysis. But any update to
__libc_single_threaded would require acquire barriers (e.g., after
acquiring a lock, using the value of __nptl_threads to infer that
future lock cycles can be skipped), and these barriers would in turn
preclude any invalid transformation by the compiler. (Leaf does not
negate barriers.)

Rich

Florian Weimer May 22, 2020, 5:40 p.m. UTC | #25

* Rich Felker:

>> Our discussion focused on the problem that observing a thread count of 1
>> in pthread_join does not necessarily mean that it is safe to assume at
>> this point that the process is single-threaded, in glibc's
>> implementation that uses a simple __nptl_nthreads counter decremented on
>> the thread itself.  This does not cause a low-level data race directly,
>> but is potentially still incorrect (I'm not quite sure yet).
>
> pthread_join necessarily has an acquire barrier (this is a fundamental
> requirement of the interface contract; join is acquiring the results
> of the thread) so under some weak assumptions on unsynchronized memory
> access (e.g. non-tearing, not seeing a value that wasn't stored
> sometime between the last and next acquire barriers on the observer's
> side) I think observing it from pthread_join is safe.

Because of the meaning of the variable, it is *completely* safe if there
are no detached threads, without any further assumptions.

With detached threads an pthread_join observing a thread count of 1 (as
decreased during thread exit), the validity of setting
__libc_single_threaded depends on whether the kernel offers something
that causes a memory write on thread exit.  I know of at least two such
facilities: the TID variable and robust mutexes.  Therefore, I'm
inclined that further such facilities could be added by the kernel, in
ways not observable to glibc, and we better make sure that we have full
synchronization with the thread exit if we implemented the pthread_join
optimization for __libc_single_threaded *today*, although it might not
actually be needed.

> I don't think leaf (at least the gcc attribute leaf) is the actual
> issue here

Yes, please see the follow-up.

Thanks,
Florian

Rich Felker May 22, 2020, 5:49 p.m. UTC | #26

On Fri, May 22, 2020 at 07:40:19PM +0200, Florian Weimer via Libc-alpha wrote:
> * Rich Felker:
> 
> >> Our discussion focused on the problem that observing a thread count of 1
> >> in pthread_join does not necessarily mean that it is safe to assume at
> >> this point that the process is single-threaded, in glibc's
> >> implementation that uses a simple __nptl_nthreads counter decremented on
> >> the thread itself.  This does not cause a low-level data race directly,
> >> but is potentially still incorrect (I'm not quite sure yet).
> >
> > pthread_join necessarily has an acquire barrier (this is a fundamental
> > requirement of the interface contract; join is acquiring the results
> > of the thread) so under some weak assumptions on unsynchronized memory
> > access (e.g. non-tearing, not seeing a value that wasn't stored
> > sometime between the last and next acquire barriers on the observer's
> > side) I think observing it from pthread_join is safe.
> 
> Because of the meaning of the variable, it is *completely* safe if there
> are no detached threads, without any further assumptions.
> 
> With detached threads an pthread_join observing a thread count of 1 (as
> decreased during thread exit), the validity of setting
> __libc_single_threaded depends on whether the kernel offers something
> that causes a memory write on thread exit.  I know of at least two such

I don't follow. Why do you care about the kernel entity exiting here?
You should only care about having a release barrier before the update
to the count, so that seeing the updated count guarantees seeing any
changes to memory made by the exiting detached thread.

Rich

Szabolcs Nagy May 22, 2020, 7:22 p.m. UTC | #27

The 05/22/2020 13:49, Rich Felker wrote:
> On Fri, May 22, 2020 at 07:40:19PM +0200, Florian Weimer via Libc-alpha wrote:
> > * Rich Felker:
> > 
> > >> Our discussion focused on the problem that observing a thread count of 1
> > >> in pthread_join does not necessarily mean that it is safe to assume at
> > >> this point that the process is single-threaded, in glibc's
> > >> implementation that uses a simple __nptl_nthreads counter decremented on
> > >> the thread itself.  This does not cause a low-level data race directly,
> > >> but is potentially still incorrect (I'm not quite sure yet).
> > >
> > > pthread_join necessarily has an acquire barrier (this is a fundamental
> > > requirement of the interface contract; join is acquiring the results
> > > of the thread) so under some weak assumptions on unsynchronized memory
> > > access (e.g. non-tearing, not seeing a value that wasn't stored
> > > sometime between the last and next acquire barriers on the observer's
> > > side) I think observing it from pthread_join is safe.
> > 
> > Because of the meaning of the variable, it is *completely* safe if there
> > are no detached threads, without any further assumptions.
> > 
> > With detached threads an pthread_join observing a thread count of 1 (as
> > decreased during thread exit), the validity of setting
> > __libc_single_threaded depends on whether the kernel offers something
> > that causes a memory write on thread exit.  I know of at least two such
> 
> I don't follow. Why do you care about the kernel entity exiting here?
> You should only care about having a release barrier before the update
> to the count, so that seeing the updated count guarantees seeing any
> changes to memory made by the exiting detached thread.

kernel entity matters if it writes user memory after
the release barrier such that user code may observe it.
(although that would likely break conformance or other
properties too, not just single thread checks).

another example is observing the detached thread via
kernel apis like /proc/task: user code may expect to
see a single os task when __libc_single_threaded is set.

so i think the safest implementation never sets
__libc_single_threaded back to true and second safest
is one that only sets it back to true in pthread_join
when there were no detached threads (or if using some
os api it can verify that there really is only one
kernel thread).

if we want to allow kernel entities to be around
but still tell the user when "as far as the libc
is concerned there is only one thread", then i
think __libc_single_threaded needs to be an extern
call (that acts as a compiler barrier).

Rich Felker May 22, 2020, 7:53 p.m. UTC | #28

On Fri, May 22, 2020 at 08:22:50PM +0100, Szabolcs Nagy wrote:
> The 05/22/2020 13:49, Rich Felker wrote:
> > On Fri, May 22, 2020 at 07:40:19PM +0200, Florian Weimer via Libc-alpha wrote:
> > > * Rich Felker:
> > > 
> > > >> Our discussion focused on the problem that observing a thread count of 1
> > > >> in pthread_join does not necessarily mean that it is safe to assume at
> > > >> this point that the process is single-threaded, in glibc's
> > > >> implementation that uses a simple __nptl_nthreads counter decremented on
> > > >> the thread itself.  This does not cause a low-level data race directly,
> > > >> but is potentially still incorrect (I'm not quite sure yet).
> > > >
> > > > pthread_join necessarily has an acquire barrier (this is a fundamental
> > > > requirement of the interface contract; join is acquiring the results
> > > > of the thread) so under some weak assumptions on unsynchronized memory
> > > > access (e.g. non-tearing, not seeing a value that wasn't stored
> > > > sometime between the last and next acquire barriers on the observer's
> > > > side) I think observing it from pthread_join is safe.
> > > 
> > > Because of the meaning of the variable, it is *completely* safe if there
> > > are no detached threads, without any further assumptions.
> > > 
> > > With detached threads an pthread_join observing a thread count of 1 (as
> > > decreased during thread exit), the validity of setting
> > > __libc_single_threaded depends on whether the kernel offers something
> > > that causes a memory write on thread exit.  I know of at least two such
> > 
> > I don't follow. Why do you care about the kernel entity exiting here?
> > You should only care about having a release barrier before the update
> > to the count, so that seeing the updated count guarantees seeing any
> > changes to memory made by the exiting detached thread.
> 
> kernel entity matters if it writes user memory after
> the release barrier such that user code may observe it.
> (although that would likely break conformance or other
> properties too, not just single thread checks).
> 
> another example is observing the detached thread via
> kernel apis like /proc/task: user code may expect to
> see a single os task when __libc_single_threaded is set.

Indeed, this is why I said the precise meaning/contract for what the
consumer of __libc_single_threaded is allowed to do needs to be clear
-- see also the new thread "How would we make linux man pages
authoritative?" and Michael Kerrisk's comments on how hidden behaviors
become de facto contracts if you don't clearly specify otherwise.

> so i think the safest implementation never sets
> __libc_single_threaded back to true and second safest
> is one that only sets it back to true in pthread_join
> when there were no detached threads (or if using some
> os api it can verify that there really is only one
> kernel thread).
> 
> if we want to allow kernel entities to be around
> but still tell the user when "as far as the libc
> is concerned there is only one thread", then i
> think __libc_single_threaded needs to be an extern
> call (that acts as a compiler barrier).

Do relaxed order atomics provide a compiler barrier? If so, I think
SYS_membarrier combined with one could be sufficient However using
atomic type for a public interface like this is a bit of a
compatibility thorn (requires compiler and std level supporting it).

Of course the same could be achieved with requirement to use a manual
compiler barrier (dummy asm) but I think it's error-prone to assume
the application would do it correctly.

Rich

Szabolcs Nagy May 23, 2020, 6:49 a.m. UTC | #29

* Rich Felker <dalias@libc.org> [2020-05-22 15:53:26 -0400]:
> On Fri, May 22, 2020 at 08:22:50PM +0100, Szabolcs Nagy wrote:
> > so i think the safest implementation never sets
> > __libc_single_threaded back to true and second safest
> > is one that only sets it back to true in pthread_join
> > when there were no detached threads (or if using some
> > os api it can verify that there really is only one
> > kernel thread).
> > 
> > if we want to allow kernel entities to be around
> > but still tell the user when "as far as the libc
> > is concerned there is only one thread", then i
> > think __libc_single_threaded needs to be an extern
> > call (that acts as a compiler barrier).
> 
> Do relaxed order atomics provide a compiler barrier? If so, I think
> SYS_membarrier combined with one could be sufficient However using
> atomic type for a public interface like this is a bit of a
> compatibility thorn (requires compiler and std level supporting it).
> 
> Of course the same could be achieved with requirement to use a manual
> compiler barrier (dummy asm) but I think it's error-prone to assume
> the application would do it correctly.

no, relaxed atomics do not order unrelated memory accesses.

but a trick like you proposed for the musl internal need_locks
may work: __libc_single_threaded can go back to 1 when there
is only one thread left executing user code *and* that thread
called a libc function that has an acquire barrier.
(if thread exit has a release barrier then this should work)

this can allow earlier single threaded detection than only
considering pthread_join: e.g. stdio, malloc etc may do a
check and update the global after an acquire barrier, however
the compiler must not cache globals across libc calls for this
to work.

Rich Felker May 23, 2020, 4:02 p.m. UTC | #30

On Sat, May 23, 2020 at 08:49:41AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2020-05-22 15:53:26 -0400]:
> > On Fri, May 22, 2020 at 08:22:50PM +0100, Szabolcs Nagy wrote:
> > > so i think the safest implementation never sets
> > > __libc_single_threaded back to true and second safest
> > > is one that only sets it back to true in pthread_join
> > > when there were no detached threads (or if using some
> > > os api it can verify that there really is only one
> > > kernel thread).
> > > 
> > > if we want to allow kernel entities to be around
> > > but still tell the user when "as far as the libc
> > > is concerned there is only one thread", then i
> > > think __libc_single_threaded needs to be an extern
> > > call (that acts as a compiler barrier).
> > 
> > Do relaxed order atomics provide a compiler barrier? If so, I think
> > SYS_membarrier combined with one could be sufficient However using
> > atomic type for a public interface like this is a bit of a
> > compatibility thorn (requires compiler and std level supporting it).
> > 
> > Of course the same could be achieved with requirement to use a manual
> > compiler barrier (dummy asm) but I think it's error-prone to assume
> > the application would do it correctly.
> 
> no, relaxed atomics do not order unrelated memory accesses.
> 
> but a trick like you proposed for the musl internal need_locks
> may work: __libc_single_threaded can go back to 1 when there
> is only one thread left executing user code *and* that thread
> called a libc function that has an acquire barrier.
> (if thread exit has a release barrier then this should work)

Thread exit is necessarily a release barrier if the thread is
joinable, has robust mutexes, or any of a number of other conditions
(including anything where the child exit futex address is non-null).

In addition any barrier in userspace before the kernel task exit is
fine too.

> this can allow earlier single threaded detection than only
> considering pthread_join: e.g. stdio, malloc etc may do a
> check and update the global after an acquire barrier, however
> the compiler must not cache globals across libc calls for this
> to work.

It can't cache globals across non-pure functions whose definitions it
cant't see (and if it saw the definition it would know the global is
modified). malloc is something of a special case where clang treats it
not as a function but having "pure malloc semantics", but even then I
don't think it matters if it caches it; at worst you see the old value
of __libc_single_threaded (false) rather than the new one (true) and
that direction is safe.

Rich

Florian Weimer May 25, 2020, 8:08 a.m. UTC | #31

* Rich Felker:

>> this can allow earlier single threaded detection than only
>> considering pthread_join: e.g. stdio, malloc etc may do a
>> check and update the global after an acquire barrier, however
>> the compiler must not cache globals across libc calls for this
>> to work.
>
> It can't cache globals across non-pure functions whose definitions it
> cant't see (and if it saw the definition it would know the global is
> modified).

All 

> malloc is something of a special case where clang treats it
> not as a function but having "pure malloc semantics", but even then I
> don't think it matters if it caches it; at worst you see the old value
> of __libc_single_threaded (false) rather than the new one (true) and
> that direction is safe.
>
> Rich

Florian Weimer May 25, 2020, 8:08 a.m. UTC | #32

* Rich Felker:

>> this can allow earlier single threaded detection than only
>> considering pthread_join: e.g. stdio, malloc etc may do a
>> check and update the global after an acquire barrier, however
>> the compiler must not cache globals across libc calls for this
>> to work.
>
> It can't cache globals across non-pure functions whose definitions it
> cant't see (and if it saw the definition it would know the global is
> modified).

Sorry about that, hit C-c C-c while I thought I was in a terminal. 8-/

For most standard C functions, it is well-known to which global
variables (if any) they write.  Of course, compilers exploit this fact.

> malloc is something of a special case where clang treats it
> not as a function but having "pure malloc semantics", but even then I
> don't think it matters if it caches it;

And of course malloc is the most common example of a standard function
that has observable side effects beyond those specified in the standard:
most implementations have a statistics interface.

> at worst you see the old value of __libc_single_threaded (false)
> rather than the new one (true) and that direction is safe.

It's still a data race.  The compiler can easily generate invalid code
if it incorrectly assumes that __libc_single_threaded remains stable.  I
don't know if Clang will do this.  But I think the C library
implementation should be rather conservative here.

Thanks,
Florian

Rich Felker May 25, 2020, 5:21 p.m. UTC | #33

On Mon, May 25, 2020 at 10:08:37AM +0200, Florian Weimer via Libc-alpha wrote:
> * Rich Felker:
> 
> >> this can allow earlier single threaded detection than only
> >> considering pthread_join: e.g. stdio, malloc etc may do a
> >> check and update the global after an acquire barrier, however
> >> the compiler must not cache globals across libc calls for this
> >> to work.
> >
> > It can't cache globals across non-pure functions whose definitions it
> > cant't see (and if it saw the definition it would know the global is
> > modified).
> 
> Sorry about that, hit C-c C-c while I thought I was in a terminal. 8-/
> 
> For most standard C functions, it is well-known to which global
> variables (if any) they write.  Of course, compilers exploit this fact.
> 
> > malloc is something of a special case where clang treats it
> > not as a function but having "pure malloc semantics", but even then I
> > don't think it matters if it caches it;
> 
> And of course malloc is the most common example of a standard function
> that has observable side effects beyond those specified in the standard:
> most implementations have a statistics interface.
> 
> > at worst you see the old value of __libc_single_threaded (false)
> > rather than the new one (true) and that direction is safe.
> 
> It's still a data race.  The compiler can easily generate invalid code
> if it incorrectly assumes that __libc_single_threaded remains stable.  I
> don't know if Clang will do this.  But I think the C library
> implementation should be rather conservative here.

If this is an issue, and even regardless of whether it is, I think the
type of __libc_single_threaded should be volatile qualified. This
ensures that it cannot be cached in any way that might be invalid.
That's not just a hack; it's the correct way to model that the value
is able to change asynchronously (such as by an operation that the
compiler would otherwise assume can't have side effects).

Rich

Florian Weimer May 27, 2020, 11:54 a.m. UTC | #34

* Rich Felker:

> On Mon, May 25, 2020 at 10:08:37AM +0200, Florian Weimer via Libc-alpha wrote:
>> * Rich Felker:
>> 
>> >> this can allow earlier single threaded detection than only
>> >> considering pthread_join: e.g. stdio, malloc etc may do a
>> >> check and update the global after an acquire barrier, however
>> >> the compiler must not cache globals across libc calls for this
>> >> to work.
>> >
>> > It can't cache globals across non-pure functions whose definitions it
>> > cant't see (and if it saw the definition it would know the global is
>> > modified).
>> 
>> Sorry about that, hit C-c C-c while I thought I was in a terminal. 8-/
>> 
>> For most standard C functions, it is well-known to which global
>> variables (if any) they write.  Of course, compilers exploit this fact.
>> 
>> > malloc is something of a special case where clang treats it
>> > not as a function but having "pure malloc semantics", but even then I
>> > don't think it matters if it caches it;
>> 
>> And of course malloc is the most common example of a standard function
>> that has observable side effects beyond those specified in the standard:
>> most implementations have a statistics interface.
>> 
>> > at worst you see the old value of __libc_single_threaded (false)
>> > rather than the new one (true) and that direction is safe.
>> 
>> It's still a data race.  The compiler can easily generate invalid code
>> if it incorrectly assumes that __libc_single_threaded remains stable.  I
>> don't know if Clang will do this.  But I think the C library
>> implementation should be rather conservative here.
>
> If this is an issue, and even regardless of whether it is, I think the
> type of __libc_single_threaded should be volatile qualified. This
> ensures that it cannot be cached in any way that might be invalid.
> That's not just a hack; it's the correct way to model that the value
> is able to change asynchronously (such as by an operation that the
> compiler would otherwise assume can't have side effects).

I think it makes more sense not to declare the object as volatile and
make sure that only libc functions which imply the required barrier
write to __libc_single_threaded.  For instance, I expect that this will
allow compilers to generate tighter code around multiple (implied)
reference count updates.

Thanks,
Florian

Rich Felker May 27, 2020, 3:36 p.m. UTC | #35

On Wed, May 27, 2020 at 01:54:01PM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> > On Mon, May 25, 2020 at 10:08:37AM +0200, Florian Weimer via Libc-alpha wrote:
> >> * Rich Felker:
> >> 
> >> >> this can allow earlier single threaded detection than only
> >> >> considering pthread_join: e.g. stdio, malloc etc may do a
> >> >> check and update the global after an acquire barrier, however
> >> >> the compiler must not cache globals across libc calls for this
> >> >> to work.
> >> >
> >> > It can't cache globals across non-pure functions whose definitions it
> >> > cant't see (and if it saw the definition it would know the global is
> >> > modified).
> >> 
> >> Sorry about that, hit C-c C-c while I thought I was in a terminal. 8-/
> >> 
> >> For most standard C functions, it is well-known to which global
> >> variables (if any) they write.  Of course, compilers exploit this fact.
> >> 
> >> > malloc is something of a special case where clang treats it
> >> > not as a function but having "pure malloc semantics", but even then I
> >> > don't think it matters if it caches it;
> >> 
> >> And of course malloc is the most common example of a standard function
> >> that has observable side effects beyond those specified in the standard:
> >> most implementations have a statistics interface.
> >> 
> >> > at worst you see the old value of __libc_single_threaded (false)
> >> > rather than the new one (true) and that direction is safe.
> >> 
> >> It's still a data race.  The compiler can easily generate invalid code
> >> if it incorrectly assumes that __libc_single_threaded remains stable.  I
> >> don't know if Clang will do this.  But I think the C library
> >> implementation should be rather conservative here.
> >
> > If this is an issue, and even regardless of whether it is, I think the
> > type of __libc_single_threaded should be volatile qualified. This
> > ensures that it cannot be cached in any way that might be invalid.
> > That's not just a hack; it's the correct way to model that the value
> > is able to change asynchronously (such as by an operation that the
> > compiler would otherwise assume can't have side effects).
> 
> I think it makes more sense not to declare the object as volatile and
> make sure that only libc functions which imply the required barrier
> write to __libc_single_threaded.  For instance, I expect that this will
> allow compilers to generate tighter code around multiple (implied)
> reference count updates.

It really should be volatile just because whatever you make it is ABI,
and there might be good reasons to want to make it updated
"asynchronously" with respect to the compiler's model in the future.
If you don't make it volatile now you can't do that in the future.

The cost of volatile is rather trivial; compiler can still cache
address of it, and loading from a cache line you recently loaded from,
and to which no stores are happening, is virtually free.

Rich

Florian Weimer June 3, 2020, 3 p.m. UTC | #36

* Rich Felker:

>> I think it makes more sense not to declare the object as volatile and
>> make sure that only libc functions which imply the required barrier
>> write to __libc_single_threaded.  For instance, I expect that this will
>> allow compilers to generate tighter code around multiple (implied)
>> reference count updates.
>
> It really should be volatile just because whatever you make it is ABI,
> and there might be good reasons to want to make it updated
> "asynchronously" with respect to the compiler's model in the future.
> If you don't make it volatile now you can't do that in the future.

volatile does not mean that it is safe to update it asynchronously.
Some barriers are still needed.  And if we need acquire MO for the
access, then the purpose of the variable is completely defeated.

Is it really a significant restriction on implementations if they can
only set __libc_single_threaded to true from functions which clearly
imply an acquire barrier?  I don't think so.  pthread_join qualifies,
and that seems the most important case/

> The cost of volatile is rather trivial; compiler can still cache
> address of it, and loading from a cache line you recently loaded from,
> and to which no stores are happening, is virtually free.

There is a code size impact because the compiler can no longer merge
several atomic/non-atomic alternatives.  Consider this example, which I
think it is not too unrealistic (hopefully, std::shared_ptr will
eventually compile down to something similar):

#include <stddef.h>

struct refcounted
{
  size_t count;
};

extern char __libc_single_threaded;

static inline
void ref (struct refcounted *p)
{
  if (__libc_single_threaded)
    ++p->count;
  else
    __atomic_fetch_add (&p->count, 1, __ATOMIC_RELAXED);
}


void f1 (struct refcounted *a, struct refcounted *b);

void
f2 (struct refcounted *a, struct refcounted *b)
{
  /* f1 takes ownership of both objects, but the caller of f2 is
     expected to retain (shared) ownership as well. */
  ref (a);
  ref (b);
  f1 (a, b);
}

As written, this turns into:

f2:
	cmpb	$0, __libc_single_threaded(%rip)
	je	.L2
	addq	$1, (%rdi)
.L3:
	addq	$1, (%rsi)
	jmp	f1
.L2:
	lock addq	$1, (%rdi)
	cmpb	$0, __libc_single_threaded(%rip)
	jne	.L3
	lock addq	$1, (%rsi)
	jmp	f1

The jump back to .L3 is due to the current reluctance to optimize
relaxed MO accesses, due to the brokenness of the C++ memory model
(although it would be quite safe here, I assume).

With volatile, it's this instead:

f2:
	movzbl	__libc_single_threaded(%rip), %eax
	testb	%al, %al
	je	.L2
	addq	$1, (%rdi)
	movzbl	__libc_single_threaded(%rip), %eax
	testb	%al, %al
	je	.L4
.L7:
	addq	$1, (%rsi)
	jmp	f1
.L2:
	lock addq	$1, (%rdi)
	movzbl	__libc_single_threaded(%rip), %eax
	testb	%al, %al
	jne	.L7
.L4:
	lock addq	$1, (%rsi)
	jmp	f1

So two checks in the non-atomic case, plus a load for the volatile
access.

Thanks,
Florian

Florian Weimer June 3, 2020, 3:48 p.m. UTC | #37

* Adhemerval Zanella:

>> I'm going to add this to the manual as an implementation note, after the
>> first example:
>> 
>> @c Note: No memory order on __libc_single_threaded.  The
>> @c implementation must ensure that exit of the critical
>> @c (second-to-last) thread happens-before setting
>> @c __libc_single_threaded to true.  Otherwise, acquire MO might be
>> @c needed for reading the variable in some scenarios, and that would
>> @c completely defeat its purpose.
>
> The comments is sound, but I still think we should properly document 
> that this initial version does not attempt to update 
> __libc_single_threaded on pthread_join or detach exit and maybe also
> the brief explanation you added on why this semantic was chose (to
> avoid the requirement of more strict MO).

I'm concerned that if we make the implementation too transparent,
programmers will read the explanation and say, “gosh, I better assign
true to __libc_single_threaded after I joined the last thread”.  That's
not something we want to encourage.

>> For detached thread exits, this kind of synchronization may not be
>> easily obtainable in all cases.  I don't think we can do it on the
>> on-thread exit path because the kernel will perform certain actions
>> afterwards (like robust mutex updates), no matter how late we do it.  I
>> guess we could perhaps piggy-back on the stack reclamation mechanism.
>
> It seems that robust mutexes updates are indeed a problem, but I am not
> sure if CLONE_CHILD_CLEARTID clear helps here.  It signals the thread
> is done with the memory synchronization, but the stack cache is not
> really updated.  Maybe an extra clone3 flag ?

I thought we might piggy-back on the work that free_stacks does.  But
the code is sufficiently convoluted that I'm no longer sure.

Thanks,
Florian

Rich Felker June 3, 2020, 5:11 p.m. UTC | #38

On Wed, Jun 03, 2020 at 05:00:01PM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> >> I think it makes more sense not to declare the object as volatile and
> >> make sure that only libc functions which imply the required barrier
> >> write to __libc_single_threaded.  For instance, I expect that this will
> >> allow compilers to generate tighter code around multiple (implied)
> >> reference count updates.
> >
> > It really should be volatile just because whatever you make it is ABI,
> > and there might be good reasons to want to make it updated
> > "asynchronously" with respect to the compiler's model in the future.
> > If you don't make it volatile now you can't do that in the future.
> 
> volatile does not mean that it is safe to update it asynchronously.
> Some barriers are still needed.

If by "asynchronously" we just mean "some code running in the same
task context which the compiler isn't aware of, like a signal handler
or internals of malloc or of a pure function" then yes volatile is
sufficient. This is basically the definition of volatile.

If you mean modification from other threads, then you need additional
synchronization to ensure that the change is seen, but if you're ok
with a change not being seen (roughly the same as relaxed order) then
volatile suffices as a valid way to model it. (The potential change is
an asynchronous change by the hardware, where the hardware in question
happens to be the cache/memory coherency implementation.)

> And if we need acquire MO for the
> access, then the purpose of the variable is completely defeated.

No. Reading a stale value (false instead of true) is perfectly fine.
The other direction cannot happen because the change from true to
false can fundamentally happen only in a single-threaded context. And
if you don't want to read a stale value, SYS_membarrier performed by
the thread changing the value ensures that you won't. But that's not
needed.

> Is it really a significant restriction on implementations if they can
> only set __libc_single_threaded to true from functions which clearly
> imply an acquire barrier?  I don't think so.  pthread_join qualifies,
> and that seems the most important case/

Again doing it only from pthread_join will completely omit the
optimization in any program that waits to join until is has no further
work to do, which is a very reasonable thing to do.

> > The cost of volatile is rather trivial; compiler can still cache
> > address of it, and loading from a cache line you recently loaded from,
> > and to which no stores are happening, is virtually free.
> 
> There is a code size impact because the compiler can no longer merge
> several atomic/non-atomic alternatives.  Consider this example, which I
> think it is not too unrealistic (hopefully, std::shared_ptr will
> eventually compile down to something similar):
> 
> #include <stddef.h>
> 
> struct refcounted
> {
>   size_t count;
> };
> 
> extern char __libc_single_threaded;
> 
> static inline
> void ref (struct refcounted *p)
> {
>   if (__libc_single_threaded)
>     ++p->count;
>   else
>     __atomic_fetch_add (&p->count, 1, __ATOMIC_RELAXED);
> }
> 
> 
> void f1 (struct refcounted *a, struct refcounted *b);
> 
> void
> f2 (struct refcounted *a, struct refcounted *b)
> {
>   /* f1 takes ownership of both objects, but the caller of f2 is
>      expected to retain (shared) ownership as well. */
>   ref (a);
>   ref (b);
>   f1 (a, b);
> }
> 
> As written, this turns into:
> 
> f2:
> 	cmpb	$0, __libc_single_threaded(%rip)
> 	je	.L2
> 	addq	$1, (%rdi)
> ..L3:
> 	addq	$1, (%rsi)
> 	jmp	f1
> ..L2:
> 	lock addq	$1, (%rdi)
> 	cmpb	$0, __libc_single_threaded(%rip)
> 	jne	.L3
> 	lock addq	$1, (%rsi)
> 	jmp	f1
> 
> The jump back to .L3 is due to the current reluctance to optimize
> relaxed MO accesses, due to the brokenness of the C++ memory model
> (although it would be quite safe here, I assume).
> 
> With volatile, it's this instead:
> 
> f2:
> 	movzbl	__libc_single_threaded(%rip), %eax
> 	testb	%al, %al
> 	je	.L2
> 	addq	$1, (%rdi)
> 	movzbl	__libc_single_threaded(%rip), %eax
> 	testb	%al, %al
> 	je	.L4
> ..L7:
> 	addq	$1, (%rsi)
> 	jmp	f1
> ..L2:
> 	lock addq	$1, (%rdi)
> 	movzbl	__libc_single_threaded(%rip), %eax
> 	testb	%al, %al
> 	jne	.L7
> ..L4:
> 	lock addq	$1, (%rsi)
> 	jmp	f1
> 
> So two checks in the non-atomic case, plus a load for the volatile
> access.

I see; indeed for something on this small a scale, it *might* matter.
Has anyone done any measurement to determine whether it does?

Rich

Adhemerval Zanella Netto June 3, 2020, 5:52 p.m. UTC | #39

On 03/06/2020 12:48, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>>> I'm going to add this to the manual as an implementation note, after the
>>> first example:
>>>
>>> @c Note: No memory order on __libc_single_threaded.  The
>>> @c implementation must ensure that exit of the critical
>>> @c (second-to-last) thread happens-before setting
>>> @c __libc_single_threaded to true.  Otherwise, acquire MO might be
>>> @c needed for reading the variable in some scenarios, and that would
>>> @c completely defeat its purpose.
>>
>> The comments is sound, but I still think we should properly document 
>> that this initial version does not attempt to update 
>> __libc_single_threaded on pthread_join or detach exit and maybe also
>> the brief explanation you added on why this semantic was chose (to
>> avoid the requirement of more strict MO).
> 
> I'm concerned that if we make the implementation too transparent,
> programmers will read the explanation and say, “gosh, I better assign
> true to __libc_single_threaded after I joined the last thread”.  That's
> not something we want to encourage.

In fact from your discussion with Rich I think we really should be
transparent about its semantic, why he have chose the qualifiers 
(whether we use volatile or not), how glibc updates internally, and 
the expected usage patterns (for instance program are not expected
to change its value).

> 
>>> For detached thread exits, this kind of synchronization may not be
>>> easily obtainable in all cases.  I don't think we can do it on the
>>> on-thread exit path because the kernel will perform certain actions
>>> afterwards (like robust mutex updates), no matter how late we do it.  I
>>> guess we could perhaps piggy-back on the stack reclamation mechanism.
>>
>> It seems that robust mutexes updates are indeed a problem, but I am not
>> sure if CLONE_CHILD_CLEARTID clear helps here.  It signals the thread
>> is done with the memory synchronization, but the stack cache is not
>> really updated.  Maybe an extra clone3 flag ?
> 
> I thought we might piggy-back on the work that free_stacks does.  But
> the code is sufficiently convoluted that I'm no longer sure.

I am not sure that the free_stack pattern is really helpful without some
memory synchronization, such as SYS_membarrier. My idea of the clone3
flag would be similar to CLONE_CHILD_{CLEAR,SET}TID where kernel does
the heavy lifting of the memory synchronization to update the total
number of threads.

diff mbox series

Patch

diff --git a/manual/threads.texi b/manual/threads.texi
index a425635179..d4c261a0e9 100644
--- a/manual/threads.texi
+++ b/manual/threads.texi
@@ -627,6 +627,7 @@  the standard.
 					  threads in a process.
 * Waiting with Explicit Clocks::          Functions for waiting with an
                                           explicit clock specification.
+* Single-Threaded::     Detecting single-threaded execution.
 @end menu
 
 @node Default Thread Attributes
@@ -771,6 +772,94 @@  Behaves like @code{pthread_timedjoin_np} except that the absolute time in
 @var{abstime} is measured against the clock specified by @var{clockid}.
 @end deftypefun
 
+@node Single-Threaded
+@subsubsection Detecting Single-Threaded Execution
+
+Multi-threaded programs require synchronization among threads.  This
+synchronization can be costly even if there is just a single thread
+and no data is shared between multiple processors.  @Theglibc{} offers
+an interface to detect whether the process is in single-threaded mode.
+Applications can use this information to avoid synchronization, for
+example by using regular instructions to load and store memory instead
+of atomic instructions, or using relaxed memory ordering instead of
+stronger memory ordering.
+
+@deftypevar char __libc_single_threaded
+@standards{GNU, sys/single_threaded.h}
+This variable is non-zero if the current process is definitely
+single-threaded.  If it is zero, the process can be multi-threaded,
+or @theglibc{} cannot determine at this point of the program execution
+whether the process is single-threaded or not.
+
+Applications must never write to this variable.
+@end deftypevar
+
+Most applications should perform the same actions whether or not
+@code{__libc_single_threaded} is true, except with less
+synchronization.  If this rule is followed, a process that
+subsequently becomes multi-threaded is already in a consistent state.
+For example, in order to increment a reference count, the following
+code can be used:
+
+@smallexample
+if (__libc_single_threaded)
+  atomic_fetch_add (&reference_count, 1, memory_order_relaxed);
+else
+  atomic_fetch_add (&reference_count, 1, memory_order_acq_rel);
+@end smallexample
+
+This still requires some form of synchronization on the
+single-threaded branch, so it can be beneficial not to declare the
+reference count as @code{_Atomic}, and use the GCC @code{__atomic}
+built-ins.  @xref{__atomic Builtins,, Built-in Functions for Memory
+Model Aware Atomic Operations, gcc, Using the GNU Compiler Collection
+(GCC)}.  Then the code to increment a reference count looks like this:
+
+@smallexample
+if (__libc_single_threaded)
+  ++refeference_count;inf
+else
+  __atomic_fetch_add (&reference_count, 1, __ATOMIC_ACQ_REL);
+@end smallexample
+
+(Depending on the data associated with the reference count, it may be
+possible to use the weaker @code{__ATOMIC_RELAXED} memory ordering on
+the multi-threaded branch.)
+
+Several functions in @theglibc{} can change the value of the
+@code{__libc_single_threaded} variable.  For example, creating new
+threads using the @code{pthread_create} or @code{thrd_create} function
+sets the variable to false.  This can also happen directly, say via a
+call to @code{dlopen}.  Therefore, applications need to make a copy of
+the value of @code{__libc_single_threaded} if after such a function
+call, behavior must match the value as it was before the call, like
+this:
+
+@smallexample
+bool single_threaded = __libc_single_threaded;
+if (single_threaded)
+  prepare_single_threaded ();
+else
+  prepare_multi_thread ();
+
+void *handle = dlopen (shared_library_name, RTLD_NOW);
+lookup_symbols (handle);
+
+if (single_threaded)
+  cleanup_single_threaded ();
+else
+  cleanup_multi_thread ();
+@end smallexample
+
+Since the value of @code{__libc_single_threaded} can change from true
+to false during the execution of the program, it is not useful for
+selecting optimized function implementations in IFUNC resolvers.
+
+Atomic operations can also be used on mappings shared among
+single-threaded processes.  This means that a compiler cannot use
+@code{__libc_single_threaded} to optimize atomic operations, unless it
+is able to prove that the memory is not shared.
+
 @c FIXME these are undocumented:
 @c pthread_atfork
 @c pthread_attr_destroy