Message ID | 2c218c9ed9586ed5491f6fa08045d1e883b126c3.1589998207.git.fweimer@redhat.com |
---|---|
State | Superseded |
Headers |
Return-Path: <libc-alpha-bounces@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 68435395B80B; Wed, 20 May 2020 18:13:07 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 68435395B80B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1589998387; bh=f4bfBoRsTH3zCBeXyKh5u46lTeatYrFAhtsjfxEKzT8=; h=To:Subject:In-Reply-To:References:Date:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=HHuXCMczYrOvvsxK/A5eNQ89p2tAjaLMU3St9xvnjj1QYJLBWuu0HJ653qRxBOG4Q x+bS4bicJN9C/KeIWq23yy95/2ATkQl67sgZb90D8O0Xvxexb02ByZcPY9Qe4xbAIS VxLzBxYYbudbiKR286pV0EL6xuuNLJKsbLPJe7q4= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-1.mimecast.com (us-smtp-1.mimecast.com [205.139.110.61]) by sourceware.org (Postfix) with ESMTP id 52017385DC32 for <libc-alpha@sourceware.org>; Wed, 20 May 2020 18:13:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 52017385DC32 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-204-XK-E0QyvM2Ch5gnxCfR7Hw-1; Wed, 20 May 2020 14:13:02 -0400 X-MC-Unique: XK-E0QyvM2Ch5gnxCfR7Hw-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 4D651835B40; Wed, 20 May 2020 18:13:01 +0000 (UTC) Received: from oldenburg2.str.redhat.com (ovpn-113-191.ams2.redhat.com [10.36.113.191]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 979E010640E1; Wed, 20 May 2020 18:13:00 +0000 (UTC) To: libc-alpha@sourceware.org Subject: [PATCH 2/2] manual: Document __libc_single_threaded In-Reply-To: <cover.1589998207.git.fweimer@redhat.com> References: <cover.1589998207.git.fweimer@redhat.com> Message-Id: <2c218c9ed9586ed5491f6fa08045d1e883b126c3.1589998207.git.fweimer@redhat.com> Date: Wed, 20 May 2020 20:12:59 +0200 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-14.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <http://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <http://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Florian Weimer <fweimer@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org> |
Series |
Add __libc_single_threaded
|
|
Commit Message
Florian Weimer
May 20, 2020, 6:12 p.m. UTC
--- manual/threads.texi | 89 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 89 insertions(+)
Comments
On 5/20/20 8:12 PM, Florian Weimer wrote: > --- > manual/threads.texi | 89 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 89 insertions(+) > > diff --git a/manual/threads.texi b/manual/threads.texi > index a425635179..d4c261a0e9 100644 > --- a/manual/threads.texi > +++ b/manual/threads.texi > @@ -627,6 +627,7 @@ the standard. > threads in a process. > * Waiting with Explicit Clocks:: Functions for waiting with an > explicit clock specification. > +* Single-Threaded:: Detecting single-threaded execution. > @end menu > > @node Default Thread Attributes > @@ -771,6 +772,94 @@ Behaves like @code{pthread_timedjoin_np} except that the absolute time in > @var{abstime} is measured against the clock specified by @var{clockid}. > @end deftypefun > > +@node Single-Threaded > +@subsubsection Detecting Single-Threaded Execution > + > +Multi-threaded programs require synchronization among threads. This > +synchronization can be costly even if there is just a single thread > +and no data is shared between multiple processors. @Theglibc{} offers > +an interface to detect whether the process is in single-threaded mode. > +Applications can use this information to avoid synchronization, for > +example by using regular instructions to load and store memory instead > +of atomic instructions, or using relaxed memory ordering instead of > +stronger memory ordering. > + > +@deftypevar char __libc_single_threaded > +@standards{GNU, sys/single_threaded.h} > +This variable is non-zero if the current process is definitely > +single-threaded. If it is zero, the process can be multi-threaded, s/can be/may be/ > +or @theglibc{} cannot determine at this point of the program execution > +whether the process is single-threaded or not. > + > +Applications must never write to this variable. > +@end deftypevar > + > +Most applications should perform the same actions whether or not > +@code{__libc_single_threaded} is true, except with less > +synchronization. If this rule is followed, a process that > +subsequently becomes multi-threaded is already in a consistent state. > +For example, in order to increment a reference count, the following > +code can be used: > + > +@smallexample > +if (__libc_single_threaded) > + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); > +else > + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); > +@end smallexample > + > +This still requires some form of synchronization on the > +single-threaded branch, so it can be beneficial not to declare the > +reference count as @code{_Atomic}, and use the GCC @code{__atomic} > +built-ins. @xref{__atomic Builtins,, Built-in Functions for Memory > +Model Aware Atomic Operations, gcc, Using the GNU Compiler Collection > +(GCC)}. Then the code to increment a reference count looks like this: > + > +@smallexample > +if (__libc_single_threaded) > + ++refeference_count;inf > +else > + __atomic_fetch_add (&reference_count, 1, __ATOMIC_ACQ_REL); > +@end smallexample > + > +(Depending on the data associated with the reference count, it may be > +possible to use the weaker @code{__ATOMIC_RELAXED} memory ordering on > +the multi-threaded branch.) > + > +Several functions in @theglibc{} can change the value of the > +@code{__libc_single_threaded} variable. For example, creating new > +threads using the @code{pthread_create} or @code{thrd_create} function > +sets the variable to false. This can also happen directly, say via a s/directly/indirectly/ ? > +call to @code{dlopen}. Therefore, applications need to make a copy of > +the value of @code{__libc_single_threaded} if after such a function > +call, behavior must match the value as it was before the call, like > +this: > + > +@smallexample > +bool single_threaded = __libc_single_threaded; > +if (single_threaded) > + prepare_single_threaded (); > +else > + prepare_multi_thread (); > + > +void *handle = dlopen (shared_library_name, RTLD_NOW); > +lookup_symbols (handle); > + > +if (single_threaded) > + cleanup_single_threaded (); > +else > + cleanup_multi_thread (); > +@end smallexample > + > +Since the value of @code{__libc_single_threaded} can change from true > +to false during the execution of the program, it is not useful for > +selecting optimized function implementations in IFUNC resolvers. > + > +Atomic operations can also be used on mappings shared among > +single-threaded processes. This means that a compiler cannot use > +@code{__libc_single_threaded} to optimize atomic operations, unless it > +is able to prove that the memory is not shared. > + > @c FIXME these are undocumented: > @c pthread_atfork > @c pthread_attr_destroy Good info! Thanks, Michael
The 05/20/2020 20:12, Florian Weimer via Libc-alpha wrote: > +@smallexample > +if (__libc_single_threaded) > + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); > +else > + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); > +@end smallexample > + > +This still requires some form of synchronization on the > +single-threaded branch, so it can be beneficial not to declare the > +reference count as @code{_Atomic}, and use the GCC @code{__atomic} > +built-ins. @xref{__atomic Builtins,, Built-in Functions for Memory > +Model Aware Atomic Operations, gcc, Using the GNU Compiler Collection > +(GCC)}. Then the code to increment a reference count looks like this: > + > +@smallexample > +if (__libc_single_threaded) > + ++refeference_count;inf inf typo
* Szabolcs Nagy: > The 05/20/2020 20:12, Florian Weimer via Libc-alpha wrote: >> +@smallexample >> +if (__libc_single_threaded) >> + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); >> +else >> + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); >> +@end smallexample >> + >> +This still requires some form of synchronization on the >> +single-threaded branch, so it can be beneficial not to declare the >> +reference count as @code{_Atomic}, and use the GCC @code{__atomic} >> +built-ins. @xref{__atomic Builtins,, Built-in Functions for Memory >> +Model Aware Atomic Operations, gcc, Using the GNU Compiler Collection >> +(GCC)}. Then the code to increment a reference count looks like this: >> + >> +@smallexample >> +if (__libc_single_threaded) >> + ++refeference_count;inf > > inf typo Thanks, queued for V2. Florian
* Michael Kerrisk: >> +@deftypevar char __libc_single_threaded >> +@standards{GNU, sys/single_threaded.h} >> +This variable is non-zero if the current process is definitely >> +single-threaded. If it is zero, the process can be multi-threaded, > > s/can be/may be/ Fixed. >> +Several functions in @theglibc{} can change the value of the >> +@code{__libc_single_threaded} variable. For example, creating new >> +threads using the @code{pthread_create} or @code{thrd_create} function >> +sets the variable to false. This can also happen directly, say via a > > s/directly/indirectly/ ? Indeed, fixed. I'll repost a V2 once there are comments on the actual code. Thanks, Florian
On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote: > +@smallexample > +if (__libc_single_threaded) > + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); > +else > + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); > +@end smallexample Shouldn't the access to __libc_single_threaded be atomic itself (at least with relaxed semantic)?
The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote: > On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote: > > > +@smallexample > > +if (__libc_single_threaded) > > + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); > > +else > > + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); > > +@end smallexample > > Shouldn't the access to __libc_single_threaded be atomic itself > (at least with relaxed semantic)? not if we guarantee that this object can only be written while the process is single threaded. (e.g. an exiting detached thread cannot update it even if only one thread left.. because that may concurrently read it)
* Adhemerval Zanella via Libc-alpha: > On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote: > >> +@smallexample >> +if (__libc_single_threaded) >> + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); >> +else >> + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); >> +@end smallexample > > Shouldn't the access to __libc_single_threaded be atomic itself > (at least with relaxed semantic)? Good question. In the current implementation, it is not needed because the variable is never written again once the process is multi-threaded. We must retain relaxed MO access as a valid use of this variable. A future implementation may set __libc_single_threaded to true after detecting that the process has become single-threaded again. But I think this requires that the last remaining thread synchronizes in some way with the exit of the other, second-to-last remaining thread. And that in turn means that no explicit MO is needed for the variable read. I'm going to add this to the manual as an implementation note, after the first example: @c Note: No memory order on __libc_single_threaded. The @c implementation must ensure that exit of the critical @c (second-to-last) thread happens-before setting @c __libc_single_threaded to true. Otherwise, acquire MO might be @c needed for reading the variable in some scenarios, and that would @c completely defeat its purpose. For detached thread exits, this kind of synchronization may not be easily obtainable in all cases. I don't think we can do it on the on-thread exit path because the kernel will perform certain actions afterwards (like robust mutex updates), no matter how late we do it. I guess we could perhaps piggy-back on the stack reclamation mechanism. Thanks, Florian
On 21/05/2020 10:09, Szabolcs Nagy wrote: > The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote: >> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote: >> >>> +@smallexample >>> +if (__libc_single_threaded) >>> + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); >>> +else >>> + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); >>> +@end smallexample >> >> Shouldn't the access to __libc_single_threaded be atomic itself >> (at least with relaxed semantic)? > > not if we guarantee that this object can only be > written while the process is single threaded. > > (e.g. an exiting detached thread cannot update it > even if only one thread left.. because that may > concurrently read it) > OK, so I think we should outline that atomic operations are not required to acess this object and that once __libc_single_threaded is set 0 it will continue to indicate non-single thread even thread are jointed or detached thread finishes.
The 05/21/2020 10:15, Adhemerval Zanella wrote: > On 21/05/2020 10:09, Szabolcs Nagy wrote: > > The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote: > >> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote: > >> > >>> +@smallexample > >>> +if (__libc_single_threaded) > >>> + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); > >>> +else > >>> + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); > >>> +@end smallexample > >> > >> Shouldn't the access to __libc_single_threaded be atomic itself > >> (at least with relaxed semantic)? > > > > not if we guarantee that this object can only be > > written while the process is single threaded. > > > > (e.g. an exiting detached thread cannot update it > > even if only one thread left.. because that may > > concurrently read it) > > > > OK, so I think we should outline that atomic operations are not required > to acess this object and that once __libc_single_threaded is set 0 it will > continue to indicate non-single thread even thread are jointed or detached > thread finishes. what's wrong with pthread_join updating it? (other than the already mentioned dlmopened libc case) if only one thread left that is doing the join then there cannot be concurent access or any memory ordering problem, the process is single threaded.
* Szabolcs Nagy: > The 05/21/2020 10:15, Adhemerval Zanella wrote: >> On 21/05/2020 10:09, Szabolcs Nagy wrote: >> > The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote: >> >> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote: >> >> >> >>> +@smallexample >> >>> +if (__libc_single_threaded) >> >>> + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); >> >>> +else >> >>> + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); >> >>> +@end smallexample >> >> >> >> Shouldn't the access to __libc_single_threaded be atomic itself >> >> (at least with relaxed semantic)? >> > >> > not if we guarantee that this object can only be >> > written while the process is single threaded. >> > >> > (e.g. an exiting detached thread cannot update it >> > even if only one thread left.. because that may >> > concurrently read it) >> > >> >> OK, so I think we should outline that atomic operations are not required >> to acess this object and that once __libc_single_threaded is set 0 it will >> continue to indicate non-single thread even thread are jointed or detached >> thread finishes. > > what's wrong with pthread_join updating it? It's tricky do it correctly if there are two remaining threads, one of them the one being joined, the other one a detached thread. A straightforward implementation merely looking at __nptl_nthreads before returning from pthread_join would not perform the required synchronization on the detached thread exit. Thanks, Florian
On 21/05/2020 10:30, Szabolcs Nagy wrote: > The 05/21/2020 10:15, Adhemerval Zanella wrote: >> On 21/05/2020 10:09, Szabolcs Nagy wrote: >>> The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote: >>>> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote: >>>> >>>>> +@smallexample >>>>> +if (__libc_single_threaded) >>>>> + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); >>>>> +else >>>>> + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); >>>>> +@end smallexample >>>> >>>> Shouldn't the access to __libc_single_threaded be atomic itself >>>> (at least with relaxed semantic)? >>> >>> not if we guarantee that this object can only be >>> written while the process is single threaded. >>> >>> (e.g. an exiting detached thread cannot update it >>> even if only one thread left.. because that may >>> concurrently read it) >>> >> >> OK, so I think we should outline that atomic operations are not required >> to acess this object and that once __libc_single_threaded is set 0 it will >> continue to indicate non-single thread even thread are jointed or detached >> thread finishes. > > what's wrong with pthread_join updating it? > (other than the already mentioned dlmopened > libc case) > I don't think it is a problem, but we should document the current semantic (which does not update the value on pthread_join or detached thread exit). > if only one thread left that is doing the join > then there cannot be concurent access or any > memory ordering problem, the process is single > threaded. > I am not sure it is valid for C11 atomic semantics to access an object that is concurrently updated without proper atomic semantic, however afaiu for this specific case it should result only in missing optimization (to use the mt patch instead of the single threaded one).
On 21/05/2020 10:44, Florian Weimer wrote: > * Szabolcs Nagy: > >> The 05/21/2020 10:15, Adhemerval Zanella wrote: >>> On 21/05/2020 10:09, Szabolcs Nagy wrote: >>>> The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote: >>>>> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote: >>>>> >>>>>> +@smallexample >>>>>> +if (__libc_single_threaded) >>>>>> + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); >>>>>> +else >>>>>> + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); >>>>>> +@end smallexample >>>>> >>>>> Shouldn't the access to __libc_single_threaded be atomic itself >>>>> (at least with relaxed semantic)? >>>> >>>> not if we guarantee that this object can only be >>>> written while the process is single threaded. >>>> >>>> (e.g. an exiting detached thread cannot update it >>>> even if only one thread left.. because that may >>>> concurrently read it) >>>> >>> >>> OK, so I think we should outline that atomic operations are not required >>> to acess this object and that once __libc_single_threaded is set 0 it will >>> continue to indicate non-single thread even thread are jointed or detached >>> thread finishes. >> >> what's wrong with pthread_join updating it? > > It's tricky do it correctly if there are two remaining threads, one of > them the one being joined, the other one a detached thread. A > straightforward implementation merely looking at __nptl_nthreads before > returning from pthread_join would not perform the required > synchronization on the detached thread exit. Couldn't we accomplish by making __libc_single_threaded count the total number of threads and making pthread_create/pthread_exit/detach exit atomically updating it?
* Adhemerval Zanella: > On 21/05/2020 10:44, Florian Weimer wrote: >> * Szabolcs Nagy: >> >>> The 05/21/2020 10:15, Adhemerval Zanella wrote: >>>> On 21/05/2020 10:09, Szabolcs Nagy wrote: >>>>> The 05/21/2020 09:50, Adhemerval Zanella via Libc-alpha wrote: >>>>>> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote: >>>>>> >>>>>>> +@smallexample >>>>>>> +if (__libc_single_threaded) >>>>>>> + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); >>>>>>> +else >>>>>>> + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); >>>>>>> +@end smallexample >>>>>> >>>>>> Shouldn't the access to __libc_single_threaded be atomic itself >>>>>> (at least with relaxed semantic)? >>>>> >>>>> not if we guarantee that this object can only be >>>>> written while the process is single threaded. >>>>> >>>>> (e.g. an exiting detached thread cannot update it >>>>> even if only one thread left.. because that may >>>>> concurrently read it) >>>>> >>>> >>>> OK, so I think we should outline that atomic operations are not required >>>> to acess this object and that once __libc_single_threaded is set 0 it will >>>> continue to indicate non-single thread even thread are jointed or detached >>>> thread finishes. >>> >>> what's wrong with pthread_join updating it? >> >> It's tricky do it correctly if there are two remaining threads, one of >> them the one being joined, the other one a detached thread. A >> straightforward implementation merely looking at __nptl_nthreads before >> returning from pthread_join would not perform the required >> synchronization on the detached thread exit. > > Couldn't we accomplish by making __libc_single_threaded count the total > number of threads and making pthread_create/pthread_exit/detach exit > atomically updating it? We already have __nptl_nthreads as a global thread count, but it is currently decremented too early (on the exiting thread). As I tried to explain, we cannot decrement it on the exiting thread itself because it would not give us the desired synchronization, particularly not with any kernel actions that happen afterwards. Thanks, Florian
On 21/05/2020 10:14, Florian Weimer wrote: > * Adhemerval Zanella via Libc-alpha: > >> On 20/05/2020 15:12, Florian Weimer via Libc-alpha wrote: >> >>> +@smallexample >>> +if (__libc_single_threaded) >>> + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); >>> +else >>> + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); >>> +@end smallexample >> >> Shouldn't the access to __libc_single_threaded be atomic itself >> (at least with relaxed semantic)? > > Good question. In the current implementation, it is not needed because > the variable is never written again once the process is multi-threaded. > > We must retain relaxed MO access as a valid use of this variable. A > future implementation may set __libc_single_threaded to true after > detecting that the process has become single-threaded again. But I > think this requires that the last remaining thread synchronizes in some > way with the exit of the other, second-to-last remaining thread. And > that in turn means that no explicit MO is needed for the variable read. > > I'm going to add this to the manual as an implementation note, after the > first example: > > @c Note: No memory order on __libc_single_threaded. The > @c implementation must ensure that exit of the critical > @c (second-to-last) thread happens-before setting > @c __libc_single_threaded to true. Otherwise, acquire MO might be > @c needed for reading the variable in some scenarios, and that would > @c completely defeat its purpose. The comments is sound, but I still think we should properly document that this initial version does not attempt to update __libc_single_threaded on pthread_join or detach exit and maybe also the brief explanation you added on why this semantic was chose (to avoid the requirement of more strict MO). > > For detached thread exits, this kind of synchronization may not be > easily obtainable in all cases. I don't think we can do it on the > on-thread exit path because the kernel will perform certain actions > afterwards (like robust mutex updates), no matter how late we do it. I > guess we could perhaps piggy-back on the stack reclamation mechanism. It seems that robust mutexes updates are indeed a problem, but I am not sure if CLONE_CHILD_CLEARTID clear helps here. It signals the thread is done with the memory synchronization, but the stack cache is not really updated. Maybe an extra clone3 flag ?
The 05/21/2020 15:44, Florian Weimer wrote: > * Szabolcs Nagy: > > what's wrong with pthread_join updating it? > > It's tricky do it correctly if there are two remaining threads, one of > them the one being joined, the other one a detached thread. A > straightforward implementation merely looking at __nptl_nthreads before > returning from pthread_join would not perform the required > synchronization on the detached thread exit. i'm trying to understand this, but don't see what's wrong if the last thread is detached. do you mean user code in atexit handlers? or synchronization in libc? what is libc synchronizing with in a single detached thread? so far i don't see why __libc_single_thread cannot go back to true once it was false (there may be usability issues but i need to look at some example usage to see)
* Szabolcs Nagy: > The 05/21/2020 15:44, Florian Weimer wrote: >> * Szabolcs Nagy: >> > what's wrong with pthread_join updating it? >> >> It's tricky do it correctly if there are two remaining threads, one of >> them the one being joined, the other one a detached thread. A >> straightforward implementation merely looking at __nptl_nthreads before >> returning from pthread_join would not perform the required >> synchronization on the detached thread exit. > > i'm trying to understand this, but don't see > what's wrong if the last thread is detached. Sorry, I meant three reamining threads in total, i.e., two more threads in addition to the one thread that keeps going after the other two exited, and may use __libc_single_threaded in the future. Clearer now? Thanks, Florian
The 05/22/2020 12:05, Florian Weimer wrote: > * Szabolcs Nagy: > > > The 05/21/2020 15:44, Florian Weimer wrote: > >> * Szabolcs Nagy: > >> > what's wrong with pthread_join updating it? > >> > >> It's tricky do it correctly if there are two remaining threads, one of > >> them the one being joined, the other one a detached thread. A > >> straightforward implementation merely looking at __nptl_nthreads before > >> returning from pthread_join would not perform the required > >> synchronization on the detached thread exit. > > > > i'm trying to understand this, but don't see > > what's wrong if the last thread is detached. > > Sorry, I meant three reamining threads in total, i.e., two more threads > in addition to the one thread that keeps going after the other two > exited, and may use __libc_single_threaded in the future. > > Clearer now? hm so a detached thread is concurrently exiting with a pthread_join which sees a decremented __nptl_nthreads but the detached thread has not actually exited yet. i think glibc can issue a memory barrier syscall before decrementing __nptl_nthreads in a detached thread, this means if pthread_join observes __nptl_nthreads==1 then user memory accesses in the detached thread are synchronized with non-atomic memory accesses after pthread_join returns. (i.e. __nptl_nthreads==1 should mean at all times that as far as user code is concerned the process is single threaded even if some detached thread is still hanging around) i think __libc_single_threaded should be possible to update in pthread_join with the above change, in which case we need not document that it stays false forever, so we can change this in the future. (unless somebody finds usecases where a false->true transition would cause problems)
* Szabolcs Nagy: > The 05/22/2020 12:05, Florian Weimer wrote: >> * Szabolcs Nagy: >> >> > The 05/21/2020 15:44, Florian Weimer wrote: >> >> * Szabolcs Nagy: >> >> > what's wrong with pthread_join updating it? >> >> >> >> It's tricky do it correctly if there are two remaining threads, one of >> >> them the one being joined, the other one a detached thread. A >> >> straightforward implementation merely looking at __nptl_nthreads before >> >> returning from pthread_join would not perform the required >> >> synchronization on the detached thread exit. >> > >> > i'm trying to understand this, but don't see >> > what's wrong if the last thread is detached. >> >> Sorry, I meant three reamining threads in total, i.e., two more threads >> in addition to the one thread that keeps going after the other two >> exited, and may use __libc_single_threaded in the future. >> >> Clearer now? > > hm so a detached thread is concurrently exiting with > a pthread_join which sees a decremented __nptl_nthreads > but the detached thread has not actually exited yet. Correct. > i think glibc can issue a memory barrier syscall before > decrementing __nptl_nthreads in a detached thread, this > means if pthread_join observes __nptl_nthreads==1 > then user memory accesses in the detached thread are > synchronized with non-atomic memory accesses after > pthread_join returns. (i.e. __nptl_nthreads==1 should > mean at all times that as far as user code is concerned > the process is single threaded even if some detached > thread is still hanging around) This depends on the extent to which kernel actions after thread exit are visible to the last remaining threads. I think it would be safer to require that the stack must have been reclaimed. From a high-level perspective, we have a similar synchronization issue with stack reclamation, and it should be possible to reuse the same mechanism for that. > i think __libc_single_threaded should be possible to > update in pthread_join with the above change, in > which case we need not document that it stays false > forever, so we can change this in the future. Yes, I expect future such changes, which is why I mentioned saving the value of __libc_single_threaded for consistent execution. The other obvious case for optimization is fork, which is easier to implement: it is only necessary to detect that the libc in question is an outer libc. Thanks, Florian
On Fri, May 22, 2020 at 11:54:58AM +0100, Szabolcs Nagy wrote: > The 05/22/2020 12:05, Florian Weimer wrote: > > * Szabolcs Nagy: > > > > > The 05/21/2020 15:44, Florian Weimer wrote: > > >> * Szabolcs Nagy: > > >> > what's wrong with pthread_join updating it? > > >> > > >> It's tricky do it correctly if there are two remaining threads, one of > > >> them the one being joined, the other one a detached thread. A > > >> straightforward implementation merely looking at __nptl_nthreads before > > >> returning from pthread_join would not perform the required > > >> synchronization on the detached thread exit. > > > > > > i'm trying to understand this, but don't see > > > what's wrong if the last thread is detached. > > > > Sorry, I meant three reamining threads in total, i.e., two more threads > > in addition to the one thread that keeps going after the other two > > exited, and may use __libc_single_threaded in the future. > > > > Clearer now? > > hm so a detached thread is concurrently exiting with > a pthread_join which sees a decremented __nptl_nthreads > but the detached thread has not actually exited yet. In principle this is no big deal as long as the exiting thread cannot make any further actions where its existence causes an observable effect on users of __libc_single_threaded. (For this purpose, I think you actually need to define what uses are valid, though; see setxid remarks below.) If it makes a problem for pthread_join that's an implementation detail that should be fixable. The bigger issue is memory synchronization. > i think glibc can issue a memory barrier syscall before > decrementing __nptl_nthreads in a detached thread, this > means if pthread_join observes __nptl_nthreads==1 > then user memory accesses in the detached thread are > synchronized with non-atomic memory accesses after > pthread_join returns. (i.e. __nptl_nthreads==1 should > mean at all times that as far as user code is concerned > the process is single threaded even if some detached > thread is still hanging around) This still has consequences for setxid safety which is why musl now fully synchronizes the existing threads list. But if you're not using the thread count for that, it's not an issue. Indeed I think SYS_membarrier is a solution here, but if it's not supported or blocked by seccomp then __libc_single_threaded must not be made true again at this time. > i think __libc_single_threaded should be possible to > update in pthread_join with the above change, in > which case we need not document that it stays false > forever, so we can change this in the future. > (unless somebody finds usecases where a false->true > transition would cause problems) pthread_join is not unique here. In principle __libc_single_threaded can be made true again any time a non-AS-safe function observes a thread count of 1 after performing a memory barrier. There may be other internal-locking situations where it's beneficial to update it so that future locks can be elided. Rich
On Fri, May 22, 2020 at 11:07:20AM -0400, Rich Felker wrote: > On Fri, May 22, 2020 at 11:54:58AM +0100, Szabolcs Nagy wrote: > > The 05/22/2020 12:05, Florian Weimer wrote: > > > * Szabolcs Nagy: > > > > > > > The 05/21/2020 15:44, Florian Weimer wrote: > > > >> * Szabolcs Nagy: > > > >> > what's wrong with pthread_join updating it? > > > >> > > > >> It's tricky do it correctly if there are two remaining threads, one of > > > >> them the one being joined, the other one a detached thread. A > > > >> straightforward implementation merely looking at __nptl_nthreads before > > > >> returning from pthread_join would not perform the required > > > >> synchronization on the detached thread exit. > > > > > > > > i'm trying to understand this, but don't see > > > > what's wrong if the last thread is detached. > > > > > > Sorry, I meant three reamining threads in total, i.e., two more threads > > > in addition to the one thread that keeps going after the other two > > > exited, and may use __libc_single_threaded in the future. > > > > > > Clearer now? > > > > hm so a detached thread is concurrently exiting with > > a pthread_join which sees a decremented __nptl_nthreads > > but the detached thread has not actually exited yet. > > In principle this is no big deal as long as the exiting thread cannot > make any further actions where its existence causes an observable > effect on users of __libc_single_threaded. (For this purpose, I think > you actually need to define what uses are valid, though; see setxid > remarks below.) If it makes a problem for pthread_join that's an > implementation detail that should be fixable. The bigger issue is > memory synchronization. > > > i think glibc can issue a memory barrier syscall before > > decrementing __nptl_nthreads in a detached thread, this > > means if pthread_join observes __nptl_nthreads==1 > > then user memory accesses in the detached thread are > > synchronized with non-atomic memory accesses after > > pthread_join returns. (i.e. __nptl_nthreads==1 should > > mean at all times that as far as user code is concerned > > the process is single threaded even if some detached > > thread is still hanging around) > > This still has consequences for setxid safety which is why musl now > fully synchronizes the existing threads list. But if you're not using > the thread count for that, it's not an issue. Indeed I think > SYS_membarrier is a solution here, but if it's not supported or > blocked by seccomp then __libc_single_threaded must not be made true > again at this time. Uhg, SYS_membarrier is *not* a solution here. The problem is far worse, because the user of __libc_single_threaded potentially lacks *compiler barriers* too. Consider something like: if (!__libc_single_threaded) { lock(); need_unlock=1; } x = *p; if (need_unlock) unlock(); /* ... */ if (!__libc_single_threaded) { lock(); need_unlock=1; } x = *p; if (need_unlock) unlock(); Here, in the case where __libc_single_threaded is true the second time around, there is no (memory or compiler) acquire barrier between the first access to *p and the second. Thus the compiler can (and actually does! I don't have a minimal PoC but musl actually just hit a bug very close to this) omit the second load from memory, and uses the cached value, which may be incorrect because the exiting thread modified it. This could potentially be avoided with complex contracts about barriers needed to use __libc_single_threaded, but it seems highly error-prone. Note that the issue becomes much easier to hit with a sort of "pretest not under lock, re-check with lock" idiom of the form: x = *p; if (predicate(x)) { if (!__libc_single_threaded) { lock(); need_unlock=1; } x = *p; /* ... */ if (need_unlock) unlock(); } Unlike the above, this one does not depend on the release barrier in unlock() not also being a compiler acquire barrier. Rich
On 22/05/2020 13:14, Rich Felker wrote: > On Fri, May 22, 2020 at 11:07:20AM -0400, Rich Felker wrote: >> On Fri, May 22, 2020 at 11:54:58AM +0100, Szabolcs Nagy wrote: >>> The 05/22/2020 12:05, Florian Weimer wrote: >>>> * Szabolcs Nagy: >>>> >>>>> The 05/21/2020 15:44, Florian Weimer wrote: >>>>>> * Szabolcs Nagy: >>>>>>> what's wrong with pthread_join updating it? >>>>>> >>>>>> It's tricky do it correctly if there are two remaining threads, one of >>>>>> them the one being joined, the other one a detached thread. A >>>>>> straightforward implementation merely looking at __nptl_nthreads before >>>>>> returning from pthread_join would not perform the required >>>>>> synchronization on the detached thread exit. >>>>> >>>>> i'm trying to understand this, but don't see >>>>> what's wrong if the last thread is detached. >>>> >>>> Sorry, I meant three reamining threads in total, i.e., two more threads >>>> in addition to the one thread that keeps going after the other two >>>> exited, and may use __libc_single_threaded in the future. >>>> >>>> Clearer now? >>> >>> hm so a detached thread is concurrently exiting with >>> a pthread_join which sees a decremented __nptl_nthreads >>> but the detached thread has not actually exited yet. >> >> In principle this is no big deal as long as the exiting thread cannot >> make any further actions where its existence causes an observable >> effect on users of __libc_single_threaded. (For this purpose, I think >> you actually need to define what uses are valid, though; see setxid >> remarks below.) If it makes a problem for pthread_join that's an >> implementation detail that should be fixable. The bigger issue is >> memory synchronization. >> >>> i think glibc can issue a memory barrier syscall before >>> decrementing __nptl_nthreads in a detached thread, this >>> means if pthread_join observes __nptl_nthreads==1 >>> then user memory accesses in the detached thread are >>> synchronized with non-atomic memory accesses after >>> pthread_join returns. (i.e. __nptl_nthreads==1 should >>> mean at all times that as far as user code is concerned >>> the process is single threaded even if some detached >>> thread is still hanging around) >> >> This still has consequences for setxid safety which is why musl now >> fully synchronizes the existing threads list. But if you're not using >> the thread count for that, it's not an issue. Indeed I think >> SYS_membarrier is a solution here, but if it's not supported or >> blocked by seccomp then __libc_single_threaded must not be made true >> again at this time. > > Uhg, SYS_membarrier is *not* a solution here. The problem is far > worse, because the user of __libc_single_threaded potentially lacks > *compiler barriers* too. > > Consider something like: > > if (!__libc_single_threaded) { lock(); need_unlock=1; } > x = *p; > if (need_unlock) unlock(); > /* ... */ > if (!__libc_single_threaded) { lock(); need_unlock=1; } > x = *p; > if (need_unlock) unlock(); > > Here, in the case where __libc_single_threaded is true the second time > around, there is no (memory or compiler) acquire barrier between the > first access to *p and the second. Thus the compiler can (and actually > does! I don't have a minimal PoC but musl actually just hit a bug very > close to this) omit the second load from memory, and uses the cached > value, which may be incorrect because the exiting thread modified it. Does it help to enforce a relaxed atomic MO on __libc_single_threaded access in this example? > > This could potentially be avoided with complex contracts about > barriers needed to use __libc_single_threaded, but it seems highly > error-prone. > > Note that the issue becomes much easier to hit with a sort of "pretest > not under lock, re-check with lock" idiom of the form: > > x = *p; > if (predicate(x)) { > if (!__libc_single_threaded) { lock(); need_unlock=1; } > x = *p; > /* ... */ > if (need_unlock) unlock(); > } > > Unlike the above, this one does not depend on the release barrier in > unlock() not also being a compiler acquire barrier. > > Rich >
* Rich Felker: >> This still has consequences for setxid safety which is why musl now >> fully synchronizes the existing threads list. But if you're not using >> the thread count for that, it's not an issue. Indeed I think >> SYS_membarrier is a solution here, but if it's not supported or >> blocked by seccomp then __libc_single_threaded must not be made true >> again at this time. > > Uhg, SYS_membarrier is *not* a solution here. The problem is far > worse, because the user of __libc_single_threaded potentially lacks > *compiler barriers* too. > > Consider something like: > > if (!__libc_single_threaded) { lock(); need_unlock=1; } > x = *p; > if (need_unlock) unlock(); > /* ... */ > if (!__libc_single_threaded) { lock(); need_unlock=1; } > x = *p; > if (need_unlock) unlock(); > > Here, in the case where __libc_single_threaded is true the second time > around, there is no (memory or compiler) acquire barrier between the > first access to *p and the second. Thus the compiler can (and actually > does! I don't have a minimal PoC but musl actually just hit a bug very > close to this) omit the second load from memory, and uses the cached > value, which may be incorrect because the exiting thread modified it. > > This could potentially be avoided with complex contracts about > barriers needed to use __libc_single_threaded, but it seems highly > error-prone. Well, yes. It's clearly a data race if the implementation sets __libc_single_threaded directly from an exiting thread. I don't see a way around that. Our discussion focused on the problem that observing a thread count of 1 in pthread_join does not necessarily mean that it is safe to assume at this point that the process is single-threaded, in glibc's implementation that uses a simple __nptl_nthreads counter decremented on the thread itself. This does not cause a low-level data race directly, but is potentially still incorrect (I'm not quite sure yet). In glibc, we annotate many functions with __attribute__ ((leaf)), implicitly via __THROW. None of these functions may reset __libc_single_threaded. I expect that many compilers have a built-in list of standard functions they treat as leaf functions. This means that these functions cannot write in practice to __libc_single_threaded (or any other global variable apart from errno). Not following this rule would result in undefined behavior, similar to an actual data race in the memory model. A compiler cannot treat pthread_create as a leaf function, so the simple implementation of __libc_single_threaded I posted should be fine in this regard. Thanjs, Florian
* Florian Weimer: > * Rich Felker: > >>> This still has consequences for setxid safety which is why musl now >>> fully synchronizes the existing threads list. But if you're not using >>> the thread count for that, it's not an issue. Indeed I think >>> SYS_membarrier is a solution here, but if it's not supported or >>> blocked by seccomp then __libc_single_threaded must not be made true >>> again at this time. >> >> Uhg, SYS_membarrier is *not* a solution here. The problem is far >> worse, because the user of __libc_single_threaded potentially lacks >> *compiler barriers* too. >> >> Consider something like: >> >> if (!__libc_single_threaded) { lock(); need_unlock=1; } >> x = *p; >> if (need_unlock) unlock(); >> /* ... */ >> if (!__libc_single_threaded) { lock(); need_unlock=1; } >> x = *p; >> if (need_unlock) unlock(); >> >> Here, in the case where __libc_single_threaded is true the second time >> around, there is no (memory or compiler) acquire barrier between the >> first access to *p and the second. Thus the compiler can (and actually >> does! I don't have a minimal PoC but musl actually just hit a bug very >> close to this) omit the second load from memory, and uses the cached >> value, which may be incorrect because the exiting thread modified it. >> >> This could potentially be avoided with complex contracts about >> barriers needed to use __libc_single_threaded, but it seems highly >> error-prone. > > Well, yes. It's clearly a data race if the implementation sets > __libc_single_threaded directly from an exiting thread. I don't see a > way around that. > > Our discussion focused on the problem that observing a thread count of 1 > in pthread_join does not necessarily mean that it is safe to assume at > this point that the process is single-threaded, in glibc's > implementation that uses a simple __nptl_nthreads counter decremented on > the thread itself. This does not cause a low-level data race directly, > but is potentially still incorrect (I'm not quite sure yet). > > In glibc, we annotate many functions with __attribute__ ((leaf)), > implicitly via __THROW. None of these functions may reset > __libc_single_threaded. I expect that many compilers have a built-in > list of standard functions they treat as leaf functions. This means > that these functions cannot write in practice to __libc_single_threaded > (or any other global variable apart from errno). Not following this > rule would result in undefined behavior, similar to an actual data race > in the memory model. > > A compiler cannot treat pthread_create as a leaf function, so the simple > implementation of __libc_single_threaded I posted should be fine in this > regard. Sorry, it's not the leaf attribute what I'm looking here. It has this affect only for static variables in translation units that contains functions whose address is taken. What I'm after is more like the pure attribute. Clang does it for malloc in C++ mode, but not in C mode: #include <stdlib.h> extern int a; extern void *p; int f (void) { int a1 = a; p = malloc (1); return a1 - a; } turns into: _Z1fv: # @_Z1fv # %bb.0: pushq %rax movl $1, %edi callq malloc movq %rax, p(%rip) xorl %eax, %eax popq %rcx retq So the code is compiled as if it were written like this, consistent with eliding the relaod: int f (void) { p = malloc (1); return 0; } Thanks, Florian
On Fri, May 22, 2020 at 07:02:01PM +0200, Florian Weimer wrote: > * Rich Felker: > > >> This still has consequences for setxid safety which is why musl now > >> fully synchronizes the existing threads list. But if you're not using > >> the thread count for that, it's not an issue. Indeed I think > >> SYS_membarrier is a solution here, but if it's not supported or > >> blocked by seccomp then __libc_single_threaded must not be made true > >> again at this time. > > > > Uhg, SYS_membarrier is *not* a solution here. The problem is far > > worse, because the user of __libc_single_threaded potentially lacks > > *compiler barriers* too. > > > > Consider something like: > > > > if (!__libc_single_threaded) { lock(); need_unlock=1; } > > x = *p; > > if (need_unlock) unlock(); > > /* ... */ > > if (!__libc_single_threaded) { lock(); need_unlock=1; } > > x = *p; > > if (need_unlock) unlock(); > > > > Here, in the case where __libc_single_threaded is true the second time > > around, there is no (memory or compiler) acquire barrier between the > > first access to *p and the second. Thus the compiler can (and actually > > does! I don't have a minimal PoC but musl actually just hit a bug very > > close to this) omit the second load from memory, and uses the cached > > value, which may be incorrect because the exiting thread modified it. > > > > This could potentially be avoided with complex contracts about > > barriers needed to use __libc_single_threaded, but it seems highly > > error-prone. > > Well, yes. It's clearly a data race if the implementation sets > __libc_single_threaded directly from an exiting thread. I don't see a > way around that. > > Our discussion focused on the problem that observing a thread count of 1 > in pthread_join does not necessarily mean that it is safe to assume at > this point that the process is single-threaded, in glibc's > implementation that uses a simple __nptl_nthreads counter decremented on > the thread itself. This does not cause a low-level data race directly, > but is potentially still incorrect (I'm not quite sure yet). pthread_join necessarily has an acquire barrier (this is a fundamental requirement of the interface contract; join is acquiring the results of the thread) so under some weak assumptions on unsynchronized memory access (e.g. non-tearing, not seeing a value that wasn't stored sometime between the last and next acquire barriers on the observer's side) I think observing it from pthread_join is safe. On the other hand I'm skeptical of the utility. In a program that only makes small use of threads, the join may happen long after the thread exits, during which time many operations may have been slowed down by inability to skip locks. > In glibc, we annotate many functions with __attribute__ ((leaf)), > implicitly via __THROW. None of these functions may reset > __libc_single_threaded. I don't think leaf (at least the gcc attribute leaf) is the actual issue here; it's more complicated. Nothing about leaf forbids stores to global variables. It just means the compiler can assume it has a fuller picture for escape analysis. But any update to __libc_single_threaded would require acquire barriers (e.g., after acquiring a lock, using the value of __nptl_threads to infer that future lock cycles can be skipped), and these barriers would in turn preclude any invalid transformation by the compiler. (Leaf does not negate barriers.) Rich
* Rich Felker: >> Our discussion focused on the problem that observing a thread count of 1 >> in pthread_join does not necessarily mean that it is safe to assume at >> this point that the process is single-threaded, in glibc's >> implementation that uses a simple __nptl_nthreads counter decremented on >> the thread itself. This does not cause a low-level data race directly, >> but is potentially still incorrect (I'm not quite sure yet). > > pthread_join necessarily has an acquire barrier (this is a fundamental > requirement of the interface contract; join is acquiring the results > of the thread) so under some weak assumptions on unsynchronized memory > access (e.g. non-tearing, not seeing a value that wasn't stored > sometime between the last and next acquire barriers on the observer's > side) I think observing it from pthread_join is safe. Because of the meaning of the variable, it is *completely* safe if there are no detached threads, without any further assumptions. With detached threads an pthread_join observing a thread count of 1 (as decreased during thread exit), the validity of setting __libc_single_threaded depends on whether the kernel offers something that causes a memory write on thread exit. I know of at least two such facilities: the TID variable and robust mutexes. Therefore, I'm inclined that further such facilities could be added by the kernel, in ways not observable to glibc, and we better make sure that we have full synchronization with the thread exit if we implemented the pthread_join optimization for __libc_single_threaded *today*, although it might not actually be needed. > I don't think leaf (at least the gcc attribute leaf) is the actual > issue here Yes, please see the follow-up. Thanks, Florian
On Fri, May 22, 2020 at 07:40:19PM +0200, Florian Weimer via Libc-alpha wrote: > * Rich Felker: > > >> Our discussion focused on the problem that observing a thread count of 1 > >> in pthread_join does not necessarily mean that it is safe to assume at > >> this point that the process is single-threaded, in glibc's > >> implementation that uses a simple __nptl_nthreads counter decremented on > >> the thread itself. This does not cause a low-level data race directly, > >> but is potentially still incorrect (I'm not quite sure yet). > > > > pthread_join necessarily has an acquire barrier (this is a fundamental > > requirement of the interface contract; join is acquiring the results > > of the thread) so under some weak assumptions on unsynchronized memory > > access (e.g. non-tearing, not seeing a value that wasn't stored > > sometime between the last and next acquire barriers on the observer's > > side) I think observing it from pthread_join is safe. > > Because of the meaning of the variable, it is *completely* safe if there > are no detached threads, without any further assumptions. > > With detached threads an pthread_join observing a thread count of 1 (as > decreased during thread exit), the validity of setting > __libc_single_threaded depends on whether the kernel offers something > that causes a memory write on thread exit. I know of at least two such I don't follow. Why do you care about the kernel entity exiting here? You should only care about having a release barrier before the update to the count, so that seeing the updated count guarantees seeing any changes to memory made by the exiting detached thread. Rich
The 05/22/2020 13:49, Rich Felker wrote: > On Fri, May 22, 2020 at 07:40:19PM +0200, Florian Weimer via Libc-alpha wrote: > > * Rich Felker: > > > > >> Our discussion focused on the problem that observing a thread count of 1 > > >> in pthread_join does not necessarily mean that it is safe to assume at > > >> this point that the process is single-threaded, in glibc's > > >> implementation that uses a simple __nptl_nthreads counter decremented on > > >> the thread itself. This does not cause a low-level data race directly, > > >> but is potentially still incorrect (I'm not quite sure yet). > > > > > > pthread_join necessarily has an acquire barrier (this is a fundamental > > > requirement of the interface contract; join is acquiring the results > > > of the thread) so under some weak assumptions on unsynchronized memory > > > access (e.g. non-tearing, not seeing a value that wasn't stored > > > sometime between the last and next acquire barriers on the observer's > > > side) I think observing it from pthread_join is safe. > > > > Because of the meaning of the variable, it is *completely* safe if there > > are no detached threads, without any further assumptions. > > > > With detached threads an pthread_join observing a thread count of 1 (as > > decreased during thread exit), the validity of setting > > __libc_single_threaded depends on whether the kernel offers something > > that causes a memory write on thread exit. I know of at least two such > > I don't follow. Why do you care about the kernel entity exiting here? > You should only care about having a release barrier before the update > to the count, so that seeing the updated count guarantees seeing any > changes to memory made by the exiting detached thread. kernel entity matters if it writes user memory after the release barrier such that user code may observe it. (although that would likely break conformance or other properties too, not just single thread checks). another example is observing the detached thread via kernel apis like /proc/task: user code may expect to see a single os task when __libc_single_threaded is set. so i think the safest implementation never sets __libc_single_threaded back to true and second safest is one that only sets it back to true in pthread_join when there were no detached threads (or if using some os api it can verify that there really is only one kernel thread). if we want to allow kernel entities to be around but still tell the user when "as far as the libc is concerned there is only one thread", then i think __libc_single_threaded needs to be an extern call (that acts as a compiler barrier).
On Fri, May 22, 2020 at 08:22:50PM +0100, Szabolcs Nagy wrote: > The 05/22/2020 13:49, Rich Felker wrote: > > On Fri, May 22, 2020 at 07:40:19PM +0200, Florian Weimer via Libc-alpha wrote: > > > * Rich Felker: > > > > > > >> Our discussion focused on the problem that observing a thread count of 1 > > > >> in pthread_join does not necessarily mean that it is safe to assume at > > > >> this point that the process is single-threaded, in glibc's > > > >> implementation that uses a simple __nptl_nthreads counter decremented on > > > >> the thread itself. This does not cause a low-level data race directly, > > > >> but is potentially still incorrect (I'm not quite sure yet). > > > > > > > > pthread_join necessarily has an acquire barrier (this is a fundamental > > > > requirement of the interface contract; join is acquiring the results > > > > of the thread) so under some weak assumptions on unsynchronized memory > > > > access (e.g. non-tearing, not seeing a value that wasn't stored > > > > sometime between the last and next acquire barriers on the observer's > > > > side) I think observing it from pthread_join is safe. > > > > > > Because of the meaning of the variable, it is *completely* safe if there > > > are no detached threads, without any further assumptions. > > > > > > With detached threads an pthread_join observing a thread count of 1 (as > > > decreased during thread exit), the validity of setting > > > __libc_single_threaded depends on whether the kernel offers something > > > that causes a memory write on thread exit. I know of at least two such > > > > I don't follow. Why do you care about the kernel entity exiting here? > > You should only care about having a release barrier before the update > > to the count, so that seeing the updated count guarantees seeing any > > changes to memory made by the exiting detached thread. > > kernel entity matters if it writes user memory after > the release barrier such that user code may observe it. > (although that would likely break conformance or other > properties too, not just single thread checks). > > another example is observing the detached thread via > kernel apis like /proc/task: user code may expect to > see a single os task when __libc_single_threaded is set. Indeed, this is why I said the precise meaning/contract for what the consumer of __libc_single_threaded is allowed to do needs to be clear -- see also the new thread "How would we make linux man pages authoritative?" and Michael Kerrisk's comments on how hidden behaviors become de facto contracts if you don't clearly specify otherwise. > so i think the safest implementation never sets > __libc_single_threaded back to true and second safest > is one that only sets it back to true in pthread_join > when there were no detached threads (or if using some > os api it can verify that there really is only one > kernel thread). > > if we want to allow kernel entities to be around > but still tell the user when "as far as the libc > is concerned there is only one thread", then i > think __libc_single_threaded needs to be an extern > call (that acts as a compiler barrier). Do relaxed order atomics provide a compiler barrier? If so, I think SYS_membarrier combined with one could be sufficient However using atomic type for a public interface like this is a bit of a compatibility thorn (requires compiler and std level supporting it). Of course the same could be achieved with requirement to use a manual compiler barrier (dummy asm) but I think it's error-prone to assume the application would do it correctly. Rich
* Rich Felker <dalias@libc.org> [2020-05-22 15:53:26 -0400]: > On Fri, May 22, 2020 at 08:22:50PM +0100, Szabolcs Nagy wrote: > > so i think the safest implementation never sets > > __libc_single_threaded back to true and second safest > > is one that only sets it back to true in pthread_join > > when there were no detached threads (or if using some > > os api it can verify that there really is only one > > kernel thread). > > > > if we want to allow kernel entities to be around > > but still tell the user when "as far as the libc > > is concerned there is only one thread", then i > > think __libc_single_threaded needs to be an extern > > call (that acts as a compiler barrier). > > Do relaxed order atomics provide a compiler barrier? If so, I think > SYS_membarrier combined with one could be sufficient However using > atomic type for a public interface like this is a bit of a > compatibility thorn (requires compiler and std level supporting it). > > Of course the same could be achieved with requirement to use a manual > compiler barrier (dummy asm) but I think it's error-prone to assume > the application would do it correctly. no, relaxed atomics do not order unrelated memory accesses. but a trick like you proposed for the musl internal need_locks may work: __libc_single_threaded can go back to 1 when there is only one thread left executing user code *and* that thread called a libc function that has an acquire barrier. (if thread exit has a release barrier then this should work) this can allow earlier single threaded detection than only considering pthread_join: e.g. stdio, malloc etc may do a check and update the global after an acquire barrier, however the compiler must not cache globals across libc calls for this to work.
On Sat, May 23, 2020 at 08:49:41AM +0200, Szabolcs Nagy wrote: > * Rich Felker <dalias@libc.org> [2020-05-22 15:53:26 -0400]: > > On Fri, May 22, 2020 at 08:22:50PM +0100, Szabolcs Nagy wrote: > > > so i think the safest implementation never sets > > > __libc_single_threaded back to true and second safest > > > is one that only sets it back to true in pthread_join > > > when there were no detached threads (or if using some > > > os api it can verify that there really is only one > > > kernel thread). > > > > > > if we want to allow kernel entities to be around > > > but still tell the user when "as far as the libc > > > is concerned there is only one thread", then i > > > think __libc_single_threaded needs to be an extern > > > call (that acts as a compiler barrier). > > > > Do relaxed order atomics provide a compiler barrier? If so, I think > > SYS_membarrier combined with one could be sufficient However using > > atomic type for a public interface like this is a bit of a > > compatibility thorn (requires compiler and std level supporting it). > > > > Of course the same could be achieved with requirement to use a manual > > compiler barrier (dummy asm) but I think it's error-prone to assume > > the application would do it correctly. > > no, relaxed atomics do not order unrelated memory accesses. > > but a trick like you proposed for the musl internal need_locks > may work: __libc_single_threaded can go back to 1 when there > is only one thread left executing user code *and* that thread > called a libc function that has an acquire barrier. > (if thread exit has a release barrier then this should work) Thread exit is necessarily a release barrier if the thread is joinable, has robust mutexes, or any of a number of other conditions (including anything where the child exit futex address is non-null). In addition any barrier in userspace before the kernel task exit is fine too. > this can allow earlier single threaded detection than only > considering pthread_join: e.g. stdio, malloc etc may do a > check and update the global after an acquire barrier, however > the compiler must not cache globals across libc calls for this > to work. It can't cache globals across non-pure functions whose definitions it cant't see (and if it saw the definition it would know the global is modified). malloc is something of a special case where clang treats it not as a function but having "pure malloc semantics", but even then I don't think it matters if it caches it; at worst you see the old value of __libc_single_threaded (false) rather than the new one (true) and that direction is safe. Rich
* Rich Felker: >> this can allow earlier single threaded detection than only >> considering pthread_join: e.g. stdio, malloc etc may do a >> check and update the global after an acquire barrier, however >> the compiler must not cache globals across libc calls for this >> to work. > > It can't cache globals across non-pure functions whose definitions it > cant't see (and if it saw the definition it would know the global is > modified). All > malloc is something of a special case where clang treats it > not as a function but having "pure malloc semantics", but even then I > don't think it matters if it caches it; at worst you see the old value > of __libc_single_threaded (false) rather than the new one (true) and > that direction is safe. > > Rich
* Rich Felker: >> this can allow earlier single threaded detection than only >> considering pthread_join: e.g. stdio, malloc etc may do a >> check and update the global after an acquire barrier, however >> the compiler must not cache globals across libc calls for this >> to work. > > It can't cache globals across non-pure functions whose definitions it > cant't see (and if it saw the definition it would know the global is > modified). Sorry about that, hit C-c C-c while I thought I was in a terminal. 8-/ For most standard C functions, it is well-known to which global variables (if any) they write. Of course, compilers exploit this fact. > malloc is something of a special case where clang treats it > not as a function but having "pure malloc semantics", but even then I > don't think it matters if it caches it; And of course malloc is the most common example of a standard function that has observable side effects beyond those specified in the standard: most implementations have a statistics interface. > at worst you see the old value of __libc_single_threaded (false) > rather than the new one (true) and that direction is safe. It's still a data race. The compiler can easily generate invalid code if it incorrectly assumes that __libc_single_threaded remains stable. I don't know if Clang will do this. But I think the C library implementation should be rather conservative here. Thanks, Florian
On Mon, May 25, 2020 at 10:08:37AM +0200, Florian Weimer via Libc-alpha wrote: > * Rich Felker: > > >> this can allow earlier single threaded detection than only > >> considering pthread_join: e.g. stdio, malloc etc may do a > >> check and update the global after an acquire barrier, however > >> the compiler must not cache globals across libc calls for this > >> to work. > > > > It can't cache globals across non-pure functions whose definitions it > > cant't see (and if it saw the definition it would know the global is > > modified). > > Sorry about that, hit C-c C-c while I thought I was in a terminal. 8-/ > > For most standard C functions, it is well-known to which global > variables (if any) they write. Of course, compilers exploit this fact. > > > malloc is something of a special case where clang treats it > > not as a function but having "pure malloc semantics", but even then I > > don't think it matters if it caches it; > > And of course malloc is the most common example of a standard function > that has observable side effects beyond those specified in the standard: > most implementations have a statistics interface. > > > at worst you see the old value of __libc_single_threaded (false) > > rather than the new one (true) and that direction is safe. > > It's still a data race. The compiler can easily generate invalid code > if it incorrectly assumes that __libc_single_threaded remains stable. I > don't know if Clang will do this. But I think the C library > implementation should be rather conservative here. If this is an issue, and even regardless of whether it is, I think the type of __libc_single_threaded should be volatile qualified. This ensures that it cannot be cached in any way that might be invalid. That's not just a hack; it's the correct way to model that the value is able to change asynchronously (such as by an operation that the compiler would otherwise assume can't have side effects). Rich
* Rich Felker: > On Mon, May 25, 2020 at 10:08:37AM +0200, Florian Weimer via Libc-alpha wrote: >> * Rich Felker: >> >> >> this can allow earlier single threaded detection than only >> >> considering pthread_join: e.g. stdio, malloc etc may do a >> >> check and update the global after an acquire barrier, however >> >> the compiler must not cache globals across libc calls for this >> >> to work. >> > >> > It can't cache globals across non-pure functions whose definitions it >> > cant't see (and if it saw the definition it would know the global is >> > modified). >> >> Sorry about that, hit C-c C-c while I thought I was in a terminal. 8-/ >> >> For most standard C functions, it is well-known to which global >> variables (if any) they write. Of course, compilers exploit this fact. >> >> > malloc is something of a special case where clang treats it >> > not as a function but having "pure malloc semantics", but even then I >> > don't think it matters if it caches it; >> >> And of course malloc is the most common example of a standard function >> that has observable side effects beyond those specified in the standard: >> most implementations have a statistics interface. >> >> > at worst you see the old value of __libc_single_threaded (false) >> > rather than the new one (true) and that direction is safe. >> >> It's still a data race. The compiler can easily generate invalid code >> if it incorrectly assumes that __libc_single_threaded remains stable. I >> don't know if Clang will do this. But I think the C library >> implementation should be rather conservative here. > > If this is an issue, and even regardless of whether it is, I think the > type of __libc_single_threaded should be volatile qualified. This > ensures that it cannot be cached in any way that might be invalid. > That's not just a hack; it's the correct way to model that the value > is able to change asynchronously (such as by an operation that the > compiler would otherwise assume can't have side effects). I think it makes more sense not to declare the object as volatile and make sure that only libc functions which imply the required barrier write to __libc_single_threaded. For instance, I expect that this will allow compilers to generate tighter code around multiple (implied) reference count updates. Thanks, Florian
On Wed, May 27, 2020 at 01:54:01PM +0200, Florian Weimer wrote: > * Rich Felker: > > > On Mon, May 25, 2020 at 10:08:37AM +0200, Florian Weimer via Libc-alpha wrote: > >> * Rich Felker: > >> > >> >> this can allow earlier single threaded detection than only > >> >> considering pthread_join: e.g. stdio, malloc etc may do a > >> >> check and update the global after an acquire barrier, however > >> >> the compiler must not cache globals across libc calls for this > >> >> to work. > >> > > >> > It can't cache globals across non-pure functions whose definitions it > >> > cant't see (and if it saw the definition it would know the global is > >> > modified). > >> > >> Sorry about that, hit C-c C-c while I thought I was in a terminal. 8-/ > >> > >> For most standard C functions, it is well-known to which global > >> variables (if any) they write. Of course, compilers exploit this fact. > >> > >> > malloc is something of a special case where clang treats it > >> > not as a function but having "pure malloc semantics", but even then I > >> > don't think it matters if it caches it; > >> > >> And of course malloc is the most common example of a standard function > >> that has observable side effects beyond those specified in the standard: > >> most implementations have a statistics interface. > >> > >> > at worst you see the old value of __libc_single_threaded (false) > >> > rather than the new one (true) and that direction is safe. > >> > >> It's still a data race. The compiler can easily generate invalid code > >> if it incorrectly assumes that __libc_single_threaded remains stable. I > >> don't know if Clang will do this. But I think the C library > >> implementation should be rather conservative here. > > > > If this is an issue, and even regardless of whether it is, I think the > > type of __libc_single_threaded should be volatile qualified. This > > ensures that it cannot be cached in any way that might be invalid. > > That's not just a hack; it's the correct way to model that the value > > is able to change asynchronously (such as by an operation that the > > compiler would otherwise assume can't have side effects). > > I think it makes more sense not to declare the object as volatile and > make sure that only libc functions which imply the required barrier > write to __libc_single_threaded. For instance, I expect that this will > allow compilers to generate tighter code around multiple (implied) > reference count updates. It really should be volatile just because whatever you make it is ABI, and there might be good reasons to want to make it updated "asynchronously" with respect to the compiler's model in the future. If you don't make it volatile now you can't do that in the future. The cost of volatile is rather trivial; compiler can still cache address of it, and loading from a cache line you recently loaded from, and to which no stores are happening, is virtually free. Rich
* Rich Felker: >> I think it makes more sense not to declare the object as volatile and >> make sure that only libc functions which imply the required barrier >> write to __libc_single_threaded. For instance, I expect that this will >> allow compilers to generate tighter code around multiple (implied) >> reference count updates. > > It really should be volatile just because whatever you make it is ABI, > and there might be good reasons to want to make it updated > "asynchronously" with respect to the compiler's model in the future. > If you don't make it volatile now you can't do that in the future. volatile does not mean that it is safe to update it asynchronously. Some barriers are still needed. And if we need acquire MO for the access, then the purpose of the variable is completely defeated. Is it really a significant restriction on implementations if they can only set __libc_single_threaded to true from functions which clearly imply an acquire barrier? I don't think so. pthread_join qualifies, and that seems the most important case/ > The cost of volatile is rather trivial; compiler can still cache > address of it, and loading from a cache line you recently loaded from, > and to which no stores are happening, is virtually free. There is a code size impact because the compiler can no longer merge several atomic/non-atomic alternatives. Consider this example, which I think it is not too unrealistic (hopefully, std::shared_ptr will eventually compile down to something similar): #include <stddef.h> struct refcounted { size_t count; }; extern char __libc_single_threaded; static inline void ref (struct refcounted *p) { if (__libc_single_threaded) ++p->count; else __atomic_fetch_add (&p->count, 1, __ATOMIC_RELAXED); } void f1 (struct refcounted *a, struct refcounted *b); void f2 (struct refcounted *a, struct refcounted *b) { /* f1 takes ownership of both objects, but the caller of f2 is expected to retain (shared) ownership as well. */ ref (a); ref (b); f1 (a, b); } As written, this turns into: f2: cmpb $0, __libc_single_threaded(%rip) je .L2 addq $1, (%rdi) .L3: addq $1, (%rsi) jmp f1 .L2: lock addq $1, (%rdi) cmpb $0, __libc_single_threaded(%rip) jne .L3 lock addq $1, (%rsi) jmp f1 The jump back to .L3 is due to the current reluctance to optimize relaxed MO accesses, due to the brokenness of the C++ memory model (although it would be quite safe here, I assume). With volatile, it's this instead: f2: movzbl __libc_single_threaded(%rip), %eax testb %al, %al je .L2 addq $1, (%rdi) movzbl __libc_single_threaded(%rip), %eax testb %al, %al je .L4 .L7: addq $1, (%rsi) jmp f1 .L2: lock addq $1, (%rdi) movzbl __libc_single_threaded(%rip), %eax testb %al, %al jne .L7 .L4: lock addq $1, (%rsi) jmp f1 So two checks in the non-atomic case, plus a load for the volatile access. Thanks, Florian
* Adhemerval Zanella: >> I'm going to add this to the manual as an implementation note, after the >> first example: >> >> @c Note: No memory order on __libc_single_threaded. The >> @c implementation must ensure that exit of the critical >> @c (second-to-last) thread happens-before setting >> @c __libc_single_threaded to true. Otherwise, acquire MO might be >> @c needed for reading the variable in some scenarios, and that would >> @c completely defeat its purpose. > > The comments is sound, but I still think we should properly document > that this initial version does not attempt to update > __libc_single_threaded on pthread_join or detach exit and maybe also > the brief explanation you added on why this semantic was chose (to > avoid the requirement of more strict MO). I'm concerned that if we make the implementation too transparent, programmers will read the explanation and say, “gosh, I better assign true to __libc_single_threaded after I joined the last thread”. That's not something we want to encourage. >> For detached thread exits, this kind of synchronization may not be >> easily obtainable in all cases. I don't think we can do it on the >> on-thread exit path because the kernel will perform certain actions >> afterwards (like robust mutex updates), no matter how late we do it. I >> guess we could perhaps piggy-back on the stack reclamation mechanism. > > It seems that robust mutexes updates are indeed a problem, but I am not > sure if CLONE_CHILD_CLEARTID clear helps here. It signals the thread > is done with the memory synchronization, but the stack cache is not > really updated. Maybe an extra clone3 flag ? I thought we might piggy-back on the work that free_stacks does. But the code is sufficiently convoluted that I'm no longer sure. Thanks, Florian
On Wed, Jun 03, 2020 at 05:00:01PM +0200, Florian Weimer wrote: > * Rich Felker: > > >> I think it makes more sense not to declare the object as volatile and > >> make sure that only libc functions which imply the required barrier > >> write to __libc_single_threaded. For instance, I expect that this will > >> allow compilers to generate tighter code around multiple (implied) > >> reference count updates. > > > > It really should be volatile just because whatever you make it is ABI, > > and there might be good reasons to want to make it updated > > "asynchronously" with respect to the compiler's model in the future. > > If you don't make it volatile now you can't do that in the future. > > volatile does not mean that it is safe to update it asynchronously. > Some barriers are still needed. If by "asynchronously" we just mean "some code running in the same task context which the compiler isn't aware of, like a signal handler or internals of malloc or of a pure function" then yes volatile is sufficient. This is basically the definition of volatile. If you mean modification from other threads, then you need additional synchronization to ensure that the change is seen, but if you're ok with a change not being seen (roughly the same as relaxed order) then volatile suffices as a valid way to model it. (The potential change is an asynchronous change by the hardware, where the hardware in question happens to be the cache/memory coherency implementation.) > And if we need acquire MO for the > access, then the purpose of the variable is completely defeated. No. Reading a stale value (false instead of true) is perfectly fine. The other direction cannot happen because the change from true to false can fundamentally happen only in a single-threaded context. And if you don't want to read a stale value, SYS_membarrier performed by the thread changing the value ensures that you won't. But that's not needed. > Is it really a significant restriction on implementations if they can > only set __libc_single_threaded to true from functions which clearly > imply an acquire barrier? I don't think so. pthread_join qualifies, > and that seems the most important case/ Again doing it only from pthread_join will completely omit the optimization in any program that waits to join until is has no further work to do, which is a very reasonable thing to do. > > The cost of volatile is rather trivial; compiler can still cache > > address of it, and loading from a cache line you recently loaded from, > > and to which no stores are happening, is virtually free. > > There is a code size impact because the compiler can no longer merge > several atomic/non-atomic alternatives. Consider this example, which I > think it is not too unrealistic (hopefully, std::shared_ptr will > eventually compile down to something similar): > > #include <stddef.h> > > struct refcounted > { > size_t count; > }; > > extern char __libc_single_threaded; > > static inline > void ref (struct refcounted *p) > { > if (__libc_single_threaded) > ++p->count; > else > __atomic_fetch_add (&p->count, 1, __ATOMIC_RELAXED); > } > > > void f1 (struct refcounted *a, struct refcounted *b); > > void > f2 (struct refcounted *a, struct refcounted *b) > { > /* f1 takes ownership of both objects, but the caller of f2 is > expected to retain (shared) ownership as well. */ > ref (a); > ref (b); > f1 (a, b); > } > > As written, this turns into: > > f2: > cmpb $0, __libc_single_threaded(%rip) > je .L2 > addq $1, (%rdi) > ..L3: > addq $1, (%rsi) > jmp f1 > ..L2: > lock addq $1, (%rdi) > cmpb $0, __libc_single_threaded(%rip) > jne .L3 > lock addq $1, (%rsi) > jmp f1 > > The jump back to .L3 is due to the current reluctance to optimize > relaxed MO accesses, due to the brokenness of the C++ memory model > (although it would be quite safe here, I assume). > > With volatile, it's this instead: > > f2: > movzbl __libc_single_threaded(%rip), %eax > testb %al, %al > je .L2 > addq $1, (%rdi) > movzbl __libc_single_threaded(%rip), %eax > testb %al, %al > je .L4 > ..L7: > addq $1, (%rsi) > jmp f1 > ..L2: > lock addq $1, (%rdi) > movzbl __libc_single_threaded(%rip), %eax > testb %al, %al > jne .L7 > ..L4: > lock addq $1, (%rsi) > jmp f1 > > So two checks in the non-atomic case, plus a load for the volatile > access. I see; indeed for something on this small a scale, it *might* matter. Has anyone done any measurement to determine whether it does? Rich
On 03/06/2020 12:48, Florian Weimer wrote: > * Adhemerval Zanella: > >>> I'm going to add this to the manual as an implementation note, after the >>> first example: >>> >>> @c Note: No memory order on __libc_single_threaded. The >>> @c implementation must ensure that exit of the critical >>> @c (second-to-last) thread happens-before setting >>> @c __libc_single_threaded to true. Otherwise, acquire MO might be >>> @c needed for reading the variable in some scenarios, and that would >>> @c completely defeat its purpose. >> >> The comments is sound, but I still think we should properly document >> that this initial version does not attempt to update >> __libc_single_threaded on pthread_join or detach exit and maybe also >> the brief explanation you added on why this semantic was chose (to >> avoid the requirement of more strict MO). > > I'm concerned that if we make the implementation too transparent, > programmers will read the explanation and say, “gosh, I better assign > true to __libc_single_threaded after I joined the last thread”. That's > not something we want to encourage. In fact from your discussion with Rich I think we really should be transparent about its semantic, why he have chose the qualifiers (whether we use volatile or not), how glibc updates internally, and the expected usage patterns (for instance program are not expected to change its value). > >>> For detached thread exits, this kind of synchronization may not be >>> easily obtainable in all cases. I don't think we can do it on the >>> on-thread exit path because the kernel will perform certain actions >>> afterwards (like robust mutex updates), no matter how late we do it. I >>> guess we could perhaps piggy-back on the stack reclamation mechanism. >> >> It seems that robust mutexes updates are indeed a problem, but I am not >> sure if CLONE_CHILD_CLEARTID clear helps here. It signals the thread >> is done with the memory synchronization, but the stack cache is not >> really updated. Maybe an extra clone3 flag ? > > I thought we might piggy-back on the work that free_stacks does. But > the code is sufficiently convoluted that I'm no longer sure. I am not sure that the free_stack pattern is really helpful without some memory synchronization, such as SYS_membarrier. My idea of the clone3 flag would be similar to CLONE_CHILD_{CLEAR,SET}TID where kernel does the heavy lifting of the memory synchronization to update the total number of threads.
diff --git a/manual/threads.texi b/manual/threads.texi index a425635179..d4c261a0e9 100644 --- a/manual/threads.texi +++ b/manual/threads.texi @@ -627,6 +627,7 @@ the standard. threads in a process. * Waiting with Explicit Clocks:: Functions for waiting with an explicit clock specification. +* Single-Threaded:: Detecting single-threaded execution. @end menu @node Default Thread Attributes @@ -771,6 +772,94 @@ Behaves like @code{pthread_timedjoin_np} except that the absolute time in @var{abstime} is measured against the clock specified by @var{clockid}. @end deftypefun +@node Single-Threaded +@subsubsection Detecting Single-Threaded Execution + +Multi-threaded programs require synchronization among threads. This +synchronization can be costly even if there is just a single thread +and no data is shared between multiple processors. @Theglibc{} offers +an interface to detect whether the process is in single-threaded mode. +Applications can use this information to avoid synchronization, for +example by using regular instructions to load and store memory instead +of atomic instructions, or using relaxed memory ordering instead of +stronger memory ordering. + +@deftypevar char __libc_single_threaded +@standards{GNU, sys/single_threaded.h} +This variable is non-zero if the current process is definitely +single-threaded. If it is zero, the process can be multi-threaded, +or @theglibc{} cannot determine at this point of the program execution +whether the process is single-threaded or not. + +Applications must never write to this variable. +@end deftypevar + +Most applications should perform the same actions whether or not +@code{__libc_single_threaded} is true, except with less +synchronization. If this rule is followed, a process that +subsequently becomes multi-threaded is already in a consistent state. +For example, in order to increment a reference count, the following +code can be used: + +@smallexample +if (__libc_single_threaded) + atomic_fetch_add (&reference_count, 1, memory_order_relaxed); +else + atomic_fetch_add (&reference_count, 1, memory_order_acq_rel); +@end smallexample + +This still requires some form of synchronization on the +single-threaded branch, so it can be beneficial not to declare the +reference count as @code{_Atomic}, and use the GCC @code{__atomic} +built-ins. @xref{__atomic Builtins,, Built-in Functions for Memory +Model Aware Atomic Operations, gcc, Using the GNU Compiler Collection +(GCC)}. Then the code to increment a reference count looks like this: + +@smallexample +if (__libc_single_threaded) + ++refeference_count;inf +else + __atomic_fetch_add (&reference_count, 1, __ATOMIC_ACQ_REL); +@end smallexample + +(Depending on the data associated with the reference count, it may be +possible to use the weaker @code{__ATOMIC_RELAXED} memory ordering on +the multi-threaded branch.) + +Several functions in @theglibc{} can change the value of the +@code{__libc_single_threaded} variable. For example, creating new +threads using the @code{pthread_create} or @code{thrd_create} function +sets the variable to false. This can also happen directly, say via a +call to @code{dlopen}. Therefore, applications need to make a copy of +the value of @code{__libc_single_threaded} if after such a function +call, behavior must match the value as it was before the call, like +this: + +@smallexample +bool single_threaded = __libc_single_threaded; +if (single_threaded) + prepare_single_threaded (); +else + prepare_multi_thread (); + +void *handle = dlopen (shared_library_name, RTLD_NOW); +lookup_symbols (handle); + +if (single_threaded) + cleanup_single_threaded (); +else + cleanup_multi_thread (); +@end smallexample + +Since the value of @code{__libc_single_threaded} can change from true +to false during the execution of the program, it is not useful for +selecting optimized function implementations in IFUNC resolvers. + +Atomic operations can also be used on mappings shared among +single-threaded processes. This means that a compiler cannot use +@code{__libc_single_threaded} to optimize atomic operations, unless it +is able to prove that the memory is not shared. + @c FIXME these are undocumented: @c pthread_atfork @c pthread_attr_destroy