Unify pthread_once (bug 15215)

Message ID 1396874251.10643.8736.camel@triegel.csb
State Committed
Headers

Commit Message

Torvald Riegel April 7, 2014, 12:37 p.m. UTC
  On Fri, 2014-03-28 at 19:29 -0400, Carlos O'Donell wrote:
> David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
> 
> We would like to unify all C-based pthread_once implmentations
> per the plan in bug 15215 for glibc 2.20.
> 
> Your machines are on the list of C-based pthread_once implementations.
> 
> See this for the intial discussions on the unified pthread_once:
> https://sourceware.org/ml/libc-alpha/2013-05/msg00210.html
> 
> The goal is to provide a single and correct C implementation of 
> pthread_once. Architectures can then build on that if they need more 
> optimal implementations, but I don't encourage that and I'd rather
> see deep discussions on how to make one unified solution where
> possible.
> 
> I've also just reviewed Torvald's new pthread_once microbenchmark which
> you can use to compare your previous C implementation with the new
> standard C implementation (measures pthread_once latency). The primary
> use of this test is to help provide objective proof for or against the
> i386 and x86_64 assembly implementations.
> 
> We are not presently converting any of the machines with custom
> implementations, but that will be a next step after testing with the
> help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
> 
> If we don't hear any objections we will go forward with this change
> in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
> and aarch64 on a single pthread_once implementation based on sparc's C
> implementation.

So far, I've seen an okay for tile, and a question about ARM.  Will, are
you okay with the change for ARM?

Any other objections to the updated patch that's attached?

> > +   When forking the process, some threads can be interrupted during the second
> > +   state; they won't be present in the forked child, so we need to restart
> > +   initialization in the child.  To distinguish an in-progress initialization
> > +   from an interrupted initialization (in which case we need to reclaim the
> > +   lock), we look at the fork generation that's part of the second state: We
> > +   can reclaim iff it differs from the current fork generation.
> > +   XXX: This algorithm has an ABA issue on the fork generation: If an
> > +   initialization is interrupted, we then fork 2^30 times (30b of once_control
> 
> What's "30b?" 30 bits? Please spell it out.
> 
> > +   are used for the fork generation), and try to initialize again, we can
> > +   deadlock because we can't distinguish the in-progress and interrupted cases
> > +   anymore.  */
> 
> Would you mind filing a bug for this in the upstream bugzilla?

https://sourceware.org/bugzilla/show_bug.cgi?id=16816

> It's a distinct bug from this unification work, but a valid problem.
> 
> Can this be fixed by detecting generation counter overflow in fork
> and failing the function call?

Yes, but this would prevent us from doing more than 2^30 fork calls.
That may not be a problem in practice -- but if so, then we won't hit
the ABA either :)

> > +      do
> > +	{
> > +	  /* Check if the initialization has already been done.  */
> > +	  if (__builtin_expect ((val & 2) != 0, 1))
> 
> Use __glibc_likely.
> 
> e.g. if (__glibc_likely ((val & 2) != 0))
> 
> This is the fast path that we are testing for in the microbenchmark?

Yes.

> > +	    return 0;
> > +
> > +	  oldval = val;
> > +	  /* We try to set the state to in-progress and having the current
> > +	     fork generation.  We don't need atomic accesses for the fork
> > +	     generation because it's immutable in a particular process, and
> > +	     forked child processes start with a single thread that modified
> > +	     the generation.  */
> > +	  newval = __fork_generation | 1;
> 
> OT: I wonder if Valgrind will report a benign race in accessing __fork_generation.

Perhaps.  I believe that eventually, lots of this and similar variables
should be atomic-typed and/or accessed with relaxed-memory-order atomic
loads.  This would clarify that we expect concurrent accesses and that
they don't constitute a data race.


[BZ #15215]
* nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c: Moved to ...
* nptl/sysdeps/unix/sysv/linux/pthread_once.c: ... here.  Add missing
memory barriers.  Add comments.
* sysdeps/unix/sysv/linux/aarch64/nptl/pthread_once.c: Remove file.
* sysdeps/unix/sysv/linux/arm/nptl/pthread_once.c: Remove file.
* sysdeps/unix/sysv/linux/ia64/nptl/pthread_once.c: Remove file.
* sysdeps/unix/sysv/linux/m68k/nptl/pthread_once.c: Remove file.
* sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c: Remove file.
* sysdeps/unix/sysv/linux/tile/nptl/pthread_once.c: Remove file.

Changelog.hppa:
	[BZ #15215]
	* sysdeps/unix/sysv/linux/hppa/nptl/pthread_once.c: Remove file.
  

Comments

Will Newton April 7, 2014, 12:46 p.m. UTC | #1
On 7 April 2014 13:37, Torvald Riegel <triegel@redhat.com> wrote:
> On Fri, 2014-03-28 at 19:29 -0400, Carlos O'Donell wrote:
>> David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
>>
>> We would like to unify all C-based pthread_once implmentations
>> per the plan in bug 15215 for glibc 2.20.
>>
>> Your machines are on the list of C-based pthread_once implementations.
>>
>> See this for the intial discussions on the unified pthread_once:
>> https://sourceware.org/ml/libc-alpha/2013-05/msg00210.html
>>
>> The goal is to provide a single and correct C implementation of
>> pthread_once. Architectures can then build on that if they need more
>> optimal implementations, but I don't encourage that and I'd rather
>> see deep discussions on how to make one unified solution where
>> possible.
>>
>> I've also just reviewed Torvald's new pthread_once microbenchmark which
>> you can use to compare your previous C implementation with the new
>> standard C implementation (measures pthread_once latency). The primary
>> use of this test is to help provide objective proof for or against the
>> i386 and x86_64 assembly implementations.
>>
>> We are not presently converting any of the machines with custom
>> implementations, but that will be a next step after testing with the
>> help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
>>
>> If we don't hear any objections we will go forward with this change
>> in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
>> and aarch64 on a single pthread_once implementation based on sparc's C
>> implementation.
>
> So far, I've seen an okay for tile, and a question about ARM.  Will, are
> you okay with the change for ARM?

>From a correctness and maintainability standpoint it looks good. I
have concerns about the performance but I will leave that call to the
respective ARM and AArch64 maintainers.

In your original post you speculate it may be possible to improve
performance on ARM:

"I'm currently also using the existing atomic_{read/write}_barrier
functions instead of not-yet-existing load_acq or store_rel functions.
I'm not sure whether the latter can have somewhat more efficient
implementations on Power and ARM; if so, and if you're concerned about
the overhead, we can add load_acq and store_rel to atomic.h and start
using it"

It would be interesting to know how much work that would be and what
the performance improvements might be like.

> Any other objections to the updated patch that's attached?
>
>> > +   When forking the process, some threads can be interrupted during the second
>> > +   state; they won't be present in the forked child, so we need to restart
>> > +   initialization in the child.  To distinguish an in-progress initialization
>> > +   from an interrupted initialization (in which case we need to reclaim the
>> > +   lock), we look at the fork generation that's part of the second state: We
>> > +   can reclaim iff it differs from the current fork generation.
>> > +   XXX: This algorithm has an ABA issue on the fork generation: If an
>> > +   initialization is interrupted, we then fork 2^30 times (30b of once_control
>>
>> What's "30b?" 30 bits? Please spell it out.
>>
>> > +   are used for the fork generation), and try to initialize again, we can
>> > +   deadlock because we can't distinguish the in-progress and interrupted cases
>> > +   anymore.  */
>>
>> Would you mind filing a bug for this in the upstream bugzilla?
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=16816
>
>> It's a distinct bug from this unification work, but a valid problem.
>>
>> Can this be fixed by detecting generation counter overflow in fork
>> and failing the function call?
>
> Yes, but this would prevent us from doing more than 2^30 fork calls.
> That may not be a problem in practice -- but if so, then we won't hit
> the ABA either :)
>
>> > +      do
>> > +   {
>> > +     /* Check if the initialization has already been done.  */
>> > +     if (__builtin_expect ((val & 2) != 0, 1))
>>
>> Use __glibc_likely.
>>
>> e.g. if (__glibc_likely ((val & 2) != 0))
>>
>> This is the fast path that we are testing for in the microbenchmark?
>
> Yes.
>
>> > +       return 0;
>> > +
>> > +     oldval = val;
>> > +     /* We try to set the state to in-progress and having the current
>> > +        fork generation.  We don't need atomic accesses for the fork
>> > +        generation because it's immutable in a particular process, and
>> > +        forked child processes start with a single thread that modified
>> > +        the generation.  */
>> > +     newval = __fork_generation | 1;
>>
>> OT: I wonder if Valgrind will report a benign race in accessing __fork_generation.
>
> Perhaps.  I believe that eventually, lots of this and similar variables
> should be atomic-typed and/or accessed with relaxed-memory-order atomic
> loads.  This would clarify that we expect concurrent accesses and that
> they don't constitute a data race.
>
>
> [BZ #15215]
> * nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c: Moved to ...
> * nptl/sysdeps/unix/sysv/linux/pthread_once.c: ... here.  Add missing
> memory barriers.  Add comments.
> * sysdeps/unix/sysv/linux/aarch64/nptl/pthread_once.c: Remove file.
> * sysdeps/unix/sysv/linux/arm/nptl/pthread_once.c: Remove file.
> * sysdeps/unix/sysv/linux/ia64/nptl/pthread_once.c: Remove file.
> * sysdeps/unix/sysv/linux/m68k/nptl/pthread_once.c: Remove file.
> * sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c: Remove file.
> * sysdeps/unix/sysv/linux/tile/nptl/pthread_once.c: Remove file.
>
> Changelog.hppa:
>         [BZ #15215]
>         * sysdeps/unix/sysv/linux/hppa/nptl/pthread_once.c: Remove file.
>
  
Torvald Riegel April 7, 2014, 1:16 p.m. UTC | #2
On Mon, 2014-04-07 at 13:46 +0100, Will Newton wrote:
> On 7 April 2014 13:37, Torvald Riegel <triegel@redhat.com> wrote:
> > On Fri, 2014-03-28 at 19:29 -0400, Carlos O'Donell wrote:
> >> David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
> >>
> >> We would like to unify all C-based pthread_once implmentations
> >> per the plan in bug 15215 for glibc 2.20.
> >>
> >> Your machines are on the list of C-based pthread_once implementations.
> >>
> >> See this for the intial discussions on the unified pthread_once:
> >> https://sourceware.org/ml/libc-alpha/2013-05/msg00210.html
> >>
> >> The goal is to provide a single and correct C implementation of
> >> pthread_once. Architectures can then build on that if they need more
> >> optimal implementations, but I don't encourage that and I'd rather
> >> see deep discussions on how to make one unified solution where
> >> possible.
> >>
> >> I've also just reviewed Torvald's new pthread_once microbenchmark which
> >> you can use to compare your previous C implementation with the new
> >> standard C implementation (measures pthread_once latency). The primary
> >> use of this test is to help provide objective proof for or against the
> >> i386 and x86_64 assembly implementations.
> >>
> >> We are not presently converting any of the machines with custom
> >> implementations, but that will be a next step after testing with the
> >> help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
> >>
> >> If we don't hear any objections we will go forward with this change
> >> in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
> >> and aarch64 on a single pthread_once implementation based on sparc's C
> >> implementation.
> >
> > So far, I've seen an okay for tile, and a question about ARM.  Will, are
> > you okay with the change for ARM?
> 
> From a correctness and maintainability standpoint it looks good. I
> have concerns about the performance but I will leave that call to the
> respective ARM and AArch64 maintainers.
> 
> In your original post you speculate it may be possible to improve
> performance on ARM:
> 
> "I'm currently also using the existing atomic_{read/write}_barrier
> functions instead of not-yet-existing load_acq or store_rel functions.
> I'm not sure whether the latter can have somewhat more efficient
> implementations on Power and ARM; if so, and if you're concerned about
> the overhead, we can add load_acq and store_rel to atomic.h and start
> using it"
> 
> It would be interesting to know how much work that would be and what
> the performance improvements might be like.

I had a quick look at the arm and aarch64 barrier definitions, and they
only define a full barrier, but not separate read / write barriers.
That is part of the performance problem I believe, since a full barrier
should be significantly more costly than an acquire barrier.

I guess read/write barriers as used in glibc are semantically equivalent
to acquire / release as in C11, but I'm not quite sure given that some
architectures use stronger barriers for read/write than acquire/release.
Cleaning that up would require review of plenty of code.  But one could
start incrementally as well by not changing existing barrier definitions
and reviewing uses one by one.  In the long term, I think we would
benefit from using C11 atomics throughout glibc; in some cases, existing
custom assembly might be faster (e.g., that has been one comment
regarding, IIRC, powerpc low-level locks) -- but maybe we can achieve
this with custom memory orders for atomics as well, or something
similar.
In any way, cleaning this up is not specific to pthread_once.

Second, suggested mappings from C11 acquire/release to arm
(http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html) show differences
for acquire loads and acquire barriers, but I don't know whether these
would result in a performance difference.

I'd appreciate input from architecture maintainers, especially from
those maintaining archs with weaker memory models such as arm.
  
Will Newton April 7, 2014, 1:50 p.m. UTC | #3
On 7 April 2014 14:16, Torvald Riegel <triegel@redhat.com> wrote:
> On Mon, 2014-04-07 at 13:46 +0100, Will Newton wrote:
>> On 7 April 2014 13:37, Torvald Riegel <triegel@redhat.com> wrote:
>> > On Fri, 2014-03-28 at 19:29 -0400, Carlos O'Donell wrote:
>> >> David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
>> >>
>> >> We would like to unify all C-based pthread_once implmentations
>> >> per the plan in bug 15215 for glibc 2.20.
>> >>
>> >> Your machines are on the list of C-based pthread_once implementations.
>> >>
>> >> See this for the intial discussions on the unified pthread_once:
>> >> https://sourceware.org/ml/libc-alpha/2013-05/msg00210.html
>> >>
>> >> The goal is to provide a single and correct C implementation of
>> >> pthread_once. Architectures can then build on that if they need more
>> >> optimal implementations, but I don't encourage that and I'd rather
>> >> see deep discussions on how to make one unified solution where
>> >> possible.
>> >>
>> >> I've also just reviewed Torvald's new pthread_once microbenchmark which
>> >> you can use to compare your previous C implementation with the new
>> >> standard C implementation (measures pthread_once latency). The primary
>> >> use of this test is to help provide objective proof for or against the
>> >> i386 and x86_64 assembly implementations.
>> >>
>> >> We are not presently converting any of the machines with custom
>> >> implementations, but that will be a next step after testing with the
>> >> help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
>> >>
>> >> If we don't hear any objections we will go forward with this change
>> >> in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
>> >> and aarch64 on a single pthread_once implementation based on sparc's C
>> >> implementation.
>> >
>> > So far, I've seen an okay for tile, and a question about ARM.  Will, are
>> > you okay with the change for ARM?
>>
>> From a correctness and maintainability standpoint it looks good. I
>> have concerns about the performance but I will leave that call to the
>> respective ARM and AArch64 maintainers.
>>
>> In your original post you speculate it may be possible to improve
>> performance on ARM:
>>
>> "I'm currently also using the existing atomic_{read/write}_barrier
>> functions instead of not-yet-existing load_acq or store_rel functions.
>> I'm not sure whether the latter can have somewhat more efficient
>> implementations on Power and ARM; if so, and if you're concerned about
>> the overhead, we can add load_acq and store_rel to atomic.h and start
>> using it"
>>
>> It would be interesting to know how much work that would be and what
>> the performance improvements might be like.
>
> I had a quick look at the arm and aarch64 barrier definitions, and they
> only define a full barrier, but not separate read / write barriers.
> That is part of the performance problem I believe, since a full barrier
> should be significantly more costly than an acquire barrier.
>
> I guess read/write barriers as used in glibc are semantically equivalent
> to acquire / release as in C11, but I'm not quite sure given that some
> architectures use stronger barriers for read/write than acquire/release.
> Cleaning that up would require review of plenty of code.  But one could
> start incrementally as well by not changing existing barrier definitions
> and reviewing uses one by one.  In the long term, I think we would
> benefit from using C11 atomics throughout glibc; in some cases, existing
> custom assembly might be faster (e.g., that has been one comment
> regarding, IIRC, powerpc low-level locks) -- but maybe we can achieve
> this with custom memory orders for atomics as well, or something
> similar.
> In any way, cleaning this up is not specific to pthread_once.
>
> Second, suggested mappings from C11 acquire/release to arm
> (http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html) show differences
> for acquire loads and acquire barriers, but I don't know whether these
> would result in a performance difference.

ARMv8 (ARM and AArch64) defines load-acquire store-release
instructions so for these systems we can do better than dmb. Hopefully
we can just use the C11 API to access them but I haven't tested to see
if gcc can actually do the right thing...
  
Mike Frysinger April 8, 2014, 11:13 p.m. UTC | #4
ia64 part look reasonable.  i haven't tested them, but i can just wait until 
next time i build/run tests ;).
-mike
  
Carlos O'Donell April 10, 2014, 9:38 p.m. UTC | #5
On 04/07/2014 08:37 AM, Torvald Riegel wrote:
> On Fri, 2014-03-28 at 19:29 -0400, Carlos O'Donell wrote:
>> David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
>>
>> We would like to unify all C-based pthread_once implmentations
>> per the plan in bug 15215 for glibc 2.20.
>>
>> Your machines are on the list of C-based pthread_once implementations.
>>
>> See this for the intial discussions on the unified pthread_once:
>> https://sourceware.org/ml/libc-alpha/2013-05/msg00210.html
>>
>> The goal is to provide a single and correct C implementation of 
>> pthread_once. Architectures can then build on that if they need more 
>> optimal implementations, but I don't encourage that and I'd rather
>> see deep discussions on how to make one unified solution where
>> possible.
>>
>> I've also just reviewed Torvald's new pthread_once microbenchmark which
>> you can use to compare your previous C implementation with the new
>> standard C implementation (measures pthread_once latency). The primary
>> use of this test is to help provide objective proof for or against the
>> i386 and x86_64 assembly implementations.
>>
>> We are not presently converting any of the machines with custom
>> implementations, but that will be a next step after testing with the
>> help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
>>
>> If we don't hear any objections we will go forward with this change
>> in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
>> and aarch64 on a single pthread_once implementation based on sparc's C
>> implementation.

This version looks good to me.

Please check it in after fixing the one nit where you needed double 
space after a period.

Follow this if you're rusty:
https://sourceware.org/glibc/wiki/Committer%20checklist

> So far, I've seen an okay for tile, and a question about ARM.  Will, are
> you okay with the change for ARM?
> 
> Any other objections to the updated patch that's attached?

As I mentioned in my other email, this cleanup is only for targets using
generic C code implementations which we have shown to be missing barriers.
Converting these C-only targets is the right thing to do. Converting the
assembly implementations is going to be more work.

Next steps:
* Check this in.
* Send another notification to the maintainers about the change.
  - This gives them another chance to look at the benchmark numbers.
* Work with any arch maintainers to look at performance losses.

>>> +   When forking the process, some threads can be interrupted during the second
>>> +   state; they won't be present in the forked child, so we need to restart
>>> +   initialization in the child.  To distinguish an in-progress initialization
>>> +   from an interrupted initialization (in which case we need to reclaim the
>>> +   lock), we look at the fork generation that's part of the second state: We
>>> +   can reclaim iff it differs from the current fork generation.
>>> +   XXX: This algorithm has an ABA issue on the fork generation: If an
>>> +   initialization is interrupted, we then fork 2^30 times (30b of once_control
>>
>> What's "30b?" 30 bits? Please spell it out.
>>
>>> +   are used for the fork generation), and try to initialize again, we can
>>> +   deadlock because we can't distinguish the in-progress and interrupted cases
>>> +   anymore.  */
>>
>> Would you mind filing a bug for this in the upstream bugzilla?
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=16816
> 
>> It's a distinct bug from this unification work, but a valid problem.
>>
>> Can this be fixed by detecting generation counter overflow in fork
>> and failing the function call?
> 
> Yes, but this would prevent us from doing more than 2^30 fork calls.
> That may not be a problem in practice -- but if so, then we won't hit
> the ABA either :)

It's probably not a problem, because 2^30 forks of even a 1MB process
is going to need 1 Petabyte or more of memory/swap, but still...

A security issue is introduced here in that early corruption of the fork
generation counter could lead to deadlock. We close that window slightly
by doing a sanity check on the generation counter to detect overflow.
It doesn't fix all cases, but it means you can't easily corrupt the gen
counter early and then wait for the fork to deadlock. You now need to
corrupt the fork generation counter after the check which is a smaller
window.

Either way I think an assert on overflow in fork.c is needed, but that's
another fix that I expect you to submit after this one. Note that the
implementation is in: nptl/sysdeps/unix/sysv/linux/fork.c, and the
limit of 2^30 forks only applies to applications linked against libpthread
which provides a strong definition of fork that overrides libc's weak
definition (which does a lot less). In the dynamic case libpthread's
version of fork is used because it is loaded first since it depends on libc
(remember that weak/strong are not applied to dynamic libraries per ELF
rules).

>>> +      do
>>> +	{
>>> +	  /* Check if the initialization has already been done.  */
>>> +	  if (__builtin_expect ((val & 2) != 0, 1))
>>
>> Use __glibc_likely.
>>
>> e.g. if (__glibc_likely ((val & 2) != 0))
>>
>> This is the fast path that we are testing for in the microbenchmark?
> 
> Yes.

Good.

>>> +	    return 0;
>>> +
>>> +	  oldval = val;
>>> +	  /* We try to set the state to in-progress and having the current
>>> +	     fork generation.  We don't need atomic accesses for the fork
>>> +	     generation because it's immutable in a particular process, and
>>> +	     forked child processes start with a single thread that modified
>>> +	     the generation.  */
>>> +	  newval = __fork_generation | 1;
>>
>> OT: I wonder if Valgrind will report a benign race in accessing __fork_generation.
> 
> Perhaps.  I believe that eventually, lots of this and similar variables
> should be atomic-typed and/or accessed with relaxed-memory-order atomic
> loads.  This would clarify that we expect concurrent accesses and that
> they don't constitute a data race.

I don't see how Valgrind would know this from the binary itself, but
I guess this will just need to have per-glibc-version exceptions for
Valgrind.

Your ChangeLog still needs to follow the normal format, including
header line with date and name, blank line, and tab before text on
lines thereafter.

e.g.

2014-04-07  Torvald Riegel  <triegel@redhat.com>

	[BZ #15215]
	* nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c: Moved to ...
	* nptl/sysdeps/unix/sysv/linux/pthread_once.c: ... here.  Add missing
	memory barriers.  Add comments.
	* sysdeps/unix/sysv/linux/aarch64/nptl/pthread_once.c: Remove file.
	* sysdeps/unix/sysv/linux/arm/nptl/pthread_once.c: Remove file.
	* sysdeps/unix/sysv/linux/ia64/nptl/pthread_once.c: Remove file.
	* sysdeps/unix/sysv/linux/m68k/nptl/pthread_once.c: Remove file.
	* sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c: Remove file.
	* sysdeps/unix/sysv/linux/tile/nptl/pthread_once.c: Remove file.

Changelog.hppa:

2014-04-07  Torvald Riegel  <triegel@redhat.com>

	[BZ #15215]
	* sysdeps/unix/sysv/linux/hppa/nptl/pthread_once.c: Remove file.

Don't forget to update NEWS

> diff --git a/nptl/sysdeps/unix/sysv/linux/pthread_once.c b/nptl/sysdeps/unix/sysv/linux/pthread_once.c
> new file mode 100644
> index 0000000..8453d2d
> --- /dev/null
> +++ b/nptl/sysdeps/unix/sysv/linux/pthread_once.c
> @@ -0,0 +1,131 @@
> +/* Copyright (C) 2003-2014 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +   Contributed by Jakub Jelinek <jakub@redhat.com>, 2003.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include "pthreadP.h"
> +#include <lowlevellock.h>
> +#include <atomic.h>
> +
> +
> +unsigned long int __fork_generation attribute_hidden;
> +
> +
> +static void
> +clear_once_control (void *arg)
> +{
> +  pthread_once_t *once_control = (pthread_once_t *) arg;
> +
> +  /* Reset to the uninitialized state here.  We don't need a stronger memory
> +     order because we do not need to make any other of our writes visible to
> +     other threads that see this value: This function will be called if we
> +     get interrupted (see __pthread_once), so all we need to relay to other
> +     threads is the state being reset again.  */

OK.

> +  *once_control = 0;
> +  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
> +}
> +
> +
> +/* This is similar to a lock implementation, but we distinguish between three
> +   states: not yet initialized (0), initialization finished (2), and
> +   initialization in progress (__fork_generation | 1).  If in the first state,
> +   threads will try to run the initialization by moving to the second state;
> +   the first thread to do so via a CAS on once_control runs init_routine,
> +   other threads block.
> +   When forking the process, some threads can be interrupted during the second
> +   state; they won't be present in the forked child, so we need to restart
> +   initialization in the child.  To distinguish an in-progress initialization
> +   from an interrupted initialization (in which case we need to reclaim the
> +   lock), we look at the fork generation that's part of the second state: We
> +   can reclaim iff it differs from the current fork generation.
> +   XXX: This algorithm has an ABA issue on the fork generation: If an
> +   initialization is interrupted, we then fork 2^30 times (30 bits of
> +   once_control are used for the fork generation), and try to initialize
> +   again, we can deadlock because we can't distinguish the in-progress and
> +   interrupted cases anymore.  */

OK.

> +int
> +__pthread_once (once_control, init_routine)
> +     pthread_once_t *once_control;
> +     void (*init_routine) (void);
> +{
> +  while (1)
> +    {
> +      int oldval, val, newval;
> +
> +      /* We need acquire memory order for this load because if the value
> +         signals that initialization has finished, we need to be see any
> +         data modifications done during initialization.  */
> +      val = *once_control;
> +      atomic_read_barrier();
> +      do
> +	{
> +	  /* Check if the initialization has already been done.  */
> +	  if (__glibc_likely ((val & 2) != 0))

OK.

> +	    return 0;
> +
> +	  oldval = val;
> +	  /* We try to set the state to in-progress and having the current
> +	     fork generation.  We don't need atomic accesses for the fork
> +	     generation because it's immutable in a particular process, and
> +	     forked child processes start with a single thread that modified
> +	     the generation.  */
> +	  newval = __fork_generation | 1;
> +	  /* We need acquire memory order here for the same reason as for the
> +	     load from once_control above.  */
> +	  val = atomic_compare_and_exchange_val_acq (once_control, newval,
> +						     oldval);
> +	}
> +      while (__glibc_unlikely (val != oldval));

OK.

> +
> +      /* Check if another thread already runs the initializer.	*/
> +      if ((oldval & 1) != 0)
> +	{
> +	  /* Check whether the initializer execution was interrupted by a
> +	     fork. We know that for both values, bit 0 is set and bit 1 is

s/. We/.  We/g

> +	     not.  */
> +	  if (oldval == newval)
> +	    {
> +	      /* Same generation, some other thread was faster. Wait.  */
> +	      lll_futex_wait (once_control, newval, LLL_PRIVATE);
> +	      continue;
> +	    }
> +	}
> +
> +      /* This thread is the first here.  Do the initialization.
> +	 Register a cleanup handler so that in case the thread gets
> +	 interrupted the initialization can be restarted.  */
> +      pthread_cleanup_push (clear_once_control, once_control);

OK.

> +
> +      init_routine ();
> +
> +      pthread_cleanup_pop (0);
> +
> +
> +      /* Mark *once_control as having finished the initialization.  We need
> +         release memory order here because we need to synchronize with other
> +         threads that want to use the initialized data.  */
> +      atomic_write_barrier();
> +      *once_control = 2;

OK.

> +
> +      /* Wake up all other threads.  */
> +      lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);

OK.

> +      break;
> +    }
> +
> +  return 0;
> +}
> +weak_alias (__pthread_once, pthread_once)
> +hidden_def (__pthread_once)

[snip removal of other pthread_once implementations]

Cheers,
Carlos.
  
Torvald Riegel April 11, 2014, 2:19 p.m. UTC | #6
On Thu, 2014-04-10 at 17:38 -0400, Carlos O'Donell wrote:
> On 04/07/2014 08:37 AM, Torvald Riegel wrote:
> > On Fri, 2014-03-28 at 19:29 -0400, Carlos O'Donell wrote:
> >> David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
> >>
> >> We would like to unify all C-based pthread_once implmentations
> >> per the plan in bug 15215 for glibc 2.20.
> >>
> >> Your machines are on the list of C-based pthread_once implementations.
> >>
> >> See this for the intial discussions on the unified pthread_once:
> >> https://sourceware.org/ml/libc-alpha/2013-05/msg00210.html
> >>
> >> The goal is to provide a single and correct C implementation of 
> >> pthread_once. Architectures can then build on that if they need more 
> >> optimal implementations, but I don't encourage that and I'd rather
> >> see deep discussions on how to make one unified solution where
> >> possible.
> >>
> >> I've also just reviewed Torvald's new pthread_once microbenchmark which
> >> you can use to compare your previous C implementation with the new
> >> standard C implementation (measures pthread_once latency). The primary
> >> use of this test is to help provide objective proof for or against the
> >> i386 and x86_64 assembly implementations.
> >>
> >> We are not presently converting any of the machines with custom
> >> implementations, but that will be a next step after testing with the
> >> help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
> >>
> >> If we don't hear any objections we will go forward with this change
> >> in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
> >> and aarch64 on a single pthread_once implementation based on sparc's C
> >> implementation.
> 
> This version looks good to me.
> 
> Please check it in after fixing the one nit where you needed double 
> space after a period.

Committed.

[...]

> * Send another notification to the maintainers about the change.
>   - This gives them another chance to look at the benchmark numbers.
> * Work with any arch maintainers to look at performance losses.

Please get in touch with me if you feel your arch has been affected in a
negative way.

> >>> +   When forking the process, some threads can be interrupted during the second
> >>> +   state; they won't be present in the forked child, so we need to restart
> >>> +   initialization in the child.  To distinguish an in-progress initialization
> >>> +   from an interrupted initialization (in which case we need to reclaim the
> >>> +   lock), we look at the fork generation that's part of the second state: We
> >>> +   can reclaim iff it differs from the current fork generation.
> >>> +   XXX: This algorithm has an ABA issue on the fork generation: If an
> >>> +   initialization is interrupted, we then fork 2^30 times (30b of once_control
> >>
> >> What's "30b?" 30 bits? Please spell it out.
> >>
> >>> +   are used for the fork generation), and try to initialize again, we can
> >>> +   deadlock because we can't distinguish the in-progress and interrupted cases
> >>> +   anymore.  */
> >>
> >> Would you mind filing a bug for this in the upstream bugzilla?
> > 
> > https://sourceware.org/bugzilla/show_bug.cgi?id=16816
> > 
> >> It's a distinct bug from this unification work, but a valid problem.
> >>
> >> Can this be fixed by detecting generation counter overflow in fork
> >> and failing the function call?
> > 
> > Yes, but this would prevent us from doing more than 2^30 fork calls.
> > That may not be a problem in practice -- but if so, then we won't hit
> > the ABA either :)
> 
> It's probably not a problem, because 2^30 forks of even a 1MB process
> is going to need 1 Petabyte or more of memory/swap, but still...

The forked processes don't need to be still running, right?  So the
overall time necessary to do the forks might be a more useful bound.

> A security issue is introduced here in that early corruption of the fork
> generation counter could lead to deadlock. We close that window slightly
> by doing a sanity check on the generation counter to detect overflow.
> It doesn't fix all cases, but it means you can't easily corrupt the gen
> counter early and then wait for the fork to deadlock. You now need to
> corrupt the fork generation counter after the check which is a smaller
> window.

I'm not sure this would really help.  The assert would terminate the
program, the deadlock would prevent forward progress.  Neither option is
always better than the other.  The deadlock on pthread_once might
actually be better.

There are other options to avoid the ABA issue, but all come with a bit
of complexity:
* Don't allow concurrent fork with pthread_once.  This is equivalent to
a reader/writer lock scheme with pthread_once being the readers, and
writers having priority.  The readers can either increment one global
var, or have per-thread flags that a fork scans.
* Block fork before overflow until no pthread_once is active. A
variation of the previous option.
* Rely on getting 64b mutexes eventually.  Those would be useful for the
condvars too.

> Either way I think an assert on overflow in fork.c is needed, but that's
> another fix that I expect you to submit after this one. Note that the
> implementation is in: nptl/sysdeps/unix/sysv/linux/fork.c, and the
> limit of 2^30 forks only applies to applications linked against libpthread
> which provides a strong definition of fork that overrides libc's weak
> definition (which does a lot less). In the dynamic case libpthread's
> version of fork is used because it is loaded first since it depends on libc
> (remember that weak/strong are not applied to dynamic libraries per ELF
> rules).

I'd wait with this until we have consensus what to do precisely (see
above).

> >>> +	    return 0;
> >>> +
> >>> +	  oldval = val;
> >>> +	  /* We try to set the state to in-progress and having the current
> >>> +	     fork generation.  We don't need atomic accesses for the fork
> >>> +	     generation because it's immutable in a particular process, and
> >>> +	     forked child processes start with a single thread that modified
> >>> +	     the generation.  */
> >>> +	  newval = __fork_generation | 1;
> >>
> >> OT: I wonder if Valgrind will report a benign race in accessing __fork_generation.
> > 
> > Perhaps.  I believe that eventually, lots of this and similar variables
> > should be atomic-typed and/or accessed with relaxed-memory-order atomic
> > loads.  This would clarify that we expect concurrent accesses and that
> > they don't constitute a data race.
> 
> I don't see how Valgrind would know this from the binary itself, but
> I guess this will just need to have per-glibc-version exceptions for
> Valgrind.

If Valgrind wants to be useful for any program that uses atomics to
synchronize (in contrast to just using pthread mutexes, for example), it
will have to somehow get aware of language-level synchronization (e.g.,
atomics).  IOW, I don't think this is a glibc-specific problem (and
would thus justify per-glibc-version exceptions).

> Your ChangeLog still needs to follow the normal format, including
> header line with date and name, blank line, and tab before text on
> lines thereafter.

Yeah, that was just a formatting issue when pasting to email.

> Don't forget to update NEWS

I don't think #15215 is fixed already.  (Hint: parts c) and d) need
review :)
  

Patch

diff --git a/nptl/sysdeps/unix/sysv/linux/pthread_once.c b/nptl/sysdeps/unix/sysv/linux/pthread_once.c
new file mode 100644
index 0000000..8453d2d
--- /dev/null
+++ b/nptl/sysdeps/unix/sysv/linux/pthread_once.c
@@ -0,0 +1,131 @@ 
+/* Copyright (C) 2003-2014 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+   Contributed by Jakub Jelinek <jakub@redhat.com>, 2003.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "pthreadP.h"
+#include <lowlevellock.h>
+#include <atomic.h>
+
+
+unsigned long int __fork_generation attribute_hidden;
+
+
+static void
+clear_once_control (void *arg)
+{
+  pthread_once_t *once_control = (pthread_once_t *) arg;
+
+  /* Reset to the uninitialized state here.  We don't need a stronger memory
+     order because we do not need to make any other of our writes visible to
+     other threads that see this value: This function will be called if we
+     get interrupted (see __pthread_once), so all we need to relay to other
+     threads is the state being reset again.  */
+  *once_control = 0;
+  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
+}
+
+
+/* This is similar to a lock implementation, but we distinguish between three
+   states: not yet initialized (0), initialization finished (2), and
+   initialization in progress (__fork_generation | 1).  If in the first state,
+   threads will try to run the initialization by moving to the second state;
+   the first thread to do so via a CAS on once_control runs init_routine,
+   other threads block.
+   When forking the process, some threads can be interrupted during the second
+   state; they won't be present in the forked child, so we need to restart
+   initialization in the child.  To distinguish an in-progress initialization
+   from an interrupted initialization (in which case we need to reclaim the
+   lock), we look at the fork generation that's part of the second state: We
+   can reclaim iff it differs from the current fork generation.
+   XXX: This algorithm has an ABA issue on the fork generation: If an
+   initialization is interrupted, we then fork 2^30 times (30 bits of
+   once_control are used for the fork generation), and try to initialize
+   again, we can deadlock because we can't distinguish the in-progress and
+   interrupted cases anymore.  */
+int
+__pthread_once (once_control, init_routine)
+     pthread_once_t *once_control;
+     void (*init_routine) (void);
+{
+  while (1)
+    {
+      int oldval, val, newval;
+
+      /* We need acquire memory order for this load because if the value
+         signals that initialization has finished, we need to be see any
+         data modifications done during initialization.  */
+      val = *once_control;
+      atomic_read_barrier();
+      do
+	{
+	  /* Check if the initialization has already been done.  */
+	  if (__glibc_likely ((val & 2) != 0))
+	    return 0;
+
+	  oldval = val;
+	  /* We try to set the state to in-progress and having the current
+	     fork generation.  We don't need atomic accesses for the fork
+	     generation because it's immutable in a particular process, and
+	     forked child processes start with a single thread that modified
+	     the generation.  */
+	  newval = __fork_generation | 1;
+	  /* We need acquire memory order here for the same reason as for the
+	     load from once_control above.  */
+	  val = atomic_compare_and_exchange_val_acq (once_control, newval,
+						     oldval);
+	}
+      while (__glibc_unlikely (val != oldval));
+
+      /* Check if another thread already runs the initializer.	*/
+      if ((oldval & 1) != 0)
+	{
+	  /* Check whether the initializer execution was interrupted by a
+	     fork. We know that for both values, bit 0 is set and bit 1 is
+	     not.  */
+	  if (oldval == newval)
+	    {
+	      /* Same generation, some other thread was faster. Wait.  */
+	      lll_futex_wait (once_control, newval, LLL_PRIVATE);
+	      continue;
+	    }
+	}
+
+      /* This thread is the first here.  Do the initialization.
+	 Register a cleanup handler so that in case the thread gets
+	 interrupted the initialization can be restarted.  */
+      pthread_cleanup_push (clear_once_control, once_control);
+
+      init_routine ();
+
+      pthread_cleanup_pop (0);
+
+
+      /* Mark *once_control as having finished the initialization.  We need
+         release memory order here because we need to synchronize with other
+         threads that want to use the initialized data.  */
+      atomic_write_barrier();
+      *once_control = 2;
+
+      /* Wake up all other threads.  */
+      lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
+      break;
+    }
+
+  return 0;
+}
+weak_alias (__pthread_once, pthread_once)
+hidden_def (__pthread_once)
diff --git a/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c b/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c
deleted file mode 100644
index a231e55..0000000
--- a/nptl/sysdeps/unix/sysv/linux/sparc/pthread_once.c
+++ /dev/null
@@ -1,93 +0,0 @@ 
-/* Copyright (C) 2003-2014 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-   Contributed by Jakub Jelinek <jakub@redhat.com>, 2003.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include "pthreadP.h"
-#include <lowlevellock.h>
-
-
-unsigned long int __fork_generation attribute_hidden;
-
-
-static void
-clear_once_control (void *arg)
-{
-  pthread_once_t *once_control = (pthread_once_t *) arg;
-
-  *once_control = 0;
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-}
-
-
-int
-__pthread_once (once_control, init_routine)
-     pthread_once_t *once_control;
-     void (*init_routine) (void);
-{
-  while (1)
-    {
-      int oldval, val, newval;
-
-      val = *once_control;
-      do
-	{
-	  /* Check if the initialized has already been done.  */
-	  if ((val & 2) != 0)
-	    return 0;
-
-	  oldval = val;
-	  newval = (oldval & 3) | __fork_generation | 1;
-	  val = atomic_compare_and_exchange_val_acq (once_control, newval,
-						     oldval);
-	}
-      while (__builtin_expect (val != oldval, 0));
-
-      /* Check if another thread already runs the initializer.	*/
-      if ((oldval & 1) != 0)
-	{
-	  /* Check whether the initializer execution was interrupted
-	     by a fork.	 */
-	  if (((oldval ^ newval) & -4) == 0)
-	    {
-	      /* Same generation, some other thread was faster. Wait.  */
-	      lll_futex_wait (once_control, newval, LLL_PRIVATE);
-	      continue;
-	    }
-	}
-
-      /* This thread is the first here.  Do the initialization.
-	 Register a cleanup handler so that in case the thread gets
-	 interrupted the initialization can be restarted.  */
-      pthread_cleanup_push (clear_once_control, once_control);
-
-      init_routine ();
-
-      pthread_cleanup_pop (0);
-
-
-      /* Add one to *once_control.  */
-      atomic_increment (once_control);
-
-      /* Wake up all other threads.  */
-      lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-      break;
-    }
-
-  return 0;
-}
-weak_alias (__pthread_once, pthread_once)
-hidden_def (__pthread_once)
diff --git a/ports/sysdeps/unix/sysv/linux/hppa/nptl/pthread_once.c b/ports/sysdeps/unix/sysv/linux/hppa/nptl/pthread_once.c
deleted file mode 100644
index ee6b496..0000000
--- a/ports/sysdeps/unix/sysv/linux/hppa/nptl/pthread_once.c
+++ /dev/null
@@ -1,93 +0,0 @@ 
-/* Copyright (C) 2003-2014 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-   Contributed by Jakub Jelinek <jakub@redhat.com>, 2003.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library.  If not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include "pthreadP.h"
-#include <lowlevellock.h>
-
-
-unsigned long int __fork_generation attribute_hidden;
-
-
-static void
-clear_once_control (void *arg)
-{
-  pthread_once_t *once_control = (pthread_once_t *) arg;
-
-  *once_control = 0;
-  lll_private_futex_wake (once_control, INT_MAX);
-}
-
-
-int
-__pthread_once (once_control, init_routine)
-     pthread_once_t *once_control;
-     void (*init_routine) (void);
-{
-  while (1)
-    {
-      int oldval, val, newval;
-
-      val = *once_control;
-      do
-	{
-	  /* Check if the initialized has already been done.  */
-	  if ((val & 2) != 0)
-	    return 0;
-
-	  oldval = val;
-	  newval = (oldval & 3) | __fork_generation | 1;
-	  val = atomic_compare_and_exchange_val_acq (once_control, newval,
-						     oldval);
-	}
-      while (__builtin_expect (val != oldval, 0));
-
-      /* Check if another thread already runs the initializer.	*/
-      if ((oldval & 1) != 0)
-	{
-	  /* Check whether the initializer execution was interrupted
-	     by a fork.	 */
-	  if (((oldval ^ newval) & -4) == 0)
-	    {
-	      /* Same generation, some other thread was faster. Wait.  */
-	      lll_private_futex_wait (once_control, newval);
-	      continue;
-	    }
-	}
-
-      /* This thread is the first here.  Do the initialization.
-	 Register a cleanup handler so that in case the thread gets
-	 interrupted the initialization can be restarted.  */
-      pthread_cleanup_push (clear_once_control, once_control);
-
-      init_routine ();
-
-      pthread_cleanup_pop (0);
-
-
-      /* Add one to *once_control.  */
-      atomic_increment (once_control);
-
-      /* Wake up all other threads.  */
-      lll_private_futex_wake (once_control, INT_MAX);
-      break;
-    }
-
-  return 0;
-}
-weak_alias (__pthread_once, pthread_once)
-hidden_def (__pthread_once)
diff --git a/sysdeps/unix/sysv/linux/aarch64/nptl/pthread_once.c b/sysdeps/unix/sysv/linux/aarch64/nptl/pthread_once.c
deleted file mode 100644
index d1b28ff..0000000
--- a/sysdeps/unix/sysv/linux/aarch64/nptl/pthread_once.c
+++ /dev/null
@@ -1,90 +0,0 @@ 
-/* Copyright (C) 2004-2014 Free Software Foundation, Inc.
-
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public License as
-   published by the Free Software Foundation; either version 2.1 of the
-   License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include "pthreadP.h"
-#include <lowlevellock.h>
-
-unsigned long int __fork_generation attribute_hidden;
-
-static void
-clear_once_control (void *arg)
-{
-  pthread_once_t *once_control = (pthread_once_t *) arg;
-
-  *once_control = 0;
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-}
-
-int
-__pthread_once (pthread_once_t *once_control, void (*init_routine) (void))
-{
-  for (;;)
-    {
-      int oldval;
-      int newval;
-
-      /* Pseudo code:
-	 newval = __fork_generation | 1;
-	 oldval = *once_control;
-	 if ((oldval & 2) == 0)
-	   *once_control = newval;
-	 Do this atomically.
-      */
-      do
-	{
-	  newval = __fork_generation | 1;
-	  oldval = *once_control;
-	  if (oldval & 2)
-	    break;
-	} while (atomic_compare_and_exchange_val_acq (once_control, newval, oldval) != oldval);
-
-      /* Check if the initializer has already been done.  */
-      if ((oldval & 2) != 0)
-	return 0;
-
-      /* Check if another thread already runs the initializer.	*/
-      if ((oldval & 1) == 0)
-	break;
-
-      /* Check whether the initializer execution was interrupted by a fork.  */
-      if (oldval != newval)
-	break;
-
-      /* Same generation, some other thread was faster. Wait.  */
-      lll_futex_wait (once_control, oldval, LLL_PRIVATE);
-    }
-
-  /* This thread is the first here.  Do the initialization.
-     Register a cleanup handler so that in case the thread gets
-     interrupted the initialization can be restarted.  */
-  pthread_cleanup_push (clear_once_control, once_control);
-
-  init_routine ();
-
-  pthread_cleanup_pop (0);
-
-  /* Say that the initialisation is done.  */
-  *once_control = __fork_generation | 2;
-
-  /* Wake up all other threads.  */
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-
-  return 0;
-}
-weak_alias (__pthread_once, pthread_once)
-hidden_def (__pthread_once)
diff --git a/sysdeps/unix/sysv/linux/arm/nptl/pthread_once.c b/sysdeps/unix/sysv/linux/arm/nptl/pthread_once.c
deleted file mode 100644
index a063149..0000000
--- a/sysdeps/unix/sysv/linux/arm/nptl/pthread_once.c
+++ /dev/null
@@ -1,89 +0,0 @@ 
-/* Copyright (C) 2004-2014 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library.  If not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include "pthreadP.h"
-#include <lowlevellock.h>
-
-unsigned long int __fork_generation attribute_hidden;
-
-static void
-clear_once_control (void *arg)
-{
-  pthread_once_t *once_control = (pthread_once_t *) arg;
-
-  *once_control = 0;
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-}
-
-int
-__pthread_once (pthread_once_t *once_control, void (*init_routine) (void))
-{
-  for (;;)
-    {
-      int oldval;
-      int newval;
-
-      /* Pseudo code:
-	 newval = __fork_generation | 1;
-	 oldval = *once_control;
-	 if ((oldval & 2) == 0)
-	   *once_control = newval;
-	 Do this atomically.
-      */
-      do
-	{
-	  newval = __fork_generation | 1;
-	  oldval = *once_control;
-	  if (oldval & 2)
-	    break;
-	} while (atomic_compare_and_exchange_val_acq (once_control, newval, oldval) != oldval);
-
-      /* Check if the initializer has already been done.  */
-      if ((oldval & 2) != 0)
-	return 0;
-
-      /* Check if another thread already runs the initializer.	*/
-      if ((oldval & 1) == 0)
-	break;
-
-      /* Check whether the initializer execution was interrupted by a fork.  */
-      if (oldval != newval)
-	break;
-
-      /* Same generation, some other thread was faster. Wait.  */
-      lll_futex_wait (once_control, oldval, LLL_PRIVATE);
-    }
-
-  /* This thread is the first here.  Do the initialization.
-     Register a cleanup handler so that in case the thread gets
-     interrupted the initialization can be restarted.  */
-  pthread_cleanup_push (clear_once_control, once_control);
-
-  init_routine ();
-
-  pthread_cleanup_pop (0);
-
-  /* Say that the initialisation is done.  */
-  *once_control = __fork_generation | 2;
-
-  /* Wake up all other threads.  */
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-
-  return 0;
-}
-weak_alias (__pthread_once, pthread_once)
-hidden_def (__pthread_once)
diff --git a/sysdeps/unix/sysv/linux/ia64/nptl/pthread_once.c b/sysdeps/unix/sysv/linux/ia64/nptl/pthread_once.c
deleted file mode 100644
index a231e55..0000000
--- a/sysdeps/unix/sysv/linux/ia64/nptl/pthread_once.c
+++ /dev/null
@@ -1,93 +0,0 @@ 
-/* Copyright (C) 2003-2014 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-   Contributed by Jakub Jelinek <jakub@redhat.com>, 2003.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include "pthreadP.h"
-#include <lowlevellock.h>
-
-
-unsigned long int __fork_generation attribute_hidden;
-
-
-static void
-clear_once_control (void *arg)
-{
-  pthread_once_t *once_control = (pthread_once_t *) arg;
-
-  *once_control = 0;
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-}
-
-
-int
-__pthread_once (once_control, init_routine)
-     pthread_once_t *once_control;
-     void (*init_routine) (void);
-{
-  while (1)
-    {
-      int oldval, val, newval;
-
-      val = *once_control;
-      do
-	{
-	  /* Check if the initialized has already been done.  */
-	  if ((val & 2) != 0)
-	    return 0;
-
-	  oldval = val;
-	  newval = (oldval & 3) | __fork_generation | 1;
-	  val = atomic_compare_and_exchange_val_acq (once_control, newval,
-						     oldval);
-	}
-      while (__builtin_expect (val != oldval, 0));
-
-      /* Check if another thread already runs the initializer.	*/
-      if ((oldval & 1) != 0)
-	{
-	  /* Check whether the initializer execution was interrupted
-	     by a fork.	 */
-	  if (((oldval ^ newval) & -4) == 0)
-	    {
-	      /* Same generation, some other thread was faster. Wait.  */
-	      lll_futex_wait (once_control, newval, LLL_PRIVATE);
-	      continue;
-	    }
-	}
-
-      /* This thread is the first here.  Do the initialization.
-	 Register a cleanup handler so that in case the thread gets
-	 interrupted the initialization can be restarted.  */
-      pthread_cleanup_push (clear_once_control, once_control);
-
-      init_routine ();
-
-      pthread_cleanup_pop (0);
-
-
-      /* Add one to *once_control.  */
-      atomic_increment (once_control);
-
-      /* Wake up all other threads.  */
-      lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-      break;
-    }
-
-  return 0;
-}
-weak_alias (__pthread_once, pthread_once)
-hidden_def (__pthread_once)
diff --git a/sysdeps/unix/sysv/linux/m68k/nptl/pthread_once.c b/sysdeps/unix/sysv/linux/m68k/nptl/pthread_once.c
deleted file mode 100644
index 01542e9..0000000
--- a/sysdeps/unix/sysv/linux/m68k/nptl/pthread_once.c
+++ /dev/null
@@ -1,90 +0,0 @@ 
-/* Copyright (C) 2010-2014 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-   Contributed by Maxim Kuvyrkov <maxim@codesourcery.com>, 2010.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library.  If not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include "pthreadP.h"
-#include <lowlevellock.h>
-
-unsigned long int __fork_generation attribute_hidden;
-
-static void
-clear_once_control (void *arg)
-{
-  pthread_once_t *once_control = (pthread_once_t *) arg;
-
-  *once_control = 0;
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-}
-
-int
-__pthread_once (pthread_once_t *once_control, void (*init_routine) (void))
-{
-  for (;;)
-    {
-      int oldval;
-      int newval;
-
-      /* Pseudo code:
-	 newval = __fork_generation | 1;
-	 oldval = *once_control;
-	 if ((oldval & 2) == 0)
-	   *once_control = newval;
-	 Do this atomically.
-      */
-      do
-	{
-	  newval = __fork_generation | 1;
-	  oldval = *once_control;
-	  if (oldval & 2)
-	    break;
-	} while (atomic_compare_and_exchange_val_acq (once_control, newval, oldval) != oldval);
-
-      /* Check if the initializer has already been done.  */
-      if ((oldval & 2) != 0)
-	return 0;
-
-      /* Check if another thread already runs the initializer.	*/
-      if ((oldval & 1) == 0)
-	break;
-
-      /* Check whether the initializer execution was interrupted by a fork.  */
-      if (oldval != newval)
-	break;
-
-      /* Same generation, some other thread was faster. Wait.  */
-      lll_futex_wait (once_control, oldval, LLL_PRIVATE);
-    }
-
-  /* This thread is the first here.  Do the initialization.
-     Register a cleanup handler so that in case the thread gets
-     interrupted the initialization can be restarted.  */
-  pthread_cleanup_push (clear_once_control, once_control);
-
-  init_routine ();
-
-  pthread_cleanup_pop (0);
-
-  /* Say that the initialisation is done.  */
-  *once_control = __fork_generation | 2;
-
-  /* Wake up all other threads.  */
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-
-  return 0;
-}
-weak_alias (__pthread_once, pthread_once)
-hidden_def (__pthread_once)
diff --git a/sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c b/sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c
deleted file mode 100644
index 3e3430d..0000000
--- a/sysdeps/unix/sysv/linux/mips/nptl/pthread_once.c
+++ /dev/null
@@ -1,93 +0,0 @@ 
-/* Copyright (C) 2003-2014 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-   Contributed by Jakub Jelinek <jakub@redhat.com>, 2003.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library.  If not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include "pthreadP.h"
-#include <lowlevellock.h>
-
-
-unsigned long int __fork_generation attribute_hidden;
-
-
-static void
-clear_once_control (void *arg)
-{
-  pthread_once_t *once_control = (pthread_once_t *) arg;
-
-  *once_control = 0;
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-}
-
-
-int
-__pthread_once (once_control, init_routine)
-     pthread_once_t *once_control;
-     void (*init_routine) (void);
-{
-  while (1)
-    {
-      int oldval, val, newval;
-
-      val = *once_control;
-      do
-	{
-	  /* Check if the initialized has already been done.  */
-	  if ((val & 2) != 0)
-	    return 0;
-
-	  oldval = val;
-	  newval = (oldval & 3) | __fork_generation | 1;
-	  val = atomic_compare_and_exchange_val_acq (once_control, newval,
-						     oldval);
-	}
-      while (__builtin_expect (val != oldval, 0));
-
-      /* Check if another thread already runs the initializer.	*/
-      if ((oldval & 1) != 0)
-	{
-	  /* Check whether the initializer execution was interrupted
-	     by a fork.	 */
-	  if (((oldval ^ newval) & -4) == 0)
-	    {
-	      /* Same generation, some other thread was faster. Wait.  */
-	      lll_futex_wait (once_control, newval, LLL_PRIVATE);
-	      continue;
-	    }
-	}
-
-      /* This thread is the first here.  Do the initialization.
-	 Register a cleanup handler so that in case the thread gets
-	 interrupted the initialization can be restarted.  */
-      pthread_cleanup_push (clear_once_control, once_control);
-
-      init_routine ();
-
-      pthread_cleanup_pop (0);
-
-
-      /* Add one to *once_control.  */
-      atomic_increment (once_control);
-
-      /* Wake up all other threads.  */
-      lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-      break;
-    }
-
-  return 0;
-}
-weak_alias (__pthread_once, pthread_once)
-hidden_def (__pthread_once)
diff --git a/sysdeps/unix/sysv/linux/tile/nptl/pthread_once.c b/sysdeps/unix/sysv/linux/tile/nptl/pthread_once.c
deleted file mode 100644
index 1b38999..0000000
--- a/sysdeps/unix/sysv/linux/tile/nptl/pthread_once.c
+++ /dev/null
@@ -1,94 +0,0 @@ 
-/* Copyright (C) 2011-2014 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-   Contributed by Chris Metcalf <cmetcalf@tilera.com>, 2011.
-   Based on work contributed by Jakub Jelinek <jakub@redhat.com>, 2003.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library.  If not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include <nptl/pthreadP.h>
-#include <lowlevellock.h>
-
-
-unsigned long int __fork_generation attribute_hidden;
-
-
-static void
-clear_once_control (void *arg)
-{
-  pthread_once_t *once_control = (pthread_once_t *) arg;
-
-  *once_control = 0;
-  lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-}
-
-
-int
-__pthread_once (once_control, init_routine)
-     pthread_once_t *once_control;
-     void (*init_routine) (void);
-{
-  while (1)
-    {
-      int oldval, val, newval;
-
-      val = *once_control;
-      do
-	{
-	  /* Check if the initialized has already been done.  */
-	  if ((val & 2) != 0)
-	    return 0;
-
-	  oldval = val;
-	  newval = (oldval & 3) | __fork_generation | 1;
-	  val = atomic_compare_and_exchange_val_acq (once_control, newval,
-						     oldval);
-	}
-      while (__builtin_expect (val != oldval, 0));
-
-      /* Check if another thread already runs the initializer.	*/
-      if ((oldval & 1) != 0)
-	{
-	  /* Check whether the initializer execution was interrupted
-	     by a fork.	 */
-	  if (((oldval ^ newval) & -4) == 0)
-	    {
-	      /* Same generation, some other thread was faster. Wait.  */
-	      lll_futex_wait (once_control, newval, LLL_PRIVATE);
-	      continue;
-	    }
-	}
-
-      /* This thread is the first here.  Do the initialization.
-	 Register a cleanup handler so that in case the thread gets
-	 interrupted the initialization can be restarted.  */
-      pthread_cleanup_push (clear_once_control, once_control);
-
-      init_routine ();
-
-      pthread_cleanup_pop (0);
-
-
-      /* Add one to *once_control.  */
-      atomic_increment (once_control);
-
-      /* Wake up all other threads.  */
-      lll_futex_wake (once_control, INT_MAX, LLL_PRIVATE);
-      break;
-    }
-
-  return 0;
-}
-weak_alias (__pthread_once, pthread_once)
-hidden_def (__pthread_once)