转发：[PATCH] NUMA spinlock [BZ #23962]

Message ID	CAMe9rOrU1niqVofiFvpgvMNrUohu0yW--OBTHh3TDC-3fnG51Q@mail.gmail.com
State	Dropped
Headers	Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org MIME-Version: 1.0 References: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local> <7D8A82D6-6F0A-4860-856A-EB0C8CD13E9C@antfin.com> <0a474516-b8c8-48cf-aeea-e57c77b78cbd.ling.ml@antfin.com> <c7f11fea-371e-4453-b5d0-9b142632aecc.ling.ml@antfin.com> <8c67f319-31bf-818b-4a89-66d25328026e@arm.com> In-Reply-To: <8c67f319-31bf-818b-4a89-66d25328026e@arm.com> From: "H.J. Lu" <hjl.tools@gmail.com> Date: Thu, 3 Jan 2019 11:58:47 -0800 Message-ID: <CAMe9rOrU1niqVofiFvpgvMNrUohu0yW--OBTHh3TDC-3fnG51Q@mail.gmail.com> Subject: =?UTF-8?B?UmU6IOi9rOWPke+8mltQQVRDSF0gTlVNQSBzcGlubG9jayBbQlogIzIzOTYyXQ==?= To: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: =?UTF-8?B?6ams5YeMKOW9puWGmyk=?= <ling.ml@antfin.com>, libc-alpha <libc-alpha@sourceware.org>, "Xiao, Wei3" <wei3.xiao@intel.com>, nd <nd@arm.com>, "ling.ma.program" <ling.ma.program@gmail.com> Content-Type: multipart/mixed; boundary="000000000000f47577057e933254"

Message ID

CAMe9rOrU1niqVofiFvpgvMNrUohu0yW--OBTHh3TDC-3fnG51Q@mail.gmail.com

State

Dropped

Headers

Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
Sender: libc-alpha-owner@sourceware.org
MIME-Version: 1.0
References: <20181226025019.38752-1-ling.ma@MacBook-Pro-8.local>
	<7D8A82D6-6F0A-4860-856A-EB0C8CD13E9C@antfin.com>
	<0a474516-b8c8-48cf-aeea-e57c77b78cbd.ling.ml@antfin.com>
	<c7f11fea-371e-4453-b5d0-9b142632aecc.ling.ml@antfin.com>
	<8c67f319-31bf-818b-4a89-66d25328026e@arm.com>
In-Reply-To: <8c67f319-31bf-818b-4a89-66d25328026e@arm.com>
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Thu, 3 Jan 2019 11:58:47 -0800
Message-ID: <CAMe9rOrU1niqVofiFvpgvMNrUohu0yW--OBTHh3TDC-3fnG51Q@mail.gmail.com>
Subject: =?UTF-8?B?UmU6IOi9rOWPke+8mltQQVRDSF0gTlVNQSBzcGlubG9jayBbQlogIzIzOTYyXQ==?=
To: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Cc: =?UTF-8?B?6ams5YeMKOW9puWGmyk=?= <ling.ml@antfin.com>, 
	libc-alpha <libc-alpha@sourceware.org>, "Xiao,
	Wei3" <wei3.xiao@intel.com>, nd <nd@arm.com>, 
	"ling.ma.program" <ling.ma.program@gmail.com>
Content-Type: multipart/mixed; boundary="000000000000f47577057e933254"

Commit Message

H.J. Lu Jan. 3, 2019, 7:58 p.m. UTC

  On Thu, Jan 3, 2019 at 6:52 AM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
>
> On 03/01/2019 05:35, 马凌(彦军) wrote:
> >      create mode 100644 manual/examples/numa-spinlock.c
> >      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h
> >      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c
> >      create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h
> >      create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
> >      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-numa-variable-overhead.c
> >      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
> >      create mode 100644 sysdeps/unix/sysv/linux/x86/tst-variable-overhead.c
>
> as far as i can tell the new code is generic
> (other than the presence of efficient getcpu),
> so i think the test should be generic too.
>
> >     --- /dev/null
> >     +++ b/sysdeps/unix/sysv/linux/x86/tst-variable-overhead-skeleton.c
> >     @@ -0,0 +1,384 @@
> ...
> >     +/* Check spinlock overhead with large number threads.  Critical region is
> >     +   very smmall.  Critical region + spinlock overhead aren't noticeable
> >     +   when number of threads is small.  When thread number increases,
> >     +   spinlock overhead become the bottleneck.  It shows up in wall time
> >     +   of thread execution.  */
>
> yeah, this is not easy to do in a generic way, i think
> even on x86 such measurement is problematic, you don't
> know what goes on a system (or vm) when the glibc test
> is running.
>
> but doing precise timing is not that important for
> checking the correctness of the locks, so i think a
> simplified version can be generic test code.

Here is the updated patch to make tests generic.

Comments

Carlos O'Donell Jan. 5, 2019, 12:34 p.m. UTC | #1

On 1/3/19 2:58 PM, H.J. Lu wrote:
> +libpthread {
> +  GLIBC_2.29 {
> +    numa_spinlock_alloc;
> +    numa_spinlock_free;
> +    numa_spinlock_init;
> +    numa_spinlock_apply;
> +  }
> +}

Why are we adding these non-standard interfaces to glibc?

The API implementation doesn't rely on any special glibc internals.

It could be implemented as a distinct library, allowed to evolve quickly
in response to customer need, and eventually integrated into glibc if the
API proves stable. A similar model has been setup by Boost and C++ just to
draw some parallels.

I'm not happy to see new APIs like this go directly into glibc without
much more discussion about *why* they have to be in glibc initially.

Just to be clear I have a sustained objection to this new set of APIs
being added to glibc until I can be convinced that they have to go in
glibc.

H.J. Lu Jan. 5, 2019, 4:35 p.m. UTC | #2

On Sat, Jan 5, 2019 at 4:34 AM Carlos O'Donell <carlos@redhat.com> wrote:
>
> On 1/3/19 2:58 PM, H.J. Lu wrote:
> > +libpthread {
> > +  GLIBC_2.29 {
> > +    numa_spinlock_alloc;
> > +    numa_spinlock_free;
> > +    numa_spinlock_init;
> > +    numa_spinlock_apply;
> > +  }
> > +}
>
> Why are we adding these non-standard interfaces to glibc?
>
> The API implementation doesn't rely on any special glibc internals.
>
> It could be implemented as a distinct library, allowed to evolve quickly
> in response to customer need, and eventually integrated into glibc if the
> API proves stable. A similar model has been setup by Boost and C++ just to
> draw some parallels.
>
> I'm not happy to see new APIs like this go directly into glibc without
> much more discussion about *why* they have to be in glibc initially.
>
> Just to be clear I have a sustained objection to this new set of APIs
> being added to glibc until I can be convinced that they have to go in
> glibc.
>

Should glibc have scalable spinlock, in libc.so or a separate shared object?
Or should we tell people that if they want scalable spinlock, they look
elsewhere?

Florian Weimer Jan. 7, 2019, 7:12 p.m. UTC | #3

* H. J. Lu:

> Should glibc have scalable spinlock, in libc.so or a separate shared object?
> Or should we tell people that if they want scalable spinlock, they look
> elsewhere?

I think non-polymorphic, small lock types with scoped locking could make
sense for glibc.

A lock specific to a certain machine architecture seems strange.  We
currently lack any of the kernel NUMA interfaces in glibc, which makes
this stand out even more.

Thanks,
Florian

H.J. Lu Jan. 7, 2019, 7:48 p.m. UTC | #4

On Mon, Jan 7, 2019 at 11:12 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > Should glibc have scalable spinlock, in libc.so or a separate shared object?
> > Or should we tell people that if they want scalable spinlock, they look
> > elsewhere?
>
> I think non-polymorphic, small lock types with scoped locking could make
> sense for glibc.
>
> A lock specific to a certain machine architecture seems strange.  We
> currently lack any of the kernel NUMA interfaces in glibc, which makes
> this stand out even more.
>

And this lack of support doesn't make problems to go away.

Carlos O'Donell Jan. 10, 2019, 4:31 p.m. UTC | #5

On 1/7/19 2:48 PM, H.J. Lu wrote:
> On Mon, Jan 7, 2019 at 11:12 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * H. J. Lu:
>>
>>> Should glibc have scalable spinlock, in libc.so or a separate shared object?
>>> Or should we tell people that if they want scalable spinlock, they look
>>> elsewhere?
>>
>> I think non-polymorphic, small lock types with scoped locking could make
>> sense for glibc.
>>
>> A lock specific to a certain machine architecture seems strange.  We
>> currently lack any of the kernel NUMA interfaces in glibc, which makes
>> this stand out even more.
>>
> 
> And this lack of support doesn't make problems to go away.

No. But it is a strong indicator that the solution space hasn't been
explored thoroughly enough for us to provide a long-term stable interface
that will remain useful.

My opinion is that for the health and evolution of a NUMA-aware spinlock
and MCS lock, that we should create a distinct project and library that
should have those locks, and then work to put them into downstream
distributions. This will support key users being able to use supported
versions of those libraries, and give the needed feedback about the API
and the performance. It may take 1-2 years to get that feedback and every
piece of feedback will improve the final API/ABI we put into glibc or
even into the next ISO C standard as pat of the C thread interface.

My objection to the NUMA-aware spinlock API is because I feel we are doing
a disservice to the work by formalizing it and freezing it as part of the
ABI/API that glibc is using.

In fact this NUMA-aware discussion touches on a deeply complex issue,
which is: How do we create/design, and evolve interfaces that we want
to one-day have in stable glibc? But this is a discussion for another
thread. Roland once said he wished we had put every function into it's
own library ;-)

Does this explain in more detail why I don't think it's a good idea to
put these interfaces into glibc?

Florian Weimer Jan. 10, 2019, 4:32 p.m. UTC | #6

* Carlos O'Donell:

> My opinion is that for the health and evolution of a NUMA-aware spinlock
> and MCS lock, that we should create a distinct project and library that
> should have those locks, and then work to put them into downstream
> distributions. This will support key users being able to use supported
> versions of those libraries, and give the needed feedback about the API
> and the performance. It may take 1-2 years to get that feedback and every
> piece of feedback will improve the final API/ABI we put into glibc or
> even into the next ISO C standard as pat of the C thread interface.

I think it's something taht could land in tbb, for which many
distributions already have mechanisms to ship updated versions after a
release.

Thanks,
Florian

Carlos O'Donell Jan. 10, 2019, 4:41 p.m. UTC | #7

On 1/10/19 11:32 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> My opinion is that for the health and evolution of a NUMA-aware spinlock
>> and MCS lock, that we should create a distinct project and library that
>> should have those locks, and then work to put them into downstream
>> distributions. This will support key users being able to use supported
>> versions of those libraries, and give the needed feedback about the API
>> and the performance. It may take 1-2 years to get that feedback and every
>> piece of feedback will improve the final API/ABI we put into glibc or
>> even into the next ISO C standard as pat of the C thread interface.
> 
> I think it's something taht could land in tbb, for which many
> distributions already have mechanisms to ship updated versions after a
> release.

Absolutely. That's a great idea.

Szabolcs Nagy Jan. 10, 2019, 5:52 p.m. UTC | #8

On 10/01/2019 16:41, Carlos O'Donell wrote:
> On 1/10/19 11:32 AM, Florian Weimer wrote:

>> * Carlos O'Donell:

>>

>>> My opinion is that for the health and evolution of a NUMA-aware spinlock

>>> and MCS lock, that we should create a distinct project and library that

>>> should have those locks, and then work to put them into downstream

>>> distributions. This will support key users being able to use supported

>>> versions of those libraries, and give the needed feedback about the API

>>> and the performance. It may take 1-2 years to get that feedback and every

>>> piece of feedback will improve the final API/ABI we put into glibc or

>>> even into the next ISO C standard as pat of the C thread interface.

>>

>> I think it's something taht could land in tbb, for which many

>> distributions already have mechanisms to ship updated versions after a

>> release.

> 

> Absolutely. That's a great idea.

> 


in principle the pthread_spin_lock api can use this algorithm
assuming we can keep the pthread_spinlock_t abi and keep the
POSIX semantics. (presumably users ran into issues with the
existing posix api.. or how did this come up in the first place?)

Carlos O'Donell Jan. 10, 2019, 7:24 p.m. UTC | #9

On 1/10/19 12:52 PM, Szabolcs Nagy wrote:
> On 10/01/2019 16:41, Carlos O'Donell wrote:
>> On 1/10/19 11:32 AM, Florian Weimer wrote:
>>> * Carlos O'Donell:
>>>
>>>> My opinion is that for the health and evolution of a NUMA-aware spinlock
>>>> and MCS lock, that we should create a distinct project and library that
>>>> should have those locks, and then work to put them into downstream
>>>> distributions. This will support key users being able to use supported
>>>> versions of those libraries, and give the needed feedback about the API
>>>> and the performance. It may take 1-2 years to get that feedback and every
>>>> piece of feedback will improve the final API/ABI we put into glibc or
>>>> even into the next ISO C standard as pat of the C thread interface.
>>>
>>> I think it's something taht could land in tbb, for which many
>>> distributions already have mechanisms to ship updated versions after a
>>> release.
>>
>> Absolutely. That's a great idea.
>>
> 
> in principle the pthread_spin_lock api can use this algorithm
> assuming we can keep the pthread_spinlock_t abi and keep the
> POSIX semantics. (presumably users ran into issues with the
> existing posix api.. or how did this come up in the first place?)
 
Correct, but meeting the ABI contract of the pthread_spinlck_t turns
out to be hard, there isn't much space. I've spoken with Kemi Wang 
(Intel) about this specific issue, and he has some ideas to share,
but I'll leave it for him to describe.

Kemi Wang Jan. 11, 2019, 11:56 a.m. UTC | #10

On 2019/1/11 上午3:24, Carlos O'Donell wrote:
> On 1/10/19 12:52 PM, Szabolcs Nagy wrote:
>> On 10/01/2019 16:41, Carlos O'Donell wrote:
>>> On 1/10/19 11:32 AM, Florian Weimer wrote:
>>>> * Carlos O'Donell:
>>>>
>>>>> My opinion is that for the health and evolution of a NUMA-aware spinlock
>>>>> and MCS lock, that we should create a distinct project and library that
>>>>> should have those locks, and then work to put them into downstream
>>>>> distributions. This will support key users being able to use supported
>>>>> versions of those libraries, and give the needed feedback about the API
>>>>> and the performance. It may take 1-2 years to get that feedback and every
>>>>> piece of feedback will improve the final API/ABI we put into glibc or
>>>>> even into the next ISO C standard as pat of the C thread interface.
>>>>
>>>> I think it's something taht could land in tbb, for which many
>>>> distributions already have mechanisms to ship updated versions after a
>>>> release.
>>>
>>> Absolutely. That's a great idea.
>>>
>>
>> in principle the pthread_spin_lock api can use this algorithm
>> assuming we can keep the pthread_spinlock_t abi and keep the
>> POSIX semantics. (presumably users ran into issues with the
>> existing posix api.. or how did this come up in the first place?)
>  
> Correct, but meeting the ABI contract of the pthread_spinlck_t turns
> out to be hard, there isn't much space. I've spoken with Kemi Wang 
> (Intel) about this specific issue, and he has some ideas to share,
> but I'll leave it for him to describe.
> 

It may be possible because we can make better use of size of pthread_spinlock_t.

MCS lock is a well known method to reduce spinlock overhead by queuing spinner, the spinlock 
cache line is only contended between spinlock holder and a active spinner, other spinners are
spinning on local-accessible flag until the previous spinner pass mcs lock holder down.

Usually, a classical MCS implementation requires an extra pointer *mcs_lock* to track the tail of queue.
When a new spinner is adding into the queue, we first get the current tail of queue, and move the mcs_lock
pointer to point to this new spinner(a new tail of queue). 
If we can squeeze some space in pthread_spinlock_t to store this tail info, and update this tail info
when a new spinner is added into the queue, then the MCS algorithm can be reimplemented without breaking ABI.
That's possible because *lock* itself don't have to occupy 32 bits (8 bits or even one bit should be enough).

Then the pthread_spinlock_t structure may be like this(Similar to qspinlock in kernel):
struct pthread_spinlock_t
{
   union {
      struct {
         u8 locked; // lock byte
         u8 reserve; 
         u16 cpuid; // CPU id used by the last spinner, and using per-cpu infrastructure to convert it
         a pointer which points to the tail of queue. E.g per_cpu_var(qnode, cpuid)
      }
   int lock;
   }
}

PER-CPU struct qnode {
    struct qnode *next; // point to next spinner
    int flag;  // local spinning flag
}

But they are two problems here.
a) Lack of per-cpu infrastructure support in Glibc, so we can't do this cpuid->per-cpu-variable transition
b) Can't disable preemption at userland. 
   When a new spinner is adding to the queue, we need update the cpuid of pthread_spinlock_t to a new one.
   Pseudo-code:
	newid = get_current_cpuid();  
        prev = atomic_exchange_acquire(&cpuid, newid); // update cpuid to the new cpuid, and return
							 back the previous one
        tail_node = per_cpu_var(qnode, prev);  //get the last tail node of queue

There is a problem when preemption happens at a time window between get_current_cpuid() and atomic_exchange_acquire().
When the thread is rescheduled back, it maybe on another cpu with different cpuid.

===============================CUT HERE==================================
Another way is to store thread-specific info(e.g. tid) in pthread_spinlock_t instead of cpuid, then, we can avoid the
issue b), but it seems that we break the semantic of TLS? Comments?

H.J. Lu Jan. 11, 2019, 4:23 p.m. UTC | #11

On Thu, Jan 10, 2019 at 8:32 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Carlos O'Donell:
>
> > My opinion is that for the health and evolution of a NUMA-aware spinlock
> > and MCS lock, that we should create a distinct project and library that
> > should have those locks, and then work to put them into downstream
> > distributions. This will support key users being able to use supported
> > versions of those libraries, and give the needed feedback about the API
> > and the performance. It may take 1-2 years to get that feedback and every
> > piece of feedback will improve the final API/ABI we put into glibc or
> > even into the next ISO C standard as pat of the C thread interface.
>
> I think it's something taht could land in tbb, for which many
> distributions already have mechanisms to ship updated versions after a
> release.

We will work on a standalone library:

https://gitlab.com/numa-spinlock/numa-spinlock

The implementation is done.  But many pieces are missing:

1. Documentation.
2. Make tests portable for non-x86 platforms.
3. Do we need symbol versioning?
4. ...

We are looking for contributors.

Thanks.

Torvald Riegel Jan. 14, 2019, 10:45 p.m. UTC | #12

On Thu, 2019-01-10 at 11:41 -0500, Carlos O'Donell wrote:
> On 1/10/19 11:32 AM, Florian Weimer wrote:
> > * Carlos O'Donell:
> > 
> > > My opinion is that for the health and evolution of a NUMA-aware spinlock
> > > and MCS lock, that we should create a distinct project and library that
> > > should have those locks, and then work to put them into downstream
> > > distributions. This will support key users being able to use supported
> > > versions of those libraries, and give the needed feedback about the API
> > > and the performance. It may take 1-2 years to get that feedback and every
> > > piece of feedback will improve the final API/ABI we put into glibc or
> > > even into the next ISO C standard as pat of the C thread interface.
> > 
> > I think it's something taht could land in tbb, for which many
> > distributions already have mechanisms to ship updated versions after a
> > release.
> 
> Absolutely. That's a great idea.
> 

I don't think tbb is a useful vehicle.  It would require that many
applications use the tbb mutexes, which I doubt is the case.

Torvald Riegel Jan. 14, 2019, 11:03 p.m. UTC | #13

On Sat, 2019-01-05 at 07:34 -0500, Carlos O'Donell wrote:
> On 1/3/19 2:58 PM, H.J. Lu wrote:
> > +libpthread {
> > +  GLIBC_2.29 {
> > +    numa_spinlock_alloc;
> > +    numa_spinlock_free;
> > +    numa_spinlock_init;
> > +    numa_spinlock_apply;
> > +  }
> > +}
> 
> Why are we adding these non-standard interfaces to glibc?

I also think that there shouldn't be a new API added for these.  There
would be the option of adding a new pthread_mutex_t type, and thus even get
some more space for the implementation, but I wouldn't like that either.

What I think we should be doing instead:

(1) Finally add proper spinning and back-off throughout our synchronization
abstractions (spinlocks, mutexes, etc.; see my other comment on this
thread).  This should improve performance significantly, either through
less contention during spinning, or through doing more spinning and thus
less trips through futexes and the kernel.

(2) Change the synchronization abstractions (especially mutexes) so that
they can efficiently run different code paths when they are of non-process-
shared type, and then start with doing things like MCS in the non-process-
shared mutexes.

For both of these, add proper benchmarks so that tuning decisions can be
checked automatically and tested for regressions.  IOW, we want tests for
whether the tuning decisions are still the correct ones (in the future, or
on future HW).

The main reason why I think that's a better approach is that the biggest
return on the investments by the glibc community is improving performance
for as many users as possible, even if it may not be ideal performance for
those unchanged programs.  Adding a special API+implementation instead
might enable somewhat larger performance improvements, but it requires
programs to change and programmers to be aware of it, so will likely remain
a niche case in practice.

> It could be implemented as a distinct library, allowed to evolve quickly
> in response to customer need, and eventually integrated into glibc if the
> API proves stable. A similar model has been setup by Boost and C++ just to
> draw some parallels.

I think first of all, we should try hard to get as much as performance out
of the interfaces, POSIX semantics constraints, and ABI constraints we have
today.  Then maybe change the ABI if it really unlocks further benefits.

If that's not sufficient to get decent performance, then I think the next
venue to look at is C++.  That is, if implementing these ideas to the full
extent is not possible in glibc, go to libstdc++ instead and see what's
possible there.  C++'s synchronization constructs have saner semantics than
POSIX's, and ABI breaks in the future are more likely for C++ than for
glibc.
If the current C++ synchronization constructs have semantics that inhibit
performance, then ISO C++ Study Group 1 will likely want to hear about it. 
And they are a much better group to discuss this than the glibc community
is, simply because they are focused on just parallelism and concurrency.

Florian Weimer Jan. 15, 2019, 9:32 a.m. UTC | #14

* Torvald Riegel:

> On Thu, 2019-01-10 at 11:41 -0500, Carlos O'Donell wrote:
>> On 1/10/19 11:32 AM, Florian Weimer wrote:
>> > * Carlos O'Donell:
>> > 
>> > > My opinion is that for the health and evolution of a NUMA-aware spinlock
>> > > and MCS lock, that we should create a distinct project and library that
>> > > should have those locks, and then work to put them into downstream
>> > > distributions. This will support key users being able to use supported
>> > > versions of those libraries, and give the needed feedback about the API
>> > > and the performance. It may take 1-2 years to get that feedback and every
>> > > piece of feedback will improve the final API/ABI we put into glibc or
>> > > even into the next ISO C standard as pat of the C thread interface.
>> > 
>> > I think it's something taht could land in tbb, for which many
>> > distributions already have mechanisms to ship updated versions after a
>> > release.
>> 
>> Absolutely. That's a great idea.
>> 
>
> I don't think tbb is a useful vehicle.  It would require that many
> applications use the tbb mutexes, which I doubt is the case.

That doesn't really matter because it's a new API anyway.

Thanks,
Florian

Torvald Riegel Jan. 15, 2019, 12:01 p.m. UTC | #15

On Tue, 2019-01-15 at 10:32 +0100, Florian Weimer wrote:
> * Torvald Riegel:
> 
> > On Thu, 2019-01-10 at 11:41 -0500, Carlos O'Donell wrote:
> > > On 1/10/19 11:32 AM, Florian Weimer wrote:
> > > > * Carlos O'Donell:
> > > > 
> > > > > My opinion is that for the health and evolution of a NUMA-aware spinlock
> > > > > and MCS lock, that we should create a distinct project and library that
> > > > > should have those locks, and then work to put them into downstream
> > > > > distributions. This will support key users being able to use supported
> > > > > versions of those libraries, and give the needed feedback about the API
> > > > > and the performance. It may take 1-2 years to get that feedback and every
> > > > > piece of feedback will improve the final API/ABI we put into glibc or
> > > > > even into the next ISO C standard as pat of the C thread interface.
> > > > 
> > > > I think it's something taht could land in tbb, for which many
> > > > distributions already have mechanisms to ship updated versions after a
> > > > release.
> > > 
> > > Absolutely. That's a great idea.
> > > 
> > 
> > I don't think tbb is a useful vehicle.  It would require that many
> > applications use the tbb mutexes, which I doubt is the case.
> 
> That doesn't really matter because it's a new API anyway.

What I mean is that applications would have to want to use locks provided
by tbb, whether those are locks/mutexes that exist in tbb today or a new
API that would be added.

Put differently, I'm not optimistic about tbb being a good way to get
feedback.

Florian Weimer Jan. 15, 2019, 12:17 p.m. UTC | #16

* Torvald Riegel:

> What I mean is that applications would have to want to use locks provided
> by tbb, whether those are locks/mutexes that exist in tbb today or a new
> API that would be added.
>
> Put differently, I'm not optimistic about tbb being a good way to get
> feedback.

Do you want to run existing workloads with a new mutex implementation?

Then we can't add new flags or change ABI in any way and would have to
use a tunable.  And to get feedback, we would have to make the new
implementation the default, with a tunable to get back the old
implementation.

Thanks,
Florian

Torvald Riegel Jan. 15, 2019, 12:30 p.m. UTC | #17

On Tue, 2019-01-15 at 13:17 +0100, Florian Weimer wrote:
> * Torvald Riegel:
> 
> > What I mean is that applications would have to want to use locks provided
> > by tbb, whether those are locks/mutexes that exist in tbb today or a new
> > API that would be added.
> > 
> > Put differently, I'm not optimistic about tbb being a good way to get
> > feedback.
> 
> Do you want to run existing workloads with a new mutex implementation?

We need to get there.

> Then we can't add new flags or change ABI in any way

Yes.

> and would have to
> use a tunable.  And to get feedback, we would have to make the new
> implementation the default, with a tunable to get back the old
> implementation.

I wouldn't be too concerned to getting back the old implementation, so
maybe we don't even need a tunable right now.  The old implementation is
just no spinning, so the cases where I can imagine the tunable to be useful
is either (1) experimentation to compare performance without using
different glibc's and (2) going back to old behavior in cases where we
really screwed up.  But how many users will have the time to investigate
(2), really?

Nobody should have to tune their spinlocks, or the back-off in mutexes. 
It's our duty to make sure this has good average-case performance.

diff mbox

Patch

From 747e940a4f3c59ce8bba68c3334b619f1807727a Mon Sep 17 00:00:00 2001
From: "ling.ma" <ling.ml@antfin.com>
Date: Mon, 26 Nov 2018 21:31:51 +0800
Subject: [PATCH] NUMA spinlock [BZ #23962]

On multi-socket systems, memory is shared across the entire system.
Data access to the local socket is much faster than the remote socket
and data access to the local core is faster than sibling cores on the
same socket.  For serialized workloads with conventional spinlock,
when there is high spinlock contention between threads, lock ping-pong
among sockets becomes the bottleneck and threads spend majority of
their time in spinlock overhead.

On multi-socket systems, the keys to our NUMA spinlock performance
are to minimize cross-socket traffic as well as localize the serialized
workload to one core for execution.  The basic principles of NUMA
spinlock are mainly consisted of following approaches, which reduce
data movement and accelerate critical section, eventually give us
significant performance improvement.

1. MCS spinlock
MCS spinlock help us to reduce the useless lock movement in the
spinning state.  This paper provides a good description for this
kind of lock:
<http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf>

2. Critical Section Integration (CSI)
Essentially spinlock is similar to that one core complete critical
sections one by one. So when contention happen, the serialized works
are sent to the core who is the lock owner and responsible to execute
them, that can save much time and power, because all shared data are
located in private cache of the lock owner.

We implemented this mechanism based on queued spinlock in kernel, that
speeds up critical section, and reduces the probability of contention.
The paper provides a good description for this kind of lock:
<https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf>

3. NUMA Aware Spinlock (NAS)
Currently multi-socket systems give us better performance per watt,
however that also involves more complex synchronization requirement,
because off-chip data movement is much slower. We use distributed
synchronization mechanism to decrease Lock cache line to and from
different nodes. The paper provides a good description for this kind
of lock:
<https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf>

4. Yield Schedule
When threads are applying for Critical Section Integration(CSI) with
known contention, they will delegate work to the thread who is the
lock owner, and wait for work to be completed.  The resources which
they are using should be transferred to other threads. In order to
accelerate the scenario, we introduce yield_sched function during
spinning stage.

5. Optimization when NUMA is ON or OFF.
Although programs can access memory with lower latency when NUMA is
enabled, some programs may need more memory bandwidth for computation
with NUMA disabled.  We also optimize multi-socket systems with NUMA
disabled.

NUMA spinlock flow chart (assuming there are 2 CPU nodes):

1. Threads from node_0/node_1 acquire local lock for node_0/1
respectively.  If the thread succeeds in acquiring local lock, it
goes to step 2, otherwise pushes critical function into current
local work queue, and enters into spinning stage with MCS mode.

2. Threads from node_0/node_1 acquire the global lock.  If it succeeds
in acquiring the global lock as the lock owner, it goes to step 3,
otherwise waits until the lock owner thread releases the global lock.

3. The lock owner thread from node_0/1 enters into critical section,
cleans up work queue by performing all local critical functions
pushed at step 1 with CSI on behalf of other threads and informs
those spinning threads that their works have been done.  It then
releases the local lock.

4. The lock owner thread frees global lock.  If another thread is
waiting at step 2, the lock owner thread passes the global lock to
the waiting thread and returns.  The new lock owner thread enters
into step 3.  If no threads are waiting, the lock owner thread
releases the global lock and returns.  The whole critical section
process is completed.

Steps 1 and 2 mitigate global lock contention.  Only one thread
from different nodes will compete for the global lock in step 2.
Step 3 reduces the global lock & shared data movement because they
are located in the same node as well as the same core.  Our data
shows that Critical Section Integration (CSI) improves data locality
and NUMA-aware spinlock (NAS) helps CSI balance the workload.

NUMA spinlock can greatly speed up critical section on multi-socket
systems.  It should improve spinlock performance on all multi-socket
systems.

2019-01-03  Ling Ma  <ling.ml@antfin.com>
	    H.J. Lu  <hongjiu.lu@intel.com>
	    Wei Xiao  <wei3.xiao@intel.com>

	[BZ #23962]
	* NEWS: Mention NUMA spinlock.
	* manual/examples/numa-spinlock.c: New file.
	* sysdeps/unix/sysv/linux/numa-spinlock-private.h: Likewise.
	* sysdeps/unix/sysv/linux/numa-spinlock.c: Likewise.
	* sysdeps/unix/sysv/linux/numa-spinlock.h: Likewise.
	* sysdeps/unix/sysv/linux/numa_spinlock_alloc.c: Likewise.
	* sysdeps/unix/sysv/linux/tst-numa-variable-overhead.c: Likewise.
	* sysdeps/unix/sysv/linux/tst-variable-overhead-skeleton.c:
	Likewise.
	* sysdeps/unix/sysv/linux/tst-variable-overhead.c: Likewise.
	* manual/threads.texi: Document NUMA spinlock.
	* sysdeps/unix/sysv/linux/Makefile (libpthread-sysdep_routines):
	Add numa_spinlock_alloc and numa-spinlock.
	(sysdep_headers): Add numa-spinlock.h.
	(xtests): Add tst-variable-overhead and tst-numa-variable-overhead.
	* sysdeps/unix/sysv/linux/Versions (libpthread): Add
	numa_spinlock_alloc, numa_spinlock_free, numa_spinlock_init
	and numa_spinlock_apply to GLIBC_2.29.
	* sysdeps/unix/sysv/linux/aarch64/libpthread.abilist: Add
	numa_spinlock_alloc, numa_spinlock_apply, numa_spinlock_free
	and numa_spinlock_init.
	* sysdeps/unix/sysv/linux/alpha/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/arm/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/csky/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/hppa/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/i386/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/ia64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/microblaze/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/nios2/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/sh/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist:
	Likewise.
	* sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist: Likewise.
---
 NEWS                                          |   3 +
 manual/examples/numa-spinlock.c               |  99 +++++
 manual/threads.texi                           | 105 +++++
 sysdeps/unix/sysv/linux/Makefile              |   3 +
 sysdeps/unix/sysv/linux/Versions              |   9 +
 .../sysv/linux/aarch64/libpthread.abilist     |   4 +
 .../unix/sysv/linux/alpha/libpthread.abilist  |   4 +
 .../unix/sysv/linux/arm/libpthread.abilist    |   4 +
 .../unix/sysv/linux/csky/libpthread.abilist   |   4 +
 .../unix/sysv/linux/hppa/libpthread.abilist   |   4 +
 .../unix/sysv/linux/i386/libpthread.abilist   |   4 +
 .../unix/sysv/linux/ia64/libpthread.abilist   |   4 +
 .../linux/m68k/coldfire/libpthread.abilist    |   4 +
 .../sysv/linux/m68k/m680x0/libpthread.abilist |   4 +
 .../sysv/linux/microblaze/libpthread.abilist  |   4 +
 .../sysv/linux/mips/mips32/libpthread.abilist |   4 +
 .../sysv/linux/mips/mips64/libpthread.abilist |   4 +
 .../unix/sysv/linux/nios2/libpthread.abilist  |   4 +
 .../unix/sysv/linux/numa-spinlock-private.h   |  38 ++
 sysdeps/unix/sysv/linux/numa-spinlock.c       | 327 +++++++++++++++
 sysdeps/unix/sysv/linux/numa-spinlock.h       |  64 +++
 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c | 304 ++++++++++++++
 .../powerpc/powerpc32/libpthread.abilist      |   4 +
 .../powerpc/powerpc64/be/libpthread.abilist   |   4 +
 .../powerpc/powerpc64/le/libpthread.abilist   |   4 +
 .../sysv/linux/riscv/rv64/libpthread.abilist  |   4 +
 .../linux/s390/s390-32/libpthread.abilist     |   4 +
 .../linux/s390/s390-64/libpthread.abilist     |   4 +
 sysdeps/unix/sysv/linux/sh/libpthread.abilist |   4 +
 .../linux/sparc/sparc32/libpthread.abilist    |   4 +
 .../linux/sparc/sparc64/libpthread.abilist    |   4 +
 .../sysv/linux/tst-numa-variable-overhead.c   |  53 +++
 .../linux/tst-variable-overhead-skeleton.c    | 397 ++++++++++++++++++
 .../unix/sysv/linux/tst-variable-overhead.c   |  47 +++
 .../sysv/linux/x86_64/64/libpthread.abilist   |   4 +
 .../sysv/linux/x86_64/x32/libpthread.abilist  |   4 +
 36 files changed, 1545 insertions(+)
 create mode 100644 manual/examples/numa-spinlock.c
 create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock-private.h
 create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.c
 create mode 100644 sysdeps/unix/sysv/linux/numa-spinlock.h
 create mode 100644 sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-numa-variable-overhead.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-variable-overhead-skeleton.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-variable-overhead.c

diff --git a/NEWS b/NEWS
index cc20102fda..bcaa932d4e 100644
--- a/NEWS
+++ b/NEWS
@@ -9,6 +9,9 @@  Version 2.29
 
 Major new features:
 
+* NUMA spinlock is added to provide a spinlock implementation optimized
+  for multi-socket NUMA systems.
+
 * The getcpu wrapper function has been added, which returns the currently
   used CPU and NUMA node.  This function is Linux-specific.
 
diff --git a/manual/examples/numa-spinlock.c b/manual/examples/numa-spinlock.c
new file mode 100644
index 0000000000..ca98443f69
--- /dev/null
+++ b/manual/examples/numa-spinlock.c
@@ -0,0 +1,99 @@ 
+/* NUMA spinlock example.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or
+   modify it under the terms of the GNU General Public License
+   as published by the Free Software Foundation; either version 2
+   of the License, or (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.
+*/
+
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <numa-spinlock.h>
+
+#define NUM_THREADS	20
+
+struct numa_spinlock *lock;
+
+struct work_todo_argument
+{
+  void *arg;
+};
+
+static void *
+work_todo (void *v)
+{
+  /* Do the real work with p->arg. */
+  struct work_todo_argument *p = v;
+  /* Return value is set to lock_info.result. */
+  return NULL;
+}
+
+void *
+work_thread (void *arg)
+{
+  struct work_todo_argument work_todo_arg;
+  struct numa_spinlock_info lock_info;
+
+  if (numa_spinlock_init (lock, &lock_info))
+    {
+      printf ("numa_spinlock_init failure: %m\n");
+      exit (1);
+    }
+
+  work_todo_arg.arg = arg;
+  lock_info.argument = &work_todo_arg;
+  lock_info.workload = work_todo;
+
+  numa_spinlock_apply (&lock_info);
+
+  return lock_info.result;
+}
+
+int
+main (int argc, char **argv)
+{
+  lock = numa_spinlock_alloc ();
+  pthread_t thr[NUM_THREADS];
+  void *res[NUM_THREADS];
+  int numthreads = NUM_THREADS;
+  int i;
+
+  for (i = 0; i < NUM_THREADS; i++)
+    {
+      int err_ret = pthread_create (&thr[i], NULL, work_thread,
+				    (void *) (intptr_t) i);
+      if (err_ret != 0)
+	{
+	  printf ("pthread_create failed: %d, %s\n",
+		  i, strerror (i));
+	  numthreads = i;
+	  break;
+	}
+    }
+
+  for (i = 0; i < numthreads; i++)
+    {
+      if (pthread_join (thr[i], (void *) &res[i]) == 0)
+	free (res[i]);
+      else
+	printf ("pthread_join failure: %m\n");
+    }
+
+  numa_spinlock_free (lock);
+
+  return 0;
+}
diff --git a/manual/threads.texi b/manual/threads.texi
index 87fda7d8e7..e82ae0d51b 100644
--- a/manual/threads.texi
+++ b/manual/threads.texi
@@ -625,6 +625,9 @@  the standard.
 @menu
 * Default Thread Attributes::             Setting default attributes for
 					  threads in a process.
+* NUMA Spinlock::                         Spinlock optimized for
+					  multi-socket NUMA platform.
+* NUMA Spinlock Example::                 A NUMA spinlock example.
 @end menu
 
 @node Default Thread Attributes
@@ -669,6 +672,108 @@  The system does not have sufficient memory.
 @end table
 @end deftypefun
 
+@node NUMA Spinlock
+@subsubsection Spinlock optimized for multi-node NUMA systems
+
+To improve performance on multi-socket NUMA platforms for serialized
+region protected by spinlock, @theglibc{} implements a NUMA spinlock
+object, which minimizes cross-socket traffic and sends the protected
+serialized region to one core for execution to reduce spinlock contention
+overhead.
+
+The fundamental data types for a NUMA spinlock are
+@code{numa_spinlock} and @code{numa_spinlock_info}:
+
+@deftp {Data Type} {struct numa_spinlock}
+@standards{Linux, numa-spinlock.h}
+This data type is an opaque structure.  A @code{numa_spinlock} pointer
+uniquely identifies a NUMA spinlock object.
+@end deftp
+
+@deftp {Data Type} {struct numa_spinlock_info}
+@standards{Linux, numa-spinlock.h}
+
+This data type uniquely identifies a NUMA spinlock information object for
+a thread.  It has the following members, and others internal to NUMA
+spinlock implemenation:
+
+@table @code
+@item void *(*workload) (void *)
+A function pointer to the workload function serialized by spinlock.
+@item void *argument
+A pointer to argument passed to the @var{workload} function pointer.
+@item void *result
+Return value from the @var{workload} function pointer.
+@end table
+
+@end deftp
+
+The following functions are provided for NUMA spinlock objects:
+
+@deftypefun struct numa_spinlock *numa_spinlock_alloc (void)
+@standards{Linux, numa-spinlock.h}
+@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
+
+This function returns a pointer to a newly allocated NUMA spinlock or a
+null pointer if the NUMA spinlock could not be allocated, setting
+@code{errno} to @code{ENOMEM}.  Caller should call
+@code{numa_spinlock_free} on the NUMA spinlock pointer to free the
+memory space when it is no longer needed.
+
+This function is Linux-specific and is declared in @file{numa-spinlock.h}.
+@end deftypefun
+
+@deftypefun void numa_spinlock_free (struct numa_spinlock *@var{lock})
+@standards{Linux, numa-spinlock.h}
+@safety{@prelim{}@mtsafe{}@asunsafe{@asulock{}}@acunsafe{@aculock{} @acsfd{} @acsmem{}}}
+
+Free the memory space pointed to by @var{lock}, which must have been
+returned by a previous call to @code{numa_spinlock_alloc}.  Otherwise,
+or if @code{numa_spinlock_free (@var{lock})} has already been called
+before, undefined behavior occurs.
+
+This function is Linux-specific and is declared in @file{numa-spinlock.h}.
+@end deftypefun
+
+@deftypefun int numa_spinlock_init (struct numa_spinlock *@var{lock},
+struct numa_spinlock_info *@var{info})
+@standards{Linux, numa-spinlock.h}
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+
+Initialize the NUMA spinlock information block pointed to by @var{info}
+with a NUMA spinlock pointer @var{lock}.  The return value is @code{0} on
+success and @code{-1} on failure.  The following @code{errno} error
+codes are defined for this function:
+
+@table @code
+@item ENOSYS
+The operating system does not support the @code{getcpu} function.
+@end table
+
+This function is Linux-specific and is declared in @file{numa-spinlock.h}.
+@end deftypefun
+
+@deftypefun void numa_spinlock_apply (struct numa_spinlock_info *@var{info})
+@standards{Linux, numa-spinlock.h}
+@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
+
+Apply for spinlock with a NUMA spinlock information block pointed to by
+@var{info}.  When @code{numa_spinlock_apply} returns, the spinlock is
+released and the @var{result} member of @var{info} contains the return
+value of the @var{workload} member.
+
+This function is Linux-specific and is declared in @file{numa-spinlock.h}.
+@end deftypefun
+
+@node NUMA Spinlock Example
+@subsubsection NUMA Spinlock Example
+
+A NUMA spinlock example:
+
+@smallexample
+@include numa-spinlock.c.texi
+@end smallexample
+
 @c FIXME these are undocumented:
 @c pthread_atfork
 @c pthread_attr_destroy
diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile
index 5f8c2c7c7d..36b12e8c92 100644
--- a/sysdeps/unix/sysv/linux/Makefile
+++ b/sysdeps/unix/sysv/linux/Makefile
@@ -227,8 +227,11 @@  CFLAGS-gai.c += -DNEED_NETLINK
 endif
 
 ifeq ($(subdir),nptl)
+libpthread-sysdep_routines += numa_spinlock_alloc numa-spinlock
+sysdep_headers += numa-spinlock.h
 tests += tst-align-clone tst-getpid1 \
 	tst-thread-affinity-pthread tst-thread-affinity-pthread2 \
 	tst-thread-affinity-sched
 tests-internal += tst-setgetname
+xtests += tst-variable-overhead tst-numa-variable-overhead
 endif
diff --git a/sysdeps/unix/sysv/linux/Versions b/sysdeps/unix/sysv/linux/Versions
index f1e12d9c69..7ce7e2b276 100644
--- a/sysdeps/unix/sysv/linux/Versions
+++ b/sysdeps/unix/sysv/linux/Versions
@@ -185,3 +185,12 @@  libc {
     __netlink_assert_response;
   }
 }
+
+libpthread {
+  GLIBC_2.29 {
+    numa_spinlock_alloc;
+    numa_spinlock_free;
+    numa_spinlock_init;
+    numa_spinlock_apply;
+  }
+}
diff --git a/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist b/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist
index 9a9e4cee85..eb54a8363d 100644
--- a/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/aarch64/libpthread.abilist
@@ -243,3 +243,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
diff --git a/sysdeps/unix/sysv/linux/alpha/libpthread.abilist b/sysdeps/unix/sysv/linux/alpha/libpthread.abilist
index b413007ccb..dd08796242 100644
--- a/sysdeps/unix/sysv/linux/alpha/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/alpha/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/arm/libpthread.abilist b/sysdeps/unix/sysv/linux/arm/libpthread.abilist
index af82a4c632..45a5c5a8fd 100644
--- a/sysdeps/unix/sysv/linux/arm/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/arm/libpthread.abilist
@@ -27,6 +27,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.4 _IO_flockfile F
 GLIBC_2.4 _IO_ftrylockfile F
 GLIBC_2.4 _IO_funlockfile F
diff --git a/sysdeps/unix/sysv/linux/csky/libpthread.abilist b/sysdeps/unix/sysv/linux/csky/libpthread.abilist
index ea4b79a518..cf65f72ae1 100644
--- a/sysdeps/unix/sysv/linux/csky/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/csky/libpthread.abilist
@@ -73,6 +73,10 @@  GLIBC_2.29 mtx_timedlock F
 GLIBC_2.29 mtx_trylock F
 GLIBC_2.29 mtx_unlock F
 GLIBC_2.29 nanosleep F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.29 open F
 GLIBC_2.29 open64 F
 GLIBC_2.29 pause F
diff --git a/sysdeps/unix/sysv/linux/hppa/libpthread.abilist b/sysdeps/unix/sysv/linux/hppa/libpthread.abilist
index bcba07f575..a80475fd04 100644
--- a/sysdeps/unix/sysv/linux/hppa/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/hppa/libpthread.abilist
@@ -219,6 +219,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/i386/libpthread.abilist b/sysdeps/unix/sysv/linux/i386/libpthread.abilist
index bece86d246..40ac05a471 100644
--- a/sysdeps/unix/sysv/linux/i386/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/i386/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/ia64/libpthread.abilist b/sysdeps/unix/sysv/linux/ia64/libpthread.abilist
index ccc9449826..5b190f69af 100644
--- a/sysdeps/unix/sysv/linux/ia64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/ia64/libpthread.abilist
@@ -219,6 +219,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist b/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist
index af82a4c632..45a5c5a8fd 100644
--- a/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/m68k/coldfire/libpthread.abilist
@@ -27,6 +27,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.4 _IO_flockfile F
 GLIBC_2.4 _IO_ftrylockfile F
 GLIBC_2.4 _IO_funlockfile F
diff --git a/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist b/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist
index bece86d246..40ac05a471 100644
--- a/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/m68k/m680x0/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist b/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist
index 5067375d23..e6539bf9a8 100644
--- a/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/microblaze/libpthread.abilist
@@ -243,3 +243,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
diff --git a/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist b/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist
index 02144967c6..76edcb8d54 100644
--- a/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips32/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist b/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist
index 02144967c6..76edcb8d54 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips64/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/nios2/libpthread.abilist b/sysdeps/unix/sysv/linux/nios2/libpthread.abilist
index 78cac2ae27..3141d08d00 100644
--- a/sysdeps/unix/sysv/linux/nios2/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/nios2/libpthread.abilist
@@ -241,3 +241,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
diff --git a/sysdeps/unix/sysv/linux/numa-spinlock-private.h b/sysdeps/unix/sysv/linux/numa-spinlock-private.h
new file mode 100644
index 0000000000..0ddffef44c
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/numa-spinlock-private.h
@@ -0,0 +1,38 @@ 
+/* Internal definitions and declarations for NUMA spinlock.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "numa-spinlock.h"
+
+/* The global NUMA spinlock.  */
+struct numa_spinlock
+{
+  /* List of threads who owns the global NUMA spinlock.  */
+  struct numa_spinlock_info *owner;
+  /* The maximium NUMA node number.  */
+  unsigned int max_node;
+  /* Non-zero for single node system.  */
+  unsigned int single_node;
+  /* The maximium CPU number.  Used only when NUMA is disabled.  */
+  unsigned int max_cpu;
+  /* Array of physical_package_id of each core if it isn't NULL.  Used
+     only when NUMA is disabled.*/
+  unsigned int *physical_package_id_p;
+  /* Arrays of lists of threads who are spinning for the local NUMA lock
+     on NUMA nodes indexed by NUMA node number.  */
+  struct numa_spinlock_info *lists[];
+};
diff --git a/sysdeps/unix/sysv/linux/numa-spinlock.c b/sysdeps/unix/sysv/linux/numa-spinlock.c
new file mode 100644
index 0000000000..c226bbb22c
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/numa-spinlock.c
@@ -0,0 +1,327 @@ 
+/* NUMA spinlock
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <config.h>
+#include <string.h>
+#include <stdlib.h>
+#include <sched.h>
+#ifndef HAVE_GETCPU
+#include <unistd.h>
+#include <syscall.h>
+#endif
+#include <errno.h>
+#include <atomic.h>
+#include "numa-spinlock-private.h"
+
+#if !defined HAVE_GETCPU && defined _LIBC
+# define HAVE_GETCPU
+#endif
+
+/* On multi-socket systems, memory is shared across the entire system.
+   Data access to the local socket is much faster than the remote socket
+   and data access to the local core is faster than sibling cores on the
+   same socket.  For serialized workloads with conventional spinlock,
+   when there is high spinlock contention between threads, lock ping-pong
+   among sockets becomes the bottleneck and threads spend majority of
+   their time in spinlock overhead.
+
+   On multi-socket systems, the keys to our NUMA spinlock performance
+   are to minimize cross-socket traffic as well as localize the serialized
+   workload to one core for execution.  The basic principles of NUMA
+   spinlock are mainly consisted of following approaches, which reduce
+   data movement and accelerate critical section, eventually give us
+   significant performance improvement.
+
+   1. MCS spinlock
+   MCS spinlock help us to reduce the useless lock movement in the
+   spinning state.  This paper provides a good description for this
+   kind of lock:
+   <http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf>
+
+   2. Critical Section Integration (CSI)
+   Essentially spinlock is similar to that one core complete critical
+   sections one by one. So when contention happen, the serialized works
+   are sent to the core who is the lock owner and responsible to execute
+   them, that can save much time and power, because all shared data are
+   located in private cache of the lock owner.
+
+   We implemented this mechanism based on queued spinlock in kernel, that
+   speeds up critical section, and reduces the probability of contention.
+   The paper provides a good description for this kind of lock:
+   <https://users.ece.cmu.edu/~omutlu/pub/acs_asplos09.pdf>
+
+   3. NUMA Aware Spinlock (NAS)
+   Currently multi-socket systems give us better performance per watt,
+   however that also involves more complex synchronization requirement,
+   because off-chip data movement is much slower. We use distributed
+   synchronization mechanism to decrease Lock cache line to and from
+   different nodes. The paper provides a good description for this kind
+   of lock:
+   <https://www.usenix.org/system/files/conference/atc17/atc17-kashyap.pdf>
+
+   4. Yield Schedule
+   When threads are applying for Critical Section Integration(CSI) with
+   known contention, they will delegate work to the thread who is the
+   lock owner, and wait for work to be completed.  The resources which
+   they are using should be transferred to other threads. In order to
+   accelerate the scenario, we introduce yield_sched function during
+   spinning stage.
+
+   5. Optimization when NUMA is ON or OFF.
+   Although programs can access memory with lower latency when NUMA is
+   enabled, some programs may need more memory bandwidth for computation
+   with NUMA disabled.  We also optimize multi-socket systems with NUMA
+   disabled.
+
+   NUMA spinlock flow chart (assuming there are 2 CPU nodes):
+
+   1. Threads from node_0/node_1 acquire local lock for node_0/1
+   respectively.  If the thread succeeds in acquiring local lock, it
+   goes to step 2, otherwise pushes critical function into current
+   local work queue, and enters into spinning stage with MCS mode.
+
+   2. Threads from node_0/node_1 acquire the global lock.  If it succeeds
+   in acquiring the global lock as the lock owner, it goes to step 3,
+   otherwise waits until the lock owner thread releases the global lock.
+
+   3. The lock owner thread from node_0/1 enters into critical section,
+   cleans up work queue by performing all local critical functions
+   pushed at step 1 with CSI on behalf of other threads and informs
+   those spinning threads that their works have been done.  It then
+   releases the local lock.
+
+   4. The lock owner thread frees global lock.  If another thread is
+   waiting at step 2, the lock owner thread passes the global lock to
+   the waiting thread and returns.  The new lock owner thread enters
+   into step 3.  If no threads are waiting, the lock owner thread
+   releases the global lock and returns.  The whole critical section
+   process is completed.
+
+   Steps 1 and 2 mitigate global lock contention.  Only one thread
+   from different nodes will compete for the global lock in step 2.
+   Step 3 reduces the global lock & shared data movement because they
+   are located in the same node as well as the same core.  Our data
+   shows that Critical Section Integration (CSI) improves data locality
+   and NUMA-aware spinlock (NAS) helps CSI balance the workload.
+
+   NUMA spinlock can greatly speed up critical section on multi-socket
+   systems.  It should improve spinlock performance on all multi-socket
+   systems.
+
+   NOTE: LiTL <https://github.com/multicore-locks/litl>, is an open-source
+   project that provides implementations of dozens of various locks,
+   including several state-of-the-art NUMA-aware spinlocks.  Among them
+
+   1. Hierarchical MCS (HMCS) spinlock.  Milind Chabbi, Michael Fagan,
+   and John Mellor-Crummey. High Performance Locks for Multi-level NUMA
+   Systems.  In Proceedings of the ACM SIGPLAN Symposium on Principles
+   and Practice of Parallel Programming (PPoPP), pages 215â€“226, 2015.
+
+   2. Cohort-MCS (C-MCS) spinlock.  Dave Dice, Virendra J. Marathe, and
+   Nir Shavit.  Lock Cohorting: A General Technique for Designing NUMA
+   Locks. ACM Trans. Parallel Comput., 1(2):13:1â€“13:42, 2015.
+ */
+
+/* Get the next thread pointed to by *NEXT_P.  NB: We must use a while
+   spin loop to load NEXT_P since there is a small window before *NEXT_P
+   is updated.  */
+
+static inline struct numa_spinlock_info *
+get_numa_spinlock_info_next (struct numa_spinlock_info **next_p)
+{
+  struct numa_spinlock_info *next;
+  while (!(next = atomic_load_relaxed (next_p)))
+    atomic_spin_nop ();
+  return next;
+}
+
+/* While holding the global NUMA spinlock, run the workload of the
+   thread pointed to by SELF first, then run the workload for each
+   thread on the thread list pointed to by HEAD_P and wake up the
+   thread so that all workloads run on a single processor.  */
+
+static inline void
+run_numa_spinlock (struct numa_spinlock_info *self,
+		   struct numa_spinlock_info **head_p)
+{
+  struct numa_spinlock_info *next, *current;
+
+  /* Run the SELF's workload. */
+  self->result = self->workload (self->argument);
+
+  /* Process workloads for the rest of threads on the thread list.
+     NB: The thread list may be prepended by other threads at the
+     same time.  */
+
+retry:
+   /* If SELF is the first thread of the thread list pointed to by
+      HEAD_P, clear the thread list.  */
+  current = atomic_compare_and_exchange_val_acq (head_p, NULL, self);
+  if (current == self)
+    {
+      /* Since SELF is the only thread on the list, clear SELF's pending
+         field and return.  */
+      atomic_store_release (&current->pending, 0);
+      return;
+    }
+
+  /* CURRENT will have the previous first thread of the thread list
+     pointed to by HEAD_P and *HEAD_P will point to SELF.  */
+  current = atomic_exchange_acquire (head_p, self);
+
+  /* NB: No need to check if CURRENT == SELF here since SELF can never
+     be CURRENT.  */
+
+repeat:
+  /* Get the next thread.  */
+  next = get_numa_spinlock_info_next (&current->next);
+
+  /* Run the CURRENT's workload and clear CURRENT's pending field. */
+  current->result = current->workload (current->argument);
+  current->pending = 0;
+
+  /* Process the workload for each thread from CURRENT to SELF on the
+     thread list.  Don't pass beyond SELF since SELF is the last thread
+     on the list.  */
+  if (next == self)
+    goto retry;
+  current = next;
+  goto repeat;
+}
+
+/* Apply for the NUMA spinlock with the NUMA spinlock info data pointed
+   to by SELF.  */
+
+void
+numa_spinlock_apply (struct numa_spinlock_info *self)
+{
+  struct numa_spinlock *lock = self->lock;
+  struct numa_spinlock_info *first, *next;
+  struct numa_spinlock_info **head_p;
+
+  self->next = NULL;
+  /* We want the global NUMA spinlock.  */
+  self->pending = 1;
+  /* Select the local NUMA spinlock list by the NUMA node number.  */
+  head_p = &lock->lists[self->node];
+  /* FIRST will have the previous first thread of the local NUMA spinlock
+     list and *HEAD_P will point to SELF.  */
+  first = atomic_exchange_acquire (head_p, self);
+  if (first)
+    {
+      /* SELF has been prepended to the thread list pointed to by
+	 HEAD_P.  NB: There is a small window between updating
+	 *HEAD_P and self->next.  */
+      atomic_store_release (&self->next, first);
+      /* Let other threads run first since another thread will run our
+	 workload for us.  */
+      sched_yield ();
+      /* Spin until our PENDING is cleared.  */
+      while (atomic_load_relaxed (&self->pending))
+	atomic_spin_nop ();
+      return;
+    }
+
+  /* NB: Now SELF must be the only thread on the thread list pointed
+     to by HEAD_P.  Since thread is always prepended to HEAD_P, we
+     can use *HEAD_P == SELF to check if SELF is the only thread on
+     the thread list.  */
+
+  if (__glibc_unlikely (lock->single_node))
+    {
+      /* If there is only one node, there is no need for the global
+         NUMA spinlock.  */
+      run_numa_spinlock (self, head_p);
+      return;
+    }
+
+  /* FIRST will have the previous first thread of the local NUMA spinlock
+     list of threads which holds the global NUMA spinlock, which will
+     point to SELF.  */
+  first = atomic_exchange_acquire (&lock->owner, self);
+  if (first)
+    {
+      /* SELF has been prepended to the thread list pointed to by
+	 lock->owner.  NB: There is a small window between updating
+	 *HEAD_P and first->next.  */
+      atomic_store_release (&first->next, self);
+      /* Spin until the list of threads which holds the global NUMA
+	 spinlock clears our PENDING.  */
+      while (atomic_load_relaxed (&self->pending))
+	atomic_spin_nop ();
+    }
+
+  /* We get the global NUMA spinlock now.  Run our workload.  */
+  run_numa_spinlock (self, head_p);
+
+  /* SELF is the only thread on the list if SELF is the first thread
+     of the thread list pointed to by lock->owner.  In this case, we
+     simply return.  */
+  if (!atomic_compare_and_exchange_bool_acq (&lock->owner, NULL, self))
+    return;
+
+  /* Wake up the next thread.  */
+  next = get_numa_spinlock_info_next (&self->next);
+  atomic_store_release (&next->pending, 0);
+}
+
+/* Initialize the NUMA spinlock info data pointed to by INFO from a
+   pointer to the NUMA spinlock, LOCK.  */
+
+int
+numa_spinlock_init (struct numa_spinlock *lock,
+		    struct numa_spinlock_info *info)
+{
+  memset (info, 0, sizeof (*info));
+  info->lock = lock;
+  /* For single node system, use 0 as the NUMA node number.  */
+  if (lock->single_node)
+    return 0;
+  /* NB: Use the NUMA node number from getcpu to select the local NUMA
+     spinlock list.  */
+  unsigned int cpu;
+  unsigned int node;
+#ifdef HAVE_GETCPU
+  int err_ret = getcpu (&cpu, &node);
+#else
+  int err_ret = syscall (SYS_getcpu, &cpu, &node, NULL);
+#endif
+  if (err_ret)
+    return err_ret;
+  if (lock->physical_package_id_p)
+    {
+      /* Can it ever happen?  */
+      if (cpu > lock->max_cpu)
+	cpu = lock->max_cpu;
+      /* NB: If NUMA is disabled, use physical_package_id.  */
+      node = lock->physical_package_id_p[cpu];
+    }
+  /* Can it ever happen?  */
+  if (node > lock->max_node)
+    node = lock->max_node;
+  info->node = node;
+  return err_ret;
+}
+
+void
+numa_spinlock_free (struct numa_spinlock *lock)
+{
+  if (lock->physical_package_id_p)
+    free (lock->physical_package_id_p);
+  free (lock);
+}
diff --git a/sysdeps/unix/sysv/linux/numa-spinlock.h b/sysdeps/unix/sysv/linux/numa-spinlock.h
new file mode 100644
index 0000000000..1c14e4a8af
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/numa-spinlock.h
@@ -0,0 +1,64 @@ 
+/* Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _NUMA_SPINLOCK_H
+#define _NUMA_SPINLOCK_H
+
+#include <features.h>
+
+__BEGIN_DECLS
+
+/* The NUMA spinlock.  */
+struct numa_spinlock;
+
+/* The NUMA spinlock information for each thread.  */
+struct numa_spinlock_info
+{
+  /* The workload function of this thread.  */
+  void *(*workload) (void *);
+  /* The argument pointer passed to the workload function.  */
+  void *argument;
+  /* The return value of the workload function.  */
+  void *result;
+  /* The pointer to the NUMA spinlock.  */
+  struct numa_spinlock *lock;
+  /* The next thread on the local NUMA spinlock thread list.  */
+  struct numa_spinlock_info *next;
+  /* The NUMA node number.  */
+  unsigned int node;
+  /* Non-zero to indicate that the thread wants the NUMA spinlock.  */
+  int pending;
+  /* Reserved for future use.  */
+  void *__reserved[4];
+};
+
+/* Return a pointer to a newly allocated NUMA spinlock.  */
+extern struct numa_spinlock *numa_spinlock_alloc (void);
+
+/* Free the memory space of the NUMA spinlock.  */
+extern void numa_spinlock_free (struct numa_spinlock *);
+
+/* Initialize the NUMA spinlock information block.  */
+extern int numa_spinlock_init (struct numa_spinlock *,
+			       struct numa_spinlock_info *);
+
+/* Apply for spinlock with a NUMA spinlock information block.  */
+extern void numa_spinlock_apply (struct numa_spinlock_info *);
+
+__END_DECLS
+
+#endif /* numa-spinlock.h */
diff --git a/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
new file mode 100644
index 0000000000..85b0917cd9
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/numa_spinlock_alloc.c
@@ -0,0 +1,304 @@ 
+/* Initialization of NUMA spinlock.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <ctype.h>
+#include <string.h>
+#include <dirent.h>
+#include <stdio.h>
+#include <limits.h>
+#ifdef _LIBC
+# include <not-cancel.h>
+#else
+# include <stdlib.h>
+# include <unistd.h>
+# include <fcntl.h>
+# define __open_nocancel		open
+# define __close_nocancel_nostatus	close
+# define __read_nocancel		read
+#endif
+
+#include "numa-spinlock-private.h"
+
+static char *
+next_line (int fd, char *const buffer, char **cp, char **re,
+	   char *const buffer_end)
+{
+  char *res = *cp;
+  char *nl = memchr (*cp, '\n', *re - *cp);
+  if (nl == NULL)
+    {
+      if (*cp != buffer)
+	{
+	  if (*re == buffer_end)
+	    {
+	      memmove (buffer, *cp, *re - *cp);
+	      *re = buffer + (*re - *cp);
+	      *cp = buffer;
+
+	      ssize_t n = __read_nocancel (fd, *re, buffer_end - *re);
+	      if (n < 0)
+		return NULL;
+
+	      *re += n;
+
+	      nl = memchr (*cp, '\n', *re - *cp);
+	      while (nl == NULL && *re == buffer_end)
+		{
+		  /* Truncate too long lines.  */
+		  *re = buffer + 3 * (buffer_end - buffer) / 4;
+		  n = __read_nocancel (fd, *re, buffer_end - *re);
+		  if (n < 0)
+		    return NULL;
+
+		  nl = memchr (*re, '\n', n);
+		  **re = '\n';
+		  *re += n;
+		}
+	    }
+	  else
+	    nl = memchr (*cp, '\n', *re - *cp);
+
+	  res = *cp;
+	}
+
+      if (nl == NULL)
+	nl = *re - 1;
+    }
+
+  *cp = nl + 1;
+  assert (*cp <= *re);
+
+  return res == *re ? NULL : res;
+}
+
+static int
+select_cpu (const struct dirent *d)
+{
+  /* Return 1 for "cpuXXX" where XXX are digits.  */
+  if (strncmp (d->d_name, "cpu", sizeof ("cpu") - 1) == 0)
+    {
+      const char *p = d->d_name + 3;
+
+      if (*p == '\0')
+	return 0;
+
+      do
+	{
+	  if (!isdigit (*p))
+	    return 0;
+	  p++;
+	}
+      while (*p != '\0');
+
+      return 1;
+    }
+  return 0;
+}
+
+/* Allocate a NUMA spinlock and return a pointer to it.  Caller should
+   call numa_spinlock_free on the NUMA spinlock pointer to free the
+   memory when it is no longer needed.  */
+
+struct numa_spinlock *
+numa_spinlock_alloc (void)
+{
+  const size_t buffer_size = 1024;
+  char buffer[buffer_size];
+  char *buffer_end = buffer + buffer_size;
+  char *cp = buffer_end;
+  char *re = buffer_end;
+
+  const int flags = O_RDONLY | O_CLOEXEC;
+  int fd = __open_nocancel ("/sys/devices/system/node/online", flags);
+  char *l;
+  unsigned int max_node = 0;
+  unsigned int node_count = 0;
+  if (fd != -1)
+    {
+      l = next_line (fd, buffer, &cp, &re, buffer_end);
+      if (l != NULL)
+	do
+	  {
+	    char *endp;
+	    unsigned long int n = strtoul (l, &endp, 10);
+	    if (l == endp)
+	      {
+		node_count = 1;
+		break;
+	      }
+
+	    unsigned long int m = n;
+	    if (*endp == '-')
+	      {
+		l = endp + 1;
+		m = strtoul (l, &endp, 10);
+		if (l == endp)
+		  {
+		    node_count = 1;
+		    break;
+		  }
+	      }
+
+	    node_count += m - n + 1;
+
+	    if (m >= max_node)
+	      max_node = m;
+
+	    l = endp;
+	    while (l < re && isspace (*l))
+	      ++l;
+	  }
+	while (l < re);
+
+      __close_nocancel_nostatus (fd);
+    }
+
+  /* NB: Some NUMA nodes may not be available, if the number of NUMA
+     nodes is 1, set the maximium NUMA node number to 0.  */
+  if (node_count == 1)
+    max_node = 0;
+
+  unsigned int max_cpu = 0;
+  unsigned int *physical_package_id_p = NULL;
+
+  if (max_node == 0)
+    {
+      /* There is at least 1 node.  */
+      node_count = 1;
+
+      /* If NUMA is disabled, use physical_package_id instead.  */
+      struct dirent **cpu_list;
+      int nprocs = scandir ("/sys/devices/system/cpu", &cpu_list,
+			    select_cpu, NULL);
+      if (nprocs > 0)
+	{
+	  int i;
+	  unsigned int *cpu_id_p = NULL;
+
+	  /* Find the maximum CPU number.  */
+	  if (posix_memalign ((void **) &cpu_id_p,
+			      __alignof__ (void *),
+			      nprocs * sizeof (unsigned int)) == 0)
+	    {
+	      for (i = 0; i < nprocs; i++)
+		{
+		  unsigned int cpu_id
+		    = strtoul (cpu_list[i]->d_name + 3, NULL, 10);
+		  cpu_id_p[i] = cpu_id;
+		  if (cpu_id > max_cpu)
+		    max_cpu = cpu_id;
+		}
+
+	      if (posix_memalign ((void **) &physical_package_id_p,
+				  __alignof__ (void *),
+				  ((max_cpu + 1)
+				   * sizeof (unsigned int))) == 0)
+		{
+		  memset (physical_package_id_p, 0,
+			  ((max_cpu + 1) * sizeof (unsigned int)));
+
+		  max_node = UINT_MAX;
+
+		  /* Get physical_package_id.  */
+		  char path[(sizeof ("/sys/devices/system/cpu")
+			     + 3 * sizeof (unsigned long int)
+			     + sizeof ("/topology/physical_package_id"))];
+		  for (i = 0; i < nprocs; i++)
+		    {
+		      struct dirent *d = cpu_list[i];
+		      if (snprintf (path, sizeof (path),
+				    "/sys/devices/system/cpu/%s/topology/physical_package_id",
+				    d->d_name) > 0)
+			{
+			  fd = __open_nocancel (path, flags);
+			  if (fd != -1)
+			    {
+			      if (__read_nocancel (fd, buffer,
+						   buffer_size) > 0)
+				{
+				  char *endp;
+				  unsigned long int package_id
+				    = strtoul (buffer, &endp, 10);
+				  if (package_id != ULONG_MAX
+				      && *buffer != '\0'
+				      && (*endp == '\0' || *endp == '\n'))
+				    {
+				      physical_package_id_p[cpu_id_p[i]]
+					= package_id;
+				      if (max_node == UINT_MAX)
+					{
+					  /* This is the first node.  */
+					  max_node = package_id;
+					}
+				      else if (package_id != max_node)
+					{
+					  /* NB: We only need to know if
+					     NODE_COUNT > 1.  */
+					  node_count = 2;
+					  if (package_id > max_node)
+					    max_node = package_id;
+					}
+				    }
+				}
+			      __close_nocancel_nostatus (fd);
+			    }
+			}
+
+		      free (d);
+		    }
+		}
+
+	      free (cpu_id_p);
+	    }
+	  else
+	    {
+	      for (i = 0; i < nprocs; i++)
+		free (cpu_list[i]);
+	    }
+
+	  free (cpu_list);
+	}
+    }
+
+  if (physical_package_id_p != NULL && node_count == 1)
+    {
+      /* There is only one node.  No need for physical_package_id_p.  */
+      free (physical_package_id_p);
+      physical_package_id_p = NULL;
+      max_cpu = 0;
+    }
+
+  /* Allocate an array of struct numa_spinlock_info pointers to hold info
+     for all NUMA nodes with NUMA node number from getcpu () as index.  */
+  size_t size = (sizeof (struct numa_spinlock)
+		 + ((max_node + 1)
+		    * sizeof (struct numa_spinlock_info *)));
+  struct numa_spinlock *lock;
+  if (posix_memalign ((void **) &lock,
+		      __alignof__ (struct numa_spinlock_info *), size))
+    return NULL;
+  memset (lock, 0, size);
+
+  lock->max_node = max_node;
+  lock->single_node = node_count == 1;
+  lock->max_cpu = max_cpu;
+  lock->physical_package_id_p = physical_package_id_p;
+
+  return lock;
+}
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist
index 09e8447b06..dba7df62aa 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist
index 8300958d47..a763c0a819 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libpthread.abilist
@@ -27,6 +27,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3 _IO_flockfile F
 GLIBC_2.3 _IO_ftrylockfile F
 GLIBC_2.3 _IO_funlockfile F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist
index 9a9e4cee85..eb54a8363d 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libpthread.abilist
@@ -243,3 +243,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
diff --git a/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist b/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist
index c370fda73d..366fcaca7e 100644
--- a/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/riscv/rv64/libpthread.abilist
@@ -235,3 +235,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
diff --git a/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist b/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist
index d05468f3b2..786d8e1b8d 100644
--- a/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/s390/s390-32/libpthread.abilist
@@ -229,6 +229,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist b/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist
index e8161aa747..dd7c52fe9a 100644
--- a/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/s390/s390-64/libpthread.abilist
@@ -221,6 +221,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/sh/libpthread.abilist b/sysdeps/unix/sysv/linux/sh/libpthread.abilist
index bcba07f575..a80475fd04 100644
--- a/sysdeps/unix/sysv/linux/sh/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/sh/libpthread.abilist
@@ -219,6 +219,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist b/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist
index b413007ccb..dd08796242 100644
--- a/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/sparc/sparc32/libpthread.abilist
@@ -227,6 +227,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist b/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist
index ccc9449826..5b190f69af 100644
--- a/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/sparc/sparc64/libpthread.abilist
@@ -219,6 +219,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/tst-numa-variable-overhead.c b/sysdeps/unix/sysv/linux/tst-numa-variable-overhead.c
new file mode 100644
index 0000000000..d43d0305ee
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/tst-numa-variable-overhead.c
@@ -0,0 +1,53 @@ 
+/* Test case for NUMA spinlock overhead.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _GNU_SOURCE
+# define _GNU_SOURCE
+#endif
+#include "numa-spinlock.h"
+
+struct numa_spinlock *lock;
+
+struct work_todo_argument
+{
+  unsigned long *v1;
+  unsigned long *v2;
+  unsigned long *v3;
+  unsigned long *v4;
+};
+
+static void *
+work_todo (void *v)
+{
+  struct work_todo_argument *p = v;
+  unsigned long ret;
+  *p->v1 = *p->v1 + 1;
+  *p->v2 = *p->v2 + 1;
+  ret = __sync_val_compare_and_swap (p->v4, 0, 1);
+  *p->v3 = *p->v3 + ret;
+  return (void *) 2;
+}
+
+static inline void
+do_work (struct numa_spinlock_info *lock_info)
+{
+  numa_spinlock_apply (lock_info);
+}
+
+#define USE_NUMA_SPINLOCK
+#include "tst-variable-overhead-skeleton.c"
diff --git a/sysdeps/unix/sysv/linux/tst-variable-overhead-skeleton.c b/sysdeps/unix/sysv/linux/tst-variable-overhead-skeleton.c
new file mode 100644
index 0000000000..f0af13f302
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/tst-variable-overhead-skeleton.c
@@ -0,0 +1,397 @@ 
+/* Test case skeleton for spinlock overhead.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Check spinlock overhead with large number threads.  Critical region is
+   very smmall.  Critical region + spinlock overhead aren't noticeable
+   when number of threads is small.  When thread number increases,
+   spinlock overhead become the bottleneck.  It shows up in wall time
+   of thread execution.  */
+
+#ifndef _GNU_SOURCE
+# define _GNU_SOURCE
+#endif
+#include <config.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <sys/time.h>
+#include <sys/param.h>
+#include <errno.h>
+#ifdef MODULE_NAME
+# include <cpu-features.h>
+# include <support/test-driver.h>
+
+# undef attribute_hidden
+# define attribute_hidden
+#endif
+#include <hp-timing.h>
+#include <atomic.h>
+
+#ifndef USE_PTHREAD_ATTR_SETAFFINITY_NP
+# define USE_PTHREAD_ATTR_SETAFFINITY_NP 1
+#endif
+
+#define CACHELINE_SIZE	64
+#define CACHE_ALIGNED	__attribute__((aligned(CACHELINE_SIZE)))
+
+#define constant_time 5
+unsigned long g_val CACHE_ALIGNED;
+unsigned long g_val2 CACHE_ALIGNED;
+unsigned long g_val3 CACHE_ALIGNED;
+unsigned long cmplock CACHE_ALIGNED;
+struct count
+{
+  unsigned long long total;
+  unsigned long long spinlock;
+  unsigned long long wall;
+} __attribute__((aligned(128)));
+
+struct count *gcount;
+
+/* The time consumed by one update is about 200 TSCs.  */
+static int delay_time_unlocked = 400;
+
+struct ops
+{
+  void *(*test) (void *arg);
+  void (*print_thread) (void *res, int);
+} *ops;
+
+struct stats_result
+{
+  unsigned long num;
+};
+
+void *work_thread (void *arg);
+
+#define iterations (10000 * 5)
+
+static volatile int start_thread;
+
+/* Delay some fixed time */
+static void
+delay_tsc (unsigned n)
+{
+#if HP_TIMING_AVAIL
+  hp_timing_t start, current, diff;
+  HP_TIMING_NOW (start);
+
+  while (1)
+    {
+      HP_TIMING_NOW (current);
+      HP_TIMING_DIFF (diff, start, current);
+      if (diff < n)
+	atomic_spin_nop ();
+      else
+	break;
+    }
+#endif
+}
+
+static void
+wait_a_bit (int delay_time)
+{
+  if (delay_time > 0)
+    delay_tsc (delay_time);
+}
+
+#ifndef USE_NUMA_SPINLOCK
+static inline void
+work_todo (void)
+{
+  unsigned long ret;
+  g_val = g_val + 1;
+  g_val2 = g_val2 + 1;
+  ret = __sync_val_compare_and_swap (&cmplock, 0, 1);
+  g_val3 = g_val3 + 1 + ret;
+}
+#endif
+
+void *
+work_thread (void *arg)
+{
+  long i;
+  unsigned long pid = (unsigned long) arg;
+  struct stats_result *res;
+  int err_ret = posix_memalign ((void **)&res, CACHELINE_SIZE,
+				roundup (sizeof (*res), CACHELINE_SIZE));
+  if (err_ret)
+    {
+      printf ("posix_memalign failure: %s\n", strerror (err_ret));
+      exit (err_ret);
+    }
+  long num = 0;
+
+#ifdef USE_NUMA_SPINLOCK
+  struct work_todo_argument work_todo_arg;
+  struct numa_spinlock_info lock_info;
+
+  if (numa_spinlock_init (lock, &lock_info))
+    {
+      printf ("numa_spinlock_init failure: %m\n");
+      exit (1);
+    }
+
+  work_todo_arg.v1 = &g_val;
+  work_todo_arg.v2 = &g_val2;
+  work_todo_arg.v3 = &g_val3;
+  work_todo_arg.v4 = &cmplock;
+  lock_info.argument = &work_todo_arg;
+  lock_info.workload = work_todo;
+#endif
+
+  while (!start_thread)
+    atomic_spin_nop ();
+
+#if HP_TIMING_AVAIL
+  hp_timing_t start, end;
+  HP_TIMING_NOW (start);
+#endif
+
+  for (i = 0; i < iterations; i++)
+    {
+#ifdef USE_NUMA_SPINLOCK
+      do_work (&lock_info);
+#else
+      do_work ();
+#endif
+      wait_a_bit (delay_time_unlocked);
+      num++;
+    }
+#if HP_TIMING_AVAIL
+  HP_TIMING_NOW (end);
+  HP_TIMING_DIFF (gcount[pid].total, start, end);
+#endif
+  res->num = num;
+
+  return res;
+}
+
+void
+init_global_data(void)
+{
+  g_val = 0;
+  g_val2 = 0;
+  g_val3 = 0;
+  cmplock = 0;
+}
+
+void
+test_threads (int numthreads, int numprocs, unsigned long time)
+{
+  start_thread = 0;
+
+#ifdef USE_NUMA_SPINLOCK
+  lock = numa_spinlock_alloc ();
+#endif
+
+  atomic_full_barrier ();
+
+  pthread_t thr[numthreads];
+  void *res[numthreads];
+  int i;
+
+  init_global_data ();
+  for (i = 0; i < numthreads; i++)
+    {
+      pthread_attr_t attr;
+      const pthread_attr_t *attrp = NULL;
+      if (USE_PTHREAD_ATTR_SETAFFINITY_NP)
+	{
+	  attrp = &attr;
+	  pthread_attr_init (&attr);
+	  cpu_set_t set;
+	  CPU_ZERO (&set);
+	  int cpu = i % numprocs;
+	  (void) CPU_SET (cpu, &set);
+	  pthread_attr_setaffinity_np (&attr, sizeof (cpu_set_t), &set);
+	}
+      int err_ret = pthread_create (&thr[i], attrp, ops->test,
+				    (void *)(uintptr_t) i);
+      if (err_ret != 0)
+	{
+	  printf ("pthread_create failed: %d, %s\n",
+		  i, strerror (i));
+	  numthreads = i;
+	  break;
+	}
+    }
+
+  atomic_full_barrier ();
+  start_thread = 1;
+  atomic_full_barrier ();
+  sched_yield ();
+
+  if (time)
+    {
+      struct timespec ts =
+	{
+	  ts.tv_sec = time,
+	  ts.tv_nsec = 0
+	};
+      clock_nanosleep (CLOCK_MONOTONIC, 0, &ts, NULL);
+      atomic_full_barrier ();
+    }
+
+  for (i = 0; i < numthreads; i++)
+    {
+      if (pthread_join (thr[i], (void *) &res[i]) == 0)
+	free (res[i]);
+      else
+	printf ("pthread_join failure: %m\n");
+    }
+
+#ifdef USE_NUMA_SPINLOCK
+  numa_spinlock_free (lock);
+#endif
+}
+
+struct ops hashwork_ops =
+{
+  .test = work_thread,
+};
+
+struct ops *ops;
+
+static struct count
+total_cost (int numthreads, int numprocs)
+{
+  int i;
+  unsigned long long total = 0;
+  unsigned long long spinlock = 0;
+
+  memset (gcount, 0, sizeof(gcount[0]) * numthreads);
+
+#if HP_TIMING_AVAIL
+  hp_timing_t start, end, diff;
+  HP_TIMING_NOW (start);
+#endif
+
+  test_threads (numthreads, numprocs, constant_time);
+
+#if HP_TIMING_AVAIL
+  HP_TIMING_NOW (end);
+  HP_TIMING_DIFF (diff, start, end);
+#endif
+
+  for (i = 0; i < numthreads; i++)
+    {
+      total += gcount[i].total;
+      spinlock += gcount[i].spinlock;
+    }
+
+  struct count cost = { total, spinlock, diff };
+  return cost;
+}
+
+#ifdef MODULE_NAME
+static int
+do_test (void)
+{
+# if !HP_TIMING_AVAIL
+  return EXIT_UNSUPPORTED;
+# endif
+#else
+int
+main (void)
+{
+#endif
+  int numprocs = sysconf (_SC_NPROCESSORS_ONLN);
+
+  /* Oversubscribe CPU.  */
+  int numthreads = 4 * numprocs;
+
+  ops = &hashwork_ops;
+
+  int err_ret = posix_memalign ((void **)&gcount, 4096,
+				sizeof(gcount[0]) * numthreads);
+  if (err_ret)
+    {
+      printf ("posix_memalign failure: %s\n", strerror (err_ret));
+      exit (err_ret);
+    }
+
+  struct count cost, cost1;
+  double overhead;
+  int i, last;
+  int last_increment = numprocs < 16 ? 16 : numprocs;
+  int numprocs_done = 0;
+  int numprocs_reset = 0;
+  cost1 = total_cost (1, numprocs);
+
+  printf ("Number of processors: %d, Single thread time %lld\n\n",
+	  numprocs, cost1.total);
+
+  for (last = i = 2; i <= numthreads;)
+    {
+      last = i;
+      cost = total_cost (i, numprocs);
+      overhead = cost.total;
+      overhead /= i;
+      overhead /= cost1.total;
+      printf ("Number of threads: %4d, Total time %14lld, Overhead: %.2f\n",
+	      i, cost.total, overhead);
+      if ((i * 2) < numprocs)
+	i = i * 2;
+      else if (numprocs_done)
+	{
+	  if (numprocs_reset)
+	    {
+	      i = numprocs_reset;
+	      numprocs_reset = 0;
+	    }
+	  else
+	    {
+	      if ((i * 2) < numthreads)
+		i = i * 2;
+	      else
+		i = i + last_increment;
+	    }
+	}
+      else
+	{
+	  if (numprocs != 2 * i)
+	    numprocs_reset = 2 * i;
+	  i = numprocs;
+	  numprocs_done = 1;
+	}
+    }
+
+  if (last != numthreads)
+    {
+      i = numthreads;
+      cost = total_cost (i, numprocs);
+      overhead = cost.total;
+      overhead /= i;
+      overhead /= cost1.total;
+      printf ("Number of threads: %4d, Total time %14lld, Overhead: %.2f\n",
+	      i, cost.total, overhead);
+    }
+
+  free (gcount);
+  return 0;
+}
+
+#ifdef MODULE_NAME
+# define TIMEOUT 900
+# include <support/test-driver.c>
+#endif
diff --git a/sysdeps/unix/sysv/linux/tst-variable-overhead.c b/sysdeps/unix/sysv/linux/tst-variable-overhead.c
new file mode 100644
index 0000000000..1cb62cbc4f
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/tst-variable-overhead.c
@@ -0,0 +1,47 @@ 
+/* Test case for spinlock overhead.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _GNU_SOURCE
+# define _GNU_SOURCE
+#endif
+#include <pthread.h>
+
+struct
+{
+  pthread_spinlock_t testlock;
+  char pad[64 - sizeof (pthread_spinlock_t)];
+} test __attribute__((aligned(64)));
+
+static void
+__attribute__((constructor))
+init_spin (void)
+{
+  pthread_spin_init (&test.testlock, 0);
+}
+
+static void work_todo (void);
+
+static inline void
+do_work (void)
+{
+  pthread_spin_lock(&test.testlock);
+  work_todo ();
+  pthread_spin_unlock(&test.testlock);
+}
+
+#include "tst-variable-overhead-skeleton.c"
diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
index 931c8277a8..e90532ef36 100644
--- a/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/64/libpthread.abilist
@@ -219,6 +219,10 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
 GLIBC_2.3.2 pthread_cond_broadcast F
 GLIBC_2.3.2 pthread_cond_destroy F
 GLIBC_2.3.2 pthread_cond_init F
diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
index c09c9b015a..c74febbda1 100644
--- a/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/x32/libpthread.abilist
@@ -243,3 +243,7 @@  GLIBC_2.28 tss_create F
 GLIBC_2.28 tss_delete F
 GLIBC_2.28 tss_get F
 GLIBC_2.28 tss_set F
+GLIBC_2.29 numa_spinlock_alloc F
+GLIBC_2.29 numa_spinlock_apply F
+GLIBC_2.29 numa_spinlock_free F
+GLIBC_2.29 numa_spinlock_init F
-- 
2.20.1