[v4,0/3] Optimize CAS [BZ #28537]

Message ID	20211110001614.2087610-1-hjl.tools@gmail.com
Headers	DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3847F3857C6D To: libc-alpha@sourceware.org Subject: [PATCH v4 0/3] Optimize CAS [BZ #28537] Date: Tue, 9 Nov 2021 16:16:11 -0800 Message-Id: <20211110001614.2087610-1-hjl.tools@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list From: "H.J. Lu via Libc-alpha" <libc-alpha@sourceware.org> Reply-To: "H.J. Lu" <hjl.tools@gmail.com> Cc: Florian Weimer <fweimer@redhat.com>, Arjan van de Ven <arjan@linux.intel.com>, Andreas Schwab <schwab@linux-m68k.org> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
Series	Optimize CAS [BZ #28537] \| [v4,0/3] Optimize CAS [BZ #28537] [v4,1/3] Reduce CAS in low level locks [BZ #28537] [v4,2/3] Reduce CAS in __pthread_mutex_lock_full [BZ #28537] [v4,3/3] Optimize CAS in __pthread_mutex_lock_full [BZ #28537]

Message ID

20211110001614.2087610-1-hjl.tools@gmail.com

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3847F3857C6D
To: libc-alpha@sourceware.org
Subject: [PATCH v4 0/3] Optimize CAS [BZ #28537]
Date: Tue,  9 Nov 2021 16:16:11 -0800
Message-Id: <20211110001614.2087610-1-hjl.tools@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
From: "H.J. Lu via Libc-alpha" <libc-alpha@sourceware.org>
Reply-To: "H.J. Lu" <hjl.tools@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>,
 Arjan van de Ven <arjan@linux.intel.com>,
 Andreas Schwab <schwab@linux-m68k.org>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

Series

Optimize CAS [BZ #28537] |

Message

H.J. Lu Nov. 10, 2021, 12:16 a.m. UTC

  CAS instruction is expensive.  From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading.  See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

Optimize CAS in low level locks and pthread_mutex_lock.c:

1. Do an atomic load and skip CAS if compare may fail to reduce cache
line bouncing on contended locks.
2. Replace atomic_compare_and_exchange_bool_acq with
atomic_compare_and_exchange_val_acq to avoid the extra load.
3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
don't know if it's actually rare; in the contended case it is clearly not
rare.

This is the first patch set to optimize CAS.  I will investigate the rest
CAS usages in glibc after this patch set has been accepted.

H.J. Lu (3):
  Reduce CAS in low level locks [BZ #28537]
  Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
  Optimize CAS in __pthread_mutex_lock_full [BZ #28537]

 nptl/lowlevellock.c         | 12 ++++-----
 nptl/pthread_mutex_lock.c   | 53 ++++++++++++++++++++++++++++---------
 sysdeps/nptl/lowlevellock.h | 29 +++++++++++++-------
 3 files changed, 67 insertions(+), 27 deletions(-)

Comments

Paul E Murphy Nov. 10, 2021, 2:26 p.m. UTC | #1

On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> 
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
> 
> Optimize CAS in low level locks and pthread_mutex_lock.c:
> 
> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> line bouncing on contended locks.
> 2. Replace atomic_compare_and_exchange_bool_acq with
> atomic_compare_and_exchange_val_acq to avoid the extra load.
> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> don't know if it's actually rare; in the contended case it is clearly not
> rare.

Are you able to share benchmarks of this change? I am curious what 
effects this might have on other platforms.

> 
> This is the first patch set to optimize CAS.  I will investigate the rest
> CAS usages in glibc after this patch set has been accepted.
> 
> H.J. Lu (3):
>    Reduce CAS in low level locks [BZ #28537]
>    Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
>    Optimize CAS in __pthread_mutex_lock_full [BZ #28537]
> 
>   nptl/lowlevellock.c         | 12 ++++-----
>   nptl/pthread_mutex_lock.c   | 53 ++++++++++++++++++++++++++++---------
>   sysdeps/nptl/lowlevellock.h | 29 +++++++++++++-------
>   3 files changed, 67 insertions(+), 27 deletions(-)
>

Paul A. Clarke Nov. 10, 2021, 3:35 p.m. UTC | #2

On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> 
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
> 
> Optimize CAS in low level locks and pthread_mutex_lock.c:
> 
> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> line bouncing on contended locks.
> 2. Replace atomic_compare_and_exchange_bool_acq with
> atomic_compare_and_exchange_val_acq to avoid the extra load.
> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> don't know if it's actually rare; in the contended case it is clearly not
> rare.

I see build errors:

In file included from pthread_mutex_cond_lock.c:23:
../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
      int private = (robust
      ^~~
../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
      int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed  */,
      ^~~
../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
      if (e == ESRCH || e == EDEADLK)
          ^
../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
In file included from ../include/assert.h:1,
                 from ../nptl/pthread_mutex_lock.c:18,
                 from pthread_mutex_cond_lock.c:23:
../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
   ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
                                  ^
../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
   assert (e != EDEADLK
   ^~~~~~
../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
   ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
                                  ^
../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
   assert (e != ESRCH || !robust);
   ^~~~~~

PC

H.J. Lu Nov. 10, 2021, 3:42 p.m. UTC | #3

On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > a cache line for writing is more expensive than reading.  See Appendix
> > A.2 Spinlock in:
> >
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.
> >
> > Optimize CAS in low level locks and pthread_mutex_lock.c:
> >
> > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > line bouncing on contended locks.
> > 2. Replace atomic_compare_and_exchange_bool_acq with
> > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > don't know if it's actually rare; in the contended case it is clearly not
> > rare.
>
> I see build errors:
>
> In file included from pthread_mutex_cond_lock.c:23:
> ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
>       int private = (robust
>       ^~~
> ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
>       int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed  */,
>       ^~~
> ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
>       if (e == ESRCH || e == EDEADLK)
>           ^
> ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
> In file included from ../include/assert.h:1,
>                  from ../nptl/pthread_mutex_lock.c:18,
>                  from pthread_mutex_cond_lock.c:23:
> ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
>    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
>                                   ^
> ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
>    assert (e != EDEADLK
>    ^~~~~~
> ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
>    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
>                                   ^
> ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
>    assert (e != ESRCH || !robust);
>    ^~~~~~
>
> PC

On PPC?

Paul A. Clarke Nov. 10, 2021, 3:50 p.m. UTC | #4

On Wed, Nov 10, 2021 at 07:42:20AM -0800, H.J. Lu wrote:
> On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote:
> >
> > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading.  See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf 
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > I see build errors:
> >
> > In file included from pthread_mutex_cond_lock.c:23:
> > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> >       int private = (robust
> >       ^~~
> > ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
> >       int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed  */,
> >       ^~~
> > ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
> >       if (e == ESRCH || e == EDEADLK)
> >           ^
> > ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
> > In file included from ../include/assert.h:1,
> >                  from ../nptl/pthread_mutex_lock.c:18,
> >                  from pthread_mutex_cond_lock.c:23:
> > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> >    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
> >                                   ^
> > ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
> >    assert (e != EDEADLK
> >    ^~~~~~
> > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> >    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
> >                                   ^
> > ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
> >    assert (e != ESRCH || !robust);
> >    ^~~~~~
> 
> On PPC?

Yes, powerpc64le.  Failing command:
gcc pthread_mutex_cond_lock.c -c -std=gnu11 -fgnu89-inline -g -O2 -Wall -Wwrite-strings -Wundef -Werror -fmerge-all-constants -frounding-math -fno-stack-protector -fno-common -Wstrict-prototypes -Wold-style-definition -fmath-errno -mabi=ieeelongdouble -Wno-psabi -mno-gnu-attribute -mlong-double-128 -ftls-model=initial-exec -I../include -I/home/pc/locks/build-patched/nptl -I/home/pc/locks/build-patched -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le -I../sysdeps/unix/sysv/linux/powerpc/powerpc64 -I../sysdeps/unix/sysv/linux/wordsize-64 -I../sysdeps/unix/sysv/linux/powerpc -I../sysdeps/powerpc/nptl -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux -I../sysdeps/nptl -I../sysdeps/pthread -I../sysdeps/gnu -I../sysdeps/unix/inet -I../sysdeps/unix/sysv -I../sysdeps/unix/powerpc -I../sysdeps/unix -I../sysdeps/posix -I../sysdeps/powerpc/powerpc64/le/power8/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power8/fpu -I../sysdeps/powerpc/powerpc64/le/power7/fpu -I../sysdeps/powerpc/powerpc64/le/fpu -I../sysdeps/powerpc/powerpc64/fpu -I../sysdeps/powerpc/powerpc64/le/power8/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/multiarch -I../sysdeps/powerpc/powerpc64/le/multiarch -I../sysdeps/powerpc/powerpc64/multiarch -I../sysdeps/powerpc/powerpc64/le/power8 -I../sysdeps/powerpc/powerpc64/power8 -I../sysdeps/powerpc/powerpc64/le/power7 -I../sysdeps/powerpc/powerpc64/power7 -I../sysdeps/powerpc/powerpc64/power6 -I../sysdeps/powerpc/powerpc64/power4 -I../sysdeps/powerpc/power4 -I../sysdeps/powerpc/powerpc64/le -I../sysdeps/powerpc/powerpc64 -I../sysdeps/wordsize-64 -I../sysdeps/powerpc/fpu -I../sysdeps/powerpc -I../sysdeps/ieee754/ldbl-128ibm-compat -I../sysdeps/ieee754/ldbl-128ibm/include -I../sysdeps/ieee754/ldbl-128ibm -I../sysdeps/ieee754/ldbl-opt -I../sysdeps/ieee754/dbl-64 -I../sysdeps/ieee754/flt-32 -I../sysdeps/ieee754/float128 -I../sysdeps/ieee754 -I../sysdeps/generic -I.. -I../libio -I. -D_LIBC_REENTRANT -include /home/pc/locks/build-patched/libc-modules.h -DMODULE_NAME=libc -include ../include/libc-symbols.h -DTOP_NAMESPACE=glibc -o /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o -MD -MP -MF /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o.dt -MT /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o

PC

H.J. Lu Nov. 10, 2021, 3:52 p.m. UTC | #5

On Wed, Nov 10, 2021 at 7:50 AM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 07:42:20AM -0800, H.J. Lu wrote:
> > On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote:
> > >
> > > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> > > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading.  See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > > >
> > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > > line bouncing on contended locks.
> > > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > > don't know if it's actually rare; in the contended case it is clearly not
> > > > rare.
> > >
> > > I see build errors:
> > >
> > > In file included from pthread_mutex_cond_lock.c:23:
> > > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> > >       int private = (robust
> > >       ^~~
> > > ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
> > >       int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed  */,
> > >       ^~~
> > > ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
> > >       if (e == ESRCH || e == EDEADLK)
> > >           ^
> > > ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
> > > In file included from ../include/assert.h:1,
> > >                  from ../nptl/pthread_mutex_lock.c:18,
> > >                  from pthread_mutex_cond_lock.c:23:
> > > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> > >    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
> > >                                   ^
> > > ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
> > >    assert (e != EDEADLK
> > >    ^~~~~~
> > > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> > >    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
> > >                                   ^
> > > ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
> > >    assert (e != ESRCH || !robust);
> > >    ^~~~~~
> >
> > On PPC?
>
> Yes, powerpc64le.  Failing command:
> gcc pthread_mutex_cond_lock.c -c -std=gnu11 -fgnu89-inline -g -O2 -Wall -Wwrite-strings -Wundef -Werror -fmerge-all-constants -frounding-math -fno-stack-protector -fno-common -Wstrict-prototypes -Wold-style-definition -fmath-errno -mabi=ieeelongdouble -Wno-psabi -mno-gnu-attribute -mlong-double-128 -ftls-model=initial-exec -I../include -I/home/pc/locks/build-patched/nptl -I/home/pc/locks/build-patched -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le -I../sysdeps/unix/sysv/linux/powerpc/powerpc64 -I../sysdeps/unix/sysv/linux/wordsize-64 -I../sysdeps/unix/sysv/linux/powerpc -I../sysdeps/powerpc/nptl -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux -I../sysdeps/nptl -I../sysdeps/pthread -I../sysdeps/gnu -I../sysdeps/unix/inet -I../sysdeps/unix/sysv -I../sysdeps/unix/powerpc -I../sysdeps/unix -I../sysdeps/posix -I../sysdeps/powerpc/powerpc64/le/power8/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power8/fpu -I../sysdeps/powerpc/powerpc64/le/power7/fpu -I../sysdeps/powerpc/powerpc64/le/fpu -I../sysdeps/powerpc/powerpc64/fpu -I../sysdeps/powerpc/powerpc64/le/power8/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/multiarch -I../sysdeps/powerpc/powerpc64/le/multiarch -I../sysdeps/powerpc/powerpc64/multiarch -I../sysdeps/powerpc/powerpc64/le/power8 -I../sysdeps/powerpc/powerpc64/power8 -I../sysdeps/powerpc/powerpc64/le/power7 -I../sysdeps/powerpc/powerpc64/power7 -I../sysdeps/powerpc/powerpc64/power6 -I../sysdeps/powerpc/powerpc64/power4 -I../sysdeps/powerpc/power4 -I../sysdeps/powerpc/powerpc64/le -I../sysdeps/powerpc/powerpc64 -I../sysdeps/wordsize-64 -I../sysdeps/powerpc/fpu -I../sysdeps/powerpc -I../sysdeps/ieee754/ldbl-128ibm-compat -I../sysdeps/ieee754/ldbl-128ibm/include -I../sysdeps/ieee754/ldbl-128ibm -I../sysdeps/ieee754/ldbl-opt -I../sysdeps/ieee754/dbl-64 -I../sysdeps/ieee754/flt-32 -I../sysdeps/ieee754/float128 -I../sysdeps/ieee754 -I../sysdeps/generic -I.. -I../libio -I. -D_LIBC_REENTRANT -include /home/pc/locks/build-patched/libc-modules.h -DMODULE_NAME=libc -include ../include/libc-symbols.h -DTOP_NAMESPACE=glibc -o /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o -MD -MP -MF /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o.dt -MT /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o
>
> PC

I am testing it now.

Florian Weimer Nov. 10, 2021, 3:52 p.m. UTC | #6

* Paul A. Clarke:

> On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
>> CAS instruction is expensive.  From the x86 CPU's point of view, getting
>> a cache line for writing is more expensive than reading.  See Appendix
>> A.2 Spinlock in:
>> 
>> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>> 
>> The full compare and swap will grab the cache line exclusive and cause
>> excessive cache line bouncing.
>> 
>> Optimize CAS in low level locks and pthread_mutex_lock.c:
>> 
>> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
>> line bouncing on contended locks.
>> 2. Replace atomic_compare_and_exchange_bool_acq with
>> atomic_compare_and_exchange_val_acq to avoid the extra load.
>> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
>> don't know if it's actually rare; in the contended case it is clearly not
>> rare.
>
> I see build errors:
>
> In file included from pthread_mutex_cond_lock.c:23:
> ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
>       int private = (robust
>       ^~~

The patch has:

+ locked_mutex:
 	    /* The mutex is locked.  The kernel will now take care of
 	       everything.  */
 	    int private = (robust

This is only supported in recent C versions, I think the workaround is
to add an empty statement with a semicolon, like this:

+ locked_mutex:;
 	    /* The mutex is locked.  The kernel will now take care of
 	       everything.  */
 	    int private = (robust

Thanks,
Florian

H.J. Lu Nov. 10, 2021, 4:03 p.m. UTC | #7

On Wed, Nov 10, 2021 at 7:52 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Paul A. Clarke:
>
> > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> >> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> >> a cache line for writing is more expensive than reading.  See Appendix
> >> A.2 Spinlock in:
> >>
> >> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >>
> >> The full compare and swap will grab the cache line exclusive and cause
> >> excessive cache line bouncing.
> >>
> >> Optimize CAS in low level locks and pthread_mutex_lock.c:
> >>
> >> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> >> line bouncing on contended locks.
> >> 2. Replace atomic_compare_and_exchange_bool_acq with
> >> atomic_compare_and_exchange_val_acq to avoid the extra load.
> >> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> >> don't know if it's actually rare; in the contended case it is clearly not
> >> rare.
> >
> > I see build errors:
> >
> > In file included from pthread_mutex_cond_lock.c:23:
> > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> >       int private = (robust
> >       ^~~
>
> The patch has:
>
> + locked_mutex:
>             /* The mutex is locked.  The kernel will now take care of
>                everything.  */
>             int private = (robust
>
> This is only supported in recent C versions, I think the workaround is
> to add an empty statement with a semicolon, like this:
>
> + locked_mutex:;
>             /* The mutex is locked.  The kernel will now take care of
>                everything.  */
>             int private = (robust
>
> Thanks,
> Florian
>

Can you try users/hjl/x86/atomic-nptl branch:

https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/x86/atomic-nptl

I fixed a couple issues and added more CAS optimizations.

Paul A. Clarke Nov. 10, 2021, 4:04 p.m. UTC | #8

On Wed, Nov 10, 2021 at 04:52:28PM +0100, Florian Weimer wrote:
> * Paul A. Clarke:
> 
> > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> >> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> >> a cache line for writing is more expensive than reading.  See Appendix
> >> A.2 Spinlock in:
> >> 
> >> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf 
> >> 
> >> The full compare and swap will grab the cache line exclusive and cause
> >> excessive cache line bouncing.
> >> 
> >> Optimize CAS in low level locks and pthread_mutex_lock.c:
> >> 
> >> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> >> line bouncing on contended locks.
> >> 2. Replace atomic_compare_and_exchange_bool_acq with
> >> atomic_compare_and_exchange_val_acq to avoid the extra load.
> >> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> >> don't know if it's actually rare; in the contended case it is clearly not
> >> rare.
> >
> > I see build errors:
> >
> > In file included from pthread_mutex_cond_lock.c:23:
> > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> >       int private = (robust
> >       ^~~
> 
> The patch has:
> 
> + locked_mutex:
>  	    /* The mutex is locked.  The kernel will now take care of
>  	       everything.  */
>  	    int private = (robust
> 
> This is only supported in recent C versions, I think the workaround is
> to add an empty statement with a semicolon, like this:
> 
> + locked_mutex:;
>  	    /* The mutex is locked.  The kernel will now take care of
>  	       everything.  */
>  	    int private = (robust

This change resolved the issue.

PC

Andreas Schwab Nov. 10, 2021, 4:14 p.m. UTC | #9

On Nov 10 2021, Florian Weimer wrote:

> This is only supported in recent C versions

Only in C++.

Andreas.

Paul A. Clarke Nov. 10, 2021, 8:07 p.m. UTC | #10

On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > a cache line for writing is more expensive than reading.  See Appendix
> > A.2 Spinlock in:
> > 
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > 
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.
> > 
> > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > 
> > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > line bouncing on contended locks.
> > 2. Replace atomic_compare_and_exchange_bool_acq with
> > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > don't know if it's actually rare; in the contended case it is clearly not
> > rare.
> 
> Are you able to share benchmarks of this change? I am curious what effects
> this might have on other platforms.

I'd like to see the expected performance results, too.

For me, the results are not uniformly positive (Power10).
From bench-pthread-locks:

                         bench   bench-patched	
mutex-empty              4.73371 4.54792   3.9%
mutex-filler             18.5395 18.3419   1.1%
mutex_trylock-empty      10.46   2.46364  76.4%
mutex_trylock-filler     16.2188 16.1758   0.3%
rwlock_read-empty        16.5118 16.4681   0.3%
rwlock_read-filler       20.68   20.4416   1.2%
rwlock_tryread-empty     2.06572 2.17284  -5.2%
rwlock_tryread-filler    16.082  16.1215  -0.2%
rwlock_write-empty       31.3723 31.259    0.4%
rwlock_write-filler      41.6492 69.313  -66.4%
rwlock_trywrite-empty    2.20584 2.32178  -5.3%
rwlock_trywrite-filler   15.7044 15.9088  -1.3%
spin_lock-empty          16.7964 16.7731   0.1%
spin_lock-filler         20.6118 20.4175   0.9%
spin_trylock-empty       8.99989 8.98879   0.1%
spin_trylock-filler      16.4732 15.9957   2.9%
sem_wait-empty           15.805  15.7391   0.4%
sem_wait-filler          19.2346 19.5098  -1.4%
sem_trywait-empty        2.06405 2.03782   1.3%
sem_trywait-filler       15.921  15.8408   0.5%
condvar-empty            1385.84 1387.29  -0.1%
condvar-filler           1419.82 1424.01  -0.3%
consumer_producer-empty  2550.01 2395.29   6.1%
consumer_producer-filler 2709.4  2558.28   5.6%

PC

H.J. Lu Nov. 10, 2021, 9:33 p.m. UTC | #11

On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading.  See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > Are you able to share benchmarks of this change? I am curious what effects
> > this might have on other platforms.
>
> I'd like to see the expected performance results, too.
>
> For me, the results are not uniformly positive (Power10).
> From bench-pthread-locks:
>
>                          bench   bench-patched
> mutex-empty              4.73371 4.54792   3.9%
> mutex-filler             18.5395 18.3419   1.1%
> mutex_trylock-empty      10.46   2.46364  76.4%
> mutex_trylock-filler     16.2188 16.1758   0.3%
> rwlock_read-empty        16.5118 16.4681   0.3%
> rwlock_read-filler       20.68   20.4416   1.2%
> rwlock_tryread-empty     2.06572 2.17284  -5.2%
> rwlock_tryread-filler    16.082  16.1215  -0.2%
> rwlock_write-empty       31.3723 31.259    0.4%
> rwlock_write-filler      41.6492 69.313  -66.4%
> rwlock_trywrite-empty    2.20584 2.32178  -5.3%
> rwlock_trywrite-filler   15.7044 15.9088  -1.3%
> spin_lock-empty          16.7964 16.7731   0.1%
> spin_lock-filler         20.6118 20.4175   0.9%
> spin_trylock-empty       8.99989 8.98879   0.1%
> spin_trylock-filler      16.4732 15.9957   2.9%
> sem_wait-empty           15.805  15.7391   0.4%
> sem_wait-filler          19.2346 19.5098  -1.4%
> sem_trywait-empty        2.06405 2.03782   1.3%
> sem_trywait-filler       15.921  15.8408   0.5%
> condvar-empty            1385.84 1387.29  -0.1%
> condvar-filler           1419.82 1424.01  -0.3%
> consumer_producer-empty  2550.01 2395.29   6.1%
> consumer_producer-filler 2709.4  2558.28   5.6%

Small regressions on uncontended locks are expected due to extra
check.   What do you get with my current branch

https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl

BTW, how did you compare the 2 results?  I tried compare_bench.py
and got

Traceback (most recent call last):
  File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
line 196, in <module>
    main(args.bench1, args.bench2, args.schema, args.threshold, args.stats)
  File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
line 165, in main
    bench1 = bench.parse_bench(bench1, schema)
  File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py",
line 137, in parse_bench
    bench = json.load(benchfile)
  File "/usr/lib64/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)

H.J. Lu Nov. 10, 2021, 11:34 p.m. UTC | #12

On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading.  See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > Are you able to share benchmarks of this change? I am curious what effects
> > this might have on other platforms.
>
> I'd like to see the expected performance results, too.
>
> For me, the results are not uniformly positive (Power10).
> From bench-pthread-locks:
>
>                          bench   bench-patched
> mutex-empty              4.73371 4.54792   3.9%
> mutex-filler             18.5395 18.3419   1.1%
> mutex_trylock-empty      10.46   2.46364  76.4%
> mutex_trylock-filler     16.2188 16.1758   0.3%
> rwlock_read-empty        16.5118 16.4681   0.3%
> rwlock_read-filler       20.68   20.4416   1.2%
> rwlock_tryread-empty     2.06572 2.17284  -5.2%
> rwlock_tryread-filler    16.082  16.1215  -0.2%
> rwlock_write-empty       31.3723 31.259    0.4%
> rwlock_write-filler      41.6492 69.313  -66.4%
> rwlock_trywrite-empty    2.20584 2.32178  -5.3%
> rwlock_trywrite-filler   15.7044 15.9088  -1.3%
> spin_lock-empty          16.7964 16.7731   0.1%
> spin_lock-filler         20.6118 20.4175   0.9%
> spin_trylock-empty       8.99989 8.98879   0.1%
> spin_trylock-filler      16.4732 15.9957   2.9%
> sem_wait-empty           15.805  15.7391   0.4%
> sem_wait-filler          19.2346 19.5098  -1.4%
> sem_trywait-empty        2.06405 2.03782   1.3%
> sem_trywait-filler       15.921  15.8408   0.5%
> condvar-empty            1385.84 1387.29  -0.1%
> condvar-filler           1419.82 1424.01  -0.3%
> consumer_producer-empty  2550.01 2395.29   6.1%
> consumer_producer-filler 2709.4  2558.28   5.6%
>
> PC

Here are the results on a machine with 112 cores:

             mutex-empty    16.0112    16.5728  -3.5%
             mutex-filler    49.4354    48.7608  1.4%
      mutex_trylock-empty    19.2854    8.56795  56%
     mutex_trylock-filler    54.9643    41.5418  24%
        rwlock_read-empty    39.8855    39.7448  0.35%
       rwlock_read-filler    75.1334    75.1218  0.015%
     rwlock_tryread-empty    5.29094     5.2917  -0.014%
    rwlock_tryread-filler    39.6653     40.209  -1.4%
       rwlock_write-empty    60.6445    60.6236  0.034%
      rwlock_write-filler     91.431    92.9016  -1.6%
    rwlock_trywrite-empty    5.28404    5.94623  -13%
   rwlock_trywrite-filler    40.7044    40.7709  -0.16%
          spin_lock-empty    19.1067    19.1068  -0.00052%
         spin_lock-filler     51.643    51.2963  0.67%
       spin_trylock-empty    16.4705    16.4707  -0.0012%
      spin_trylock-filler    45.4647    50.5047  -11%
           sem_wait-empty     42.169    42.1889  -0.047%
          sem_wait-filler    74.4302    74.4577  -0.037%
        sem_trywait-empty    5.27318    5.27172  0.028%
       sem_trywait-filler     40.191    40.8506  -1.6%
            condvar-empty    5404.27    5406.39  -0.039%
           condvar-filler    5022.93    1566.82  69%
  consumer_producer-empty    15899.2    16755.8  -5.4%
 consumer_producer-filler    16076.9    16065.8  0.069%

rwlock_trywrite-empty has 13% regression and spin_trylock-filler
has 11% regression.  But there are 69%, 56% and 24% improvements.


--
H.J.

Paul A. Clarke Nov. 11, 2021, 12:30 a.m. UTC | #13

On Wed, Nov 10, 2021 at 01:33:26PM -0800, H.J. Lu wrote:
> On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
> >
> > On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading.  See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf 
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > > >
> > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > > line bouncing on contended locks.
> > > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > > don't know if it's actually rare; in the contended case it is clearly not
> > > > rare.
> > >
> > > Are you able to share benchmarks of this change? I am curious what effects
> > > this might have on other platforms.
> >
> > I'd like to see the expected performance results, too.
> >
> > For me, the results are not uniformly positive (Power10).
> > From bench-pthread-locks:
> >
> >                          bench   bench-patched
> > mutex-empty              4.73371 4.54792   3.9%
> > mutex-filler             18.5395 18.3419   1.1%
> > mutex_trylock-empty      10.46   2.46364  76.4%
> > mutex_trylock-filler     16.2188 16.1758   0.3%
> > rwlock_read-empty        16.5118 16.4681   0.3%
> > rwlock_read-filler       20.68   20.4416   1.2%
> > rwlock_tryread-empty     2.06572 2.17284  -5.2%
> > rwlock_tryread-filler    16.082  16.1215  -0.2%
> > rwlock_write-empty       31.3723 31.259    0.4%
> > rwlock_write-filler      41.6492 69.313  -66.4%
> > rwlock_trywrite-empty    2.20584 2.32178  -5.3%
> > rwlock_trywrite-filler   15.7044 15.9088  -1.3%
> > spin_lock-empty          16.7964 16.7731   0.1%
> > spin_lock-filler         20.6118 20.4175   0.9%
> > spin_trylock-empty       8.99989 8.98879   0.1%
> > spin_trylock-filler      16.4732 15.9957   2.9%
> > sem_wait-empty           15.805  15.7391   0.4%
> > sem_wait-filler          19.2346 19.5098  -1.4%
> > sem_trywait-empty        2.06405 2.03782   1.3%
> > sem_trywait-filler       15.921  15.8408   0.5%
> > condvar-empty            1385.84 1387.29  -0.1%
> > condvar-filler           1419.82 1424.01  -0.3%
> > consumer_producer-empty  2550.01 2395.29   6.1%
> > consumer_producer-filler 2709.4  2558.28   5.6%
> 
> Small regressions on uncontended locks are expected due to extra
> check.   What do you get with my current branch
> 
> https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl

                         bench   bench-hjl
mutex-empty              4.73371 4.65279   1.7%
mutex-filler             18.5395 18.3971   0.8%
mutex_trylock-empty      10.46   10.1671   2.8%
mutex_trylock-filler     16.2188 16.7105  -3.0%
rwlock_read-empty        16.5118 16.4697   0.3%
rwlock_read-filler       20.68   20.0416   3.1%
rwlock_tryread-empty     2.06572 2.038     1.3%
rwlock_tryread-filler    16.082  15.7182   2.3%
rwlock_write-empty       31.3723 31.1147   0.8%
rwlock_write-filler      41.6492 69.8115 -67.6%
rwlock_trywrite-empty    2.20584 2.32175  -5.3%
rwlock_trywrite-filler   15.7044 15.86    -1.0%
spin_lock-empty          16.7964 16.4342   2.2%
spin_lock-filler         20.6118 20.3916   1.1%
spin_trylock-empty       8.99989 8.98884   0.1%
spin_trylock-filler      16.4732 16.1979   1.7%
sem_wait-empty           15.805  15.7558   0.3%
sem_wait-filler          19.2346 19.2554  -0.1%
sem_trywait-empty        2.06405 2.03789   1.3%
sem_trywait-filler       15.921  15.7884   0.8%
condvar-empty            1385.84 1341.96   3.2%
condvar-filler           1419.82 1343.06   5.4%
consumer_producer-empty  2550.01 2446.33   4.1%
consumer_producer-filler 2709.4  2659.59   1.8%

...still one very bad outlier, and a few of concern.

> BTW, how did you compare the 2 results?  I tried compare_bench.py
> and got
> 
> Traceback (most recent call last):
>   File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
> line 196, in <module>
>     main(args.bench1, args.bench2, args.schema, args.threshold, args.stats)
>   File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
> line 165, in main
>     bench1 = bench.parse_bench(bench1, schema)
>   File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py",
> line 137, in parse_bench
>     bench = json.load(benchfile)
>   File "/usr/lib64/python3.10/json/__init__.py", line 293, in load
>     return loads(fp.read(),
>   File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads
>     return _default_decoder.decode(s)
>   File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode
>     raise JSONDecodeError("Extra data", s, end)
> json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)

I did it the old-fashioned way, in a spreadsheet.  :-)

I see the same errors you see with compare_bench.py.

Upon further investigation, compare_bench.py expects input in the form
produced by "make bench". The output from running the benchtest directly
is insufficient.  Using the respective outputs from
"make BENCHSET=bench-pthread bench":
--
$ ./benchtests/scripts/compare_bench.py --threshold 2 --stats mean A.out B.out
[snip]
+++ thread_create(stack=1024,guard=2)[mean]: (2.15%) from 372674 to 364660
+++ thread_create(stack=2048,guard=1)[mean]: (4.88%) from 377835 to 359396
+++ thread_create(stack=2048,guard=2)[mean]: (3.58%) from 377306 to 363798
+++ pthread_locks(mutex-empty)[mean]: (4.27%) from 4.85936 to 4.65185
--- pthread_locks(mutex_trylock-filler)[mean]: (3.09%) from 16.0579 to 16.5533
--- pthread_locks(rwlock_write-filler)[mean]: (56.90%) from 44.4255 to 69.7047
--- pthread_locks(rwlock_trywrite-empty)[mean]: (6.73%) from 2.17594 to 2.32244
+++ pthread_locks(spin_lock-empty)[mean]: (2.17%) from 16.8086 to 16.4436
--- pthread_locks(spin_trylock-filler)[mean]: (2.34%) from 16.1119 to 16.4896
+++ pthread_locks(consumer_producer-empty)[mean]: (2.94%) from 2531.95 to 2457.48
--

PC