Message ID | 20211110001614.2087610-1-hjl.tools@gmail.com |
---|---|
Headers |
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 85174385803D for <patchwork@sourceware.org>; Wed, 10 Nov 2021 00:20:30 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 85174385803D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1636503630; bh=51wfbsWQIcmllq+BbsUNIU76+w3ZcM+zsoJYURnLahA=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=bCSeQF0nQ5/Kth51lYwR3rozao6wUn5L1lvkAJMjXRiXPy/LhhvC844kktMFg72c2 u3mnOYnSyDbB6bvQXstP5Jb+0kiGJ5PD05cfubNpxzgGqMkacilsR6mydNNSokWsvj JvVknpDooPNw6KsgzZYApS64lLqxDzRHL4kVG+18= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by sourceware.org (Postfix) with ESMTPS id 3847F3857C6D for <libc-alpha@sourceware.org>; Wed, 10 Nov 2021 00:16:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3847F3857C6D Received: by mail-pj1-x102c.google.com with SMTP id gf14-20020a17090ac7ce00b001a7a2a0b5c3so72573pjb.5 for <libc-alpha@sourceware.org>; Tue, 09 Nov 2021 16:16:18 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=51wfbsWQIcmllq+BbsUNIU76+w3ZcM+zsoJYURnLahA=; b=FnPanizTlGN3ZZTqBumMNy3bel66aMYWIwbB91IY3/7kxyewbwCym+CTTJPDIyEW57 zQUyvMVC836JpJ49eaNRm+UzwzULy4yjglO6LlRfC+jY2okd5UdzWe6Px9rggYGmluN/ DKuAfLC2aPbRryXkKlkZ14tNCVC3DuQ/3rIHMedeAJCfj+WXJ/otj9wG9rMmkfsoYg6h dKKvt/26p4S/HZyFyZWb5a+8fcVvcxIHEZkQ+KtNjVZw7DofedCa0W1sYE2xeXgzcTwn CW7X6TTvWB77bBoDY/p9Tvb9QeSR5MLvOO43f9OlxJsBp9kxLVHZY8IT+Q5rA8jfsmhb xR8A== X-Gm-Message-State: AOAM530d6c87Jk+8jefdVcfhAbGgfBk2ZjkPbERimBg5JGtmCTZecwiU SVWTN6WOrUWfyh4WdKHGSiE= X-Google-Smtp-Source: ABdhPJwRUWDddfaE7gb3mLrCrYjfBODoNQA8k0mVSQmhHqqCrozVoHCe5dOSkrw5rZdqXlFkWiVZ4w== X-Received: by 2002:a17:90b:3ecd:: with SMTP id rm13mr11748152pjb.157.1636503377329; Tue, 09 Nov 2021 16:16:17 -0800 (PST) Received: from gnu-cfl-2.localdomain ([172.58.35.133]) by smtp.gmail.com with ESMTPSA id b21sm16562624pff.179.2021.11.09.16.16.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Nov 2021 16:16:15 -0800 (PST) Received: from gnu-cfl-2.lan (localhost [IPv6:::1]) by gnu-cfl-2.localdomain (Postfix) with ESMTP id 7E4051A096B; Tue, 9 Nov 2021 16:16:14 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v4 0/3] Optimize CAS [BZ #28537] Date: Tue, 9 Nov 2021 16:16:11 -0800 Message-Id: <20211110001614.2087610-1-hjl.tools@gmail.com> X-Mailer: git-send-email 2.33.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3021.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: "H.J. Lu via Libc-alpha" <libc-alpha@sourceware.org> Reply-To: "H.J. Lu" <hjl.tools@gmail.com> Cc: Florian Weimer <fweimer@redhat.com>, Arjan van de Ven <arjan@linux.intel.com>, Andreas Schwab <schwab@linux-m68k.org> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> |
Series | Optimize CAS [BZ #28537] | |
Message
H.J. Lu
Nov. 10, 2021, 12:16 a.m. UTC
CAS instruction is expensive. From the x86 CPU's point of view, getting a cache line for writing is more expensive than reading. See Appendix A.2 Spinlock in: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf The full compare and swap will grab the cache line exclusive and cause excessive cache line bouncing. Optimize CAS in low level locks and pthread_mutex_lock.c: 1. Do an atomic load and skip CAS if compare may fail to reduce cache line bouncing on contended locks. 2. Replace atomic_compare_and_exchange_bool_acq with atomic_compare_and_exchange_val_acq to avoid the extra load. 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we don't know if it's actually rare; in the contended case it is clearly not rare. This is the first patch set to optimize CAS. I will investigate the rest CAS usages in glibc after this patch set has been accepted. H.J. Lu (3): Reduce CAS in low level locks [BZ #28537] Reduce CAS in __pthread_mutex_lock_full [BZ #28537] Optimize CAS in __pthread_mutex_lock_full [BZ #28537] nptl/lowlevellock.c | 12 ++++----- nptl/pthread_mutex_lock.c | 53 ++++++++++++++++++++++++++++--------- sysdeps/nptl/lowlevellock.h | 29 +++++++++++++------- 3 files changed, 67 insertions(+), 27 deletions(-)
Comments
On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote: > CAS instruction is expensive. From the x86 CPU's point of view, getting > a cache line for writing is more expensive than reading. See Appendix > A.2 Spinlock in: > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > The full compare and swap will grab the cache line exclusive and cause > excessive cache line bouncing. > > Optimize CAS in low level locks and pthread_mutex_lock.c: > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache > line bouncing on contended locks. > 2. Replace atomic_compare_and_exchange_bool_acq with > atomic_compare_and_exchange_val_acq to avoid the extra load. > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > don't know if it's actually rare; in the contended case it is clearly not > rare. Are you able to share benchmarks of this change? I am curious what effects this might have on other platforms. > > This is the first patch set to optimize CAS. I will investigate the rest > CAS usages in glibc after this patch set has been accepted. > > H.J. Lu (3): > Reduce CAS in low level locks [BZ #28537] > Reduce CAS in __pthread_mutex_lock_full [BZ #28537] > Optimize CAS in __pthread_mutex_lock_full [BZ #28537] > > nptl/lowlevellock.c | 12 ++++----- > nptl/pthread_mutex_lock.c | 53 ++++++++++++++++++++++++++++--------- > sysdeps/nptl/lowlevellock.h | 29 +++++++++++++------- > 3 files changed, 67 insertions(+), 27 deletions(-) >
On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote: > CAS instruction is expensive. From the x86 CPU's point of view, getting > a cache line for writing is more expensive than reading. See Appendix > A.2 Spinlock in: > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > The full compare and swap will grab the cache line exclusive and cause > excessive cache line bouncing. > > Optimize CAS in low level locks and pthread_mutex_lock.c: > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache > line bouncing on contended locks. > 2. Replace atomic_compare_and_exchange_bool_acq with > atomic_compare_and_exchange_val_acq to avoid the extra load. > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > don't know if it's actually rare; in the contended case it is clearly not > rare. I see build errors: In file included from pthread_mutex_cond_lock.c:23: ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’: ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement int private = (robust ^~~ ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’ int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed */, ^~~ ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function) if (e == ESRCH || e == EDEADLK) ^ ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in In file included from ../include/assert.h:1, from ../nptl/pthread_mutex_lock.c:18, from pthread_mutex_cond_lock.c:23: ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value] ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \ ^ ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’ assert (e != EDEADLK ^~~~~~ ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value] ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \ ^ ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’ assert (e != ESRCH || !robust); ^~~~~~ PC
On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote: > > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote: > > CAS instruction is expensive. From the x86 CPU's point of view, getting > > a cache line for writing is more expensive than reading. See Appendix > > A.2 Spinlock in: > > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > > > The full compare and swap will grab the cache line exclusive and cause > > excessive cache line bouncing. > > > > Optimize CAS in low level locks and pthread_mutex_lock.c: > > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache > > line bouncing on contended locks. > > 2. Replace atomic_compare_and_exchange_bool_acq with > > atomic_compare_and_exchange_val_acq to avoid the extra load. > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > > don't know if it's actually rare; in the contended case it is clearly not > > rare. > > I see build errors: > > In file included from pthread_mutex_cond_lock.c:23: > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’: > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement > int private = (robust > ^~~ > ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’ > int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed */, > ^~~ > ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function) > if (e == ESRCH || e == EDEADLK) > ^ > ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in > In file included from ../include/assert.h:1, > from ../nptl/pthread_mutex_lock.c:18, > from pthread_mutex_cond_lock.c:23: > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value] > ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \ > ^ > ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’ > assert (e != EDEADLK > ^~~~~~ > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value] > ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \ > ^ > ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’ > assert (e != ESRCH || !robust); > ^~~~~~ > > PC On PPC?
On Wed, Nov 10, 2021 at 07:42:20AM -0800, H.J. Lu wrote: > On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote: > > > > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote: > > > CAS instruction is expensive. From the x86 CPU's point of view, getting > > > a cache line for writing is more expensive than reading. See Appendix > > > A.2 Spinlock in: > > > > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > > > > > The full compare and swap will grab the cache line exclusive and cause > > > excessive cache line bouncing. > > > > > > Optimize CAS in low level locks and pthread_mutex_lock.c: > > > > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache > > > line bouncing on contended locks. > > > 2. Replace atomic_compare_and_exchange_bool_acq with > > > atomic_compare_and_exchange_val_acq to avoid the extra load. > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > > > don't know if it's actually rare; in the contended case it is clearly not > > > rare. > > > > I see build errors: > > > > In file included from pthread_mutex_cond_lock.c:23: > > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’: > > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement > > int private = (robust > > ^~~ > > ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’ > > int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed */, > > ^~~ > > ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function) > > if (e == ESRCH || e == EDEADLK) > > ^ > > ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in > > In file included from ../include/assert.h:1, > > from ../nptl/pthread_mutex_lock.c:18, > > from pthread_mutex_cond_lock.c:23: > > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value] > > ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \ > > ^ > > ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’ > > assert (e != EDEADLK > > ^~~~~~ > > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value] > > ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \ > > ^ > > ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’ > > assert (e != ESRCH || !robust); > > ^~~~~~ > > On PPC? Yes, powerpc64le. Failing command: gcc pthread_mutex_cond_lock.c -c -std=gnu11 -fgnu89-inline -g -O2 -Wall -Wwrite-strings -Wundef -Werror -fmerge-all-constants -frounding-math -fno-stack-protector -fno-common -Wstrict-prototypes -Wold-style-definition -fmath-errno -mabi=ieeelongdouble -Wno-psabi -mno-gnu-attribute -mlong-double-128 -ftls-model=initial-exec -I../include -I/home/pc/locks/build-patched/nptl -I/home/pc/locks/build-patched -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le -I../sysdeps/unix/sysv/linux/powerpc/powerpc64 -I../sysdeps/unix/sysv/linux/wordsize-64 -I../sysdeps/unix/sysv/linux/powerpc -I../sysdeps/powerpc/nptl -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux -I../sysdeps/nptl -I../sysdeps/pthread -I../sysdeps/gnu -I../sysdeps/unix/inet -I../sysdeps/unix/sysv -I../sysdeps/unix/powerpc -I../sysdeps/unix -I../sysdeps/posix -I../sysdeps/powerpc/powerpc64/le/power8/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power8/fpu -I../sysdeps/powerpc/powerpc64/le/power7/fpu -I../sysdeps/powerpc/powerpc64/le/fpu -I../sysdeps/powerpc/powerpc64/fpu -I../sysdeps/powerpc/powerpc64/le/power8/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/multiarch -I../sysdeps/powerpc/powerpc64/le/multiarch -I../sysdeps/powerpc/powerpc64/multiarch -I../sysdeps/powerpc/powerpc64/le/power8 -I../sysdeps/powerpc/powerpc64/power8 -I../sysdeps/powerpc/powerpc64/le/power7 -I../sysdeps/powerpc/powerpc64/power7 -I../sysdeps/powerpc/powerpc64/power6 -I../sysdeps/powerpc/powerpc64/power4 -I../sysdeps/powerpc/power4 -I../sysdeps/powerpc/powerpc64/le -I../sysdeps/powerpc/powerpc64 -I../sysdeps/wordsize-64 -I../sysdeps/powerpc/fpu -I../sysdeps/powerpc -I../sysdeps/ieee754/ldbl-128ibm-compat -I../sysdeps/ieee754/ldbl-128ibm/include -I../sysdeps/ieee754/ldbl-128ibm -I../sysdeps/ieee754/ldbl-opt -I../sysdeps/ieee754/dbl-64 -I../sysdeps/ieee754/flt-32 -I../sysdeps/ieee754/float128 -I../sysdeps/ieee754 -I../sysdeps/generic -I.. -I../libio -I. -D_LIBC_REENTRANT -include /home/pc/locks/build-patched/libc-modules.h -DMODULE_NAME=libc -include ../include/libc-symbols.h -DTOP_NAMESPACE=glibc -o /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o -MD -MP -MF /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o.dt -MT /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o PC
On Wed, Nov 10, 2021 at 7:50 AM Paul A. Clarke <pc@us.ibm.com> wrote: > > On Wed, Nov 10, 2021 at 07:42:20AM -0800, H.J. Lu wrote: > > On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote: > > > > > > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote: > > > > CAS instruction is expensive. From the x86 CPU's point of view, getting > > > > a cache line for writing is more expensive than reading. See Appendix > > > > A.2 Spinlock in: > > > > > > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > > > > > > > The full compare and swap will grab the cache line exclusive and cause > > > > excessive cache line bouncing. > > > > > > > > Optimize CAS in low level locks and pthread_mutex_lock.c: > > > > > > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache > > > > line bouncing on contended locks. > > > > 2. Replace atomic_compare_and_exchange_bool_acq with > > > > atomic_compare_and_exchange_val_acq to avoid the extra load. > > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > > > > don't know if it's actually rare; in the contended case it is clearly not > > > > rare. > > > > > > I see build errors: > > > > > > In file included from pthread_mutex_cond_lock.c:23: > > > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’: > > > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement > > > int private = (robust > > > ^~~ > > > ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’ > > > int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed */, > > > ^~~ > > > ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function) > > > if (e == ESRCH || e == EDEADLK) > > > ^ > > > ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in > > > In file included from ../include/assert.h:1, > > > from ../nptl/pthread_mutex_lock.c:18, > > > from pthread_mutex_cond_lock.c:23: > > > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value] > > > ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \ > > > ^ > > > ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’ > > > assert (e != EDEADLK > > > ^~~~~~ > > > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value] > > > ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \ > > > ^ > > > ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’ > > > assert (e != ESRCH || !robust); > > > ^~~~~~ > > > > On PPC? > > Yes, powerpc64le. Failing command: > gcc pthread_mutex_cond_lock.c -c -std=gnu11 -fgnu89-inline -g -O2 -Wall -Wwrite-strings -Wundef -Werror -fmerge-all-constants -frounding-math -fno-stack-protector -fno-common -Wstrict-prototypes -Wold-style-definition -fmath-errno -mabi=ieeelongdouble -Wno-psabi -mno-gnu-attribute -mlong-double-128 -ftls-model=initial-exec -I../include -I/home/pc/locks/build-patched/nptl -I/home/pc/locks/build-patched -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le -I../sysdeps/unix/sysv/linux/powerpc/powerpc64 -I../sysdeps/unix/sysv/linux/wordsize-64 -I../sysdeps/unix/sysv/linux/powerpc -I../sysdeps/powerpc/nptl -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux -I../sysdeps/nptl -I../sysdeps/pthread -I../sysdeps/gnu -I../sysdeps/unix/inet -I../sysdeps/unix/sysv -I../sysdeps/unix/powerpc -I../sysdeps/unix -I../sysdeps/posix -I../sysdeps/powerpc/powerpc64/le/power8/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power8/fpu -I../sysdeps/powerpc/powerpc64/le/power7/fpu -I../sysdeps/powerpc/powerpc64/le/fpu -I../sysdeps/powerpc/powerpc64/fpu -I../sysdeps/powerpc/powerpc64/le/power8/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/multiarch -I../sysdeps/powerpc/powerpc64/le/multiarch -I../sysdeps/powerpc/powerpc64/multiarch -I../sysdeps/powerpc/powerpc64/le/power8 -I../sysdeps/powerpc/powerpc64/power8 -I../sysdeps/powerpc/powerpc64/le/power7 -I../sysdeps/powerpc/powerpc64/power7 -I../sysdeps/powerpc/powerpc64/power6 -I../sysdeps/powerpc/powerpc64/power4 -I../sysdeps/powerpc/power4 -I../sysdeps/powerpc/powerpc64/le -I../sysdeps/powerpc/powerpc64 -I../sysdeps/wordsize-64 -I../sysdeps/powerpc/fpu -I../sysdeps/powerpc -I../sysdeps/ieee754/ldbl-128ibm-compat -I../sysdeps/ieee754/ldbl-128ibm/include -I../sysdeps/ieee754/ldbl-128ibm -I../sysdeps/ieee754/ldbl-opt -I../sysdeps/ieee754/dbl-64 -I../sysdeps/ieee754/flt-32 -I../sysdeps/ieee754/float128 -I../sysdeps/ieee754 -I../sysdeps/generic -I.. -I../libio -I. -D_LIBC_REENTRANT -include /home/pc/locks/build-patched/libc-modules.h -DMODULE_NAME=libc -include ../include/libc-symbols.h -DTOP_NAMESPACE=glibc -o /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o -MD -MP -MF /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o.dt -MT /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o > > PC I am testing it now.
* Paul A. Clarke: > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote: >> CAS instruction is expensive. From the x86 CPU's point of view, getting >> a cache line for writing is more expensive than reading. See Appendix >> A.2 Spinlock in: >> >> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf >> >> The full compare and swap will grab the cache line exclusive and cause >> excessive cache line bouncing. >> >> Optimize CAS in low level locks and pthread_mutex_lock.c: >> >> 1. Do an atomic load and skip CAS if compare may fail to reduce cache >> line bouncing on contended locks. >> 2. Replace atomic_compare_and_exchange_bool_acq with >> atomic_compare_and_exchange_val_acq to avoid the extra load. >> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we >> don't know if it's actually rare; in the contended case it is clearly not >> rare. > > I see build errors: > > In file included from pthread_mutex_cond_lock.c:23: > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’: > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement > int private = (robust > ^~~ The patch has: + locked_mutex: /* The mutex is locked. The kernel will now take care of everything. */ int private = (robust This is only supported in recent C versions, I think the workaround is to add an empty statement with a semicolon, like this: + locked_mutex:; /* The mutex is locked. The kernel will now take care of everything. */ int private = (robust Thanks, Florian
On Wed, Nov 10, 2021 at 7:52 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Paul A. Clarke: > > > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote: > >> CAS instruction is expensive. From the x86 CPU's point of view, getting > >> a cache line for writing is more expensive than reading. See Appendix > >> A.2 Spinlock in: > >> > >> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > >> > >> The full compare and swap will grab the cache line exclusive and cause > >> excessive cache line bouncing. > >> > >> Optimize CAS in low level locks and pthread_mutex_lock.c: > >> > >> 1. Do an atomic load and skip CAS if compare may fail to reduce cache > >> line bouncing on contended locks. > >> 2. Replace atomic_compare_and_exchange_bool_acq with > >> atomic_compare_and_exchange_val_acq to avoid the extra load. > >> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > >> don't know if it's actually rare; in the contended case it is clearly not > >> rare. > > > > I see build errors: > > > > In file included from pthread_mutex_cond_lock.c:23: > > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’: > > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement > > int private = (robust > > ^~~ > > The patch has: > > + locked_mutex: > /* The mutex is locked. The kernel will now take care of > everything. */ > int private = (robust > > This is only supported in recent C versions, I think the workaround is > to add an empty statement with a semicolon, like this: > > + locked_mutex:; > /* The mutex is locked. The kernel will now take care of > everything. */ > int private = (robust > > Thanks, > Florian > Can you try users/hjl/x86/atomic-nptl branch: https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/x86/atomic-nptl I fixed a couple issues and added more CAS optimizations.
On Wed, Nov 10, 2021 at 04:52:28PM +0100, Florian Weimer wrote: > * Paul A. Clarke: > > > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote: > >> CAS instruction is expensive. From the x86 CPU's point of view, getting > >> a cache line for writing is more expensive than reading. See Appendix > >> A.2 Spinlock in: > >> > >> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > >> > >> The full compare and swap will grab the cache line exclusive and cause > >> excessive cache line bouncing. > >> > >> Optimize CAS in low level locks and pthread_mutex_lock.c: > >> > >> 1. Do an atomic load and skip CAS if compare may fail to reduce cache > >> line bouncing on contended locks. > >> 2. Replace atomic_compare_and_exchange_bool_acq with > >> atomic_compare_and_exchange_val_acq to avoid the extra load. > >> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > >> don't know if it's actually rare; in the contended case it is clearly not > >> rare. > > > > I see build errors: > > > > In file included from pthread_mutex_cond_lock.c:23: > > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’: > > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement > > int private = (robust > > ^~~ > > The patch has: > > + locked_mutex: > /* The mutex is locked. The kernel will now take care of > everything. */ > int private = (robust > > This is only supported in recent C versions, I think the workaround is > to add an empty statement with a semicolon, like this: > > + locked_mutex:; > /* The mutex is locked. The kernel will now take care of > everything. */ > int private = (robust This change resolved the issue. PC
On Nov 10 2021, Florian Weimer wrote:
> This is only supported in recent C versions
Only in C++.
Andreas.
On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote: > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote: > > CAS instruction is expensive. From the x86 CPU's point of view, getting > > a cache line for writing is more expensive than reading. See Appendix > > A.2 Spinlock in: > > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > > > The full compare and swap will grab the cache line exclusive and cause > > excessive cache line bouncing. > > > > Optimize CAS in low level locks and pthread_mutex_lock.c: > > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache > > line bouncing on contended locks. > > 2. Replace atomic_compare_and_exchange_bool_acq with > > atomic_compare_and_exchange_val_acq to avoid the extra load. > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > > don't know if it's actually rare; in the contended case it is clearly not > > rare. > > Are you able to share benchmarks of this change? I am curious what effects > this might have on other platforms. I'd like to see the expected performance results, too. For me, the results are not uniformly positive (Power10). From bench-pthread-locks: bench bench-patched mutex-empty 4.73371 4.54792 3.9% mutex-filler 18.5395 18.3419 1.1% mutex_trylock-empty 10.46 2.46364 76.4% mutex_trylock-filler 16.2188 16.1758 0.3% rwlock_read-empty 16.5118 16.4681 0.3% rwlock_read-filler 20.68 20.4416 1.2% rwlock_tryread-empty 2.06572 2.17284 -5.2% rwlock_tryread-filler 16.082 16.1215 -0.2% rwlock_write-empty 31.3723 31.259 0.4% rwlock_write-filler 41.6492 69.313 -66.4% rwlock_trywrite-empty 2.20584 2.32178 -5.3% rwlock_trywrite-filler 15.7044 15.9088 -1.3% spin_lock-empty 16.7964 16.7731 0.1% spin_lock-filler 20.6118 20.4175 0.9% spin_trylock-empty 8.99989 8.98879 0.1% spin_trylock-filler 16.4732 15.9957 2.9% sem_wait-empty 15.805 15.7391 0.4% sem_wait-filler 19.2346 19.5098 -1.4% sem_trywait-empty 2.06405 2.03782 1.3% sem_trywait-filler 15.921 15.8408 0.5% condvar-empty 1385.84 1387.29 -0.1% condvar-filler 1419.82 1424.01 -0.3% consumer_producer-empty 2550.01 2395.29 6.1% consumer_producer-filler 2709.4 2558.28 5.6% PC
On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote: > > On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote: > > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote: > > > CAS instruction is expensive. From the x86 CPU's point of view, getting > > > a cache line for writing is more expensive than reading. See Appendix > > > A.2 Spinlock in: > > > > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > > > > > The full compare and swap will grab the cache line exclusive and cause > > > excessive cache line bouncing. > > > > > > Optimize CAS in low level locks and pthread_mutex_lock.c: > > > > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache > > > line bouncing on contended locks. > > > 2. Replace atomic_compare_and_exchange_bool_acq with > > > atomic_compare_and_exchange_val_acq to avoid the extra load. > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > > > don't know if it's actually rare; in the contended case it is clearly not > > > rare. > > > > Are you able to share benchmarks of this change? I am curious what effects > > this might have on other platforms. > > I'd like to see the expected performance results, too. > > For me, the results are not uniformly positive (Power10). > From bench-pthread-locks: > > bench bench-patched > mutex-empty 4.73371 4.54792 3.9% > mutex-filler 18.5395 18.3419 1.1% > mutex_trylock-empty 10.46 2.46364 76.4% > mutex_trylock-filler 16.2188 16.1758 0.3% > rwlock_read-empty 16.5118 16.4681 0.3% > rwlock_read-filler 20.68 20.4416 1.2% > rwlock_tryread-empty 2.06572 2.17284 -5.2% > rwlock_tryread-filler 16.082 16.1215 -0.2% > rwlock_write-empty 31.3723 31.259 0.4% > rwlock_write-filler 41.6492 69.313 -66.4% > rwlock_trywrite-empty 2.20584 2.32178 -5.3% > rwlock_trywrite-filler 15.7044 15.9088 -1.3% > spin_lock-empty 16.7964 16.7731 0.1% > spin_lock-filler 20.6118 20.4175 0.9% > spin_trylock-empty 8.99989 8.98879 0.1% > spin_trylock-filler 16.4732 15.9957 2.9% > sem_wait-empty 15.805 15.7391 0.4% > sem_wait-filler 19.2346 19.5098 -1.4% > sem_trywait-empty 2.06405 2.03782 1.3% > sem_trywait-filler 15.921 15.8408 0.5% > condvar-empty 1385.84 1387.29 -0.1% > condvar-filler 1419.82 1424.01 -0.3% > consumer_producer-empty 2550.01 2395.29 6.1% > consumer_producer-filler 2709.4 2558.28 5.6% Small regressions on uncontended locks are expected due to extra check. What do you get with my current branch https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl BTW, how did you compare the 2 results? I tried compare_bench.py and got Traceback (most recent call last): File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py", line 196, in <module> main(args.bench1, args.bench2, args.schema, args.threshold, args.stats) File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py", line 165, in main bench1 = bench.parse_bench(bench1, schema) File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py", line 137, in parse_bench bench = json.load(benchfile) File "/usr/lib64/python3.10/json/__init__.py", line 293, in load return loads(fp.read(), File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)
On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote: > > On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote: > > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote: > > > CAS instruction is expensive. From the x86 CPU's point of view, getting > > > a cache line for writing is more expensive than reading. See Appendix > > > A.2 Spinlock in: > > > > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > > > > > The full compare and swap will grab the cache line exclusive and cause > > > excessive cache line bouncing. > > > > > > Optimize CAS in low level locks and pthread_mutex_lock.c: > > > > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache > > > line bouncing on contended locks. > > > 2. Replace atomic_compare_and_exchange_bool_acq with > > > atomic_compare_and_exchange_val_acq to avoid the extra load. > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > > > don't know if it's actually rare; in the contended case it is clearly not > > > rare. > > > > Are you able to share benchmarks of this change? I am curious what effects > > this might have on other platforms. > > I'd like to see the expected performance results, too. > > For me, the results are not uniformly positive (Power10). > From bench-pthread-locks: > > bench bench-patched > mutex-empty 4.73371 4.54792 3.9% > mutex-filler 18.5395 18.3419 1.1% > mutex_trylock-empty 10.46 2.46364 76.4% > mutex_trylock-filler 16.2188 16.1758 0.3% > rwlock_read-empty 16.5118 16.4681 0.3% > rwlock_read-filler 20.68 20.4416 1.2% > rwlock_tryread-empty 2.06572 2.17284 -5.2% > rwlock_tryread-filler 16.082 16.1215 -0.2% > rwlock_write-empty 31.3723 31.259 0.4% > rwlock_write-filler 41.6492 69.313 -66.4% > rwlock_trywrite-empty 2.20584 2.32178 -5.3% > rwlock_trywrite-filler 15.7044 15.9088 -1.3% > spin_lock-empty 16.7964 16.7731 0.1% > spin_lock-filler 20.6118 20.4175 0.9% > spin_trylock-empty 8.99989 8.98879 0.1% > spin_trylock-filler 16.4732 15.9957 2.9% > sem_wait-empty 15.805 15.7391 0.4% > sem_wait-filler 19.2346 19.5098 -1.4% > sem_trywait-empty 2.06405 2.03782 1.3% > sem_trywait-filler 15.921 15.8408 0.5% > condvar-empty 1385.84 1387.29 -0.1% > condvar-filler 1419.82 1424.01 -0.3% > consumer_producer-empty 2550.01 2395.29 6.1% > consumer_producer-filler 2709.4 2558.28 5.6% > > PC Here are the results on a machine with 112 cores: mutex-empty 16.0112 16.5728 -3.5% mutex-filler 49.4354 48.7608 1.4% mutex_trylock-empty 19.2854 8.56795 56% mutex_trylock-filler 54.9643 41.5418 24% rwlock_read-empty 39.8855 39.7448 0.35% rwlock_read-filler 75.1334 75.1218 0.015% rwlock_tryread-empty 5.29094 5.2917 -0.014% rwlock_tryread-filler 39.6653 40.209 -1.4% rwlock_write-empty 60.6445 60.6236 0.034% rwlock_write-filler 91.431 92.9016 -1.6% rwlock_trywrite-empty 5.28404 5.94623 -13% rwlock_trywrite-filler 40.7044 40.7709 -0.16% spin_lock-empty 19.1067 19.1068 -0.00052% spin_lock-filler 51.643 51.2963 0.67% spin_trylock-empty 16.4705 16.4707 -0.0012% spin_trylock-filler 45.4647 50.5047 -11% sem_wait-empty 42.169 42.1889 -0.047% sem_wait-filler 74.4302 74.4577 -0.037% sem_trywait-empty 5.27318 5.27172 0.028% sem_trywait-filler 40.191 40.8506 -1.6% condvar-empty 5404.27 5406.39 -0.039% condvar-filler 5022.93 1566.82 69% consumer_producer-empty 15899.2 16755.8 -5.4% consumer_producer-filler 16076.9 16065.8 0.069% rwlock_trywrite-empty has 13% regression and spin_trylock-filler has 11% regression. But there are 69%, 56% and 24% improvements. -- H.J.
On Wed, Nov 10, 2021 at 01:33:26PM -0800, H.J. Lu wrote: > On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote: > > > > On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote: > > > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote: > > > > CAS instruction is expensive. From the x86 CPU's point of view, getting > > > > a cache line for writing is more expensive than reading. See Appendix > > > > A.2 Spinlock in: > > > > > > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf > > > > > > > > The full compare and swap will grab the cache line exclusive and cause > > > > excessive cache line bouncing. > > > > > > > > Optimize CAS in low level locks and pthread_mutex_lock.c: > > > > > > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache > > > > line bouncing on contended locks. > > > > 2. Replace atomic_compare_and_exchange_bool_acq with > > > > atomic_compare_and_exchange_val_acq to avoid the extra load. > > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we > > > > don't know if it's actually rare; in the contended case it is clearly not > > > > rare. > > > > > > Are you able to share benchmarks of this change? I am curious what effects > > > this might have on other platforms. > > > > I'd like to see the expected performance results, too. > > > > For me, the results are not uniformly positive (Power10). > > From bench-pthread-locks: > > > > bench bench-patched > > mutex-empty 4.73371 4.54792 3.9% > > mutex-filler 18.5395 18.3419 1.1% > > mutex_trylock-empty 10.46 2.46364 76.4% > > mutex_trylock-filler 16.2188 16.1758 0.3% > > rwlock_read-empty 16.5118 16.4681 0.3% > > rwlock_read-filler 20.68 20.4416 1.2% > > rwlock_tryread-empty 2.06572 2.17284 -5.2% > > rwlock_tryread-filler 16.082 16.1215 -0.2% > > rwlock_write-empty 31.3723 31.259 0.4% > > rwlock_write-filler 41.6492 69.313 -66.4% > > rwlock_trywrite-empty 2.20584 2.32178 -5.3% > > rwlock_trywrite-filler 15.7044 15.9088 -1.3% > > spin_lock-empty 16.7964 16.7731 0.1% > > spin_lock-filler 20.6118 20.4175 0.9% > > spin_trylock-empty 8.99989 8.98879 0.1% > > spin_trylock-filler 16.4732 15.9957 2.9% > > sem_wait-empty 15.805 15.7391 0.4% > > sem_wait-filler 19.2346 19.5098 -1.4% > > sem_trywait-empty 2.06405 2.03782 1.3% > > sem_trywait-filler 15.921 15.8408 0.5% > > condvar-empty 1385.84 1387.29 -0.1% > > condvar-filler 1419.82 1424.01 -0.3% > > consumer_producer-empty 2550.01 2395.29 6.1% > > consumer_producer-filler 2709.4 2558.28 5.6% > > Small regressions on uncontended locks are expected due to extra > check. What do you get with my current branch > > https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl bench bench-hjl mutex-empty 4.73371 4.65279 1.7% mutex-filler 18.5395 18.3971 0.8% mutex_trylock-empty 10.46 10.1671 2.8% mutex_trylock-filler 16.2188 16.7105 -3.0% rwlock_read-empty 16.5118 16.4697 0.3% rwlock_read-filler 20.68 20.0416 3.1% rwlock_tryread-empty 2.06572 2.038 1.3% rwlock_tryread-filler 16.082 15.7182 2.3% rwlock_write-empty 31.3723 31.1147 0.8% rwlock_write-filler 41.6492 69.8115 -67.6% rwlock_trywrite-empty 2.20584 2.32175 -5.3% rwlock_trywrite-filler 15.7044 15.86 -1.0% spin_lock-empty 16.7964 16.4342 2.2% spin_lock-filler 20.6118 20.3916 1.1% spin_trylock-empty 8.99989 8.98884 0.1% spin_trylock-filler 16.4732 16.1979 1.7% sem_wait-empty 15.805 15.7558 0.3% sem_wait-filler 19.2346 19.2554 -0.1% sem_trywait-empty 2.06405 2.03789 1.3% sem_trywait-filler 15.921 15.7884 0.8% condvar-empty 1385.84 1341.96 3.2% condvar-filler 1419.82 1343.06 5.4% consumer_producer-empty 2550.01 2446.33 4.1% consumer_producer-filler 2709.4 2659.59 1.8% ...still one very bad outlier, and a few of concern. > BTW, how did you compare the 2 results? I tried compare_bench.py > and got > > Traceback (most recent call last): > File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py", > line 196, in <module> > main(args.bench1, args.bench2, args.schema, args.threshold, args.stats) > File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py", > line 165, in main > bench1 = bench.parse_bench(bench1, schema) > File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py", > line 137, in parse_bench > bench = json.load(benchfile) > File "/usr/lib64/python3.10/json/__init__.py", line 293, in load > return loads(fp.read(), > File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads > return _default_decoder.decode(s) > File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode > raise JSONDecodeError("Extra data", s, end) > json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17) I did it the old-fashioned way, in a spreadsheet. :-) I see the same errors you see with compare_bench.py. Upon further investigation, compare_bench.py expects input in the form produced by "make bench". The output from running the benchtest directly is insufficient. Using the respective outputs from "make BENCHSET=bench-pthread bench": -- $ ./benchtests/scripts/compare_bench.py --threshold 2 --stats mean A.out B.out [snip] +++ thread_create(stack=1024,guard=2)[mean]: (2.15%) from 372674 to 364660 +++ thread_create(stack=2048,guard=1)[mean]: (4.88%) from 377835 to 359396 +++ thread_create(stack=2048,guard=2)[mean]: (3.58%) from 377306 to 363798 +++ pthread_locks(mutex-empty)[mean]: (4.27%) from 4.85936 to 4.65185 --- pthread_locks(mutex_trylock-filler)[mean]: (3.09%) from 16.0579 to 16.5533 --- pthread_locks(rwlock_write-filler)[mean]: (56.90%) from 44.4255 to 69.7047 --- pthread_locks(rwlock_trywrite-empty)[mean]: (6.73%) from 2.17594 to 2.32244 +++ pthread_locks(spin_lock-empty)[mean]: (2.17%) from 16.8086 to 16.4436 --- pthread_locks(spin_trylock-filler)[mean]: (2.34%) from 16.1119 to 16.4896 +++ pthread_locks(consumer_producer-empty)[mean]: (2.94%) from 2531.95 to 2457.48 -- PC