From patchwork Thu Oct  8 05:34:13 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: DJ Delorie <dj@redhat.com>
X-Patchwork-Id: 40680
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 47E4A388EC1D;
	Thu,  8 Oct 2020 05:34:23 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 47E4A388EC1D
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1602135263;
	bh=1VDBsa9sWDbmLEUSktSOtocLH+dolaEmX+0gFcbf49s=;
	h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:From;
	b=jqvmCoq7Tp759CzRKi3Vnnr3j8eLdGQuZQchc6VHROY5wLj+fkn6+OflylwMs2Y6A
	 jwbRwK8IO6HpRmiSjf98wYJ2VF4/SXxoey5Yf7aVykdaZkGTv/zs+lHgkvy9gSr8is
	 w0yMUswEi7Jrr9qR5tCO57NjIzjGEWx+IvfXxG64=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [216.205.24.124])
 by sourceware.org (Postfix) with ESMTP id DE4F3385782F
 for <libc-alpha@sourceware.org>; Thu,  8 Oct 2020 05:34:19 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org DE4F3385782F
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-335-J-OReCt8NS2tVVnerYkSew-1; Thu, 08 Oct 2020 01:34:16 -0400
X-MC-Unique: J-OReCt8NS2tVVnerYkSew-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 3925E1018F7E
 for <libc-alpha@sourceware.org>; Thu,  8 Oct 2020 05:34:15 +0000 (UTC)
Received: from greed.delorie.com (ovpn-118-74.rdu2.redhat.com [10.10.118.74])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id E124D6EF4A
 for <libc-alpha@sourceware.org>; Thu,  8 Oct 2020 05:34:14 +0000 (UTC)
Received: from greed.delorie.com.redhat.com (localhost [127.0.0.1])
 by greed.delorie.com (8.14.7/8.14.7) with ESMTP id 0985YDei013780
 for <libc-alpha@sourceware.org>; Thu, 8 Oct 2020 01:34:13 -0400
Date: Thu, 08 Oct 2020 01:34:13 -0400
Message-Id: <xnimbl5lii.fsf@greed.delorie.com>
To: libc-alpha@sourceware.org
Subject: New benchtest: pthread locks
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 KAM_SHORT,
 RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: DJ Delorie via Libc-alpha
 <libc-alpha@sourceware.org>
From: DJ Delorie <dj@redhat.com>
Reply-To: DJ Delorie <dj@redhat.com>
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

Performance benchmarks for various posix locks: mutex, rwlock,
spinlock, condvar, and semaphore.  Each test is performed with
an empty loop body or with a computationally "interesting" (i.e.
difficult to optimize away, and used just to allow lock code to
be "hidden" in the filler's CPU cycles).

diff --git a/benchtests/Makefile b/benchtests/Makefile
index 922e2a94b1..36ea224d9d 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -31,7 +31,7 @@ ifneq (,$(filter yes,$(float128-fcts) $(float128-alias-fcts)))
 bench-math += expf128 powf128 sinf128
 endif
 
-bench-pthread := pthread_once thread_create
+bench-pthread := pthread_once thread_create pthread-locks
 
 bench-string := ffs ffsll
 
diff --git a/benchtests/bench-pthread-locks.c b/benchtests/bench-pthread-locks.c
new file mode 100644
index 0000000000..cc866dd5ad
--- /dev/null
+++ b/benchtests/bench-pthread-locks.c
@@ -0,0 +1,502 @@
+/* Measure various lock acquisition times for empty critical sections.
+   Copyright (C) 2020 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#define TEST_MAIN
+#define TEST_NAME "pthread-locks"
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <pthread.h>
+#include <semaphore.h>
+#include <stdatomic.h>
+#include <sys/time.h>
+#include "bench-timing.h"
+#include "json-lib.h"
+
+/* The point of this benchmark is to measure the overhead of an empty
+   critical section or a small critical section.  This is never going
+   to be indicative of real application performance.  Instead we are
+   trying to benchmark the effects of the compiler and the runtime
+   coupled with a particular set of hardware atomic operations.
+   The numbers from this benchmark should be taken with a massive gain
+   of salt and viewed through the eyes of expert reviewers.  */
+
+static pthread_mutex_t m;
+static pthread_rwlock_t rw;
+static pthread_cond_t cv;
+static pthread_cond_t consumer_c, producer_c;
+static int cv_done;
+static pthread_spinlock_t sp;
+static sem_t sem;
+
+typedef timing_t (*test_t)(long, int);
+
+#define START_ITERS 1000
+
+#define FILLER_GOES_HERE \
+  if (filler) \
+    do_filler ();
+
+/* Everyone loves a good fibonacci series.  This isn't quite one of
+   them because we need larger values in fewer steps, in a way that
+   won't be optimized away.  We're looking to approximately double the
+   total time each test iteration takes, so as to not swamp the useful
+   timings.  */
+
+#pragma GCC push_options
+#pragma GCC optimize(1)
+
+static int __attribute__((noinline))
+fibonacci (int i)
+{
+  asm("");
+  if (i > 2)
+    return fibonacci (i-1) + fibonacci (i-2);
+  return 10+i;
+}
+
+static void
+do_filler (void)
+{
+  static char buf1[512], buf2[512];
+  int f = fibonacci (5);
+  memcpy (buf1, buf2, f);
+}
+
+#pragma GCC pop_options
+
+static timing_t
+test_mutex (long iters, int filler)
+{
+  timing_t start, stop, cur;
+
+  pthread_mutex_init (&m, NULL);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      pthread_mutex_lock (&m);
+      FILLER_GOES_HERE;
+      pthread_mutex_unlock (&m);
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  return cur;
+}
+
+static timing_t
+test_mutex_trylock (long iters, int filler)
+{
+  timing_t start, stop, cur;
+
+  pthread_mutex_init (&m, NULL);
+  pthread_mutex_lock (&m);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      pthread_mutex_trylock (&m);
+      FILLER_GOES_HERE;
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  pthread_mutex_unlock (&m);
+  return cur;
+}
+
+static timing_t
+test_rwlock_read (long iters, int filler)
+{
+  timing_t start, stop, cur;
+
+  pthread_rwlock_init (&rw, NULL);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      pthread_rwlock_rdlock (&rw);
+      FILLER_GOES_HERE;
+      pthread_rwlock_unlock (&rw);
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  return cur;
+}
+
+static timing_t
+test_rwlock_tryread (long iters, int filler)
+{
+  timing_t start, stop, cur;
+
+  pthread_rwlock_init (&rw, NULL);
+  pthread_rwlock_wrlock (&rw);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      pthread_rwlock_tryrdlock (&rw);
+      FILLER_GOES_HERE;
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  pthread_rwlock_unlock (&rw);
+  return cur;
+}
+
+static timing_t
+test_rwlock_write (long iters, int filler)
+{
+  timing_t start, stop, cur;
+
+  pthread_rwlock_init (&rw, NULL);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      pthread_rwlock_wrlock (&rw);
+      FILLER_GOES_HERE;
+      pthread_rwlock_unlock (&rw);
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  return cur;
+}
+
+static timing_t
+test_rwlock_trywrite (long iters, int filler)
+{
+  timing_t start, stop, cur;
+
+  pthread_rwlock_init (&rw, NULL);
+  pthread_rwlock_rdlock (&rw);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      pthread_rwlock_trywrlock (&rw);
+      FILLER_GOES_HERE;
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  pthread_rwlock_unlock (&rw);
+  return cur;
+}
+
+static timing_t
+test_spin_lock (long iters, int filler)
+{
+  timing_t start, stop, cur;
+
+  pthread_spin_init (&sp, PTHREAD_PROCESS_PRIVATE);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      pthread_spin_lock (&sp);
+      FILLER_GOES_HERE;
+      pthread_spin_unlock (&sp);
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  return cur;
+}
+
+static timing_t
+test_spin_trylock (long iters, int filler)
+{
+  timing_t start, stop, cur;
+
+  pthread_spin_init (&sp, PTHREAD_PROCESS_PRIVATE);
+  pthread_spin_lock (&sp);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      pthread_spin_trylock (&sp);
+      FILLER_GOES_HERE;
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  pthread_spin_unlock (&sp);
+  return cur;
+}
+
+static timing_t
+test_sem_wait (long iters, int filler)
+{
+  timing_t start, stop, cur;
+
+  sem_init (&sem, 0, 1);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      sem_post (&sem);
+      FILLER_GOES_HERE;
+      sem_wait (&sem);
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  return cur;
+}
+
+static timing_t
+test_sem_trywait (long iters, int filler)
+{
+  timing_t start, stop, cur;
+
+  sem_init (&sem, 0, 0);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      sem_trywait (&sem);
+      FILLER_GOES_HERE;
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  return cur;
+}
+
+static void *
+test_condvar_helper (void *v)
+{
+  /* This is wasteful, but the alternative is to add the overhead of a
+     mutex lock/unlock to the overall iteration (both threads) and we
+     don't want that.  Ideally, this thread would run on an
+     independent processing core anyway.  The ONLY goal here is to
+     minimize the time the other thread spends waiting for us.  */
+  while (__atomic_load_n (&cv_done, __ATOMIC_RELAXED) == 0)
+    pthread_cond_signal (&cv);
+
+  return NULL;
+}
+
+static timing_t
+test_condvar (long iters, int filler)
+{
+  timing_t start, stop, cur;
+  pthread_t helper_id;
+
+  pthread_mutex_init (&m, NULL);
+  pthread_cond_init (&cv, NULL);
+  pthread_mutex_lock (&m);
+
+  __atomic_store_n (&cv_done, 0, __ATOMIC_RELAXED);
+  pthread_create (&helper_id, NULL, test_condvar_helper, &iters);
+
+  TIMING_NOW (start);
+  for (long j = iters; j >= 0; --j)
+    {
+      pthread_cond_wait (&cv, &m);
+      FILLER_GOES_HERE;
+    }
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+  pthread_mutex_unlock (&m);
+  __atomic_store_n (&cv_done, 1, __ATOMIC_RELAXED);
+
+  pthread_join (helper_id, NULL);
+  return cur;
+}
+
+/* How many items are "queued" in our pretend queue.  */
+static int queued = 0;
+
+typedef struct Producer_Params {
+  long iters;
+  int filler;
+} Producer_Params;
+
+/* We only benchmark the consumer thread, but both threads are doing
+   essentially the same thing, and never run in parallel due to the
+   locks.  Thus, even if they run on separate processing cores, we
+   count the time for both threads.  */
+static void *
+test_producer_thread (void *v)
+{
+  Producer_Params *p = (Producer_Params *) v;
+  long iters = p->iters;
+  int filler = p->filler;
+  long j;
+
+  for (j = iters; j >= 0; --j)
+    {
+      /* Aquire lock on the queue.  */
+      pthread_mutex_lock (&m);
+      /* if something's already there, wait.  */
+      while (queued > 0)
+	pthread_cond_wait (&consumer_c, &m);
+
+      /* Put something on the queue */
+      FILLER_GOES_HERE;
+      ++ queued;
+      pthread_cond_signal (&producer_c);
+
+      /* Give the other thread a chance to run.  */
+      pthread_mutex_unlock (&m);
+    }
+
+  return NULL;
+}
+
+static timing_t
+test_consumer_producer (long iters, int filler)
+{
+  timing_t start, stop, cur;
+  pthread_t helper_id;
+  Producer_Params p;
+
+  p.iters = iters;
+  p.filler = filler;
+
+  pthread_mutex_init (&m, NULL);
+  pthread_cond_init (&cv, NULL);
+
+  pthread_create (&helper_id, NULL, test_producer_thread, &p);
+
+  TIMING_NOW (start);
+
+  for (long j = iters; j >= 0; --j)
+    {
+      /* Aquire lock on the queue.  */
+      pthread_mutex_lock (&m);
+      /* Wait for something to be on the queue.  */
+      while (queued == 0)
+	pthread_cond_wait (&producer_c, &m);
+
+      /* Take if off. */
+      FILLER_GOES_HERE;
+      -- queued;
+      pthread_cond_signal (&consumer_c);
+
+      /* Give the other thread a chance to run.  */
+      pthread_mutex_unlock (&m);
+    }
+
+  TIMING_NOW (stop);
+  TIMING_DIFF (cur, start, stop);
+
+
+  pthread_join (helper_id, NULL);
+  return cur;
+}
+
+static int
+do_bench_2 (const char *name, test_t func, int filler, json_ctx_t *j)
+{
+  timing_t cur;
+  struct timeval ts, te;
+  double tsd, ted, td;
+  long iters, iters_limit;
+
+  iters = START_ITERS;
+  iters_limit = LONG_MAX / 10;
+
+  while (1) {
+    gettimeofday (&ts, NULL);
+    cur = func(iters, filler);
+    gettimeofday (&te, NULL);
+
+    /* We want a test to take at least 0.1 seconds, and try
+       increasingly larger iteration counts until it does.  This
+       allows for approximately constant-time tests regardless of
+       hardware speed, without the overhead of checking the time
+       inside the test loop itself. */
+
+    /* Note that this also primes the CPU cache and triggers faster
+       MHz, we hope.  */
+    tsd = ts.tv_sec + ts.tv_usec / 1000000.0;
+    ted = te.tv_sec + te.tv_usec / 1000000.0;
+    td = ted - tsd;
+    if (td >= 0.1 || iters >= iters_limit)
+      break;
+
+    iters *= 10;
+  }
+
+  json_attr_object_begin (j, filler ? "filler" : "empty");
+
+  json_attr_double (j, "duration", (double) iters);
+  json_attr_double (j, "iterations", (double) iters);
+  json_attr_double (j, "mean", (double) cur / (double) iters);
+
+  json_attr_object_end (j);
+
+  return 0;
+}
+
+static int
+do_bench_1 (const char *name, test_t func, json_ctx_t *j)
+{
+  int rv = 0;
+
+  json_attr_object_begin (j, name);
+
+  rv += do_bench_2 (name, func, 0, j);
+  rv += do_bench_2 (name, func, 1, j);
+
+  json_attr_object_end (j);
+
+  return rv;
+}
+
+int
+do_bench (void)
+{
+  int rv = 0;
+  json_ctx_t json_ctx;
+
+  json_init (&json_ctx, 2, stdout);
+  json_attr_object_begin (&json_ctx, "pthread_locks");
+
+#define BENCH(n) rv += do_bench_1 (#n, test_##n, &json_ctx)
+
+  BENCH (mutex);
+  BENCH (mutex_trylock);
+  BENCH (rwlock_read);
+  BENCH (rwlock_tryread);
+  BENCH (rwlock_write);
+  BENCH (rwlock_trywrite);
+  BENCH (spin_lock);
+  BENCH (spin_trylock);
+  BENCH (sem_wait);
+  BENCH (sem_trywait);
+  BENCH (condvar);
+  BENCH (consumer_producer);
+
+  json_attr_object_end (&json_ctx);
+
+  return rv;
+}
+	    
+
+#define TEST_FUNCTION do_bench ()
+
+#include "../test-skeleton.c"