From patchwork Sat Oct 10 03:55:30 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: DJ Delorie X-Patchwork-Id: 40694 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 321103857C57; Sat, 10 Oct 2020 03:55:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 321103857C57 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1602302139; bh=PPF1PPuBA5cBuL9qKXBP5vovqdxarFCgnmCIy5DDmNM=; h=To:Subject:In-Reply-To:Date:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:List-Subscribe:From:Reply-To:From; b=VjlApbmUd5xdKfvhHfIjYl2s/UZS5AxuQePP0S72tBtSrL2nkXNIAgJT5Wk8SMI/3 NyD2iNxnuGE29mA4Yr9gwFQSYt8uQiBmieIFOsgghhM5zwgDJv1Xp3xK9oxk5hPj0g +xv2NEsCJUpEesDNncKVaph4XkprNAGIJh9IdIW0= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by sourceware.org (Postfix) with ESMTP id CD4F93857C52 for ; Sat, 10 Oct 2020 03:55:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org CD4F93857C52 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-320-VfvNfzkGO_-_Ozh-ag2ItA-1; Fri, 09 Oct 2020 23:55:33 -0400 X-MC-Unique: VfvNfzkGO_-_Ozh-ag2ItA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 208A71DDE0 for ; Sat, 10 Oct 2020 03:55:32 +0000 (UTC) Received: from greed.delorie.com (ovpn-118-74.rdu2.redhat.com [10.10.118.74]) by smtp.corp.redhat.com (Postfix) with ESMTPS id C95AC277DF for ; Sat, 10 Oct 2020 03:55:31 +0000 (UTC) Received: from greed.delorie.com.redhat.com (localhost [127.0.0.1]) by greed.delorie.com (8.14.7/8.14.7) with ESMTP id 09A3tUX8011076 for ; Fri, 9 Oct 2020 23:55:30 -0400 To: libc-alpha@sourceware.org Subject: [v2] New benchtest: pthread locks In-Reply-To: (message from DJ Delorie via Libc-alpha on Thu, 08 Oct 2020 01:34:13 -0400) Date: Fri, 09 Oct 2020 23:55:30 -0400 Message-ID: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-11.5 required=5.0 tests=BAYES_00, DKIM_INVALID, DKIM_SIGNED, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_SHORT, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H5, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: DJ Delorie via Libc-alpha From: DJ Delorie Reply-To: DJ Delorie Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" v2 changes do_bench_2 to run multiple passes and provide a mean and standard deviation as well as some other statistics, since I got tired of running the whole benchmark multiple times and computing them manually ;-) From 94653e93eec9403ebf61b00179799751ce12088d Mon Sep 17 00:00:00 2001 From: DJ Delorie Date: Wed, 7 Oct 2020 17:04:12 -0400 Subject: New benchtest: pthread locks Performance benchmarks for various posix locks: mutex, rwlock, spinlock, condvar, and semaphore. Each test is performed with an empty loop body or with a computationally "interesting" (i.e. difficult to optimize away, and used just to allow lock code to be "hidden" in the filler's CPU cycles). diff --git a/benchtests/Makefile b/benchtests/Makefile index 922e2a94b1..5cd211ee9a 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -31,7 +31,7 @@ ifneq (,$(filter yes,$(float128-fcts) $(float128-alias-fcts))) bench-math += expf128 powf128 sinf128 endif -bench-pthread := pthread_once thread_create +bench-pthread := pthread_once thread_create pthread-locks bench-string := ffs ffsll @@ -109,6 +109,7 @@ $(addprefix $(objpfx)bench-,$(bench-math)): $(libm) $(addprefix $(objpfx)bench-,$(math-benchset)): $(libm) $(addprefix $(objpfx)bench-,$(bench-pthread)): $(shared-thread-library) $(addprefix $(objpfx)bench-,$(bench-malloc)): $(shared-thread-library) +$(addprefix $(objpfx)bench-,pthread-locks): $(libm) diff --git a/benchtests/bench-pthread-locks.c b/benchtests/bench-pthread-locks.c new file mode 100644 index 0000000000..2bd49d8762 --- /dev/null +++ b/benchtests/bench-pthread-locks.c @@ -0,0 +1,554 @@ +/* Measure various lock acquisition times for empty critical sections. + Copyright (C) 2020 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#define TEST_MAIN +#define TEST_NAME "pthread-locks" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "bench-timing.h" +#include "json-lib.h" + +/* The point of this benchmark is to measure the overhead of an empty + critical section or a small critical section. This is never going + to be indicative of real application performance. Instead we are + trying to benchmark the effects of the compiler and the runtime + coupled with a particular set of hardware atomic operations. + The numbers from this benchmark should be taken with a massive gain + of salt and viewed through the eyes of expert reviewers. */ + +static pthread_mutex_t m; +static pthread_rwlock_t rw; +static pthread_cond_t cv; +static pthread_cond_t consumer_c, producer_c; +static int cv_done; +static pthread_spinlock_t sp; +static sem_t sem; + +typedef timing_t (*test_t)(long, int); + +#define START_ITERS 1000 + +#define FILLER_GOES_HERE \ + if (filler) \ + do_filler (); + +/* Everyone loves a good fibonacci series. This isn't quite one of + them because we need larger values in fewer steps, in a way that + won't be optimized away. We're looking to approximately double the + total time each test iteration takes, so as to not swamp the useful + timings. */ + +#pragma GCC push_options +#pragma GCC optimize(1) + +static int __attribute__((noinline)) +fibonacci (int i) +{ + asm(""); + if (i > 2) + return fibonacci (i-1) + fibonacci (i-2); + return 10+i; +} + +static void +do_filler (void) +{ + static char buf1[512], buf2[512]; + int f = fibonacci (5); + memcpy (buf1, buf2, f); +} + +#pragma GCC pop_options + +static timing_t +test_mutex (long iters, int filler) +{ + timing_t start, stop, cur; + + pthread_mutex_init (&m, NULL); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + pthread_mutex_lock (&m); + FILLER_GOES_HERE; + pthread_mutex_unlock (&m); + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + return cur; +} + +static timing_t +test_mutex_trylock (long iters, int filler) +{ + timing_t start, stop, cur; + + pthread_mutex_init (&m, NULL); + pthread_mutex_lock (&m); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + pthread_mutex_trylock (&m); + FILLER_GOES_HERE; + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + pthread_mutex_unlock (&m); + return cur; +} + +static timing_t +test_rwlock_read (long iters, int filler) +{ + timing_t start, stop, cur; + + pthread_rwlock_init (&rw, NULL); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + pthread_rwlock_rdlock (&rw); + FILLER_GOES_HERE; + pthread_rwlock_unlock (&rw); + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + return cur; +} + +static timing_t +test_rwlock_tryread (long iters, int filler) +{ + timing_t start, stop, cur; + + pthread_rwlock_init (&rw, NULL); + pthread_rwlock_wrlock (&rw); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + pthread_rwlock_tryrdlock (&rw); + FILLER_GOES_HERE; + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + pthread_rwlock_unlock (&rw); + return cur; +} + +static timing_t +test_rwlock_write (long iters, int filler) +{ + timing_t start, stop, cur; + + pthread_rwlock_init (&rw, NULL); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + pthread_rwlock_wrlock (&rw); + FILLER_GOES_HERE; + pthread_rwlock_unlock (&rw); + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + return cur; +} + +static timing_t +test_rwlock_trywrite (long iters, int filler) +{ + timing_t start, stop, cur; + + pthread_rwlock_init (&rw, NULL); + pthread_rwlock_rdlock (&rw); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + pthread_rwlock_trywrlock (&rw); + FILLER_GOES_HERE; + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + pthread_rwlock_unlock (&rw); + return cur; +} + +static timing_t +test_spin_lock (long iters, int filler) +{ + timing_t start, stop, cur; + + pthread_spin_init (&sp, PTHREAD_PROCESS_PRIVATE); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + pthread_spin_lock (&sp); + FILLER_GOES_HERE; + pthread_spin_unlock (&sp); + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + return cur; +} + +static timing_t +test_spin_trylock (long iters, int filler) +{ + timing_t start, stop, cur; + + pthread_spin_init (&sp, PTHREAD_PROCESS_PRIVATE); + pthread_spin_lock (&sp); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + pthread_spin_trylock (&sp); + FILLER_GOES_HERE; + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + pthread_spin_unlock (&sp); + return cur; +} + +static timing_t +test_sem_wait (long iters, int filler) +{ + timing_t start, stop, cur; + + sem_init (&sem, 0, 1); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + sem_post (&sem); + FILLER_GOES_HERE; + sem_wait (&sem); + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + return cur; +} + +static timing_t +test_sem_trywait (long iters, int filler) +{ + timing_t start, stop, cur; + + sem_init (&sem, 0, 0); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + sem_trywait (&sem); + FILLER_GOES_HERE; + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + return cur; +} + +static void * +test_condvar_helper (void *v) +{ + /* This is wasteful, but the alternative is to add the overhead of a + mutex lock/unlock to the overall iteration (both threads) and we + don't want that. Ideally, this thread would run on an + independent processing core anyway. The ONLY goal here is to + minimize the time the other thread spends waiting for us. */ + while (__atomic_load_n (&cv_done, __ATOMIC_RELAXED) == 0) + pthread_cond_signal (&cv); + + return NULL; +} + +static timing_t +test_condvar (long iters, int filler) +{ + timing_t start, stop, cur; + pthread_t helper_id; + + pthread_mutex_init (&m, NULL); + pthread_cond_init (&cv, NULL); + pthread_mutex_lock (&m); + + __atomic_store_n (&cv_done, 0, __ATOMIC_RELAXED); + pthread_create (&helper_id, NULL, test_condvar_helper, &iters); + + TIMING_NOW (start); + for (long j = iters; j >= 0; --j) + { + pthread_cond_wait (&cv, &m); + FILLER_GOES_HERE; + } + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + pthread_mutex_unlock (&m); + __atomic_store_n (&cv_done, 1, __ATOMIC_RELAXED); + + pthread_join (helper_id, NULL); + return cur; +} + +/* How many items are "queued" in our pretend queue. */ +static int queued = 0; + +typedef struct Producer_Params { + long iters; + int filler; +} Producer_Params; + +/* We only benchmark the consumer thread, but both threads are doing + essentially the same thing, and never run in parallel due to the + locks. Thus, even if they run on separate processing cores, we + count the time for both threads. */ +static void * +test_producer_thread (void *v) +{ + Producer_Params *p = (Producer_Params *) v; + long iters = p->iters; + int filler = p->filler; + long j; + + for (j = iters; j >= 0; --j) + { + /* Aquire lock on the queue. */ + pthread_mutex_lock (&m); + /* if something's already there, wait. */ + while (queued > 0) + pthread_cond_wait (&consumer_c, &m); + + /* Put something on the queue */ + FILLER_GOES_HERE; + ++ queued; + pthread_cond_signal (&producer_c); + + /* Give the other thread a chance to run. */ + pthread_mutex_unlock (&m); + } + + return NULL; +} + +static timing_t +test_consumer_producer (long iters, int filler) +{ + timing_t start, stop, cur; + pthread_t helper_id; + Producer_Params p; + + p.iters = iters; + p.filler = filler; + + pthread_mutex_init (&m, NULL); + pthread_cond_init (&cv, NULL); + + pthread_create (&helper_id, NULL, test_producer_thread, &p); + + TIMING_NOW (start); + + for (long j = iters; j >= 0; --j) + { + /* Aquire lock on the queue. */ + pthread_mutex_lock (&m); + /* Wait for something to be on the queue. */ + while (queued == 0) + pthread_cond_wait (&producer_c, &m); + + /* Take if off. */ + FILLER_GOES_HERE; + -- queued; + pthread_cond_signal (&consumer_c); + + /* Give the other thread a chance to run. */ + pthread_mutex_unlock (&m); + } + + TIMING_NOW (stop); + TIMING_DIFF (cur, start, stop); + + + pthread_join (helper_id, NULL); + return cur; +} + +/* Number of runs we use for computing mean and standard deviation. + We actually do two additional runs and discard the outliers. */ +#define RUN_COUNT 10 + +static int +do_bench_2 (const char *name, test_t func, int filler, json_ctx_t *js) +{ + timing_t cur; + struct timeval ts, te; + double tsd, ted, td; + long iters, iters_limit; + timing_t curs[RUN_COUNT + 2]; + int i, j; + double mean, stdev; + + iters = START_ITERS; + iters_limit = LONG_MAX / 100; + + while (1) { + gettimeofday (&ts, NULL); + cur = func(iters, filler); + gettimeofday (&te, NULL); + + /* We want a test to take at least 0.01 seconds, and try + increasingly larger iteration counts until it does. This + allows for approximately constant-time tests regardless of + hardware speed, without the overhead of checking the time + inside the test loop itself. We stop at a million iterations + as that should be precise enough. Once we determine a suitable + iteration count, we run the test multiple times to calculate + mean and standard deviation. */ + + /* Note that this also primes the CPU cache and triggers faster + MHz, we hope. */ + tsd = ts.tv_sec + ts.tv_usec / 1000000.0; + ted = te.tv_sec + te.tv_usec / 1000000.0; + td = ted - tsd; + if (td >= 0.01 + || iters >= iters_limit + || iters >= 1000000) + break; + + iters *= 10; + } + + curs[0] = cur; + for (i = 1; i < RUN_COUNT + 2; i ++) + curs[i] = func(iters, filler); + + /* We sort the results so we can discard the fastest and slowest + times as outliers. In theory we should keep the fastest time, + but IMHO this is more fair. A simple bubble sort suffices. */ + + for (i = 0; i < RUN_COUNT + 1; i ++) + for (j = i + 1; j < RUN_COUNT + 2; j ++) + if (curs[i] > curs[j]) + { + timing_t temp = curs[i]; + curs[i] = curs[j]; + curs[j] = temp; + } + + /* Now calculate mean and standard deviation, skipping the outliers. */ + mean = 0.0; + for (i = 1; i