From patchwork Sat Nov 6 18:33:22 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 47173 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 7821F3858401 for ; Sat, 6 Nov 2021 18:36:06 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7821F3858401 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1636223766; bh=XBry1tyOdZu7s7+SmrHrzEelAyxyJTMQTFJfO+NB4P4=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=AZQWC3cZ6q7oalrLjT1iiMBdhl0sDSqv8olzKJNk05t2X+Zlbqz+O7MMePO15cb7L C0Z8Cys5U2diF/C1z6o3Df4pNOtoQuWIuAyMMohnP+Aqx12vYnuqttCQsetSyieB9K 8T8vbsOnh2mEmSVEB2F3AR/Uo0ImPDRvyoPD4dDU= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-io1-xd2b.google.com (mail-io1-xd2b.google.com [IPv6:2607:f8b0:4864:20::d2b]) by sourceware.org (Postfix) with ESMTPS id C48163858034 for ; Sat, 6 Nov 2021 18:33:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C48163858034 Received: by mail-io1-xd2b.google.com with SMTP id r8so62747iog.7 for ; Sat, 06 Nov 2021 11:33:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=XBry1tyOdZu7s7+SmrHrzEelAyxyJTMQTFJfO+NB4P4=; b=zHZoT0kRFheGCAHT6gPrWXQe44FtlRkZ1kmB8xJQtLelGDVumDQ/qwvq3PjQWUzYwv 2qpWI63G1hxlWPkizocIWUv6AW6dMxu8DfVXTZyw6j7l/SxmaX78rKzloQnun/C+yBYs DnW8Mw5ajBtyTExWpQmPwox0M4rlIGTGCrfAWHW2ZnXw96uoyGy0qww7ZEvOV/AyHgQf 9/ZPYwGm771sQcKfPzM1/1eUnwZPhWKqAfsF2C42uu94zF60aJWurq3rk0LuFdw/EMRZ PDk4z+Q8dlSmbFCPARtw6Ix0qQ6htRnYwEUYviW+wMh1W+bSrjF73ngaJJb+kb2/tRKA daNA== X-Gm-Message-State: AOAM531wFlGeVIkk58CFUzuFNPB/1XheyuOQwzpDXJFtAfO8QUtmt1F0 b38nllV5/CicvWmgZtoIVEN6LJZ0rtI= X-Google-Smtp-Source: ABdhPJyewCM0A3MOBczzTbncdp97/59Y9KwLLbGd1l1svLrYPAvVV+Cy8UMduRRauM/vU8cuTzDUbg== X-Received: by 2002:a5d:8619:: with SMTP id f25mr5314848iol.46.1636223616001; Sat, 06 Nov 2021 11:33:36 -0700 (PDT) Received: from localhost.localdomain (node-17-161.flex.volo.net. [76.191.17.161]) by smtp.googlemail.com with ESMTPSA id d7sm4913742ioh.0.2021.11.06.11.33.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 06 Nov 2021 11:33:35 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v4 5/5] x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.h Date: Sat, 6 Nov 2021 13:33:22 -0500 Message-Id: <20211106183322.3129442-5-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20211106183322.3129442-1-goldstein.w.n@gmail.com> References: <20211101054952.2349590-1-goldstein.w.n@gmail.com> <20211106183322.3129442-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" No bug. This patch doubles the rep_movsb_threshold when using ERMS. Based on benchmarks the vector copy loop, especially now that it handles 4k aliasing, is better for these medium ranged. On Skylake with ERMS: Size, Align1, Align2, dst>src,(rep movsb) / (vec copy) 4096, 0, 0, 0, 0.975 4096, 0, 0, 1, 0.953 4096, 12, 0, 0, 0.969 4096, 12, 0, 1, 0.872 4096, 44, 0, 0, 0.979 4096, 44, 0, 1, 0.83 4096, 0, 12, 0, 1.006 4096, 0, 12, 1, 0.989 4096, 0, 44, 0, 0.739 4096, 0, 44, 1, 0.942 4096, 12, 12, 0, 1.009 4096, 12, 12, 1, 0.973 4096, 44, 44, 0, 0.791 4096, 44, 44, 1, 0.961 4096, 2048, 0, 0, 0.978 4096, 2048, 0, 1, 0.951 4096, 2060, 0, 0, 0.986 4096, 2060, 0, 1, 0.963 4096, 2048, 12, 0, 0.971 4096, 2048, 12, 1, 0.941 4096, 2060, 12, 0, 0.977 4096, 2060, 12, 1, 0.949 8192, 0, 0, 0, 0.85 8192, 0, 0, 1, 0.845 8192, 13, 0, 0, 0.937 8192, 13, 0, 1, 0.939 8192, 45, 0, 0, 0.932 8192, 45, 0, 1, 0.927 8192, 0, 13, 0, 0.621 8192, 0, 13, 1, 0.62 8192, 0, 45, 0, 0.53 8192, 0, 45, 1, 0.516 8192, 13, 13, 0, 0.664 8192, 13, 13, 1, 0.659 8192, 45, 45, 0, 0.593 8192, 45, 45, 1, 0.575 8192, 2048, 0, 0, 0.854 8192, 2048, 0, 1, 0.834 8192, 2061, 0, 0, 0.863 8192, 2061, 0, 1, 0.857 8192, 2048, 13, 0, 0.63 8192, 2048, 13, 1, 0.629 8192, 2061, 13, 0, 0.627 8192, 2061, 13, 1, 0.62 Reviewed-by: H.J. Lu --- sysdeps/x86/dl-cacheinfo.h | 8 +++++--- sysdeps/x86/dl-tunables.list | 26 +++++++++++++++----------- 2 files changed, 20 insertions(+), 14 deletions(-) diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index e6c94dfd02..2e43e67e4f 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -866,12 +866,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8. */ unsigned int minimum_rep_movsb_threshold; #endif - /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16). */ + /* NB: The default REP MOVSB threshold is 4096 * (VEC_SIZE / 16) for + VEC_SIZE == 64 or 32. For VEC_SIZE == 16, the default REP MOVSB + threshold is 2048 * (VEC_SIZE / 16). */ unsigned int rep_movsb_threshold; if (CPU_FEATURE_USABLE_P (cpu_features, AVX512F) && !CPU_FEATURE_PREFERRED_P (cpu_features, Prefer_No_AVX512)) { - rep_movsb_threshold = 2048 * (64 / 16); + rep_movsb_threshold = 4096 * (64 / 16); #if HAVE_TUNABLES minimum_rep_movsb_threshold = 64 * 8; #endif @@ -879,7 +881,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) else if (CPU_FEATURE_PREFERRED_P (cpu_features, AVX_Fast_Unaligned_Load)) { - rep_movsb_threshold = 2048 * (32 / 16); + rep_movsb_threshold = 4096 * (32 / 16); #if HAVE_TUNABLES minimum_rep_movsb_threshold = 32 * 8; #endif diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list index dd6e1d65c9..419313804d 100644 --- a/sysdeps/x86/dl-tunables.list +++ b/sysdeps/x86/dl-tunables.list @@ -32,17 +32,21 @@ glibc { } x86_rep_movsb_threshold { type: SIZE_T - # Since there is overhead to set up REP MOVSB operation, REP MOVSB - # isn't faster on short data. The memcpy micro benchmark in glibc - # shows that 2KB is the approximate value above which REP MOVSB - # becomes faster than SSE2 optimization on processors with Enhanced - # REP MOVSB. Since larger register size can move more data with a - # single load and store, the threshold is higher with larger register - # size. Note: Since the REP MOVSB threshold must be greater than 8 - # times of vector size and the default value is 2048 * (vector size - # / 16), the default value and the minimum value must be updated at - # run-time. NB: Don't set the default value since we can't tell if - # the tunable value is set by user or not [BZ #27069]. + # Since there is overhead to set up REP MOVSB operation, REP + # MOVSB isn't faster on short data. The memcpy micro benchmark + # in glibc shows that 2KB is the approximate value above which + # REP MOVSB becomes faster than SSE2 optimization on processors + # with Enhanced REP MOVSB. Since larger register size can move + # more data with a single load and store, the threshold is + # higher with larger register size. Micro benchmarks show AVX + # REP MOVSB becomes faster apprximately at 8KB. The AVX512 + # threshold is extrapolated to 16KB. For machines with FSRM the + # threshold is universally set at 2112 bytes. Note: Since the + # REP MOVSB threshold must be greater than 8 times of vector + # size and the default value is 4096 * (vector size / 16), the + # default value and the minimum value must be updated at + # run-time. NB: Don't set the default value since we can't tell + # if the tunable value is set by user or not [BZ #27069]. minval: 1 } x86_rep_stosb_threshold {