Message ID | 20230413232157.1487389-1-philipp.tomsich@vrull.eu |
---|---|
State | New |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 5D97D3858412 for <patchwork@sourceware.org>; Thu, 13 Apr 2023 23:22:29 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-lj1-x232.google.com (mail-lj1-x232.google.com [IPv6:2a00:1450:4864:20::232]) by sourceware.org (Postfix) with ESMTPS id 60B353858D33 for <gcc-patches@gcc.gnu.org>; Thu, 13 Apr 2023 23:22:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 60B353858D33 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=vrull.eu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu Received: by mail-lj1-x232.google.com with SMTP id b33so12809065ljf.2 for <gcc-patches@gcc.gnu.org>; Thu, 13 Apr 2023 16:22:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vrull.eu; s=google; t=1681428126; x=1684020126; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=jRWooSPjoKGbodZzRU5z0L5f/oyGoAHP8oiDoARbCEE=; b=dDVn/PquRlg1SnmUSfa8qfbF0B+7ShHrSGMAcamEP8OY64KWgYnnfboatZlNBRK/5V eZCHVej7PHJ62e09ejhXLENyL4grRtnV6DH+30sa5ezMM/cZdnsZEVpYIh6J6v24rtXg ncuQj5oTHYykNme+Sme+5JqafFmxYvkYVpo2D0tcvHksuzZlR3dO8I+L4j0HIPKDUeZE fR9F69fHokDoe6n4Q4SRb7b6dyI9uONmkdAs75Moux2WqCvrGXClQKdwYixyXY1GkXVv M0qtoMuuSsS5tgh35qo/vkQ7uiIGYkYt4PvkOUfIrJS+03T0TapxXRYm1h6VQweOB28N FGxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681428126; x=1684020126; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=jRWooSPjoKGbodZzRU5z0L5f/oyGoAHP8oiDoARbCEE=; b=Ti4QKLlocw+/3qyY1b4qBbo/zQc2bAv5wo88laF1zfxg5zI1uKb5o7T0P2qLuiny3q zty4Y0F39VSZA6jptAyqlHyI4r4HT49vaJtg0c38FjfnRCtI8ZL0pGjmtNAfVOKTeu90 Gq5mlLHa/36F5EdTaVcfcnS2lPOcCxlFUStZVHW8waBqI/bIsvbP1B0J0klun1Y2C2+M 3X/gFlUqXtIXYByFu/uUBUawswEXKWG8PpMxFgDDI4ygC9SdoIbRrDXKIpSjIUI4B/SJ tLti0yiybGIKsCEHQT9bTc1vKviFVbWXBhyIxHMZHtPRKImXfiBRwwzTje/nlcbeASu8 7PsA== X-Gm-Message-State: AAQBX9fRVlkGWN5o4JIdnkGhlpbnmvouO1NW3SA7jyZZwdUJJV85+ky8 8OirwRKANGZhIOinJh/QOtoc/SQ3fCRLBYePQRjvEw== X-Google-Smtp-Source: AKy350b+J1D12Db8StKG0Sa1MLl/XFrqBNqG8w4P02a8pbo9/es6nAojgrUz3+iO941Mbqi8Cqzpfw== X-Received: by 2002:a2e:8ec9:0:b0:295:a958:2bca with SMTP id e9-20020a2e8ec9000000b00295a9582bcamr1262390ljl.6.1681428126333; Thu, 13 Apr 2023 16:22:06 -0700 (PDT) Received: from ubuntu-focal.. ([2a01:4f9:3a:1e26::2]) by smtp.gmail.com with ESMTPSA id h23-20020a2e3a17000000b002a785484afasm473600lja.68.2023.04.13.16.22.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Apr 2023 16:22:05 -0700 (PDT) From: Philipp Tomsich <philipp.tomsich@vrull.eu> To: gcc-patches@gcc.gnu.org Cc: Kyrylo Tkachov <kyrylo.tkachov@arm.com>, Philipp Tomsich <philipp.tomsich@vrull.eu>, Di Zhao <di.zhao@amperecomputing.com> Subject: [PATCH] aarch64: disable LDP via tuning structure for -mcpu=ampere1 Date: Fri, 14 Apr 2023 01:21:57 +0200 Message-Id: <20230413232157.1487389-1-philipp.tomsich@vrull.eu> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, JMQ_SPF_NEUTRAL, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
aarch64: disable LDP via tuning structure for -mcpu=ampere1
|
|
Commit Message
Philipp Tomsich
April 13, 2023, 11:21 p.m. UTC
AmpereOne (-mcpu=ampere1) breaks LDP instructions into two uops.
Given the chance that this causes instructions to slip into the next
decoding cycle and the additional overheads when handling
cacheline-crossing LDP instructions, we disable the generation of LDP
isntructions through the tuning structure from instruction combining
(such as in peephole2).
Given the code-density benefits in builtins and prologue/epilogue
expansion, we allow LDPs there.
This commit:
* adds a new tuning option AARCH64_EXTRA_TUNE_NO_LDP_COMBINE
* allows -moverride=tune=... to override this
Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>
Co-Authored-By: Di Zhao <di.zhao@amperecomputing.com>
gcc/ChangeLog:
* config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
Add AARCH64_EXTRA_TUNE_NO_LDP_COMBINE.
* config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp):
Check for the above tuning option when processing loads.
---
gcc/config/aarch64/aarch64-tuning-flags.def | 3 +++
gcc/config/aarch64/aarch64.cc | 8 +++++++-
2 files changed, 10 insertions(+), 1 deletion(-)
Comments
Hi Philipp, > -----Original Message----- > From: Philipp Tomsich <philipp.tomsich@vrull.eu> > Sent: Friday, April 14, 2023 12:22 AM > To: gcc-patches@gcc.gnu.org > Cc: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; Philipp Tomsich > <philipp.tomsich@vrull.eu>; Di Zhao <di.zhao@amperecomputing.com> > Subject: [PATCH] aarch64: disable LDP via tuning structure for - > mcpu=ampere1 > > AmpereOne (-mcpu=ampere1) breaks LDP instructions into two uops. > Given the chance that this causes instructions to slip into the next > decoding cycle and the additional overheads when handling > cacheline-crossing LDP instructions, we disable the generation of LDP > isntructions through the tuning structure from instruction combining > (such as in peephole2). > > Given the code-density benefits in builtins and prologue/epilogue > expansion, we allow LDPs there. LDPs are indeed quite an important part of the ISA for code density and there are, in principle, second-order benefits from using them, like keeping the instruction cache footprint low (which can be important for large workloads). Did you gather some benchmarks showing a benefit of disabling them in this manner? > > This commit: > * adds a new tuning option AARCH64_EXTRA_TUNE_NO_LDP_COMBINE > * allows -moverride=tune=... to override this > > Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu> > Co-Authored-By: Di Zhao <di.zhao@amperecomputing.com> > > gcc/ChangeLog: > > * config/aarch64/aarch64-tuning-flags.def > (AARCH64_EXTRA_TUNING_OPTION): > Add AARCH64_EXTRA_TUNE_NO_LDP_COMBINE. > * config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp): > Check for the above tuning option when processing loads. > > --- > > gcc/config/aarch64/aarch64-tuning-flags.def | 3 +++ > gcc/config/aarch64/aarch64.cc | 8 +++++++- > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def > b/gcc/config/aarch64/aarch64-tuning-flags.def > index 712895a5263..52112ba7c48 100644 > --- a/gcc/config/aarch64/aarch64-tuning-flags.def > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > @@ -44,6 +44,9 @@ AARCH64_EXTRA_TUNING_OPTION > ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) > /* Disallow load/store pair instructions on Q-registers. */ > AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", > NO_LDP_STP_QREGS) > > +/* Disallow load-pair instructions to be formed in combine/peephole. */ > +AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", > NO_LDP_COMBINE) > + > AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", > RENAME_LOAD_REGS) > > AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", > CSE_SVE_VL_CONSTANTS) > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index f4ef22ce02f..8dc1a9ceb17 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -1971,7 +1971,7 @@ static const struct tune_params ampere1a_tunings > = > 2, /* min_div_recip_mul_df. */ > 0, /* max_case_values. */ > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > - (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ > + (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags. */ > &ere1_prefetch_tune > }; > > @@ -26053,6 +26053,12 @@ aarch64_operands_ok_for_ldpstp (rtx > *operands, bool load, > enum reg_class rclass_1, rclass_2; > rtx mem_1, mem_2, reg_1, reg_2; > > + /* Allow the tuning structure to disable LDP instruction formation > + from combining instructions (e.g., in peephole2). */ > + if (load && (aarch64_tune_params.extra_tuning_flags > + & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE)) > + return false; If we do decide to do this, I think this is not a complete approach. See the similar tuning flag AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS. There's various other places in the backend that would need to be adjusted to avoid bringing loads together for the peephole2s to merge (the sched_fusion stuff). Plus there's the cpymem expansions that would generate load pairs too... We'd want some testcases added to check that LDPs are blocked too... Thanks, Kyrill > + > if (load) > { > mem_1 = operands[1]; > -- > 2.34.1
Kyrylo, On Fri, 14 Apr 2023 at 11:21, Kyrylo Tkachov <Kyrylo.Tkachov@arm.com> wrote: > > Hi Philipp, > > > -----Original Message----- > > From: Philipp Tomsich <philipp.tomsich@vrull.eu> > > Sent: Friday, April 14, 2023 12:22 AM > > To: gcc-patches@gcc.gnu.org > > Cc: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; Philipp Tomsich > > <philipp.tomsich@vrull.eu>; Di Zhao <di.zhao@amperecomputing.com> > > Subject: [PATCH] aarch64: disable LDP via tuning structure for - > > mcpu=ampere1 > > > > AmpereOne (-mcpu=ampere1) breaks LDP instructions into two uops. > > Given the chance that this causes instructions to slip into the next > > decoding cycle and the additional overheads when handling > > cacheline-crossing LDP instructions, we disable the generation of LDP > > isntructions through the tuning structure from instruction combining > > (such as in peephole2). > > > > Given the code-density benefits in builtins and prologue/epilogue > > expansion, we allow LDPs there. > > LDPs are indeed quite an important part of the ISA for code density and there are, in principle, second-order benefits from using them, like keeping the instruction cache footprint low (which can be important for large workloads). > Did you gather some benchmarks showing a benefit of disabling them in this manner? This has been benchmark-driven, but I need to follow up separately (as I the final numbers are with the folks that have access to the benchmark machines).. > > > This commit: > > * adds a new tuning option AARCH64_EXTRA_TUNE_NO_LDP_COMBINE > > * allows -moverride=tune=... to override this > > > > Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu> > > Co-Authored-By: Di Zhao <di.zhao@amperecomputing.com> > > > > gcc/ChangeLog: > > > > * config/aarch64/aarch64-tuning-flags.def > > (AARCH64_EXTRA_TUNING_OPTION): > > Add AARCH64_EXTRA_TUNE_NO_LDP_COMBINE. > > * config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp): > > Check for the above tuning option when processing loads. > > > > --- > > > > gcc/config/aarch64/aarch64-tuning-flags.def | 3 +++ > > gcc/config/aarch64/aarch64.cc | 8 +++++++- > > 2 files changed, 10 insertions(+), 1 deletion(-) > > > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def > > b/gcc/config/aarch64/aarch64-tuning-flags.def > > index 712895a5263..52112ba7c48 100644 > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > > @@ -44,6 +44,9 @@ AARCH64_EXTRA_TUNING_OPTION > > ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) > > /* Disallow load/store pair instructions on Q-registers. */ > > AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", > > NO_LDP_STP_QREGS) > > > > +/* Disallow load-pair instructions to be formed in combine/peephole. */ > > +AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", > > NO_LDP_COMBINE) > > + > > AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", > > RENAME_LOAD_REGS) > > > > AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", > > CSE_SVE_VL_CONSTANTS) > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > index f4ef22ce02f..8dc1a9ceb17 100644 > > --- a/gcc/config/aarch64/aarch64.cc > > +++ b/gcc/config/aarch64/aarch64.cc > > @@ -1971,7 +1971,7 @@ static const struct tune_params ampere1a_tunings > > = > > 2, /* min_div_recip_mul_df. */ > > 0, /* max_case_values. */ > > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > > - (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ > > + (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags. */ > > &ere1_prefetch_tune > > }; > > > > @@ -26053,6 +26053,12 @@ aarch64_operands_ok_for_ldpstp (rtx > > *operands, bool load, > > enum reg_class rclass_1, rclass_2; > > rtx mem_1, mem_2, reg_1, reg_2; > > > > + /* Allow the tuning structure to disable LDP instruction formation > > + from combining instructions (e.g., in peephole2). */ > > + if (load && (aarch64_tune_params.extra_tuning_flags > > + & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE)) > > + return false; > > If we do decide to do this, I think this is not a complete approach. See the similar tuning flag AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS. > There's various other places in the backend that would need to be adjusted to avoid bringing loads together for the peephole2s to merge (the sched_fusion stuff). > Plus there's the cpymem expansions that would generate load pairs too... I have add-on patches for these, but given that I don't have direct access to the benchmarking machine and the benchmarks have been run with this functionality only, I didn't submit them for the time being. Do you see a path to get this in during the current cycle and defer the add-on patches (happy to resubmit as a series) only? > We'd want some testcases added to check that LDPs are blocked too... > > Thanks, > Kyrill > > > + > > if (load) > > { > > mem_1 = operands[1]; > > -- > > 2.34.1 >
For phase 1, we plan to replace this with a feature to allow finer-grained control over when to use LDP or STP (i.e., control these independently) with the following scopes and policies: - scopes are: { sched-fusion, mem, pro/epilogue, peephole } - policies are: { default (from tuning), always, never, aligned (to 2x element size) } Happy to get this fuller solution already onto the list, if it helps with forward-progress on the localised change. The current patch tries to be minimally invasive (i.e., it doesn't touch STP). It intentionally avoids modifying the sched-fusion logic (which requires refactoring, as it doesn't differentiate between the load and store cases), pro/epilogue creation and mem* function expansion. Philipp. On Fri, 14 Apr 2023 at 11:31, Philipp Tomsich <philipp.tomsich@vrull.eu> wrote: > > Kyrylo, > > On Fri, 14 Apr 2023 at 11:21, Kyrylo Tkachov <Kyrylo.Tkachov@arm.com> wrote: > > > > Hi Philipp, > > > > > -----Original Message----- > > > From: Philipp Tomsich <philipp.tomsich@vrull.eu> > > > Sent: Friday, April 14, 2023 12:22 AM > > > To: gcc-patches@gcc.gnu.org > > > Cc: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; Philipp Tomsich > > > <philipp.tomsich@vrull.eu>; Di Zhao <di.zhao@amperecomputing.com> > > > Subject: [PATCH] aarch64: disable LDP via tuning structure for - > > > mcpu=ampere1 > > > > > > AmpereOne (-mcpu=ampere1) breaks LDP instructions into two uops. > > > Given the chance that this causes instructions to slip into the next > > > decoding cycle and the additional overheads when handling > > > cacheline-crossing LDP instructions, we disable the generation of LDP > > > isntructions through the tuning structure from instruction combining > > > (such as in peephole2). > > > > > > Given the code-density benefits in builtins and prologue/epilogue > > > expansion, we allow LDPs there. > > > > LDPs are indeed quite an important part of the ISA for code density and there are, in principle, second-order benefits from using them, like keeping the instruction cache footprint low (which can be important for large workloads). > > Did you gather some benchmarks showing a benefit of disabling them in this manner? > > > This has been benchmark-driven, but I need to follow up separately (as > I the final numbers are with the folks that have access to the > benchmark machines).. > > > > > > This commit: > > > * adds a new tuning option AARCH64_EXTRA_TUNE_NO_LDP_COMBINE > > > * allows -moverride=tune=... to override this > > > > > > Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu> > > > Co-Authored-By: Di Zhao <di.zhao@amperecomputing.com> > > > > > > gcc/ChangeLog: > > > > > > * config/aarch64/aarch64-tuning-flags.def > > > (AARCH64_EXTRA_TUNING_OPTION): > > > Add AARCH64_EXTRA_TUNE_NO_LDP_COMBINE. > > > * config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp): > > > Check for the above tuning option when processing loads. > > > > > > --- > > > > > > gcc/config/aarch64/aarch64-tuning-flags.def | 3 +++ > > > gcc/config/aarch64/aarch64.cc | 8 +++++++- > > > 2 files changed, 10 insertions(+), 1 deletion(-) > > > > > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def > > > b/gcc/config/aarch64/aarch64-tuning-flags.def > > > index 712895a5263..52112ba7c48 100644 > > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def > > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > > > @@ -44,6 +44,9 @@ AARCH64_EXTRA_TUNING_OPTION > > > ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) > > > /* Disallow load/store pair instructions on Q-registers. */ > > > AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", > > > NO_LDP_STP_QREGS) > > > > > > +/* Disallow load-pair instructions to be formed in combine/peephole. */ > > > +AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", > > > NO_LDP_COMBINE) > > > + > > > AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", > > > RENAME_LOAD_REGS) > > > > > > AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", > > > CSE_SVE_VL_CONSTANTS) > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > > index f4ef22ce02f..8dc1a9ceb17 100644 > > > --- a/gcc/config/aarch64/aarch64.cc > > > +++ b/gcc/config/aarch64/aarch64.cc > > > @@ -1971,7 +1971,7 @@ static const struct tune_params ampere1a_tunings > > > = > > > 2, /* min_div_recip_mul_df. */ > > > 0, /* max_case_values. */ > > > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > > > - (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ > > > + (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags. */ > > > &ere1_prefetch_tune > > > }; > > > > > > @@ -26053,6 +26053,12 @@ aarch64_operands_ok_for_ldpstp (rtx > > > *operands, bool load, > > > enum reg_class rclass_1, rclass_2; > > > rtx mem_1, mem_2, reg_1, reg_2; > > > > > > + /* Allow the tuning structure to disable LDP instruction formation > > > + from combining instructions (e.g., in peephole2). */ > > > + if (load && (aarch64_tune_params.extra_tuning_flags > > > + & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE)) > > > + return false; > > > > If we do decide to do this, I think this is not a complete approach. See the similar tuning flag AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS. > > There's various other places in the backend that would need to be adjusted to avoid bringing loads together for the peephole2s to merge (the sched_fusion stuff). > > Plus there's the cpymem expansions that would generate load pairs too... > > I have add-on patches for these, but given that I don't have direct > access to the benchmarking machine and the benchmarks have been run > with this functionality only, I didn't submit them for the time being. > Do you see a path to get this in during the current cycle and defer > the add-on patches (happy to resubmit as a series) only? > > > We'd want some testcases added to check that LDPs are blocked too... > > > > Thanks, > > Kyrill > > > > > + > > > if (load) > > > { > > > mem_1 = operands[1]; > > > -- > > > 2.34.1 > >
On Fri, 14 Apr 2023 at 11:31, Philipp Tomsich <philipp.tomsich@vrull.eu> wrote: > Kyrylo, > > On Fri, 14 Apr 2023 at 11:21, Kyrylo Tkachov <Kyrylo.Tkachov@arm.com> > wrote: > > > > Hi Philipp, > > > > > -----Original Message----- > > > From: Philipp Tomsich <philipp.tomsich@vrull.eu> > > > Sent: Friday, April 14, 2023 12:22 AM > > > To: gcc-patches@gcc.gnu.org > > > Cc: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>; Philipp Tomsich > > > <philipp.tomsich@vrull.eu>; Di Zhao <di.zhao@amperecomputing.com> > > > Subject: [PATCH] aarch64: disable LDP via tuning structure for - > > > mcpu=ampere1 > > > > > > AmpereOne (-mcpu=ampere1) breaks LDP instructions into two uops. > > > Given the chance that this causes instructions to slip into the next > > > decoding cycle and the additional overheads when handling > > > cacheline-crossing LDP instructions, we disable the generation of LDP > > > isntructions through the tuning structure from instruction combining > > > (such as in peephole2). > > > > > > Given the code-density benefits in builtins and prologue/epilogue > > > expansion, we allow LDPs there. > > > > LDPs are indeed quite an important part of the ISA for code density and > there are, in principle, second-order benefits from using them, like > keeping the instruction cache footprint low (which can be important for > large workloads). > > Did you gather some benchmarks showing a benefit of disabling them in > this manner? > > This has been benchmark-driven, but I need to follow up separately (as > I the final numbers are with the folks that have access to the > benchmark machines).. > Here are the numbers for the submitted change for AmpereOne: 503.bwaves_r. -0.88% 507.cactuBSSN_r 0.35% 508.namd_r 3.09% 510.parest_r -2.99% 511.povray_r 5.54% 519.lbm_r 15.83% 521.wrf_r 0.56% 526.blender_r 2.47% 527.cam4_r 0.70% 538.imagick_r 0.00% 544.nab_r -0.33% 549.fotonik3d_r. -0.42% 554.roms_r 0.00% = total 1.79% > > > > > This commit: > > > * adds a new tuning option AARCH64_EXTRA_TUNE_NO_LDP_COMBINE > > > * allows -moverride=tune=... to override this > > > > > > Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu> > > > Co-Authored-By: Di Zhao <di.zhao@amperecomputing.com> > > > > > > gcc/ChangeLog: > > > > > > * config/aarch64/aarch64-tuning-flags.def > > > (AARCH64_EXTRA_TUNING_OPTION): > > > Add AARCH64_EXTRA_TUNE_NO_LDP_COMBINE. > > > * config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp): > > > Check for the above tuning option when processing loads. > > > > > > --- > > > > > > gcc/config/aarch64/aarch64-tuning-flags.def | 3 +++ > > > gcc/config/aarch64/aarch64.cc | 8 +++++++- > > > 2 files changed, 10 insertions(+), 1 deletion(-) > > > > > > diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def > > > b/gcc/config/aarch64/aarch64-tuning-flags.def > > > index 712895a5263..52112ba7c48 100644 > > > --- a/gcc/config/aarch64/aarch64-tuning-flags.def > > > +++ b/gcc/config/aarch64/aarch64-tuning-flags.def > > > @@ -44,6 +44,9 @@ AARCH64_EXTRA_TUNING_OPTION > > > ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) > > > /* Disallow load/store pair instructions on Q-registers. */ > > > AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", > > > NO_LDP_STP_QREGS) > > > > > > +/* Disallow load-pair instructions to be formed in combine/peephole. > */ > > > +AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", > > > NO_LDP_COMBINE) > > > + > > > AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", > > > RENAME_LOAD_REGS) > > > > > > AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", > > > CSE_SVE_VL_CONSTANTS) > > > diff --git a/gcc/config/aarch64/aarch64.cc > b/gcc/config/aarch64/aarch64.cc > > > index f4ef22ce02f..8dc1a9ceb17 100644 > > > --- a/gcc/config/aarch64/aarch64.cc > > > +++ b/gcc/config/aarch64/aarch64.cc > > > @@ -1971,7 +1971,7 @@ static const struct tune_params ampere1a_tunings > > > = > > > 2, /* min_div_recip_mul_df. */ > > > 0, /* max_case_values. */ > > > tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > > > - (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ > > > + (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags. */ > > > &ere1_prefetch_tune > > > }; > > > > > > @@ -26053,6 +26053,12 @@ aarch64_operands_ok_for_ldpstp (rtx > > > *operands, bool load, > > > enum reg_class rclass_1, rclass_2; > > > rtx mem_1, mem_2, reg_1, reg_2; > > > > > > + /* Allow the tuning structure to disable LDP instruction formation > > > + from combining instructions (e.g., in peephole2). */ > > > + if (load && (aarch64_tune_params.extra_tuning_flags > > > + & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE)) > > > + return false; > > > > If we do decide to do this, I think this is not a complete approach. See > the similar tuning flag AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS. > > There's various other places in the backend that would need to be > adjusted to avoid bringing loads together for the peephole2s to merge (the > sched_fusion stuff). > > Plus there's the cpymem expansions that would generate load pairs too... > > I have add-on patches for these, but given that I don't have direct > access to the benchmarking machine and the benchmarks have been run > with this functionality only, I didn't submit them for the time being. > Do you see a path to get this in during the current cycle and defer > the add-on patches (happy to resubmit as a series) only? > > > We'd want some testcases added to check that LDPs are blocked too... > > > > Thanks, > > Kyrill > > > > > + > > > if (load) > > > { > > > mem_1 = operands[1]; > > > -- > > > 2.34.1 > > >
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def index 712895a5263..52112ba7c48 100644 --- a/gcc/config/aarch64/aarch64-tuning-flags.def +++ b/gcc/config/aarch64/aarch64-tuning-flags.def @@ -44,6 +44,9 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) /* Disallow load/store pair instructions on Q-registers. */ AARCH64_EXTRA_TUNING_OPTION ("no_ldp_stp_qregs", NO_LDP_STP_QREGS) +/* Disallow load-pair instructions to be formed in combine/peephole. */ +AARCH64_EXTRA_TUNING_OPTION ("no_ldp_combine", NO_LDP_COMBINE) + AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS) AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index f4ef22ce02f..8dc1a9ceb17 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -1971,7 +1971,7 @@ static const struct tune_params ampere1a_tunings = 2, /* min_div_recip_mul_df. */ 0, /* max_case_values. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_EXTRA_TUNE_NO_LDP_COMBINE), /* tune_flags. */ &ere1_prefetch_tune }; @@ -26053,6 +26053,12 @@ aarch64_operands_ok_for_ldpstp (rtx *operands, bool load, enum reg_class rclass_1, rclass_2; rtx mem_1, mem_2, reg_1, reg_2; + /* Allow the tuning structure to disable LDP instruction formation + from combining instructions (e.g., in peephole2). */ + if (load && (aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_NO_LDP_COMBINE)) + return false; + if (load) { mem_1 = operands[1];