From patchwork Wed Jan 26 02:28:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Li, Pan2 via Gcc-patches" X-Patchwork-Id: 50446 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 335D13851C1A for ; Wed, 26 Jan 2022 02:29:59 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 335D13851C1A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1643164199; bh=So0907fyyupvpJYp6KHfnXe6DBR3sh+PPBGsfbObQxw=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=AMChemBxH3o6BulKkq6eibhuyfPO54G9K/UzoZBqZsXNLZ4vle1teLth1F/GVWOHj x7ZNDKWlOeCWOHRiKdblsKB1eWWxO/28qOuWBXaxuy2l9Pdg6t+Ata4cX7jxbLm470 V2siMFHKouPSYo0U1w1tr/92rxo5/9Ay8hQq3BdE= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0b-0016f401.pphosted.com (mx0b-0016f401.pphosted.com [67.231.156.173]) by sourceware.org (Postfix) with ESMTPS id DB5153851C2B for ; Wed, 26 Jan 2022 02:29:28 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DB5153851C2B Received: from pps.filterd (m0045851.ppops.net [127.0.0.1]) by mx0b-0016f401.pphosted.com (8.16.1.2/8.16.1.2) with ESMTP id 20PKMsqZ011432 for ; Tue, 25 Jan 2022 18:29:28 -0800 Received: from dc5-exch01.marvell.com ([199.233.59.181]) by mx0b-0016f401.pphosted.com (PPS) with ESMTPS id 3dt8mumdhg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT) for ; Tue, 25 Jan 2022 18:29:26 -0800 Received: from DC5-EXCH01.marvell.com (10.69.176.38) by DC5-EXCH01.marvell.com (10.69.176.38) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 25 Jan 2022 18:28:12 -0800 Received: from maili.marvell.com (10.69.176.80) by DC5-EXCH01.marvell.com (10.69.176.38) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Tue, 25 Jan 2022 18:28:11 -0800 Received: from linux.wrightpinski.org.com (unknown [10.69.242.198]) by maili.marvell.com (Postfix) with ESMTP id C47373F7066; Tue, 25 Jan 2022 18:28:11 -0800 (PST) To: Subject: [PATCH v3] [AARCH64] Fix PR target/103100 -mstrict-align and memset on not aligned buffers Date: Tue, 25 Jan 2022 18:28:08 -0800 Message-ID: <1643164088-18048-1-git-send-email-apinski@marvell.com> X-Mailer: git-send-email 1.8.3.1 MIME-Version: 1.0 X-Proofpoint-ORIG-GUID: FkWxC9DLsoUk40mTVjFMLfDiYgDHyCya X-Proofpoint-GUID: FkWxC9DLsoUk40mTVjFMLfDiYgDHyCya X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.816,Hydra:6.0.425,FMLib:17.11.62.513 definitions=2022-01-25_06,2022-01-25_02,2021-12-02_01 X-Spam-Status: No, score=-14.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: apinski--- via Gcc-patches From: "Li, Pan2 via Gcc-patches" Reply-To: apinski@marvell.com Cc: Andrew Pinski Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" From: Andrew Pinski The problem here is that aarch64_expand_setmem does not change the alignment for strict alignment case. This is version 3 of this patch, is is based on version 2 and moves the check for the number of instructions from the optimizing for size case to be always and change the cost of libcalls for the !size case to be max_size/16 + 1 (or 17) which was the same as before when handling just the max_size. The main change is dealing with strict alignment case where we only inline a max of 17 instructions as at that point the call to the memset will be faster and could handle the dynamic alignment instead of just the static alignment. Note the reason why it is +1 is to count for the setting of the simd duplicate. OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions. PR target/103100 gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_expand_setmem): Constraint copy_limit to the alignment of the mem if STRICT_ALIGNMENT is true. Also constraint the number of instructions for the !size case to max_size/16 + 1. gcc/testsuite/ChangeLog: * gcc.target/aarch64/memset-strict-align-1.c: Update test. Reduce the size down to 207 and make s1 global and aligned to 16 bytes. * gcc.target/aarch64/memset-strict-align-2.c: New test. --- gcc/config/aarch64/aarch64.cc | 55 ++++++++++--------- .../aarch64/memset-strict-align-1.c | 20 +++---- .../aarch64/memset-strict-align-2.c | 14 +++++ 3 files changed, 53 insertions(+), 36 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 296145e6008..02ecb2154ea 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -23831,8 +23831,11 @@ aarch64_expand_setmem (rtx *operands) (zero constants can use XZR directly). */ unsigned mops_cost = 3 + 1 + cst_val; /* A libcall to memset in the worst case takes 3 instructions to prepare - the arguments + 1 for the call. */ - unsigned libcall_cost = 4; + the arguments + 1 for the call. + In the case of not optimizing for size the cost of doing a libcall + is the max_set_size / 16 + 1 or 17 instructions. The one instruction + is for the vector dup which may or may not be used. */ + unsigned libcall_cost = size_p ? 4 : (max_set_size / 16 + 1); /* Upper bound check. For large constant-sized setmem use the MOPS sequence when available. */ @@ -23842,12 +23845,12 @@ aarch64_expand_setmem (rtx *operands) /* Attempt a sequence with a vector broadcast followed by stores. Count the number of operations involved to see if it's worth it - against the alternatives. A simple counter simd_ops on the + against the alternatives. A simple counter inlined_ops on the algorithmically-relevant operations is used rather than an rtx_insn count as all the pointer adjusmtents and mode reinterprets will be optimized away later. */ start_sequence (); - unsigned simd_ops = 0; + unsigned inlined_ops = 0; base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); dst = adjust_automodify_address (dst, VOIDmode, base, 0); @@ -23855,15 +23858,22 @@ aarch64_expand_setmem (rtx *operands) /* Prepare the val using a DUP/MOVI v0.16B, val. */ src = expand_vector_broadcast (V16QImode, val); src = force_reg (V16QImode, src); - simd_ops++; + inlined_ops++; /* Convert len to bits to make the rest of the code simpler. */ n = len * BITS_PER_UNIT; /* Maximum amount to copy in one go. We allow 256-bit chunks based on the AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter. */ - const int copy_limit = (aarch64_tune_params.extra_tuning_flags - & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) - ? GET_MODE_BITSIZE (TImode) : 256; + int copy_limit; + + if (aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) + copy_limit = GET_MODE_BITSIZE (TImode); + else + copy_limit = 256; + + if (STRICT_ALIGNMENT) + copy_limit = MIN (copy_limit, (int)MEM_ALIGN (dst)); while (n > 0) { @@ -23878,7 +23888,7 @@ aarch64_expand_setmem (rtx *operands) mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant (); aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode); - simd_ops++; + inlined_ops++; n -= mode_bits; /* Do certain trailing copies as overlapping if it's going to be @@ -23897,24 +23907,17 @@ aarch64_expand_setmem (rtx *operands) rtx_insn *seq = get_insns (); end_sequence (); - if (size_p) - { - /* When optimizing for size we have 3 options: the SIMD broadcast sequence, - call to memset or the MOPS expansion. */ - if (TARGET_MOPS - && mops_cost <= libcall_cost - && mops_cost <= simd_ops) - return aarch64_expand_setmem_mops (operands); - /* If MOPS is not available or not shorter pick a libcall if the SIMD - sequence is too long. */ - else if (libcall_cost < simd_ops) - return false; - emit_insn (seq); - return true; - } + /* When optimizing for size we have 3 options: the inlined sequence, + call to memset or the MOPS expansion. */ + if (size_p + && TARGET_MOPS + && mops_cost <= libcall_cost + && mops_cost <= inlined_ops) + return aarch64_expand_setmem_mops (operands); + /* Pick a libcall if the inlined sequence is too long. */ + else if (libcall_cost < inlined_ops) + return false; - /* At this point the SIMD broadcast sequence is the best choice when - optimizing for speed. */ emit_insn (seq); return true; } diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c index 664d43aee13..5a7acbf2fa9 100644 --- a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-1.c @@ -1,28 +1,28 @@ /* { dg-do compile } */ /* { dg-options "-O2 -mstrict-align" } */ -struct s { char x[255]; }; +struct s { char x[207]; }; +struct s s1 __attribute__((aligned(16))); void foo (struct s *); -void bar (void) { struct s s1 = {}; foo (&s1); } +void bar (void) { s1 = (struct s){}; foo (&s1); } -/* memset (s1 = {}, sizeof = 255) should be expanded out +/* memset (s1 = {}, sizeof = 207) should be expanded out such that there are no overlap stores when -mstrict-align is in use. - so 7 pairs of 16 bytes stores (224 bytes). - 1 16 byte stores + so 5 pairs of 16 bytes stores (80 bytes). + 2 16 byte stores (FIXME: should be 0) 1 8 byte store 1 4 byte store 1 2 byte store 1 1 byte store */ -/* { dg-final { scan-assembler-times "stp\tq" 7 } } */ -/* { dg-final { scan-assembler-times "str\tq" 1 } } */ +/* { dg-final { scan-assembler-times "stp\tq" 5 } } */ +/* { dg-final { scan-assembler-times "str\tq" 2 } } */ /* { dg-final { scan-assembler-times "str\txzr" 1 } } */ /* { dg-final { scan-assembler-times "str\twzr" 1 } } */ /* { dg-final { scan-assembler-times "strh\twzr" 1 } } */ /* { dg-final { scan-assembler-times "strb\twzr" 1 } } */ -/* Also one store pair for the frame-pointer and the LR. */ -/* { dg-final { scan-assembler-times "stp\tx" 1 } } */ - +/* No store pair with 8 byte words is needed as foo is called with a sibcall. */ +/* { dg-final { scan-assembler-times "stp\tx" 0 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c new file mode 100644 index 00000000000..67d994fe39e --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/memset-strict-align-2.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstrict-align" } */ + +struct s { char x[15]; }; +void foo (struct s *); +void bar (struct s *s1) { *s1 = (struct s){}; foo (s1); } + +/* memset (s1 = {}, sizeof = 15) should be expanded out + such that there are no overlap stores when -mstrict-align + is in use. As the alignment of s1 is unknown, byte stores are needed. + so 15 byte stores + */ + +/* { dg-final { scan-assembler-times "strb\twzr" 15 } } */