Message ID | 3cad2a5e-dd68-2fbe-d52b-e077a7405623@linux.ibm.com |
---|---|
State | New |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 007653858025 for <patchwork@sourceware.org>; Mon, 27 Feb 2023 15:12:13 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 007653858025 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1677510734; bh=hkBCvPn8JLTH+fjMLNXhYbBrmT+yUclWVgw7lZ7BWjc=; h=Date:To:Cc:Subject:List-Id:List-Unsubscribe:List-Archive: List-Post:List-Help:List-Subscribe:From:Reply-To:From; b=rSIo14E1ReCdDJelEsOUzF6I8L+vX/z8D55ZIZHEwkuN261LpArkeO88sZ5ZiHh+n N50Y6Gpkzmckne+GjjFOCaEKUc9CsPJ8kpKiuf9awfhZtqKNJ3A4COaTYA/ozw51Ka npV5c6YQvoqZRIWBio7qomBZkZWgyKbAkXIl26mA= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 04D653858D32 for <gcc-patches@gcc.gnu.org>; Mon, 27 Feb 2023 15:11:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 04D653858D32 Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 31REjPCT026800; Mon, 27 Feb 2023 15:11:42 GMT Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3p0u1r788a-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 27 Feb 2023 15:11:42 +0000 Received: from m0098396.ppops.net (m0098396.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 31REhNgU028637; Mon, 27 Feb 2023 15:11:41 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3p0u1r787x-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 27 Feb 2023 15:11:41 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 31RD7clF016017; Mon, 27 Feb 2023 15:11:40 GMT Received: from smtprelay01.wdc07v.mail.ibm.com ([9.208.129.119]) by ppma02dal.us.ibm.com (PPS) with ESMTPS id 3nybdkeqq8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 27 Feb 2023 15:11:40 +0000 Received: from smtpav03.dal12v.mail.ibm.com (smtpav03.dal12v.mail.ibm.com [10.241.53.102]) by smtprelay01.wdc07v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 31RFBckE19988890 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 27 Feb 2023 15:11:39 GMT Received: from smtpav03.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id BC2905805A; Mon, 27 Feb 2023 15:11:38 +0000 (GMT) Received: from smtpav03.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4215F58063; Mon, 27 Feb 2023 15:11:38 +0000 (GMT) Received: from [9.77.148.130] (unknown [9.77.148.130]) by smtpav03.dal12v.mail.ibm.com (Postfix) with ESMTPS; Mon, 27 Feb 2023 15:11:38 +0000 (GMT) Message-ID: <3cad2a5e-dd68-2fbe-d52b-e077a7405623@linux.ibm.com> Date: Mon, 27 Feb 2023 09:11:37 -0600 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.8.0 Content-Language: en-US To: GCC Patches <gcc-patches@gcc.gnu.org> Cc: Segher Boessenkool <segher@kernel.crashing.org>, "Kewen.Lin" <linkw@linux.ibm.com>, David Edelsohn <dje.gcc@gmail.com>, Peter Bergner <bergner@linux.ibm.com> Subject: [PATCH, rs6000] Tweak modulo define_insns to eliminate register copy Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-GUID: -BqxolcPyJfSSMYLUhL5ebB6TZRX0QKc X-Proofpoint-ORIG-GUID: pdXcmI_4sYRvW0aIW64yIKEhi7PBpXYU X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.930,Hydra:6.0.562,FMLib:17.11.170.22 definitions=2023-02-27_10,2023-02-27_01,2023-02-09_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 clxscore=1015 suspectscore=0 bulkscore=0 mlxlogscore=999 malwarescore=0 spamscore=0 impostorscore=0 priorityscore=1501 mlxscore=0 adultscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2212070000 definitions=main-2302270115 X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: Pat Haugen via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Pat Haugen <pthaugen@linux.ibm.com> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
[rs6000] Tweak modulo define_insns to eliminate register copy
|
|
Commit Message
Pat Haugen
Feb. 27, 2023, 3:11 p.m. UTC
Don't force target of modulo into a distinct register. The define_insns for the modulo operation currently force the target register to a distinct reg in preparation for a possible future peephole combining div/mod. But this can lead to cases of a needless copy being inserted. Fixed with the following patch. Bootstrapped and regression tested on powerpc64le. Ok for master? -Pat 2023-02-27 Pat Haugen <pthaugen@linux.ibm.com> gcc/ * config/rs6000/rs6000.md (*mod<mode>3, umod<mode>3): Add non-earlyclobber alternative. gcc/testsuite/ * gcc.target/powerpc/mod-no_copy.c: New.
Comments
Hi! On Mon, Feb 27, 2023 at 09:11:37AM -0600, Pat Haugen wrote: > The define_insns for the modulo operation currently force the target > register > to a distinct reg in preparation for a possible future peephole combining > div/mod. But this can lead to cases of a needless copy being inserted. Fixed > with the following patch. Have you verified those peepholes still match? Do those peepholes actually improve performance? On new CPUs? The code here says ;; On machines with modulo support, do a combined div/mod the old fashioned ;; method, since the multiply/subtract is faster than doing the mod instruction ;; after a divide. but that really should not be true: we can do the div and mod in parallel (except in SMT4 perhaps, which we never schedule for anyway), so that should always be strictly faster. > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/mod-no_copy.c > @@ -0,0 +1,17 @@ > +/* { dg-do compile { target { powerpc*-*-* } } } */ All files in gcc.target/powerpc/ test for this already. Just leave off the target clause here? > +/* { dg-require-effective-target powerpc_p9modulo_ok } */ Leave out this line, because ... > +/* { dg-options "-mdejagnu-cpu=power9 -O2" } */ ... the -mcpu= forces it to true always. > +/* Verify r3 is used as source and target, no copy inserted. */ > +/* { dg-final { scan-assembler-not {\mmr\M} } } */ That is probably good enough, yeah, since the test results in only a handful of insns. Segher
On 2/27/23 11:08 AM, Segher Boessenkool wrote: > Hi! > > On Mon, Feb 27, 2023 at 09:11:37AM -0600, Pat Haugen wrote: >> The define_insns for the modulo operation currently force the target >> register >> to a distinct reg in preparation for a possible future peephole combining >> div/mod. But this can lead to cases of a needless copy being inserted. Fixed >> with the following patch. > > Have you verified those peepholes still match? Yes, I verified the peepholes still match and transform the sequence. > > Do those peepholes actually improve performance? On new CPUs? The code > here says > ;; On machines with modulo support, do a combined div/mod the old fashioned > ;; method, since the multiply/subtract is faster than doing the mod instruction > ;; after a divide. > but that really should not be true: we can do the div and mod in > parallel (except in SMT4 perhaps, which we never schedule for anyway), > so that should always be strictly faster. > Since the modulo insns were introduced in Power9, we're just talking Power9/Power10. On paper, I would agree that separate div/mod could be slightly faster to get the mod result, but if you throw in another independent div or mod in the insn stream then doing the peephole should be a clear win since that 3rd insn can execute in parallel with the initial divide as opposed to waiting for the one of the first div/mod to clear the exclusive stage of the pipe. >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/powerpc/mod-no_copy.c >> @@ -0,0 +1,17 @@ >> +/* { dg-do compile { target { powerpc*-*-* } } } */ > > All files in gcc.target/powerpc/ test for this already. Just leave off > the target clause here? > >> +/* { dg-require-effective-target powerpc_p9modulo_ok } */ > > Leave out this line, because ... > >> +/* { dg-options "-mdejagnu-cpu=power9 -O2" } */ > > ... the -mcpu= forces it to true always. Will update. -Pat > >> +/* Verify r3 is used as source and target, no copy inserted. */ > >> +/* { dg-final { scan-assembler-not {\mmr\M} } } */ > > That is probably good enough, yeah, since the test results in only a > handful of insns. > > > Segher
Hi! On Mon, Feb 27, 2023 at 02:12:23PM -0600, Pat Haugen wrote: > On 2/27/23 11:08 AM, Segher Boessenkool wrote: > >On Mon, Feb 27, 2023 at 09:11:37AM -0600, Pat Haugen wrote: > >>The define_insns for the modulo operation currently force the target > >>register > >>to a distinct reg in preparation for a possible future peephole combining > >>div/mod. But this can lead to cases of a needless copy being inserted. > >>Fixed > >>with the following patch. > > > >Have you verified those peepholes still match? > > Yes, I verified the peepholes still match and transform the sequence. Please add the testcases for that then? Or do we have tests for it already :-) > >Do those peepholes actually improve performance? On new CPUs? The code > >here says > >;; On machines with modulo support, do a combined div/mod the old fashioned > >;; method, since the multiply/subtract is faster than doing the mod > >instruction > >;; after a divide. > >but that really should not be true: we can do the div and mod in > >parallel (except in SMT4 perhaps, which we never schedule for anyway), > >so that should always be strictly faster. > > > Since the modulo insns were introduced in Power9, we're just talking > Power9/Power10. On paper, I would agree that separate div/mod could be > slightly faster to get the mod result, "Slightly". It takes 12 cycles for the two in parallel (64-bit, p9), but 17 cycles for the "cheaper" sequence (divd+mulld+subf, 12+5+2). It is all worse if the units are busy of course, or if there are other problems. > but if you throw in another > independent div or mod in the insn stream then doing the peephole should > be a clear win since that 3rd insn can execute in parallel with the > initial divide as opposed to waiting for the one of the first div/mod to > clear the exclusive stage of the pipe. That is the SMT4 case, the one we do not optimise for. SMT2 and ST can do four in parallel. This means you can start a div or mod every 2nd cycle on average, so it is very unlikely you will ever be limited by this on real code. Segher
On 2/27/23 2:53 PM, Segher Boessenkool wrote: > Hi! > > On Mon, Feb 27, 2023 at 02:12:23PM -0600, Pat Haugen wrote: >> On 2/27/23 11:08 AM, Segher Boessenkool wrote: >>> On Mon, Feb 27, 2023 at 09:11:37AM -0600, Pat Haugen wrote: >>>> The define_insns for the modulo operation currently force the target >>>> register >>>> to a distinct reg in preparation for a possible future peephole combining >>>> div/mod. But this can lead to cases of a needless copy being inserted. >>>> Fixed >>>> with the following patch. >>> >>> Have you verified those peepholes still match? >> >> Yes, I verified the peepholes still match and transform the sequence. > > Please add the testcases for that then? Or do we have tests for it > already :-) I don't see one, but can add one. >>> Do those peepholes actually improve performance? On new CPUs? The code >>> here says >>> ;; On machines with modulo support, do a combined div/mod the old fashioned >>> ;; method, since the multiply/subtract is faster than doing the mod >>> instruction >>> ;; after a divide. >>> but that really should not be true: we can do the div and mod in >>> parallel (except in SMT4 perhaps, which we never schedule for anyway), >>> so that should always be strictly faster. >>> >> Since the modulo insns were introduced in Power9, we're just talking >> Power9/Power10. On paper, I would agree that separate div/mod could be >> slightly faster to get the mod result, > > "Slightly". It takes 12 cycles for the two in parallel (64-bit, p9), > but 17 cycles for the "cheaper" sequence (divd+mulld+subf, 12+5+2). It > is all worse if the units are busy of course, or if there are other > problems. > >> but if you throw in another >> independent div or mod in the insn stream then doing the peephole should >> be a clear win since that 3rd insn can execute in parallel with the >> initial divide as opposed to waiting for the one of the first div/mod to >> clear the exclusive stage of the pipe. > > That is the SMT4 case, the one we do not optimise for. SMT2 and ST can > do four in parallel. This means you can start a div or mod every 2nd > cycle on average, so it is very unlikely you will ever be limited by > this on real code. Power9/Power10 only have 2 fixed-point divide units, and are able to issue 2 divides every 9/11 cycles (they aren't fully pipelined), with latencies of 12-24/12-25. Not saying that changes the "best case" scenario, just pointing out a lot of variables in play. -Pat
On Mon, Feb 27, 2023 at 04:03:56PM -0600, Pat Haugen wrote: > On 2/27/23 2:53 PM, Segher Boessenkool wrote: > >"Slightly". It takes 12 cycles for the two in parallel (64-bit, p9), > >but 17 cycles for the "cheaper" sequence (divd+mulld+subf, 12+5+2). It > >is all worse if the units are busy of course, or if there are other > >problems. > > > >>but if you throw in another > >>independent div or mod in the insn stream then doing the peephole should > >>be a clear win since that 3rd insn can execute in parallel with the > >>initial divide as opposed to waiting for the one of the first div/mod to > >>clear the exclusive stage of the pipe. > > > >That is the SMT4 case, the one we do not optimise for. SMT2 and ST can > >do four in parallel. This means you can start a div or mod every 2nd > >cycle on average, so it is very unlikely you will ever be limited by > >this on real code. > > Power9/Power10 only have 2 fixed-point divide units, and are able to > issue 2 divides every 9/11 cycles (they aren't fully pipelined), with > latencies of 12-24/12-25. Not saying that changes the "best case" > scenario, just pointing out a lot of variables in play. The p9 UM says in no uncertain terms there are four integer dividers (four fixed-point execution pipelines, all four capable of divides). Is that wrong then? Let's do actual tests on actual hardware :-) Segher
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md index 81bffb04ceb..44f7dd509cb 100644 --- a/gcc/config/rs6000/rs6000.md +++ b/gcc/config/rs6000/rs6000.md @@ -3437,9 +3437,9 @@ (define_expand "mod<mode>3" ;; In order to enable using a peephole2 for combining div/mod to eliminate the ;; mod, prefer putting the result of mod into a different register (define_insn "*mod<mode>3" - [(set (match_operand:GPR 0 "gpc_reg_operand" "=&r") - (mod:GPR (match_operand:GPR 1 "gpc_reg_operand" "r") - (match_operand:GPR 2 "gpc_reg_operand" "r")))] + [(set (match_operand:GPR 0 "gpc_reg_operand" "=&r,r") + (mod:GPR (match_operand:GPR 1 "gpc_reg_operand" "r,r") + (match_operand:GPR 2 "gpc_reg_operand" "r,r")))] "TARGET_MODULO" "mods<wd> %0,%1,%2" [(set_attr "type" "div") @@ -3447,9 +3447,9 @@ (define_insn "*mod<mode>3" (define_insn "umod<mode>3" - [(set (match_operand:GPR 0 "gpc_reg_operand" "=&r") - (umod:GPR (match_operand:GPR 1 "gpc_reg_operand" "r") - (match_operand:GPR 2 "gpc_reg_operand" "r")))] + [(set (match_operand:GPR 0 "gpc_reg_operand" "=&r,r") + (umod:GPR (match_operand:GPR 1 "gpc_reg_operand" "r,r") + (match_operand:GPR 2 "gpc_reg_operand" "r,r")))] "TARGET_MODULO" "modu<wd> %0,%1,%2" [(set_attr "type" "div") diff --git a/gcc/testsuite/gcc.target/powerpc/mod-no_copy.c b/gcc/testsuite/gcc.target/powerpc/mod-no_copy.c new file mode 100644 index 00000000000..91e3003b3fc --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/mod-no_copy.c @@ -0,0 +1,17 @@ +/* { dg-do compile { target { powerpc*-*-* } } } */ +/* { dg-require-effective-target powerpc_p9modulo_ok } */ +/* { dg-options "-mdejagnu-cpu=power9 -O2" } */ + +/* Verify r3 is used as source and target, no copy inserted. */ + +long foo (long a, long b) +{ + return (a % b); +} + +unsigned long foo2 (unsigned long a, unsigned long b) +{ + return (a % b); +} + +/* { dg-final { scan-assembler-not {\mmr\M} } } */