Message ID | 20220401112305.GA19166@delia |
---|---|
State | Committed |
Commit | 065e25f6331c130bc3cd2ce78036f2328adb3d71 |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 21D6C3945C38 for <patchwork@sourceware.org>; Fri, 1 Apr 2022 11:23:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 21D6C3945C38 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1648812219; bh=uw4H6emM74hSCi0yZWjdJkApJDCSV7c6fqpfrwLMwXM=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=v/69JHzyGWu27SXBmvO3NnHHQExFYKy8e14Af4M/ZBCIZr/zTFEIwqLyHHl8LjQZZ UtelET3DLZyhw5OIxvGxqOh5IqvN0W0DgX1/57d/ugen3GEEdnK4NETfT9DprftPLD VBKbvWHBQIQKQ3pwXvGMt3O6+IQnFOE2IaIEvhZA= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by sourceware.org (Postfix) with ESMTPS id 4DC4B3858D28 for <gcc-patches@gcc.gnu.org>; Fri, 1 Apr 2022 11:23:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4DC4B3858D28 Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 2C71221A96; Fri, 1 Apr 2022 11:23:08 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 1491D132C1; Fri, 1 Apr 2022 11:23:08 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 863WA5zgRmIHFwAAMHmgww (envelope-from <tdevries@suse.de>); Fri, 01 Apr 2022 11:23:08 +0000 Date: Fri, 1 Apr 2022 13:23:06 +0200 To: gcc-patches@gcc.gnu.org Subject: [committed][libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c Message-ID: <20220401112305.GA19166@delia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: Tom de Vries via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Tom de Vries <tdevries@suse.de> Cc: Thomas Schwinge <thomas@schwinge.name> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
[committed,libgomp,testsuite,nvptx] Fix dg-output test in vector-length-128-7.c
|
|
Commit Message
Tom de Vries
April 1, 2022, 11:23 a.m. UTC
Hi, When running test-case libgomp.oacc-c-c++-common/vector-length-128-7.c on an RTX A2000 (sm_86) with driver 510.60.02 I run into: ... FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-7.c \ -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 \ output pattern test ... The failing check verifies the launch dimensions: ... /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: \ launch gangs=1, workers=8, vectors=128" } */ ... which fails because (as we can see with GOMP_DEBUG=1) the actual num_workers is 6: ... nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128 ... This is due to the result of cuOccupancyMaxPotentialBlockSize (which suggests 'a launch configuration with reasonable occupancy') printed just before: ... cuOccupancyMaxPotentialBlockSize: grid = 52, block = 768 ... [ Note: 6 * 128 == 768. ] Fix this by updating the check to allow num_workers in the range 1 to 8. Tested on x86_64 with nvptx accelerator. Committed to trunk. Thanks, - Tom [libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c libgomp/ChangeLog: 2022-04-01 Tom de Vries <tdevries@suse.de> * testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: Fix num_workers check. --- libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Comments
Hi Tom! On 2022-04-01T13:23:06+0200, Tom de Vries <tdevries@suse.de> wrote: > When running test-case libgomp.oacc-c-c++-common/vector-length-128-7.c on an > RTX A2000 (sm_86) with driver 510.60.02 I run into: > ... > FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-7.c \ > -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 \ > output pattern test > ... > > The failing check verifies the launch dimensions: > ... > /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: \ > launch gangs=1, workers=8, vectors=128" } */ > ... > which fails because (as we can see with GOMP_DEBUG=1) the actual num_workers > is 6: > ... > nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128 > ... > > This is due to the result of cuOccupancyMaxPotentialBlockSize (which suggests > 'a launch configuration with reasonable occupancy') printed just before: > ... > cuOccupancyMaxPotentialBlockSize: grid = 52, block = 768 > ... > [ Note: 6 * 128 == 768. ] I had a while ago observed, and now finally looked into a similar case with Nvidia TITAN V, Driver 455.23.05, GCC/nvptx default multilib. Looking at 'GOMP_DEBUG=1' output: '-O2'; all good: [...] Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=32, block_size=1024, dev_size=80, cpu_size=2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024 nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128 nvptx_exec: kernel main$_omp_fn$0: finished ... vs. '-O0'; similar to your report: [...] Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 33 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=32, block_size=1024, dev_size=80, cpu_size=2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid = 160, block = 768 nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128 nvptx_exec: kernel main$_omp_fn$0: finished ..., so I would've suggested: > Fix this by updating the check to allow num_workers in the range 1 to 8. ... to do this for '-O0' only, to make sure that we'll notice should the '-O2' case regress at some later point in time. Are you OK if I make the obvious a change? But that said... We might also generally classify this as a regression, because when using the GCC/nvptx '-mptx=3.1' instead of default multilib ('-foffload-options=nvptx-none=-mptx=3.1'), I see: '-O2'; all good (exactly the same launch configuration as with GCC/nvptx default multilib, see above): [...] Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=32, block_size=1024, dev_size=80, cpu_size=2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024 nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128 nvptx_exec: kernel main$_omp_fn$0: finished ..., but also for -O0'; all good: Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 30 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=32, block_size=1024, dev_size=80, cpu_size=2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024 nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128 nvptx_exec: kernel main$_omp_fn$0: finished Are you able to reproduce that? Follows '-O0' word-diff between GCC/nvptx default vs. '-mptx=3.1' multilib: Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used [-33-]{+30+} registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=32, block_size=1024, dev_size=80, cpu_size=2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid = 160, block = [-768-]{+1024+} nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, [-workers=6,-]{+workers=8,+} vectors=128 nvptx_exec: kernel main$_omp_fn$0: finished Notice that the GCC/nvptx default multilib uses 33 registers vs. the '-mptx=3.1' multilib uses 30 registers! (..., which then allows for 'block = [-768-]{+1024+}', 'workers=[-6-]{+8+}'). If that's useful, 'diff' of the PTX code that gets loaded to the GPU: // BEGIN PREAMBLE -.version 6.0 +.version 3.1 .target sm_30 .address_size 64 // END PREAMBLE @@ -158,9 +158,17 @@ setp.ne.u32 %r111,%r110,0; add.u64 %r109,%r109,8; @ %r111 bra.uni $L11; $L19: -bar.warp.sync 0xffffffff; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} $L18: -barrier.sync.aligned 0; +bar.sync 0; // forked 2; @ %r113 bra $L12; cvta.shared.u64 %r101,__oacc_bcast; @@ -179,7 +187,15 @@ mov.u32 %r22,0; mov.u32 %r29,1; mov.u32 %r30,%ntid.y; $L12: -bar.warp.sync 0xffffffff; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} $L7: @ %r113 bra $L13; mov.u32 %r23,%tid.y; @@ -188,11 +204,19 @@ setp.ge.s32 %r62,%r23,%r31; selp.u32 %r114,1,0,%r62; st.u32 [%r93],%r114; $L13: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; ld.u32 %r115,[%r93]; setp.ne.u32 %r62,%r115,0; -barrier.sync %r94,128; +bar.sync %r94,128; @ %r62 bra.uni $L2; $L6: @ %r113 bra $L14; @@ -220,8 +244,16 @@ st.u32 [%r95+36],%r30; st.u32 [%r95+40],%r31; st.u32 [%r95+44],%r34; $L14: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; // forked 4; mov.u64 %r87,%r93; mov.u64 %r89,%frame; @@ -296,7 +328,7 @@ setp.lt.s32 %r83,%r24,%r38; mov.u32 %r56,%r37; st.u32 [%frame+8],%r56; // joining 4; -barrier.sync %r94,128; +bar.sync %r94,128; // join 4; @ %r113 bra $L15; add.u32 %r23,%r23,%r30; @@ -304,11 +336,19 @@ setp.lt.s32 %r84,%r23,%r31; selp.u32 %r116,1,0,%r84; st.u32 [%r93],%r116; $L15: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; ld.u32 %r117,[%r93]; setp.ne.u32 %r84,%r117,0; -barrier.sync %r94,128; +bar.sync %r94,128; @ %r84 bra.uni $L6; $L2: @ %r113 bra $L16; @@ -317,19 +357,35 @@ setp.lt.s32 %r85,%r22,%r29; selp.u32 %r118,1,0,%r85; st.u32 [%r93],%r118; $L16: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; ld.u32 %r119,[%r93]; setp.ne.u32 %r85,%r119,0; -barrier.sync %r94,128; +bar.sync %r94,128; @ %r85 bra.uni $L7; @ %r113 bra $L17; mov.u32 %r86,4; st.u32 [%frame+4],%r86; // joining 2; $L17: -bar.warp.sync 0xffffffff; -barrier.sync.aligned 0; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync 0; // join 2; ret; } Do the 'trap/'exit' "no-return" calls allow for optimizing JIT register allocation? Does it follow that we should be doing something different in the GCC/nvptx default multilib, to achieve a similar outcome (without otherwise pessimizing the code, of course)? Grüße Thomas > [libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c > > libgomp/ChangeLog: > > 2022-04-01 Tom de Vries <tdevries@suse.de> > > * testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: Fix > num_workers check. > > --- > libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c > index 4a8c1bf549e..92b3de03636 100644 > --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c > @@ -37,4 +37,4 @@ main (void) > } > > /* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccloops" } } */ > -/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=128" } */ > +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=\[1-8\], vectors=128" } */ ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c index 4a8c1bf549e..92b3de03636 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c @@ -37,4 +37,4 @@ main (void) } /* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccloops" } } */ -/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=128" } */ +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=\[1-8\], vectors=128" } */