[committed,libgomp,testsuite,nvptx] Fix dg-output test in vector-length-128-7.c

Message ID 20220401112305.GA19166@delia
State Committed
Commit 065e25f6331c130bc3cd2ce78036f2328adb3d71
Headers
Series [committed,libgomp,testsuite,nvptx] Fix dg-output test in vector-length-128-7.c |

Commit Message

Tom de Vries April 1, 2022, 11:23 a.m. UTC
  Hi,

When running test-case libgomp.oacc-c-c++-common/vector-length-128-7.c on an
RTX A2000 (sm_86) with driver 510.60.02 I run into:
...
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-7.c \
  -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  \
  output pattern test
...

The failing check verifies the launch dimensions:
...
/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: \
                launch gangs=1, workers=8, vectors=128" } */
...
which fails because (as we can see with GOMP_DEBUG=1) the actual num_workers
is 6:
...
  nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128
...

This is due to the result of cuOccupancyMaxPotentialBlockSize (which suggests
'a launch configuration with reasonable occupancy') printed just before:
...
cuOccupancyMaxPotentialBlockSize: grid = 52, block = 768
...
[ Note: 6 * 128 == 768. ]

Fix this by updating the check to allow num_workers in the range 1 to 8.

Tested on x86_64 with nvptx accelerator.

Committed to trunk.

Thanks,
- Tom

[libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c

libgomp/ChangeLog:

2022-04-01  Tom de Vries  <tdevries@suse.de>

	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: Fix
	num_workers check.

---
 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Thomas Schwinge May 4, 2022, 8:40 a.m. UTC | #1
Hi Tom!

On 2022-04-01T13:23:06+0200, Tom de Vries <tdevries@suse.de> wrote:
> When running test-case libgomp.oacc-c-c++-common/vector-length-128-7.c on an
> RTX A2000 (sm_86) with driver 510.60.02 I run into:
> ...
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-7.c \
>   -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  \
>   output pattern test
> ...
>
> The failing check verifies the launch dimensions:
> ...
> /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: \
>                 launch gangs=1, workers=8, vectors=128" } */
> ...
> which fails because (as we can see with GOMP_DEBUG=1) the actual num_workers
> is 6:
> ...
>   nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128
> ...
>
> This is due to the result of cuOccupancyMaxPotentialBlockSize (which suggests
> 'a launch configuration with reasonable occupancy') printed just before:
> ...
> cuOccupancyMaxPotentialBlockSize: grid = 52, block = 768
> ...
> [ Note: 6 * 128 == 768. ]

I had a while ago observed, and now finally looked into a similar case
with Nvidia TITAN V, Driver 455.23.05, GCC/nvptx default multilib.
Looking at 'GOMP_DEBUG=1' output:

'-O2'; all good:

    [...]
    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128
      nvptx_exec: kernel main$_omp_fn$0: finished

... vs. '-O0'; similar to your report:

    [...]
    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 33 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid = 160, block = 768
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128
      nvptx_exec: kernel main$_omp_fn$0: finished

..., so I would've suggested:

> Fix this by updating the check to allow num_workers in the range 1 to 8.

... to do this for '-O0' only, to make sure that we'll notice should the
'-O2' case regress at some later point in time.  Are you OK if I make the
obvious a change?


But that said...  We might also generally classify this as a regression,
because when using the GCC/nvptx '-mptx=3.1' instead of default multilib
('-foffload-options=nvptx-none=-mptx=3.1'), I see:

'-O2'; all good (exactly the same launch configuration as with GCC/nvptx
default multilib, see above):

    [...]
    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128
      nvptx_exec: kernel main$_omp_fn$0: finished

..., but also for -O0'; all good:

    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 30 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128
      nvptx_exec: kernel main$_omp_fn$0: finished

Are you able to reproduce that?

Follows '-O0' word-diff between GCC/nvptx default vs. '-mptx=3.1'
multilib:

    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used [-33-]{+30+} registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid = 160, block = [-768-]{+1024+}
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, [-workers=6,-]{+workers=8,+} vectors=128
      nvptx_exec: kernel main$_omp_fn$0: finished

Notice that the GCC/nvptx default multilib uses 33 registers vs. the
'-mptx=3.1' multilib uses 30 registers!  (..., which then allows for
'block = [-768-]{+1024+}', 'workers=[-6-]{+8+}').

If that's useful, 'diff' of the PTX code that gets loaded to the GPU:

     // BEGIN PREAMBLE
    -.version 6.0
    +.version 3.1
     .target sm_30
     .address_size 64
     // END PREAMBLE
    @@ -158,9 +158,17 @@ setp.ne.u32 %r111,%r110,0;
     add.u64 %r109,%r109,8;
     @ %r111 bra.uni $L11;
     $L19:
    -bar.warp.sync 0xffffffff;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
     $L18:
    -barrier.sync.aligned 0;
    +bar.sync 0;
     // forked 2;
     @ %r113 bra $L12;
     cvta.shared.u64 %r101,__oacc_bcast;
    @@ -179,7 +187,15 @@ mov.u32 %r22,0;
     mov.u32 %r29,1;
     mov.u32 %r30,%ntid.y;
     $L12:
    -bar.warp.sync 0xffffffff;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
     $L7:
     @ %r113 bra $L13;
     mov.u32 %r23,%tid.y;
    @@ -188,11 +204,19 @@ setp.ge.s32 %r62,%r23,%r31;
     selp.u32 %r114,1,0,%r62;
     st.u32 [%r93],%r114;
     $L13:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     ld.u32 %r115,[%r93];
     setp.ne.u32 %r62,%r115,0;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     @ %r62 bra.uni $L2;
     $L6:
     @ %r113 bra $L14;
    @@ -220,8 +244,16 @@ st.u32 [%r95+36],%r30;
     st.u32 [%r95+40],%r31;
     st.u32 [%r95+44],%r34;
     $L14:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     // forked 4;
     mov.u64 %r87,%r93;
     mov.u64 %r89,%frame;
    @@ -296,7 +328,7 @@ setp.lt.s32 %r83,%r24,%r38;
     mov.u32 %r56,%r37;
     st.u32 [%frame+8],%r56;
     // joining 4;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     // join 4;
     @ %r113 bra $L15;
     add.u32 %r23,%r23,%r30;
    @@ -304,11 +336,19 @@ setp.lt.s32 %r84,%r23,%r31;
     selp.u32 %r116,1,0,%r84;
     st.u32 [%r93],%r116;
     $L15:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     ld.u32 %r117,[%r93];
     setp.ne.u32 %r84,%r117,0;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     @ %r84 bra.uni $L6;
     $L2:
     @ %r113 bra $L16;
    @@ -317,19 +357,35 @@ setp.lt.s32 %r85,%r22,%r29;
     selp.u32 %r118,1,0,%r85;
     st.u32 [%r93],%r118;
     $L16:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     ld.u32 %r119,[%r93];
     setp.ne.u32 %r85,%r119,0;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     @ %r85 bra.uni $L7;
     @ %r113 bra $L17;
     mov.u32 %r86,4;
     st.u32 [%frame+4],%r86;
     // joining 2;
     $L17:
    -bar.warp.sync 0xffffffff;
    -barrier.sync.aligned 0;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync 0;
     // join 2;
     ret;
     }

Do the 'trap/'exit' "no-return" calls allow for optimizing JIT register
allocation?  Does it follow that we should be doing something different
in the GCC/nvptx default multilib, to achieve a similar outcome (without
otherwise pessimizing the code, of course)?


Grüße
 Thomas


> [libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c
>
> libgomp/ChangeLog:
>
> 2022-04-01  Tom de Vries  <tdevries@suse.de>
>
>       * testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: Fix
>       num_workers check.
>
> ---
>  libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> index 4a8c1bf549e..92b3de03636 100644
> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> @@ -37,4 +37,4 @@ main (void)
>  }
>
>  /* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccloops" } } */
> -/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=128" } */
> +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=\[1-8\], vectors=128" } */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  

Patch

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
index 4a8c1bf549e..92b3de03636 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
@@ -37,4 +37,4 @@  main (void)
 }
 
 /* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccloops" } } */
-/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=128" } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=\[1-8\], vectors=128" } */