[committed,nvptx] Use nvptx_warpsync / nvptx_uniform_warp_check for -muniform-simt

Message ID 20220219190716.GA1470@delia.home
State Committed
Commit 8e5c34ab45f34aadea65c5ba33ec685264b6ec66
Headers
Series [committed,nvptx] Use nvptx_warpsync / nvptx_uniform_warp_check for -muniform-simt |

Commit Message

Tom de Vries Feb. 19, 2022, 7:07 p.m. UTC
  Hi,

With the default ptx isa 6.0, we have for uniform-simt-1.c:
...
        @%r33   atom.global.cas.b32     %r26, [a], %r28, %r29;
                shfl.sync.idx.b32       %r26, %r26, %r32, 31, 0xffffffff;
...

The atomic insn is predicated by -muniform-simt, and the subsequent insn does
a warp sync, at which point the warp is uniform again.

But with -mptx=3.1, we have instead:
...
        @%r33   atom.global.cas.b32     %r26, [a], %r28, %r29;
                shfl.idx.b32    %r26, %r26, %r32, 31;
...

The shfl does not sync the warp, and we want the warp to go back to executing
uniformly asap.  We cannot enforce this, but at least check this using
nvptx_uniform_warp_check, similar to how that is done for openacc.

Likewise, detect the case that no shfl insn is emitted, and add a
nvptx_uniform_warp_check or nvptx_warpsync.

Committed to trunk.

Thanks,
- Tom

[nvptx] Use nvptx_warpsync / nvptx_uniform_warp_check for -muniform-simt

gcc/ChangeLog:

2022-02-19  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.cc (nvptx_unisimt_handle_set): Change return
	type to bool.
	(nvptx_reorg_uniform_simt): Insert nvptx_uniform_warp_check or
	nvptx_warpsync, if necessary.

gcc/testsuite/ChangeLog:

2022-02-19  Tom de Vries  <tdevries@suse.de>

	* gcc.target/nvptx/uniform-simt-1.c: Add scan-assembler test.
	* gcc.target/nvptx/uniform-simt-2.c: New test.

---
 gcc/config/nvptx/nvptx.cc                       | 34 ++++++++++++++++++++++---
 gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c |  1 +
 gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c | 20 +++++++++++++++
 3 files changed, 52 insertions(+), 3 deletions(-)
  

Comments

Thomas Schwinge Feb. 23, 2022, 9:06 a.m. UTC | #1
Hi Tom!

This is me again, following along GCC/nvptx devlopment, and asking
questions.  ;-)

On 2022-02-19T20:07:18+0100, Tom de Vries via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> With the default ptx isa 6.0, we have for uniform-simt-1.c:
> ...
>         @%r33   atom.global.cas.b32     %r26, [a], %r28, %r29;
>                 shfl.sync.idx.b32       %r26, %r26, %r32, 31, 0xffffffff;
> ...
>
> The atomic insn is predicated by -muniform-simt, and the subsequent insn does
> a warp sync, at which point the warp is uniform again.

I understand the concern here is Independent Thread Scheduling, where the
execution of predicated-off threads of a warp ('@ ! %r33') may proceed
with the next instruction, 'shfl', without implicitly waiting for the
other threads of a warp still working on the 'atom'?  Hence, the 'sync'
aspect of 'shfl.sync', as a means that PTX provides at the ISA level such
that we're getting the desired semantics: as its first step, "wait for
all threads in membermask to arrive".

> But with -mptx=3.1, we have instead:
> ...
>         @%r33   atom.global.cas.b32     %r26, [a], %r28, %r29;
>                 shfl.idx.b32    %r26, %r26, %r32, 31;
> ...
>
> The shfl does not sync the warp, and we want the warp to go back to executing
> uniformly asap.  We cannot enforce this

Is it really the case that such code may cause "permanent" warp-divergent
execution (until re-converging "somewhere")?  My understanding has been
that predicated-off threads of a warp ('@ ! %r33') would simply idle,
implicitly waiting for the other threads of a warp still working on the
'atom' -- due to the nature of a shared program counter per warp, and the
desire to re-converge as soon as possible.

For example, PTX ISA 7.2, 3.1. "A Set of SIMT Multiprocessors":

| [...]
| At every instruction issue time, the SIMT unit selects a warp that is ready to execute and
| issues the next instruction to the active threads of the warp. A warp executes one common
| instruction at a time, so full efficiency is realized when all threads of a warp agree on their
| execution path. If threads of a warp diverge via a data-dependent conditional branch, the
| warp serially executes each branch path taken, disabling threads that are not on that path,
| and when all paths complete, the threads converge back to the same execution path. [...]

So I'd have assumed that after the potentially-diverging
'@%r33'-predicated 'atom' instruction, we're implicitly re-converging for
the unpredicated 'shfl' (as long as Independent Thread Scheduling isn't
involved, which it it's for '-mptx=3.1')?

As I'm understanding you, my understanding is not correct, and we may
thus be getting "permanent" warp-divergent execution as soon as there's
any predication/conditional involved that may evaluate differently for
individual threads of a warp, and we thus need such *explicit*
synchronization after all such instances?

> but at least check this using
> nvptx_uniform_warp_check, similar to how that is done for openacc.
>
> Likewise, detect the case that no shfl insn is emitted, and add a
> nvptx_uniform_warp_check or nvptx_warpsync.

For example, 'nvptx-none/mgomp/libatomic/cas_1_.o':

    [...]
     @ %r71 atom.cas.b64 %r62,[%r35],%r29,%r61;
    +{
    +.reg .b32 act;
    +vote.ballot.b32 act,1;
    +.reg .pred uni;
    +setp.eq.b32 uni,act,0xffffffff;
    +@ ! uni trap;
    +@ ! uni exit;
    +}
     mov.b64 {%r69,%r70},%r62;
     shfl.idx.b32 %r69,%r69,%r68,31;
     shfl.idx.b32 %r70,%r70,%r68,31;
    [...]

So that's basically an 'assert' that all threads of a warp are converged.
(Is the JIT maybe even able to optimize that out?)  I guess I just wonder
if that's not satisfied implicitly.


Grüße
 Thomas


> [nvptx] Use nvptx_warpsync / nvptx_uniform_warp_check for -muniform-simt
>
> gcc/ChangeLog:
>
> 2022-02-19  Tom de Vries  <tdevries@suse.de>
>
>       * config/nvptx/nvptx.cc (nvptx_unisimt_handle_set): Change return
>       type to bool.
>       (nvptx_reorg_uniform_simt): Insert nvptx_uniform_warp_check or
>       nvptx_warpsync, if necessary.
>
> gcc/testsuite/ChangeLog:
>
> 2022-02-19  Tom de Vries  <tdevries@suse.de>
>
>       * gcc.target/nvptx/uniform-simt-1.c: Add scan-assembler test.
>       * gcc.target/nvptx/uniform-simt-2.c: New test.
>
> ---
>  gcc/config/nvptx/nvptx.cc                       | 34 ++++++++++++++++++++++---
>  gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c |  1 +
>  gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c | 20 +++++++++++++++
>  3 files changed, 52 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
> index afbad5bdde6..4942f1100da 100644
> --- a/gcc/config/nvptx/nvptx.cc
> +++ b/gcc/config/nvptx/nvptx.cc
> @@ -3248,12 +3248,18 @@ nvptx_call_insn_is_syscall_p (rtx_insn *insn)
>  /* If SET subexpression of INSN sets a register, emit a shuffle instruction to
>     propagate its value from lane MASTER to current lane.  */
>
> -static void
> +static bool
>  nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
>  {
>    rtx reg;
>    if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
> -    emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn);
> +    {
> +      emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX),
> +                    insn);
> +      return true;
> +    }
> +
> +  return false;
>  }
>
>  /* Adjust code for uniform-simt code generation variant by making atomics and
> @@ -3275,8 +3281,30 @@ nvptx_reorg_uniform_simt ()
>       continue;
>        rtx pat = PATTERN (insn);
>        rtx master = nvptx_get_unisimt_master ();
> +      bool shuffle_p = false;
>        for (int i = 0; i < XVECLEN (pat, 0); i++)
> -     nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
> +     shuffle_p
> +       |= nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
> +      if (shuffle_p && TARGET_PTX_6_0)
> +     {
> +       /* The shuffle is a sync, so uniformity is guaranteed.  */
> +     }
> +      else
> +     {
> +       if (TARGET_PTX_6_0)
> +         {
> +           gcc_assert (!shuffle_p);
> +           /* Emit after the insn, to guarantee uniformity.  */
> +           emit_insn_after (gen_nvptx_warpsync (), insn);
> +         }
> +       else
> +         {
> +           /* Emit after the insn (and before the shuffle, if there are any)
> +              to check uniformity.  */
> +           emit_insn_after (gen_nvptx_uniform_warp_check (), insn);
> +         }
> +     }
> +
>        rtx pred = nvptx_get_unisimt_predicate ();
>        pred = gen_rtx_NE (BImode, pred, const0_rtx);
>        pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
> diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
> index 1bc0adae014..77cffc40a66 100644
> --- a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
> +++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
> @@ -16,3 +16,4 @@ f (void)
>  }
>
>  /* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */
> +/* { dg-final { scan-assembler-times "shfl.sync.idx.b32" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c
> new file mode 100644
> index 00000000000..0f1e4e780fe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c
> @@ -0,0 +1,20 @@
> +/* { dg-options "-O2 -muniform-simt -mptx=3.1" } */
> +
> +enum memmodel
> +{
> +  MEMMODEL_RELAXED = 0,
> +};
> +
> +int a = 0;
> +
> +int
> +f (void)
> +{
> +  int expected = 1;
> +  return __atomic_compare_exchange_n (&a, &expected, 0, 0, MEMMODEL_RELAXED,
> +                                   MEMMODEL_RELAXED);
> +}
> +
> +/* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */
> +/* { dg-final { scan-assembler-times "shfl.idx.b32" 1 } } */
> +/* { dg-final { scan-assembler-times "vote.ballot.b32" 1 } } */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Tom de Vries Feb. 23, 2022, 10:45 a.m. UTC | #2
On 2/23/22 10:06, Thomas Schwinge wrote:
> Hi Tom!
> 
> This is me again, following along GCC/nvptx devlopment, and asking
> questions.  ;-)
> 

Yes, thanks for that, that's useful :)

> On 2022-02-19T20:07:18+0100, Tom de Vries via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>> With the default ptx isa 6.0, we have for uniform-simt-1.c:
>> ...
>>          @%r33   atom.global.cas.b32     %r26, [a], %r28, %r29;
>>                  shfl.sync.idx.b32       %r26, %r26, %r32, 31, 0xffffffff;
>> ...
>>
>> The atomic insn is predicated by -muniform-simt, and the subsequent insn does
>> a warp sync, at which point the warp is uniform again.
> 
> I understand the concern here is Independent Thread Scheduling, where the
> execution of predicated-off threads of a warp ('@ ! %r33') may proceed
> with the next instruction, 'shfl', without implicitly waiting for the
> other threads of a warp still working on the 'atom'?  Hence, the 'sync'
> aspect of 'shfl.sync', as a means that PTX provides at the ISA level such
> that we're getting the desired semantics: as its first step, "wait for
> all threads in membermask to arrive".
> 

Indeed.

>> But with -mptx=3.1, we have instead:
>> ...
>>          @%r33   atom.global.cas.b32     %r26, [a], %r28, %r29;
>>                  shfl.idx.b32    %r26, %r26, %r32, 31;
>> ...
>>
>> The shfl does not sync the warp, and we want the warp to go back to executing
>> uniformly asap.  We cannot enforce this
> 
> Is it really the case that such code may cause "permanent" warp-divergent
> execution (until re-converging "somewhere")?  My understanding has been
> that predicated-off threads of a warp ('@ ! %r33') would simply idle,
> implicitly waiting for the other threads of a warp still working on the
> 'atom' -- due to the nature of a shared program counter per warp, and the
> desire to re-converge as soon as possible.
> 
> For example, PTX ISA 7.2, 3.1. "A Set of SIMT Multiprocessors":
> 
> | [...]
> | At every instruction issue time, the SIMT unit selects a warp that is ready to execute and
> | issues the next instruction to the active threads of the warp. A warp executes one common
> | instruction at a time, so full efficiency is realized when all threads of a warp agree on their
> | execution path. If threads of a warp diverge via a data-dependent conditional branch, the
> | warp serially executes each branch path taken, disabling threads that are not on that path,
> | and when all paths complete, the threads converge back to the same execution path. [...]
> 
> So I'd have assumed that after the potentially-diverging
> '@%r33'-predicated 'atom' instruction, we're implicitly re-converging for
> the unpredicated 'shfl' (as long as Independent Thread Scheduling isn't
> involved, which it it's for '-mptx=3.1')?
> 
> As I'm understanding you, my understanding is not correct, and we may
> thus be getting "permanent" warp-divergent execution as soon as there's
> any predication/conditional involved that may evaluate differently for
> individual threads of a warp, and we thus need such *explicit*
> synchronization after all such instances?
> 

Reading the ptx manual, I think your interpretation of what _should_ 
happen is right.

Regardless, the JIT is still free to translate say a block of equally 
predicated insns using a branch as long as it inserts a warp sync right 
after.  And then there might be a JIT bug that optimizes that sync away, 
or shift it further out, past the shfl.

So perhaps the rationale should have been formulated more in terms of 
the shfl.  Note btw that it's possible that there's a compiler bug that 
does a diverging branch earlier, which would give problems for the shfl, 
and which the check would catch.

Note that the uniform-warp-check insn doesn't enforce convergence.  It 
only checks that the warp is convergent.

So, if the warp is not convergent, the check will abort.

If the warp is convergent, the JIT optimizer is free to optimize the 
check away.

And sometimes we have seen that adding the check makes the warp 
convergent (as in: preventing some JIT bug to trigger).

Anyway, unfortunately at this point I don't remember whether I found a 
smoking gun specifically for openmp.

Thanks,
- Tom

>> but at least check this using
>> nvptx_uniform_warp_check, similar to how that is done for openacc.
>>
>> Likewise, detect the case that no shfl insn is emitted, and add a
>> nvptx_uniform_warp_check or nvptx_warpsync.
> 
> For example, 'nvptx-none/mgomp/libatomic/cas_1_.o':
> 
>      [...]
>       @ %r71 atom.cas.b64 %r62,[%r35],%r29,%r61;
>      +{
>      +.reg .b32 act;
>      +vote.ballot.b32 act,1;
>      +.reg .pred uni;
>      +setp.eq.b32 uni,act,0xffffffff;
>      +@ ! uni trap;
>      +@ ! uni exit;
>      +}
>       mov.b64 {%r69,%r70},%r62;
>       shfl.idx.b32 %r69,%r69,%r68,31;
>       shfl.idx.b32 %r70,%r70,%r68,31;
>      [...]
> 
> So that's basically an 'assert' that all threads of a warp are converged.
> (Is the JIT maybe even able to optimize that out?)  I guess I just wonder
> if that's not satisfied implicitly.
> 
> 
> Grüße
>   Thomas
> 
> 
>> [nvptx] Use nvptx_warpsync / nvptx_uniform_warp_check for -muniform-simt
>>
>> gcc/ChangeLog:
>>
>> 2022-02-19  Tom de Vries  <tdevries@suse.de>
>>
>>        * config/nvptx/nvptx.cc (nvptx_unisimt_handle_set): Change return
>>        type to bool.
>>        (nvptx_reorg_uniform_simt): Insert nvptx_uniform_warp_check or
>>        nvptx_warpsync, if necessary.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2022-02-19  Tom de Vries  <tdevries@suse.de>
>>
>>        * gcc.target/nvptx/uniform-simt-1.c: Add scan-assembler test.
>>        * gcc.target/nvptx/uniform-simt-2.c: New test.
>>
>> ---
>>   gcc/config/nvptx/nvptx.cc                       | 34 ++++++++++++++++++++++---
>>   gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c |  1 +
>>   gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c | 20 +++++++++++++++
>>   3 files changed, 52 insertions(+), 3 deletions(-)
>>
>> diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
>> index afbad5bdde6..4942f1100da 100644
>> --- a/gcc/config/nvptx/nvptx.cc
>> +++ b/gcc/config/nvptx/nvptx.cc
>> @@ -3248,12 +3248,18 @@ nvptx_call_insn_is_syscall_p (rtx_insn *insn)
>>   /* If SET subexpression of INSN sets a register, emit a shuffle instruction to
>>      propagate its value from lane MASTER to current lane.  */
>>
>> -static void
>> +static bool
>>   nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
>>   {
>>     rtx reg;
>>     if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
>> -    emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn);
>> +    {
>> +      emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX),
>> +                    insn);
>> +      return true;
>> +    }
>> +
>> +  return false;
>>   }
>>
>>   /* Adjust code for uniform-simt code generation variant by making atomics and
>> @@ -3275,8 +3281,30 @@ nvptx_reorg_uniform_simt ()
>>        continue;
>>         rtx pat = PATTERN (insn);
>>         rtx master = nvptx_get_unisimt_master ();
>> +      bool shuffle_p = false;
>>         for (int i = 0; i < XVECLEN (pat, 0); i++)
>> -     nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
>> +     shuffle_p
>> +       |= nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
>> +      if (shuffle_p && TARGET_PTX_6_0)
>> +     {
>> +       /* The shuffle is a sync, so uniformity is guaranteed.  */
>> +     }
>> +      else
>> +     {
>> +       if (TARGET_PTX_6_0)
>> +         {
>> +           gcc_assert (!shuffle_p);
>> +           /* Emit after the insn, to guarantee uniformity.  */
>> +           emit_insn_after (gen_nvptx_warpsync (), insn);
>> +         }
>> +       else
>> +         {
>> +           /* Emit after the insn (and before the shuffle, if there are any)
>> +              to check uniformity.  */
>> +           emit_insn_after (gen_nvptx_uniform_warp_check (), insn);
>> +         }
>> +     }
>> +
>>         rtx pred = nvptx_get_unisimt_predicate ();
>>         pred = gen_rtx_NE (BImode, pred, const0_rtx);
>>         pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
>> diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
>> index 1bc0adae014..77cffc40a66 100644
>> --- a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
>> +++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
>> @@ -16,3 +16,4 @@ f (void)
>>   }
>>
>>   /* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */
>> +/* { dg-final { scan-assembler-times "shfl.sync.idx.b32" 1 } } */
>> diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c
>> new file mode 100644
>> index 00000000000..0f1e4e780fe
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c
>> @@ -0,0 +1,20 @@
>> +/* { dg-options "-O2 -muniform-simt -mptx=3.1" } */
>> +
>> +enum memmodel
>> +{
>> +  MEMMODEL_RELAXED = 0,
>> +};
>> +
>> +int a = 0;
>> +
>> +int
>> +f (void)
>> +{
>> +  int expected = 1;
>> +  return __atomic_compare_exchange_n (&a, &expected, 0, 0, MEMMODEL_RELAXED,
>> +                                   MEMMODEL_RELAXED);
>> +}
>> +
>> +/* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */
>> +/* { dg-final { scan-assembler-times "shfl.idx.b32" 1 } } */
>> +/* { dg-final { scan-assembler-times "vote.ballot.b32" 1 } } */
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  

Patch

diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
index afbad5bdde6..4942f1100da 100644
--- a/gcc/config/nvptx/nvptx.cc
+++ b/gcc/config/nvptx/nvptx.cc
@@ -3248,12 +3248,18 @@  nvptx_call_insn_is_syscall_p (rtx_insn *insn)
 /* If SET subexpression of INSN sets a register, emit a shuffle instruction to
    propagate its value from lane MASTER to current lane.  */
 
-static void
+static bool
 nvptx_unisimt_handle_set (rtx set, rtx_insn *insn, rtx master)
 {
   rtx reg;
   if (GET_CODE (set) == SET && REG_P (reg = SET_DEST (set)))
-    emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX), insn);
+    {
+      emit_insn_after (nvptx_gen_shuffle (reg, reg, master, SHUFFLE_IDX),
+		       insn);
+      return true;
+    }
+
+  return false;
 }
 
 /* Adjust code for uniform-simt code generation variant by making atomics and
@@ -3275,8 +3281,30 @@  nvptx_reorg_uniform_simt ()
 	continue;
       rtx pat = PATTERN (insn);
       rtx master = nvptx_get_unisimt_master ();
+      bool shuffle_p = false;
       for (int i = 0; i < XVECLEN (pat, 0); i++)
-	nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
+	shuffle_p
+	  |= nvptx_unisimt_handle_set (XVECEXP (pat, 0, i), insn, master);
+      if (shuffle_p && TARGET_PTX_6_0)
+	{
+	  /* The shuffle is a sync, so uniformity is guaranteed.  */
+	}
+      else
+	{
+	  if (TARGET_PTX_6_0)
+	    {
+	      gcc_assert (!shuffle_p);
+	      /* Emit after the insn, to guarantee uniformity.  */
+	      emit_insn_after (gen_nvptx_warpsync (), insn);
+	    }
+	  else
+	    {
+	      /* Emit after the insn (and before the shuffle, if there are any)
+		 to check uniformity.  */
+	      emit_insn_after (gen_nvptx_uniform_warp_check (), insn);
+	    }
+	}
+
       rtx pred = nvptx_get_unisimt_predicate ();
       pred = gen_rtx_NE (BImode, pred, const0_rtx);
       pat = gen_rtx_COND_EXEC (VOIDmode, pred, pat);
diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
index 1bc0adae014..77cffc40a66 100644
--- a/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
+++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-1.c
@@ -16,3 +16,4 @@  f (void)
 }
 
 /* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */
+/* { dg-final { scan-assembler-times "shfl.sync.idx.b32" 1 } } */
diff --git a/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c
new file mode 100644
index 00000000000..0f1e4e780fe
--- /dev/null
+++ b/gcc/testsuite/gcc.target/nvptx/uniform-simt-2.c
@@ -0,0 +1,20 @@ 
+/* { dg-options "-O2 -muniform-simt -mptx=3.1" } */
+
+enum memmodel
+{
+  MEMMODEL_RELAXED = 0,
+};
+
+int a = 0;
+
+int
+f (void)
+{
+  int expected = 1;
+  return __atomic_compare_exchange_n (&a, &expected, 0, 0, MEMMODEL_RELAXED,
+				      MEMMODEL_RELAXED);
+}
+
+/* { dg-final { scan-assembler-times "@%r\[0-9\]*\tatom.global.cas" 1 } } */
+/* { dg-final { scan-assembler-times "shfl.idx.b32" 1 } } */
+/* { dg-final { scan-assembler-times "vote.ballot.b32" 1 } } */