[RFC,nvptx] Initialize ptx regs

Message ID 20220220224650.GA10763@delia.home
State Committed
Commit 02aedc6f269b5e3c1f354edcf5b84d27b0a15946
Headers
Series [RFC,nvptx] Initialize ptx regs |

Commit Message

Tom de Vries Feb. 20, 2022, 10:46 p.m. UTC
  Hi,

With nvptx target, driver version 510.47.03 and board GT 1030 I, we run into:
...
FAIL: gcc.c-torture/execute/pr53465.c -O1 execution test
FAIL: gcc.c-torture/execute/pr53465.c -O2 execution test
FAIL: gcc.c-torture/execute/pr53465.c -O3 -g execution test
...
while the test-cases pass with nvptx-none-run -O0.

The problem is that the generated ptx contains a read from an uninitialized
ptx register, and the driver JIT doesn't handle this well.

For -O2 and -O3, we can get rid of the FAIL using --param
logical-op-non-short-circuit=0.  But not for -O1.

At -O1, the test-case minimizes to:
...
void __attribute__((noinline, noclone))
foo (int y) {
  int c;
  for (int i = 0; i < y; i++)
    {
      int d = i + 1;
      if (i && d <= c)
        __builtin_abort ();
      c = d;
    }
}

int main () {
  foo (2); return 0;
}
...

Note that the test-case does not contain an uninitialized use.  In the first
iteration, i is 0 and consequently c is not read.  In the second iteration, c
is read, but by that time it's already initialized by 'c = d' from the first
iteration.

AFAICT the problem is introduced as follows: the conditional use of c in the
loop body is translated into an unconditional use of c in the loop header:
...
  # c_1 = PHI <c_4(D)(2), c_9(6)>
...
which forwprop1 propagates the 'c_9 = d_7' assignment into:
...
  # c_1 = PHI <c_4(D)(2), d_7(6)>
...
which ends up being translated by expand into an unconditional:
...
(insn 13 12 0 (set (reg/v:SI 22 [ c ])
        (reg/v:SI 23 [ d ])) -1
     (nil))
...
at the start of the loop body, creating an uninitialized read of d on the
path from loop entry.

By disabling coalesce_ssa_name, we get the more usual copies on the incoming
edges.  The copy on the loop entry path still does an uninitialized read, but
that one's now initialized by init-regs.  The test-case passes, also when
disabling init-regs, so it's possible that the JIT driver doesn't object to
this type of uninitialized read.

Now that we characterized the problem to some degree, we need to fix this,
because either:
- we're violating an undocumented ptx invariant, and this is a compiler bug,
  or
- this is is a driver JIT bug and we need to work around it.

There are essentially two strategies to address this:
- stop the compiler from creating uninitialized reads
- patch up uninitialized reads using additional initialization

The former will probably involve:
- making some optimizations more conservative in the presence of
  uninitialized reads, and
- disabling some other optimizations (where making them more conservative is
  not possible, or cannot easily be achieved).
This will probably will have a cost penalty for code that does not suffer from
the original problem.

The latter has the problem that it may paper over uninitialized reads
in the source code, or indeed over ones that were incorrectly introduced
by the compiler.  But it has the advantage that it allows for the problem to
be addressed at a single location.

There's an existing pass, init-regs, which implements a form of the latter,
but it doesn't work for this example because it only inserts additional
initialization for uses that have not a single reaching definition.

Fix this by adding initialization of uninitialized ptx regs in reorg.

Control the new functionality using -minit-regs=<0|1|2|3>, meaning:
- 0: disabled.
- 1: add initialization of all regs at the entry bb
- 2: add initialization of uninitialized regs at the entry bb
- 3: add initialization of uninitialized regs close to the use
and defaulting to 3.

Tested on nvptx.

Any comments?

Thanks,
- Tom

[nvptx] Initialize ptx regs

gcc/ChangeLog:

2022-02-17  Tom de Vries  <tdevries@suse.de>

	PR target/104440
	* config/nvptx/nvptx.cc (workaround_uninit_method_1)
	(workaround_uninit_method_2, workaround_uninit_method_3)
	(workaround_uninit): New function.
	(nvptx_reorg): Use workaround_uninit.
	* config/nvptx/nvptx.opt (minit-regs): New option.

---
 gcc/config/nvptx/nvptx.cc  | 188 +++++++++++++++++++++++++++++++++++++++++++++
 gcc/config/nvptx/nvptx.opt |   4 +
 2 files changed, 192 insertions(+)
  

Comments

Richard Biener Feb. 21, 2022, 7:54 a.m. UTC | #1
On Sun, Feb 20, 2022 at 11:50 PM Tom de Vries via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> With nvptx target, driver version 510.47.03 and board GT 1030 I, we run into:
> ...
> FAIL: gcc.c-torture/execute/pr53465.c -O1 execution test
> FAIL: gcc.c-torture/execute/pr53465.c -O2 execution test
> FAIL: gcc.c-torture/execute/pr53465.c -O3 -g execution test
> ...
> while the test-cases pass with nvptx-none-run -O0.
>
> The problem is that the generated ptx contains a read from an uninitialized
> ptx register, and the driver JIT doesn't handle this well.
>
> For -O2 and -O3, we can get rid of the FAIL using --param
> logical-op-non-short-circuit=0.  But not for -O1.
>
> At -O1, the test-case minimizes to:
> ...
> void __attribute__((noinline, noclone))
> foo (int y) {
>   int c;
>   for (int i = 0; i < y; i++)
>     {
>       int d = i + 1;
>       if (i && d <= c)
>         __builtin_abort ();
>       c = d;
>     }
> }
>
> int main () {
>   foo (2); return 0;
> }
> ...
>
> Note that the test-case does not contain an uninitialized use.  In the first
> iteration, i is 0 and consequently c is not read.  In the second iteration, c
> is read, but by that time it's already initialized by 'c = d' from the first
> iteration.
>
> AFAICT the problem is introduced as follows: the conditional use of c in the
> loop body is translated into an unconditional use of c in the loop header:
> ...
>   # c_1 = PHI <c_4(D)(2), c_9(6)>
> ...
> which forwprop1 propagates the 'c_9 = d_7' assignment into:
> ...
>   # c_1 = PHI <c_4(D)(2), d_7(6)>
> ...
> which ends up being translated by expand into an unconditional:
> ...
> (insn 13 12 0 (set (reg/v:SI 22 [ c ])
>         (reg/v:SI 23 [ d ])) -1
>      (nil))
> ...
> at the start of the loop body, creating an uninitialized read of d on the
> path from loop entry.

Ah, interesting case.  Note that some fixup pass inserted a copy in
the loop header
before coalescing:

;;   basic block 3, loop depth 1
;;    pred:       6
;;                2
  # c_10 = PHI <d_6(6), c_3(D)(2)>
  # i_11 = PHI <d_6(6), 0(2)>
  c_2 = c_10;               <----------------------- this one
  i_8 = i_11;
  d_6 = i_11 + 1;
  if (i_8 != 0)
    goto <bb 4>; [64.00%]
  else
    goto <bb 6>; [36.00%]
;;    succ:       4
;;                6

;;   basic block 4, loop depth 1
;;    pred:       3
  if (d_6 <= c_2)
    goto <bb 5>; [0.00%]
  else
    goto <bb 6>; [100.00%]

we try to coalesce both c_10 to d_6 and i_11 to d_6, both have the same
cost I think and we succeed with the first which happens to be the one with
the default def arg.

I also think whether we coalesce or not doesn't really matter for the issue at
hand, the copy on entry should be elided anyway but the odd inserted copy
should be investigated (it looks unnecessary and it should be placed before
the single-use, not in the header).

> By disabling coalesce_ssa_name, we get the more usual copies on the incoming
> edges.  The copy on the loop entry path still does an uninitialized read, but
> that one's now initialized by init-regs.  The test-case passes, also when
> disabling init-regs, so it's possible that the JIT driver doesn't object to
> this type of uninitialized read.
>
> Now that we characterized the problem to some degree, we need to fix this,
> because either:
> - we're violating an undocumented ptx invariant, and this is a compiler bug,
>   or
> - this is is a driver JIT bug and we need to work around it.

So what does the JIT do that ends up breaking things?  Does the
actual hardware ISA have NaTs and trap?

> There are essentially two strategies to address this:
> - stop the compiler from creating uninitialized reads
> - patch up uninitialized reads using additional initialization
>
> The former will probably involve:
> - making some optimizations more conservative in the presence of
>   uninitialized reads, and
> - disabling some other optimizations (where making them more conservative is
>   not possible, or cannot easily be achieved).
> This will probably will have a cost penalty for code that does not suffer from
> the original problem.
>
> The latter has the problem that it may paper over uninitialized reads
> in the source code, or indeed over ones that were incorrectly introduced
> by the compiler.  But it has the advantage that it allows for the problem to
> be addressed at a single location.

There are some long-standing bug in bugzilla regarding to uninit uses
and how we treat them as invoking undefined behavior but also introduce
those ourselves in some places.  We of course can't do both so I think
we do neet to get our hands at a way to fix things without introducing
too many optimization regressions.

You've identified the most obvious candidate already - logical-op
short-circuiting.

I guess that the PTX JIT is fine with uninitialized memory so one issue
is that we can end up turning uninitialized memory into uninitialized
registers (not changing the point of execution though), if the JIT will
break here you will need to fixup in reorg like you do.

> There's an existing pass, init-regs, which implements a form of the latter,
> but it doesn't work for this example because it only inserts additional
> initialization for uses that have not a single reaching definition.

The init-regs pass has a motivation that isn't backed by facts so I like
it to go away since it also stands in the way of some optimizations.

> Fix this by adding initialization of uninitialized ptx regs in reorg.
>
> Control the new functionality using -minit-regs=<0|1|2|3>, meaning:
> - 0: disabled.
> - 1: add initialization of all regs at the entry bb
> - 2: add initialization of uninitialized regs at the entry bb
> - 3: add initialization of uninitialized regs close to the use
> and defaulting to 3.
>
> Tested on nvptx.
>
> Any comments?
>
> Thanks,
> - Tom
>
> [nvptx] Initialize ptx regs
>
> gcc/ChangeLog:
>
> 2022-02-17  Tom de Vries  <tdevries@suse.de>
>
>         PR target/104440
>         * config/nvptx/nvptx.cc (workaround_uninit_method_1)
>         (workaround_uninit_method_2, workaround_uninit_method_3)
>         (workaround_uninit): New function.
>         (nvptx_reorg): Use workaround_uninit.
>         * config/nvptx/nvptx.opt (minit-regs): New option.
>
> ---
>  gcc/config/nvptx/nvptx.cc  | 188 +++++++++++++++++++++++++++++++++++++++++++++
>  gcc/config/nvptx/nvptx.opt |   4 +
>  2 files changed, 192 insertions(+)
>
> diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
> index ed347cab70e..a37a6c78b41 100644
> --- a/gcc/config/nvptx/nvptx.cc
> +++ b/gcc/config/nvptx/nvptx.cc
> @@ -5372,6 +5372,190 @@ workaround_barsyncs (void)
>  }
>  #endif
>
> +/* Initialize all declared regs at function entry.
> +   Advantage   : Fool-proof.
> +   Disadvantage: Potentially creates a lot of long live ranges and adds a lot
> +                of insns.  */
> +
> +static void
> +workaround_uninit_method_1 (void)
> +{
> +  rtx_insn *first = get_insns ();
> +  rtx_insn *insert_here = NULL;
> +
> +  for (int ix = LAST_VIRTUAL_REGISTER + 1; ix < max_reg_num (); ix++)
> +    {
> +      rtx reg = regno_reg_rtx[ix];
> +
> +      /* Skip undeclared registers.  */
> +      if (reg == const0_rtx)
> +       continue;
> +
> +      gcc_assert (CONST0_RTX (GET_MODE (reg)));
> +
> +      start_sequence ();
> +      emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
> +      rtx_insn *inits = get_insns ();
> +      end_sequence ();
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       for (rtx_insn *init = inits; init != NULL; init = NEXT_INSN (init))
> +         fprintf (dump_file, "Default init of reg %u inserted: insn %u\n",
> +                  ix, INSN_UID (init));
> +
> +      if (first != NULL)
> +       {
> +         insert_here = emit_insn_before (inits, first);
> +         first = NULL;
> +       }
> +      else
> +       insert_here = emit_insn_after (inits, insert_here);
> +    }
> +}
> +
> +/* Find uses of regs that are not defined on all incoming paths, and insert a
> +   corresponding def at function entry.
> +   Advantage   : Simple.
> +   Disadvantage: Potentially creates long live ranges.
> +                May not catch all cases.  F.i. a clobber cuts a live range in
> +                the compiler and may prevent entry_lr_in from being set for a
> +                reg, but the clobber does not translate to a ptx insn, so in
> +                ptx there still may be an uninitialized ptx reg.  See f.i.
> +                gcc.c-torture/compile/20020926-1.c.  */
> +
> +static void
> +workaround_uninit_method_2 (void)
> +{
> +  auto_bitmap entry_pseudo_uninit;
> +  {
> +    auto_bitmap not_pseudo;
> +    bitmap_set_range (not_pseudo, 0, LAST_VIRTUAL_REGISTER);
> +
> +    bitmap entry_lr_in = DF_LR_IN (ENTRY_BLOCK_PTR_FOR_FN (cfun));
> +    bitmap_and_compl (entry_pseudo_uninit, entry_lr_in, not_pseudo);
> +  }
> +
> +  rtx_insn *first = get_insns ();
> +  rtx_insn *insert_here = NULL;
> +
> +  bitmap_iterator iterator;
> +  unsigned ix;
> +  EXECUTE_IF_SET_IN_BITMAP (entry_pseudo_uninit, 0, ix, iterator)
> +    {
> +      rtx reg = regno_reg_rtx[ix];
> +      gcc_assert (CONST0_RTX (GET_MODE (reg)));
> +
> +      start_sequence ();
> +      emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
> +      rtx_insn *inits = get_insns ();
> +      end_sequence ();
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       for (rtx_insn *init = inits; init != NULL; init = NEXT_INSN (init))
> +         fprintf (dump_file, "Missing init of reg %u inserted: insn %u\n",
> +                  ix, INSN_UID (init));
> +
> +      if (first != NULL)
> +       {
> +         insert_here = emit_insn_before (inits, first);
> +         first = NULL;
> +       }
> +      else
> +       insert_here = emit_insn_after (inits, insert_here);
> +    }
> +}
> +
> +/* Find uses of regs that are not defined on all incoming paths, and insert a
> +   corresponding def on those.
> +   Advantage   : Doesn't create long live ranges.
> +   Disadvantage: More complex, and potentially also more defs.  */
> +
> +static void
> +workaround_uninit_method_3 (void)
> +{
> +  auto_bitmap not_pseudo;
> +  bitmap_set_range (not_pseudo, 0, LAST_VIRTUAL_REGISTER);
> +
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      if (single_pred_p (bb))
> +       continue;
> +
> +      auto_bitmap bb_pseudo_uninit;
> +      bitmap_and_compl (bb_pseudo_uninit, DF_LIVE_IN (bb), DF_MIR_IN (bb));
> +      bitmap_and_compl_into (bb_pseudo_uninit, not_pseudo);
> +
> +      bitmap_iterator iterator;
> +      unsigned ix;
> +      EXECUTE_IF_SET_IN_BITMAP (bb_pseudo_uninit, 0, ix, iterator)
> +       {
> +         bool have_false = false;
> +         bool have_true = false;
> +
> +         edge e;
> +         edge_iterator ei;
> +         FOR_EACH_EDGE (e, ei, bb->preds)
> +           {
> +             if (bitmap_bit_p (DF_LIVE_OUT (e->src), ix))
> +               have_true = true;
> +             else
> +               have_false = true;
> +           }
> +         if (have_false ^ have_true)
> +           continue;
> +
> +         FOR_EACH_EDGE (e, ei, bb->preds)
> +           {
> +             if (bitmap_bit_p (DF_LIVE_OUT (e->src), ix))
> +               continue;
> +
> +             rtx reg = regno_reg_rtx[ix];
> +             gcc_assert (CONST0_RTX (GET_MODE (reg)));
> +
> +             start_sequence ();
> +             emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
> +             rtx_insn *inits = get_insns ();
> +             end_sequence ();
> +
> +             if (dump_file && (dump_flags & TDF_DETAILS))
> +               for (rtx_insn *init = inits; init != NULL;
> +                    init = NEXT_INSN (init))
> +                 fprintf (dump_file,
> +                          "Missing init of reg %u inserted on edge: %d -> %d:"
> +                          " insn %u\n", ix, e->src->index, e->dest->index,
> +                          INSN_UID (init));
> +
> +             insert_insn_on_edge (inits, e);
> +           }
> +       }
> +    }
> +
> +  commit_edge_insertions ();
> +}
> +
> +static void
> +workaround_uninit (void)
> +{
> +  switch (nvptx_init_regs)
> +    {
> +    case 0:
> +      /* Skip.  */
> +      break;
> +    case 1:
> +      workaround_uninit_method_1 ();
> +      break;
> +    case 2:
> +      workaround_uninit_method_2 ();
> +      break;
> +    case 3:
> +      workaround_uninit_method_3 ();
> +      break;
> +    default:
> +      gcc_unreachable ();
> +    }
> +}
> +
>  /* PTX-specific reorganization
>     - Split blocks at fork and join instructions
>     - Compute live registers
> @@ -5401,6 +5585,8 @@ nvptx_reorg (void)
>    df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
>    df_live_add_problem ();
>    df_live_set_all_dirty ();
> +  if (nvptx_init_regs == 3)
> +    df_mir_add_problem ();
>    df_analyze ();
>    regstat_init_n_sets_and_refs ();
>
> @@ -5413,6 +5599,8 @@ nvptx_reorg (void)
>      if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
>        regno_reg_rtx[i] = const0_rtx;
>
> +  workaround_uninit ();
> +
>    /* Determine launch dimensions of the function.  If it is not an
>       offloaded function  (i.e. this is a regular compiler), the
>       function has no neutering.  */
> diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
> index e3f65b2d0b1..08580071731 100644
> --- a/gcc/config/nvptx/nvptx.opt
> +++ b/gcc/config/nvptx/nvptx.opt
> @@ -91,3 +91,7 @@ Enum(ptx_version) String(7.0) Value(PTX_VERSION_7_0)
>  mptx=
>  Target RejectNegative ToLower Joined Enum(ptx_version) Var(ptx_version_option)
>  Specify the version of the ptx version to use.
> +
> +minit-regs=
> +Target Var(nvptx_init_regs) IntegerRange(0, 3) Joined UInteger Init(3)
> +Initialize ptx registers.
  
Tom de Vries Feb. 21, 2022, 1:47 p.m. UTC | #2
On 2/21/22 08:54, Richard Biener wrote:
> On Sun, Feb 20, 2022 at 11:50 PM Tom de Vries via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Hi,
>>
>> With nvptx target, driver version 510.47.03 and board GT 1030 I, we run into:
>> ...
>> FAIL: gcc.c-torture/execute/pr53465.c -O1 execution test
>> FAIL: gcc.c-torture/execute/pr53465.c -O2 execution test
>> FAIL: gcc.c-torture/execute/pr53465.c -O3 -g execution test
>> ...
>> while the test-cases pass with nvptx-none-run -O0.
>>
>> The problem is that the generated ptx contains a read from an uninitialized
>> ptx register, and the driver JIT doesn't handle this well.
>>
>> For -O2 and -O3, we can get rid of the FAIL using --param
>> logical-op-non-short-circuit=0.  But not for -O1.
>>
>> At -O1, the test-case minimizes to:
>> ...
>> void __attribute__((noinline, noclone))
>> foo (int y) {
>>    int c;
>>    for (int i = 0; i < y; i++)
>>      {
>>        int d = i + 1;
>>        if (i && d <= c)
>>          __builtin_abort ();
>>        c = d;
>>      }
>> }
>>
>> int main () {
>>    foo (2); return 0;
>> }
>> ...
>>
>> Note that the test-case does not contain an uninitialized use.  In the first
>> iteration, i is 0 and consequently c is not read.  In the second iteration, c
>> is read, but by that time it's already initialized by 'c = d' from the first
>> iteration.
>>
>> AFAICT the problem is introduced as follows: the conditional use of c in the
>> loop body is translated into an unconditional use of c in the loop header:
>> ...
>>    # c_1 = PHI <c_4(D)(2), c_9(6)>
>> ...
>> which forwprop1 propagates the 'c_9 = d_7' assignment into:
>> ...
>>    # c_1 = PHI <c_4(D)(2), d_7(6)>
>> ...
>> which ends up being translated by expand into an unconditional:
>> ...
>> (insn 13 12 0 (set (reg/v:SI 22 [ c ])
>>          (reg/v:SI 23 [ d ])) -1
>>       (nil))
>> ...
>> at the start of the loop body, creating an uninitialized read of d on the
>> path from loop entry.
> 
> Ah, interesting case.  Note that some fixup pass inserted a copy in
> the loop header
> before coalescing:
> 
> ;;   basic block 3, loop depth 1
> ;;    pred:       6
> ;;                2
>    # c_10 = PHI <d_6(6), c_3(D)(2)>
>    # i_11 = PHI <d_6(6), 0(2)>
>    c_2 = c_10;               <----------------------- this one
>    i_8 = i_11;
>    d_6 = i_11 + 1;
>    if (i_8 != 0)
>      goto <bb 4>; [64.00%]
>    else
>      goto <bb 6>; [36.00%]
> ;;    succ:       4
> ;;                6
> 
> ;;   basic block 4, loop depth 1
> ;;    pred:       3
>    if (d_6 <= c_2)
>      goto <bb 5>; [0.00%]
>    else
>      goto <bb 6>; [100.00%]
> 
> we try to coalesce both c_10 to d_6 and i_11 to d_6, both have the same
> cost I think and we succeed with the first which happens to be the one with
> the default def arg.
> 
> I also think whether we coalesce or not doesn't really matter for the issue at
> hand, the copy on entry should be elided anyway but the odd inserted copy
> should be investigated (it looks unnecessary and it should be placed before
> the single-use, not in the header).
> 

Thanks for looking into this in detail, unfortunately I'm not familiar 
with this part of the compiler so I can't really comment on your findings.

>> By disabling coalesce_ssa_name, we get the more usual copies on the incoming
>> edges.  The copy on the loop entry path still does an uninitialized read, but
>> that one's now initialized by init-regs.  The test-case passes, also when
>> disabling init-regs, so it's possible that the JIT driver doesn't object to
>> this type of uninitialized read.
>>
>> Now that we characterized the problem to some degree, we need to fix this,
>> because either:
>> - we're violating an undocumented ptx invariant, and this is a compiler bug,
>>    or
>> - this is is a driver JIT bug and we need to work around it.
> 
> So what does the JIT do that ends up breaking things?  Does the
> actual hardware ISA have NaTs and trap?
> 

That's a good question.  I haven't studied the actual hardware in 
detail, so I can't answer that. [ And the driver being closed source and 
the hardware undocumented by nvidia doesn't help in quickly finding a 
reliable answer there. ]

However, my theory is the following: there's one or several 
optimizations in the JIT that takes a read from an unitialized register 
as sufficient proof to delete (some) depend insns.

We've seen this in action before.  The following is documented at 
WORKAROUND_PTXJIT_BUG in nvptx.cc.

Consider this code:
...
                 {
                     .reg .u32 %x; 

                     mov.u32 %x,%tid.x; 

                     setp.ne.u32 %rnotvzero,%x,0; 

                 }

                  @%rnotvzero bra Lskip;
                  setp.<op>.<type> %rcond,op1,op2;
                  Lskip:

                  selp.u32 %rcondu32,1,0,%rcond;
                  shfl.idx.b32 %rcondu32,%rcondu32,0,31;
                  setp.ne.u32 %rcond,%rcondu32,0;
...

My interpretation of what is supposed to happen, as per the ptx 
specification is as follows.

The conditional setp.<op> sets register %rcond, but just for thread 0 in 
the warp (of 32 threads).  The register remains uninitialized for 
threads 1..31.

The selp sets register %rcondu32. For thread 0, it'll contain a defined 
value,  but for threads 1..31 it'll have undefined values.

The shfl propagates the value of thread 0 in register %rcondu32 to 
threads 0..31.  The register %rcondu32 now has a defined value in all 32 
threads.

However, the driver JIT removes the shfl insn.  We've filed this as a 
bug at nvidia, but got the answer that the ptx is illegal because 
uninitialized registers are read.  We've gotten no clarification in the 
bug report, and AFAIK the ptx isa spec hasn't been updated to clarify 
this either.

The policy since then is: if we minimize a failing test-case, and find a 
speculative read of an unitialized register, and can make the fail go 
away by initializing the register (as well as by turning off JIT 
optimization), then add a workaround in the compiler that adds the 
initialization.

[ In a way, this points to a deficiency in the ptx format: it has a bit 
bucket operand '_' to act as a sink for operations where you don't care 
about the result, like implementing an atomic store using atomic exchange:
...
     atom.exch.b32 _,[%r22],%r23;
...

The same bit bucket operand should be able to function as a source 
operand, to write:
...
mov.u32 %rx, _;
...
to indicate that for purposes of definedness analysis in the JIT, the 
register should be considered defined, but there's no need to allocate a 
resource to store a value. ]

>> There are essentially two strategies to address this:
>> - stop the compiler from creating uninitialized reads
>> - patch up uninitialized reads using additional initialization
>>
>> The former will probably involve:
>> - making some optimizations more conservative in the presence of
>>    uninitialized reads, and
>> - disabling some other optimizations (where making them more conservative is
>>    not possible, or cannot easily be achieved).
>> This will probably will have a cost penalty for code that does not suffer from
>> the original problem.
>>
>> The latter has the problem that it may paper over uninitialized reads
>> in the source code, or indeed over ones that were incorrectly introduced
>> by the compiler.  But it has the advantage that it allows for the problem to
>> be addressed at a single location.
> 
> There are some long-standing bug in bugzilla regarding to uninit uses
> and how we treat them as invoking undefined behavior but also introduce
> those ourselves in some places.  We of course can't do both

[ FWIW, I wonder if we can do both, if we can differentiate between 
uninit uses introduced by the user and by the compiler. ]

> so I think
> we do neet to get our hands at a way to fix things without introducing
> too many optimization regressions.
> 

Agreed.

> You've identified the most obvious candidate already - logical-op
> short-circuiting.
> 

FWIW, I tried hard coding --param logical-op-non-short-circuit=0 in the 
port, and immediately ran into regressions in 
gcc.target/nvptx/bool-{1,2,3}.c where we try to optimize:
...
int
foo (int x, int y)
{
   return (x == 21) && (y == 69);
}
...

[ So I guess we need to reimplement this optimization on tree-ssa, where 
we could reasonably implement a test on whether y is initialized on the 
x != 21 path. ]

> I guess that the PTX JIT is fine with uninitialized memory so one issue
> is that we can end up turning uninitialized memory into uninitialized
> registers (not changing the point of execution though), if the JIT will
> break here you will need to fixup in reorg like you do.
> 

Hmm, interesting, I hadn't though of that possibility.

>> There's an existing pass, init-regs, which implements a form of the latter,
>> but it doesn't work for this example because it only inserts additional
>> initialization for uses that have not a single reaching definition.
> 

And to be complete here, if we use -ftrivial-auto-var-init=zero, the 
test-case does pass.  But that seems to take effect during 
gimplification, which means it won't catch any cases introduced later-on.

> The init-regs pass has a motivation that isn't backed by facts so I like
> it to go away since it also stands in the way of some optimizations.
> 

Yeah, I read the related PR (PR61810), and that makes sense.

I also saw the proposed patch to remove init-regs, and it does so for 
lra ports.  The nvptx port is not an targetm.lra_p port, but neither is 
it a reload port, instead it's a targetm.no_register_allocation port. 
So I guess it would be interesting to see what the fall-out is of 
disabling init-regs for nvptx.

Thanks for the comments.

I'll try to summarize my understanding here:
- init-regs needs to go, because:
   - it doesn't clearly state the problem it's solving, and
   - it papers over issues elsewhere.
- speculative use of unitialized registers is a source of problems for
   the nvidia driver JIT.
- we need a way in the compiler to stop introducing speculative use of
   uninitialized registers, without unnecessarily loosing performance.
- in absense of such a solution, the nvptx port needs a stop-gap
   solution
- such a stop-gap solution is similar to init-regs in the sense that
   both insert inits of regs, but:
   - the stop-gap solution runs later that init-regs.  The problems it
     attempts to fix are post-emit, in the driver JIT, so
     it needs to run ALAP to catch all cases introduced by earlier
     passes.
   - the stop-gap solution has a clear description of the problem it's
     solving (though the scope of the problem remains guesswork)
- we need to test (at some point) whether disabling init-regs requires
   expanding the stop-gap solution (because atm, the two passes have no
   overlap in terms of problem scope and consequently, generated inits).
- it being a stop-gap solution, once it's made unnecessary, it can be
   removed (or, disabled by default, or, assert by default rather than
   inserting an init)

I'll commit unless there are further comments.

Thanks,
- Tom

>> Fix this by adding initialization of uninitialized ptx regs in reorg.
>>
>> Control the new functionality using -minit-regs=<0|1|2|3>, meaning:
>> - 0: disabled.
>> - 1: add initialization of all regs at the entry bb
>> - 2: add initialization of uninitialized regs at the entry bb
>> - 3: add initialization of uninitialized regs close to the use
>> and defaulting to 3.
>>
>> Tested on nvptx.
>>
>> Any comments?
>>
>> Thanks,
>> - Tom
>>
>> [nvptx] Initialize ptx regs
>>
>> gcc/ChangeLog:
>>
>> 2022-02-17  Tom de Vries  <tdevries@suse.de>
>>
>>          PR target/104440
>>          * config/nvptx/nvptx.cc (workaround_uninit_method_1)
>>          (workaround_uninit_method_2, workaround_uninit_method_3)
>>          (workaround_uninit): New function.
>>          (nvptx_reorg): Use workaround_uninit.
>>          * config/nvptx/nvptx.opt (minit-regs): New option.
>>
>> ---
>>   gcc/config/nvptx/nvptx.cc  | 188 +++++++++++++++++++++++++++++++++++++++++++++
>>   gcc/config/nvptx/nvptx.opt |   4 +
>>   2 files changed, 192 insertions(+)
>>
>> diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
>> index ed347cab70e..a37a6c78b41 100644
>> --- a/gcc/config/nvptx/nvptx.cc
>> +++ b/gcc/config/nvptx/nvptx.cc
>> @@ -5372,6 +5372,190 @@ workaround_barsyncs (void)
>>   }
>>   #endif
>>
>> +/* Initialize all declared regs at function entry.
>> +   Advantage   : Fool-proof.
>> +   Disadvantage: Potentially creates a lot of long live ranges and adds a lot
>> +                of insns.  */
>> +
>> +static void
>> +workaround_uninit_method_1 (void)
>> +{
>> +  rtx_insn *first = get_insns ();
>> +  rtx_insn *insert_here = NULL;
>> +
>> +  for (int ix = LAST_VIRTUAL_REGISTER + 1; ix < max_reg_num (); ix++)
>> +    {
>> +      rtx reg = regno_reg_rtx[ix];
>> +
>> +      /* Skip undeclared registers.  */
>> +      if (reg == const0_rtx)
>> +       continue;
>> +
>> +      gcc_assert (CONST0_RTX (GET_MODE (reg)));
>> +
>> +      start_sequence ();
>> +      emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
>> +      rtx_insn *inits = get_insns ();
>> +      end_sequence ();
>> +
>> +      if (dump_file && (dump_flags & TDF_DETAILS))
>> +       for (rtx_insn *init = inits; init != NULL; init = NEXT_INSN (init))
>> +         fprintf (dump_file, "Default init of reg %u inserted: insn %u\n",
>> +                  ix, INSN_UID (init));
>> +
>> +      if (first != NULL)
>> +       {
>> +         insert_here = emit_insn_before (inits, first);
>> +         first = NULL;
>> +       }
>> +      else
>> +       insert_here = emit_insn_after (inits, insert_here);
>> +    }
>> +}
>> +
>> +/* Find uses of regs that are not defined on all incoming paths, and insert a
>> +   corresponding def at function entry.
>> +   Advantage   : Simple.
>> +   Disadvantage: Potentially creates long live ranges.
>> +                May not catch all cases.  F.i. a clobber cuts a live range in
>> +                the compiler and may prevent entry_lr_in from being set for a
>> +                reg, but the clobber does not translate to a ptx insn, so in
>> +                ptx there still may be an uninitialized ptx reg.  See f.i.
>> +                gcc.c-torture/compile/20020926-1.c.  */
>> +
>> +static void
>> +workaround_uninit_method_2 (void)
>> +{
>> +  auto_bitmap entry_pseudo_uninit;
>> +  {
>> +    auto_bitmap not_pseudo;
>> +    bitmap_set_range (not_pseudo, 0, LAST_VIRTUAL_REGISTER);
>> +
>> +    bitmap entry_lr_in = DF_LR_IN (ENTRY_BLOCK_PTR_FOR_FN (cfun));
>> +    bitmap_and_compl (entry_pseudo_uninit, entry_lr_in, not_pseudo);
>> +  }
>> +
>> +  rtx_insn *first = get_insns ();
>> +  rtx_insn *insert_here = NULL;
>> +
>> +  bitmap_iterator iterator;
>> +  unsigned ix;
>> +  EXECUTE_IF_SET_IN_BITMAP (entry_pseudo_uninit, 0, ix, iterator)
>> +    {
>> +      rtx reg = regno_reg_rtx[ix];
>> +      gcc_assert (CONST0_RTX (GET_MODE (reg)));
>> +
>> +      start_sequence ();
>> +      emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
>> +      rtx_insn *inits = get_insns ();
>> +      end_sequence ();
>> +
>> +      if (dump_file && (dump_flags & TDF_DETAILS))
>> +       for (rtx_insn *init = inits; init != NULL; init = NEXT_INSN (init))
>> +         fprintf (dump_file, "Missing init of reg %u inserted: insn %u\n",
>> +                  ix, INSN_UID (init));
>> +
>> +      if (first != NULL)
>> +       {
>> +         insert_here = emit_insn_before (inits, first);
>> +         first = NULL;
>> +       }
>> +      else
>> +       insert_here = emit_insn_after (inits, insert_here);
>> +    }
>> +}
>> +
>> +/* Find uses of regs that are not defined on all incoming paths, and insert a
>> +   corresponding def on those.
>> +   Advantage   : Doesn't create long live ranges.
>> +   Disadvantage: More complex, and potentially also more defs.  */
>> +
>> +static void
>> +workaround_uninit_method_3 (void)
>> +{
>> +  auto_bitmap not_pseudo;
>> +  bitmap_set_range (not_pseudo, 0, LAST_VIRTUAL_REGISTER);
>> +
>> +  basic_block bb;
>> +  FOR_EACH_BB_FN (bb, cfun)
>> +    {
>> +      if (single_pred_p (bb))
>> +       continue;
>> +
>> +      auto_bitmap bb_pseudo_uninit;
>> +      bitmap_and_compl (bb_pseudo_uninit, DF_LIVE_IN (bb), DF_MIR_IN (bb));
>> +      bitmap_and_compl_into (bb_pseudo_uninit, not_pseudo);
>> +
>> +      bitmap_iterator iterator;
>> +      unsigned ix;
>> +      EXECUTE_IF_SET_IN_BITMAP (bb_pseudo_uninit, 0, ix, iterator)
>> +       {
>> +         bool have_false = false;
>> +         bool have_true = false;
>> +
>> +         edge e;
>> +         edge_iterator ei;
>> +         FOR_EACH_EDGE (e, ei, bb->preds)
>> +           {
>> +             if (bitmap_bit_p (DF_LIVE_OUT (e->src), ix))
>> +               have_true = true;
>> +             else
>> +               have_false = true;
>> +           }
>> +         if (have_false ^ have_true)
>> +           continue;
>> +
>> +         FOR_EACH_EDGE (e, ei, bb->preds)
>> +           {
>> +             if (bitmap_bit_p (DF_LIVE_OUT (e->src), ix))
>> +               continue;
>> +
>> +             rtx reg = regno_reg_rtx[ix];
>> +             gcc_assert (CONST0_RTX (GET_MODE (reg)));
>> +
>> +             start_sequence ();
>> +             emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
>> +             rtx_insn *inits = get_insns ();
>> +             end_sequence ();
>> +
>> +             if (dump_file && (dump_flags & TDF_DETAILS))
>> +               for (rtx_insn *init = inits; init != NULL;
>> +                    init = NEXT_INSN (init))
>> +                 fprintf (dump_file,
>> +                          "Missing init of reg %u inserted on edge: %d -> %d:"
>> +                          " insn %u\n", ix, e->src->index, e->dest->index,
>> +                          INSN_UID (init));
>> +
>> +             insert_insn_on_edge (inits, e);
>> +           }
>> +       }
>> +    }
>> +
>> +  commit_edge_insertions ();
>> +}
>> +
>> +static void
>> +workaround_uninit (void)
>> +{
>> +  switch (nvptx_init_regs)
>> +    {
>> +    case 0:
>> +      /* Skip.  */
>> +      break;
>> +    case 1:
>> +      workaround_uninit_method_1 ();
>> +      break;
>> +    case 2:
>> +      workaround_uninit_method_2 ();
>> +      break;
>> +    case 3:
>> +      workaround_uninit_method_3 ();
>> +      break;
>> +    default:
>> +      gcc_unreachable ();
>> +    }
>> +}
>> +
>>   /* PTX-specific reorganization
>>      - Split blocks at fork and join instructions
>>      - Compute live registers
>> @@ -5401,6 +5585,8 @@ nvptx_reorg (void)
>>     df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
>>     df_live_add_problem ();
>>     df_live_set_all_dirty ();
>> +  if (nvptx_init_regs == 3)
>> +    df_mir_add_problem ();
>>     df_analyze ();
>>     regstat_init_n_sets_and_refs ();
>>
>> @@ -5413,6 +5599,8 @@ nvptx_reorg (void)
>>       if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
>>         regno_reg_rtx[i] = const0_rtx;
>>
>> +  workaround_uninit ();
>> +
>>     /* Determine launch dimensions of the function.  If it is not an
>>        offloaded function  (i.e. this is a regular compiler), the
>>        function has no neutering.  */
>> diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
>> index e3f65b2d0b1..08580071731 100644
>> --- a/gcc/config/nvptx/nvptx.opt
>> +++ b/gcc/config/nvptx/nvptx.opt
>> @@ -91,3 +91,7 @@ Enum(ptx_version) String(7.0) Value(PTX_VERSION_7_0)
>>   mptx=
>>   Target RejectNegative ToLower Joined Enum(ptx_version) Var(ptx_version_option)
>>   Specify the version of the ptx version to use.
>> +
>> +minit-regs=
>> +Target Var(nvptx_init_regs) IntegerRange(0, 3) Joined UInteger Init(3)
>> +Initialize ptx registers.
  
Richard Biener Feb. 21, 2022, 2:38 p.m. UTC | #3
On Mon, Feb 21, 2022 at 2:47 PM Tom de Vries <tdevries@suse.de> wrote:
>
> On 2/21/22 08:54, Richard Biener wrote:
> > On Sun, Feb 20, 2022 at 11:50 PM Tom de Vries via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> Hi,
> >>
> >> With nvptx target, driver version 510.47.03 and board GT 1030 I, we run into:
> >> ...
> >> FAIL: gcc.c-torture/execute/pr53465.c -O1 execution test
> >> FAIL: gcc.c-torture/execute/pr53465.c -O2 execution test
> >> FAIL: gcc.c-torture/execute/pr53465.c -O3 -g execution test
> >> ...
> >> while the test-cases pass with nvptx-none-run -O0.
> >>
> >> The problem is that the generated ptx contains a read from an uninitialized
> >> ptx register, and the driver JIT doesn't handle this well.
> >>
> >> For -O2 and -O3, we can get rid of the FAIL using --param
> >> logical-op-non-short-circuit=0.  But not for -O1.
> >>
> >> At -O1, the test-case minimizes to:
> >> ...
> >> void __attribute__((noinline, noclone))
> >> foo (int y) {
> >>    int c;
> >>    for (int i = 0; i < y; i++)
> >>      {
> >>        int d = i + 1;
> >>        if (i && d <= c)
> >>          __builtin_abort ();
> >>        c = d;
> >>      }
> >> }
> >>
> >> int main () {
> >>    foo (2); return 0;
> >> }
> >> ...
> >>
> >> Note that the test-case does not contain an uninitialized use.  In the first
> >> iteration, i is 0 and consequently c is not read.  In the second iteration, c
> >> is read, but by that time it's already initialized by 'c = d' from the first
> >> iteration.
> >>
> >> AFAICT the problem is introduced as follows: the conditional use of c in the
> >> loop body is translated into an unconditional use of c in the loop header:
> >> ...
> >>    # c_1 = PHI <c_4(D)(2), c_9(6)>
> >> ...
> >> which forwprop1 propagates the 'c_9 = d_7' assignment into:
> >> ...
> >>    # c_1 = PHI <c_4(D)(2), d_7(6)>
> >> ...
> >> which ends up being translated by expand into an unconditional:
> >> ...
> >> (insn 13 12 0 (set (reg/v:SI 22 [ c ])
> >>          (reg/v:SI 23 [ d ])) -1
> >>       (nil))
> >> ...
> >> at the start of the loop body, creating an uninitialized read of d on the
> >> path from loop entry.
> >
> > Ah, interesting case.  Note that some fixup pass inserted a copy in
> > the loop header
> > before coalescing:
> >
> > ;;   basic block 3, loop depth 1
> > ;;    pred:       6
> > ;;                2
> >    # c_10 = PHI <d_6(6), c_3(D)(2)>
> >    # i_11 = PHI <d_6(6), 0(2)>
> >    c_2 = c_10;               <----------------------- this one
> >    i_8 = i_11;
> >    d_6 = i_11 + 1;
> >    if (i_8 != 0)
> >      goto <bb 4>; [64.00%]
> >    else
> >      goto <bb 6>; [36.00%]
> > ;;    succ:       4
> > ;;                6
> >
> > ;;   basic block 4, loop depth 1
> > ;;    pred:       3
> >    if (d_6 <= c_2)
> >      goto <bb 5>; [0.00%]
> >    else
> >      goto <bb 6>; [100.00%]
> >
> > we try to coalesce both c_10 to d_6 and i_11 to d_6, both have the same
> > cost I think and we succeed with the first which happens to be the one with
> > the default def arg.
> >
> > I also think whether we coalesce or not doesn't really matter for the issue at
> > hand, the copy on entry should be elided anyway but the odd inserted copy
> > should be investigated (it looks unnecessary and it should be placed before
> > the single-use, not in the header).
> >
>
> Thanks for looking into this in detail, unfortunately I'm not familiar
> with this part of the compiler so I can't really comment on your findings.
>
> >> By disabling coalesce_ssa_name, we get the more usual copies on the incoming
> >> edges.  The copy on the loop entry path still does an uninitialized read, but
> >> that one's now initialized by init-regs.  The test-case passes, also when
> >> disabling init-regs, so it's possible that the JIT driver doesn't object to
> >> this type of uninitialized read.
> >>
> >> Now that we characterized the problem to some degree, we need to fix this,
> >> because either:
> >> - we're violating an undocumented ptx invariant, and this is a compiler bug,
> >>    or
> >> - this is is a driver JIT bug and we need to work around it.
> >
> > So what does the JIT do that ends up breaking things?  Does the
> > actual hardware ISA have NaTs and trap?
> >
>
> That's a good question.  I haven't studied the actual hardware in
> detail, so I can't answer that. [ And the driver being closed source and
> the hardware undocumented by nvidia doesn't help in quickly finding a
> reliable answer there. ]
>
> However, my theory is the following: there's one or several
> optimizations in the JIT that takes a read from an unitialized register
> as sufficient proof to delete (some) depend insns.
>
> We've seen this in action before.  The following is documented at
> WORKAROUND_PTXJIT_BUG in nvptx.cc.
>
> Consider this code:
> ...
>                  {
>                      .reg .u32 %x;
>
>                      mov.u32 %x,%tid.x;
>
>                      setp.ne.u32 %rnotvzero,%x,0;
>
>                  }
>
>                   @%rnotvzero bra Lskip;
>                   setp.<op>.<type> %rcond,op1,op2;
>                   Lskip:
>
>                   selp.u32 %rcondu32,1,0,%rcond;
>                   shfl.idx.b32 %rcondu32,%rcondu32,0,31;
>                   setp.ne.u32 %rcond,%rcondu32,0;
> ...
>
> My interpretation of what is supposed to happen, as per the ptx
> specification is as follows.
>
> The conditional setp.<op> sets register %rcond, but just for thread 0 in
> the warp (of 32 threads).  The register remains uninitialized for
> threads 1..31.
>
> The selp sets register %rcondu32. For thread 0, it'll contain a defined
> value,  but for threads 1..31 it'll have undefined values.
>
> The shfl propagates the value of thread 0 in register %rcondu32 to
> threads 0..31.  The register %rcondu32 now has a defined value in all 32
> threads.
>
> However, the driver JIT removes the shfl insn.  We've filed this as a
> bug at nvidia, but got the answer that the ptx is illegal because
> uninitialized registers are read.  We've gotten no clarification in the
> bug report, and AFAIK the ptx isa spec hasn't been updated to clarify
> this either.
>
> The policy since then is: if we minimize a failing test-case, and find a
> speculative read of an unitialized register, and can make the fail go
> away by initializing the register (as well as by turning off JIT
> optimization), then add a workaround in the compiler that adds the
> initialization.
>
> [ In a way, this points to a deficiency in the ptx format: it has a bit
> bucket operand '_' to act as a sink for operations where you don't care
> about the result, like implementing an atomic store using atomic exchange:
> ...
>      atom.exch.b32 _,[%r22],%r23;
> ...
>
> The same bit bucket operand should be able to function as a source
> operand, to write:
> ...
> mov.u32 %rx, _;
> ...
> to indicate that for purposes of definedness analysis in the JIT, the
> register should be considered defined, but there's no need to allocate a
> resource to store a value. ]
>
> >> There are essentially two strategies to address this:
> >> - stop the compiler from creating uninitialized reads
> >> - patch up uninitialized reads using additional initialization
> >>
> >> The former will probably involve:
> >> - making some optimizations more conservative in the presence of
> >>    uninitialized reads, and
> >> - disabling some other optimizations (where making them more conservative is
> >>    not possible, or cannot easily be achieved).
> >> This will probably will have a cost penalty for code that does not suffer from
> >> the original problem.
> >>
> >> The latter has the problem that it may paper over uninitialized reads
> >> in the source code, or indeed over ones that were incorrectly introduced
> >> by the compiler.  But it has the advantage that it allows for the problem to
> >> be addressed at a single location.
> >
> > There are some long-standing bug in bugzilla regarding to uninit uses
> > and how we treat them as invoking undefined behavior but also introduce
> > those ourselves in some places.  We of course can't do both
>
> [ FWIW, I wonder if we can do both, if we can differentiate between
> uninit uses introduced by the user and by the compiler. ]
>
> > so I think
> > we do neet to get our hands at a way to fix things without introducing
> > too many optimization regressions.
> >
>
> Agreed.
>
> > You've identified the most obvious candidate already - logical-op
> > short-circuiting.
> >
>
> FWIW, I tried hard coding --param logical-op-non-short-circuit=0 in the
> port, and immediately ran into regressions in
> gcc.target/nvptx/bool-{1,2,3}.c where we try to optimize:
> ...
> int
> foo (int x, int y)
> {
>    return (x == 21) && (y == 69);
> }
> ...
>
> [ So I guess we need to reimplement this optimization on tree-ssa, where
> we could reasonably implement a test on whether y is initialized on the
> x != 21 path. ]

We do have this optimization on GIMPLE - that's ifcombine.  But
it doesn't do a good job (or one at all?) to avoid using uninits and it
is guarded by the very same --param logical-op-non-short-circuit as
the early fold-const.cc code which just looks at side-effects and cannot
reasonably do any dataflow analysis.

> > I guess that the PTX JIT is fine with uninitialized memory so one issue
> > is that we can end up turning uninitialized memory into uninitialized
> > registers (not changing the point of execution though), if the JIT will
> > break here you will need to fixup in reorg like you do.
> >
>
> Hmm, interesting, I hadn't though of that possibility.

Yep, consider

void foo(int i)
{
   int j;
   if (i_1)
    {
     _2 = j;
     if (_2)
        ...
    }
   if (complex condition) // optimized to 0 "late"
     bar (&j);
}

when the address-taken of 'j' vanishes and if-combine has already
concluded that '_2' is initialized by a load from 'j' and thus initialized
we'll re-write 'j' into SSA and expose unconditional use of a register.
That's why if-combine needs some magic to compute "must initialized",
and that "must initialized" will be kind-of a live problem as a user
written uninit use can make us consider further uses "initialized".

Now - for memory that's not entirely true since in the abstract machine
memory is just uninitialized and it's not undefined to read from it.  And
for the PTX JIT uninit memory is appearantly different from an uninitialized
register as well.

So it also makes a difference for incoming parameters dependent on
whether they are passed on the stack or via registers ...

> >> There's an existing pass, init-regs, which implements a form of the latter,
> >> but it doesn't work for this example because it only inserts additional
> >> initialization for uses that have not a single reaching definition.
> >
>
> And to be complete here, if we use -ftrivial-auto-var-init=zero, the
> test-case does pass.  But that seems to take effect during
> gimplification, which means it won't catch any cases introduced later-on.
>
> > The init-regs pass has a motivation that isn't backed by facts so I like
> > it to go away since it also stands in the way of some optimizations.
> >
>
> Yeah, I read the related PR (PR61810), and that makes sense.
>
> I also saw the proposed patch to remove init-regs, and it does so for
> lra ports.  The nvptx port is not an targetm.lra_p port, but neither is
> it a reload port, instead it's a targetm.no_register_allocation port.
> So I guess it would be interesting to see what the fall-out is of
> disabling init-regs for nvptx.
>
> Thanks for the comments.
>
> I'll try to summarize my understanding here:
> - init-regs needs to go, because:
>    - it doesn't clearly state the problem it's solving, and
>    - it papers over issues elsewhere.
> - speculative use of unitialized registers is a source of problems for
>    the nvidia driver JIT.
> - we need a way in the compiler to stop introducing speculative use of
>    uninitialized registers, without unnecessarily loosing performance.
> - in absense of such a solution, the nvptx port needs a stop-gap
>    solution
> - such a stop-gap solution is similar to init-regs in the sense that
>    both insert inits of regs, but:
>    - the stop-gap solution runs later that init-regs.  The problems it
>      attempts to fix are post-emit, in the driver JIT, so
>      it needs to run ALAP to catch all cases introduced by earlier
>      passes.
>    - the stop-gap solution has a clear description of the problem it's
>      solving (though the scope of the problem remains guesswork)
> - we need to test (at some point) whether disabling init-regs requires
>    expanding the stop-gap solution (because atm, the two passes have no
>    overlap in terms of problem scope and consequently, generated inits).
> - it being a stop-gap solution, once it's made unnecessary, it can be
>    removed (or, disabled by default, or, assert by default rather than
>    inserting an init)

Yes, I think all of the above is correct.

> I'll commit unless there are further comments.
>
> Thanks,
> - Tom
>
> >> Fix this by adding initialization of uninitialized ptx regs in reorg.
> >>
> >> Control the new functionality using -minit-regs=<0|1|2|3>, meaning:
> >> - 0: disabled.
> >> - 1: add initialization of all regs at the entry bb
> >> - 2: add initialization of uninitialized regs at the entry bb
> >> - 3: add initialization of uninitialized regs close to the use
> >> and defaulting to 3.
> >>
> >> Tested on nvptx.
> >>
> >> Any comments?
> >>
> >> Thanks,
> >> - Tom
> >>
> >> [nvptx] Initialize ptx regs
> >>
> >> gcc/ChangeLog:
> >>
> >> 2022-02-17  Tom de Vries  <tdevries@suse.de>
> >>
> >>          PR target/104440
> >>          * config/nvptx/nvptx.cc (workaround_uninit_method_1)
> >>          (workaround_uninit_method_2, workaround_uninit_method_3)
> >>          (workaround_uninit): New function.
> >>          (nvptx_reorg): Use workaround_uninit.
> >>          * config/nvptx/nvptx.opt (minit-regs): New option.
> >>
> >> ---
> >>   gcc/config/nvptx/nvptx.cc  | 188 +++++++++++++++++++++++++++++++++++++++++++++
> >>   gcc/config/nvptx/nvptx.opt |   4 +
> >>   2 files changed, 192 insertions(+)
> >>
> >> diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
> >> index ed347cab70e..a37a6c78b41 100644
> >> --- a/gcc/config/nvptx/nvptx.cc
> >> +++ b/gcc/config/nvptx/nvptx.cc
> >> @@ -5372,6 +5372,190 @@ workaround_barsyncs (void)
> >>   }
> >>   #endif
> >>
> >> +/* Initialize all declared regs at function entry.
> >> +   Advantage   : Fool-proof.
> >> +   Disadvantage: Potentially creates a lot of long live ranges and adds a lot
> >> +                of insns.  */
> >> +
> >> +static void
> >> +workaround_uninit_method_1 (void)
> >> +{
> >> +  rtx_insn *first = get_insns ();
> >> +  rtx_insn *insert_here = NULL;
> >> +
> >> +  for (int ix = LAST_VIRTUAL_REGISTER + 1; ix < max_reg_num (); ix++)
> >> +    {
> >> +      rtx reg = regno_reg_rtx[ix];
> >> +
> >> +      /* Skip undeclared registers.  */
> >> +      if (reg == const0_rtx)
> >> +       continue;
> >> +
> >> +      gcc_assert (CONST0_RTX (GET_MODE (reg)));
> >> +
> >> +      start_sequence ();
> >> +      emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
> >> +      rtx_insn *inits = get_insns ();
> >> +      end_sequence ();
> >> +
> >> +      if (dump_file && (dump_flags & TDF_DETAILS))
> >> +       for (rtx_insn *init = inits; init != NULL; init = NEXT_INSN (init))
> >> +         fprintf (dump_file, "Default init of reg %u inserted: insn %u\n",
> >> +                  ix, INSN_UID (init));
> >> +
> >> +      if (first != NULL)
> >> +       {
> >> +         insert_here = emit_insn_before (inits, first);
> >> +         first = NULL;
> >> +       }
> >> +      else
> >> +       insert_here = emit_insn_after (inits, insert_here);
> >> +    }
> >> +}
> >> +
> >> +/* Find uses of regs that are not defined on all incoming paths, and insert a
> >> +   corresponding def at function entry.
> >> +   Advantage   : Simple.
> >> +   Disadvantage: Potentially creates long live ranges.
> >> +                May not catch all cases.  F.i. a clobber cuts a live range in
> >> +                the compiler and may prevent entry_lr_in from being set for a
> >> +                reg, but the clobber does not translate to a ptx insn, so in
> >> +                ptx there still may be an uninitialized ptx reg.  See f.i.
> >> +                gcc.c-torture/compile/20020926-1.c.  */
> >> +
> >> +static void
> >> +workaround_uninit_method_2 (void)
> >> +{
> >> +  auto_bitmap entry_pseudo_uninit;
> >> +  {
> >> +    auto_bitmap not_pseudo;
> >> +    bitmap_set_range (not_pseudo, 0, LAST_VIRTUAL_REGISTER);
> >> +
> >> +    bitmap entry_lr_in = DF_LR_IN (ENTRY_BLOCK_PTR_FOR_FN (cfun));
> >> +    bitmap_and_compl (entry_pseudo_uninit, entry_lr_in, not_pseudo);
> >> +  }
> >> +
> >> +  rtx_insn *first = get_insns ();
> >> +  rtx_insn *insert_here = NULL;
> >> +
> >> +  bitmap_iterator iterator;
> >> +  unsigned ix;
> >> +  EXECUTE_IF_SET_IN_BITMAP (entry_pseudo_uninit, 0, ix, iterator)
> >> +    {
> >> +      rtx reg = regno_reg_rtx[ix];
> >> +      gcc_assert (CONST0_RTX (GET_MODE (reg)));
> >> +
> >> +      start_sequence ();
> >> +      emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
> >> +      rtx_insn *inits = get_insns ();
> >> +      end_sequence ();
> >> +
> >> +      if (dump_file && (dump_flags & TDF_DETAILS))
> >> +       for (rtx_insn *init = inits; init != NULL; init = NEXT_INSN (init))
> >> +         fprintf (dump_file, "Missing init of reg %u inserted: insn %u\n",
> >> +                  ix, INSN_UID (init));
> >> +
> >> +      if (first != NULL)
> >> +       {
> >> +         insert_here = emit_insn_before (inits, first);
> >> +         first = NULL;
> >> +       }
> >> +      else
> >> +       insert_here = emit_insn_after (inits, insert_here);
> >> +    }
> >> +}
> >> +
> >> +/* Find uses of regs that are not defined on all incoming paths, and insert a
> >> +   corresponding def on those.
> >> +   Advantage   : Doesn't create long live ranges.
> >> +   Disadvantage: More complex, and potentially also more defs.  */
> >> +
> >> +static void
> >> +workaround_uninit_method_3 (void)
> >> +{
> >> +  auto_bitmap not_pseudo;
> >> +  bitmap_set_range (not_pseudo, 0, LAST_VIRTUAL_REGISTER);
> >> +
> >> +  basic_block bb;
> >> +  FOR_EACH_BB_FN (bb, cfun)
> >> +    {
> >> +      if (single_pred_p (bb))
> >> +       continue;
> >> +
> >> +      auto_bitmap bb_pseudo_uninit;
> >> +      bitmap_and_compl (bb_pseudo_uninit, DF_LIVE_IN (bb), DF_MIR_IN (bb));
> >> +      bitmap_and_compl_into (bb_pseudo_uninit, not_pseudo);
> >> +
> >> +      bitmap_iterator iterator;
> >> +      unsigned ix;
> >> +      EXECUTE_IF_SET_IN_BITMAP (bb_pseudo_uninit, 0, ix, iterator)
> >> +       {
> >> +         bool have_false = false;
> >> +         bool have_true = false;
> >> +
> >> +         edge e;
> >> +         edge_iterator ei;
> >> +         FOR_EACH_EDGE (e, ei, bb->preds)
> >> +           {
> >> +             if (bitmap_bit_p (DF_LIVE_OUT (e->src), ix))
> >> +               have_true = true;
> >> +             else
> >> +               have_false = true;
> >> +           }
> >> +         if (have_false ^ have_true)
> >> +           continue;
> >> +
> >> +         FOR_EACH_EDGE (e, ei, bb->preds)
> >> +           {
> >> +             if (bitmap_bit_p (DF_LIVE_OUT (e->src), ix))
> >> +               continue;
> >> +
> >> +             rtx reg = regno_reg_rtx[ix];
> >> +             gcc_assert (CONST0_RTX (GET_MODE (reg)));
> >> +
> >> +             start_sequence ();
> >> +             emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
> >> +             rtx_insn *inits = get_insns ();
> >> +             end_sequence ();
> >> +
> >> +             if (dump_file && (dump_flags & TDF_DETAILS))
> >> +               for (rtx_insn *init = inits; init != NULL;
> >> +                    init = NEXT_INSN (init))
> >> +                 fprintf (dump_file,
> >> +                          "Missing init of reg %u inserted on edge: %d -> %d:"
> >> +                          " insn %u\n", ix, e->src->index, e->dest->index,
> >> +                          INSN_UID (init));
> >> +
> >> +             insert_insn_on_edge (inits, e);
> >> +           }
> >> +       }
> >> +    }
> >> +
> >> +  commit_edge_insertions ();
> >> +}
> >> +
> >> +static void
> >> +workaround_uninit (void)
> >> +{
> >> +  switch (nvptx_init_regs)
> >> +    {
> >> +    case 0:
> >> +      /* Skip.  */
> >> +      break;
> >> +    case 1:
> >> +      workaround_uninit_method_1 ();
> >> +      break;
> >> +    case 2:
> >> +      workaround_uninit_method_2 ();
> >> +      break;
> >> +    case 3:
> >> +      workaround_uninit_method_3 ();
> >> +      break;
> >> +    default:
> >> +      gcc_unreachable ();
> >> +    }
> >> +}
> >> +
> >>   /* PTX-specific reorganization
> >>      - Split blocks at fork and join instructions
> >>      - Compute live registers
> >> @@ -5401,6 +5585,8 @@ nvptx_reorg (void)
> >>     df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
> >>     df_live_add_problem ();
> >>     df_live_set_all_dirty ();
> >> +  if (nvptx_init_regs == 3)
> >> +    df_mir_add_problem ();
> >>     df_analyze ();
> >>     regstat_init_n_sets_and_refs ();
> >>
> >> @@ -5413,6 +5599,8 @@ nvptx_reorg (void)
> >>       if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
> >>         regno_reg_rtx[i] = const0_rtx;
> >>
> >> +  workaround_uninit ();
> >> +
> >>     /* Determine launch dimensions of the function.  If it is not an
> >>        offloaded function  (i.e. this is a regular compiler), the
> >>        function has no neutering.  */
> >> diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
> >> index e3f65b2d0b1..08580071731 100644
> >> --- a/gcc/config/nvptx/nvptx.opt
> >> +++ b/gcc/config/nvptx/nvptx.opt
> >> @@ -91,3 +91,7 @@ Enum(ptx_version) String(7.0) Value(PTX_VERSION_7_0)
> >>   mptx=
> >>   Target RejectNegative ToLower Joined Enum(ptx_version) Var(ptx_version_option)
> >>   Specify the version of the ptx version to use.
> >> +
> >> +minit-regs=
> >> +Target Var(nvptx_init_regs) IntegerRange(0, 3) Joined UInteger Init(3)
> >> +Initialize ptx registers.
  

Patch

diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
index ed347cab70e..a37a6c78b41 100644
--- a/gcc/config/nvptx/nvptx.cc
+++ b/gcc/config/nvptx/nvptx.cc
@@ -5372,6 +5372,190 @@  workaround_barsyncs (void)
 }
 #endif
 
+/* Initialize all declared regs at function entry.
+   Advantage   : Fool-proof.
+   Disadvantage: Potentially creates a lot of long live ranges and adds a lot
+		 of insns.  */
+
+static void
+workaround_uninit_method_1 (void)
+{
+  rtx_insn *first = get_insns ();
+  rtx_insn *insert_here = NULL;
+
+  for (int ix = LAST_VIRTUAL_REGISTER + 1; ix < max_reg_num (); ix++)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      /* Skip undeclared registers.  */
+      if (reg == const0_rtx)
+	continue;
+
+      gcc_assert (CONST0_RTX (GET_MODE (reg)));
+
+      start_sequence ();
+      emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
+      rtx_insn *inits = get_insns ();
+      end_sequence ();
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	for (rtx_insn *init = inits; init != NULL; init = NEXT_INSN (init))
+	  fprintf (dump_file, "Default init of reg %u inserted: insn %u\n",
+		   ix, INSN_UID (init));
+
+      if (first != NULL)
+	{
+	  insert_here = emit_insn_before (inits, first);
+	  first = NULL;
+	}
+      else
+	insert_here = emit_insn_after (inits, insert_here);
+    }
+}
+
+/* Find uses of regs that are not defined on all incoming paths, and insert a
+   corresponding def at function entry.
+   Advantage   : Simple.
+   Disadvantage: Potentially creates long live ranges.
+		 May not catch all cases.  F.i. a clobber cuts a live range in
+		 the compiler and may prevent entry_lr_in from being set for a
+		 reg, but the clobber does not translate to a ptx insn, so in
+		 ptx there still may be an uninitialized ptx reg.  See f.i.
+		 gcc.c-torture/compile/20020926-1.c.  */
+
+static void
+workaround_uninit_method_2 (void)
+{
+  auto_bitmap entry_pseudo_uninit;
+  {
+    auto_bitmap not_pseudo;
+    bitmap_set_range (not_pseudo, 0, LAST_VIRTUAL_REGISTER);
+
+    bitmap entry_lr_in = DF_LR_IN (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+    bitmap_and_compl (entry_pseudo_uninit, entry_lr_in, not_pseudo);
+  }
+
+  rtx_insn *first = get_insns ();
+  rtx_insn *insert_here = NULL;
+
+  bitmap_iterator iterator;
+  unsigned ix;
+  EXECUTE_IF_SET_IN_BITMAP (entry_pseudo_uninit, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+      gcc_assert (CONST0_RTX (GET_MODE (reg)));
+
+      start_sequence ();
+      emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
+      rtx_insn *inits = get_insns ();
+      end_sequence ();
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	for (rtx_insn *init = inits; init != NULL; init = NEXT_INSN (init))
+	  fprintf (dump_file, "Missing init of reg %u inserted: insn %u\n",
+		   ix, INSN_UID (init));
+
+      if (first != NULL)
+	{
+	  insert_here = emit_insn_before (inits, first);
+	  first = NULL;
+	}
+      else
+	insert_here = emit_insn_after (inits, insert_here);
+    }
+}
+
+/* Find uses of regs that are not defined on all incoming paths, and insert a
+   corresponding def on those.
+   Advantage   : Doesn't create long live ranges.
+   Disadvantage: More complex, and potentially also more defs.  */
+
+static void
+workaround_uninit_method_3 (void)
+{
+  auto_bitmap not_pseudo;
+  bitmap_set_range (not_pseudo, 0, LAST_VIRTUAL_REGISTER);
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      if (single_pred_p (bb))
+	continue;
+
+      auto_bitmap bb_pseudo_uninit;
+      bitmap_and_compl (bb_pseudo_uninit, DF_LIVE_IN (bb), DF_MIR_IN (bb));
+      bitmap_and_compl_into (bb_pseudo_uninit, not_pseudo);
+
+      bitmap_iterator iterator;
+      unsigned ix;
+      EXECUTE_IF_SET_IN_BITMAP (bb_pseudo_uninit, 0, ix, iterator)
+	{
+	  bool have_false = false;
+	  bool have_true = false;
+
+	  edge e;
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (e, ei, bb->preds)
+	    {
+	      if (bitmap_bit_p (DF_LIVE_OUT (e->src), ix))
+		have_true = true;
+	      else
+		have_false = true;
+	    }
+	  if (have_false ^ have_true)
+	    continue;
+
+	  FOR_EACH_EDGE (e, ei, bb->preds)
+	    {
+	      if (bitmap_bit_p (DF_LIVE_OUT (e->src), ix))
+		continue;
+
+	      rtx reg = regno_reg_rtx[ix];
+	      gcc_assert (CONST0_RTX (GET_MODE (reg)));
+
+	      start_sequence ();
+	      emit_move_insn (reg, CONST0_RTX (GET_MODE (reg)));
+	      rtx_insn *inits = get_insns ();
+	      end_sequence ();
+
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		for (rtx_insn *init = inits; init != NULL;
+		     init = NEXT_INSN (init))
+		  fprintf (dump_file,
+			   "Missing init of reg %u inserted on edge: %d -> %d:"
+			   " insn %u\n", ix, e->src->index, e->dest->index,
+			   INSN_UID (init));
+
+	      insert_insn_on_edge (inits, e);
+	    }
+	}
+    }
+
+  commit_edge_insertions ();
+}
+
+static void
+workaround_uninit (void)
+{
+  switch (nvptx_init_regs)
+    {
+    case 0:
+      /* Skip.  */
+      break;
+    case 1:
+      workaround_uninit_method_1 ();
+      break;
+    case 2:
+      workaround_uninit_method_2 ();
+      break;
+    case 3:
+      workaround_uninit_method_3 ();
+      break;
+    default:
+      gcc_unreachable ();
+    }
+}
+
 /* PTX-specific reorganization
    - Split blocks at fork and join instructions
    - Compute live registers
@@ -5401,6 +5585,8 @@  nvptx_reorg (void)
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
   df_live_add_problem ();
   df_live_set_all_dirty ();
+  if (nvptx_init_regs == 3)
+    df_mir_add_problem ();
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
@@ -5413,6 +5599,8 @@  nvptx_reorg (void)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
+  workaround_uninit ();
+
   /* Determine launch dimensions of the function.  If it is not an
      offloaded function  (i.e. this is a regular compiler), the
      function has no neutering.  */
diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt
index e3f65b2d0b1..08580071731 100644
--- a/gcc/config/nvptx/nvptx.opt
+++ b/gcc/config/nvptx/nvptx.opt
@@ -91,3 +91,7 @@  Enum(ptx_version) String(7.0) Value(PTX_VERSION_7_0)
 mptx=
 Target RejectNegative ToLower Joined Enum(ptx_version) Var(ptx_version_option)
 Specify the version of the ptx version to use.
+
+minit-regs=
+Target Var(nvptx_init_regs) IntegerRange(0, 3) Joined UInteger Init(3)
+Initialize ptx registers.