[2/3] bpf: md: fix "*movsi" to generate wN regs [PR124688]

Message ID 20260329231754.2325557-3-vineet.gupta@linux.dev
State New
Headers
Series bpf: Enable wN reg codegen for bug-fix and fun |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm fail Patch failed to apply
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 fail Patch failed to apply

Commit Message

Vineet Gupta March 29, 2026, 11:17 p.m. UTC
  From: Vineet Gupta <gvineet@fb.com>

movsi currently only generates DImode rN regs, despite RTL being SImode.

| (insn 14 5 11 (set (reg:SI 1 %r1 [23])
|        (reg:SI 0 %r0))  {*movsi}
|     (expr_list:REG_DEAD (reg:SI 0 %r0)
|        (nil)))

generates

|  r1 = r0

as opposed to

| w1 = w0

This is not just issue of taste or getting more wN regs. As
illustrated by test, this can be a correctness issue where mov
needs to zero out the upper bits.

Again the issue is asm teplmate of pattern missing 'w' specifier,
leading bpf_print_register() to only generate 64-bit rN regs.
Using 'w' allows either wN/rN reg depending on the mode.

	PR target/124688

gcc/ChangeLog:

	* config/bpf/bpf.md (*movsi): Add 'w' to asm template.

gcc/testsuite/ChangeLog:

	* gcc.target/bpf/ret-reuse-arg-1.c: New test.

Signed-off-by: Vineet Gupta <vineet.gupta@linux.dev>
---
 gcc/config/bpf/bpf.md                          | 10 +++++-----
 gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c | 14 ++++++++++++++
 2 files changed, 19 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c
  

Comments

Vineet Gupta March 29, 2026, 11:35 p.m. UTC | #1
On 3/29/26 4:17 PM, Vineet Gupta wrote:
> From: Vineet Gupta <gvineet@fb.com>
>
> movsi currently only generates DImode rN regs, despite RTL being SImode.
>
> | (insn 14 5 11 (set (reg:SI 1 %r1 [23])
> |        (reg:SI 0 %r0))  {*movsi}
> |     (expr_list:REG_DEAD (reg:SI 0 %r0)
> |        (nil)))
>
> generates
>
> |  r1 = r0
>
> as opposed to
>
> | w1 = w0
>
> This is not just issue of taste or getting more wN regs. As
> illustrated by test, this can be a correctness issue where mov
> needs to zero out the upper bits.
>
> Again the issue is asm teplmate of pattern missing 'w' specifier,
> leading bpf_print_register() to only generate 64-bit rN regs.
> Using 'w' allows either wN/rN reg depending on the mode.
>
> 	PR target/124688
>
> gcc/ChangeLog:
>
> 	* config/bpf/bpf.md (*movsi): Add 'w' to asm template.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/bpf/ret-reuse-arg-1.c: New test.
>
> Signed-off-by: Vineet Gupta <vineet.gupta@linux.dev>

Oh just realized that this needs assembler update. So need to post that 
first and also gate this change based on configure time detection of 
updated gas.

-Vineet
  
Vineet Gupta March 30, 2026, 4:48 p.m. UTC | #2
On 3/29/26 4:35 PM, Vineet Gupta wrote:
> On 3/29/26 4:17 PM, Vineet Gupta wrote:
>> From: Vineet Gupta <gvineet@fb.com>
>>
>> movsi currently only generates DImode rN regs, despite RTL being SImode.
>>
>> | (insn 14 5 11 (set (reg:SI 1 %r1 [23])
>> |        (reg:SI 0 %r0))  {*movsi}
>> |     (expr_list:REG_DEAD (reg:SI 0 %r0)
>> |        (nil)))
>>
>> generates
>>
>> |  r1 = r0
>>
>> as opposed to
>>
>> | w1 = w0
>>
>> This is not just issue of taste or getting more wN regs. As
>> illustrated by test, this can be a correctness issue where mov
>> needs to zero out the upper bits.
>>
>> Again the issue is asm teplmate of pattern missing 'w' specifier,
>> leading bpf_print_register() to only generate 64-bit rN regs.
>> Using 'w' allows either wN/rN reg depending on the mode.
>>
>>     PR target/124688
>>
>> gcc/ChangeLog:
>>
>>     * config/bpf/bpf.md (*movsi): Add 'w' to asm template.
>>
>> gcc/testsuite/ChangeLog:
>>
>>     * gcc.target/bpf/ret-reuse-arg-1.c: New test.
>>
>> Signed-off-by: Vineet Gupta <vineet.gupta@linux.dev>
>
> Oh just realized that this needs assembler update. So need to post 
> that first and also gate this change based on configure time detection 
> of updated gas.

Essentially gcc would now generate following 'w' + 'r' reg variants in 
addition to the existing 'r' only form.

wN = *(u32 *)(rM + imm)
*(u32 *) (rM + imm) = wN

Added support to gas, however when asking Claude to generate gas 
updates, it reports the encodins for both forms to be same. Further 
disassembler will only generate the 'r' output. So to avoid confusion, 
it would make sense for gcc mem alternate to not emit the 'w' form 
despite SImode.

Thx,
-Vineet
  
David Faust March 30, 2026, 6:45 p.m. UTC | #3
Hi Vineet,

On 3/29/26 16:17, Vineet Gupta wrote:
> From: Vineet Gupta <gvineet@fb.com>
> 
> movsi currently only generates DImode rN regs, despite RTL being SImode.
> 
> | (insn 14 5 11 (set (reg:SI 1 %r1 [23])
> |        (reg:SI 0 %r0))  {*movsi}
> |     (expr_list:REG_DEAD (reg:SI 0 %r0)
> |        (nil)))
> 
> generates
> 
> |  r1 = r0
> 
> as opposed to
> 
> | w1 = w0
> 
> This is not just issue of taste or getting more wN regs. As
> illustrated by test, this can be a correctness issue where mov
> needs to zero out the upper bits.
> 
> Again the issue is asm teplmate of pattern missing 'w' specifier,
> leading bpf_print_register() to only generate 64-bit rN regs.
> Using 'w' allows either wN/rN reg depending on the mode.
> 
> 	PR target/124688
> 
> gcc/ChangeLog:
> 
> 	* config/bpf/bpf.md (*movsi): Add 'w' to asm template.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/bpf/ret-reuse-arg-1.c: New test.
> 
> Signed-off-by: Vineet Gupta <vineet.gupta@linux.dev>
> ---
>  gcc/config/bpf/bpf.md                          | 10 +++++-----
>  gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c | 14 ++++++++++++++
>  2 files changed, 19 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c
> 
> diff --git a/gcc/config/bpf/bpf.md b/gcc/config/bpf/bpf.md
> index a2bceb8998d7..e6776a94e7e4 100644
> --- a/gcc/config/bpf/bpf.md
> +++ b/gcc/config/bpf/bpf.md
> @@ -383,11 +383,11 @@
>          (match_operand:MM 1 "mov_src_operand"      " q,rIc,BC,r,I"))]
>    ""
>    "@
> -   *return bpf_output_move (operands, \"{ldx<mop>\t%0,%1|%0 = *(<smop> *) %1}\");
> -   *return bpf_output_move (operands, \"{mov\t%0,%1|%0 = %1}\");
> -   *return bpf_output_move (operands, \"{lddw\t%0,%1|%0 = %1 ll}\");
> -   *return bpf_output_move (operands, \"{stx<mop>\t%0,%1|*(<smop> *) %0 = %1}\");
> -   *return bpf_output_move (operands, \"{st<mop>\t%0,%1|*(<smop> *) %0 = %1}\");"
> +   *return bpf_output_move (operands, \"{ldx<mop>\t%0,%1|%w0 = *(<smop> *) %w1}\");
> +   *return bpf_output_move (operands, \"{mov\t%0,%1|%w0 = %w1}\");
> +   *return bpf_output_move (operands, \"{lddw\t%0,%1|%w0 = %w1 ll}\");
> +   *return bpf_output_move (operands, \"{stx<mop>\t%0,%1|*(<smop> *) %w0 = %w1}\");
> +   *return bpf_output_move (operands, \"{st<mop>\t%0,%1|*(<smop> *) %w0 = %w1}\");"

Based on our discussion in the BPF GCC meeting this morning
I think we decided that we should not emit wN for ld/st[x],
(though gas should accommodate them).

And the lddw pattern doesn't need to accommodate w regs since
it is exclusively for immediate values.

So then the only pattern that needs to change here would be
the register-register 'mov' insn itself. But the pattern is
matching a move from MM to MM, i.e. the source and dest must
already be the same mode. So are you sure we would need to emit
w reg in this case?

Maybe I am wrong but to me it seems the behavior in the test
below (and in the pr) must be more related to extendhisi.

>  [(set_attr "type" "ldx,alu,alu,stx,st")])
>  
>  ;;;; Shifts
> diff --git a/gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c b/gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c
> new file mode 100644
> index 000000000000..6d0a4f280cd7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c
> @@ -0,0 +1,14 @@
> +/* Return value of first call is arg to second call.  */
> +
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mcpu=v4" } */
> +
> +int ret_int ();
> +void arg_int (int);
> +
> +void foo () {
> +   arg_int(ret_int ());
> +}
> +
> +/* { dg-final { scan-assembler-not {r1 = r0} } } */
> +/* { dg-final { scan-assembler-times {w1 = w0} 1 } } */
  
Vineet Gupta March 30, 2026, 8:57 p.m. UTC | #4
Hi David,

Thanks for quickly reviewing the patches.

On 3/30/26 11:45 AM, David Faust wrote:
>> From: Vineet Gupta <gvineet@fb.com>

This somehow snuck in, will fix in v2.

>> movsi currently only generates DImode rN regs, despite RTL being SImode.
>>
>> | (insn 14 5 11 (set (reg:SI 1 %r1 [23])
>> |        (reg:SI 0 %r0))  {*movsi}
>> |     (expr_list:REG_DEAD (reg:SI 0 %r0)
>> |        (nil)))
>>
>> generates
>>
>> |  r1 = r0
>>
>> as opposed to
>>
>> | w1 = w0

[snip...]

>> -   *return bpf_output_move (operands, \"{ldx<mop>\t%0,%1|%0 = *(<smop> *) %1}\");
>> -   *return bpf_output_move (operands, \"{mov\t%0,%1|%0 = %1}\");
>> -   *return bpf_output_move (operands, \"{lddw\t%0,%1|%0 = %1 ll}\");
>> -   *return bpf_output_move (operands, \"{stx<mop>\t%0,%1|*(<smop> *) %0 = %1}\");
>> -   *return bpf_output_move (operands, \"{st<mop>\t%0,%1|*(<smop> *) %0 = %1}\");"
>> +   *return bpf_output_move (operands, \"{ldx<mop>\t%0,%1|%w0 = *(<smop> *) %w1}\");
>> +   *return bpf_output_move (operands, \"{mov\t%0,%1|%w0 = %w1}\");
>> +   *return bpf_output_move (operands, \"{lddw\t%0,%1|%w0 = %w1 ll}\");
>> +   *return bpf_output_move (operands, \"{stx<mop>\t%0,%1|*(<smop> *) %w0 = %w1}\");
>> +   *return bpf_output_move (operands, \"{st<mop>\t%0,%1|*(<smop> *) %w0 = %w1}\");"
> Based on our discussion in the BPF GCC meeting this morning
> I think we decided that we should not emit wN for ld/st[x],
> (though gas should accommodate them).

Correct, will do that in v2.

> And the lddw pattern doesn't need to accommodate w regs since
> it is exclusively for immediate values.

Correct, will revert that as well.

> So then the only pattern that needs to change here would be
> the register-register 'mov' insn itself. But the pattern is
> matching a move from MM to MM, i.e. the source and dest must
> already be the same mode. So are you sure we would need to emit
> w reg in this case?

Right, the modes not matching is not the issue, it emitting rN reg when 
both operands are SImode is the problem which I describe in changelog above.

> Maybe I am wrong but to me it seems the behavior in the test
> below (and in the pr) must be more related to extendhisi.

Yes and No :-)

You are right that for the test below, it starts of with extend, sidi2 
to be precise, which gets expanded to 2 shifts (Its on my todo to avoid 
doing this for cpuv4 for better combine outcomes, but anyways that's for 
later).

(insn 8 7 9 2 (set (reg:DI 19 [ _1 ])
         (ashift:DI (subreg:DI (reg:SI 21) 0)
             (const_int 32 [0x20])))
(insn 9 8 10 2 (set (reg:DI 19 [ _1 ])
         (ashiftrt:DI (reg:DI 19 [ _1 ])
             (const_int 32 [0x20])))
(insn 10 9 11 2 (set (reg:SI 1 %r1)
         (subreg/s/u:SI (reg:DI 19 [ _1 ]) 0))

Combine is able to see thru all this to generate insn 14

scanning new insn with uid = 14.
...
allowing combination of insns 8 and 9
deferring deletion of insn with uid = 8.
allowing combination of insns 9 and 10
deferring deletion of insn with uid = 9.
rescanning insn with uid = 10.

(insn 14 5 6 2 (set (reg:SI 23)
         (reg:SI 0 %r0))  {*movsi}

And the -dP annotated output also confirms movsi involvement for the 
actual asm.

#(insn 14 5 11 (set (reg:SI 1 %r1 [23])
#        (reg:SI 0 %r0)) 
"gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c":10:4 35 {*movsi}
#     (expr_list:REG_DEAD (reg:SI 0 %r0)
#        (nil)))
     r1 = r0    # 14    [c=4 l=8]  *movsi/1


Spurred by your comment, I added a variant with "short" ret/args and 
looks like movhi needs even more fixing as it needs to generate wN = 
(s16) wM. I need to see if that needs to fixed in asm pattern or will 
PROMOTE_MODE change later be able to handle that in the core.

Thx,
-Vineet
  
David Faust March 31, 2026, 10:42 p.m. UTC | #5
On 3/30/26 13:57, Vineet Gupta wrote:
> Hi David,
> 
> Thanks for quickly reviewing the patches.
> 
> On 3/30/26 11:45 AM, David Faust wrote:
>>> From: Vineet Gupta <gvineet@fb.com>
> 
> This somehow snuck in, will fix in v2.
> 
>>> movsi currently only generates DImode rN regs, despite RTL being SImode.
>>>
>>> | (insn 14 5 11 (set (reg:SI 1 %r1 [23])
>>> |        (reg:SI 0 %r0))  {*movsi}
>>> |     (expr_list:REG_DEAD (reg:SI 0 %r0)
>>> |        (nil)))
>>>
>>> generates
>>>
>>> |  r1 = r0
>>>
>>> as opposed to
>>>
>>> | w1 = w0
> 
> [snip...]
> 
>>> -   *return bpf_output_move (operands, \"{ldx<mop>\t%0,%1|%0 = *(<smop> *) %1}\");
>>> -   *return bpf_output_move (operands, \"{mov\t%0,%1|%0 = %1}\");
>>> -   *return bpf_output_move (operands, \"{lddw\t%0,%1|%0 = %1 ll}\");
>>> -   *return bpf_output_move (operands, \"{stx<mop>\t%0,%1|*(<smop> *) %0 = %1}\");
>>> -   *return bpf_output_move (operands, \"{st<mop>\t%0,%1|*(<smop> *) %0 = %1}\");"
>>> +   *return bpf_output_move (operands, \"{ldx<mop>\t%0,%1|%w0 = *(<smop> *) %w1}\");
>>> +   *return bpf_output_move (operands, \"{mov\t%0,%1|%w0 = %w1}\");
>>> +   *return bpf_output_move (operands, \"{lddw\t%0,%1|%w0 = %w1 ll}\");
>>> +   *return bpf_output_move (operands, \"{stx<mop>\t%0,%1|*(<smop> *) %w0 = %w1}\");
>>> +   *return bpf_output_move (operands, \"{st<mop>\t%0,%1|*(<smop> *) %w0 = %w1}\");"
>> Based on our discussion in the BPF GCC meeting this morning
>> I think we decided that we should not emit wN for ld/st[x],
>> (though gas should accommodate them).
> 
> Correct, will do that in v2.
> 
>> And the lddw pattern doesn't need to accommodate w regs since
>> it is exclusively for immediate values.
> 
> Correct, will revert that as well.
> 
>> So then the only pattern that needs to change here would be
>> the register-register 'mov' insn itself. But the pattern is
>> matching a move from MM to MM, i.e. the source and dest must
>> already be the same mode. So are you sure we would need to emit
>> w reg in this case?
> 
> Right, the modes not matching is not the issue, it emitting rN reg when 
> both operands are SImode is the problem which I describe in changelog above.

Hm...

> this can be a correctness issue where mov needs to zero out the upper bits.

I guess I am missing something but honestly I am confused by this part.
Why would mov need to zero out the upper bits?

I mean, this case is matching only say a (set (reg:SI rD) (reg:SI rS)).
i.e. the source register already has an SImode value.  We aren't counting
on the state of the bits outside that SImode value.  Moving either the
low 32-bits or all 64, doesn't make a difference to that SImode value.

Explicitly zeroing the upper 32 bits, while it may be achieved by using
a _BPF_ mov32 insn, I don't think is an exact match to the semantics of
a simple reg-to-reg move.

I understood from the testcase/pr that you are saying the issue is with
function return/argument passing and extension.  But I don't think this
simple case matching on set reg:MM reg:MM is the proper way to
address that.

> 
>> Maybe I am wrong but to me it seems the behavior in the test
>> below (and in the pr) must be more related to extendhisi.
> 
> Yes and No :-)
> 
> You are right that for the test below, it starts of with extend, sidi2 
> to be precise, which gets expanded to 2 shifts (Its on my todo to avoid 
> doing this for cpuv4 for better combine outcomes, but anyways that's for 
> later).
> 
> (insn 8 7 9 2 (set (reg:DI 19 [ _1 ])
>          (ashift:DI (subreg:DI (reg:SI 21) 0)
>              (const_int 32 [0x20])))
> (insn 9 8 10 2 (set (reg:DI 19 [ _1 ])
>          (ashiftrt:DI (reg:DI 19 [ _1 ])
>              (const_int 32 [0x20])))
> (insn 10 9 11 2 (set (reg:SI 1 %r1)
>          (subreg/s/u:SI (reg:DI 19 [ _1 ]) 0))
> 
> Combine is able to see thru all this to generate insn 14

OK, the value is sign extended to DImode, and then truncated
back to SImode.  The original value in SImode is unchanged...

> 
> scanning new insn with uid = 14.
> ...
> allowing combination of insns 8 and 9
> deferring deletion of insn with uid = 8.
> allowing combination of insns 9 and 10
> deferring deletion of insn with uid = 9.
> rescanning insn with uid = 10.
> 
> (insn 14 5 6 2 (set (reg:SI 23)
>          (reg:SI 0 %r0))  {*movsi}

... so this result from combine is valid; the net effect
is simply to move the SImode value from one reg to another,
keeping SImode.

I'm not sure exactly what is supposed to happen in the test.
The arg is passed as int, i.e. SImode. The origin of that value
is another int, also SImode. Why do the upper bits of the reg need
to be explicitly zeroed?

> 
> And the -dP annotated output also confirms movsi involvement for the 
> actual asm.
> 
> #(insn 14 5 11 (set (reg:SI 1 %r1 [23])
> #        (reg:SI 0 %r0)) 
> "gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c":10:4 35 {*movsi}
> #     (expr_list:REG_DEAD (reg:SI 0 %r0)
> #        (nil)))
>      r1 = r0    # 14    [c=4 l=8]  *movsi/1
> 
> 
> Spurred by your comment, I added a variant with "short" ret/args and 
> looks like movhi needs even more fixing as it needs to generate wN = 
> (s16) wM. I need to see if that needs to fixed in asm pattern or will 
> PROMOTE_MODE change later be able to handle that in the core.
> 
> Thx,
> -Vineet
  
Vineet Gupta March 31, 2026, 11:22 p.m. UTC | #6
On 3/31/26 3:42 PM, David Faust wrote:
>>> So then the only pattern that needs to change here would be
>>> the register-register 'mov' insn itself. But the pattern is
>>> matching a move from MM to MM, i.e. the source and dest must
>>> already be the same mode. So are you sure we would need to emit
>>> w reg in this case?
>> Right, the modes not matching is not the issue, it emitting rN reg when
>> both operands are SImode is the problem which I describe in changelog above.
> Hm...
>
>> this can be a correctness issue where mov needs to zero out the upper bits.
> I guess I am missing something but honestly I am confused by this part.
> Why would mov need to zero out the upper bits?
>
> I mean, this case is matching only say a (set (reg:SI rD) (reg:SI rS)).
> i.e. the source register already has an SImode value.  We aren't counting
> on the state of the bits outside that SImode value.  Moving either the
> low 32-bits or all 64, doesn't make a difference to that SImode value.

I think I'm starting to see where you are coming from.


> Explicitly zeroing the upper 32 bits, while it may be achieved by using
> a _BPF_ mov32 insn, I don't think is an exact match to the semantics of
> a simple reg-to-reg move.

Seems fair.

> I understood from the testcase/pr that you are saying the issue is with
> function return/argument passing and extension.  But I don't think this
> simple case matching on set reg:MM reg:MM is the proper way to
> address that.

Right and that's not the intent, this is not claiming to fix the ABI, 
that's a totally different change, this is just the precursor, at least 
in my mind's eye, but I'm willing to concede that view-point.

So let me step back, all of this originated from ABI change work and 
roughly based on HJ's v1 (not v2), I was starting to see the light. 
There was a grand total of 1 additional selftest failure which was 
attach_probe. The manually reduced test (attached here) showed the 
failing case (is essentially beefier version of the test in this patch, 
but same in spirit)

So with the ABI hooks and everything in place, what we were seeing was 
following RTL stream

(call_insn 7 6 8 2 (set (reg:SI 0 %r0)
         (call (mem:DI (symbol_ref:DI ("bpf_copy_from_user_str") [flags 
0x41]  <function_decl 0x7f10c1547300 bpf_copy_from_user_str>) [0 
bpf_copy_from_user_str S8 A64])
             (const_int 0 [0])))
      (expr_list:REG_CALL_DECL (symbol_ref:DI ("bpf_copy_from_user_str") 
[flags 0x41]  <function_decl 0x7f10c1547300 bpf_copy_from_user_str>)
         (nil))
     (expr_list:SI (use (reg:SI 1 %r1))
         (nil)))

(insn 8 7 9 2 (set (reg/v:SI 19 [ ret ])
         (reg:SI 0 %r0))
      (nil))

(jump_insn 9 8 10 2 (set (pc)
       (if_then_else (ne (reg/v:SI 19 [ ret ])
                 (const_int 4 [0x4]))
             (label_ref:DI 33)
             (pc))) {*branch_on_si}
  -> 33)

(insn 11 10 12 4 (set (reg:SI 1 %r1)
         (const_int 4 [0x4]))
      (nil))

cse1 equivalence logic deduces that reg 19 = 4 = SI, guaranteed by 
control flow, so we end up with

(insn 11 10 12 3 (set (reg:SI 1 %r1)
        (reg/v:SI 19 [ ret ])) {*movsi}
      (expr_list:REG_EQUAL (const_int 4 [0x4])
         (nil)))

And this carries all the way to the end, which if generates rN is would 
be problematic and was the reason for this change.

>>> Maybe I am wrong but to me it seems the behavior in the test
>>> below (and in the pr) must be more related to extendhisi.
>> Yes and No :-)
>>
>> You are right that for the test below, it starts of with extend, sidi2
>> to be precise, which gets expanded to 2 shifts (Its on my todo to avoid
>> doing this for cpuv4 for better combine outcomes, but anyways that's for
>> later).
>>
>> (insn 8 7 9 2 (set (reg:DI 19 [ _1 ])
>>           (ashift:DI (subreg:DI (reg:SI 21) 0)
>>               (const_int 32 [0x20])))
>> (insn 9 8 10 2 (set (reg:DI 19 [ _1 ])
>>           (ashiftrt:DI (reg:DI 19 [ _1 ])
>>               (const_int 32 [0x20])))
>> (insn 10 9 11 2 (set (reg:SI 1 %r1)
>>           (subreg/s/u:SI (reg:DI 19 [ _1 ]) 0))
>>
>> Combine is able to see thru all this to generate insn 14
> OK, the value is sign extended to DImode, and then truncated
> back to SImode.  The original value in SImode is unchanged...
>
>> scanning new insn with uid = 14.
>> ...
>> allowing combination of insns 8 and 9
>> deferring deletion of insn with uid = 8.
>> allowing combination of insns 9 and 10
>> deferring deletion of insn with uid = 9.
>> rescanning insn with uid = 10.
>>
>> (insn 14 5 6 2 (set (reg:SI 23)
>>           (reg:SI 0 %r0))  {*movsi}
> ... so this result from combine is valid; the net effect
> is simply to move the SImode value from one reg to another,
> keeping SImode.
>
> I'm not sure exactly what is supposed to happen in the test.
> The arg is passed as int, i.e. SImode. The origin of that value
> is another int, also SImode. Why do the upper bits of the reg need
> to be explicitly zeroed?

You may be right that test case is more ABI leaning (and this pinskia 
said something on PR), but even with the ABI change, this adjustment 
will still be needed IMO.

Thx,
-Vineet
  
Jose E. Marchesi April 1, 2026, 5:50 p.m. UTC | #7
> On 3/31/26 3:42 PM, David Faust wrote:
>>>> So then the only pattern that needs to change here would be
>>>> the register-register 'mov' insn itself. But the pattern is
>>>> matching a move from MM to MM, i.e. the source and dest must
>>>> already be the same mode. So are you sure we would need to emit
>>>> w reg in this case?
>>> Right, the modes not matching is not the issue, it emitting rN reg when
>>> both operands are SImode is the problem which I describe in changelog above.
>> Hm...
>>
>>> this can be a correctness issue where mov needs to zero out the upper bits.
>> I guess I am missing something but honestly I am confused by this part.
>> Why would mov need to zero out the upper bits?
>>
>> I mean, this case is matching only say a (set (reg:SI rD) (reg:SI rS)).
>> i.e. the source register already has an SImode value.  We aren't counting
>> on the state of the bits outside that SImode value.  Moving either the
>> low 32-bits or all 64, doesn't make a difference to that SImode value.
>
> I think I'm starting to see where you are coming from.
>
>
>> Explicitly zeroing the upper 32 bits, while it may be achieved by using
>> a _BPF_ mov32 insn, I don't think is an exact match to the semantics of
>> a simple reg-to-reg move.
>
> Seems fair.
>
>> I understood from the testcase/pr that you are saying the issue is with
>> function return/argument passing and extension.  But I don't think this
>> simple case matching on set reg:MM reg:MM is the proper way to
>> address that.
>
> Right and that's not the intent, this is not claiming to fix the ABI,
> that's a totally different change, this is just the precursor, at
> least in my mind's eye, but I'm willing to concede that view-point.
>
> So let me step back, all of this originated from ABI change work and
> roughly based on HJ's v1 (not v2), I was starting to see the
> light. There was a grand total of 1 additional selftest failure which
> was attach_probe. The manually reduced test (attached here) showed the
> failing case (is essentially beefier version of the test in this
> patch, but same in spirit)
>
> So with the ABI hooks and everything in place, what we were seeing was
> following RTL stream
>
> (call_insn 7 6 8 2 (set (reg:SI 0 %r0)
>         (call (mem:DI (symbol_ref:DI ("bpf_copy_from_user_str") [flags
> 0x41]  <function_decl 0x7f10c1547300 bpf_copy_from_user_str>) [0
> bpf_copy_from_user_str S8 A64])
>             (const_int 0 [0])))
>      (expr_list:REG_CALL_DECL (symbol_ref:DI
> ("bpf_copy_from_user_str") [flags 0x41]  <function_decl 0x7f10c1547300
> bpf_copy_from_user_str>)
>         (nil))
>     (expr_list:SI (use (reg:SI 1 %r1))
>         (nil)))
>
> (insn 8 7 9 2 (set (reg/v:SI 19 [ ret ])
>         (reg:SI 0 %r0))
>      (nil))
>
> (jump_insn 9 8 10 2 (set (pc)
>       (if_then_else (ne (reg/v:SI 19 [ ret ])
>                 (const_int 4 [0x4]))
>             (label_ref:DI 33)
>             (pc))) {*branch_on_si}
>  -> 33)
>
> (insn 11 10 12 4 (set (reg:SI 1 %r1)
>         (const_int 4 [0x4]))
>      (nil))
>
> cse1 equivalence logic deduces that reg 19 = 4 = SI, guaranteed by
> control flow, so we end up with
>
> (insn 11 10 12 3 (set (reg:SI 1 %r1)
>        (reg/v:SI 19 [ ret ])) {*movsi}
>      (expr_list:REG_EQUAL (const_int 4 [0x4])
>         (nil)))
>
> And this carries all the way to the end, which if generates rN is
> would be problematic and was the reason for this change.
>
>>>> Maybe I am wrong but to me it seems the behavior in the test
>>>> below (and in the pr) must be more related to extendhisi.
>>> Yes and No :-)
>>>
>>> You are right that for the test below, it starts of with extend, sidi2
>>> to be precise, which gets expanded to 2 shifts (Its on my todo to avoid
>>> doing this for cpuv4 for better combine outcomes, but anyways that's for
>>> later).
>>>
>>> (insn 8 7 9 2 (set (reg:DI 19 [ _1 ])
>>>           (ashift:DI (subreg:DI (reg:SI 21) 0)
>>>               (const_int 32 [0x20])))
>>> (insn 9 8 10 2 (set (reg:DI 19 [ _1 ])
>>>           (ashiftrt:DI (reg:DI 19 [ _1 ])
>>>               (const_int 32 [0x20])))
>>> (insn 10 9 11 2 (set (reg:SI 1 %r1)
>>>           (subreg/s/u:SI (reg:DI 19 [ _1 ]) 0))
>>>
>>> Combine is able to see thru all this to generate insn 14
>> OK, the value is sign extended to DImode, and then truncated
>> back to SImode.  The original value in SImode is unchanged...
>>
>>> scanning new insn with uid = 14.
>>> ...
>>> allowing combination of insns 8 and 9
>>> deferring deletion of insn with uid = 8.
>>> allowing combination of insns 9 and 10
>>> deferring deletion of insn with uid = 9.
>>> rescanning insn with uid = 10.
>>>
>>> (insn 14 5 6 2 (set (reg:SI 23)
>>>           (reg:SI 0 %r0))  {*movsi}
>> ... so this result from combine is valid; the net effect
>> is simply to move the SImode value from one reg to another,
>> keeping SImode.
>>
>> I'm not sure exactly what is supposed to happen in the test.
>> The arg is passed as int, i.e. SImode. The origin of that value
>> is another int, also SImode. Why do the upper bits of the reg need
>> to be explicitly zeroed?
>
> You may be right that test case is more ABI leaning (and this pinskia
> said something on PR), but even with the ABI change, this adjustment
> will still be needed IMO.

Regardless of the context, ABI conventions, or any other consideration,
a move (set (reg:SI) (reg:SI)) should _not_ require mov32, because the
source register is supposed to already have the upper bits zeroed, or it
would not contain a value of mode SI.

The different movement/extension/etc insns conform a sort of a closure:
all the possible ways to put a SI value into a register, be it from
register or memory or immediate, must result into properly handled upper
32/48/56 bits.  Moves from registers and immediates with values of
different moves involve expansions of extend* and friends, moves from
memory involves ldx and friends, all of which handle upper unused bits
properly.  Given this, moves from registers of _the same_ mode should
not require anything extra than just copying all the 64 bits, i.e. mov.

So I think that if you can contrive an example in which you would need
emitting mov32 for such a move then the problem is somewhere else, and
you probably should ask yourself, why are the source register's contents
invalid.

You probably want to avoid optimizing while investigating this, because
otherwise a truncation + move will very likely be conflated into a
single perhaps-truncating-move and you will not be able to determine
what the source of the problem is..
  
Jose E. Marchesi April 1, 2026, 6:08 p.m. UTC | #8
> diff --git a/gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c b/gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c
> new file mode 100644
> index 000000000000..6d0a4f280cd7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c
> @@ -0,0 +1,14 @@
> +/* Return value of first call is arg to second call.  */
> +
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mcpu=v4" } */
> +
> +int ret_int ();
> +void arg_int (int);
> +
> +void foo () {
> +   arg_int(ret_int ());
> +}
> +
> +/* { dg-final { scan-assembler-not {r1 = r0} } } */
> +/* { dg-final { scan-assembler-times {w1 = w0} 1 } } */

As discussed in BZ 124688, these expected results are incorrect.  There
is absolutely nothing wrong with the mov.

As things stand now, it is the callee that converts the value returned,
not the caller.  Once we change the ABI to match llvm a conversion will
happen after the call, but even then a mov will be still the right
instruction to use to move a value of any mode from register to
register, provided the mode is not changed.
  
David Faust April 1, 2026, 6:28 p.m. UTC | #9
On 3/31/26 16:22, Vineet Gupta wrote:
> 
> On 3/31/26 3:42 PM, David Faust wrote:
>>>> So then the only pattern that needs to change here would be
>>>> the register-register 'mov' insn itself. But the pattern is
>>>> matching a move from MM to MM, i.e. the source and dest must
>>>> already be the same mode. So are you sure we would need to emit
>>>> w reg in this case?
>>> Right, the modes not matching is not the issue, it emitting rN reg when
>>> both operands are SImode is the problem which I describe in changelog above.
>> Hm...
>>
>>> this can be a correctness issue where mov needs to zero out the upper bits.
>> I guess I am missing something but honestly I am confused by this part.
>> Why would mov need to zero out the upper bits?
>>
>> I mean, this case is matching only say a (set (reg:SI rD) (reg:SI rS)).
>> i.e. the source register already has an SImode value.  We aren't counting
>> on the state of the bits outside that SImode value.  Moving either the
>> low 32-bits or all 64, doesn't make a difference to that SImode value.
> 
> I think I'm starting to see where you are coming from.
> 
> 
>> Explicitly zeroing the upper 32 bits, while it may be achieved by using
>> a _BPF_ mov32 insn, I don't think is an exact match to the semantics of
>> a simple reg-to-reg move.
> 
> Seems fair.
> 
>> I understood from the testcase/pr that you are saying the issue is with
>> function return/argument passing and extension.  But I don't think this
>> simple case matching on set reg:MM reg:MM is the proper way to
>> address that.> 
> Right and that's not the intent, this is not claiming to fix the ABI, 
> that's a totally different change, this is just the precursor, at least 
> in my mind's eye, but I'm willing to concede that view-point.
> 
> So let me step back, all of this originated from ABI change work and 
> roughly based on HJ's v1 (not v2), I was starting to see the light. 
> There was a grand total of 1 additional selftest failure which was 
> attach_probe. The manually reduced test (attached here) showed the 
> failing case (is essentially beefier version of the test in this patch, 
> but same in spirit)

Pasting here the attached test for reference:

> /* Extracted from BPF selftests attach_probe: verifier failure.
>    Return value of first call is reused as argument of second.
>    It either needs to be explicitly truncated to s32 or w regs need
>    to be used.  */
> /* { dg-do compile } */
> /* { dg-options "-O2 -mcpu=v4" } */
> 
> extern int bpf_copy_from_user_str(int dst__sz);
> 
> _Bool verify_sleepable_user_copy_str(void)
> {
>  int ret;
>  char data_short_pad[4];
> 
>  ret = bpf_copy_from_user_str(sizeof(data_short_pad));
>  if (ret != 4)
>   return false;
> 
>  /* ret above is guaranteed to be 4, same as sizeof.  */
>  ret = bpf_copy_from_user_str(sizeof(data_short_pad));
>  if (ret != 4)
>   return false;
> 
>   return true;
> }
> 
> /* Don't generate 64-bit mov for arg, do 32-bit zero extended mov.  */
> /* { dg-final { scan-assembler-not {r1 = r0} } } */
> /* { dg-final { scan-assembler-not {r1 = \(s32\) r0} } } */
> /* { dg-final { scan-assembler-times {w1 = w0} 1 } } */

I'm confused by the comment at the top.

First:

>    Return value of first call is reused as argument of second.

Where? The only calls I see are identical:
  ret = bpf_copy_from_user_str (sizeof (data_short_pad));

The return value is not reused as an argument.

Secondly:

>    It either needs to be explicitly truncated to s32 or w regs need
>    to be used.  */
Why?  Assuming the value were actually being reused, we would be
taking an int return value and passing it as an int argument to a
function taking an int. There is no conversion needed.

> 
> So with the ABI hooks and everything in place, what we were seeing was 
> following RTL stream

In the RTL below it's clear that the values are all being treated
in SImode:

> 
> (call_insn 7 6 8 2 (set (reg:SI 0 %r0)

So the call return value is SImode.

>          (call (mem:DI (symbol_ref:DI ("bpf_copy_from_user_str") [flags 
> 0x41]  <function_decl 0x7f10c1547300 bpf_copy_from_user_str>) [0 
> bpf_copy_from_user_str S8 A64])
>              (const_int 0 [0])))
>       (expr_list:REG_CALL_DECL (symbol_ref:DI ("bpf_copy_from_user_str") 
> [flags 0x41]  <function_decl 0x7f10c1547300 bpf_copy_from_user_str>)
>          (nil))
>      (expr_list:SI (use (reg:SI 1 %r1))

 (use (reg:SI 1 %r1)) : the arg is passed SI mode

>          (nil)))
> 
> (insn 8 7 9 2 (set (reg/v:SI 19 [ ret ])
>          (reg:SI 0 %r0))

Return value copied to pseudo r19 in SImode to be used for
the comparison.

>       (nil))
> 
> (jump_insn 9 8 10 2 (set (pc)
>        (if_then_else (ne (reg/v:SI 19 [ ret ])
>                  (const_int 4 [0x4]))
>              (label_ref:DI 33)
>              (pc))) {*branch_on_si}
>   -> 33)

> > (insn 11 10 12 4 (set (reg:SI 1 %r1)
>          (const_int 4 [0x4]))
>       (nil))
> 
> cse1 equivalence logic deduces that reg 19 = 4 = SI, guaranteed by 
> control flow, so we end up with

Note that r19 is SImode a priori, to match the mode of the value from %r0.
It is not SI mode as a result of cse1 determining that it holds 4.

> 
> (insn 11 10 12 3 (set (reg:SI 1 %r1)
>         (reg/v:SI 19 [ ret ])) {*movsi}
>       (expr_list:REG_EQUAL (const_int 4 [0x4])
>          (nil)))

> And this carries all the way to the end, which if generates rN is would 
> be problematic and was the reason for this change.

And again, what precisely is problematic about generating rN = rM
for this set?
If %r1 is subsequently used as the int argument to a function call,
then it is valid as is.
If it would be passed as a 'unsigned long' or something else requiring
a conversion then we would rather have e.g. a zero_extend:DI to do that,
not a simple set.


> 
>>>> Maybe I am wrong but to me it seems the behavior in the test
>>>> below (and in the pr) must be more related to extendhisi.
>>> Yes and No :-)
>>>
>>> You are right that for the test below, it starts of with extend, sidi2
>>> to be precise, which gets expanded to 2 shifts (Its on my todo to avoid
>>> doing this for cpuv4 for better combine outcomes, but anyways that's for
>>> later).
>>>
>>> (insn 8 7 9 2 (set (reg:DI 19 [ _1 ])
>>>           (ashift:DI (subreg:DI (reg:SI 21) 0)
>>>               (const_int 32 [0x20])))
>>> (insn 9 8 10 2 (set (reg:DI 19 [ _1 ])
>>>           (ashiftrt:DI (reg:DI 19 [ _1 ])
>>>               (const_int 32 [0x20])))
>>> (insn 10 9 11 2 (set (reg:SI 1 %r1)
>>>           (subreg/s/u:SI (reg:DI 19 [ _1 ]) 0))
>>>
>>> Combine is able to see thru all this to generate insn 14
>> OK, the value is sign extended to DImode, and then truncated
>> back to SImode.  The original value in SImode is unchanged...
>>
>>> scanning new insn with uid = 14.
>>> ...
>>> allowing combination of insns 8 and 9
>>> deferring deletion of insn with uid = 8.
>>> allowing combination of insns 9 and 10
>>> deferring deletion of insn with uid = 9.
>>> rescanning insn with uid = 10.
>>>
>>> (insn 14 5 6 2 (set (reg:SI 23)
>>>           (reg:SI 0 %r0))  {*movsi}
>> ... so this result from combine is valid; the net effect
>> is simply to move the SImode value from one reg to another,
>> keeping SImode.
>>
>> I'm not sure exactly what is supposed to happen in the test.
>> The arg is passed as int, i.e. SImode. The origin of that value
>> is another int, also SImode. Why do the upper bits of the reg need
>> to be explicitly zeroed?
> 
> You may be right that test case is more ABI leaning (and this pinskia 
> said something on PR), but even with the ABI change, this adjustment 
> will still be needed IMO.
> 
> Thx,
> -Vineet
  
Vineet Gupta April 2, 2026, 12:02 a.m. UTC | #10
On 4/1/26 11:28 AM, David Faust wrote:
>> attach_probe. The manually reduced test (attached here) showed the
>> failing case (is essentially beefier version of the test in this patch,
>> but same in spirit)
> Pasting here the attached test for reference:
>
>> /* Extracted from BPF selftests attach_probe: verifier failure.
>>     Return value of first call is reused as argument of second.
>>     It either needs to be explicitly truncated to s32 or w regs need
>>     to be used.  */
>> /* { dg-do compile } */
>> /* { dg-options "-O2 -mcpu=v4" } */
>>
>> extern int bpf_copy_from_user_str(int dst__sz);
>>
>> _Bool verify_sleepable_user_copy_str(void)
>> {
>>   int ret;
>>   char data_short_pad[4];
>>
>>   ret = bpf_copy_from_user_str(sizeof(data_short_pad));
>>   if (ret != 4)
>>    return false;
>>
>>   /* ret above is guaranteed to be 4, same as sizeof.  */
>>   ret = bpf_copy_from_user_str(sizeof(data_short_pad));
>>   if (ret != 4)
>>    return false;
>>
>>    return true;
>> }
>>
>> /* Don't generate 64-bit mov for arg, do 32-bit zero extended mov.  */
>> /* { dg-final { scan-assembler-not {r1 = r0} } } */
>> /* { dg-final { scan-assembler-not {r1 = \(s32\) r0} } } */
>> /* { dg-final { scan-assembler-times {w1 = w0} 1 } } */
> I'm confused by the comment at the top.
>
> First:
>
>>     Return value of first call is reused as argument of second.
> Where? The only calls I see are identical:
>    ret = bpf_copy_from_user_str (sizeof (data_short_pad));
>
> The return value is not reused as an argument.

I meant semantically. At the time of second call, @ret from first call 
is guaranteed to be 4 by virtue of control flow.
And sizeof (data_short_pad) happens to be 4 as well, hence the reason 
gcc:cse1 decides to reuse the reg/value for ret as arg for second call.

> Secondly:
>
>>     It either needs to be explicitly truncated to s32 or w regs need
>>     to be used.  */
> Why?  Assuming the value were actually being reused, we would be
> taking an int return value and passing it as an int argument to a
> function taking an int. There is no conversion needed.

Correct they are all int, so yes no conversion required, but we are 
forgetting that big boss "verifier" is watching everything and causes 
issues, see below...

>> So with the ABI hooks and everything in place, what we were seeing was
>> following RTL stream
> In the RTL below it's clear that the values are all being treated
> in SImode:
>
>> (call_insn 7 6 8 2 (set (reg:SI 0 %r0)
> So the call return value is SImode.
>
>>           (call (mem:DI (symbol_ref:DI ("bpf_copy_from_user_str") [flags
>> 0x41]  <function_decl 0x7f10c1547300 bpf_copy_from_user_str>) [0
>> bpf_copy_from_user_str S8 A64])
>>               (const_int 0 [0])))
>>        (expr_list:REG_CALL_DECL (symbol_ref:DI ("bpf_copy_from_user_str")
>> [flags 0x41]  <function_decl 0x7f10c1547300 bpf_copy_from_user_str>)
>>           (nil))
>>       (expr_list:SI (use (reg:SI 1 %r1))
>   (use (reg:SI 1 %r1)) : the arg is passed SI mode
>
>>           (nil)))
>>
>> (insn 8 7 9 2 (set (reg/v:SI 19 [ ret ])
>>           (reg:SI 0 %r0))
> Return value copied to pseudo r19 in SImode to be used for
> the comparison.

Yeah everything is SI mode and its fine from gcc's codegen point of view.

>
>>        (nil))
>>
>> (jump_insn 9 8 10 2 (set (pc)
>>         (if_then_else (ne (reg/v:SI 19 [ ret ])
>>                   (const_int 4 [0x4]))
>>               (label_ref:DI 33)
>>               (pc))) {*branch_on_si}
>>    -> 33)
>>> (insn 11 10 12 4 (set (reg:SI 1 %r1)
>>           (const_int 4 [0x4]))
>>        (nil))
>>
>> cse1 equivalence logic deduces that reg 19 = 4 = SI, guaranteed by
>> control flow, so we end up with
> Note that r19 is SImode a priori, to match the mode of the value from %r0.
> It is not SI mode as a result of cse1 determining that it holds 4.

Right.

>> (insn 11 10 12 3 (set (reg:SI 1 %r1)
>>          (reg/v:SI 19 [ ret ])) {*movsi}
>>        (expr_list:REG_EQUAL (const_int 4 [0x4])
>>           (nil)))
>> And this carries all the way to the end, which if generates rN is would
>> be problematic and was the reason for this change.
> And again, what precisely is problematic about generating rN = rM
> for this set?

Verifier trips up, #12 attach _probe 
(tools/testing/selftests/bpf/progs/test_attach_probe.c)

> If %r1 is subsequently used as the int argument to a function call,
> then it is valid as is.
> If it would be passed as a 'unsigned long' or something else requiring
> a conversion then we would rather have e.g. a zero_extend:DI to do that,
> not a simple set.

The rN based move copies the full 64-bits, but the subsequent arg is 
int: verifier wants to make sure top bits are cleared, which is not 
guaranteed given the current codegen.

Full log below, relevant snippets extracted:

22: (85) call bpf_copy_from_user_str#85440    ; R0=scalar()
...
24: (bf) r7 = r0                      ; R0=scalar(id=1) R7=scalar(id=1)
...
29: (85) call bpf_strncmp#182         ; R0=scalar()
30: (55) if r0 != 0x0 goto pc+1       ; R0=0
31: (16) if w7 == 0x4 goto pc+2 34: R0=0 R6=map_value(map=test_att.bss,ks=4,vs=56) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R10=fp0 fp-16=???????m fp-24=mmmmmmmm fp-88=????mmmm
...

36: (bf) r2 = r7                      ; R2=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000))
...
39: (85) call bpf_copy_from_user_str#85440
R2 min value is negative, either use unsigned or 'var &= const'

Now, thinking about it, part of the problem could also be #31 using w7 
reg for comparison. If it had used the rN reg too, maybe verifier would 
be confident about the top bits too and won't trip up. But that again 
would point to impedance mismatch between movsi (only using rN regs) and 
branch_on_si, which would use wN regs.

I did a quick hack to have branch_on_ generate rN forms (there's no way 
this can be productized, since we might have even more issues with 
verifier dealing with 64-bits now.

-  case EQ: return  "{jeq<msuffix>\t%0,%1,%2|if %w0 == %w1 goto %2}"; break;
+  case EQ: return  "{jeq<msuffix>\t%0,%1,%2|if %0 == %1 goto %2}"; break;

And now it does generate rN for branch n compare

verify_sleepable_user_copy_str:
     r1 = 4
     call    bpf_copy_from_user_str
     if r0 == 4 goto .L5
     r0 = 0
     exit
.L5:
     r1 = r0
     call    bpf_copy_from_user_str
     w0 ^= 4
     w0 = w0
     r0 += -1
     r0 >>= 63
     exit

But this is besides the point, just a curiosity, since I'm pretty sure 
verifier will hit way more false positives with rN regs in branch and 
compare.


Full attach probe log (w/o the above hack) and code to generate it on my 
gh portal [1]
[1] git@github.com:vineetgarc/gcc.git  #bpf/promotion-n-abi


Thx,
-Vineet

--- #12      attach_probe:
libbpf: elf: skipping unrecognized data section(17) .comment
test_attach_probe:PASS:skel_open 0 nsec
libbpf: prog 'handle_uprobe_byname3_sleepable': BPF program load failed: -EACCES
libbpf: prog 'handle_uprobe_byname3_sleepable': -- BEGIN PROG LOAD LOG --
0: R1=ctx() R10=fp0
;  @ vmlinux.h:49637
0: (b7) r2 = 9                        ; R2=9
;  @ test_attach_probe.c:92
1: (18) r6 = 0xffa00000017cd000       ; R6=map_value(map=test_att.bss,ks=4,vs=56)
3: (bf) r1 = r10                      ; R1=fp0 R10=fp0
4: (79) r3 = *(u64 *)(r6 +0)          ; R3=scalar() R6=map_value(map=test_att.bss,ks=4,vs=56)
5: (07) r1 += -24                     ; R1=fp-24
6: (85) call bpf_copy_from_user#148   ; R0=scalar() fp-16=???????m fp-24=mmmmmmmm
;  @ test_attach_probe.c:93
7: (bf) r1 = r10                      ; R1=fp0 R10=fp0
8: (18) r3 = 0xff11000113510930       ; R3=map_value(map=.rodata.str1.1,ks=4,vs=10)
10: (b7) r2 = 9                       ; R2=9
11: (07) r1 += -24                    ; R1=fp-24
12: (85) call bpf_strncmp#182         ; R0=scalar()
;  @ test_attach_probe.c:147
13: (55) if r0 != 0x0 goto pc+3       ; R0=0
;  @ test_attach_probe.c:148
14: (18) r0 = 0xffa00000017cd01c      ; R0=map_value(map=test_att.bss,ks=4,vs=56,imm=28)
16: (62) *(u32 *)(r0 +0) = 9          ; R0=map_value(map=test_att.bss,ks=4,vs=56,imm=28)
;  @ test_attach_probe.c:105
17: (79) r3 = *(u64 *)(r6 +0)         ; R3=scalar() R6=map_value(map=test_att.bss,ks=4,vs=56)
18: (b7) r2 = 4                       ; R2=4
19: (b7) r4 = 0                       ; R4=0
20: (bf) r1 = r10                     ; R1=fp0 R10=fp0
21: (07) r1 += -88                    ; R1=fp-88
22: (85) call bpf_copy_from_user_str#85440    ; R0=scalar()
;  @ test_attach_probe.c:107
23: (bf) r1 = r10                     ; R1=fp0 R10=fp0
;  @ test_attach_probe.c:105
24: (bf) r7 = r0                      ; R0=scalar(id=1) R7=scalar(id=1)
;  @ test_attach_probe.c:107
25: (18) r3 = 0xff11000113514d30      ; R3=map_value(map=test_att.rodata,ks=4,vs=16)
27: (b7) r2 = 4                       ; R2=4
28: (07) r1 += -88                    ; R1=fp-88
29: (85) call bpf_strncmp#182         ; R0=scalar()
30: (55) if r0 != 0x0 goto pc+1       ; R0=0
31: (16) if w7 == 0x4 goto pc+2 34: R0=0 R6=map_value(map=test_att.bss,ks=4,vs=56) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R10=fp0 fp-16=???????m fp-24=mmmmmmmm fp-88=????mmmm
;  @ test_attach_probe.c:110
34: (b7) r4 = 1                       ; R4=1
35: (79) r3 = *(u64 *)(r6 +0)         ; R3=scalar() R6=map_value(map=test_att.bss,ks=4,vs=56)
36: (bf) r2 = r7                      ; R2=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000))
37: (bf) r1 = r10                     ; R1=fp0 R10=fp0
38: (07) r1 += -80                    ; R1=fp-80
39: (85) call bpf_copy_from_user_str#85440
R2 min value is negative, either use unsigned or 'var &= const'
arg#0 arg#1 memory, len pair leads to invalid memory access
processed 36 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 0
-- END PROG LOAD LOG --
libbpf: prog 'handle_uprobe_byname3_sleepable': failed to load: -EACCES
libbpf: failed to load object 'test_attach_probe'
libbpf: failed to load BPF skeleton 'test_attach_probe': -EACCES
test_attach_probe:FAIL:skel_load unexpected error: -13 (errno 13)
test_attach_probe:PASS:uprobe_ref_ctr_cleanup 0 nsec
#12      attach_probe:FAIL
  
David Faust April 2, 2026, 5:02 p.m. UTC | #11
On 4/1/26 17:02, Vineet Gupta wrote:
> On 4/1/26 11:28 AM, David Faust wrote:
>>> attach_probe. The manually reduced test (attached here) showed the
>>> failing case (is essentially beefier version of the test in this patch,
>>> but same in spirit)
>> Pasting here the attached test for reference:
>>
>>> /* Extracted from BPF selftests attach_probe: verifier failure.
>>>     Return value of first call is reused as argument of second.
>>>     It either needs to be explicitly truncated to s32 or w regs need
>>>     to be used.  */
>>> /* { dg-do compile } */
>>> /* { dg-options "-O2 -mcpu=v4" } */
>>>
>>> extern int bpf_copy_from_user_str(int dst__sz);
>>>
>>> _Bool verify_sleepable_user_copy_str(void)
>>> {
>>>   int ret;
>>>   char data_short_pad[4];
>>>
>>>   ret = bpf_copy_from_user_str(sizeof(data_short_pad));
>>>   if (ret != 4)
>>>    return false;
>>>
>>>   /* ret above is guaranteed to be 4, same as sizeof.  */
>>>   ret = bpf_copy_from_user_str(sizeof(data_short_pad));
>>>   if (ret != 4)
>>>    return false;
>>>
>>>    return true;
>>> }
>>>
>>> /* Don't generate 64-bit mov for arg, do 32-bit zero extended mov.  */
>>> /* { dg-final { scan-assembler-not {r1 = r0} } } */
>>> /* { dg-final { scan-assembler-not {r1 = \(s32\) r0} } } */
>>> /* { dg-final { scan-assembler-times {w1 = w0} 1 } } */
>> I'm confused by the comment at the top.
>>
>> First:
>>
>>>     Return value of first call is reused as argument of second.
>> Where? The only calls I see are identical:
>>    ret = bpf_copy_from_user_str (sizeof (data_short_pad));
>>
>> The return value is not reused as an argument.
> 
> I meant semantically. At the time of second call, @ret from first call 
> is guaranteed to be 4 by virtue of control flow.
> And sizeof (data_short_pad) happens to be 4 as well, hence the reason 
> gcc:cse1 decides to reuse the reg/value for ret as arg for second call.
> 
>> Secondly:
>>
>>>     It either needs to be explicitly truncated to s32 or w regs need
>>>     to be used.  */
>> Why?  Assuming the value were actually being reused, we would be
>> taking an int return value and passing it as an int argument to a
>> function taking an int. There is no conversion needed.
> 
> Correct they are all int, so yes no conversion required, but we are 
> forgetting that big boss "verifier" is watching everything and causes 
> issues, see below...
> 
>>> So with the ABI hooks and everything in place, what we were seeing was
>>> following RTL stream
>> In the RTL below it's clear that the values are all being treated
>> in SImode:
>>
>>> (call_insn 7 6 8 2 (set (reg:SI 0 %r0)
>> So the call return value is SImode.
>>
>>>           (call (mem:DI (symbol_ref:DI ("bpf_copy_from_user_str") [flags
>>> 0x41]  <function_decl 0x7f10c1547300 bpf_copy_from_user_str>) [0
>>> bpf_copy_from_user_str S8 A64])
>>>               (const_int 0 [0])))
>>>        (expr_list:REG_CALL_DECL (symbol_ref:DI ("bpf_copy_from_user_str")
>>> [flags 0x41]  <function_decl 0x7f10c1547300 bpf_copy_from_user_str>)
>>>           (nil))
>>>       (expr_list:SI (use (reg:SI 1 %r1))
>>   (use (reg:SI 1 %r1)) : the arg is passed SI mode
>>
>>>           (nil)))
>>>
>>> (insn 8 7 9 2 (set (reg/v:SI 19 [ ret ])
>>>           (reg:SI 0 %r0))
>> Return value copied to pseudo r19 in SImode to be used for
>> the comparison.
> 
> Yeah everything is SI mode and its fine from gcc's codegen point of view.
> 
>>
>>>        (nil))
>>>
>>> (jump_insn 9 8 10 2 (set (pc)
>>>         (if_then_else (ne (reg/v:SI 19 [ ret ])
>>>                   (const_int 4 [0x4]))
>>>               (label_ref:DI 33)
>>>               (pc))) {*branch_on_si}
>>>    -> 33)
>>>> (insn 11 10 12 4 (set (reg:SI 1 %r1)
>>>           (const_int 4 [0x4]))
>>>        (nil))
>>>
>>> cse1 equivalence logic deduces that reg 19 = 4 = SI, guaranteed by
>>> control flow, so we end up with
>> Note that r19 is SImode a priori, to match the mode of the value from %r0.
>> It is not SI mode as a result of cse1 determining that it holds 4.
> 
> Right.
> 
>>> (insn 11 10 12 3 (set (reg:SI 1 %r1)
>>>          (reg/v:SI 19 [ ret ])) {*movsi}
>>>        (expr_list:REG_EQUAL (const_int 4 [0x4])
>>>           (nil)))
>>> And this carries all the way to the end, which if generates rN is would
>>> be problematic and was the reason for this change.
>> And again, what precisely is problematic about generating rN = rM
>> for this set?
> 
> Verifier trips up, #12 attach _probe 
> (tools/testing/selftests/bpf/progs/test_attach_probe.c)
> 
>> If %r1 is subsequently used as the int argument to a function call,
>> then it is valid as is.
>> If it would be passed as a 'unsigned long' or something else requiring
>> a conversion then we would rather have e.g. a zero_extend:DI to do that,
>> not a simple set.
> 
> The rN based move copies the full 64-bits, but the subsequent arg is 
> int: verifier wants to make sure top bits are cleared, which is not 
> guaranteed given the current codegen.

Ahhhhhh, so it is a verifier issue! Not a code correctness issue.
OK, that makes more sense.

> 
> Full log below, relevant snippets extracted:
> 
> 22: (85) call bpf_copy_from_user_str#85440    ; R0=scalar()
> ...
> 24: (bf) r7 = r0                      ; R0=scalar(id=1) R7=scalar(id=1)
> ...
> 29: (85) call bpf_strncmp#182         ; R0=scalar()
> 30: (55) if r0 != 0x0 goto pc+1       ; R0=0
> 31: (16) if w7 == 0x4 goto pc+2 34: R0=0 R6=map_value(map=test_att.bss,ks=4,vs=56) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R10=fp0 fp-16=???????m fp-24=mmmmmmmm fp-88=????mmmm
> ...
> 
> 36: (bf) r2 = r7                      ; R2=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000))
> ...
> 39: (85) call bpf_copy_from_user_str#85440
> R2 min value is negative, either use unsigned or 'var &= const'
> 
> Now, thinking about it, part of the problem could also be #31 using w7 
> reg for comparison. If it had used the rN reg too, maybe verifier would 
> be confident about the top bits too and won't trip up. But that again 
> would point to impedance mismatch between movsi (only using rN regs) and 
> branch_on_si, which would use wN regs.

Hmm, looking at these verifier logs (thanks for providing), I think
the issue may stem from the bpf_copy_from_user_str call.
Ultimately the return value propagates r0 -> r7 -> r2.

BUT, I am confused.  This test passes on my system compiled with
gcc trunk.  After the call, we emit a movs32 when moving the return value
to a different register, which is enough for the verifier to know the
value is restricted to signed integer range.

Are you seeing this with trunk or with the PROMOTE_MODE or some other
change applied?

> 
> I did a quick hack to have branch_on_ generate rN forms (there's no way 
> this can be productized, since we might have even more issues with 
> verifier dealing with 64-bits now.
> 
> -  case EQ: return  "{jeq<msuffix>\t%0,%1,%2|if %w0 == %w1 goto %2}"; break;
> +  case EQ: return  "{jeq<msuffix>\t%0,%1,%2|if %0 == %1 goto %2}"; break;
> 
> And now it does generate rN for branch n compare
> 
> verify_sleepable_user_copy_str:
>      r1 = 4
>      call    bpf_copy_from_user_str
>      if r0 == 4 goto .L5
>      r0 = 0
>      exit
> .L5:
>      r1 = r0
>      call    bpf_copy_from_user_str
>      w0 ^= 4
>      w0 = w0
>      r0 += -1
>      r0 >>= 63
>      exit
> 
> But this is besides the point, just a curiosity, since I'm pretty sure 
> verifier will hit way more false positives with rN regs in branch and 
> compare.
> 
> 
> Full attach probe log (w/o the above hack) and code to generate it on my 
> gh portal [1]
> [1] git@github.com:vineetgarc/gcc.git  #bpf/promotion-n-abi
> 
> 
> Thx,
> -Vineet
> 
> --- #12      attach_probe:
> libbpf: elf: skipping unrecognized data section(17) .comment
> test_attach_probe:PASS:skel_open 0 nsec
> libbpf: prog 'handle_uprobe_byname3_sleepable': BPF program load failed: -EACCES
> libbpf: prog 'handle_uprobe_byname3_sleepable': -- BEGIN PROG LOAD LOG --
> 0: R1=ctx() R10=fp0
> ;  @ vmlinux.h:49637
> 0: (b7) r2 = 9                        ; R2=9
> ;  @ test_attach_probe.c:92
> 1: (18) r6 = 0xffa00000017cd000       ; R6=map_value(map=test_att.bss,ks=4,vs=56)
> 3: (bf) r1 = r10                      ; R1=fp0 R10=fp0
> 4: (79) r3 = *(u64 *)(r6 +0)          ; R3=scalar() R6=map_value(map=test_att.bss,ks=4,vs=56)
> 5: (07) r1 += -24                     ; R1=fp-24
> 6: (85) call bpf_copy_from_user#148   ; R0=scalar() fp-16=???????m fp-24=mmmmmmmm
> ;  @ test_attach_probe.c:93
> 7: (bf) r1 = r10                      ; R1=fp0 R10=fp0
> 8: (18) r3 = 0xff11000113510930       ; R3=map_value(map=.rodata.str1.1,ks=4,vs=10)
> 10: (b7) r2 = 9                       ; R2=9
> 11: (07) r1 += -24                    ; R1=fp-24
> 12: (85) call bpf_strncmp#182         ; R0=scalar()
> ;  @ test_attach_probe.c:147
> 13: (55) if r0 != 0x0 goto pc+3       ; R0=0
> ;  @ test_attach_probe.c:148
> 14: (18) r0 = 0xffa00000017cd01c      ; R0=map_value(map=test_att.bss,ks=4,vs=56,imm=28)
> 16: (62) *(u32 *)(r0 +0) = 9          ; R0=map_value(map=test_att.bss,ks=4,vs=56,imm=28)
> ;  @ test_attach_probe.c:105
> 17: (79) r3 = *(u64 *)(r6 +0)         ; R3=scalar() R6=map_value(map=test_att.bss,ks=4,vs=56)
> 18: (b7) r2 = 4                       ; R2=4
> 19: (b7) r4 = 0                       ; R4=0
> 20: (bf) r1 = r10                     ; R1=fp0 R10=fp0
> 21: (07) r1 += -88                    ; R1=fp-88
> 22: (85) call bpf_copy_from_user_str#85440    ; R0=scalar()
> ;  @ test_attach_probe.c:107
> 23: (bf) r1 = r10                     ; R1=fp0 R10=fp0
> ;  @ test_attach_probe.c:105
> 24: (bf) r7 = r0                      ; R0=scalar(id=1) R7=scalar(id=1)
> ;  @ test_attach_probe.c:107
> 25: (18) r3 = 0xff11000113514d30      ; R3=map_value(map=test_att.rodata,ks=4,vs=16)
> 27: (b7) r2 = 4                       ; R2=4
> 28: (07) r1 += -88                    ; R1=fp-88
> 29: (85) call bpf_strncmp#182         ; R0=scalar()
> 30: (55) if r0 != 0x0 goto pc+1       ; R0=0
> 31: (16) if w7 == 0x4 goto pc+2 34: R0=0 R6=map_value(map=test_att.bss,ks=4,vs=56) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R10=fp0 fp-16=???????m fp-24=mmmmmmmm fp-88=????mmmm
> ;  @ test_attach_probe.c:110
> 34: (b7) r4 = 1                       ; R4=1
> 35: (79) r3 = *(u64 *)(r6 +0)         ; R3=scalar() R6=map_value(map=test_att.bss,ks=4,vs=56)
> 36: (bf) r2 = r7                      ; R2=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000))
> 37: (bf) r1 = r10                     ; R1=fp0 R10=fp0
> 38: (07) r1 += -80                    ; R1=fp-80
> 39: (85) call bpf_copy_from_user_str#85440
> R2 min value is negative, either use unsigned or 'var &= const'
> arg#0 arg#1 memory, len pair leads to invalid memory access
> processed 36 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 0
> -- END PROG LOAD LOG --
> libbpf: prog 'handle_uprobe_byname3_sleepable': failed to load: -EACCES
> libbpf: failed to load object 'test_attach_probe'
> libbpf: failed to load BPF skeleton 'test_attach_probe': -EACCES
> test_attach_probe:FAIL:skel_load unexpected error: -13 (errno 13)
> test_attach_probe:PASS:uprobe_ref_ctr_cleanup 0 nsec
> #12      attach_probe:FAIL
> 
> 
>
  
Vineet Gupta April 2, 2026, 8:44 p.m. UTC | #12
On 4/2/26 10:02 AM, David Faust wrote:
>>> And again, what precisely is problematic about generating rN = rM
>>> for this set?
>> Verifier trips up, #12 attach _probe
>> (tools/testing/selftests/bpf/progs/test_attach_probe.c)
>>
>>> If %r1 is subsequently used as the int argument to a function call,
>>> then it is valid as is.
>>> If it would be passed as a 'unsigned long' or something else requiring
>>> a conversion then we would rather have e.g. a zero_extend:DI to do that,
>>> not a simple set.
>> The rN based move copies the full 64-bits, but the subsequent arg is
>> int: verifier wants to make sure top bits are cleared, which is not
>> guaranteed given the current codegen.
> Ahhhhhh, so it is a verifier issue! Not a code correctness issue.
> OK, that makes more sense.

Yeah and my bad for not calling it in the changelog, but it was a 
logistics issue as I'd seperated it out of ABI change but needed ABI 
changes for the exact context.

>> Full log below, relevant snippets extracted:
>>
>> 22: (85) call bpf_copy_from_user_str#85440    ; R0=scalar()
>> ...
>> 24: (bf) r7 = r0                      ; R0=scalar(id=1) R7=scalar(id=1)
>> ...
>> 29: (85) call bpf_strncmp#182         ; R0=scalar()
>> 30: (55) if r0 != 0x0 goto pc+1       ; R0=0
>> 31: (16) if w7 == 0x4 goto pc+2 34: R0=0 R6=map_value(map=test_att.bss,ks=4,vs=56) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R10=fp0 fp-16=???????m fp-24=mmmmmmmm fp-88=????mmmm
>> ...
>>
>> 36: (bf) r2 = r7                      ; R2=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000))
>> ...
>> 39: (85) call bpf_copy_from_user_str#85440
>> R2 min value is negative, either use unsigned or 'var &= const'
>>
>> Now, thinking about it, part of the problem could also be #31 using w7
>> reg for comparison. If it had used the rN reg too, maybe verifier would
>> be confident about the top bits too and won't trip up. But that again
>> would point to impedance mismatch between movsi (only using rN regs) and
>> branch_on_si, which would use wN regs.
> Hmm, looking at these verifier logs (thanks for providing), I think
> the issue may stem from the bpf_copy_from_user_str call.
> Ultimately the return value propagates r0 -> r7 -> r2.

Exactly.

> BUT, I am confused.  This test passes on my system compiled with
> gcc trunk.  After the call, we emit a movs32 when moving the return value
> to a different register, which is enough for the verifier to know the
> value is restricted to signed integer range.
>
> Are you seeing this with trunk or with the PROMOTE_MODE or some other
> change applied?

Yeah it's specifically with PROMOTE_MODE part of the ABI change, w/o 
touching target hooks.

Thx,
-Vineet
  
David Faust April 3, 2026, 4:09 p.m. UTC | #13
On 4/2/26 13:44, Vineet Gupta wrote:
> On 4/2/26 10:02 AM, David Faust wrote:
>>>> And again, what precisely is problematic about generating rN = rM
>>>> for this set?
>>> Verifier trips up, #12 attach _probe
>>> (tools/testing/selftests/bpf/progs/test_attach_probe.c)
>>>
>>>> If %r1 is subsequently used as the int argument to a function call,
>>>> then it is valid as is.
>>>> If it would be passed as a 'unsigned long' or something else requiring
>>>> a conversion then we would rather have e.g. a zero_extend:DI to do that,
>>>> not a simple set.
>>> The rN based move copies the full 64-bits, but the subsequent arg is
>>> int: verifier wants to make sure top bits are cleared, which is not
>>> guaranteed given the current codegen.
>> Ahhhhhh, so it is a verifier issue! Not a code correctness issue.
>> OK, that makes more sense.
> 
> Yeah and my bad for not calling it in the changelog, but it was a 
> logistics issue as I'd seperated it out of ABI change but needed ABI 
> changes for the exact context.
> 
>>> Full log below, relevant snippets extracted:
>>>
>>> 22: (85) call bpf_copy_from_user_str#85440    ; R0=scalar()
>>> ...
>>> 24: (bf) r7 = r0                      ; R0=scalar(id=1) R7=scalar(id=1)
>>> ...
>>> 29: (85) call bpf_strncmp#182         ; R0=scalar()
>>> 30: (55) if r0 != 0x0 goto pc+1       ; R0=0
>>> 31: (16) if w7 == 0x4 goto pc+2 34: R0=0 R6=map_value(map=test_att.bss,ks=4,vs=56) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R10=fp0 fp-16=???????m fp-24=mmmmmmmm fp-88=????mmmm
>>> ...
>>>
>>> 36: (bf) r2 = r7                      ; R2=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000)) R7=scalar(id=1,smin=0x8000000000000004,smax=0x7fffffff00000004,umin=smin32=umin32=4,umax=0xffffffff00000004,smax32=umax32=4,var_off=(0x4; 0xffffffff00000000))
>>> ...
>>> 39: (85) call bpf_copy_from_user_str#85440
>>> R2 min value is negative, either use unsigned or 'var &= const'
>>>
>>> Now, thinking about it, part of the problem could also be #31 using w7
>>> reg for comparison. If it had used the rN reg too, maybe verifier would
>>> be confident about the top bits too and won't trip up. But that again
>>> would point to impedance mismatch between movsi (only using rN regs) and
>>> branch_on_si, which would use wN regs.
>> Hmm, looking at these verifier logs (thanks for providing), I think
>> the issue may stem from the bpf_copy_from_user_str call.
>> Ultimately the return value propagates r0 -> r7 -> r2.
> 
> Exactly.
> 
>> BUT, I am confused.  This test passes on my system compiled with
>> gcc trunk.  After the call, we emit a movs32 when moving the return value
>> to a different register, which is enough for the verifier to know the
>> value is restricted to signed integer range.
>>
>> Are you seeing this with trunk or with the PROMOTE_MODE or some other
>> change applied?
> 
> Yeah it's specifically with PROMOTE_MODE part of the ABI change, w/o 
> touching target hooks.

I see.

At this point it seems clear that there is no correctness issue with the
moves currently generated by gcc trunk.

Rather, the question is about the verifiability of the resulting code.
But verifier issues raised here also do not reproduce with gcc trunk.

Can you please send a reduced example showing precisely what code is
being rejected by the verifier (despite being semantically correct)
_as generated by gcc trunk_ ?

If these issues only appear with a patched gcc, then this patch to
address those issues should be part of a patchset including all other
changes which would necessitate this fix.

You are introducing a lot of confusion by sending a patch against
issues that do not seem to exist in trunk.

> 
> Thx,
> -Vineet
  
Vineet Gupta April 3, 2026, 4:28 p.m. UTC | #14
On 4/3/26 9:09 AM, David Faust wrote:
>>> BUT, I am confused.  This test passes on my system compiled with
>>> gcc trunk.  After the call, we emit a movs32 when moving the return value
>>> to a different register, which is enough for the verifier to know the
>>> value is restricted to signed integer range.
>>>
>>> Are you seeing this with trunk or with the PROMOTE_MODE or some other
>>> change applied?
>> Yeah it's specifically with PROMOTE_MODE part of the ABI change, w/o
>> touching target hooks.
> I see.
>
> At this point it seems clear that there is no correctness issue with the
> moves currently generated by gcc trunk.
>
> Rather, the question is about the verifiability of the resulting code.
> But verifier issues raised here also do not reproduce with gcc trunk.
>
> Can you please send a reduced example showing precisely what code is
> being rejected by the verifier (despite being semantically correct)
> _as generated by gcc trunk_ ?
>
> If these issues only appear with a patched gcc, then this patch to
> address those issues should be part of a patchset including all other
> changes which would necessitate this fix.

Yeah that's what I concluded as well on this thread and IRC.

> You are introducing a lot of confusion by sending a patch against
> issues that do not seem to exist in trunk.

Yep.

-Vineet
  

Patch

diff --git a/gcc/config/bpf/bpf.md b/gcc/config/bpf/bpf.md
index a2bceb8998d7..e6776a94e7e4 100644
--- a/gcc/config/bpf/bpf.md
+++ b/gcc/config/bpf/bpf.md
@@ -383,11 +383,11 @@ 
         (match_operand:MM 1 "mov_src_operand"      " q,rIc,BC,r,I"))]
   ""
   "@
-   *return bpf_output_move (operands, \"{ldx<mop>\t%0,%1|%0 = *(<smop> *) %1}\");
-   *return bpf_output_move (operands, \"{mov\t%0,%1|%0 = %1}\");
-   *return bpf_output_move (operands, \"{lddw\t%0,%1|%0 = %1 ll}\");
-   *return bpf_output_move (operands, \"{stx<mop>\t%0,%1|*(<smop> *) %0 = %1}\");
-   *return bpf_output_move (operands, \"{st<mop>\t%0,%1|*(<smop> *) %0 = %1}\");"
+   *return bpf_output_move (operands, \"{ldx<mop>\t%0,%1|%w0 = *(<smop> *) %w1}\");
+   *return bpf_output_move (operands, \"{mov\t%0,%1|%w0 = %w1}\");
+   *return bpf_output_move (operands, \"{lddw\t%0,%1|%w0 = %w1 ll}\");
+   *return bpf_output_move (operands, \"{stx<mop>\t%0,%1|*(<smop> *) %w0 = %w1}\");
+   *return bpf_output_move (operands, \"{st<mop>\t%0,%1|*(<smop> *) %w0 = %w1}\");"
 [(set_attr "type" "ldx,alu,alu,stx,st")])
 
 ;;;; Shifts
diff --git a/gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c b/gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c
new file mode 100644
index 000000000000..6d0a4f280cd7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/bpf/ret-reuse-arg-1.c
@@ -0,0 +1,14 @@ 
+/* Return value of first call is arg to second call.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=v4" } */
+
+int ret_int ();
+void arg_int (int);
+
+void foo () {
+   arg_int(ret_int ());
+}
+
+/* { dg-final { scan-assembler-not {r1 = r0} } } */
+/* { dg-final { scan-assembler-times {w1 = w0} 1 } } */