[0/7] ira/lra: Support subreg coalesce

Message ID	20231108034740.834590-1-lehua.ding@rivai.ai
Headers	DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D79543858D33 From: Lehua Ding <lehua.ding@rivai.ai> To: gcc-patches@gcc.gnu.org Cc: vmakarov@redhat.com, richard.sandiford@arm.com, juzhe.zhong@rivai.ai, lehua.ding@rivai.ai Subject: [PATCH 0/7] ira/lra: Support subreg coalesce Date: Wed, 8 Nov 2023 11:47:33 +0800 Message-Id: <20231108034740.834590-1-lehua.ding@rivai.ai> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz6a-0 Precedence: list Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Series	ira/lra: Support subreg coalesce \| [0/7] ira/lra: Support subreg coalesce [1/7] ira: Refactor the handling of register conflicts to make it more general [2/7] ira: Add live_subreg problem and apply to ira pass [3/7] ira: Support subreg live range track [4/7] ira: Support subreg copy [5/7] ira: Add all nregs >= 2 pseudos to tracke subreg list [6/7] lra: Apply live_subreg df_problem to lra pass [7/7] lra: Support subreg live range track and conflict detect

Message ID

20231108034740.834590-1-lehua.ding@rivai.ai

Headers

DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D79543858D33
From: Lehua Ding <lehua.ding@rivai.ai>
To: gcc-patches@gcc.gnu.org
Cc: vmakarov@redhat.com, richard.sandiford@arm.com, juzhe.zhong@rivai.ai,
 lehua.ding@rivai.ai
Subject: [PATCH 0/7]  ira/lra: Support subreg coalesce
Date: Wed,  8 Nov 2023 11:47:33 +0800
Message-Id: <20231108034740.834590-1-lehua.ding@rivai.ai>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz6a-0
Precedence: list
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org

Series

ira/lra: Support subreg coalesce |

Message

Lehua Ding Nov. 8, 2023, 3:47 a.m. UTC

  Hi,

These patchs try to support subreg coalesce feature in
register allocation passes (ira and lra).

Let's consider a RISC-V program (https://godbolt.org/z/ec51d91aT):

```
#include <riscv_vector.h>

void
foo (int32_t *in, int32_t *out, size_t m)
{
  vint32m2_t result = __riscv_vle32_v_i32m2 (in, 32);
  vint32m1_t v0 = __riscv_vget_v_i32m2_i32m1 (result, 0);
  vint32m1_t v1 = __riscv_vget_v_i32m2_i32m1 (result, 1);
  for (size_t i = 0; i < m; i++)
    {
      v0 = __riscv_vadd_vv_i32m1(v0, v0, 4);
      v1 = __riscv_vmul_vv_i32m1(v1, v1, 4);
    }
  *(vint32m1_t*)(out+4*0) = v0;
  *(vint32m1_t*)(out+4*1) = v1;
}
```

Before these patchs:

```
foo:
	li	a5,32
	vsetvli	zero,a5,e32,m2,ta,ma
	vle32.v	v4,0(a0)
	vmv1r.v	v2,v4
	vmv1r.v	v1,v5
	beq	a2,zero,.L2
	li	a5,0
	vsetivli	zero,4,e32,m1,ta,ma
.L3:
	addi	a5,a5,1
	vadd.vv	v2,v2,v2
	vmul.vv	v1,v1,v1
	bne	a2,a5,.L3
.L2:
	vs1r.v	v2,0(a1)
	addi	a1,a1,16
	vs1r.v	v1,0(a1)
	ret
```

After these patchs:

```
foo:
	li	a5,32
	vsetvli	zero,a5,e32,m2,ta,ma
	vle32.v	v2,0(a0)
	beq	a2,zero,.L2
	li	a5,0
	vsetivli	zero,4,e32,m1,ta,ma
.L3:
	addi	a5,a5,1
	vadd.vv	v2,v2,v2
	vmul.vv	v3,v3,v3
	bne	a2,a5,.L3
.L2:
	vs1r.v	v2,0(a1)
	addi	a1,a1,16
	vs1r.v	v3,0(a1)
	ret
```

As you can see, the two redundant vmv1r.v instructions were removed.
The reason for the two redundant vmv1r.v instructions is because
the current ira pass is being conservative in calculating the live
range of pseduo registers that occupy multil hardregs. As in the
following two RTL instructions. Where r134 occupies two physical
registers and r135 and r136 occupy one physical register.
At insn 12 point, ira considers the entire r134 pseudo register
to be live, so r135 is in conflict with r134, as shown in the ira
dump info. Then when the physical registers are allocated, r135 and
r134 are allocated first because they are inside the loop body and
have higher priority. This makes it difficult to assign r136 to
overlap with r134, i.e., to assign r136 to hr100, thus eliminating
the need for the vmv1r.v instruction. Thus two vmv1r.v instructions
appear.

If we refine the live information of r134 to the case of each subreg,
we can remove this conflict. We can then create copies of the set
with subreg reference, thus increasing the priority of the r134 allocation,
which allow registers with bigger alignment requirements to prioritize
the allocation of physical registers. In RVV, pseudo registers occupying
two physical registers need to be time-2 aligned.

```
(insn 11 10 12 2 (set (reg/v:RVVM1SI 135 [ v0 ])
        (subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) 0)) "/app/example.c":7:19 998 {*movrvvm1si_whole}
     (nil))
(insn 12 11 13 2 (set (reg/v:RVVM1SI 136 [ v1 ])
        (subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) [16, 16])) "/app/example.c":8:19 998 {*movrvvm1si_whole}
     (expr_list:REG_DEAD (reg/v:RVVM2SI 134 [ result ])
        (nil)))
```

ira dump:

;; a1(r136,l0) conflicts: a3(r135,l0)
;;     total conflict hard regs:
;;     conflict hard regs:
;; a3(r135,l0) conflicts: a1(r136,l0) a6(r134,l0)
;;     total conflict hard regs:
;;     conflict hard regs:
;; a6(r134,l0) conflicts: a3(r135,l0)
;;     total conflict hard regs:
;;     conflict hard regs:
;;
;; ...
      Popping a1(r135,l0)  --         assign reg 97
      Popping a3(r136,l0)  --         assign reg 98
      Popping a4(r137,l0)  --         assign reg 15
      Popping a5(r140,l0)  --         assign reg 12
      Popping a10(r145,l0)  --         assign reg 12
      Popping a2(r139,l0)  --         assign reg 11
      Popping a9(r144,l0)  --         assign reg 11
      Popping a0(r142,l0)  --         assign reg 11
      Popping a6(r134,l0)  --         assign reg 100
      Popping a7(r143,l0)  --         assign reg 10
      Popping a8(r141,l0)  --         assign reg 15

The AArch64 SVE has the same problem. Consider the following
code (https://godbolt.org/z/MYrK7Ghaj):

```
#include <arm_sve.h>

int bar (svbool_t pg, int64_t* base, int n, int64_t *in1, int64_t *in2, int64_t*out)
{
  svint64x4_t result = svld4_s64 (pg, base);
  svint64_t v0 = svget4_s64(result, 0);
  svint64_t v1 = svget4_s64(result, 1);
  svint64_t v2 = svget4_s64(result, 2);
  svint64_t v3 = svget4_s64(result, 3);

  for (int i = 0; i < n; i += 1)
    {
        svint64_t v18 = svld1_s64(pg, in1);
        svint64_t v19 = svld1_s64(pg, in2);
        v0 = svmad_s64_z(pg, v0, v18, v19);
        v1 = svmad_s64_z(pg, v1, v18, v19);
        v2 = svmad_s64_z(pg, v2, v18, v19);
        v3 = svmad_s64_z(pg, v3, v18, v19);
    }
  svst1_s64(pg, out+0,v0);
  svst1_s64(pg, out+1,v1);
  svst1_s64(pg, out+2,v2);
  svst1_s64(pg, out+3,v3);
}
```

Before these patchs:

```
bar:
	ld4d	{z4.d - z7.d}, p0/z, [x0]
	mov	z26.d, z4.d
	mov	z27.d, z5.d
	mov	z28.d, z6.d
	mov	z29.d, z7.d
	cmp	w1, 0
	...
```

After these patchs:

```
bar:
	ld4d	{z28.d - z31.d}, p0/z, [x0]
	cmp	w1, 0
	...
```

Lehua Ding (7):
  ira: Refactor the handling of register conflicts to make it more
    general
  ira: Add live_subreg problem and apply to ira pass
  ira: Support subreg live range track
  ira: Support subreg copy
  ira: Add all nregs >= 2 pseudos to tracke subreg list
  lra: Apply live_subreg df_problem to lra pass
  lra: Support subreg live range track and conflict detect

 gcc/Makefile.in          |   1 +
 gcc/df-problems.cc       | 889 ++++++++++++++++++++++++++++++++++++++-
 gcc/df.h                 |  93 +++-
 gcc/hard-reg-set.h       |  33 ++
 gcc/ira-build.cc         | 458 ++++++++++++++++----
 gcc/ira-color.cc         | 851 ++++++++++++++++++++++++++-----------
 gcc/ira-conflicts.cc     | 221 +++++++---
 gcc/ira-emit.cc          |  24 +-
 gcc/ira-int.h            |  67 ++-
 gcc/ira-lives.cc         | 527 +++++++++++++++++------
 gcc/ira.cc               |  77 ++--
 gcc/lra-assigns.cc       | 111 ++++-
 gcc/lra-coalesce.cc      |  20 +-
 gcc/lra-constraints.cc   | 111 +++--
 gcc/lra-int.h            |  33 ++
 gcc/lra-lives.cc         | 661 ++++++++++++++++++++++++-----
 gcc/lra-remat.cc         |  13 +-
 gcc/lra-spills.cc        |  22 +-
 gcc/lra.cc               | 139 +++++-
 gcc/reginfo.cc           |  14 +
 gcc/rtl.h                |  14 +
 gcc/subreg-live-range.cc | 649 ++++++++++++++++++++++++++++
 gcc/subreg-live-range.h  | 343 +++++++++++++++
 gcc/timevar.def          |   1 +
 24 files changed, 4564 insertions(+), 808 deletions(-)
 create mode 100644 gcc/subreg-live-range.cc
 create mode 100644 gcc/subreg-live-range.h

Comments

juzhe.zhong@rivai.ai Nov. 8, 2023, 3:55 a.m. UTC | #1

Thanks Lehua.

Appreciate for supporting subreg liveness tracking with tons of work.

A nit comments, I think you should mention these following PRs:

106694
89967
106146
99161 

No need send V2 now. You can send V2 after Richard and Vlad reviewed.



juzhe.zhong@rivai.ai
 
From: Lehua Ding
Date: 2023-11-08 11:47
To: gcc-patches
CC: vmakarov; richard.sandiford; juzhe.zhong; lehua.ding
Subject: [PATCH 0/7] ira/lra: Support subreg coalesce
Hi,
 
These patchs try to support subreg coalesce feature in
register allocation passes (ira and lra).
 
Let's consider a RISC-V program (https://godbolt.org/z/ec51d91aT):
 
```
#include <riscv_vector.h>
 
void
foo (int32_t *in, int32_t *out, size_t m)
{
  vint32m2_t result = __riscv_vle32_v_i32m2 (in, 32);
  vint32m1_t v0 = __riscv_vget_v_i32m2_i32m1 (result, 0);
  vint32m1_t v1 = __riscv_vget_v_i32m2_i32m1 (result, 1);
  for (size_t i = 0; i < m; i++)
    {
      v0 = __riscv_vadd_vv_i32m1(v0, v0, 4);
      v1 = __riscv_vmul_vv_i32m1(v1, v1, 4);
    }
  *(vint32m1_t*)(out+4*0) = v0;
  *(vint32m1_t*)(out+4*1) = v1;
}
```
 
Before these patchs:
 
```
foo:
li a5,32
vsetvli zero,a5,e32,m2,ta,ma
vle32.v v4,0(a0)
vmv1r.v v2,v4
vmv1r.v v1,v5
beq a2,zero,.L2
li a5,0
vsetivli zero,4,e32,m1,ta,ma
.L3:
addi a5,a5,1
vadd.vv v2,v2,v2
vmul.vv v1,v1,v1
bne a2,a5,.L3
.L2:
vs1r.v v2,0(a1)
addi a1,a1,16
vs1r.v v1,0(a1)
ret
```
 
After these patchs:
 
```
foo:
li a5,32
vsetvli zero,a5,e32,m2,ta,ma
vle32.v v2,0(a0)
beq a2,zero,.L2
li a5,0
vsetivli zero,4,e32,m1,ta,ma
.L3:
addi a5,a5,1
vadd.vv v2,v2,v2
vmul.vv v3,v3,v3
bne a2,a5,.L3
.L2:
vs1r.v v2,0(a1)
addi a1,a1,16
vs1r.v v3,0(a1)
ret
```
 
As you can see, the two redundant vmv1r.v instructions were removed.
The reason for the two redundant vmv1r.v instructions is because
the current ira pass is being conservative in calculating the live
range of pseduo registers that occupy multil hardregs. As in the
following two RTL instructions. Where r134 occupies two physical
registers and r135 and r136 occupy one physical register.
At insn 12 point, ira considers the entire r134 pseudo register
to be live, so r135 is in conflict with r134, as shown in the ira
dump info. Then when the physical registers are allocated, r135 and
r134 are allocated first because they are inside the loop body and
have higher priority. This makes it difficult to assign r136 to
overlap with r134, i.e., to assign r136 to hr100, thus eliminating
the need for the vmv1r.v instruction. Thus two vmv1r.v instructions
appear.
 
If we refine the live information of r134 to the case of each subreg,
we can remove this conflict. We can then create copies of the set
with subreg reference, thus increasing the priority of the r134 allocation,
which allow registers with bigger alignment requirements to prioritize
the allocation of physical registers. In RVV, pseudo registers occupying
two physical registers need to be time-2 aligned.
 
```
(insn 11 10 12 2 (set (reg/v:RVVM1SI 135 [ v0 ])
        (subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) 0)) "/app/example.c":7:19 998 {*movrvvm1si_whole}
     (nil))
(insn 12 11 13 2 (set (reg/v:RVVM1SI 136 [ v1 ])
        (subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) [16, 16])) "/app/example.c":8:19 998 {*movrvvm1si_whole}
     (expr_list:REG_DEAD (reg/v:RVVM2SI 134 [ result ])
        (nil)))
```
 
ira dump:
 
;; a1(r136,l0) conflicts: a3(r135,l0)
;;     total conflict hard regs:
;;     conflict hard regs:
;; a3(r135,l0) conflicts: a1(r136,l0) a6(r134,l0)
;;     total conflict hard regs:
;;     conflict hard regs:
;; a6(r134,l0) conflicts: a3(r135,l0)
;;     total conflict hard regs:
;;     conflict hard regs:
;;
;; ...
      Popping a1(r135,l0)  --         assign reg 97
      Popping a3(r136,l0)  --         assign reg 98
      Popping a4(r137,l0)  --         assign reg 15
      Popping a5(r140,l0)  --         assign reg 12
      Popping a10(r145,l0)  --         assign reg 12
      Popping a2(r139,l0)  --         assign reg 11
      Popping a9(r144,l0)  --         assign reg 11
      Popping a0(r142,l0)  --         assign reg 11
      Popping a6(r134,l0)  --         assign reg 100
      Popping a7(r143,l0)  --         assign reg 10
      Popping a8(r141,l0)  --         assign reg 15
 
The AArch64 SVE has the same problem. Consider the following
code (https://godbolt.org/z/MYrK7Ghaj):
 
```
#include <arm_sve.h>
 
int bar (svbool_t pg, int64_t* base, int n, int64_t *in1, int64_t *in2, int64_t*out)
{
  svint64x4_t result = svld4_s64 (pg, base);
  svint64_t v0 = svget4_s64(result, 0);
  svint64_t v1 = svget4_s64(result, 1);
  svint64_t v2 = svget4_s64(result, 2);
  svint64_t v3 = svget4_s64(result, 3);
 
  for (int i = 0; i < n; i += 1)
    {
        svint64_t v18 = svld1_s64(pg, in1);
        svint64_t v19 = svld1_s64(pg, in2);
        v0 = svmad_s64_z(pg, v0, v18, v19);
        v1 = svmad_s64_z(pg, v1, v18, v19);
        v2 = svmad_s64_z(pg, v2, v18, v19);
        v3 = svmad_s64_z(pg, v3, v18, v19);
    }
  svst1_s64(pg, out+0,v0);
  svst1_s64(pg, out+1,v1);
  svst1_s64(pg, out+2,v2);
  svst1_s64(pg, out+3,v3);
}
```
 
Before these patchs:
 
```
bar:
ld4d {z4.d - z7.d}, p0/z, [x0]
mov z26.d, z4.d
mov z27.d, z5.d
mov z28.d, z6.d
mov z29.d, z7.d
cmp w1, 0
...
```
 
After these patchs:
 
```
bar:
ld4d {z28.d - z31.d}, p0/z, [x0]
cmp w1, 0
...
```
 
Lehua Ding (7):
  ira: Refactor the handling of register conflicts to make it more
    general
  ira: Add live_subreg problem and apply to ira pass
  ira: Support subreg live range track
  ira: Support subreg copy
  ira: Add all nregs >= 2 pseudos to tracke subreg list
  lra: Apply live_subreg df_problem to lra pass
  lra: Support subreg live range track and conflict detect
 
gcc/Makefile.in          |   1 +
gcc/df-problems.cc       | 889 ++++++++++++++++++++++++++++++++++++++-
gcc/df.h                 |  93 +++-
gcc/hard-reg-set.h       |  33 ++
gcc/ira-build.cc         | 458 ++++++++++++++++----
gcc/ira-color.cc         | 851 ++++++++++++++++++++++++++-----------
gcc/ira-conflicts.cc     | 221 +++++++---
gcc/ira-emit.cc          |  24 +-
gcc/ira-int.h            |  67 ++-
gcc/ira-lives.cc         | 527 +++++++++++++++++------
gcc/ira.cc               |  77 ++--
gcc/lra-assigns.cc       | 111 ++++-
gcc/lra-coalesce.cc      |  20 +-
gcc/lra-constraints.cc   | 111 +++--
gcc/lra-int.h            |  33 ++
gcc/lra-lives.cc         | 661 ++++++++++++++++++++++++-----
gcc/lra-remat.cc         |  13 +-
gcc/lra-spills.cc        |  22 +-
gcc/lra.cc               | 139 +++++-
gcc/reginfo.cc           |  14 +
gcc/rtl.h                |  14 +
gcc/subreg-live-range.cc | 649 ++++++++++++++++++++++++++++
gcc/subreg-live-range.h  | 343 +++++++++++++++
gcc/timevar.def          |   1 +
24 files changed, 4564 insertions(+), 808 deletions(-)
create mode 100644 gcc/subreg-live-range.cc
create mode 100644 gcc/subreg-live-range.h
 
-- 
2.36.3

Richard Sandiford Nov. 8, 2023, 9:40 a.m. UTC | #2

Lehua Ding <lehua.ding@rivai.ai> writes:
> Hi,
>
> These patchs try to support subreg coalesce feature in
> register allocation passes (ira and lra).

Thanks a lot for the series.  This is definitely something we've
needed for a while.

I probably won't be able to look at it in detail for a couple of weeks
(and the real review should come from Vlad anyway), but one initial
comment:

Tracking subreg liveness will sometimes expose dead code that
wasn't obvious without it.  PR89606 has an example of this.
There the dead code was introduced by init-regs, and there's a
debate about (a) whether init-regs should still be run and (b) if it
should still be run, whether it should use subreg liveness tracking too.

But I think such dead code is possible even without init-regs.
So for the purpose of this series, I think the init-regs behaviour
in that PR creates a helpful example.

I agree with Richi of course that compile-time is a concern.
The patch seems to add quite a bit of new data to ira_allocno,
but perhaps that's OK.  ira_object + ira_allocno is already quite big.

However:

@@ -387,8 +398,8 @@ struct ira_allocno
   /* An array of structures describing conflict information and live
      ranges for each object associated with the allocno.  There may be
      more than one such object in cases where the allocno represents a
-     multi-word register.  */
-  ira_object_t objects[2];
+     multi-hardreg pesudo.  */
+  std::vector<ira_object_t> objects;
   /* Registers clobbered by intersected calls.  */
    HARD_REG_SET crossed_calls_clobbered_regs;
   /* Array of usage costs (accumulated and the one updated during

adds an extra level of indirection (and separate extra storage) for
every allocno, not just multi-hardreg ones.  It'd be worth optimising
the data structures' representation of single-hardreg pseudos even if
that slows down the multi-hardreg code, since single-hardreg pseudos are
so much more common.  And the different single-hardreg and multi-hardreg
representations could be hidden behind accessors, to make life easier
for consumers.  (Of course, performance of the accessors is also then
an issue. :))

Richard

Dimitar Dimitrov Nov. 8, 2023, 4:56 p.m. UTC | #3

On Wed, Nov 08, 2023 at 11:47:33AM +0800, Lehua Ding wrote:
> Hi,
> 
> These patchs try to support subreg coalesce feature in
> register allocation passes (ira and lra).

Hi Lehua,

This patch set breaks the build for at least three embedded targets. See
below.

For avr the GCC build fails with:
/mnt/nvme/dinux/local-workspace/gcc/gcc/ira-lives.cc:149:39: error: call of overloaded ‘set_subreg_conflict_hard_regs(ira_allocno*&, int&)’ is ambiguous
  149 |         set_subreg_conflict_hard_regs (OBJECT_ALLOCNO (obj), regno);


For arm-none-eabi the newlib build fails with:
/mnt/nvme/dinux/local-workspace/newlib/newlib/libm/math/e_jn.c:279:1: internal compiler error: Floating point exception
  279 | }
      | ^
0x1176e0f crash_signal
        /mnt/nvme/dinux/local-workspace/gcc/gcc/toplev.cc:316
0xf6008d get_range_hard_regs(int, subreg_range const&)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:609
0xf6008d get_range_hard_regs(int, subreg_range const&)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:601
0xf60312 new_insn_reg
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:658
0xf6064d add_regs_to_insn_regno_info
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1623
0xf62909 lra_update_insn_regno_info(rtx_insn*)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1769
0xf62e46 lra_update_insn_regno_info(rtx_insn*)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1762
0xf62e46 lra_push_insn_1
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1919
0xf62f2d lra_push_insn(rtx_insn*)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1927
0xf62f2d push_insns
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1970
0xf63302 push_insns
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1966
0xf63302 lra(_IO_FILE*)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:2511
0xf0e399 do_reload 
        /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5960
0xf0e399 execute
        /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6148


For pru-elf the GCC build fails with:
/mnt/nvme/dinux/local-workspace/gcc/libgcc/unwind-dw2-fde.c: In function 'linear_search_fdes':
/mnt/nvme/dinux/local-workspace/gcc/libgcc/unwind-dw2-fde.c:1035:1: internal compiler error: Floating point exception
 1035 | }
      | ^
0x1694f2e crash_signal
        /mnt/nvme/dinux/local-workspace/gcc/gcc/toplev.cc:316
0x1313178 get_range_hard_regs(int, subreg_range const&)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:609
0x131343a new_insn_reg
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:658
0x13174f0 add_regs_to_insn_regno_info
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1608
0x1318479 lra_update_insn_regno_info(rtx_insn*)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1769
0x13196ab lra_push_insn_1
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1919
0x13196de lra_push_insn(rtx_insn*)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1927
0x13197da push_insns
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1970
0x131b6dc lra(_IO_FILE*)
        /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:2511
0x129f237 do_reload
        /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5960
0x129f6c6 execute
        /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6148


The divide by zero error above is interesting. I'm not sure why ira_reg_class_max_nregs[] yields 0 for the pseudo register 168 in the following rtx:
(debug_insn 168 167 169 19 (var_location:SI encoding (reg/v:SI 168 [ encoding ])) -1
     (nil))

Regards,
Dimitar

Jeff Law Nov. 8, 2023, 7:13 p.m. UTC | #4

On 11/8/23 02:40, Richard Sandiford wrote:
> Lehua Ding <lehua.ding@rivai.ai> writes:
>> Hi,
>>
>> These patchs try to support subreg coalesce feature in
>> register allocation passes (ira and lra).
> 
> Thanks a lot for the series.  This is definitely something we've
> needed for a while.
> 
> I probably won't be able to look at it in detail for a couple of weeks
> (and the real review should come from Vlad anyway), but one initial
> comment:
Absolutely agreed on the above.

The other thing to ponder.  Jivan and I have been banging on Joern's 
sub-object tracking bits for a totally different problem in the RISC-V 
space.  But there may be some overlap.

Essentially Joern's code tracks liveness for a few chunks in registers. 
bits 0..7, bits 8..15, bits 16..31 and bits 32..63.  This includes 
propagating liveness from the destination through to the sources.  SO 
for example if we have

(set (reg:SI dest) (plus:SI (srcreg1:SI) (srcreg2:SI)))

If we had previously determined that only bits 0..15 were live in DEST, 
then we'll propagate that into the source registers.

The goal is to ultimately transform something like

(set (dest:mode) (any_extend:mode (reg:narrower_mode)))

into

(set (dest:mode) (subreg:mode (reg:narrower_mode)))

Where the latter typically will get simplified and propagated away.


Joern's code is a bit of a mess, but Jivan and I are slowly untangling 
it from a correctness standpoint.  It'll also need the usual cleanups.

Anyway, point being I think it'll be worth looking at Lehua's bits and 
Joern's bits to see if there's anything that can and should be shared. 
Given I'm getting fairly familiar with Joern's bits, that likely falls 
to me.

Jeff

> 
> Tracking subreg liveness will sometimes expose dead code that
> wasn't obvious without it.  PR89606 has an example of this.
> There the dead code was introduced by init-regs, and there's a
> debate about (a) whether init-regs should still be run and (b) if it
> should still be run, whether it should use subreg liveness tracking too.
> 
> But I think such dead code is possible even without init-regs.
> So for the purpose of this series, I think the init-regs behaviour
> in that PR creates a helpful example.
> 
> I agree with Richi of course that compile-time is a concern.
> The patch seems to add quite a bit of new data to ira_allocno,
> but perhaps that's OK.  ira_object + ira_allocno is already quite big.
> 
> However:
> 
> @@ -387,8 +398,8 @@ struct ira_allocno
>     /* An array of structures describing conflict information and live
>        ranges for each object associated with the allocno.  There may be
>        more than one such object in cases where the allocno represents a
> -     multi-word register.  */
> -  ira_object_t objects[2];
> +     multi-hardreg pesudo.  */
> +  std::vector<ira_object_t> objects;
>     /* Registers clobbered by intersected calls.  */
>      HARD_REG_SET crossed_calls_clobbered_regs;
>     /* Array of usage costs (accumulated and the one updated during
> 
> adds an extra level of indirection (and separate extra storage) for
> every allocno, not just multi-hardreg ones.  It'd be worth optimising
> the data structures' representation of single-hardreg pseudos even if
> that slows down the multi-hardreg code, since single-hardreg pseudos are
> so much more common.  And the different single-hardreg and multi-hardreg
> representations could be hidden behind accessors, to make life easier
> for consumers.  (Of course, performance of the accessors is also then
> an issue. :))
> 
> Richard

Vladimir Makarov Nov. 9, 2023, 8:24 p.m. UTC | #5

On 11/7/23 22:47, Lehua Ding wrote:
>
> Lehua Ding (7):
>    ira: Refactor the handling of register conflicts to make it more
>      general
>    ira: Add live_subreg problem and apply to ira pass
>    ira: Support subreg live range track
>    ira: Support subreg copy
>    ira: Add all nregs >= 2 pseudos to tracke subreg list
>    lra: Apply live_subreg df_problem to lra pass
>    lra: Support subreg live range track and conflict detect
>
Thank you very much for addressing subreg RA.  It is a big work.  I 
wanted to address this long time ago but have no time to do this by myself.

I tried to evaluate your patches on x86-64 (i7-9700k) release mode GCC.  
I used -O3 for SPEC2017 compilation.

Here are the results:

                baseline baseline(+patches)
specint2017:  8.51 vs 8.58 (+0.8%)
specfp2017:   21.1 vs 21.1 (+0%)
compile time: 2426.41s vs 2580.58s (+6.4%)

Spec2017 average code size change: -0.07%

Improving specint by 0.8% is impressive for me.

Unfortunately, it is achieved by decreasing compilation speed by 6.4% 
(although on smaller benchmark I saw only 3% slowdown). I don't know how 
but we should mitigate this speed degradation.  May be we can find a hot 
spot in the new code (but I think it is not a linear search pointed by 
Richard Biener as the object vectors most probably contain 1-2 elements) 
and this code spot can be improved, or we could use this only for 
-O3/fast, or the code can be function or target dependent.

I also find GCC consumes more memory with the patches. May be it can be 
improved too (although I am not sure about this).

I'll start to review the patches on the next week.  I don't expect that 
I'll find something serious to reject the patches but again we should 
work on mitigation of the compilation speed problem.  We can fill a new 
PR for this and resolve the problem during the release cycle.

Richard Biener Nov. 10, 2023, 7:59 a.m. UTC | #6

On Thu, Nov 9, 2023 at 9:25 PM Vladimir Makarov <vmakarov@redhat.com> wrote:
>
>
> On 11/7/23 22:47, Lehua Ding wrote:
> >
> > Lehua Ding (7):
> >    ira: Refactor the handling of register conflicts to make it more
> >      general
> >    ira: Add live_subreg problem and apply to ira pass
> >    ira: Support subreg live range track
> >    ira: Support subreg copy
> >    ira: Add all nregs >= 2 pseudos to tracke subreg list
> >    lra: Apply live_subreg df_problem to lra pass
> >    lra: Support subreg live range track and conflict detect
> >
> Thank you very much for addressing subreg RA.  It is a big work.  I
> wanted to address this long time ago but have no time to do this by myself.
>
> I tried to evaluate your patches on x86-64 (i7-9700k) release mode GCC.
> I used -O3 for SPEC2017 compilation.
>
> Here are the results:
>
>                 baseline baseline(+patches)
> specint2017:  8.51 vs 8.58 (+0.8%)
> specfp2017:   21.1 vs 21.1 (+0%)
> compile time: 2426.41s vs 2580.58s (+6.4%)
>
> Spec2017 average code size change: -0.07%
>
> Improving specint by 0.8% is impressive for me.
>
> Unfortunately, it is achieved by decreasing compilation speed by 6.4%
> (although on smaller benchmark I saw only 3% slowdown). I don't know how
> but we should mitigate this speed degradation.  May be we can find a hot
> spot in the new code (but I think it is not a linear search pointed by
> Richard Biener as the object vectors most probably contain 1-2 elements)
> and this code spot can be improved, or we could use this only for
> -O3/fast, or the code can be function or target dependent.
>
> I also find GCC consumes more memory with the patches. May be it can be
> improved too (although I am not sure about this).

Note I think it's important that this can be disabled by default for -O1
which we recommend when you feed GCC with large machine-generated
code which is also where I guess you'll find the effect is way worse.

That includes disabling the memory usage side-effect which I guess might
be hard given you grow generic data structures.

> I'll start to review the patches on the next week.  I don't expect that
> I'll find something serious to reject the patches but again we should
> work on mitigation of the compilation speed problem.  We can fill a new
> PR for this and resolve the problem during the release cycle.
>
>

Lehua Ding Nov. 10, 2023, 8:46 a.m. UTC | #7

Hi Dimitar,

Thanks for the tests.

> This patch set breaks the build for at least three embedded targets. See
> below.
> 
> For avr the GCC build fails with:
> /mnt/nvme/dinux/local-workspace/gcc/gcc/ira-lives.cc:149:39: error: call of overloaded ‘set_subreg_conflict_hard_regs(ira_allocno*&, int&)’ is ambiguous
>    149 |         set_subreg_conflict_hard_regs (OBJECT_ALLOCNO (obj), regno);

I think it's because `HARD_REG_SET` and `unsigned int` are of the same 
type on avr target(i.e. No more than 32 registers on avr target), so 
these two bellow function prototypes conflict, I'll adjust it.

static void
set_subreg_conflict_hard_regs (ira_allocno_t a, HARD_REG_SET regs)

static void
set_subreg_conflict_hard_regs (ira_allocno_t a, unsigned int regno)

> For arm-none-eabi the newlib build fails with:
> /mnt/nvme/dinux/local-workspace/newlib/newlib/libm/math/e_jn.c:279:1: internal compiler error: Floating point exception
>    279 | }
>        | ^
> 0x1176e0f crash_signal
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/toplev.cc:316
> 0xf6008d get_range_hard_regs(int, subreg_range const&)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:609
> 0xf6008d get_range_hard_regs(int, subreg_range const&)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:601
> 0xf60312 new_insn_reg
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:658
> 0xf6064d add_regs_to_insn_regno_info
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1623
> 0xf62909 lra_update_insn_regno_info(rtx_insn*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1769
> 0xf62e46 lra_update_insn_regno_info(rtx_insn*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1762
> 0xf62e46 lra_push_insn_1
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1919
> 0xf62f2d lra_push_insn(rtx_insn*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1927
> 0xf62f2d push_insns
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1970
> 0xf63302 push_insns
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1966
> 0xf63302 lra(_IO_FILE*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:2511
> 0xf0e399 do_reload
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5960
> 0xf0e399 execute
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6148
> 
> The divide by zero error above is interesting. I'm not sure why ira_reg_class_max_nregs[] yields 0 for the pseudo register 168 in the following rtx:
> (debug_insn 168 167 169 19 (var_location:SI encoding (reg/v:SI 168 [ encoding ])) -1
>       (nil))

I just cross compiled an arm-none-eabi compiler and didn't encounter 
this error, can you give me a little more config info about build? For 
example, flags_for_target, etc. Thanks again.

Lehua Ding Nov. 10, 2023, 8:53 a.m. UTC | #8

>> The divide by zero error above is interesting. I'm not sure why 
>> ira_reg_class_max_nregs[] yields 0 for the pseudo register 168 in the 
>> following rtx:
>> (debug_insn 168 167 169 19 (var_location:SI encoding (reg/v:SI 168 [ 
>> encoding ])) -1
>>       (nil))
> 
> I just cross compiled an arm-none-eabi compiler and didn't encounter 
> this error, can you give me a little more config info about build? For 
> example, flags_for_target, etc. Thanks again.
> 

Forgot, please also provide the version information of newlib code.

Lehua Ding Nov. 10, 2023, 9:26 a.m. UTC | #9

Hi Richard,

On 2023/11/8 17:40, Richard Sandiford wrote:
> Tracking subreg liveness will sometimes expose dead code that
> wasn't obvious without it.  PR89606 has an example of this.
> There the dead code was introduced by init-regs, and there's a
> debate about (a) whether init-regs should still be run and (b) if it
> should still be run, whether it should use subreg liveness tracking too.
> 
> But I think such dead code is possible even without init-regs.
> So for the purpose of this series, I think the init-regs behaviour
> in that PR creates a helpful example.

Yes, I think the init-regs should be enhanced to reduce unnecessary 
initialization. My previous internal patchs did this in a separate 
patch. Maybe I should split the live_subreg problem out of the second 
patch and not couple it with these patches. That way it can be reviewed 
separately.

> I agree with Richi of course that compile-time is a concern.
> The patch seems to add quite a bit of new data to ira_allocno,
> but perhaps that's OK.  ira_object + ira_allocno is already quite big.
> 
> However:
> 
> @@ -387,8 +398,8 @@ struct ira_allocno
>     /* An array of structures describing conflict information and live
>        ranges for each object associated with the allocno.  There may be
>        more than one such object in cases where the allocno represents a
> -     multi-word register.  */
> -  ira_object_t objects[2];
> +     multi-hardreg pesudo.  */
> +  std::vector<ira_object_t> objects;
>     /* Registers clobbered by intersected calls.  */
>      HARD_REG_SET crossed_calls_clobbered_regs;
>     /* Array of usage costs (accumulated and the one updated during
> 
> adds an extra level of indirection (and separate extra storage) for
> every allocno, not just multi-hardreg ones.  It'd be worth optimising
> the data structures' representation of single-hardreg pseudos even if
> that slows down the multi-hardreg code, since single-hardreg pseudos are
> so much more common.  And the different single-hardreg and multi-hardreg
> representations could be hidden behind accessors, to make life easier
> for consumers.  (Of course, performance of the accessors is also then
> an issue. :))

Okay, I'll try. Thank you so much.

Lehua Ding Nov. 10, 2023, 9:29 a.m. UTC | #10

On 2023/11/8 11:55, juzhe.zhong@rivai.ai wrote:
> Thanks Lehua.
> 
> Appreciate for supporting subreg liveness tracking with tons of work.
> 
> A nit comments, I think you should mention these following PRs:
> 
> 106694
> 89967
> 106146
> 99161
> 
> No need send V2 now. You can send V2 after Richard and Vlad reviewed.
Okay, thanks :)

Lehua Ding Nov. 10, 2023, 9:43 a.m. UTC | #11

Hi Jeff,

On 2023/11/9 3:13, Jeff Law wrote:
> The other thing to ponder.  Jivan and I have been banging on Joern's 
> sub-object tracking bits for a totally different problem in the RISC-V 
> space.  But there may be some overlap.
> 
> Essentially Joern's code tracks liveness for a few chunks in registers. 
> bits 0..7, bits 8..15, bits 16..31 and bits 32..63.  This includes 
> propagating liveness from the destination through to the sources.  SO 
> for example if we have
> 
> (set (reg:SI dest) (plus:SI (srcreg1:SI) (srcreg2:SI)))
> 
> If we had previously determined that only bits 0..15 were live in DEST, 
> then we'll propagate that into the source registers.
> 
> The goal is to ultimately transform something like
> 
> (set (dest:mode) (any_extend:mode (reg:narrower_mode)))
> 
> into
> 
> (set (dest:mode) (subreg:mode (reg:narrower_mode)))
> 
> Where the latter typically will get simplified and propagated away.
> 
> 
> Joern's code is a bit of a mess, but Jivan and I are slowly untangling 
> it from a correctness standpoint.  It'll also need the usual cleanups.
> 
> Anyway, point being I think it'll be worth looking at Lehua's bits and 
> Joern's bits to see if there's anything that can and should be shared. 
> Given I'm getting fairly familiar with Joern's bits, that likely falls 
> to me.

Maybe subreg live range track classes (in patch 2) could be shared. 
Including range's UNION, Diff, and other operations should be similar. 
I'll see if I'm going to extract a separate patch to review this part. 
What do you think?

Richard Sandiford Nov. 10, 2023, 10:16 a.m. UTC | #12

Lehua Ding <lehua.ding@rivai.ai> writes:
> Hi Richard,
>
> On 2023/11/8 17:40, Richard Sandiford wrote:
>> Tracking subreg liveness will sometimes expose dead code that
>> wasn't obvious without it.  PR89606 has an example of this.
>> There the dead code was introduced by init-regs, and there's a
>> debate about (a) whether init-regs should still be run and (b) if it
>> should still be run, whether it should use subreg liveness tracking too.
>> 
>> But I think such dead code is possible even without init-regs.
>> So for the purpose of this series, I think the init-regs behaviour
>> in that PR creates a helpful example.
>
> Yes, I think the init-regs should be enhanced to reduce unnecessary 
> initialization. My previous internal patchs did this in a separate 
> patch. Maybe I should split the live_subreg problem out of the second 
> patch and not couple it with these patches. That way it can be reviewed 
> separately.

But my point was that this kind of dead code is possible even without
init-regs.  So I think we should have something that removes the dead
code.  And we can try it on that PR (without changing init-regs).

Thanks,
Richard

>
>> I agree with Richi of course that compile-time is a concern.
>> The patch seems to add quite a bit of new data to ira_allocno,
>> but perhaps that's OK.  ira_object + ira_allocno is already quite big.
>> 
>> However:
>> 
>> @@ -387,8 +398,8 @@ struct ira_allocno
>>     /* An array of structures describing conflict information and live
>>        ranges for each object associated with the allocno.  There may be
>>        more than one such object in cases where the allocno represents a
>> -     multi-word register.  */
>> -  ira_object_t objects[2];
>> +     multi-hardreg pesudo.  */
>> +  std::vector<ira_object_t> objects;
>>     /* Registers clobbered by intersected calls.  */
>>      HARD_REG_SET crossed_calls_clobbered_regs;
>>     /* Array of usage costs (accumulated and the one updated during
>> 
>> adds an extra level of indirection (and separate extra storage) for
>> every allocno, not just multi-hardreg ones.  It'd be worth optimising
>> the data structures' representation of single-hardreg pseudos even if
>> that slows down the multi-hardreg code, since single-hardreg pseudos are
>> so much more common.  And the different single-hardreg and multi-hardreg
>> representations could be hidden behind accessors, to make life easier
>> for consumers.  (Of course, performance of the accessors is also then
>> an issue. :))
>
> Okay, I'll try. Thank you so much.

Lehua Ding Nov. 10, 2023, 10:30 a.m. UTC | #13

On 2023/11/10 18:16, Richard Sandiford wrote:
> Lehua Ding <lehua.ding@rivai.ai> writes:
>> Hi Richard,
>>
>> On 2023/11/8 17:40, Richard Sandiford wrote:
>>> Tracking subreg liveness will sometimes expose dead code that
>>> wasn't obvious without it.  PR89606 has an example of this.
>>> There the dead code was introduced by init-regs, and there's a
>>> debate about (a) whether init-regs should still be run and (b) if it
>>> should still be run, whether it should use subreg liveness tracking too.
>>>
>>> But I think such dead code is possible even without init-regs.
>>> So for the purpose of this series, I think the init-regs behaviour
>>> in that PR creates a helpful example.
>>
>> Yes, I think the init-regs should be enhanced to reduce unnecessary
>> initialization. My previous internal patchs did this in a separate
>> patch. Maybe I should split the live_subreg problem out of the second
>> patch and not couple it with these patches. That way it can be reviewed
>> separately.
> 
> But my point was that this kind of dead code is possible even without
> init-regs.  So I think we should have something that removes the dead
> code.  And we can try it on that PR (without changing init-regs).

Got it, so we should add a fast remove dead code job after init-regs pass.

Richard Sandiford Nov. 10, 2023, 10:39 a.m. UTC | #14

Lehua Ding <lehua.ding@rivai.ai> writes:
> On 2023/11/10 18:16, Richard Sandiford wrote:
>> Lehua Ding <lehua.ding@rivai.ai> writes:
>>> Hi Richard,
>>>
>>> On 2023/11/8 17:40, Richard Sandiford wrote:
>>>> Tracking subreg liveness will sometimes expose dead code that
>>>> wasn't obvious without it.  PR89606 has an example of this.
>>>> There the dead code was introduced by init-regs, and there's a
>>>> debate about (a) whether init-regs should still be run and (b) if it
>>>> should still be run, whether it should use subreg liveness tracking too.
>>>>
>>>> But I think such dead code is possible even without init-regs.
>>>> So for the purpose of this series, I think the init-regs behaviour
>>>> in that PR creates a helpful example.
>>>
>>> Yes, I think the init-regs should be enhanced to reduce unnecessary
>>> initialization. My previous internal patchs did this in a separate
>>> patch. Maybe I should split the live_subreg problem out of the second
>>> patch and not couple it with these patches. That way it can be reviewed
>>> separately.
>> 
>> But my point was that this kind of dead code is possible even without
>> init-regs.  So I think we should have something that removes the dead
>> code.  And we can try it on that PR (without changing init-regs).
>
> Got it, so we should add a fast remove dead code job after init-regs pass.

I'm just not sure how fast it would be, given that it needs the subreg
liveness info.  Could it be done during RA itself, during one of the existing
instruction walks?  E.g. if IRA sees a dead instruction, it could remove it
rather than recording conflict information for it.

Thanks,
Richard

Jeff Law Nov. 10, 2023, 2:28 p.m. UTC | #15

On 11/10/23 03:39, Richard Sandiford wrote:
> Lehua Ding <lehua.ding@rivai.ai> writes:
>> On 2023/11/10 18:16, Richard Sandiford wrote:
>>> Lehua Ding <lehua.ding@rivai.ai> writes:
>>>> Hi Richard,
>>>>
>>>> On 2023/11/8 17:40, Richard Sandiford wrote:
>>>>> Tracking subreg liveness will sometimes expose dead code that
>>>>> wasn't obvious without it.  PR89606 has an example of this.
>>>>> There the dead code was introduced by init-regs, and there's a
>>>>> debate about (a) whether init-regs should still be run and (b) if it
>>>>> should still be run, whether it should use subreg liveness tracking too.
>>>>>
>>>>> But I think such dead code is possible even without init-regs.
>>>>> So for the purpose of this series, I think the init-regs behaviour
>>>>> in that PR creates a helpful example.
>>>>
>>>> Yes, I think the init-regs should be enhanced to reduce unnecessary
>>>> initialization. My previous internal patchs did this in a separate
>>>> patch. Maybe I should split the live_subreg problem out of the second
>>>> patch and not couple it with these patches. That way it can be reviewed
>>>> separately.
>>>
>>> But my point was that this kind of dead code is possible even without
>>> init-regs.  So I think we should have something that removes the dead
>>> code.  And we can try it on that PR (without changing init-regs).
>>
>> Got it, so we should add a fast remove dead code job after init-regs pass.
> 
> I'm just not sure how fast it would be, given that it needs the subreg
> liveness info.  Could it be done during RA itself, during one of the existing
> instruction walks?  E.g. if IRA sees a dead instruction, it could remove it
> rather than recording conflict information for it.
> 
Yea, it's a real concern.  I haven't done the analysis yet, but I have a 
  sense that Joern's ext-dce work which Jivan and I are working on 
(which does sub-object liveness tracking) is having a compile-time 
impact as well.

Jeff

Dimitar Dimitrov Nov. 10, 2023, 4 p.m. UTC | #16

On Fri, Nov 10, 2023 at 04:53:57PM +0800, Lehua Ding wrote:
> > > The divide by zero error above is interesting. I'm not sure why
> > > ira_reg_class_max_nregs[] yields 0 for the pseudo register 168 in
> > > the following rtx:
> > > (debug_insn 168 167 169 19 (var_location:SI encoding (reg/v:SI 168 [
> > > encoding ])) -1
> > >       (nil))
> > 
> > I just cross compiled an arm-none-eabi compiler and didn't encounter
> > this error, can you give me a little more config info about build? For
> > example, flags_for_target, etc. Thanks again.
> > 
> 
> Forgot, please also provide the version information of newlib code.
> 

These are the GIT commit hashes which I tested:
  gcc 39d81b667373b0033f44702a4b532a4618dde9ff
  binutils c96ceed9dce7617f270aa4742645706e535f74b7
  newlib 39f734a857e2692224715b03b99fc7bd83e94a0f

This is the script I'm using to build arm-none-eabi:
   https://github.com/dinuxbg/gnupru/blob/master/testing/manual-build-arm.sh
The build steps and config parameters are easily seen there.

Note that the Linaro CI is also detecting issues. It hits ICEs when
building libgcc:
  https://patchwork.sourceware.org/project/gcc/patch/20231108034740.834590-8-lehua.ding@rivai.ai/

Regards,
Dimitar

Richard Sandiford Nov. 11, 2023, 3:33 p.m. UTC | #17

Jeff Law <jeffreyalaw@gmail.com> writes:
> On 11/8/23 02:40, Richard Sandiford wrote:
>> Lehua Ding <lehua.ding@rivai.ai> writes:
>>> Hi,
>>>
>>> These patchs try to support subreg coalesce feature in
>>> register allocation passes (ira and lra).
>> 
>> Thanks a lot for the series.  This is definitely something we've
>> needed for a while.
>> 
>> I probably won't be able to look at it in detail for a couple of weeks
>> (and the real review should come from Vlad anyway), but one initial
>> comment:
> Absolutely agreed on the above.
>
> The other thing to ponder.  Jivan and I have been banging on Joern's 
> sub-object tracking bits for a totally different problem in the RISC-V 
> space.  But there may be some overlap.
>
> Essentially Joern's code tracks liveness for a few chunks in registers. 
> bits 0..7, bits 8..15, bits 16..31 and bits 32..63.  This includes 
> propagating liveness from the destination through to the sources.  SO 
> for example if we have
>
> (set (reg:SI dest) (plus:SI (srcreg1:SI) (srcreg2:SI)))
>
> If we had previously determined that only bits 0..15 were live in DEST, 
> then we'll propagate that into the source registers.
>
> The goal is to ultimately transform something like
>
> (set (dest:mode) (any_extend:mode (reg:narrower_mode)))
>
> into
>
> (set (dest:mode) (subreg:mode (reg:narrower_mode)))
>
> Where the latter typically will get simplified and propagated away.
>
>
> Joern's code is a bit of a mess, but Jivan and I are slowly untangling 
> it from a correctness standpoint.  It'll also need the usual cleanups.

Ah, nice!  How configurable are the bit ranges?  We might be able to use
something similar to track lanes in a vector operation, to detect the
dead code in:

   ins v0.b[4], w0
   ...
   ins v0.b[4], w1

It sounds like the bit ranges you have now would do that for some
common/useful cases, even if it doesn't handle the general case.

Maybe dead lanes are better tracked at the gimple level though, not sure.
(But AArch64 might need to lower lane operations more than it does now if
we want gimple to handle it.)

Richard

Jeff Law Nov. 11, 2023, 5:46 p.m. UTC | #18

On 11/11/23 08:33, Richard Sandiford wrote:

>> Joern's code is a bit of a mess, but Jivan and I are slowly untangling
>> it from a correctness standpoint.  It'll also need the usual cleanups.
> 
> Ah, nice!  How configurable are the bit ranges?  We might be able to use
> something similar to track lanes in a vector operation, to detect the
> dead code in:
> 
>     ins v0.b[4], w0
>     ...
>     ins v0.b[4], w1
> 
> It sounds like the bit ranges you have now would do that for some
> common/useful cases, even if it doesn't handle the general case.
It could probably be extended to handle more cases.  Right now the 
regions tracked are static.  Bits 0..7, 8..16, 16..31 and 32..64.  I 
don't think extending it to additional regions would be terribly hard.

> 
> Maybe dead lanes are better tracked at the gimple level though, not sure.
> (But AArch64 might need to lower lane operations more than it does now if
> we want gimple to handle it.)
I'd think the best place depends on what you want to do with the dead 
lane information.  THe more complex the transformation you want to make 
the more likely gimple is the right spot.  If you're looking to do 
something simplistic like Joern's code does when it finds dead chunks 
RTL seems like the natural choice.

jeff

juzhe.zhong@rivai.ai Nov. 12, 2023, 1:16 a.m. UTC | #19

Hi, Richard.

>> Maybe dead lanes are better tracked at the gimple level though, not sure.
>> (But AArch64 might need to lower lane operations more than it does now if
>> we want gimple to handle it.)

We were trying to address such issue at GIMPLE leve at the beginning.
Tracking subreg-lanes of tuple type may be enough for aarch64 since aarch64 only tuple types.
However, for RVV, that's not enough to address all issues.
Consider this following situation:
https://godbolt.org/z/fhTvEjvr8 

You can see comparing with LLVM, GCC has so many redundant mov instructions "vmv1r.v".
Since GCC is not able to tracking subreg liveness, wheras LLVM can.

The reason why tracking sub-lanes in GIMPLE can not address these redundant move issues for RVV：

1. RVV has tuple type like "vint8m1x2_t" which is totoally the same as aarch64 "svint8x1_t".
    It used by segment load/store which is similiar instruction "ld2r" instruction in ARM SVE (vec_load_lanes/vec_store_lanes)
    Support sub-lanes tracking in GIMPLE can fix this situation for both RVV and ARM SVE.
    
2. However, we are not having "vint8m1x2_t", we also have "vint8m2_t" (LMUL =2) which also occupies 2 regsiters
    which is not tuple type, instead, it is simple vector type. Such type is used by all simple operations.
    For example, "vadd" with vint8m1_t is doing PLUS operation on single vector registers, wheras same
    instruction "vadd“ with vint8m2_t is dong PLUS operation on 2 vector registers.  Such type we can't
    define them as tuple type for following reasons:
    1). we also have tuple type for LMUL > 1, for example, we also have "vint8m2x2_t" has tuple type.
         If we define "vint8m2_t" as tuple type, How about "vint8m2x2_t" ? , Tuple type with tuple or
         Array with array ? It makes type so strange.
    2). RVV instrinsic doc define vint8m2x2_t as tuple type, but vint8m2_t not tuple type. We are not able
         to change the documents.
    3). Clang has supported RVV intrinsics 3 years ago, vint8m2_t is not tuple type for 3 years and widely
         used, changing type definition will destroy ecosystem.  So for compability, we are not able define
         LMUL > 1 as tuple type.

For these reasons, we should be able to access highpart of vint8m2_t and lowpart of vint8m2_t, we provide
vget to generate subreg access of the vector mode.

So, at the discussion stage, we decided to address subpart access of vector mode in more generic way,
which is support subreg liveness tracking in RTL level. So that it can not only address issues happens on ARM SVE,
but also address issues for LMUL > 1.

3. After we decided to support subreg liveness tracking in RTL, we study LLVM.
    Actually, LLVM has a standalone PASS right before their linear scan RA (greedy) call register coalescer.
    So, the first draft of our solution is supporting register coalescing before RA which is opened source:
    riscv-gcc/gcc/ira-coalesce.cc at riscv-gcc-rvv-next ・ riscv-collab/riscv-gcc (github.com)
    by simulating LLVM solution. However, we don't think such solution is elegant and we have consulted
    Vlad.  Vlad suggested we should enhance IRA/LRA with subreg liveness tracking which turns to be
    more reasonable and elegant approach. 

So, after Lehua several experiments and investigations, he dedicate himself produce this series of patches.
And we think Lehua's approach should be generic and optimal solution to fix this subreg generic problems.

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-11-11 23:33
To: Jeff Law
CC: Lehua Ding; gcc-patches; vmakarov; juzhe.zhong
Subject: Re: [PATCH 0/7] ira/lra: Support subreg coalesce
Jeff Law <jeffreyalaw@gmail.com> writes:
> On 11/8/23 02:40, Richard Sandiford wrote:
>> Lehua Ding <lehua.ding@rivai.ai> writes:
>>> Hi,
>>>
>>> These patchs try to support subreg coalesce feature in
>>> register allocation passes (ira and lra).
>> 
>> Thanks a lot for the series.  This is definitely something we've
>> needed for a while.
>> 
>> I probably won't be able to look at it in detail for a couple of weeks
>> (and the real review should come from Vlad anyway), but one initial
>> comment:
> Absolutely agreed on the above.
>
> The other thing to ponder.  Jivan and I have been banging on Joern's 
> sub-object tracking bits for a totally different problem in the RISC-V 
> space.  But there may be some overlap.
>
> Essentially Joern's code tracks liveness for a few chunks in registers. 
> bits 0..7, bits 8..15, bits 16..31 and bits 32..63.  This includes 
> propagating liveness from the destination through to the sources.  SO 
> for example if we have
>
> (set (reg:SI dest) (plus:SI (srcreg1:SI) (srcreg2:SI)))
>
> If we had previously determined that only bits 0..15 were live in DEST, 
> then we'll propagate that into the source registers.
>
> The goal is to ultimately transform something like
>
> (set (dest:mode) (any_extend:mode (reg:narrower_mode)))
>
> into
>
> (set (dest:mode) (subreg:mode (reg:narrower_mode)))
>
> Where the latter typically will get simplified and propagated away.
>
>
> Joern's code is a bit of a mess, but Jivan and I are slowly untangling 
> it from a correctness standpoint.  It'll also need the usual cleanups.
 
Ah, nice!  How configurable are the bit ranges?  We might be able to use
something similar to track lanes in a vector operation, to detect the
dead code in:
 
   ins v0.b[4], w0
   ...
   ins v0.b[4], w1
 
It sounds like the bit ranges you have now would do that for some
common/useful cases, even if it doesn't handle the general case.
 
Maybe dead lanes are better tracked at the gimple level though, not sure.
(But AArch64 might need to lower lane operations more than it does now if
we want gimple to handle it.)
 
Richard

Lehua Ding Nov. 12, 2023, 6:06 a.m. UTC | #20

Hi Dimitar,

On 2023/11/11 0:00, Dimitar Dimitrov wrote:
> On Fri, Nov 10, 2023 at 04:53:57PM +0800, Lehua Ding wrote:
>>>> The divide by zero error above is interesting. I'm not sure why
>>>> ira_reg_class_max_nregs[] yields 0 for the pseudo register 168 in
>>>> the following rtx:
>>>> (debug_insn 168 167 169 19 (var_location:SI encoding (reg/v:SI 168 [
>>>> encoding ])) -1
>>>>        (nil))
>>>
>>> I just cross compiled an arm-none-eabi compiler and didn't encounter
>>> this error, can you give me a little more config info about build? For
>>> example, flags_for_target, etc. Thanks again.
>>>
>>
>> Forgot, please also provide the version information of newlib code.
>>
> 
> These are the GIT commit hashes which I tested:
>    gcc 39d81b667373b0033f44702a4b532a4618dde9ff
>    binutils c96ceed9dce7617f270aa4742645706e535f74b7
>    newlib 39f734a857e2692224715b03b99fc7bd83e94a0f
> 
> This is the script I'm using to build arm-none-eabi:
>     https://github.com/dinuxbg/gnupru/blob/master/testing/manual-build-arm.sh
> The build steps and config parameters are easily seen there.
> 
> Note that the Linaro CI is also detecting issues. It hits ICEs when
> building libgcc:
>    https://patchwork.sourceware.org/project/gcc/patch/20231108034740.834590-8-lehua.ding@rivai.ai/

Thanks so much for the information, I can reproduce the problem now! I 
will fixed these bugs in the V2 patchs.

Lehua Ding Nov. 12, 2023, 10:08 a.m. UTC | #21

Hi Dimitar,

I solved the problem you reported in V2 patch 
(https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636166.html), 
is it possible for you to help confirm this? Thank you very much.

On 2023/11/9 0:56, Dimitar Dimitrov wrote:
> On Wed, Nov 08, 2023 at 11:47:33AM +0800, Lehua Ding wrote:
>> Hi,
>>
>> These patchs try to support subreg coalesce feature in
>> register allocation passes (ira and lra).
> 
> Hi Lehua,
> 
> This patch set breaks the build for at least three embedded targets. See
> below.
> 
> For avr the GCC build fails with:
> /mnt/nvme/dinux/local-workspace/gcc/gcc/ira-lives.cc:149:39: error: call of overloaded ‘set_subreg_conflict_hard_regs(ira_allocno*&, int&)’ is ambiguous
>    149 |         set_subreg_conflict_hard_regs (OBJECT_ALLOCNO (obj), regno);
> 
> 
> For arm-none-eabi the newlib build fails with:
> /mnt/nvme/dinux/local-workspace/newlib/newlib/libm/math/e_jn.c:279:1: internal compiler error: Floating point exception
>    279 | }
>        | ^
> 0x1176e0f crash_signal
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/toplev.cc:316
> 0xf6008d get_range_hard_regs(int, subreg_range const&)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:609
> 0xf6008d get_range_hard_regs(int, subreg_range const&)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:601
> 0xf60312 new_insn_reg
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:658
> 0xf6064d add_regs_to_insn_regno_info
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1623
> 0xf62909 lra_update_insn_regno_info(rtx_insn*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1769
> 0xf62e46 lra_update_insn_regno_info(rtx_insn*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1762
> 0xf62e46 lra_push_insn_1
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1919
> 0xf62f2d lra_push_insn(rtx_insn*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1927
> 0xf62f2d push_insns
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1970
> 0xf63302 push_insns
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1966
> 0xf63302 lra(_IO_FILE*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:2511
> 0xf0e399 do_reload
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5960
> 0xf0e399 execute
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6148
> 
> 
> For pru-elf the GCC build fails with:
> /mnt/nvme/dinux/local-workspace/gcc/libgcc/unwind-dw2-fde.c: In function 'linear_search_fdes':
> /mnt/nvme/dinux/local-workspace/gcc/libgcc/unwind-dw2-fde.c:1035:1: internal compiler error: Floating point exception
>   1035 | }
>        | ^
> 0x1694f2e crash_signal
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/toplev.cc:316
> 0x1313178 get_range_hard_regs(int, subreg_range const&)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:609
> 0x131343a new_insn_reg
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:658
> 0x13174f0 add_regs_to_insn_regno_info
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1608
> 0x1318479 lra_update_insn_regno_info(rtx_insn*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1769
> 0x13196ab lra_push_insn_1
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1919
> 0x13196de lra_push_insn(rtx_insn*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1927
> 0x13197da push_insns
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1970
> 0x131b6dc lra(_IO_FILE*)
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:2511
> 0x129f237 do_reload
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5960
> 0x129f6c6 execute
>          /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6148
> 
> 
> The divide by zero error above is interesting. I'm not sure why ira_reg_class_max_nregs[] yields 0 for the pseudo register 168 in the following rtx:
> (debug_insn 168 167 169 19 (var_location:SI encoding (reg/v:SI 168 [ encoding ])) -1
>       (nil))
> 
> Regards,
> Dimitar
>

Richard Sandiford Nov. 12, 2023, 11:53 a.m. UTC | #22

钟居哲 <juzhe.zhong@rivai.ai> writes:
> Hi, Richard.
>
>>> Maybe dead lanes are better tracked at the gimple level though, not sure.
>>> (But AArch64 might need to lower lane operations more than it does now if
>>> we want gimple to handle it.)
>
> We were trying to address such issue at GIMPLE leve at the beginning.
> Tracking subreg-lanes of tuple type may be enough for aarch64 since aarch64 only tuple types.
> However, for RVV, that's not enough to address all issues.
> Consider this following situation:
> https://godbolt.org/z/fhTvEjvr8 
>
> You can see comparing with LLVM, GCC has so many redundant mov instructions "vmv1r.v".
> Since GCC is not able to tracking subreg liveness, wheras LLVM can.
>
> The reason why tracking sub-lanes in GIMPLE can not address these redundant move issues for RVV：
>
> 1. RVV has tuple type like "vint8m1x2_t" which is totoally the same as aarch64 "svint8x1_t".
>     It used by segment load/store which is similiar instruction "ld2r" instruction in ARM SVE (vec_load_lanes/vec_store_lanes)
>     Support sub-lanes tracking in GIMPLE can fix this situation for both RVV and ARM SVE.
>     
> 2. However, we are not having "vint8m1x2_t", we also have "vint8m2_t" (LMUL =2) which also occupies 2 regsiters
>     which is not tuple type, instead, it is simple vector type. Such type is used by all simple operations.
>     For example, "vadd" with vint8m1_t is doing PLUS operation on single vector registers, wheras same
>     instruction "vadd“ with vint8m2_t is dong PLUS operation on 2 vector registers.  Such type we can't
>     define them as tuple type for following reasons:
>     1). we also have tuple type for LMUL > 1, for example, we also have "vint8m2x2_t" has tuple type.
>          If we define "vint8m2_t" as tuple type, How about "vint8m2x2_t" ? , Tuple type with tuple or
>          Array with array ? It makes type so strange.
>     2). RVV instrinsic doc define vint8m2x2_t as tuple type, but vint8m2_t not tuple type. We are not able
>          to change the documents.
>     3). Clang has supported RVV intrinsics 3 years ago, vint8m2_t is not tuple type for 3 years and widely
>          used, changing type definition will destroy ecosystem.  So for compability, we are not able define
>          LMUL > 1 as tuple type.
>
> For these reasons, we should be able to access highpart of vint8m2_t and lowpart of vint8m2_t, we provide
> vget to generate subreg access of the vector mode.
>
> So, at the discussion stage, we decided to address subpart access of vector mode in more generic way,
> which is support subreg liveness tracking in RTL level. So that it can not only address issues happens on ARM SVE,
> but also address issues for LMUL > 1.
>
> 3. After we decided to support subreg liveness tracking in RTL, we study LLVM.
>     Actually, LLVM has a standalone PASS right before their linear scan RA (greedy) call register coalescer.
>     So, the first draft of our solution is supporting register coalescing before RA which is opened source:
>     riscv-gcc/gcc/ira-coalesce.cc at riscv-gcc-rvv-next · riscv-collab/riscv-gcc (github.com)
>     by simulating LLVM solution. However, we don't think such solution is elegant and we have consulted
>     Vlad.  Vlad suggested we should enhance IRA/LRA with subreg liveness tracking which turns to be
>     more reasonable and elegant approach. 
>
> So, after Lehua several experiments and investigations, he dedicate himself produce this series of patches.
> And we think Lehua's approach should be generic and optimal solution to fix this subreg generic problems.

Ah, sorry, I caused a misunderstanding.  In the message quoted above,
I'd moved on from talking about tracking liveness of vectors in a tuple.
I was instead talking about tracking the liveness of individual lanes
in a single vector.

I was responding to Jeff's description of the bit-level liveness tracking
pass.  That pass solves a generic issue: redundant sign and zero extensions.
But it sounded like it could also be reused for tracking lanes of a vector
(by using different bit ranges from the ones that Jeff listed).

The thing that I was saying might be better done on gimple was tracking
lanes of an individual vector.  In other words, I was arguing against
my own question.

I should have changed the subject line when responding, sorry.

I wasn't suggesting that we should avoid subreg tracking in the RA.
That's definitely needed for AArch64, and in general.

Thanks,
Richard

Lehua Ding Nov. 12, 2023, 12:01 p.m. UTC | #23

Hi Vladimir,

On 2023/11/10 4:24, Vladimir Makarov wrote:
> 
> On 11/7/23 22:47, Lehua Ding wrote:
>>
>> Lehua Ding (7):
>>    ira: Refactor the handling of register conflicts to make it more
>>      general
>>    ira: Add live_subreg problem and apply to ira pass
>>    ira: Support subreg live range track
>>    ira: Support subreg copy
>>    ira: Add all nregs >= 2 pseudos to tracke subreg list
>>    lra: Apply live_subreg df_problem to lra pass
>>    lra: Support subreg live range track and conflict detect
>>
> Thank you very much for addressing subreg RA.  It is a big work.  I 
> wanted to address this long time ago but have no time to do this by myself.
> 
> I tried to evaluate your patches on x86-64 (i7-9700k) release mode GCC. 
> I used -O3 for SPEC2017 compilation.
> 
> Here are the results:
> 
>                 baseline baseline(+patches)
> specint2017:  8.51 vs 8.58 (+0.8%)
> specfp2017:   21.1 vs 21.1 (+0%)
> compile time: 2426.41s vs 2580.58s (+6.4%)
> 
> Spec2017 average code size change: -0.07%
> 
> Improving specint by 0.8% is impressive for me.
> 
> Unfortunately, it is achieved by decreasing compilation speed by 6.4% 
> (although on smaller benchmark I saw only 3% slowdown). I don't know how 
> but we should mitigate this speed degradation.  May be we can find a hot 
> spot in the new code (but I think it is not a linear search pointed by 
> Richard Biener as the object vectors most probably contain 1-2 elements) 
> and this code spot can be improved, or we could use this only for 
> -O3/fast, or the code can be function or target dependent.
> 
> I also find GCC consumes more memory with the patches. May be it can be 
> improved too (although I am not sure about this).

Thanks for the specint performance data. I'll do my best to get the 
compile time and memory issues fixed. I'm very curious to know if the 
way used to solve the subreg coalesce problem makes sense to you?

> I'll start to review the patches on the next week.  I don't expect that 
> I'll find something serious to reject the patches but again we should 
> work on mitigation of the compilation speed problem.  We can fill a new 
> PR for this and resolve the problem during the release cycle.

Lehua Ding Nov. 12, 2023, 12:12 p.m. UTC | #24

Hi Vladimir,

While you're starting your review, please review v3 version that fixes 
some ICE issues, thanks.

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636178.html

On 2023/11/12 20:01, Lehua Ding wrote:
> Hi Vladimir,
> 
> On 2023/11/10 4:24, Vladimir Makarov wrote:
>>
>> On 11/7/23 22:47, Lehua Ding wrote:
>>>
>>> Lehua Ding (7):
>>>    ira: Refactor the handling of register conflicts to make it more
>>>      general
>>>    ira: Add live_subreg problem and apply to ira pass
>>>    ira: Support subreg live range track
>>>    ira: Support subreg copy
>>>    ira: Add all nregs >= 2 pseudos to tracke subreg list
>>>    lra: Apply live_subreg df_problem to lra pass
>>>    lra: Support subreg live range track and conflict detect
>>>
>> Thank you very much for addressing subreg RA.  It is a big work.  I 
>> wanted to address this long time ago but have no time to do this by 
>> myself.
>>
>> I tried to evaluate your patches on x86-64 (i7-9700k) release mode 
>> GCC. I used -O3 for SPEC2017 compilation.
>>
>> Here are the results:
>>
>>                 baseline baseline(+patches)
>> specint2017:  8.51 vs 8.58 (+0.8%)
>> specfp2017:   21.1 vs 21.1 (+0%)
>> compile time: 2426.41s vs 2580.58s (+6.4%)
>>
>> Spec2017 average code size change: -0.07%
>>
>> Improving specint by 0.8% is impressive for me.
>>
>> Unfortunately, it is achieved by decreasing compilation speed by 6.4% 
>> (although on smaller benchmark I saw only 3% slowdown). I don't know 
>> how but we should mitigate this speed degradation.  May be we can find 
>> a hot spot in the new code (but I think it is not a linear search 
>> pointed by Richard Biener as the object vectors most probably contain 
>> 1-2 elements) and this code spot can be improved, or we could use this 
>> only for -O3/fast, or the code can be function or target dependent.
>>
>> I also find GCC consumes more memory with the patches. May be it can 
>> be improved too (although I am not sure about this).
> 
> Thanks for the specint performance data. I'll do my best to get the 
> compile time and memory issues fixed. I'm very curious to know if the 
> way used to solve the subreg coalesce problem makes sense to you?
> 
>> I'll start to review the patches on the next week.  I don't expect 
>> that I'll find something serious to reject the patches but again we 
>> should work on mitigation of the compilation speed problem.  We can 
>> fill a new PR for this and resolve the problem during the release cycle.
>

juzhe.zhong@rivai.ai Nov. 13, 2023, 1:11 a.m. UTC | #25

>> Ah, nice!  How configurable are the bit ranges?
I think Lehua's patch is configurable for bit ranges.
Since his patch allow target flexible tracking subreg livenesss according to REGMODE_NATURAL_SIZE

+/* Return true if REGNO is a pseudo and MODE is a multil regs size.  */
+bool
+need_track_subreg (int regno, machine_mode reg_mode)
+{
+  poly_int64 total_size = GET_MODE_SIZE (reg_mode);
+  poly_int64 natural_size = REGMODE_NATURAL_SIZE (reg_mode);
+  return maybe_gt (total_size, natural_size)
+	 && multiple_p (total_size, natural_size)
+	 && regno >= FIRST_PSEUDO_REGISTER;
+}
It depends on how targets configure REGMODE_NATURAL_SIZE target hook.

If we return QImode size, his patch is enable tracking bit ranges 7 bits subreg.


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-11-12 19:53
To: 钟居哲
CC: Jeff Law; 丁乐华; gcc-patches; vmakarov
Subject: Re: [PATCH 0/7] ira/lra: Support subreg coalesce
钟居哲 <juzhe.zhong@rivai.ai> writes:
> Hi, Richard.
>
>>> Maybe dead lanes are better tracked at the gimple level though, not sure.
>>> (But AArch64 might need to lower lane operations more than it does now if
>>> we want gimple to handle it.)
>
> We were trying to address such issue at GIMPLE leve at the beginning.
> Tracking subreg-lanes of tuple type may be enough for aarch64 since aarch64 only tuple types.
> However, for RVV, that's not enough to address all issues.
> Consider this following situation:
> https://godbolt.org/z/fhTvEjvr8 
>
> You can see comparing with LLVM, GCC has so many redundant mov instructions "vmv1r.v".
> Since GCC is not able to tracking subreg liveness, wheras LLVM can.
>
> The reason why tracking sub-lanes in GIMPLE can not address these redundant move issues for RVV：
>
> 1. RVV has tuple type like "vint8m1x2_t" which is totoally the same as aarch64 "svint8x1_t".
>     It used by segment load/store which is similiar instruction "ld2r" instruction in ARM SVE (vec_load_lanes/vec_store_lanes)
>     Support sub-lanes tracking in GIMPLE can fix this situation for both RVV and ARM SVE.
>     
> 2. However, we are not having "vint8m1x2_t", we also have "vint8m2_t" (LMUL =2) which also occupies 2 regsiters
>     which is not tuple type, instead, it is simple vector type. Such type is used by all simple operations.
>     For example, "vadd" with vint8m1_t is doing PLUS operation on single vector registers, wheras same
>     instruction "vadd“ with vint8m2_t is dong PLUS operation on 2 vector registers.  Such type we can't
>     define them as tuple type for following reasons:
>     1). we also have tuple type for LMUL > 1, for example, we also have "vint8m2x2_t" has tuple type.
>          If we define "vint8m2_t" as tuple type, How about "vint8m2x2_t" ? , Tuple type with tuple or
>          Array with array ? It makes type so strange.
>     2). RVV instrinsic doc define vint8m2x2_t as tuple type, but vint8m2_t not tuple type. We are not able
>          to change the documents.
>     3). Clang has supported RVV intrinsics 3 years ago, vint8m2_t is not tuple type for 3 years and widely
>          used, changing type definition will destroy ecosystem.  So for compability, we are not able define
>          LMUL > 1 as tuple type.
>
> For these reasons, we should be able to access highpart of vint8m2_t and lowpart of vint8m2_t, we provide
> vget to generate subreg access of the vector mode.
>
> So, at the discussion stage, we decided to address subpart access of vector mode in more generic way,
> which is support subreg liveness tracking in RTL level. So that it can not only address issues happens on ARM SVE,
> but also address issues for LMUL > 1.
>
> 3. After we decided to support subreg liveness tracking in RTL, we study LLVM.
>     Actually, LLVM has a standalone PASS right before their linear scan RA (greedy) call register coalescer.
>     So, the first draft of our solution is supporting register coalescing before RA which is opened source:
>     riscv-gcc/gcc/ira-coalesce.cc at riscv-gcc-rvv-next · riscv-collab/riscv-gcc (github.com)
>     by simulating LLVM solution. However, we don't think such solution is elegant and we have consulted
>     Vlad.  Vlad suggested we should enhance IRA/LRA with subreg liveness tracking which turns to be
>     more reasonable and elegant approach. 
>
> So, after Lehua several experiments and investigations, he dedicate himself produce this series of patches.
> And we think Lehua's approach should be generic and optimal solution to fix this subreg generic problems.
 
Ah, sorry, I caused a misunderstanding.  In the message quoted above,
I'd moved on from talking about tracking liveness of vectors in a tuple.
I was instead talking about tracking the liveness of individual lanes
in a single vector.
 
I was responding to Jeff's description of the bit-level liveness tracking
pass.  That pass solves a generic issue: redundant sign and zero extensions.
But it sounded like it could also be reused for tracking lanes of a vector
(by using different bit ranges from the ones that Jeff listed).
 
The thing that I was saying might be better done on gimple was tracking
lanes of an individual vector.  In other words, I was arguing against
my own question.
 
I should have changed the subject line when responding, sorry.
 
I wasn't suggesting that we should avoid subreg tracking in the RA.
That's definitely needed for AArch64, and in general.
 
Thanks,
Richard

Lehua Ding Nov. 13, 2023, 3:34 a.m. UTC | #26

On 2023/11/13 9:11, juzhe.zhong@rivai.ai wrote:
>>> Ah, nice!  How configurable are the bit ranges?
> I think Lehua's patch is configurable for bit ranges.
> Since his patch allow target flexible tracking subreg livenesss 
> according to REGMODE_NATURAL_SIZE
> 
> +/* Return true if REGNO is a pseudo and MODE is a multil regs size.  */
> +bool
> +need_track_subreg (int regno, machine_mode reg_mode)
> +{
> +  poly_int64 total_size = GET_MODE_SIZE (reg_mode);
> +  poly_int64 natural_size = REGMODE_NATURAL_SIZE (reg_mode);
> +  return maybe_gt (total_size, natural_size)
> +	 && multiple_p (total_size, natural_size)
> +	 && regno >= FIRST_PSEUDO_REGISTER;
> +}
> 
> It depends on how targets configure REGMODE_NATURAL_SIZE target hook.
> 
> If we return QImode size, his patch is enable tracking bit ranges 7 bits 
> subreg.

Yes, the current subreg_ranges class provides 
remove_range/add_range/remove_ranges/add_ranges interfaces to modify 
ranges. Each subreg_range contains start and end fields representing the 
range [start, end). For live_subreg problem, the value returned by 
REGMODE_NATURAL_SIZE is used as the unit, for bit track like Jeff's 
side, it can be used bit as the unit.

Vladimir Makarov Nov. 13, 2023, 7:25 p.m. UTC | #27

On 11/12/23 07:01, Lehua Ding wrote:
> Thanks for the specint performance data. I'll do my best to get the 
> compile time and memory issues fixed. I'm very curious to know if the 
> way used to solve the subreg coalesce problem makes sense to you?
>
If it works,  it is ok for me.  There is always a room for any 
optimization even if it decreases compilation speed considerably. We 
just need to keep the same speed for optimization level <= 2.  We can 
put really expensive optimizations to -O3 or -Ofast.

Although the first thing I would try myself is to do subreg liveness 
analysis only locally (inside BBs).  The majority cases I saw to improve 
subreg RA were local (inside a BB).   For such approach, we probably 
would have only minor compiler speed slowdown and could use the 
optimization by default.