RISC-V: Fine tune gather load RA constraint

Message ID 20230313082855.248118-1-juzhe.zhong@rivai.ai
State Committed
Commit a010f0e08501b267ecb925ff88450f58e01dd991
Headers
Series RISC-V: Fine tune gather load RA constraint |

Commit Message

钟居哲 March 13, 2023, 8:28 a.m. UTC
  From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>

For DEST EEW < SOURCE EEW, we can partial overlap register
according to RVV ISA.

gcc/ChangeLog:

        * config/riscv/vector.md: Fix RA constraint.

gcc/testsuite/ChangeLog:

        * gcc.target/riscv/rvv/base/narrow_constraint-12.c: New test.

---
 gcc/config/riscv/vector.md                    |  54 ++--
 .../riscv/rvv/base/narrow_constraint-12.c     | 303 ++++++++++++++++++
 2 files changed, 330 insertions(+), 27 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/narrow_constraint-12.c
  

Comments

Jeff Law March 14, 2023, 6:08 p.m. UTC | #1
On 3/13/23 02:28, juzhe.zhong@rivai.ai wrote:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> For DEST EEW < SOURCE EEW, we can partial overlap register
> according to RVV ISA.
> 
> gcc/ChangeLog:
> 
>          * config/riscv/vector.md: Fix RA constraint.
> 
> gcc/testsuite/ChangeLog:
> 
>          * gcc.target/riscv/rvv/base/narrow_constraint-12.c: New test.
Similarly.  I think this can wait for gcc-14.

jeff
  
钟居哲 March 15, 2023, 6:52 a.m. UTC | #2
Hi, Jeff. I really hope the current "refine tune RA constraint" patches can be merged into GCC-13.
These patches are just making RA constraint to be consistent with RVV ISA after I double checked RVV ISA.
These RA constraints changing is very safe.
This is the last stuff that I want to make it into GCC-13. 

More patches I am gonna to send are going to expected to be merged into GCC-14.

Thanks.


juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2023-03-15 02:08
To: juzhe.zhong; gcc-patches
CC: kito.cheng
Subject: Re: [PATCH] RISC-V: Fine tune gather load RA constraint
 
 
On 3/13/23 02:28, juzhe.zhong@rivai.ai wrote:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> For DEST EEW < SOURCE EEW, we can partial overlap register
> according to RVV ISA.
> 
> gcc/ChangeLog:
> 
>          * config/riscv/vector.md: Fix RA constraint.
> 
> gcc/testsuite/ChangeLog:
> 
>          * gcc.target/riscv/rvv/base/narrow_constraint-12.c: New test.
Similarly.  I think this can wait for gcc-14.
 
jeff
  
Jeff Law March 19, 2023, 4:55 p.m. UTC | #3
On 3/15/23 00:52, juzhe.zhong@rivai.ai wrote:
> Hi, Jeff. I really hope the current "refine tune RA constraint" patches 
> can be merged into GCC-13.
> These patches are just making RA constraint to be consistent with RVV 
> ISA after I double checked RVV ISA.
> These RA constraints changing is very safe.They may be very safe, but we're *way* past the point where we should be 
making this kind of change.  When I agreed to not object to including 
the RVV builtins in gcc-13, I never imagined we'd still be making 
changes to that code in March.   My bad for not getting clarification on 
how much work remained to be done.


Jeff
  
钟居哲 March 20, 2023, 12:49 a.m. UTC | #4
It's ok to defer them GCC-14. I will keep testing and fix bugs during these 2 months.
I won't support any more feature or optimizations until GCC-14 is open.



juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2023-03-20 00:55
To: juzhe.zhong@rivai.ai; gcc-patches
CC: kito.cheng
Subject: Re: [PATCH] RISC-V: Fine tune gather load RA constraint
 
 
On 3/15/23 00:52, juzhe.zhong@rivai.ai wrote:
> Hi, Jeff. I really hope the current "refine tune RA constraint" patches 
> can be merged into GCC-13.
> These patches are just making RA constraint to be consistent with RVV 
> ISA after I double checked RVV ISA.
> These RA constraints changing is very safe.They may be very safe, but we're *way* past the point where we should be 
making this kind of change.  When I agreed to not object to including 
the RVV builtins in gcc-13, I never imagined we'd still be making 
changes to that code in March.   My bad for not getting clarification on 
how much work remained to be done.
 
 
Jeff
  
Jeff Law April 21, 2023, 8:36 p.m. UTC | #5
On 3/13/23 02:28, juzhe.zhong@rivai.ai wrote:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> For DEST EEW < SOURCE EEW, we can partial overlap register
> according to RVV ISA.
> 
> gcc/ChangeLog:
> 
>          * config/riscv/vector.md: Fix RA constraint.
> 
> gcc/testsuite/ChangeLog:
> 
>          * gcc.target/riscv/rvv/base/narrow_constraint-12.c: New test.
This is OK.

The one question I keep having when I read these patterns is why we have 
the earlyclobber.

Earlyclobber means that the output is potentially written before the 
inputs are consumed.   Typically for a single instruction pattern such 
constraints wouldn't make a lot of sense as *usually* the inputs are 
consumed before the output is written.

Just looking for a clarification as to why the earlyclobbers are needed 
at all, particularly for non-reduction patterns.

jeff
  
钟居哲 April 24, 2023, 3:05 a.m. UTC | #6
Adding  earlyclobber is to make dest operand do not overlap with source operand.
For example:
for gather load, vluxei.v v8,(a5),v8 is illegal according to RVV ISA.
GCC is using same way as LLVM which is also adding earlyclobber for modeling disabling overlap between dest and source operand.



juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2023-04-22 04:36
To: juzhe.zhong; gcc-patches
CC: kito.cheng
Subject: Re: [PATCH] RISC-V: Fine tune gather load RA constraint
 
 
On 3/13/23 02:28, juzhe.zhong@rivai.ai wrote:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> For DEST EEW < SOURCE EEW, we can partial overlap register
> according to RVV ISA.
> 
> gcc/ChangeLog:
> 
>          * config/riscv/vector.md: Fix RA constraint.
> 
> gcc/testsuite/ChangeLog:
> 
>          * gcc.target/riscv/rvv/base/narrow_constraint-12.c: New test.
This is OK.
 
The one question I keep having when I read these patterns is why we have 
the earlyclobber.
 
Earlyclobber means that the output is potentially written before the 
inputs are consumed.   Typically for a single instruction pattern such 
constraints wouldn't make a lot of sense as *usually* the inputs are 
consumed before the output is written.
 
Just looking for a clarification as to why the earlyclobbers are needed 
at all, particularly for non-reduction patterns.
 
jeff
  
Kito Cheng April 26, 2023, 4:21 a.m. UTC | #7
Committed to trunk

On Mon, Apr 24, 2023 at 11:06 AM juzhe.zhong@rivai.ai
<juzhe.zhong@rivai.ai> wrote:
>
> Adding  earlyclobber is to make dest operand do not overlap with source operand.
> For example:
> for gather load, vluxei.v v8,(a5),v8 is illegal according to RVV ISA.
> GCC is using same way as LLVM which is also adding earlyclobber for modeling disabling overlap between dest and source operand.
>
>
>
> juzhe.zhong@rivai.ai
>
> From: Jeff Law
> Date: 2023-04-22 04:36
> To: juzhe.zhong; gcc-patches
> CC: kito.cheng
> Subject: Re: [PATCH] RISC-V: Fine tune gather load RA constraint
>
>
> On 3/13/23 02:28, juzhe.zhong@rivai.ai wrote:
> > From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> >
> > For DEST EEW < SOURCE EEW, we can partial overlap register
> > according to RVV ISA.
> >
> > gcc/ChangeLog:
> >
> >          * config/riscv/vector.md: Fix RA constraint.
> >
> > gcc/testsuite/ChangeLog:
> >
> >          * gcc.target/riscv/rvv/base/narrow_constraint-12.c: New test.
> This is OK.
>
> The one question I keep having when I read these patterns is why we have
> the earlyclobber.
>
> Earlyclobber means that the output is potentially written before the
> inputs are consumed.   Typically for a single instruction pattern such
> constraints wouldn't make a lot of sense as *usually* the inputs are
> consumed before the output is written.
>
> Just looking for a clarification as to why the earlyclobbers are needed
> at all, particularly for non-reduction patterns.
>
> jeff
>
  

Patch

diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index 37a539b4852..4ea74372de5 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -1434,63 +1434,63 @@ 
 
 ;; DEST eew is smaller than SOURCE eew.
 (define_insn "@pred_indexed_<order>load<mode>_x2_smaller_eew"
-  [(set (match_operand:VEEWTRUNC2 0 "register_operand"                "=&vr,  &vr")
+  [(set (match_operand:VEEWTRUNC2 0 "register_operand"               "=vd, vd, vr, vr,  &vr,  &vr")
 	(if_then_else:VEEWTRUNC2
 	  (unspec:<VM>
-	    [(match_operand:<VM> 1 "vector_mask_operand"             "vmWc1,vmWc1")
-	     (match_operand 5 "vector_length_operand"                "   rK,   rK")
-	     (match_operand 6 "const_int_operand"                    "    i,    i")
-	     (match_operand 7 "const_int_operand"                    "    i,    i")
-	     (match_operand 8 "const_int_operand"                    "    i,    i")
+	    [(match_operand:<VM> 1 "vector_mask_operand"             " vm, vm,Wc1,Wc1,vmWc1,vmWc1")
+	     (match_operand 5 "vector_length_operand"                " rK, rK, rK, rK,   rK,   rK")
+	     (match_operand 6 "const_int_operand"                    "  i,  i,  i,  i,    i,    i")
+	     (match_operand 7 "const_int_operand"                    "  i,  i,  i,  i,    i,    i")
+	     (match_operand 8 "const_int_operand"                    "  i,  i,  i,  i,    i,    i")
 	     (reg:SI VL_REGNUM)
 	     (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
 	  (unspec:VEEWTRUNC2
-	    [(match_operand 3 "pmode_register_operand"               "    r,    r")
+	    [(match_operand 3 "pmode_register_operand"               "  r,  r,  r,  r,    r,    r")
 	     (mem:BLK (scratch))
-	     (match_operand:<VINDEX_DOUBLE_EXT> 4 "register_operand" "   vr,   vr")] ORDER)
-	  (match_operand:VEEWTRUNC2 2 "vector_merge_operand"         "   vu,    0")))]
+	     (match_operand:<VINDEX_DOUBLE_EXT> 4 "register_operand" "  0,  0,  0,  0,   vr,   vr")] ORDER)
+	  (match_operand:VEEWTRUNC2 2 "vector_merge_operand"         " vu,  0, vu,  0,   vu,    0")))]
   "TARGET_VECTOR"
   "vl<order>xei<double_ext_sew>.v\t%0,(%3),%4%p1"
   [(set_attr "type" "vld<order>x")
    (set_attr "mode" "<MODE>")])
 
 (define_insn "@pred_indexed_<order>load<mode>_x4_smaller_eew"
-  [(set (match_operand:VEEWTRUNC4 0 "register_operand"              "=&vr,  &vr")
+  [(set (match_operand:VEEWTRUNC4 0 "register_operand"             "=vd, vd, vr, vr,  &vr,  &vr")
 	(if_then_else:VEEWTRUNC4
 	  (unspec:<VM>
-	    [(match_operand:<VM> 1 "vector_mask_operand"           "vmWc1,vmWc1")
-	     (match_operand 5 "vector_length_operand"              "   rK,   rK")
-	     (match_operand 6 "const_int_operand"                  "    i,    i")
-	     (match_operand 7 "const_int_operand"                  "    i,    i")
-	     (match_operand 8 "const_int_operand"                  "    i,    i")
+	    [(match_operand:<VM> 1 "vector_mask_operand"           " vm, vm,Wc1,Wc1,vmWc1,vmWc1")
+	     (match_operand 5 "vector_length_operand"              " rK, rK, rK, rK,   rK,   rK")
+	     (match_operand 6 "const_int_operand"                  "  i,  i,  i,  i,    i,    i")
+	     (match_operand 7 "const_int_operand"                  "  i,  i,  i,  i,    i,    i")
+	     (match_operand 8 "const_int_operand"                  "  i,  i,  i,  i,    i,    i")
 	     (reg:SI VL_REGNUM)
 	     (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
 	  (unspec:VEEWTRUNC4
-	    [(match_operand 3 "pmode_register_operand"             "    r,    r")
+	    [(match_operand 3 "pmode_register_operand"             "  r,  r,  r,  r,    r,    r")
 	     (mem:BLK (scratch))
-	     (match_operand:<VINDEX_QUAD_EXT> 4 "register_operand" "   vr,   vr")] ORDER)
-	  (match_operand:VEEWTRUNC4 2 "vector_merge_operand"       "   vu,    0")))]
+	     (match_operand:<VINDEX_QUAD_EXT> 4 "register_operand" "  0,  0,  0,  0,   vr,   vr")] ORDER)
+	  (match_operand:VEEWTRUNC4 2 "vector_merge_operand"       " vu,  0, vu,  0,   vu,    0")))]
   "TARGET_VECTOR"
   "vl<order>xei<quad_ext_sew>.v\t%0,(%3),%4%p1"
   [(set_attr "type" "vld<order>x")
    (set_attr "mode" "<MODE>")])
 
 (define_insn "@pred_indexed_<order>load<mode>_x8_smaller_eew"
-  [(set (match_operand:VEEWTRUNC8 0 "register_operand"             "=&vr,  &vr")
+  [(set (match_operand:VEEWTRUNC8 0 "register_operand"            "=vd, vd, vr, vr,  &vr,  &vr")
 	(if_then_else:VEEWTRUNC8
 	  (unspec:<VM>
-	    [(match_operand:<VM> 1 "vector_mask_operand"          "vmWc1,vmWc1")
-	     (match_operand 5 "vector_length_operand"             "   rK,   rK")
-	     (match_operand 6 "const_int_operand"                 "    i,    i")
-	     (match_operand 7 "const_int_operand"                 "    i,    i")
-	     (match_operand 8 "const_int_operand"                 "    i,    i")
+	    [(match_operand:<VM> 1 "vector_mask_operand"          " vm, vm,Wc1,Wc1,vmWc1,vmWc1")
+	     (match_operand 5 "vector_length_operand"             " rK, rK, rK, rK,   rK,   rK")
+	     (match_operand 6 "const_int_operand"                 "  i,  i,  i,  i,    i,    i")
+	     (match_operand 7 "const_int_operand"                 "  i,  i,  i,  i,    i,    i")
+	     (match_operand 8 "const_int_operand"                 "  i,  i,  i,  i,    i,    i")
 	     (reg:SI VL_REGNUM)
 	     (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
 	  (unspec:VEEWTRUNC8
-	    [(match_operand 3 "pmode_register_operand"            "    r,    r")
+	    [(match_operand 3 "pmode_register_operand"            "  r,  r,  r,  r,    r,    r")
 	     (mem:BLK (scratch))
-	     (match_operand:<VINDEX_OCT_EXT> 4 "register_operand" "   vr,   vr")] ORDER)
-	  (match_operand:VEEWTRUNC8 2 "vector_merge_operand"      "   vu,    0")))]
+	     (match_operand:<VINDEX_OCT_EXT> 4 "register_operand" "  0,  0,  0,  0,   vr,   vr")] ORDER)
+	  (match_operand:VEEWTRUNC8 2 "vector_merge_operand"      " vu,  0, vu,  0,   vu,    0")))]
   "TARGET_VECTOR"
   "vl<order>xei<oct_ext_sew>.v\t%0,(%3),%4%p1"
   [(set_attr "type" "vld<order>x")
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/narrow_constraint-12.c b/gcc/testsuite/gcc.target/riscv/rvv/base/narrow_constraint-12.c
new file mode 100644
index 00000000000..df5b2dc5c51
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/narrow_constraint-12.c
@@ -0,0 +1,303 @@ 
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3" } */
+
+#include "riscv_vector.h"
+
+void f0 (void *base,void *out,size_t vl)
+{
+    vuint64m1_t bindex = __riscv_vle64_v_u64m1 (base, vl);
+    vint8mf8_t v = __riscv_vluxei64_v_i8mf8(base,bindex,vl);
+    __riscv_vse8_v_i8mf8 (out,v,vl);
+}
+
+void f1 (void *base,void *out,size_t vl)
+{
+    vuint64m1_t bindex = __riscv_vle64_v_u64m1 (base, vl);
+    vint8mf8_t bindex2 = __riscv_vle8_v_i8mf8 ((void *)(base + 100), vl);
+    vint8mf8_t v = __riscv_vluxei64_v_i8mf8_tu(bindex2,base,bindex,vl);
+    __riscv_vse8_v_i8mf8 (out,v,vl);
+}
+
+void f2 (void *base,void *out,size_t vl)
+{
+    vuint64m1_t bindex = __riscv_vle64_v_u64m1 (base, vl);
+    vint8mf8_t v = __riscv_vluxei64_v_i8mf8(base,bindex,vl);
+    vuint64m1_t v2 = __riscv_vadd_vv_u64m1 (bindex, bindex,vl);
+    __riscv_vse8_v_i8mf8 (out,v,vl);
+    __riscv_vse64_v_u64m1 ((void *)out,v2,vl);
+}
+
+void f3 (void *base,void *out,size_t vl, int n)
+{
+    for (int i = 0; i < n; i++){
+      vuint64m1_t bindex = __riscv_vle64_v_u64m1 (base + 100*i, vl);
+      vint8mf8_t v = __riscv_vluxei64_v_i8mf8(base,bindex,vl);
+      vuint64m1_t v2 = __riscv_vadd_vv_u64m1 (bindex, bindex,vl);
+      __riscv_vse8_v_i8mf8 (out + 100*i,v,vl);
+      __riscv_vse64_v_u64m1 ((void *)(out + 200*i),v2,vl);
+    }
+}
+
+void f4 (void *base,void *out,size_t vl)
+{
+    vuint64m1_t bindex = __riscv_vle64_v_u64m1 (base, vl);
+    vint8mf8_t v = __riscv_vluxei64_v_i8mf8(base,bindex,vl);
+    v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+    v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+    vuint64m1_t v2 = __riscv_vadd_vv_u64m1 (bindex, bindex,vl);
+    __riscv_vse8_v_i8mf8 (out,v,vl);
+    __riscv_vse64_v_u64m1 ((void *)out,v2,vl);
+}
+
+void f5 (void *base,void *base2,void *out,size_t vl, int n)
+{
+    vuint64m1_t bindex = __riscv_vle64_v_u64m1 (base + 100, vl);
+    for (int i = 0; i < n; i++){
+      vbool64_t m = __riscv_vlm_v_b64 (base + i, vl);
+      vint8mf8_t v = __riscv_vluxei64_v_i8mf8_m(m,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vle8_v_i8mf8_tu (v, base2, vl);
+      __riscv_vse8_v_i8mf8 (out + 100*i,v,vl);
+    }
+}
+
+void f6 (void *base,void *out,size_t vl)
+{
+    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base, vl);
+    vint8m1_t v = __riscv_vluxei64_v_i8m1(base,bindex,vl);
+    __riscv_vse8_v_i8m1 (out,v,vl);
+}
+
+void f7 (void *base,void *out,size_t vl)
+{
+    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base, vl);
+    vint8m1_t src = __riscv_vle8_v_i8m1 ((void *)(base + 100), vl);
+    vint8m1_t v = __riscv_vluxei64_v_i8m1_tu(src,base,bindex,vl);
+    __riscv_vse8_v_i8m1 (out,v,vl);
+}
+
+void f8 (void *base,void *out,size_t vl)
+{
+    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base, vl);
+    vint8m1_t v = __riscv_vluxei64_v_i8m1(base,bindex,vl);
+    vuint64m8_t v2 = __riscv_vadd_vv_u64m8 (bindex, bindex,vl);
+    __riscv_vse8_v_i8m1 (out,v,vl);
+    __riscv_vse64_v_u64m8 ((void *)out,v2,vl);
+}
+
+void f9 (void *base,void *out,size_t vl, int n)
+{
+    for (int i = 0; i < n; i++){
+      vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base + 100*i, vl);
+      vint8m1_t v = __riscv_vluxei64_v_i8m1(base,bindex,vl);
+      vuint64m8_t v2 = __riscv_vadd_vv_u64m8 (bindex, bindex,vl);
+      __riscv_vse8_v_i8m1 (out + 100*i,v,vl);
+      __riscv_vse64_v_u64m8 ((void *)(out + 200*i),v2,vl);
+    }
+}
+
+void f10 (void *base,void *out,size_t vl)
+{
+    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base, vl);
+    vint8m1_t v = __riscv_vluxei64_v_i8m1(base,bindex,vl);
+    v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+    v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+    vuint64m8_t v2 = __riscv_vadd_vv_u64m8 (bindex, bindex,vl);
+    __riscv_vse8_v_i8m1 (out,v,vl);
+    __riscv_vse64_v_u64m8 ((void *)out,v2,vl);
+}
+
+void f11 (void *base,void *base2,void *out,size_t vl, int n)
+{
+    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base + 100, vl);
+    for (int i = 0; i < n; i++){
+      vbool8_t m = __riscv_vlm_v_b8 (base + i, vl);
+      vint8m1_t v = __riscv_vluxei64_v_i8m1_m(m,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vle8_v_i8m1_tu (v, base2, vl);
+      __riscv_vse8_v_i8m1 (out + 100*i,v,vl);
+    }
+}
+
+void f12 (void *base,void *out,size_t vl, int n)
+{
+    vint8mf8_t v = __riscv_vle8_v_i8mf8 ((void *)(base + 1000), vl);
+    for (int i = 0; i < n; i++){
+      vuint64m1_t bindex = __riscv_vle64_v_u64m1 (base + 100*i, vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      __riscv_vse8_v_i8mf8 (out + 100*i,v,vl);
+    }
+}
+
+void f13 (void *base,void *out,size_t vl, int n)
+{
+    vint8m1_t v = __riscv_vle8_v_i8m1 ((void *)(base + 1000), vl);
+    for (int i = 0; i < n; i++){
+      vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base + 100*i, vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      __riscv_vse8_v_i8m1 (out + 100*i,v,vl);
+    }
+}
+
+void f14 (void *base,void *out,size_t vl, int n)
+{
+    for (int i = 0; i < n; i++){
+      vint8mf8_t v = __riscv_vle8_v_i8mf8 ((void *)(base + 1000 * i), vl);
+      vuint64m1_t bindex = __riscv_vle64_v_u64m1 (base + 100*i, vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex,vl);
+      __riscv_vse8_v_i8mf8 (out + 100*i,v,vl);
+    }
+}
+
+void f15 (void *base,void *out,size_t vl, int n)
+{
+    for (int i = 0; i < n; i++){
+      vint8m1_t v = __riscv_vle8_v_i8m1 ((void *)(base + 1000 * i), vl);
+      vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base + 100*i, vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex,vl);
+      __riscv_vse8_v_i8m1 (out + 100*i,v,vl);
+    }
+}
+
+void f16 (void *base,void *out,size_t vl, int n)
+{
+    for (int i = 0; i < n; i++){
+      vint8mf8_t v = __riscv_vle8_v_i8mf8 ((void *)(base + 1000 * i), vl);
+      vuint64m1_t bindex1 = __riscv_vle64_v_u64m1 (base + 100*i, vl);
+      vuint64m1_t bindex2 = __riscv_vle64_v_u64m1 (base + 200*i, vl);
+      vuint64m1_t bindex3 = __riscv_vle64_v_u64m1 (base + 300*i, vl);
+      vuint64m1_t bindex4 = __riscv_vle64_v_u64m1 (base + 400*i, vl);
+      vuint64m1_t bindex5 = __riscv_vle64_v_u64m1 (base + 500*i, vl);
+      vuint64m1_t bindex6 = __riscv_vle64_v_u64m1 (base + 600*i, vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex1,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex2,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex3,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex4,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex5,vl);
+      v = __riscv_vluxei64_v_i8mf8_tu(v,base,bindex6,vl);
+      __riscv_vse8_v_i8mf8 (out + 100*i,v,vl);
+    }
+}
+
+void f17 (void *base,void *out,size_t vl, int n)
+{
+    for (int i = 0; i < n; i++){
+      vint8m1_t v = __riscv_vle8_v_i8m1 ((void *)(base + 1000 * i), vl);
+      vuint64m8_t bindex1 = __riscv_vle64_v_u64m8 (base + 100*i, vl);
+      vuint64m8_t bindex2 = __riscv_vle64_v_u64m8 (base + 200*i, vl);
+      vuint64m8_t bindex3 = __riscv_vle64_v_u64m8 (base + 300*i, vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex1,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex2,vl);
+      v = __riscv_vluxei64_v_i8m1_tu(v,base,bindex3,vl);
+      __riscv_vse8_v_i8m1 (out + 100*i,v,vl);
+    }
+}
+
+void f18 (void *base,void *base2,void *out,size_t vl, int n)
+{
+    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base + 100, vl);
+    for (int i = 0; i < n; i++){
+      vbool8_t m = __riscv_vlm_v_b8 (base + i, vl);
+      vuint32m4_t v = __riscv_vluxei64_v_u32m4_m(m,base,bindex,vl);
+      vuint32m4_t v2 = __riscv_vle32_v_u32m4_tu (v, base2 + i, vl);
+      vint8m1_t v3 = __riscv_vluxei32_v_i8m1_m(m,base,v2,vl);
+      __riscv_vse8_v_i8m1 (out + 100*i,v3,vl);
+    }
+}
+
+void f19 (void *base,void *base2,void *out,size_t vl, int n)
+{
+    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base + 100, vl);
+    for (int i = 0; i < n; i++){
+      vbool8_t m = __riscv_vlm_v_b8 (base + i, vl);
+      vuint64m8_t v = __riscv_vluxei64_v_u64m8_m(m,base,bindex,vl);
+      vuint64m8_t v2 = __riscv_vle64_v_u64m8_tu (v, base2 + i, vl);
+      vint8m1_t v3 = __riscv_vluxei64_v_i8m1_m(m,base,v,vl);
+      vint8m1_t v4 = __riscv_vluxei64_v_i8m1_m(m,base,v2,vl);
+      __riscv_vse8_v_i8m1 (out + 100*i,v3,vl);
+      __riscv_vse8_v_i8m1 (out + 222*i,v4,vl);
+    }
+}
+void f20 (void *base,void *out,size_t vl)
+{
+    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base, vl);
+    asm volatile("#" ::
+		 : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+		   "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17", 
+		   "v18", "v19", "v20", "v21", "v22", "v23");
+
+    vint8m1_t v = __riscv_vluxei64_v_i8m1(base,bindex,vl);
+    asm volatile("#" ::                                                        
+		 : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", 
+		   "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",     
+		   "v18", "v19", "v20", "v21", "v22", "v23", "v25",     
+		   "v26", "v27", "v28", "v29", "v30", "v31");
+
+    __riscv_vse8_v_i8m1 (out,v,vl);
+}
+
+void f21 (void *base,void *out,size_t vl)
+{
+    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base, vl);
+    vbool8_t m = __riscv_vlm_v_b8 (base, vl);
+    asm volatile("#" ::
+		 : "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+		   "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17", 
+		   "v18", "v19", "v20", "v21", "v22", "v23");
+
+    vint8m1_t v = __riscv_vluxei64_v_i8m1_m(m,base,bindex,vl);
+    asm volatile("#" ::                                                        
+		 : "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", 
+		   "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",     
+		   "v18", "v19", "v20", "v21", "v22", "v23", "v25",     
+		   "v26", "v27", "v28", "v29", "v30", "v31");
+
+    __riscv_vse8_v_i8m1 (out,v,vl);
+}
+
+void f22 (void *base,void *out,size_t vl)
+{
+    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base, vl);
+    asm volatile("#" ::
+		 : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+		   "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17", 
+		   "v18", "v19", "v20", "v21", "v22", "v23");
+
+    vint8m1_t v = __riscv_vluxei64_v_i8m1(base,bindex,vl);
+    asm volatile("#" ::                                                        
+		 : "v0", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", 
+		   "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",     
+		   "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25",     
+		   "v26", "v27", "v28", "v29", "v30", "v31");
+    v = __riscv_vadd_vv_i8m1 (v,v,vl);
+    asm volatile("#" ::                                                        
+		 : "v0", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", 
+		   "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",     
+		   "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25",     
+		   "v26", "v27", "v28", "v29", "v30", "v31");
+
+    __riscv_vse8_v_i8m1 (out,v,vl);
+}
+
+/* { dg-final { scan-assembler-times {vmv} 1 } } */
+/* { dg-final { scan-assembler-not {csrr} } } */