From patchwork Wed Jul 12 04:21:24 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "juzhe.zhong@rivai.ai" X-Patchwork-Id: 72538 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 0B15B3857727 for ; Wed, 12 Jul 2023 04:21:51 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtpbgjp3.qq.com (smtpbgjp3.qq.com [54.92.39.34]) by sourceware.org (Postfix) with ESMTPS id 54B463858C62 for ; Wed, 12 Jul 2023 04:21:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 54B463858C62 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivai.ai Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivai.ai X-QQ-mid: bizesmtp68t1689135686tan7cva8 Received: from server1.localdomain ( [58.60.1.22]) by bizesmtp.qq.com (ESMTP) with id ; Wed, 12 Jul 2023 12:21:25 +0800 (CST) X-QQ-SSF: 01400000000000G0T000000A0000000 X-QQ-FEAT: Z2qCHCq1/92IozGpck/iIOkgBOoDMJ5nfmJMhw6XOmFjVFrJg7UGLw2MUY6Mw UGzO2lq+7L4UiQ7k30w7PcF8a0kLhmFcVs7Ucb+A6TDQlB/cMAofZFfiGuv6Ekk+H5Xa4zx wf/xOVCxhN/yap0z4YYI+MEocU4mMoyU6p7jrhXxuNRJLJzSuSMvWJe0EWA6sEvzxUi4T4R uS2p4pnuPZSrwYrn6q/D5EYFF0xUpvKMYOOi0oLZ+IOZPbj22t7Vq/Yvgfl30mmUnSQVpYF Xd84HzoxtAsGu7corCwkYWwZ8SXNXtvsGG9U5cIYCow98OD0lSIAGOILzp5Im9Atyb/vkp0 fVy5ns0AM1DyMF9FlxwRvbq7SFrvR8ruyhj9DLVPgyGyHhxv1Sd6l2whxAE7+WjYdW6SZHH X-QQ-GoodBg: 2 X-BIZMAIL-ID: 998267861649072410 From: juzhe.zhong@rivai.ai To: gcc-patches@gcc.gnu.org Cc: richard.sandiford@arm.com, rguenther@suse.de, Ju-Zhe Zhong Subject: [PATCH] VECT: Apply COND_LEN_* into vectorizable_operation Date: Wed, 12 Jul 2023 12:21:24 +0800 Message-Id: <20230712042124.111818-1-juzhe.zhong@rivai.ai> X-Mailer: git-send-email 2.36.1 MIME-Version: 1.0 X-QQ-SENDSIZE: 520 Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz7a-one-0 X-Spam-Status: No, score=-10.3 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" From: Ju-Zhe Zhong Hi, Richard and Richi. As we disscussed before, COND_LEN_* patterns were added for multiple situations. This patch apply CON_LEN_* for the following situation: Support for the situation that in "vectorizable_operation": /* If operating on inactive elements could generate spurious traps, we need to restrict the operation to active lanes. Note that this specifically doesn't apply to unhoisted invariants, since they operate on the same value for every lane. Similarly, if this operation is part of a reduction, a fully-masked loop should only change the active lanes of the reduction chain, keeping the inactive lanes as-is. */ bool mask_out_inactive = ((!is_invariant && gimple_could_trap_p (stmt)) || reduc_idx >= 0); For mask_out_inactive is true with length loop control. So, we can these 2 following cases: 1. Integer division: #define TEST_TYPE(TYPE) \ __attribute__((noipa)) \ void vrem_##TYPE (TYPE *dst, TYPE *a, TYPE *b, int n) \ { \ for (int i = 0; i < n; i++) \ dst[i] = a[i] % b[i]; \ } #define TEST_ALL() \ TEST_TYPE(int8_t) \ TEST_ALL() With this patch: _61 = .SELECT_VL (ivtmp_59, POLY_INT_CST [4, 4]); ivtmp_45 = _61 * 4; vect__4.8_48 = .LEN_MASK_LOAD (vectp_a.6_46, 32B, _61, 0, { -1, ... }); vect__6.11_52 = .LEN_MASK_LOAD (vectp_b.9_50, 32B, _61, 0, { -1, ... }); vect__8.12_53 = .COND_LEN_ADD ({ -1, ... }, vect__4.8_48, vect__6.11_52, vect__4.8_48, _61, 0); .LEN_MASK_STORE (vectp_dst.13_55, 32B, _61, 0, { -1, ... }, vect__8.12_53); 2. Floating-point arithmetic **WITHOUT** -ffast-math #define TEST_TYPE(TYPE) \ __attribute__((noipa)) \ void vadd_##TYPE (TYPE *dst, TYPE *a, TYPE *b, int n) \ { \ for (int i = 0; i < n; i++) \ dst[i] = a[i] + b[i]; \ } #define TEST_ALL() \ TEST_TYPE(float) \ TEST_ALL() With this patch: _61 = .SELECT_VL (ivtmp_59, POLY_INT_CST [4, 4]); ivtmp_45 = _61 * 4; vect__4.8_48 = .LEN_MASK_LOAD (vectp_a.6_46, 32B, _61, 0, { -1, ... }); vect__6.11_52 = .LEN_MASK_LOAD (vectp_b.9_50, 32B, _61, 0, { -1, ... }); vect__8.12_53 = .COND_LEN_ADD ({ -1, ... }, vect__4.8_48, vect__6.11_52, vect__4.8_48, _61, 0); .LEN_MASK_STORE (vectp_dst.13_55, 32B, _61, 0, { -1, ... }, vect__8.12_53); With this patch, we can make sure operations won't trap for elements that "mask_out_inactive". gcc/ChangeLog: * internal-fn.cc (FOR_EACH_CODE_LEN_MAPPING): Add COND_LEN_*. (get_conditional_len_internal_fn): New function. (CASE): Add COND_LEN_*. * internal-fn.h (get_conditional_len_internal_fn): New function. * tree-vect-stmts.cc (vectorizable_operation): Apply COND_LEN_* into operation could trap. --- gcc/internal-fn.cc | 48 +++++++++++++++++++++++++++++++++ gcc/internal-fn.h | 1 + gcc/tree-vect-stmts.cc | 60 ++++++++++++++++++++++++++++++++++++++---- 3 files changed, 104 insertions(+), 5 deletions(-) diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc index f9aaf66cf2a..e46dd57b7f0 100644 --- a/gcc/internal-fn.cc +++ b/gcc/internal-fn.cc @@ -4337,6 +4337,54 @@ conditional_internal_fn_code (internal_fn ifn) } } +/* Invoke T(CODE, IFN) for each conditional len function IFN that maps to a + tree code CODE. */ +#define FOR_EACH_CODE_LEN_MAPPING(T) \ + T (PLUS_EXPR, IFN_COND_LEN_ADD) \ + T (MINUS_EXPR, IFN_COND_LEN_SUB) \ + T (MULT_EXPR, IFN_COND_LEN_MUL) \ + T (TRUNC_DIV_EXPR, IFN_COND_LEN_DIV) \ + T (TRUNC_MOD_EXPR, IFN_COND_LEN_MOD) \ + T (RDIV_EXPR, IFN_COND_LEN_RDIV) \ + T (MIN_EXPR, IFN_COND_LEN_MIN) \ + T (MAX_EXPR, IFN_COND_LEN_MAX) \ + T (BIT_AND_EXPR, IFN_COND_LEN_AND) \ + T (BIT_IOR_EXPR, IFN_COND_LEN_IOR) \ + T (BIT_XOR_EXPR, IFN_COND_LEN_XOR) \ + T (LSHIFT_EXPR, IFN_COND_LEN_SHL) \ + T (RSHIFT_EXPR, IFN_COND_LEN_SHR) \ + T (NEGATE_EXPR, IFN_COND_LEN_NEG) + +/* Return a function that only performs CODE when a certain condition is met + and that uses a given fallback value otherwise. For example, if CODE is + a binary operation associated with conditional function FN: + + LHS = FN (COND, A, B, ELSE, LEN) + + is equivalent to the C expression: + + for (int i = 0; i < LEN; i++) + LHS[i] = COND[i] ? A[i] CODE B[i] : ELSE[i]; + + operating elementwise if the operands are vectors. + + Return IFN_LAST if no such function exists. */ + +internal_fn +get_conditional_len_internal_fn (tree_code code) +{ + switch (code) + { +#define CASE(CODE, IFN) \ + case CODE: \ + return IFN; + FOR_EACH_CODE_LEN_MAPPING (CASE) +#undef CASE + default: + return IFN_LAST; + } +} + /* Invoke T(IFN) for each internal function IFN that also has an IFN_COND_* form. */ #define FOR_EACH_COND_FN_PAIR(T) \ diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h index 4234bbfed87..dd1bab0bddf 100644 --- a/gcc/internal-fn.h +++ b/gcc/internal-fn.h @@ -224,6 +224,7 @@ extern bool set_edom_supported_p (void); extern internal_fn get_conditional_internal_fn (tree_code); extern internal_fn get_conditional_internal_fn (internal_fn); +extern internal_fn get_conditional_len_internal_fn (tree_code); extern tree_code conditional_internal_fn_code (internal_fn); extern internal_fn get_unconditional_internal_fn (internal_fn); extern bool can_interpret_as_conditional_op_p (gimple *, tree *, diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 10e71178ce7..1c90b54bdb7 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -6711,7 +6711,9 @@ vectorizable_operation (vec_info *vinfo, int reduc_idx = STMT_VINFO_REDUC_IDX (stmt_info); vec_loop_masks *masks = (loop_vinfo ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL); + vec_loop_lens *lens = (loop_vinfo ? &LOOP_VINFO_LENS (loop_vinfo) : NULL); internal_fn cond_fn = get_conditional_internal_fn (code); + internal_fn cond_len_fn = get_conditional_len_internal_fn (code); /* If operating on inactive elements could generate spurious traps, we need to restrict the operation to active lanes. Note that this @@ -6734,11 +6736,19 @@ vectorizable_operation (vec_info *vinfo, || !direct_internal_fn_supported_p (cond_fn, vectype, OPTIMIZE_FOR_SPEED)) { - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "can't use a fully-masked loop because no" - " conditional operation is available.\n"); - LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; + if (cond_len_fn != IFN_LAST + && direct_internal_fn_supported_p (cond_len_fn, vectype, + OPTIMIZE_FOR_SPEED)) + vect_record_loop_len (loop_vinfo, lens, ncopies * vec_num, + vectype, 1); + else + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use a fully-masked loop because no" + " conditional operation is available.\n"); + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; + } } else vect_record_loop_mask (loop_vinfo, masks, ncopies * vec_num, @@ -6805,6 +6815,7 @@ vectorizable_operation (vec_info *vinfo, "transform binary/unary operation.\n"); bool masked_loop_p = loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); + bool len_loop_p = loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo); /* POINTER_DIFF_EXPR has pointer arguments which are vectorized as vectors with unsigned elements, but the result is signed. So, we @@ -6971,6 +6982,45 @@ vectorizable_operation (vec_info *vinfo, gimple_assign_set_lhs (new_stmt, new_temp); vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi); } + else if (len_loop_p && mask_out_inactive) + { + tree len = vect_get_loop_len (loop_vinfo, gsi, lens, + vec_num * ncopies, vectype, i, 1); + signed char biasval = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo); + tree bias = build_int_cst (intQI_type_node, biasval); + /* Dummy mask. */ + tree mask_vectype = truth_type_for (vectype); + tree mask = build_minus_one_cst (mask_vectype); + auto_vec vops (6); + vops.quick_push (mask); + vops.quick_push (vop0); + if (vop1) + vops.quick_push (vop1); + if (vop2) + vops.quick_push (vop2); + if (reduc_idx >= 0) + { + /* Perform the operation on active elements only and take + inactive elements from the reduction chain input. */ + gcc_assert (!vop2); + vops.quick_push (reduc_idx == 1 ? vop1 : vop0); + } + else + { + auto else_value + = targetm.preferred_else_value (cond_len_fn, vectype, + vops.length () - 1, &vops[1]); + vops.quick_push (else_value); + } + vops.quick_push (len); + vops.quick_push (bias); + gcall *call = gimple_build_call_internal_vec (cond_len_fn, vops); + new_temp = make_ssa_name (vec_dest, call); + gimple_call_set_lhs (call, new_temp); + gimple_call_set_nothrow (call, true); + vect_finish_stmt_generation (vinfo, stmt_info, call, gsi); + new_stmt = call; + } else if (masked_loop_p && mask_out_inactive) { tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,