[v2] Add pattern to convert vector shift + bitwise and + multiply to vector compare in some cases.
Message ID | 20221129100446.3875697-1-manolis.tsamis@vrull.eu |
---|---|
State | New |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 509913858039 for <patchwork@sourceware.org>; Tue, 29 Nov 2022 10:06:04 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-lf1-x12d.google.com (mail-lf1-x12d.google.com [IPv6:2a00:1450:4864:20::12d]) by sourceware.org (Postfix) with ESMTPS id 32C133858D1E for <gcc-patches@gcc.gnu.org>; Tue, 29 Nov 2022 10:05:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 32C133858D1E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=vrull.eu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu Received: by mail-lf1-x12d.google.com with SMTP id u27so10669468lfc.9 for <gcc-patches@gcc.gnu.org>; Tue, 29 Nov 2022 02:05:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vrull.eu; s=google; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=+0UXXzEKT7dDqVeNJ1MZiIf66Z9BKaRCcPTFoYbTrfo=; b=bYwIepM+f78q91SajRw8EuQMFXG0gmIurtWDvMz6jhmHwkICR/J6pybu/sqgbMSKXx kEUI2ZnP3zcbtUABBd9a9JD7YMefxdtX42/rcIGyUgC2wkGewObbglVVUTi56oHMgQHb oSiUtlYT2sUTbmbhvx7r+6KIhChzAckEXBOQ0pctmwo+AvSicUbnBVSCJyYirBq0InSN sd9dZFDzgwKDsRPkx845uNu7OV2ViuhoeZvF8YGt3xz+aXGr6yJ1u8Pmk2iBDdhE7Jpq amHGDDHVQVcuoro8NXfyCCuII//Lx1IIGqUZp/AuO1d7ULeKZNl+Y01ABucDFUB/GugA JWMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=+0UXXzEKT7dDqVeNJ1MZiIf66Z9BKaRCcPTFoYbTrfo=; b=mPDW0hv4eqDC6EueZMPS3yEZy+6zFOe2AogJ9ezTudO9NqLq/BPfc7YVF0M11xotZd lxeWr5sexiZRP0G/RR37gNTbgvlX70t3boNy06a3JIE9MPY9D1CbPeMVRzfdGnDZk6eV V1R9lduiWHbCs4m0xyH1noI0MHZakEkgEYPA93ZjccYWOIhQsM4YqQEnxIqr71W8ucej W8+5liYr34l/q2/JxJwOye9VPZ8THsX1/tChafJUaxQAlYxh/N7kXNJf1/u7n4GlpiGG MGa6ueDe1Bye9mewEQ1UymIlyZ3+uja+jfni9FTCN8mZZxMGilfeVFjL0sZPzqa26q9n 26Ag== X-Gm-Message-State: ANoB5pkaAvpAv7MTBfqXvk3k7jYpknD2mLVtUv8XAsZ3wwNJ9I7BqCSh 6jg6tgMIZLM5Gs6gpPVAKayduovL5vAuRDXc X-Google-Smtp-Source: AA0mqf7+4pk97TdYm6Q7Vh1j/wAkKSPKHBLx6n7f1ExNQaVruJLe+zbsFZ2mDTMWyp66DgXTEL/yyA== X-Received: by 2002:a19:8c50:0:b0:4b1:cbeb:c578 with SMTP id i16-20020a198c50000000b004b1cbebc578mr19333920lfj.86.1669716342753; Tue, 29 Nov 2022 02:05:42 -0800 (PST) Received: from helsinki-03.engr ([2a01:4f9:6b:2a47::2]) by smtp.gmail.com with ESMTPSA id m1-20020a056512114100b004972b0bb426sm2162829lfg.257.2022.11.29.02.05.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Nov 2022 02:05:41 -0800 (PST) From: Manolis Tsamis <manolis.tsamis@vrull.eu> To: gcc-patches@gcc.gnu.org Cc: Philipp Tomsich <philipp.tomsich@vrull.eu>, Tamar Christina <Tamar.Christina@arm.com>, Richard Biener <richard.guenther@gmail.com>, jiangning.liu@amperecomputing.com, Christoph Muellner <christoph.muellner@vrull.eu>, Manolis Tsamis <manolis.tsamis@vrull.eu> Subject: [PATCH v2] Add pattern to convert vector shift + bitwise and + multiply to vector compare in some cases. Date: Tue, 29 Nov 2022 11:04:46 +0100 Message-Id: <20221129100446.3875697-1-manolis.tsamis@vrull.eu> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, JMQ_SPF_NEUTRAL, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
[v2] Add pattern to convert vector shift + bitwise and + multiply to vector compare in some cases.
|
|
Commit Message
Manolis Tsamis
Nov. 29, 2022, 10:04 a.m. UTC
When using SWAR (SIMD in a register) techniques a comparison operation within
such a register can be made by using a combination of shifts, bitwise and and
multiplication. If code using this scheme is vectorized then there is potential
to replace all these operations with a single vector comparison, by reinterpreting
the vector types to match the width of the SWAR register.
For example, for the test function packed_cmp_16_32, the original generated code is:
ldr q0, [x0]
add w1, w1, 1
ushr v0.4s, v0.4s, 15
and v0.16b, v0.16b, v2.16b
shl v1.4s, v0.4s, 16
sub v0.4s, v1.4s, v0.4s
str q0, [x0], 16
cmp w2, w1
bhi .L20
with this pattern the above can be optimized to:
ldr q0, [x0]
add w1, w1, 1
cmlt v0.8h, v0.8h, #0
str q0, [x0], 16
cmp w2, w1
bhi .L20
The effect is similar for x86-64.
Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
gcc/ChangeLog:
* match.pd: Simplify vector shift + bit_and + multiply in some cases.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/swar_to_vec_cmp.c: New test.
---
Changes in v2:
- Changed pattern to use vec_cond_expr.
- Changed pattern to work with VLA vector.
- Added more checks and comments.
gcc/match.pd | 60 ++++++++++++++++
.../gcc.target/aarch64/swar_to_vec_cmp.c | 72 +++++++++++++++++++
2 files changed, 132 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
Comments
On Tue, Nov 29, 2022 at 11:05 AM Manolis Tsamis <manolis.tsamis@vrull.eu> wrote: > > When using SWAR (SIMD in a register) techniques a comparison operation within > such a register can be made by using a combination of shifts, bitwise and and > multiplication. If code using this scheme is vectorized then there is potential > to replace all these operations with a single vector comparison, by reinterpreting > the vector types to match the width of the SWAR register. > > For example, for the test function packed_cmp_16_32, the original generated code is: > > ldr q0, [x0] > add w1, w1, 1 > ushr v0.4s, v0.4s, 15 > and v0.16b, v0.16b, v2.16b > shl v1.4s, v0.4s, 16 > sub v0.4s, v1.4s, v0.4s > str q0, [x0], 16 > cmp w2, w1 > bhi .L20 > > with this pattern the above can be optimized to: > > ldr q0, [x0] > add w1, w1, 1 > cmlt v0.8h, v0.8h, #0 > str q0, [x0], 16 > cmp w2, w1 > bhi .L20 > > The effect is similar for x86-64. > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > gcc/ChangeLog: > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > --- > > Changes in v2: > - Changed pattern to use vec_cond_expr. > - Changed pattern to work with VLA vector. > - Added more checks and comments. > > gcc/match.pd | 60 ++++++++++++++++ > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 +++++++++++++++++++ > 2 files changed, 132 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > diff --git a/gcc/match.pd b/gcc/match.pd > index 67a0a682f31..05e7fc79ba8 100644 > --- a/gcc/match.pd > +++ b/gcc/match.pd > @@ -301,6 +301,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (view_convert (bit_and:itype (view_convert @0) > (ne @1 { build_zero_cst (type); }))))))) > > +/* In SWAR (SIMD in a register) code a signed comparison of packed data can > + be constructed with a particular combination of shift, bitwise and, > + and multiplication by constants. If that code is vectorized we can > + convert this pattern into a more efficient vector comparison. */ > +(simplify > + (mult (bit_and (rshift @0 uniform_integer_cst_p@1) > + uniform_integer_cst_p@2) > + uniform_integer_cst_p@3) Please use VECTOR_CST in the match instead of uniform_integer_cst_p and instead ... > + (with { > + tree rshift_cst = uniform_integer_cst_p (@1); > + tree bit_and_cst = uniform_integer_cst_p (@2); > + tree mult_cst = uniform_integer_cst_p (@3); > + } > + /* Make sure we're working with vectors and uniform vector constants. */ > + (if (VECTOR_TYPE_P (type) ... test for non-NULL *_cst here where you can use uniform_vector_p instead of uniform_integer_cst_p. You can elide the VECTOR_TYPE_P check then and instead do INTEGRAL_TYPE_P (TREE_TYPE (type)). > + && tree_fits_uhwi_p (rshift_cst) > + && tree_fits_uhwi_p (mult_cst) > + && tree_fits_uhwi_p (bit_and_cst)) > + /* Compute what constants would be needed for this to represent a packed > + comparison based on the shift amount denoted by RSHIFT_CST. */ > + (with { > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (type); > + poly_int64 vec_nelts = TYPE_VECTOR_SUBPARTS (type); > + poly_int64 vec_bits = vec_elem_bits * vec_nelts; > + > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > + > + mult_i = tree_to_uhwi (mult_cst); > + bit_and_i = tree_to_uhwi (bit_and_cst); > + target_bit_and_i = 0; > + > + /* The bit pattern in BIT_AND_I should be a mask for the least > + significant bit of each packed element that is CMP_BITS wide. */ > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > + } > + (if ((exact_log2 (cmp_bits_i)) >= 0 > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > + && multiple_p (vec_bits, cmp_bits_i) > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > + && target_mult_i == mult_i > + && target_bit_and_i == bit_and_i) > + /* Compute the vector shape for the comparison and check if the target is > + able to expand the comparison with that type. */ > + (with { > + /* We're doing a signed comparison. */ > + tree cmp_type = build_nonstandard_integer_type (cmp_bits_i, 0); > + poly_int64 vector_type_nelts = exact_div (vec_bits, cmp_bits_i); > + tree vector_cmp_type = build_vector_type (cmp_type, vector_type_nelts); > + tree zeros = build_zero_cst (vector_cmp_type); > + tree ones = build_all_ones_cst (vector_cmp_type); > + } > + (if (expand_vec_cmp_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR)) > + (view_convert:type (vec_cond (lt (view_convert:vector_cmp_type @0) > + { zeros; }) > + { ones; } { zeros; }))))))))) You are testing whether we can expand the comparison but not whether we can expand the vector condition here? The set of combinations supported are tricky, I think your test is conservatively OK but it might fail to match AVX512 and will fail targets with masks (SVE, GCN?). I think adding || expand_vec_cond_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR) would fix that (but there will be extra cost in producing the final result vector from the comparison mask). I do wonder if, as this is targeted at vectorization, this shouldn't be a vectorizer pattern instead of a post-processing transform? That way we would get the costing in the vectorizer correct and not rely on integer vector shift or multiplication and also get the cost of producing the result vector correct? I can't fully decipher the trick though, but assume that vector_cmp_type always has smaller elements than 'type'. Thanks, Richard. > + > (for cmp (gt ge lt le) > outp (convert convert negate negate) > outn (negate negate convert convert) > diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > new file mode 100644 > index 00000000000..26f9ad9ef28 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > @@ -0,0 +1,72 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize" } */ > + > +typedef unsigned char uint8_t; > +typedef unsigned short uint16_t; > +typedef unsigned int uint32_t; > + > +/* 8-bit SWAR tests. */ > + > +static uint8_t packed_cmp_8_8(uint8_t a) > +{ > + return ((a >> 7) & 0x1U) * 0xffU; > +} > + > +/* 16-bit SWAR tests. */ > + > +static uint16_t packed_cmp_8_16(uint16_t a) > +{ > + return ((a >> 7) & 0x101U) * 0xffU; > +} > + > +static uint16_t packed_cmp_16_16(uint16_t a) > +{ > + return ((a >> 15) & 0x1U) * 0xffffU; > +} > + > +/* 32-bit SWAR tests. */ > + > +static uint32_t packed_cmp_8_32(uint32_t a) > +{ > + return ((a >> 7) & 0x1010101U) * 0xffU; > +} > + > +static uint32_t packed_cmp_16_32(uint32_t a) > +{ > + return ((a >> 15) & 0x10001U) * 0xffffU; > +} > + > +static uint32_t packed_cmp_32_32(uint32_t a) > +{ > + return ((a >> 31) & 0x1U) * 0xffffffffU; > +} > + > +/* Driver function to test the vectorized code generated for the different > + packed_cmp variants. */ > + > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > + void vectorized_cmp_##FUNC(T* a, int n) \ > + { \ > + n = (n / 32) * 32; \ > + for(int i = 0; i < n; i += 4) \ > + { \ > + a[i + 0] = FUNC(a[i + 0]); \ > + a[i + 1] = FUNC(a[i + 1]); \ > + a[i + 2] = FUNC(a[i + 2]); \ > + a[i + 3] = FUNC(a[i + 3]); \ > + } \ > + } > + > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > + > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > + > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > + > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > -- > 2.34.1 >
On Wed, Nov 30, 2022 at 9:44 AM Richard Biener <richard.guenther@gmail.com> wrote: > > On Tue, Nov 29, 2022 at 11:05 AM Manolis Tsamis <manolis.tsamis@vrull.eu> wrote: > > > > When using SWAR (SIMD in a register) techniques a comparison operation within > > such a register can be made by using a combination of shifts, bitwise and and > > multiplication. If code using this scheme is vectorized then there is potential > > to replace all these operations with a single vector comparison, by reinterpreting > > the vector types to match the width of the SWAR register. > > > > For example, for the test function packed_cmp_16_32, the original generated code is: > > > > ldr q0, [x0] > > add w1, w1, 1 > > ushr v0.4s, v0.4s, 15 > > and v0.16b, v0.16b, v2.16b > > shl v1.4s, v0.4s, 16 > > sub v0.4s, v1.4s, v0.4s > > str q0, [x0], 16 > > cmp w2, w1 > > bhi .L20 > > > > with this pattern the above can be optimized to: > > > > ldr q0, [x0] > > add w1, w1, 1 > > cmlt v0.8h, v0.8h, #0 > > str q0, [x0], 16 > > cmp w2, w1 > > bhi .L20 > > > > The effect is similar for x86-64. > > > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > > > gcc/ChangeLog: > > > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > > > --- > > > > Changes in v2: > > - Changed pattern to use vec_cond_expr. > > - Changed pattern to work with VLA vector. > > - Added more checks and comments. > > > > gcc/match.pd | 60 ++++++++++++++++ > > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 +++++++++++++++++++ > > 2 files changed, 132 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > diff --git a/gcc/match.pd b/gcc/match.pd > > index 67a0a682f31..05e7fc79ba8 100644 > > --- a/gcc/match.pd > > +++ b/gcc/match.pd > > @@ -301,6 +301,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > (view_convert (bit_and:itype (view_convert @0) > > (ne @1 { build_zero_cst (type); }))))))) > > > > +/* In SWAR (SIMD in a register) code a signed comparison of packed data can > > + be constructed with a particular combination of shift, bitwise and, > > + and multiplication by constants. If that code is vectorized we can > > + convert this pattern into a more efficient vector comparison. */ > > +(simplify > > + (mult (bit_and (rshift @0 uniform_integer_cst_p@1) > > + uniform_integer_cst_p@2) > > + uniform_integer_cst_p@3) > > Please use VECTOR_CST in the match instead of uniform_integer_cst_p > and instead ... > Will do. > > + (with { > > + tree rshift_cst = uniform_integer_cst_p (@1); > > + tree bit_and_cst = uniform_integer_cst_p (@2); > > + tree mult_cst = uniform_integer_cst_p (@3); > > + } > > + /* Make sure we're working with vectors and uniform vector constants. */ > > + (if (VECTOR_TYPE_P (type) > > ... test for non-NULL *_cst here where you can use uniform_vector_p instead > of uniform_integer_cst_p. You can elide the VECTOR_TYPE_P check then > and instead do INTEGRAL_TYPE_P (TREE_TYPE (type)). > Will do. > > + && tree_fits_uhwi_p (rshift_cst) > > + && tree_fits_uhwi_p (mult_cst) > > + && tree_fits_uhwi_p (bit_and_cst)) > > + /* Compute what constants would be needed for this to represent a packed > > + comparison based on the shift amount denoted by RSHIFT_CST. */ > > + (with { > > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (type); > > + poly_int64 vec_nelts = TYPE_VECTOR_SUBPARTS (type); > > + poly_int64 vec_bits = vec_elem_bits * vec_nelts; > > + > > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > > + > > + mult_i = tree_to_uhwi (mult_cst); > > + bit_and_i = tree_to_uhwi (bit_and_cst); > > + target_bit_and_i = 0; > > + > > + /* The bit pattern in BIT_AND_I should be a mask for the least > > + significant bit of each packed element that is CMP_BITS wide. */ > > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > > + } > > + (if ((exact_log2 (cmp_bits_i)) >= 0 > > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > > + && multiple_p (vec_bits, cmp_bits_i) > > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > > + && target_mult_i == mult_i > > + && target_bit_and_i == bit_and_i) > > + /* Compute the vector shape for the comparison and check if the target is > > + able to expand the comparison with that type. */ > > + (with { > > + /* We're doing a signed comparison. */ > > + tree cmp_type = build_nonstandard_integer_type (cmp_bits_i, 0); > > + poly_int64 vector_type_nelts = exact_div (vec_bits, cmp_bits_i); > > + tree vector_cmp_type = build_vector_type (cmp_type, vector_type_nelts); > > + tree zeros = build_zero_cst (vector_cmp_type); > > + tree ones = build_all_ones_cst (vector_cmp_type); > > + } > > + (if (expand_vec_cmp_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR)) > > + (view_convert:type (vec_cond (lt (view_convert:vector_cmp_type @0) > > + { zeros; }) > > + { ones; } { zeros; }))))))))) > > You are testing whether we can expand the comparison but not whether we can > expand the vector condition here? The set of combinations supported are > tricky, I think your test is conservatively OK but it might fail to match AVX512 > and will fail targets with masks (SVE, GCN?). I think adding > > || expand_vec_cond_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR) > > would fix that (but there will be extra cost in producing the final > result vector > from the comparison mask). > I didn't understand that these needed separate checks, thanks for pointing it out. Will do. > I do wonder if, as this is targeted at vectorization, this shouldn't > be a vectorizer pattern instead of a post-processing transform? That My initial implementation for this was actually targeting the vectorization patterns but it didn't fit with it very well. One issue that was raised is that these patterns need the internal vec calls and adding one internal function for this optimization would be hardly justifiable and possibly ugly. The way I see it this is a peephole optimization and not something to make a pattern for. After that match.pd was looking like a better place to achieve the optimization in a clean way and not pollute the vectorizer. My understanding of the vectorization machinery is limited so please correct me if I'm wrong. > way we would > get the costing in the vectorizer correct and not rely on integer > vector shift or > multiplication and also get the cost of producing the result vector correct? I > can't fully decipher the trick though, but assume that vector_cmp_type > always has > smaller elements than 'type'. > That is correct, all this applies when vector_cmp_type has smaller elements than type. Otherwise if vector_cmp_type == type there is no SWAR and GCC optimizes it nicely already). But I haven't added any special case for that as the pattern is general and works nicely in both cases. The trick is that in scalar code where simd-within-a-register techniques are used a signed comparison can be created by 1) right shifting the sign bits to the lsb 2) masking off everything else and 3) multiply by an appropriate 0xffff... constant to create a mask. GCC computes this as 4 instructions since the multiply is usually replaced by shift and subtract. But if this sequence gets vectorized then all this sequence can be replaced by a single vector compare-less, since the original code was emulating what a vector comapre-less does anyway. Thanks! Manolis > Thanks, > Richard. > > > + > > (for cmp (gt ge lt le) > > outp (convert convert negate negate) > > outn (negate negate convert convert) > > diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > new file mode 100644 > > index 00000000000..26f9ad9ef28 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > @@ -0,0 +1,72 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -ftree-vectorize" } */ > > + > > +typedef unsigned char uint8_t; > > +typedef unsigned short uint16_t; > > +typedef unsigned int uint32_t; > > + > > +/* 8-bit SWAR tests. */ > > + > > +static uint8_t packed_cmp_8_8(uint8_t a) > > +{ > > + return ((a >> 7) & 0x1U) * 0xffU; > > +} > > + > > +/* 16-bit SWAR tests. */ > > + > > +static uint16_t packed_cmp_8_16(uint16_t a) > > +{ > > + return ((a >> 7) & 0x101U) * 0xffU; > > +} > > + > > +static uint16_t packed_cmp_16_16(uint16_t a) > > +{ > > + return ((a >> 15) & 0x1U) * 0xffffU; > > +} > > + > > +/* 32-bit SWAR tests. */ > > + > > +static uint32_t packed_cmp_8_32(uint32_t a) > > +{ > > + return ((a >> 7) & 0x1010101U) * 0xffU; > > +} > > + > > +static uint32_t packed_cmp_16_32(uint32_t a) > > +{ > > + return ((a >> 15) & 0x10001U) * 0xffffU; > > +} > > + > > +static uint32_t packed_cmp_32_32(uint32_t a) > > +{ > > + return ((a >> 31) & 0x1U) * 0xffffffffU; > > +} > > + > > +/* Driver function to test the vectorized code generated for the different > > + packed_cmp variants. */ > > + > > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > > + void vectorized_cmp_##FUNC(T* a, int n) \ > > + { \ > > + n = (n / 32) * 32; \ > > + for(int i = 0; i < n; i += 4) \ > > + { \ > > + a[i + 0] = FUNC(a[i + 0]); \ > > + a[i + 1] = FUNC(a[i + 1]); \ > > + a[i + 2] = FUNC(a[i + 2]); \ > > + a[i + 3] = FUNC(a[i + 3]); \ > > + } \ > > + } > > + > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > > + > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > > + > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > > + > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > > -- > > 2.34.1 > >
On Wed, Nov 30, 2022 at 9:59 AM Manolis Tsamis <manolis.tsamis@vrull.eu> wrote: > > On Wed, Nov 30, 2022 at 9:44 AM Richard Biener > <richard.guenther@gmail.com> wrote: > > > > On Tue, Nov 29, 2022 at 11:05 AM Manolis Tsamis <manolis.tsamis@vrull.eu> wrote: > > > > > > When using SWAR (SIMD in a register) techniques a comparison operation within > > > such a register can be made by using a combination of shifts, bitwise and and > > > multiplication. If code using this scheme is vectorized then there is potential > > > to replace all these operations with a single vector comparison, by reinterpreting > > > the vector types to match the width of the SWAR register. > > > > > > For example, for the test function packed_cmp_16_32, the original generated code is: > > > > > > ldr q0, [x0] > > > add w1, w1, 1 > > > ushr v0.4s, v0.4s, 15 > > > and v0.16b, v0.16b, v2.16b > > > shl v1.4s, v0.4s, 16 > > > sub v0.4s, v1.4s, v0.4s > > > str q0, [x0], 16 > > > cmp w2, w1 > > > bhi .L20 > > > > > > with this pattern the above can be optimized to: > > > > > > ldr q0, [x0] > > > add w1, w1, 1 > > > cmlt v0.8h, v0.8h, #0 > > > str q0, [x0], 16 > > > cmp w2, w1 > > > bhi .L20 > > > > > > The effect is similar for x86-64. > > > > > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > > > > > gcc/ChangeLog: > > > > > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > > > > > --- > > > > > > Changes in v2: > > > - Changed pattern to use vec_cond_expr. > > > - Changed pattern to work with VLA vector. > > > - Added more checks and comments. > > > > > > gcc/match.pd | 60 ++++++++++++++++ > > > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 +++++++++++++++++++ > > > 2 files changed, 132 insertions(+) > > > create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd > > > index 67a0a682f31..05e7fc79ba8 100644 > > > --- a/gcc/match.pd > > > +++ b/gcc/match.pd > > > @@ -301,6 +301,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > (view_convert (bit_and:itype (view_convert @0) > > > (ne @1 { build_zero_cst (type); }))))))) > > > > > > +/* In SWAR (SIMD in a register) code a signed comparison of packed data can > > > + be constructed with a particular combination of shift, bitwise and, > > > + and multiplication by constants. If that code is vectorized we can > > > + convert this pattern into a more efficient vector comparison. */ > > > +(simplify > > > + (mult (bit_and (rshift @0 uniform_integer_cst_p@1) > > > + uniform_integer_cst_p@2) > > > + uniform_integer_cst_p@3) > > > > Please use VECTOR_CST in the match instead of uniform_integer_cst_p > > and instead ... > > > > Will do. > > > > + (with { > > > + tree rshift_cst = uniform_integer_cst_p (@1); > > > + tree bit_and_cst = uniform_integer_cst_p (@2); > > > + tree mult_cst = uniform_integer_cst_p (@3); > > > + } > > > + /* Make sure we're working with vectors and uniform vector constants. */ > > > + (if (VECTOR_TYPE_P (type) > > > > ... test for non-NULL *_cst here where you can use uniform_vector_p instead > > of uniform_integer_cst_p. You can elide the VECTOR_TYPE_P check then > > and instead do INTEGRAL_TYPE_P (TREE_TYPE (type)). > > > > Will do. > > > > + && tree_fits_uhwi_p (rshift_cst) > > > + && tree_fits_uhwi_p (mult_cst) > > > + && tree_fits_uhwi_p (bit_and_cst)) > > > + /* Compute what constants would be needed for this to represent a packed > > > + comparison based on the shift amount denoted by RSHIFT_CST. */ > > > + (with { > > > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (type); > > > + poly_int64 vec_nelts = TYPE_VECTOR_SUBPARTS (type); > > > + poly_int64 vec_bits = vec_elem_bits * vec_nelts; > > > + > > > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > > > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > > > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > > > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > > > + > > > + mult_i = tree_to_uhwi (mult_cst); > > > + bit_and_i = tree_to_uhwi (bit_and_cst); > > > + target_bit_and_i = 0; > > > + > > > + /* The bit pattern in BIT_AND_I should be a mask for the least > > > + significant bit of each packed element that is CMP_BITS wide. */ > > > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > > > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > > > + } > > > + (if ((exact_log2 (cmp_bits_i)) >= 0 > > > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > > > + && multiple_p (vec_bits, cmp_bits_i) > > > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > > > + && target_mult_i == mult_i > > > + && target_bit_and_i == bit_and_i) > > > + /* Compute the vector shape for the comparison and check if the target is > > > + able to expand the comparison with that type. */ > > > + (with { > > > + /* We're doing a signed comparison. */ > > > + tree cmp_type = build_nonstandard_integer_type (cmp_bits_i, 0); > > > + poly_int64 vector_type_nelts = exact_div (vec_bits, cmp_bits_i); > > > + tree vector_cmp_type = build_vector_type (cmp_type, vector_type_nelts); > > > + tree zeros = build_zero_cst (vector_cmp_type); > > > + tree ones = build_all_ones_cst (vector_cmp_type); > > > + } > > > + (if (expand_vec_cmp_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR)) > > > + (view_convert:type (vec_cond (lt (view_convert:vector_cmp_type @0) > > > + { zeros; }) > > > + { ones; } { zeros; }))))))))) > > > > You are testing whether we can expand the comparison but not whether we can > > expand the vector condition here? The set of combinations supported are > > tricky, I think your test is conservatively OK but it might fail to match AVX512 > > and will fail targets with masks (SVE, GCN?). I think adding > > > > || expand_vec_cond_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR) > > > > would fix that (but there will be extra cost in producing the final > > result vector > > from the comparison mask). > > > > I didn't understand that these needed separate checks, thanks for > pointing it out. > Will do. > > > I do wonder if, as this is targeted at vectorization, this shouldn't > > be a vectorizer pattern instead of a post-processing transform? That > > My initial implementation for this was actually targeting the vectorization > patterns but it didn't fit with it very well. One issue that was > raised is that these > patterns need the internal vec calls and adding one internal function for this > optimization would be hardly justifiable and possibly ugly. The way I see it > this is a peephole optimization and not something to make a pattern for. > After that match.pd was looking like a better place to achieve the optimization > in a clean way and not pollute the vectorizer. > My understanding of the vectorization machinery is limited so please correct > me if I'm wrong. > > > way we would > > get the costing in the vectorizer correct and not rely on integer > > vector shift or > > multiplication and also get the cost of producing the result vector correct? I > > can't fully decipher the trick though, but assume that vector_cmp_type > > always has > > smaller elements than 'type'. > > > > That is correct, all this applies when vector_cmp_type has smaller elements > than type. Otherwise if vector_cmp_type == type there is no SWAR and GCC > optimizes it nicely already). But I haven't added any special case for that as > the pattern is general and works nicely in both cases. So a vectorizer pattern would be to replace the multiplication with (with smaller type T and original type U) (T)@0 < 0 ? (U)-1 :(U)0 ? That is, this should also work on the scalar side? (if it doesn't then it indeed gets more tricky) > The trick is that in scalar code where simd-within-a-register > techniques are used > a signed comparison can be created by 1) right shifting the sign bits to the lsb > 2) masking off everything else and 3) multiply by an appropriate > 0xffff... constant > to create a mask. GCC computes this as 4 instructions since the > multiply is usually > replaced by shift and subtract. But if this sequence gets vectorized > then all this > sequence can be replaced by a single vector compare-less, since the > original code > was emulating what a vector comapre-less does anyway. > > Thanks! > Manolis > > > Thanks, > > Richard. > > > > > + > > > (for cmp (gt ge lt le) > > > outp (convert convert negate negate) > > > outn (negate negate convert convert) > > > diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > new file mode 100644 > > > index 00000000000..26f9ad9ef28 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > @@ -0,0 +1,72 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O2 -ftree-vectorize" } */ > > > + > > > +typedef unsigned char uint8_t; > > > +typedef unsigned short uint16_t; > > > +typedef unsigned int uint32_t; > > > + > > > +/* 8-bit SWAR tests. */ > > > + > > > +static uint8_t packed_cmp_8_8(uint8_t a) > > > +{ > > > + return ((a >> 7) & 0x1U) * 0xffU; > > > +} > > > + > > > +/* 16-bit SWAR tests. */ > > > + > > > +static uint16_t packed_cmp_8_16(uint16_t a) > > > +{ > > > + return ((a >> 7) & 0x101U) * 0xffU; > > > +} > > > + > > > +static uint16_t packed_cmp_16_16(uint16_t a) > > > +{ > > > + return ((a >> 15) & 0x1U) * 0xffffU; > > > +} > > > + > > > +/* 32-bit SWAR tests. */ > > > + > > > +static uint32_t packed_cmp_8_32(uint32_t a) > > > +{ > > > + return ((a >> 7) & 0x1010101U) * 0xffU; > > > +} > > > + > > > +static uint32_t packed_cmp_16_32(uint32_t a) > > > +{ > > > + return ((a >> 15) & 0x10001U) * 0xffffU; > > > +} > > > + > > > +static uint32_t packed_cmp_32_32(uint32_t a) > > > +{ > > > + return ((a >> 31) & 0x1U) * 0xffffffffU; > > > +} > > > + > > > +/* Driver function to test the vectorized code generated for the different > > > + packed_cmp variants. */ > > > + > > > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > > > + void vectorized_cmp_##FUNC(T* a, int n) \ > > > + { \ > > > + n = (n / 32) * 32; \ > > > + for(int i = 0; i < n; i += 4) \ > > > + { \ > > > + a[i + 0] = FUNC(a[i + 0]); \ > > > + a[i + 1] = FUNC(a[i + 1]); \ > > > + a[i + 2] = FUNC(a[i + 2]); \ > > > + a[i + 3] = FUNC(a[i + 3]); \ > > > + } \ > > > + } > > > + > > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > > > + > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > > > + > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > > > + > > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > > > -- > > > 2.34.1 > > >
> -----Original Message----- > From: Richard Biener <richard.guenther@gmail.com> > Sent: Wednesday, November 30, 2022 1:19 PM > To: Manolis Tsamis <manolis.tsamis@vrull.eu> > Cc: gcc-patches@gcc.gnu.org; Philipp Tomsich <philipp.tomsich@vrull.eu>; > Tamar Christina <Tamar.Christina@arm.com>; > jiangning.liu@amperecomputing.com; Christoph Muellner > <christoph.muellner@vrull.eu> > Subject: Re: [PATCH v2] Add pattern to convert vector shift + bitwise and + > multiply to vector compare in some cases. > > On Wed, Nov 30, 2022 at 9:59 AM Manolis Tsamis <manolis.tsamis@vrull.eu> > wrote: > > > > On Wed, Nov 30, 2022 at 9:44 AM Richard Biener > > <richard.guenther@gmail.com> wrote: > > > > > > On Tue, Nov 29, 2022 at 11:05 AM Manolis Tsamis > <manolis.tsamis@vrull.eu> wrote: > > > > > > > > When using SWAR (SIMD in a register) techniques a comparison > > > > operation within such a register can be made by using a > > > > combination of shifts, bitwise and and multiplication. If code > > > > using this scheme is vectorized then there is potential to replace > > > > all these operations with a single vector comparison, by reinterpreting > the vector types to match the width of the SWAR register. > > > > > > > > For example, for the test function packed_cmp_16_32, the original > generated code is: > > > > > > > > ldr q0, [x0] > > > > add w1, w1, 1 > > > > ushr v0.4s, v0.4s, 15 > > > > and v0.16b, v0.16b, v2.16b > > > > shl v1.4s, v0.4s, 16 > > > > sub v0.4s, v1.4s, v0.4s > > > > str q0, [x0], 16 > > > > cmp w2, w1 > > > > bhi .L20 > > > > > > > > with this pattern the above can be optimized to: > > > > > > > > ldr q0, [x0] > > > > add w1, w1, 1 > > > > cmlt v0.8h, v0.8h, #0 > > > > str q0, [x0], 16 > > > > cmp w2, w1 > > > > bhi .L20 > > > > > > > > The effect is similar for x86-64. > > > > > > > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > > > > > > > gcc/ChangeLog: > > > > > > > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > > > > > > > --- > > > > > > > > Changes in v2: > > > > - Changed pattern to use vec_cond_expr. > > > > - Changed pattern to work with VLA vector. > > > > - Added more checks and comments. > > > > > > > > gcc/match.pd | 60 ++++++++++++++++ > > > > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 > +++++++++++++++++++ > > > > 2 files changed, 132 insertions(+) create mode 100644 > > > > gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index > > > > 67a0a682f31..05e7fc79ba8 100644 > > > > --- a/gcc/match.pd > > > > +++ b/gcc/match.pd > > > > @@ -301,6 +301,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > > (view_convert (bit_and:itype (view_convert @0) > > > > (ne @1 { build_zero_cst (type); > > > > }))))))) > > > > > > > > +/* In SWAR (SIMD in a register) code a signed comparison of packed > data can > > > > + be constructed with a particular combination of shift, bitwise and, > > > > + and multiplication by constants. If that code is vectorized we can > > > > + convert this pattern into a more efficient vector comparison. > > > > +*/ (simplify (mult (bit_and (rshift @0 uniform_integer_cst_p@1) > > > > + uniform_integer_cst_p@2) > > > > + uniform_integer_cst_p@3) > > > > > > Please use VECTOR_CST in the match instead of uniform_integer_cst_p > > > and instead ... > > > > > > > Will do. > > > > > > + (with { > > > > + tree rshift_cst = uniform_integer_cst_p (@1); > > > > + tree bit_and_cst = uniform_integer_cst_p (@2); > > > > + tree mult_cst = uniform_integer_cst_p (@3); } > > > > + /* Make sure we're working with vectors and uniform vector > > > > + constants. */ (if (VECTOR_TYPE_P (type) > > > > > > ... test for non-NULL *_cst here where you can use uniform_vector_p > > > instead of uniform_integer_cst_p. You can elide the VECTOR_TYPE_P > > > check then and instead do INTEGRAL_TYPE_P (TREE_TYPE (type)). > > > > > > > Will do. > > > > > > + && tree_fits_uhwi_p (rshift_cst) > > > > + && tree_fits_uhwi_p (mult_cst) > > > > + && tree_fits_uhwi_p (bit_and_cst)) > > > > + /* Compute what constants would be needed for this to represent a > packed > > > > + comparison based on the shift amount denoted by RSHIFT_CST. */ > > > > + (with { > > > > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (type); > > > > + poly_int64 vec_nelts = TYPE_VECTOR_SUBPARTS (type); > > > > + poly_int64 vec_bits = vec_elem_bits * vec_nelts; > > > > + > > > > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > > > > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > > > > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > > > > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > > > > + > > > > + mult_i = tree_to_uhwi (mult_cst); > > > > + bit_and_i = tree_to_uhwi (bit_and_cst); > > > > + target_bit_and_i = 0; > > > > + > > > > + /* The bit pattern in BIT_AND_I should be a mask for the least > > > > + significant bit of each packed element that is CMP_BITS wide. */ > > > > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > > > > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > > > > + } > > > > + (if ((exact_log2 (cmp_bits_i)) >= 0 > > > > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > > > > + && multiple_p (vec_bits, cmp_bits_i) > > > > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > > > > + && target_mult_i == mult_i > > > > + && target_bit_and_i == bit_and_i) > > > > + /* Compute the vector shape for the comparison and check if the > target is > > > > + able to expand the comparison with that type. */ > > > > + (with { > > > > + /* We're doing a signed comparison. */ > > > > + tree cmp_type = build_nonstandard_integer_type (cmp_bits_i, 0); > > > > + poly_int64 vector_type_nelts = exact_div (vec_bits, cmp_bits_i); > > > > + tree vector_cmp_type = build_vector_type (cmp_type, > vector_type_nelts); > > > > + tree zeros = build_zero_cst (vector_cmp_type); > > > > + tree ones = build_all_ones_cst (vector_cmp_type); > > > > + } > > > > + (if (expand_vec_cmp_expr_p (vector_cmp_type, > vector_cmp_type, LT_EXPR)) > > > > + (view_convert:type (vec_cond (lt > (view_convert:vector_cmp_type @0) > > > > + { zeros; }) > > > > + { ones; } { zeros; }))))))))) > > > > > > You are testing whether we can expand the comparison but not whether > > > we can expand the vector condition here? The set of combinations > > > supported are tricky, I think your test is conservatively OK but it > > > might fail to match AVX512 and will fail targets with masks (SVE, > > > GCN?). I think adding > > > > > > || expand_vec_cond_expr_p (vector_cmp_type, vector_cmp_type, > > > LT_EXPR) > > > > > > would fix that (but there will be extra cost in producing the final > > > result vector from the comparison mask). > > > > > > > I didn't understand that these needed separate checks, thanks for > > pointing it out. > > Will do. > > > > > I do wonder if, as this is targeted at vectorization, this shouldn't > > > be a vectorizer pattern instead of a post-processing transform? > > > That > > > > My initial implementation for this was actually targeting the > > vectorization patterns but it didn't fit with it very well. One issue > > that was raised is that these patterns need the internal vec calls and > > adding one internal function for this optimization would be hardly > > justifiable and possibly ugly. The way I see it this is a peephole > > optimization and not something to make a pattern for. > > After that match.pd was looking like a better place to achieve the > > optimization in a clean way and not pollute the vectorizer. > > My understanding of the vectorization machinery is limited so please > > correct me if I'm wrong. > > > > > way we would > > > get the costing in the vectorizer correct and not rely on integer > > > vector shift or multiplication and also get the cost of producing > > > the result vector correct? I can't fully decipher the trick though, > > > but assume that vector_cmp_type always has smaller elements than > > > 'type'. > > > > > > > That is correct, all this applies when vector_cmp_type has smaller > > elements than type. Otherwise if vector_cmp_type == type there is no > > SWAR and GCC optimizes it nicely already). But I haven't added any > > special case for that as the pattern is general and works nicely in both > cases. > > So a vectorizer pattern would be to replace the multiplication with (with > smaller type T and original type U) > > (T)@0 < 0 ? (U)-1 :(U)0 > > ? That is, this should also work on the scalar side? (if it doesn't then it indeed > gets more tricky) Hmm don't think it does, as the narrowing on the scalar size will result in pack/unpack operations during vectorization would it not? The < needs to be done without changing the VF. i.e. it's only valid as a vector -> vector transform by reinterpreting the bits in the vector. Tamar > > > The trick is that in scalar code where simd-within-a-register > > techniques are used a signed comparison can be created by 1) right > > shifting the sign bits to the lsb > > 2) masking off everything else and 3) multiply by an appropriate > > 0xffff... constant to create a mask. GCC computes this as 4 > > instructions since the multiply is usually replaced by shift and > > subtract. But if this sequence gets vectorized then all this sequence > > can be replaced by a single vector compare-less, since the original > > code was emulating what a vector comapre-less does anyway. > > > > Thanks! > > Manolis > > > > > Thanks, > > > Richard. > > > > > > > + > > > > (for cmp (gt ge lt le) > > > > outp (convert convert negate negate) > > > > outn (negate negate convert convert) diff --git > > > > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > new file mode 100644 > > > > index 00000000000..26f9ad9ef28 > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > @@ -0,0 +1,72 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-options "-O2 -ftree-vectorize" } */ > > > > + > > > > +typedef unsigned char uint8_t; > > > > +typedef unsigned short uint16_t; > > > > +typedef unsigned int uint32_t; > > > > + > > > > +/* 8-bit SWAR tests. */ > > > > + > > > > +static uint8_t packed_cmp_8_8(uint8_t a) { > > > > + return ((a >> 7) & 0x1U) * 0xffU; } > > > > + > > > > +/* 16-bit SWAR tests. */ > > > > + > > > > +static uint16_t packed_cmp_8_16(uint16_t a) { > > > > + return ((a >> 7) & 0x101U) * 0xffU; } > > > > + > > > > +static uint16_t packed_cmp_16_16(uint16_t a) { > > > > + return ((a >> 15) & 0x1U) * 0xffffU; } > > > > + > > > > +/* 32-bit SWAR tests. */ > > > > + > > > > +static uint32_t packed_cmp_8_32(uint32_t a) { > > > > + return ((a >> 7) & 0x1010101U) * 0xffU; } > > > > + > > > > +static uint32_t packed_cmp_16_32(uint32_t a) { > > > > + return ((a >> 15) & 0x10001U) * 0xffffU; } > > > > + > > > > +static uint32_t packed_cmp_32_32(uint32_t a) { > > > > + return ((a >> 31) & 0x1U) * 0xffffffffU; } > > > > + > > > > +/* Driver function to test the vectorized code generated for the > different > > > > + packed_cmp variants. */ > > > > + > > > > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > > > > + void vectorized_cmp_##FUNC(T* a, int n) \ > > > > + { \ > > > > + n = (n / 32) * 32; \ > > > > + for(int i = 0; i < n; i += 4) \ > > > > + { \ > > > > + a[i + 0] = FUNC(a[i + 0]); \ > > > > + a[i + 1] = FUNC(a[i + 1]); \ > > > > + a[i + 2] = FUNC(a[i + 2]); \ > > > > + a[i + 3] = FUNC(a[i + 3]); \ > > > > + } \ > > > > + } > > > > + > > > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > > > > + > > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > > > > + > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > > > > + > > > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > > > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > > > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > > > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > > > > -- > > > > 2.34.1 > > > >
> -----Original Message----- > From: Gcc-patches <gcc-patches- > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Tamar > Christina via Gcc-patches > Sent: Wednesday, November 30, 2022 1:24 PM > To: Richard Biener <richard.guenther@gmail.com>; Manolis Tsamis > <manolis.tsamis@vrull.eu> > Cc: gcc-patches@gcc.gnu.org; Philipp Tomsich <philipp.tomsich@vrull.eu>; > jiangning.liu@amperecomputing.com; Christoph Muellner > <christoph.muellner@vrull.eu> > Subject: RE: [PATCH v2] Add pattern to convert vector shift + bitwise and + > multiply to vector compare in some cases. > > > -----Original Message----- > > From: Richard Biener <richard.guenther@gmail.com> > > Sent: Wednesday, November 30, 2022 1:19 PM > > To: Manolis Tsamis <manolis.tsamis@vrull.eu> > > Cc: gcc-patches@gcc.gnu.org; Philipp Tomsich > > <philipp.tomsich@vrull.eu>; Tamar Christina <Tamar.Christina@arm.com>; > > jiangning.liu@amperecomputing.com; Christoph Muellner > > <christoph.muellner@vrull.eu> > > Subject: Re: [PATCH v2] Add pattern to convert vector shift + bitwise > > and + multiply to vector compare in some cases. > > > > On Wed, Nov 30, 2022 at 9:59 AM Manolis Tsamis > > <manolis.tsamis@vrull.eu> > > wrote: > > > > > > On Wed, Nov 30, 2022 at 9:44 AM Richard Biener > > > <richard.guenther@gmail.com> wrote: > > > > > > > > On Tue, Nov 29, 2022 at 11:05 AM Manolis Tsamis > > <manolis.tsamis@vrull.eu> wrote: > > > > > > > > > > When using SWAR (SIMD in a register) techniques a comparison > > > > > operation within such a register can be made by using a > > > > > combination of shifts, bitwise and and multiplication. If code > > > > > using this scheme is vectorized then there is potential to > > > > > replace all these operations with a single vector comparison, by > > > > > reinterpreting > > the vector types to match the width of the SWAR register. > > > > > > > > > > For example, for the test function packed_cmp_16_32, the > > > > > original > > generated code is: > > > > > > > > > > ldr q0, [x0] > > > > > add w1, w1, 1 > > > > > ushr v0.4s, v0.4s, 15 > > > > > and v0.16b, v0.16b, v2.16b > > > > > shl v1.4s, v0.4s, 16 > > > > > sub v0.4s, v1.4s, v0.4s > > > > > str q0, [x0], 16 > > > > > cmp w2, w1 > > > > > bhi .L20 > > > > > > > > > > with this pattern the above can be optimized to: > > > > > > > > > > ldr q0, [x0] > > > > > add w1, w1, 1 > > > > > cmlt v0.8h, v0.8h, #0 > > > > > str q0, [x0], 16 > > > > > cmp w2, w1 > > > > > bhi .L20 > > > > > > > > > > The effect is similar for x86-64. > > > > > > > > > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > * match.pd: Simplify vector shift + bit_and + multiply in some > cases. > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > > > > > > > > > --- > > > > > > > > > > Changes in v2: > > > > > - Changed pattern to use vec_cond_expr. > > > > > - Changed pattern to work with VLA vector. > > > > > - Added more checks and comments. > > > > > > > > > > gcc/match.pd | 60 ++++++++++++++++ > > > > > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 > > +++++++++++++++++++ > > > > > 2 files changed, 132 insertions(+) create mode 100644 > > > > > gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index > > > > > 67a0a682f31..05e7fc79ba8 100644 > > > > > --- a/gcc/match.pd > > > > > +++ b/gcc/match.pd > > > > > @@ -301,6 +301,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > > > (view_convert (bit_and:itype (view_convert @0) > > > > > (ne @1 { build_zero_cst (type); > > > > > }))))))) > > > > > > > > > > +/* In SWAR (SIMD in a register) code a signed comparison of > > > > > +packed > > data can > > > > > + be constructed with a particular combination of shift, bitwise and, > > > > > + and multiplication by constants. If that code is vectorized we can > > > > > + convert this pattern into a more efficient vector comparison. > > > > > +*/ (simplify (mult (bit_and (rshift @0 uniform_integer_cst_p@1) > > > > > + uniform_integer_cst_p@2) > > > > > + uniform_integer_cst_p@3) > > > > > > > > Please use VECTOR_CST in the match instead of > > > > uniform_integer_cst_p and instead ... > > > > > > > > > > Will do. > > > > > > > > + (with { > > > > > + tree rshift_cst = uniform_integer_cst_p (@1); > > > > > + tree bit_and_cst = uniform_integer_cst_p (@2); > > > > > + tree mult_cst = uniform_integer_cst_p (@3); } > > > > > + /* Make sure we're working with vectors and uniform vector > > > > > + constants. */ (if (VECTOR_TYPE_P (type) > > > > > > > > ... test for non-NULL *_cst here where you can use > > > > uniform_vector_p instead of uniform_integer_cst_p. You can elide > > > > the VECTOR_TYPE_P check then and instead do INTEGRAL_TYPE_P > (TREE_TYPE (type)). > > > > > > > > > > Will do. > > > > > > > > + && tree_fits_uhwi_p (rshift_cst) > > > > > + && tree_fits_uhwi_p (mult_cst) > > > > > + && tree_fits_uhwi_p (bit_and_cst)) > > > > > + /* Compute what constants would be needed for this to > > > > > + represent a > > packed > > > > > + comparison based on the shift amount denoted by RSHIFT_CST. > */ > > > > > + (with { > > > > > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (type); > > > > > + poly_int64 vec_nelts = TYPE_VECTOR_SUBPARTS (type); > > > > > + poly_int64 vec_bits = vec_elem_bits * vec_nelts; > > > > > + > > > > > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > > > > > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > > > > > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > > > > > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > > > > > + > > > > > + mult_i = tree_to_uhwi (mult_cst); > > > > > + bit_and_i = tree_to_uhwi (bit_and_cst); > > > > > + target_bit_and_i = 0; > > > > > + > > > > > + /* The bit pattern in BIT_AND_I should be a mask for the least > > > > > + significant bit of each packed element that is CMP_BITS wide. */ > > > > > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > > > > > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > > > > > + } > > > > > + (if ((exact_log2 (cmp_bits_i)) >= 0 > > > > > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > > > > > + && multiple_p (vec_bits, cmp_bits_i) > > > > > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > > > > > + && target_mult_i == mult_i > > > > > + && target_bit_and_i == bit_and_i) > > > > > + /* Compute the vector shape for the comparison and check > > > > > + if the > > target is > > > > > + able to expand the comparison with that type. */ > > > > > + (with { > > > > > + /* We're doing a signed comparison. */ > > > > > + tree cmp_type = build_nonstandard_integer_type (cmp_bits_i, > 0); > > > > > + poly_int64 vector_type_nelts = exact_div (vec_bits, cmp_bits_i); > > > > > + tree vector_cmp_type = build_vector_type (cmp_type, > > vector_type_nelts); > > > > > + tree zeros = build_zero_cst (vector_cmp_type); > > > > > + tree ones = build_all_ones_cst (vector_cmp_type); > > > > > + } > > > > > + (if (expand_vec_cmp_expr_p (vector_cmp_type, > > vector_cmp_type, LT_EXPR)) > > > > > + (view_convert:type (vec_cond (lt > > (view_convert:vector_cmp_type @0) > > > > > + { zeros; }) > > > > > + { ones; } { zeros; }))))))))) > > > > > > > > You are testing whether we can expand the comparison but not > > > > whether we can expand the vector condition here? The set of > > > > combinations supported are tricky, I think your test is > > > > conservatively OK but it might fail to match AVX512 and will fail > > > > targets with masks (SVE, GCN?). I think adding > > > > > > > > || expand_vec_cond_expr_p (vector_cmp_type, vector_cmp_type, > > > > LT_EXPR) > > > > > > > > would fix that (but there will be extra cost in producing the > > > > final result vector from the comparison mask). > > > > > > > > > > I didn't understand that these needed separate checks, thanks for > > > pointing it out. > > > Will do. > > > > > > > I do wonder if, as this is targeted at vectorization, this > > > > shouldn't be a vectorizer pattern instead of a post-processing > transform? > > > > That > > > > > > My initial implementation for this was actually targeting the > > > vectorization patterns but it didn't fit with it very well. One > > > issue that was raised is that these patterns need the internal vec > > > calls and adding one internal function for this optimization would > > > be hardly justifiable and possibly ugly. The way I see it this is a > > > peephole optimization and not something to make a pattern for. > > > After that match.pd was looking like a better place to achieve the > > > optimization in a clean way and not pollute the vectorizer. > > > My understanding of the vectorization machinery is limited so please > > > correct me if I'm wrong. > > > > > > > way we would > > > > get the costing in the vectorizer correct and not rely on integer > > > > vector shift or multiplication and also get the cost of producing > > > > the result vector correct? I can't fully decipher the trick > > > > though, but assume that vector_cmp_type always has smaller > > > > elements than 'type'. > > > > > > > > > > That is correct, all this applies when vector_cmp_type has smaller > > > elements than type. Otherwise if vector_cmp_type == type there is no > > > SWAR and GCC optimizes it nicely already). But I haven't added any > > > special case for that as the pattern is general and works nicely in > > > both > > cases. > > > > So a vectorizer pattern would be to replace the multiplication with > > (with smaller type T and original type U) > > > > (T)@0 < 0 ? (U)-1 :(U)0 > > > > ? That is, this should also work on the scalar side? (if it doesn't > > then it indeed gets more tricky) > > Hmm don't think it does, as the narrowing on the scalar size will result in > pack/unpack operations during vectorization would it not? The < needs to be > done without changing the VF. > > i.e. it's only valid as a vector -> vector transform by reinterpreting the bits in > the vector. Additionally, I don't think a scalar representation is actually correct. Each scalar value still contains multiple packed values. So e.g. uint16_t contains two uint8_t values which have to be individually compared. So a cast on the scalar value is incorrect as it's just truncating. A type convert on a scalar would be incorrect as well as the semantics of < is incorrect then (it'll compare it as a whole). Regards, Tamar > > Tamar > > > > > > The trick is that in scalar code where simd-within-a-register > > > techniques are used a signed comparison can be created by 1) right > > > shifting the sign bits to the lsb > > > 2) masking off everything else and 3) multiply by an appropriate > > > 0xffff... constant to create a mask. GCC computes this as 4 > > > instructions since the multiply is usually replaced by shift and > > > subtract. But if this sequence gets vectorized then all this > > > sequence can be replaced by a single vector compare-less, since the > > > original code was emulating what a vector comapre-less does anyway. > > > > > > Thanks! > > > Manolis > > > > > > > Thanks, > > > > Richard. > > > > > > > > > + > > > > > (for cmp (gt ge lt le) > > > > > outp (convert convert negate negate) > > > > > outn (negate negate convert convert) diff --git > > > > > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > new file mode 100644 > > > > > index 00000000000..26f9ad9ef28 > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > @@ -0,0 +1,72 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-options "-O2 -ftree-vectorize" } */ > > > > > + > > > > > +typedef unsigned char uint8_t; > > > > > +typedef unsigned short uint16_t; typedef unsigned int uint32_t; > > > > > + > > > > > +/* 8-bit SWAR tests. */ > > > > > + > > > > > +static uint8_t packed_cmp_8_8(uint8_t a) { > > > > > + return ((a >> 7) & 0x1U) * 0xffU; } > > > > > + > > > > > +/* 16-bit SWAR tests. */ > > > > > + > > > > > +static uint16_t packed_cmp_8_16(uint16_t a) { > > > > > + return ((a >> 7) & 0x101U) * 0xffU; } > > > > > + > > > > > +static uint16_t packed_cmp_16_16(uint16_t a) { > > > > > + return ((a >> 15) & 0x1U) * 0xffffU; } > > > > > + > > > > > +/* 32-bit SWAR tests. */ > > > > > + > > > > > +static uint32_t packed_cmp_8_32(uint32_t a) { > > > > > + return ((a >> 7) & 0x1010101U) * 0xffU; } > > > > > + > > > > > +static uint32_t packed_cmp_16_32(uint32_t a) { > > > > > + return ((a >> 15) & 0x10001U) * 0xffffU; } > > > > > + > > > > > +static uint32_t packed_cmp_32_32(uint32_t a) { > > > > > + return ((a >> 31) & 0x1U) * 0xffffffffU; } > > > > > + > > > > > +/* Driver function to test the vectorized code generated for > > > > > +the > > different > > > > > + packed_cmp variants. */ > > > > > + > > > > > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > > > > > + void vectorized_cmp_##FUNC(T* a, int n) \ > > > > > + { \ > > > > > + n = (n / 32) * 32; \ > > > > > + for(int i = 0; i < n; i += 4) \ > > > > > + { \ > > > > > + a[i + 0] = FUNC(a[i + 0]); \ > > > > > + a[i + 1] = FUNC(a[i + 1]); \ > > > > > + a[i + 2] = FUNC(a[i + 2]); \ > > > > > + a[i + 3] = FUNC(a[i + 3]); \ > > > > > + } \ > > > > > + } > > > > > + > > > > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > > > > > + > > > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > > > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > > > > > + > > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > > > > > + > > > > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > > > > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > > > > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > > > > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > > > > > -- > > > > > 2.34.1 > > > > >
On Wed, Nov 30, 2022 at 9:44 AM Richard Biener <richard.guenther@gmail.com> wrote: > > On Tue, Nov 29, 2022 at 11:05 AM Manolis Tsamis <manolis.tsamis@vrull.eu> wrote: > > > > When using SWAR (SIMD in a register) techniques a comparison operation within > > such a register can be made by using a combination of shifts, bitwise and and > > multiplication. If code using this scheme is vectorized then there is potential > > to replace all these operations with a single vector comparison, by reinterpreting > > the vector types to match the width of the SWAR register. > > > > For example, for the test function packed_cmp_16_32, the original generated code is: > > > > ldr q0, [x0] > > add w1, w1, 1 > > ushr v0.4s, v0.4s, 15 > > and v0.16b, v0.16b, v2.16b > > shl v1.4s, v0.4s, 16 > > sub v0.4s, v1.4s, v0.4s > > str q0, [x0], 16 > > cmp w2, w1 > > bhi .L20 > > > > with this pattern the above can be optimized to: > > > > ldr q0, [x0] > > add w1, w1, 1 > > cmlt v0.8h, v0.8h, #0 > > str q0, [x0], 16 > > cmp w2, w1 > > bhi .L20 > > > > The effect is similar for x86-64. > > > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > > > gcc/ChangeLog: > > > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > > > --- > > > > Changes in v2: > > - Changed pattern to use vec_cond_expr. > > - Changed pattern to work with VLA vector. > > - Added more checks and comments. > > > > gcc/match.pd | 60 ++++++++++++++++ > > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 +++++++++++++++++++ > > 2 files changed, 132 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > diff --git a/gcc/match.pd b/gcc/match.pd > > index 67a0a682f31..05e7fc79ba8 100644 > > --- a/gcc/match.pd > > +++ b/gcc/match.pd > > @@ -301,6 +301,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > (view_convert (bit_and:itype (view_convert @0) > > (ne @1 { build_zero_cst (type); }))))))) > > > > +/* In SWAR (SIMD in a register) code a signed comparison of packed data can > > + be constructed with a particular combination of shift, bitwise and, > > + and multiplication by constants. If that code is vectorized we can > > + convert this pattern into a more efficient vector comparison. */ > > +(simplify > > + (mult (bit_and (rshift @0 uniform_integer_cst_p@1) > > + uniform_integer_cst_p@2) > > + uniform_integer_cst_p@3) > > Please use VECTOR_CST in the match instead of uniform_integer_cst_p > and instead ... > > > + (with { > > + tree rshift_cst = uniform_integer_cst_p (@1); > > + tree bit_and_cst = uniform_integer_cst_p (@2); > > + tree mult_cst = uniform_integer_cst_p (@3); > > + } > > + /* Make sure we're working with vectors and uniform vector constants. */ > > + (if (VECTOR_TYPE_P (type) > > ... test for non-NULL *_cst here where you can use uniform_vector_p instead > of uniform_integer_cst_p. You can elide the VECTOR_TYPE_P check then > and instead do INTEGRAL_TYPE_P (TREE_TYPE (type)). > I implemented this solution but it didn't match the pattern and after re-examining the GIMPLE code I remembered that it looks like this: vect__5.151_72 = MEM <vector(4) unsigned intD.10> [(uint32_tD.4389 *)_58]; vect__38.152_73 = vect__5.151_72 >> 15; vect__39.153_74 = vect__38.152_73 & { 65537, 65537, 65537, 65537 }; vect__40.154_75 = vect__39.153_74 * { 65535, 65535, 65535, 65535 }; Even if all the operations are on vectors the shift is not a VECTOR_CST. I don't know if this is guaranteed to be the case, i.e. vector shifts are always a single constant? I used uniform_integer_cst_p to handle this difference in the constants involved without making an assumption about the shift being a single constant. Is there a better way to deal with this subtlety? Manolis > > > + && tree_fits_uhwi_p (rshift_cst) > > + && tree_fits_uhwi_p (mult_cst) > > + && tree_fits_uhwi_p (bit_and_cst)) > > + /* Compute what constants would be needed for this to represent a packed > > + comparison based on the shift amount denoted by RSHIFT_CST. */ > > + (with { > > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (type); > > + poly_int64 vec_nelts = TYPE_VECTOR_SUBPARTS (type); > > + poly_int64 vec_bits = vec_elem_bits * vec_nelts; > > + > > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > > + > > + mult_i = tree_to_uhwi (mult_cst); > > + bit_and_i = tree_to_uhwi (bit_and_cst); > > + target_bit_and_i = 0; > > + > > + /* The bit pattern in BIT_AND_I should be a mask for the least > > + significant bit of each packed element that is CMP_BITS wide. */ > > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > > + } > > + (if ((exact_log2 (cmp_bits_i)) >= 0 > > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > > + && multiple_p (vec_bits, cmp_bits_i) > > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > > + && target_mult_i == mult_i > > + && target_bit_and_i == bit_and_i) > > + /* Compute the vector shape for the comparison and check if the target is > > + able to expand the comparison with that type. */ > > + (with { > > + /* We're doing a signed comparison. */ > > + tree cmp_type = build_nonstandard_integer_type (cmp_bits_i, 0); > > + poly_int64 vector_type_nelts = exact_div (vec_bits, cmp_bits_i); > > + tree vector_cmp_type = build_vector_type (cmp_type, vector_type_nelts); > > + tree zeros = build_zero_cst (vector_cmp_type); > > + tree ones = build_all_ones_cst (vector_cmp_type); > > + } > > + (if (expand_vec_cmp_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR)) > > + (view_convert:type (vec_cond (lt (view_convert:vector_cmp_type @0) > > + { zeros; }) > > + { ones; } { zeros; }))))))))) > > You are testing whether we can expand the comparison but not whether we can > expand the vector condition here? The set of combinations supported are > tricky, I think your test is conservatively OK but it might fail to match AVX512 > and will fail targets with masks (SVE, GCN?). I think adding > > || expand_vec_cond_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR) > > would fix that (but there will be extra cost in producing the final > result vector > from the comparison mask). > > I do wonder if, as this is targeted at vectorization, this shouldn't > be a vectorizer pattern instead of a post-processing transform? That > way we would > get the costing in the vectorizer correct and not rely on integer > vector shift or > multiplication and also get the cost of producing the result vector correct? I > can't fully decipher the trick though, but assume that vector_cmp_type > always has > smaller elements than 'type'. > > Thanks, > Richard. > > > + > > (for cmp (gt ge lt le) > > outp (convert convert negate negate) > > outn (negate negate convert convert) > > diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > new file mode 100644 > > index 00000000000..26f9ad9ef28 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > @@ -0,0 +1,72 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -ftree-vectorize" } */ > > + > > +typedef unsigned char uint8_t; > > +typedef unsigned short uint16_t; > > +typedef unsigned int uint32_t; > > + > > +/* 8-bit SWAR tests. */ > > + > > +static uint8_t packed_cmp_8_8(uint8_t a) > > +{ > > + return ((a >> 7) & 0x1U) * 0xffU; > > +} > > + > > +/* 16-bit SWAR tests. */ > > + > > +static uint16_t packed_cmp_8_16(uint16_t a) > > +{ > > + return ((a >> 7) & 0x101U) * 0xffU; > > +} > > + > > +static uint16_t packed_cmp_16_16(uint16_t a) > > +{ > > + return ((a >> 15) & 0x1U) * 0xffffU; > > +} > > + > > +/* 32-bit SWAR tests. */ > > + > > +static uint32_t packed_cmp_8_32(uint32_t a) > > +{ > > + return ((a >> 7) & 0x1010101U) * 0xffU; > > +} > > + > > +static uint32_t packed_cmp_16_32(uint32_t a) > > +{ > > + return ((a >> 15) & 0x10001U) * 0xffffU; > > +} > > + > > +static uint32_t packed_cmp_32_32(uint32_t a) > > +{ > > + return ((a >> 31) & 0x1U) * 0xffffffffU; > > +} > > + > > +/* Driver function to test the vectorized code generated for the different > > + packed_cmp variants. */ > > + > > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > > + void vectorized_cmp_##FUNC(T* a, int n) \ > > + { \ > > + n = (n / 32) * 32; \ > > + for(int i = 0; i < n; i += 4) \ > > + { \ > > + a[i + 0] = FUNC(a[i + 0]); \ > > + a[i + 1] = FUNC(a[i + 1]); \ > > + a[i + 2] = FUNC(a[i + 2]); \ > > + a[i + 3] = FUNC(a[i + 3]); \ > > + } \ > > + } > > + > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > > + > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > > + > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > > + > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > > -- > > 2.34.1 > >
diff --git a/gcc/match.pd b/gcc/match.pd index 67a0a682f31..05e7fc79ba8 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -301,6 +301,66 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (view_convert (bit_and:itype (view_convert @0) (ne @1 { build_zero_cst (type); }))))))) +/* In SWAR (SIMD in a register) code a signed comparison of packed data can + be constructed with a particular combination of shift, bitwise and, + and multiplication by constants. If that code is vectorized we can + convert this pattern into a more efficient vector comparison. */ +(simplify + (mult (bit_and (rshift @0 uniform_integer_cst_p@1) + uniform_integer_cst_p@2) + uniform_integer_cst_p@3) + (with { + tree rshift_cst = uniform_integer_cst_p (@1); + tree bit_and_cst = uniform_integer_cst_p (@2); + tree mult_cst = uniform_integer_cst_p (@3); + } + /* Make sure we're working with vectors and uniform vector constants. */ + (if (VECTOR_TYPE_P (type) + && tree_fits_uhwi_p (rshift_cst) + && tree_fits_uhwi_p (mult_cst) + && tree_fits_uhwi_p (bit_and_cst)) + /* Compute what constants would be needed for this to represent a packed + comparison based on the shift amount denoted by RSHIFT_CST. */ + (with { + HOST_WIDE_INT vec_elem_bits = vector_element_bits (type); + poly_int64 vec_nelts = TYPE_VECTOR_SUBPARTS (type); + poly_int64 vec_bits = vec_elem_bits * vec_nelts; + + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; + + mult_i = tree_to_uhwi (mult_cst); + bit_and_i = tree_to_uhwi (bit_and_cst); + target_bit_and_i = 0; + + /* The bit pattern in BIT_AND_I should be a mask for the least + significant bit of each packed element that is CMP_BITS wide. */ + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; + } + (if ((exact_log2 (cmp_bits_i)) >= 0 + && cmp_bits_i < HOST_BITS_PER_WIDE_INT + && multiple_p (vec_bits, cmp_bits_i) + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT + && target_mult_i == mult_i + && target_bit_and_i == bit_and_i) + /* Compute the vector shape for the comparison and check if the target is + able to expand the comparison with that type. */ + (with { + /* We're doing a signed comparison. */ + tree cmp_type = build_nonstandard_integer_type (cmp_bits_i, 0); + poly_int64 vector_type_nelts = exact_div (vec_bits, cmp_bits_i); + tree vector_cmp_type = build_vector_type (cmp_type, vector_type_nelts); + tree zeros = build_zero_cst (vector_cmp_type); + tree ones = build_all_ones_cst (vector_cmp_type); + } + (if (expand_vec_cmp_expr_p (vector_cmp_type, vector_cmp_type, LT_EXPR)) + (view_convert:type (vec_cond (lt (view_convert:vector_cmp_type @0) + { zeros; }) + { ones; } { zeros; }))))))))) + (for cmp (gt ge lt le) outp (convert convert negate negate) outn (negate negate convert convert) diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c new file mode 100644 index 00000000000..26f9ad9ef28 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c @@ -0,0 +1,72 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize" } */ + +typedef unsigned char uint8_t; +typedef unsigned short uint16_t; +typedef unsigned int uint32_t; + +/* 8-bit SWAR tests. */ + +static uint8_t packed_cmp_8_8(uint8_t a) +{ + return ((a >> 7) & 0x1U) * 0xffU; +} + +/* 16-bit SWAR tests. */ + +static uint16_t packed_cmp_8_16(uint16_t a) +{ + return ((a >> 7) & 0x101U) * 0xffU; +} + +static uint16_t packed_cmp_16_16(uint16_t a) +{ + return ((a >> 15) & 0x1U) * 0xffffU; +} + +/* 32-bit SWAR tests. */ + +static uint32_t packed_cmp_8_32(uint32_t a) +{ + return ((a >> 7) & 0x1010101U) * 0xffU; +} + +static uint32_t packed_cmp_16_32(uint32_t a) +{ + return ((a >> 15) & 0x10001U) * 0xffffU; +} + +static uint32_t packed_cmp_32_32(uint32_t a) +{ + return ((a >> 31) & 0x1U) * 0xffffffffU; +} + +/* Driver function to test the vectorized code generated for the different + packed_cmp variants. */ + +#define VECTORIZED_PACKED_CMP(T, FUNC) \ + void vectorized_cmp_##FUNC(T* a, int n) \ + { \ + n = (n / 32) * 32; \ + for(int i = 0; i < n; i += 4) \ + { \ + a[i + 0] = FUNC(a[i + 0]); \ + a[i + 1] = FUNC(a[i + 1]); \ + a[i + 2] = FUNC(a[i + 2]); \ + a[i + 3] = FUNC(a[i + 3]); \ + } \ + } + +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); + +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); + +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); + +/* { dg-final { scan-assembler {\tcmlt\t} } } */ +/* { dg-final { scan-assembler-not {\tushr\t} } } */ +/* { dg-final { scan-assembler-not {\tshl\t} } } */ +/* { dg-final { scan-assembler-not {\tmul\t} } } */