[1/2] match.pd: Fold ((X >> C1) & C2) * (1 << C1)

Message ID 20250312185837.2863788-2-richard.sandiford@arm.com
State New
Headers
Series Two match.pd folds for sve/pr98119.c |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_simplebootstrap_build--master-arm-bootstrap success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_simplebootstrap_build--master-aarch64-bootstrap success Build passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Test passed

Commit Message

Richard Sandiford March 12, 2025, 6:58 p.m. UTC
  Using a combination of rules, we were able to fold

  ((X >> C1) & C2) * (1 << C1)  -->  X & (C2 << C1)

if everything was done at the same precision, but we couldn't fold
it if the AND was done at a different precision.  The optimisation is
often (but not always) valid for that case too.

This patch adds a dedicated rule for the case where different precisions
are involved.

An alternative would be to extend the individual folds that together
handle the same-precision case so that those rules handle differing
precisions.  But the risk is that that could replace narrow operations
with wide operations, which would be especially harmful on targets
like avr.  It's also not obviously free of cycles.

I also wondered whether the converts should be non-optional.

gcc/
	* match.pd: Fold ((X >> C1) & C2) * (1 << C1) to X & (C2 << C1).

gcc/testsuite/
	* gcc.dg/fold-mul-and-lshift-1.c: New test.
	* gcc.dg/fold-mul-and-lshift-2.c: Likewise.
---
 gcc/match.pd                                 | 29 ++++++++++
 gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c | 59 ++++++++++++++++++++
 gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c | 15 +++++
 3 files changed, 103 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c
 create mode 100644 gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c
  

Comments

Andrew Pinski March 12, 2025, 7:26 p.m. UTC | #1
On Wed, Mar 12, 2025 at 12:00 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Using a combination of rules, we were able to fold
>
>   ((X >> C1) & C2) * (1 << C1)  -->  X & (C2 << C1)
>
> if everything was done at the same precision, but we couldn't fold
> it if the AND was done at a different precision.  The optimisation is
> often (but not always) valid for that case too.
>
> This patch adds a dedicated rule for the case where different precisions
> are involved.
>
> An alternative would be to extend the individual folds that together
> handle the same-precision case so that those rules handle differing
> precisions.  But the risk is that that could replace narrow operations
> with wide operations, which would be especially harmful on targets
> like avr.  It's also not obviously free of cycles.
>
> I also wondered whether the converts should be non-optional.
>
> gcc/
>         * match.pd: Fold ((X >> C1) & C2) * (1 << C1) to X & (C2 << C1).
>
> gcc/testsuite/
>         * gcc.dg/fold-mul-and-lshift-1.c: New test.
>         * gcc.dg/fold-mul-and-lshift-2.c: Likewise.
> ---
>  gcc/match.pd                                 | 29 ++++++++++
>  gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c | 59 ++++++++++++++++++++
>  gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c | 15 +++++
>  3 files changed, 103 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 5c679848bdf..3197d1cac75 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -5231,6 +5231,35 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>       (if (mask)
>        (bit_op (shift (convert @0) @1) { mask; })))))))
>
> +/* Fold ((X >> C1) & C2) * (1 << C1) into X & (C2 << C1), including cases where
> +   the & happens in a different type.  It is the conversion case that isn't
> +   a composition of other folds.
> +
> +   Let the type of the * and >> be T1 and the type of the & be T2.
> +   The fold is valid if the conversion to T2 preserves all information;
> +   that is, if T2 is wider than T1 or drops no more than C1 bits from T1.
> +   In that case, the & might operate on bits that are dropped by the
> +   later conversion to T1 and the multiplication by (1 << C1), but those
> +   bits are also dropped by ANDing with C2 << C1 (converted to T1).
> +
> +   If the conversion to T2 is not information-preserving, we have to be
> +   careful about the later conversion to T1 acting as a sign extension.
> +   We need either T2 to be unsigned or the top (sign) bit of C2 to be clear.
> +   That is equivalent to testing whether C2 is nonnegative.  */
> +(simplify
> + (mult
> +  (convert? (bit_and (convert? (rshift @0 INTEGER_CST@1)) INTEGER_CST@2))
> +  INTEGER_CST@3)
> + (if (tree_nop_conversion_p (type, TREE_TYPE (@0)))
> +  (with { auto prec = element_precision (type); }
Since we know this needs to be a scalar, Using TREE_PRECISION here is fine too.

> +   (if (wi::ltu_p (wi::to_widest (@1), prec))

I think using wi::to_wide is better than using wi::to_widest here.
The other place in match which checks shift count does:
       /* Use a signed compare to leave negative shift counts alone.  */
       && wi::ges_p (wi::to_wide (uniform_integer_cst_p (@1)),
                     element_precision (type)))


> +    (with { auto shift = tree_to_uhwi (@1); }
> +     (if ((prec <= element_precision (TREE_TYPE (@2)) + shift
> +          || wi::to_widest (@2) >= 0)

I think `wi::to_widest (@2) >= 0` can be written as
`!tree_int_cst_sign_bit (@2)`.

Other than that the patch looks good.

Thanks,
Andrew


> +         && wi::to_wide (@3) == wi::set_bit_in_zero (shift, prec))
> +      (with { auto mask = wide_int::from (wi::to_wide (@2), prec, UNSIGNED); }
> +       (bit_and @0 { wide_int_to_tree (type, mask << shift); }))))))))
> +
>  /* ~(~X >> Y) -> X >> Y (for arithmetic shift).  */
>  (simplify
>   (bit_not (convert1?:s (rshift:s (convert2?@0 (bit_not @1)) @2)))
> diff --git a/gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c b/gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c
> new file mode 100644
> index 00000000000..b1ce10495e3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c
> @@ -0,0 +1,59 @@
> +/* { dg-do compile { target lp64 } } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +/* { dg-final { scan-tree-dump-not { >> } "optimized" } } */
> +/* { dg-final { scan-tree-dump-not { \* } "optimized" } } */
> +
> +unsigned int
> +f1 (unsigned int x)
> +{
> +    x >>= 1;
> +    unsigned long y = x;
> +    y &= 255;
> +    x = y;
> +    x *= 2;
> +    return x;
> +}
> +
> +unsigned int
> +f2 (unsigned int x)
> +{
> +    x >>= 1;
> +    unsigned long y = x;
> +    y &= -2UL;
> +    x = y;
> +    x *= 2;
> +    return x;
> +}
> +
> +unsigned int
> +f3 (unsigned int x)
> +{
> +    x >>= 1;
> +    unsigned short y = x;
> +    y &= 255;
> +    x = y;
> +    x *= 2;
> +    return x;
> +}
> +
> +unsigned int
> +f4 (unsigned int x)
> +{
> +    x >>= 1;
> +    short y = x;
> +    y &= (unsigned short) ~0U >> 1;
> +    x = y;
> +    x *= 2;
> +    return x;
> +}
> +
> +unsigned int
> +f5 (unsigned int x)
> +{
> +    x >>= 16;
> +    short y = x;
> +    y &= -2;
> +    x = y;
> +    x *= 1 << 16;
> +    return x;
> +}
> diff --git a/gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c b/gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c
> new file mode 100644
> index 00000000000..86eabef0fef
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile { target int32 } } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +/* { dg-final { scan-tree-dump { >> 15;} "optimized" } } */
> +/* { dg-final { scan-tree-dump { \* 32768;} "optimized" } } */
> +
> +unsigned int
> +f1 (unsigned int x)
> +{
> +    x >>= 15;
> +    short y = x;
> +    y &= -2;
> +    x = y;
> +    x *= 1 << 15;
> +    return x;
> +}
> --
> 2.25.1
>
  
Richard Sandiford March 12, 2025, 8:38 p.m. UTC | #2
Thanks for the review!

Andrew Pinski <pinskia@gmail.com> writes:
> On Wed, Mar 12, 2025 at 12:00 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Using a combination of rules, we were able to fold
>>
>>   ((X >> C1) & C2) * (1 << C1)  -->  X & (C2 << C1)
>>
>> if everything was done at the same precision, but we couldn't fold
>> it if the AND was done at a different precision.  The optimisation is
>> often (but not always) valid for that case too.
>>
>> This patch adds a dedicated rule for the case where different precisions
>> are involved.
>>
>> An alternative would be to extend the individual folds that together
>> handle the same-precision case so that those rules handle differing
>> precisions.  But the risk is that that could replace narrow operations
>> with wide operations, which would be especially harmful on targets
>> like avr.  It's also not obviously free of cycles.
>>
>> I also wondered whether the converts should be non-optional.
>>
>> gcc/
>>         * match.pd: Fold ((X >> C1) & C2) * (1 << C1) to X & (C2 << C1).
>>
>> gcc/testsuite/
>>         * gcc.dg/fold-mul-and-lshift-1.c: New test.
>>         * gcc.dg/fold-mul-and-lshift-2.c: Likewise.
>> ---
>>  gcc/match.pd                                 | 29 ++++++++++
>>  gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c | 59 ++++++++++++++++++++
>>  gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c | 15 +++++
>>  3 files changed, 103 insertions(+)
>>  create mode 100644 gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c
>>  create mode 100644 gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c
>>
>> diff --git a/gcc/match.pd b/gcc/match.pd
>> index 5c679848bdf..3197d1cac75 100644
>> --- a/gcc/match.pd
>> +++ b/gcc/match.pd
>> @@ -5231,6 +5231,35 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>       (if (mask)
>>        (bit_op (shift (convert @0) @1) { mask; })))))))
>>
>> +/* Fold ((X >> C1) & C2) * (1 << C1) into X & (C2 << C1), including cases where
>> +   the & happens in a different type.  It is the conversion case that isn't
>> +   a composition of other folds.
>> +
>> +   Let the type of the * and >> be T1 and the type of the & be T2.
>> +   The fold is valid if the conversion to T2 preserves all information;
>> +   that is, if T2 is wider than T1 or drops no more than C1 bits from T1.
>> +   In that case, the & might operate on bits that are dropped by the
>> +   later conversion to T1 and the multiplication by (1 << C1), but those
>> +   bits are also dropped by ANDing with C2 << C1 (converted to T1).
>> +
>> +   If the conversion to T2 is not information-preserving, we have to be
>> +   careful about the later conversion to T1 acting as a sign extension.
>> +   We need either T2 to be unsigned or the top (sign) bit of C2 to be clear.
>> +   That is equivalent to testing whether C2 is nonnegative.  */
>> +(simplify
>> + (mult
>> +  (convert? (bit_and (convert? (rshift @0 INTEGER_CST@1)) INTEGER_CST@2))
>> +  INTEGER_CST@3)
>> + (if (tree_nop_conversion_p (type, TREE_TYPE (@0)))
>> +  (with { auto prec = element_precision (type); }
> Since we know this needs to be a scalar, Using TREE_PRECISION here is fine too.

Yeah, agreed.  I'd wondered whether to use TREE_PRECISION instead,
but then I'd also wondered about trying to make the fold work for ectors.
Guess I ended up between two stools.

>> +   (if (wi::ltu_p (wi::to_widest (@1), prec))
>
> I think using wi::to_wide is better than using wi::to_widest here.

What's the reason for preferring wi::to_wide?  wi::to_widest should
usually be more efficient for this kind of check, since the tree
representation allows the underlying HWIs to be used directly.
wi::to_wide instead requires masking off bits above the precision.

E.g. on an --enable-checking=release compiler:

bool
foo (tree t, unsigned int n)
{
  return wi::ltu_p (wi::to_widest (t), n);
}

gives:

    188c:       79400c02        ldrh    w2, [x0, #6]
    1890:       7100045f        cmp     w2, #0x1
    1894:       54000060        b.eq    18a0 <foo(tree_node*, unsigned int)+0x14>  // b.none
    1898:       52800000        mov     w0, #0x0                        // #0
    189c:       d65f03c0        ret
    18a0:       f9400800        ldr     x0, [x0, #16]
    18a4:       eb21401f        cmp     x0, w1, uxtw
    18a8:       1a9f27e0        cset    w0, cc  // cc = lo, ul, last
    18ac:       d65f03c0        ret

whereas:

bool
foo (tree t, unsigned int n)
{
  return wi::ltu_p (wi::to_wide (t), n);
}

gives:

    188c:       79400802        ldrh    w2, [x0, #4]
    1890:       7100045f        cmp     w2, #0x1
    1894:       54000060        b.eq    18a0 <foo(tree_node*, unsigned int)+0x14>  // b.none
    1898:       52800000        mov     w0, #0x0                        // #0
    189c:       d65f03c0        ret
    18a0:       a9408800        ldp     x0, x2, [x0, #8]
    18a4:       79406c03        ldrh    w3, [x0, #54]
    18a8:       92800000        mov     x0, #0xffffffffffffffff         // #-1
    18ac:       7100fc7f        cmp     w3, #0x3f
    18b0:       9ac32000        lsl     x0, x0, x3
    18b4:       8a200040        bic     x0, x2, x0
    18b8:       9a829002        csel    x2, x0, x2, ls  // ls = plast
    18bc:       eb21405f        cmp     x2, w1, uxtw
    18c0:       1a9f27e0        cset    w0, cc  // cc = lo, ul, last
    18c4:       d65f03c0        ret

> The other place in match which checks shift count does:
>        /* Use a signed compare to leave negative shift counts alone.  */
>        && wi::ges_p (wi::to_wide (uniform_integer_cst_p (@1)),
>                      element_precision (type)))
>
>
>> +    (with { auto shift = tree_to_uhwi (@1); }
>> +     (if ((prec <= element_precision (TREE_TYPE (@2)) + shift
>> +          || wi::to_widest (@2) >= 0)
>
> I think `wi::to_widest (@2) >= 0` can be written as
> `!tree_int_cst_sign_bit (@2)`.

Ah, yeah, thanks.

Richard
  
Andrew Pinski March 12, 2025, 8:40 p.m. UTC | #3
On Wed, Mar 12, 2025 at 1:38 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Thanks for the review!
>
> Andrew Pinski <pinskia@gmail.com> writes:
> > On Wed, Mar 12, 2025 at 12:00 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Using a combination of rules, we were able to fold
> >>
> >>   ((X >> C1) & C2) * (1 << C1)  -->  X & (C2 << C1)
> >>
> >> if everything was done at the same precision, but we couldn't fold
> >> it if the AND was done at a different precision.  The optimisation is
> >> often (but not always) valid for that case too.
> >>
> >> This patch adds a dedicated rule for the case where different precisions
> >> are involved.
> >>
> >> An alternative would be to extend the individual folds that together
> >> handle the same-precision case so that those rules handle differing
> >> precisions.  But the risk is that that could replace narrow operations
> >> with wide operations, which would be especially harmful on targets
> >> like avr.  It's also not obviously free of cycles.
> >>
> >> I also wondered whether the converts should be non-optional.
> >>
> >> gcc/
> >>         * match.pd: Fold ((X >> C1) & C2) * (1 << C1) to X & (C2 << C1).
> >>
> >> gcc/testsuite/
> >>         * gcc.dg/fold-mul-and-lshift-1.c: New test.
> >>         * gcc.dg/fold-mul-and-lshift-2.c: Likewise.
> >> ---
> >>  gcc/match.pd                                 | 29 ++++++++++
> >>  gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c | 59 ++++++++++++++++++++
> >>  gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c | 15 +++++
> >>  3 files changed, 103 insertions(+)
> >>  create mode 100644 gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c
> >>
> >> diff --git a/gcc/match.pd b/gcc/match.pd
> >> index 5c679848bdf..3197d1cac75 100644
> >> --- a/gcc/match.pd
> >> +++ b/gcc/match.pd
> >> @@ -5231,6 +5231,35 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >>       (if (mask)
> >>        (bit_op (shift (convert @0) @1) { mask; })))))))
> >>
> >> +/* Fold ((X >> C1) & C2) * (1 << C1) into X & (C2 << C1), including cases where
> >> +   the & happens in a different type.  It is the conversion case that isn't
> >> +   a composition of other folds.
> >> +
> >> +   Let the type of the * and >> be T1 and the type of the & be T2.
> >> +   The fold is valid if the conversion to T2 preserves all information;
> >> +   that is, if T2 is wider than T1 or drops no more than C1 bits from T1.
> >> +   In that case, the & might operate on bits that are dropped by the
> >> +   later conversion to T1 and the multiplication by (1 << C1), but those
> >> +   bits are also dropped by ANDing with C2 << C1 (converted to T1).
> >> +
> >> +   If the conversion to T2 is not information-preserving, we have to be
> >> +   careful about the later conversion to T1 acting as a sign extension.
> >> +   We need either T2 to be unsigned or the top (sign) bit of C2 to be clear.
> >> +   That is equivalent to testing whether C2 is nonnegative.  */
> >> +(simplify
> >> + (mult
> >> +  (convert? (bit_and (convert? (rshift @0 INTEGER_CST@1)) INTEGER_CST@2))
> >> +  INTEGER_CST@3)
> >> + (if (tree_nop_conversion_p (type, TREE_TYPE (@0)))
> >> +  (with { auto prec = element_precision (type); }
> > Since we know this needs to be a scalar, Using TREE_PRECISION here is fine too.
>
> Yeah, agreed.  I'd wondered whether to use TREE_PRECISION instead,
> but then I'd also wondered about trying to make the fold work for ectors.
> Guess I ended up between two stools.
>
> >> +   (if (wi::ltu_p (wi::to_widest (@1), prec))
> >
> > I think using wi::to_wide is better than using wi::to_widest here.
>
> What's the reason for preferring wi::to_wide?  wi::to_widest should
> usually be more efficient for this kind of check, since the tree
> representation allows the underlying HWIs to be used directly.
> wi::to_wide instead requires masking off bits above the precision.
>
> E.g. on an --enable-checking=release compiler:
>
> bool
> foo (tree t, unsigned int n)
> {
>   return wi::ltu_p (wi::to_widest (t), n);
> }
>
> gives:
>
>     188c:       79400c02        ldrh    w2, [x0, #6]
>     1890:       7100045f        cmp     w2, #0x1
>     1894:       54000060        b.eq    18a0 <foo(tree_node*, unsigned int)+0x14>  // b.none
>     1898:       52800000        mov     w0, #0x0                        // #0
>     189c:       d65f03c0        ret
>     18a0:       f9400800        ldr     x0, [x0, #16]
>     18a4:       eb21401f        cmp     x0, w1, uxtw
>     18a8:       1a9f27e0        cset    w0, cc  // cc = lo, ul, last
>     18ac:       d65f03c0        ret
>
> whereas:
>
> bool
> foo (tree t, unsigned int n)
> {
>   return wi::ltu_p (wi::to_wide (t), n);
> }
>
> gives:
>
>     188c:       79400802        ldrh    w2, [x0, #4]
>     1890:       7100045f        cmp     w2, #0x1
>     1894:       54000060        b.eq    18a0 <foo(tree_node*, unsigned int)+0x14>  // b.none
>     1898:       52800000        mov     w0, #0x0                        // #0
>     189c:       d65f03c0        ret
>     18a0:       a9408800        ldp     x0, x2, [x0, #8]
>     18a4:       79406c03        ldrh    w3, [x0, #54]
>     18a8:       92800000        mov     x0, #0xffffffffffffffff         // #-1
>     18ac:       7100fc7f        cmp     w3, #0x3f
>     18b0:       9ac32000        lsl     x0, x0, x3
>     18b4:       8a200040        bic     x0, x2, x0
>     18b8:       9a829002        csel    x2, x0, x2, ls  // ls = plast
>     18bc:       eb21405f        cmp     x2, w1, uxtw
>     18c0:       1a9f27e0        cset    w0, cc  // cc = lo, ul, last
>     18c4:       d65f03c0        ret

Oh I guess I didn't realize that. Maybe there are other places which
should get this handling too (obviously for another time).

Thanks,
Andrew

>
> > The other place in match which checks shift count does:
> >        /* Use a signed compare to leave negative shift counts alone.  */
> >        && wi::ges_p (wi::to_wide (uniform_integer_cst_p (@1)),
> >                      element_precision (type)))
> >
> >
> >> +    (with { auto shift = tree_to_uhwi (@1); }
> >> +     (if ((prec <= element_precision (TREE_TYPE (@2)) + shift
> >> +          || wi::to_widest (@2) >= 0)
> >
> > I think `wi::to_widest (@2) >= 0` can be written as
> > `!tree_int_cst_sign_bit (@2)`.
>
> Ah, yeah, thanks.
>
> Richard
  
Richard Biener March 13, 2025, 11:44 a.m. UTC | #4
On Wed, 12 Mar 2025, Richard Sandiford wrote:

> Thanks for the review!
> 
> Andrew Pinski <pinskia@gmail.com> writes:
> > On Wed, Mar 12, 2025 at 12:00 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Using a combination of rules, we were able to fold
> >>
> >>   ((X >> C1) & C2) * (1 << C1)  -->  X & (C2 << C1)
> >>
> >> if everything was done at the same precision, but we couldn't fold
> >> it if the AND was done at a different precision.  The optimisation is
> >> often (but not always) valid for that case too.
> >>
> >> This patch adds a dedicated rule for the case where different precisions
> >> are involved.
> >>
> >> An alternative would be to extend the individual folds that together
> >> handle the same-precision case so that those rules handle differing
> >> precisions.  But the risk is that that could replace narrow operations
> >> with wide operations, which would be especially harmful on targets
> >> like avr.  It's also not obviously free of cycles.
> >>
> >> I also wondered whether the converts should be non-optional.
> >>
> >> gcc/
> >>         * match.pd: Fold ((X >> C1) & C2) * (1 << C1) to X & (C2 << C1).
> >>
> >> gcc/testsuite/
> >>         * gcc.dg/fold-mul-and-lshift-1.c: New test.
> >>         * gcc.dg/fold-mul-and-lshift-2.c: Likewise.
> >> ---
> >>  gcc/match.pd                                 | 29 ++++++++++
> >>  gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c | 59 ++++++++++++++++++++
> >>  gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c | 15 +++++
> >>  3 files changed, 103 insertions(+)
> >>  create mode 100644 gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c
> >>
> >> diff --git a/gcc/match.pd b/gcc/match.pd
> >> index 5c679848bdf..3197d1cac75 100644
> >> --- a/gcc/match.pd
> >> +++ b/gcc/match.pd
> >> @@ -5231,6 +5231,35 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >>       (if (mask)
> >>        (bit_op (shift (convert @0) @1) { mask; })))))))
> >>
> >> +/* Fold ((X >> C1) & C2) * (1 << C1) into X & (C2 << C1), including cases where
> >> +   the & happens in a different type.  It is the conversion case that isn't
> >> +   a composition of other folds.
> >> +
> >> +   Let the type of the * and >> be T1 and the type of the & be T2.
> >> +   The fold is valid if the conversion to T2 preserves all information;
> >> +   that is, if T2 is wider than T1 or drops no more than C1 bits from T1.
> >> +   In that case, the & might operate on bits that are dropped by the
> >> +   later conversion to T1 and the multiplication by (1 << C1), but those
> >> +   bits are also dropped by ANDing with C2 << C1 (converted to T1).
> >> +
> >> +   If the conversion to T2 is not information-preserving, we have to be
> >> +   careful about the later conversion to T1 acting as a sign extension.
> >> +   We need either T2 to be unsigned or the top (sign) bit of C2 to be clear.
> >> +   That is equivalent to testing whether C2 is nonnegative.  */
> >> +(simplify
> >> + (mult
> >> +  (convert? (bit_and (convert? (rshift @0 INTEGER_CST@1)) INTEGER_CST@2))
> >> +  INTEGER_CST@3)
> >> + (if (tree_nop_conversion_p (type, TREE_TYPE (@0)))
> >> +  (with { auto prec = element_precision (type); }
> > Since we know this needs to be a scalar, Using TREE_PRECISION here is fine too.
> 
> Yeah, agreed.  I'd wondered whether to use TREE_PRECISION instead,
> but then I'd also wondered about trying to make the fold work for ectors.
> Guess I ended up between two stools.
> 
> >> +   (if (wi::ltu_p (wi::to_widest (@1), prec))
> >
> > I think using wi::to_wide is better than using wi::to_widest here.
> 
> What's the reason for preferring wi::to_wide?  wi::to_widest should
> usually be more efficient for this kind of check, since the tree
> representation allows the underlying HWIs to be used directly.
> wi::to_wide instead requires masking off bits above the precision.

I prefer ::to_widest.

The original patch is OK with changing element_precision to 
TYPE_PRECISION.

Richard.

> E.g. on an --enable-checking=release compiler:
> 
> bool
> foo (tree t, unsigned int n)
> {
>   return wi::ltu_p (wi::to_widest (t), n);
> }
> 
> gives:
> 
>     188c:       79400c02        ldrh    w2, [x0, #6]
>     1890:       7100045f        cmp     w2, #0x1
>     1894:       54000060        b.eq    18a0 <foo(tree_node*, unsigned int)+0x14>  // b.none
>     1898:       52800000        mov     w0, #0x0                        // #0
>     189c:       d65f03c0        ret
>     18a0:       f9400800        ldr     x0, [x0, #16]
>     18a4:       eb21401f        cmp     x0, w1, uxtw
>     18a8:       1a9f27e0        cset    w0, cc  // cc = lo, ul, last
>     18ac:       d65f03c0        ret
> 
> whereas:
> 
> bool
> foo (tree t, unsigned int n)
> {
>   return wi::ltu_p (wi::to_wide (t), n);
> }
> 
> gives:
> 
>     188c:       79400802        ldrh    w2, [x0, #4]
>     1890:       7100045f        cmp     w2, #0x1
>     1894:       54000060        b.eq    18a0 <foo(tree_node*, unsigned int)+0x14>  // b.none
>     1898:       52800000        mov     w0, #0x0                        // #0
>     189c:       d65f03c0        ret
>     18a0:       a9408800        ldp     x0, x2, [x0, #8]
>     18a4:       79406c03        ldrh    w3, [x0, #54]
>     18a8:       92800000        mov     x0, #0xffffffffffffffff         // #-1
>     18ac:       7100fc7f        cmp     w3, #0x3f
>     18b0:       9ac32000        lsl     x0, x0, x3
>     18b4:       8a200040        bic     x0, x2, x0
>     18b8:       9a829002        csel    x2, x0, x2, ls  // ls = plast
>     18bc:       eb21405f        cmp     x2, w1, uxtw
>     18c0:       1a9f27e0        cset    w0, cc  // cc = lo, ul, last
>     18c4:       d65f03c0        ret
> 
> > The other place in match which checks shift count does:
> >        /* Use a signed compare to leave negative shift counts alone.  */
> >        && wi::ges_p (wi::to_wide (uniform_integer_cst_p (@1)),
> >                      element_precision (type)))
> >
> >
> >> +    (with { auto shift = tree_to_uhwi (@1); }
> >> +     (if ((prec <= element_precision (TREE_TYPE (@2)) + shift
> >> +          || wi::to_widest (@2) >= 0)
> >
> > I think `wi::to_widest (@2) >= 0` can be written as
> > `!tree_int_cst_sign_bit (@2)`.
> 
> Ah, yeah, thanks.
> 
> Richard
>
  

Patch

diff --git a/gcc/match.pd b/gcc/match.pd
index 5c679848bdf..3197d1cac75 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5231,6 +5231,35 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
      (if (mask)
       (bit_op (shift (convert @0) @1) { mask; })))))))
 
+/* Fold ((X >> C1) & C2) * (1 << C1) into X & (C2 << C1), including cases where
+   the & happens in a different type.  It is the conversion case that isn't
+   a composition of other folds.
+
+   Let the type of the * and >> be T1 and the type of the & be T2.
+   The fold is valid if the conversion to T2 preserves all information;
+   that is, if T2 is wider than T1 or drops no more than C1 bits from T1.
+   In that case, the & might operate on bits that are dropped by the
+   later conversion to T1 and the multiplication by (1 << C1), but those
+   bits are also dropped by ANDing with C2 << C1 (converted to T1).
+
+   If the conversion to T2 is not information-preserving, we have to be
+   careful about the later conversion to T1 acting as a sign extension.
+   We need either T2 to be unsigned or the top (sign) bit of C2 to be clear.
+   That is equivalent to testing whether C2 is nonnegative.  */
+(simplify
+ (mult
+  (convert? (bit_and (convert? (rshift @0 INTEGER_CST@1)) INTEGER_CST@2))
+  INTEGER_CST@3)
+ (if (tree_nop_conversion_p (type, TREE_TYPE (@0)))
+  (with { auto prec = element_precision (type); }
+   (if (wi::ltu_p (wi::to_widest (@1), prec))
+    (with { auto shift = tree_to_uhwi (@1); }
+     (if ((prec <= element_precision (TREE_TYPE (@2)) + shift
+	   || wi::to_widest (@2) >= 0)
+	  && wi::to_wide (@3) == wi::set_bit_in_zero (shift, prec))
+      (with { auto mask = wide_int::from (wi::to_wide (@2), prec, UNSIGNED); }
+       (bit_and @0 { wide_int_to_tree (type, mask << shift); }))))))))
+
 /* ~(~X >> Y) -> X >> Y (for arithmetic shift).  */
 (simplify
  (bit_not (convert1?:s (rshift:s (convert2?@0 (bit_not @1)) @2)))
diff --git a/gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c b/gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c
new file mode 100644
index 00000000000..b1ce10495e3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-mul-and-lshift-1.c
@@ -0,0 +1,59 @@ 
+/* { dg-do compile { target lp64 } } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump-not { >> } "optimized" } } */
+/* { dg-final { scan-tree-dump-not { \* } "optimized" } } */
+
+unsigned int
+f1 (unsigned int x)
+{
+    x >>= 1;
+    unsigned long y = x;
+    y &= 255;
+    x = y;
+    x *= 2;
+    return x;
+}
+
+unsigned int
+f2 (unsigned int x)
+{
+    x >>= 1;
+    unsigned long y = x;
+    y &= -2UL;
+    x = y;
+    x *= 2;
+    return x;
+}
+
+unsigned int
+f3 (unsigned int x)
+{
+    x >>= 1;
+    unsigned short y = x;
+    y &= 255;
+    x = y;
+    x *= 2;
+    return x;
+}
+
+unsigned int
+f4 (unsigned int x)
+{
+    x >>= 1;
+    short y = x;
+    y &= (unsigned short) ~0U >> 1;
+    x = y;
+    x *= 2;
+    return x;
+}
+
+unsigned int
+f5 (unsigned int x)
+{
+    x >>= 16;
+    short y = x;
+    y &= -2;
+    x = y;
+    x *= 1 << 16;
+    return x;
+}
diff --git a/gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c b/gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c
new file mode 100644
index 00000000000..86eabef0fef
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-mul-and-lshift-2.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile { target int32 } } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump { >> 15;} "optimized" } } */
+/* { dg-final { scan-tree-dump { \* 32768;} "optimized" } } */
+
+unsigned int
+f1 (unsigned int x)
+{
+    x >>= 15;
+    short y = x;
+    y &= -2;
+    x = y;
+    x *= 1 << 15;
+    return x;
+}