middle-end Add an RPO pass after successful vectorization

Message ID patch-15007-tamar@arm.com
State Committed
Headers
Series middle-end Add an RPO pass after successful vectorization |

Commit Message

Tamar Christina Nov. 2, 2021, 1:50 p.m. UTC
  Hi All,

Following my current SVE predicate optimization series a problem has presented
itself in that the way vector masks are generated for masked operations relies
on CSE to share masks efficiently.

The issue however is that masking is done using the & operand and & is
associative and so reassoc decides to reassociate the masked operations.

This makes CSE then unable to CSE an unmasked and a masked operation leading to
duplicate operations being performed.

To counter this we want to add an RPO pass over the vectorized loop body when
vectorization succeeds.  This makes it then no longer reliant on the RTL level
CSE.

I have not added a testcase for this as it requires the changes in my patch
series, however the entire series relies on this patch to work so all the
tests there cover it.

Bootstrapped Regtested on aarch64-none-linux-gnu,
x86_64-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon
	successful vectorization.

--- inline copy of patch -- 
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c660af4b6d32dc51 100644


--
  

Comments

Richard Biener Nov. 2, 2021, 2:23 p.m. UTC | #1
On Tue, 2 Nov 2021, Tamar Christina wrote:

> Hi All,
> 
> Following my current SVE predicate optimization series a problem has presented
> itself in that the way vector masks are generated for masked operations relies
> on CSE to share masks efficiently.
> 
> The issue however is that masking is done using the & operand and & is
> associative and so reassoc decides to reassociate the masked operations.

But it does this for the purpose of canonicalization and thus CSE.

> This makes CSE then unable to CSE an unmasked and a masked operation leading to
> duplicate operations being performed.
> 
> To counter this we want to add an RPO pass over the vectorized loop body when
> vectorization succeeds.  This makes it then no longer reliant on the RTL level
> CSE.
> 
> I have not added a testcase for this as it requires the changes in my patch
> series, however the entire series relies on this patch to work so all the
> tests there cover it.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> x86_64-linux-gnu and no issues.
> 
> Ok for master?

You are running VN over _all_ loop bodies rather only those vectorized.
We loop over vectorized loops earlier for optimizing masked store
sequences.  I suppose you could hook in there.  I'll also notice
that we have pass_pre_slp_scalar_cleanup which eventually runs
plus we have a late FRE.  So I don't understand why it doesn't
work to CSE later.

  for (i = 1; i < number_of_loops (cfun); i++)
    {
      loop_vec_info loop_vinfo;
      bool has_mask_store;

      loop = get_loop (cfun, i);
      if (!loop || !loop->aux)
        continue;
      loop_vinfo = (loop_vec_info) loop->aux;
      has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
      delete loop_vinfo;
      if (has_mask_store
          && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
        optimize_mask_stores (loop);
      loop->aux = NULL;
    }


> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon
> 	successful vectorization.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c660af4b6d32dc51 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "gimple-pretty-print.h"
>  #include "opt-problem.h"
>  #include "internal-fn.h"
> -
> +#include "tree-ssa-sccvn.h"
>  
>  /* Loop or bb location, with hotness information.  */
>  dump_user_location_t vect_location;
> @@ -1323,6 +1323,27 @@ vectorize_loops (void)
>  	 ???  Also while we try hard to update loop-closed SSA form we fail
>  	 to properly do this in some corner-cases (see PR56286).  */
>        rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa_only_virtuals);
> +
> +      for (i = 1; i < number_of_loops (cfun); i++)
> +	{
> +	  loop = get_loop (cfun, i);
> +	  if (!loop || !single_exit (loop))
> +	    continue;
> +
> +	  bitmap exit_bbs;
> +	  /* Perform local CSE, this esp. helps because we emit code for
> +	     predicates that need to be shared for optimal predicate usage.
> +	     However reassoc will re-order them and prevent CSE from working
> +	     as it should.  CSE only the loop body, not the entry.  */
> +	  exit_bbs = BITMAP_ALLOC (NULL);
> +	  bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> +	  bitmap_set_bit (exit_bbs, loop->latch->index);
> +
> +	  do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> +
> +	  BITMAP_FREE (exit_bbs);
> +	}
> +
>        return TODO_cleanup_cfg;
>      }
>  
> 
> 
>
  
Tamar Christina Nov. 2, 2021, 2:36 p.m. UTC | #2
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Tuesday, November 2, 2021 2:24 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> Subject: Re: [PATCH]middle-end Add an RPO pass after successful
> vectorization
> 
> On Tue, 2 Nov 2021, Tamar Christina wrote:
> 
> > Hi All,
> >
> > Following my current SVE predicate optimization series a problem has
> > presented itself in that the way vector masks are generated for masked
> > operations relies on CSE to share masks efficiently.
> >
> > The issue however is that masking is done using the & operand and & is
> > associative and so reassoc decides to reassociate the masked operations.
> 
> But it does this for the purpose of canonicalization and thus CSE.

Yes, but it turns something like

(a & b) & mask into a & (b & mask).

When (a & b) is used somewhere else you now lose the CSE.  So it's actually hurting
In this case.

> 
> > This makes CSE then unable to CSE an unmasked and a masked operation
> > leading to duplicate operations being performed.
> >
> > To counter this we want to add an RPO pass over the vectorized loop
> > body when vectorization succeeds.  This makes it then no longer
> > reliant on the RTL level CSE.
> >
> > I have not added a testcase for this as it requires the changes in my
> > patch series, however the entire series relies on this patch to work
> > so all the tests there cover it.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and
> > no issues.
> >
> > Ok for master?
> 
> You are running VN over _all_ loop bodies rather only those vectorized.
> We loop over vectorized loops earlier for optimizing masked store sequences.
> I suppose you could hook in there.  I'll also notice that we have
> pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE.
> So I don't understand why it doesn't work to CSE later.
> 

Atm, say you have the conditions a > b, and a > b & a > c

We generate

mask1 = (a > b) & loop_mask
mask2 = (a > b & a > c) & loop_mask

with the intention that mask1 can be re-used in mask2.

Reassoc changes this to mask2 = a > b & (a > c & loop_mask)

Which has now unmasked (a > b) in mask2, which leaves us unable to combine
the mask1 and mask2.  It doesn't generate incorrect code, just inefficient.

>   for (i = 1; i < number_of_loops (cfun); i++)
>     {
>       loop_vec_info loop_vinfo;
>       bool has_mask_store;
> 
>       loop = get_loop (cfun, i);
>       if (!loop || !loop->aux)
>         continue;
>       loop_vinfo = (loop_vec_info) loop->aux;
>       has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
>       delete loop_vinfo;
>       if (has_mask_store
>           && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
>         optimize_mask_stores (loop);
>       loop->aux = NULL;
>     }
> 

Ah thanks, I'll make the changes.

Thanks,
Tamar

> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN
> upon
> > 	successful vectorization.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index
> >
> 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c
> 660
> > af4b6d32dc51 100644
> > --- a/gcc/tree-vectorizer.c
> > +++ b/gcc/tree-vectorizer.c
> > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
> > #include "gimple-pretty-print.h"
> >  #include "opt-problem.h"
> >  #include "internal-fn.h"
> > -
> > +#include "tree-ssa-sccvn.h"
> >
> >  /* Loop or bb location, with hotness information.  */
> > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@
> > vectorize_loops (void)
> >  	 ???  Also while we try hard to update loop-closed SSA form we fail
> >  	 to properly do this in some corner-cases (see PR56286).  */
> >        rewrite_into_loop_closed_ssa (NULL,
> > TODO_update_ssa_only_virtuals);
> > +
> > +      for (i = 1; i < number_of_loops (cfun); i++)
> > +	{
> > +	  loop = get_loop (cfun, i);
> > +	  if (!loop || !single_exit (loop))
> > +	    continue;
> > +
> > +	  bitmap exit_bbs;
> > +	  /* Perform local CSE, this esp. helps because we emit code for
> > +	     predicates that need to be shared for optimal predicate usage.
> > +	     However reassoc will re-order them and prevent CSE from working
> > +	     as it should.  CSE only the loop body, not the entry.  */
> > +	  exit_bbs = BITMAP_ALLOC (NULL);
> > +	  bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > +	  bitmap_set_bit (exit_bbs, loop->latch->index);
> > +
> > +	  do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> > +
> > +	  BITMAP_FREE (exit_bbs);
> > +	}
> > +
> >        return TODO_cleanup_cfg;
> >      }
> >
> >
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
  
Richard Biener Nov. 2, 2021, 3:29 p.m. UTC | #3
On Tue, 2 Nov 2021, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Tuesday, November 2, 2021 2:24 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful
> > vectorization
> > 
> > On Tue, 2 Nov 2021, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > Following my current SVE predicate optimization series a problem has
> > > presented itself in that the way vector masks are generated for masked
> > > operations relies on CSE to share masks efficiently.
> > >
> > > The issue however is that masking is done using the & operand and & is
> > > associative and so reassoc decides to reassociate the masked operations.
> > 
> > But it does this for the purpose of canonicalization and thus CSE.
> 
> Yes, but it turns something like
> 
> (a & b) & mask into a & (b & mask).
> 
> When (a & b) is used somewhere else you now lose the CSE.  So it's actually hurting
> In this case.

OK, so that's a known "issue" with reassoc, it doesn't consider global
CSE opportunities and I guess it pushes 'mask' to leaf if it is loop
carried.

> > 
> > > This makes CSE then unable to CSE an unmasked and a masked operation
> > > leading to duplicate operations being performed.
> > >
> > > To counter this we want to add an RPO pass over the vectorized loop
> > > body when vectorization succeeds.  This makes it then no longer
> > > reliant on the RTL level CSE.
> > >
> > > I have not added a testcase for this as it requires the changes in my
> > > patch series, however the entire series relies on this patch to work
> > > so all the tests there cover it.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and
> > > no issues.
> > >
> > > Ok for master?
> > 
> > You are running VN over _all_ loop bodies rather only those vectorized.
> > We loop over vectorized loops earlier for optimizing masked store sequences.
> > I suppose you could hook in there.  I'll also notice that we have
> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE.
> > So I don't understand why it doesn't work to CSE later.
> > 
> 
> Atm, say you have the conditions a > b, and a > b & a > c
> 
> We generate
> 
> mask1 = (a > b) & loop_mask
> mask2 = (a > b & a > c) & loop_mask
> 
> with the intention that mask1 can be re-used in mask2.
> 
> Reassoc changes this to mask2 = a > b & (a > c & loop_mask)
> 
> Which has now unmasked (a > b) in mask2, which leaves us unable to combine
> the mask1 and mask2.  It doesn't generate incorrect code, just inefficient.
> 
> >   for (i = 1; i < number_of_loops (cfun); i++)
> >     {
> >       loop_vec_info loop_vinfo;
> >       bool has_mask_store;
> > 
> >       loop = get_loop (cfun, i);
> >       if (!loop || !loop->aux)
> >         continue;
> >       loop_vinfo = (loop_vec_info) loop->aux;
> >       has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
> >       delete loop_vinfo;
> >       if (has_mask_store
> >           && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
> >         optimize_mask_stores (loop);
> >       loop->aux = NULL;
> >     }
> > 
> 
> Ah thanks, I'll make the changes.

Note I think that full-blown CSE is a bit overkill just to counter
a deficient reassoc (or VN).  At least it is supposed to be "cheap"
and can be conditionalized on loop masks being used as well.

> Thanks,
> Tamar
> 
> > 
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 	* tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN
> > upon
> > > 	successful vectorization.
> > >
> > > --- inline copy of patch --
> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index
> > >
> > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c
> > 660
> > > af4b6d32dc51 100644
> > > --- a/gcc/tree-vectorizer.c
> > > +++ b/gcc/tree-vectorizer.c
> > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
> > > #include "gimple-pretty-print.h"
> > >  #include "opt-problem.h"
> > >  #include "internal-fn.h"
> > > -
> > > +#include "tree-ssa-sccvn.h"
> > >
> > >  /* Loop or bb location, with hotness information.  */
> > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@
> > > vectorize_loops (void)
> > >  	 ???  Also while we try hard to update loop-closed SSA form we fail
> > >  	 to properly do this in some corner-cases (see PR56286).  */
> > >        rewrite_into_loop_closed_ssa (NULL,
> > > TODO_update_ssa_only_virtuals);
> > > +
> > > +      for (i = 1; i < number_of_loops (cfun); i++)
> > > +	{
> > > +	  loop = get_loop (cfun, i);
> > > +	  if (!loop || !single_exit (loop))
> > > +	    continue;
> > > +
> > > +	  bitmap exit_bbs;
> > > +	  /* Perform local CSE, this esp. helps because we emit code for
> > > +	     predicates that need to be shared for optimal predicate usage.
> > > +	     However reassoc will re-order them and prevent CSE from working
> > > +	     as it should.  CSE only the loop body, not the entry.  */
> > > +	  exit_bbs = BITMAP_ALLOC (NULL);
> > > +	  bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > +	  bitmap_set_bit (exit_bbs, loop->latch->index);
> > > +
> > > +	  do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> > > +
> > > +	  BITMAP_FREE (exit_bbs);
> > > +	}
> > > +
> > >        return TODO_cleanup_cfg;
> > >      }
> > >
> > >
> > >
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
>
  
Richard Sandiford Nov. 2, 2021, 5:06 p.m. UTC | #4
Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Tue, 2 Nov 2021, Tamar Christina wrote:
>
>> > -----Original Message-----
>> > From: Richard Biener <rguenther@suse.de>
>> > Sent: Tuesday, November 2, 2021 2:24 PM
>> > To: Tamar Christina <Tamar.Christina@arm.com>
>> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
>> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful
>> > vectorization
>> > 
>> > On Tue, 2 Nov 2021, Tamar Christina wrote:
>> > 
>> > > Hi All,
>> > >
>> > > Following my current SVE predicate optimization series a problem has
>> > > presented itself in that the way vector masks are generated for masked
>> > > operations relies on CSE to share masks efficiently.
>> > >
>> > > The issue however is that masking is done using the & operand and & is
>> > > associative and so reassoc decides to reassociate the masked operations.
>> > 
>> > But it does this for the purpose of canonicalization and thus CSE.
>> 
>> Yes, but it turns something like
>> 
>> (a & b) & mask into a & (b & mask).
>> 
>> When (a & b) is used somewhere else you now lose the CSE.  So it's actually hurting
>> In this case.
>
> OK, so that's a known "issue" with reassoc, it doesn't consider global
> CSE opportunities and I guess it pushes 'mask' to leaf if it is loop
> carried.
>
>> > 
>> > > This makes CSE then unable to CSE an unmasked and a masked operation
>> > > leading to duplicate operations being performed.
>> > >
>> > > To counter this we want to add an RPO pass over the vectorized loop
>> > > body when vectorization succeeds.  This makes it then no longer
>> > > reliant on the RTL level CSE.
>> > >
>> > > I have not added a testcase for this as it requires the changes in my
>> > > patch series, however the entire series relies on this patch to work
>> > > so all the tests there cover it.
>> > >
>> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and
>> > > no issues.
>> > >
>> > > Ok for master?
>> > 
>> > You are running VN over _all_ loop bodies rather only those vectorized.
>> > We loop over vectorized loops earlier for optimizing masked store sequences.
>> > I suppose you could hook in there.  I'll also notice that we have
>> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE.
>> > So I don't understand why it doesn't work to CSE later.
>> > 
>> 
>> Atm, say you have the conditions a > b, and a > b & a > c
>> 
>> We generate
>> 
>> mask1 = (a > b) & loop_mask
>> mask2 = (a > b & a > c) & loop_mask
>> 
>> with the intention that mask1 can be re-used in mask2.
>> 
>> Reassoc changes this to mask2 = a > b & (a > c & loop_mask)
>> 
>> Which has now unmasked (a > b) in mask2, which leaves us unable to combine
>> the mask1 and mask2.  It doesn't generate incorrect code, just inefficient.
>> 
>> >   for (i = 1; i < number_of_loops (cfun); i++)
>> >     {
>> >       loop_vec_info loop_vinfo;
>> >       bool has_mask_store;
>> > 
>> >       loop = get_loop (cfun, i);
>> >       if (!loop || !loop->aux)
>> >         continue;
>> >       loop_vinfo = (loop_vec_info) loop->aux;
>> >       has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
>> >       delete loop_vinfo;
>> >       if (has_mask_store
>> >           && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
>> >         optimize_mask_stores (loop);
>> >       loop->aux = NULL;
>> >     }
>> > 
>> 
>> Ah thanks, I'll make the changes.
>
> Note I think that full-blown CSE is a bit overkill just to counter
> a deficient reassoc (or VN).  At least it is supposed to be "cheap"
> and can be conditionalized on loop masks being used as well.

Not sure we should make this conditional on loop masks being used.
It seems either that:

(a) the vectoriser is supposed to avoid creating code that has folding
    or VN opportunities, in which case we need to generate the vectorised
    code in a smarter way or

(b) the vectoriser is allowed to create code that has folding or VN
    opportunities, in which case it would be good to have a defined
    place to get rid of them.

I'm just worried that if we make it conditional on loop masks,
we could see cases that in which non-loop-mask stuff is optimised
differently based on whether the loop has masks or not.  E.g.
we might get worse code with an unpredicated main loop and
a predicated epilogue compared to a predicated main loop.

Thanks,
Richard
>
>> Thanks,
>> Tamar
>> 
>> > 
>> > > Thanks,
>> > > Tamar
>> > >
>> > > gcc/ChangeLog:
>> > >
>> > > 	* tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN
>> > upon
>> > > 	successful vectorization.
>> > >
>> > > --- inline copy of patch --
>> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index
>> > >
>> > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c
>> > 660
>> > > af4b6d32dc51 100644
>> > > --- a/gcc/tree-vectorizer.c
>> > > +++ b/gcc/tree-vectorizer.c
>> > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
>> > > #include "gimple-pretty-print.h"
>> > >  #include "opt-problem.h"
>> > >  #include "internal-fn.h"
>> > > -
>> > > +#include "tree-ssa-sccvn.h"
>> > >
>> > >  /* Loop or bb location, with hotness information.  */
>> > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@
>> > > vectorize_loops (void)
>> > >  	 ???  Also while we try hard to update loop-closed SSA form we fail
>> > >  	 to properly do this in some corner-cases (see PR56286).  */
>> > >        rewrite_into_loop_closed_ssa (NULL,
>> > > TODO_update_ssa_only_virtuals);
>> > > +
>> > > +      for (i = 1; i < number_of_loops (cfun); i++)
>> > > +	{
>> > > +	  loop = get_loop (cfun, i);
>> > > +	  if (!loop || !single_exit (loop))
>> > > +	    continue;
>> > > +
>> > > +	  bitmap exit_bbs;
>> > > +	  /* Perform local CSE, this esp. helps because we emit code for
>> > > +	     predicates that need to be shared for optimal predicate usage.
>> > > +	     However reassoc will re-order them and prevent CSE from working
>> > > +	     as it should.  CSE only the loop body, not the entry.  */
>> > > +	  exit_bbs = BITMAP_ALLOC (NULL);
>> > > +	  bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
>> > > +	  bitmap_set_bit (exit_bbs, loop->latch->index);
>> > > +
>> > > +	  do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
>> > > +
>> > > +	  BITMAP_FREE (exit_bbs);
>> > > +	}
>> > > +
>> > >        return TODO_cleanup_cfg;
>> > >      }
>> > >
>> > >
>> > >
>> > >
>> > 
>> > --
>> > Richard Biener <rguenther@suse.de>
>> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
>> > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
>>
  
Richard Biener Nov. 2, 2021, 6:22 p.m. UTC | #5
On Tue, 2 Nov 2021, Richard Sandiford wrote:

> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Tue, 2 Nov 2021, Tamar Christina wrote:
> >
> >> > -----Original Message-----
> >> > From: Richard Biener <rguenther@suse.de>
> >> > Sent: Tuesday, November 2, 2021 2:24 PM
> >> > To: Tamar Christina <Tamar.Christina@arm.com>
> >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful
> >> > vectorization
> >> > 
> >> > On Tue, 2 Nov 2021, Tamar Christina wrote:
> >> > 
> >> > > Hi All,
> >> > >
> >> > > Following my current SVE predicate optimization series a problem has
> >> > > presented itself in that the way vector masks are generated for masked
> >> > > operations relies on CSE to share masks efficiently.
> >> > >
> >> > > The issue however is that masking is done using the & operand and & is
> >> > > associative and so reassoc decides to reassociate the masked operations.
> >> > 
> >> > But it does this for the purpose of canonicalization and thus CSE.
> >> 
> >> Yes, but it turns something like
> >> 
> >> (a & b) & mask into a & (b & mask).
> >> 
> >> When (a & b) is used somewhere else you now lose the CSE.  So it's actually hurting
> >> In this case.
> >
> > OK, so that's a known "issue" with reassoc, it doesn't consider global
> > CSE opportunities and I guess it pushes 'mask' to leaf if it is loop
> > carried.
> >
> >> > 
> >> > > This makes CSE then unable to CSE an unmasked and a masked operation
> >> > > leading to duplicate operations being performed.
> >> > >
> >> > > To counter this we want to add an RPO pass over the vectorized loop
> >> > > body when vectorization succeeds.  This makes it then no longer
> >> > > reliant on the RTL level CSE.
> >> > >
> >> > > I have not added a testcase for this as it requires the changes in my
> >> > > patch series, however the entire series relies on this patch to work
> >> > > so all the tests there cover it.
> >> > >
> >> > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-linux-gnu and
> >> > > no issues.
> >> > >
> >> > > Ok for master?
> >> > 
> >> > You are running VN over _all_ loop bodies rather only those vectorized.
> >> > We loop over vectorized loops earlier for optimizing masked store sequences.
> >> > I suppose you could hook in there.  I'll also notice that we have
> >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a late FRE.
> >> > So I don't understand why it doesn't work to CSE later.
> >> > 
> >> 
> >> Atm, say you have the conditions a > b, and a > b & a > c
> >> 
> >> We generate
> >> 
> >> mask1 = (a > b) & loop_mask
> >> mask2 = (a > b & a > c) & loop_mask
> >> 
> >> with the intention that mask1 can be re-used in mask2.
> >> 
> >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask)
> >> 
> >> Which has now unmasked (a > b) in mask2, which leaves us unable to combine
> >> the mask1 and mask2.  It doesn't generate incorrect code, just inefficient.
> >> 
> >> >   for (i = 1; i < number_of_loops (cfun); i++)
> >> >     {
> >> >       loop_vec_info loop_vinfo;
> >> >       bool has_mask_store;
> >> > 
> >> >       loop = get_loop (cfun, i);
> >> >       if (!loop || !loop->aux)
> >> >         continue;
> >> >       loop_vinfo = (loop_vec_info) loop->aux;
> >> >       has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
> >> >       delete loop_vinfo;
> >> >       if (has_mask_store
> >> >           && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
> >> >         optimize_mask_stores (loop);
> >> >       loop->aux = NULL;
> >> >     }
> >> > 
> >> 
> >> Ah thanks, I'll make the changes.
> >
> > Note I think that full-blown CSE is a bit overkill just to counter
> > a deficient reassoc (or VN).  At least it is supposed to be "cheap"
> > and can be conditionalized on loop masks being used as well.
> 
> Not sure we should make this conditional on loop masks being used.
> It seems either that:
> 
> (a) the vectoriser is supposed to avoid creating code that has folding
>     or VN opportunities, in which case we need to generate the vectorised
>     code in a smarter way or
> 
> (b) the vectoriser is allowed to create code that has folding or VN
>     opportunities, in which case it would be good to have a defined
>     place to get rid of them.

It's certainly (b), and the definitive place to get rid of those is
the post-loop optimizer FRE pass.  That just happens to be after
a reassoc pass which makes FRE run into the pre-existing issue
that we fail to capture all (or the best) possible CSE opportunity
from separate associatable chains.

> I'm just worried that if we make it conditional on loop masks,
> we could see cases that in which non-loop-mask stuff is optimised
> differently based on whether the loop has masks or not.  E.g.
> we might get worse code with an unpredicated main loop and
> a predicated epilogue compared to a predicated main loop.

Sure.  Note for loop vectorization we can indeed reasonably easy
CSE the main body and RPO VN should be O(region size) and cheap
enough for this case (we could even add an extra cheap entry
for single basic-blocks).  For BB vectorization we have to rely
on the full function FRE pass though.

Richard.

> Thanks,
> Richard
> >
> >> Thanks,
> >> Tamar
> >> 
> >> > 
> >> > > Thanks,
> >> > > Tamar
> >> > >
> >> > > gcc/ChangeLog:
> >> > >
> >> > > 	* tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN
> >> > upon
> >> > > 	successful vectorization.
> >> > >
> >> > > --- inline copy of patch --
> >> > > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index
> >> > >
> >> > 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c
> >> > 660
> >> > > af4b6d32dc51 100644
> >> > > --- a/gcc/tree-vectorizer.c
> >> > > +++ b/gcc/tree-vectorizer.c
> >> > > @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
> >> > > #include "gimple-pretty-print.h"
> >> > >  #include "opt-problem.h"
> >> > >  #include "internal-fn.h"
> >> > > -
> >> > > +#include "tree-ssa-sccvn.h"
> >> > >
> >> > >  /* Loop or bb location, with hotness information.  */
> >> > > dump_user_location_t vect_location; @@ -1323,6 +1323,27 @@
> >> > > vectorize_loops (void)
> >> > >  	 ???  Also while we try hard to update loop-closed SSA form we fail
> >> > >  	 to properly do this in some corner-cases (see PR56286).  */
> >> > >        rewrite_into_loop_closed_ssa (NULL,
> >> > > TODO_update_ssa_only_virtuals);
> >> > > +
> >> > > +      for (i = 1; i < number_of_loops (cfun); i++)
> >> > > +	{
> >> > > +	  loop = get_loop (cfun, i);
> >> > > +	  if (!loop || !single_exit (loop))
> >> > > +	    continue;
> >> > > +
> >> > > +	  bitmap exit_bbs;
> >> > > +	  /* Perform local CSE, this esp. helps because we emit code for
> >> > > +	     predicates that need to be shared for optimal predicate usage.
> >> > > +	     However reassoc will re-order them and prevent CSE from working
> >> > > +	     as it should.  CSE only the loop body, not the entry.  */
> >> > > +	  exit_bbs = BITMAP_ALLOC (NULL);
> >> > > +	  bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> >> > > +	  bitmap_set_bit (exit_bbs, loop->latch->index);
> >> > > +
> >> > > +	  do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> >> > > +
> >> > > +	  BITMAP_FREE (exit_bbs);
> >> > > +	}
> >> > > +
> >> > >        return TODO_cleanup_cfg;
> >> > >      }
> >> > >
> >> > >
> >> > >
> >> > >
> >> > 
> >> > --
> >> > Richard Biener <rguenther@suse.de>
> >> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> >> > Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)
> >> 
>
  
Tamar Christina Nov. 5, 2021, 11:45 a.m. UTC | #6
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Tuesday, November 2, 2021 6:22 PM
> To: Richard Sandiford <Richard.Sandiford@arm.com>
> Cc: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar
> Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>
> Subject: Re: [PATCH]middle-end Add an RPO pass after successful
> vectorization
> 
> On Tue, 2 Nov 2021, Richard Sandiford wrote:
> 
> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > On Tue, 2 Nov 2021, Tamar Christina wrote:
> > >
> > >> > -----Original Message-----
> > >> > From: Richard Biener <rguenther@suse.de>
> > >> > Sent: Tuesday, November 2, 2021 2:24 PM
> > >> > To: Tamar Christina <Tamar.Christina@arm.com>
> > >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful
> > >> > vectorization
> > >> >
> > >> > On Tue, 2 Nov 2021, Tamar Christina wrote:
> > >> >
> > >> > > Hi All,
> > >> > >
> > >> > > Following my current SVE predicate optimization series a
> > >> > > problem has presented itself in that the way vector masks are
> > >> > > generated for masked operations relies on CSE to share masks
> efficiently.
> > >> > >
> > >> > > The issue however is that masking is done using the & operand
> > >> > > and & is associative and so reassoc decides to reassociate the
> masked operations.
> > >> >
> > >> > But it does this for the purpose of canonicalization and thus CSE.
> > >>
> > >> Yes, but it turns something like
> > >>
> > >> (a & b) & mask into a & (b & mask).
> > >>
> > >> When (a & b) is used somewhere else you now lose the CSE.  So it's
> > >> actually hurting In this case.
> > >
> > > OK, so that's a known "issue" with reassoc, it doesn't consider
> > > global CSE opportunities and I guess it pushes 'mask' to leaf if it
> > > is loop carried.
> > >
> > >> >
> > >> > > This makes CSE then unable to CSE an unmasked and a masked
> > >> > > operation leading to duplicate operations being performed.
> > >> > >
> > >> > > To counter this we want to add an RPO pass over the vectorized
> > >> > > loop body when vectorization succeeds.  This makes it then no
> > >> > > longer reliant on the RTL level CSE.
> > >> > >
> > >> > > I have not added a testcase for this as it requires the changes
> > >> > > in my patch series, however the entire series relies on this
> > >> > > patch to work so all the tests there cover it.
> > >> > >
> > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > >> > > x86_64-linux-gnu and no issues.
> > >> > >
> > >> > > Ok for master?
> > >> >
> > >> > You are running VN over _all_ loop bodies rather only those
> vectorized.
> > >> > We loop over vectorized loops earlier for optimizing masked store
> sequences.
> > >> > I suppose you could hook in there.  I'll also notice that we have
> > >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a
> late FRE.
> > >> > So I don't understand why it doesn't work to CSE later.
> > >> >
> > >>
> > >> Atm, say you have the conditions a > b, and a > b & a > c
> > >>
> > >> We generate
> > >>
> > >> mask1 = (a > b) & loop_mask
> > >> mask2 = (a > b & a > c) & loop_mask
> > >>
> > >> with the intention that mask1 can be re-used in mask2.
> > >>
> > >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask)
> > >>
> > >> Which has now unmasked (a > b) in mask2, which leaves us unable to
> > >> combine the mask1 and mask2.  It doesn't generate incorrect code, just
> inefficient.
> > >>
> > >> >   for (i = 1; i < number_of_loops (cfun); i++)
> > >> >     {
> > >> >       loop_vec_info loop_vinfo;
> > >> >       bool has_mask_store;
> > >> >
> > >> >       loop = get_loop (cfun, i);
> > >> >       if (!loop || !loop->aux)
> > >> >         continue;
> > >> >       loop_vinfo = (loop_vec_info) loop->aux;
> > >> >       has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
> > >> >       delete loop_vinfo;
> > >> >       if (has_mask_store
> > >> >           && targetm.vectorize.empty_mask_is_expensive
> (IFN_MASK_STORE))
> > >> >         optimize_mask_stores (loop);
> > >> >       loop->aux = NULL;
> > >> >     }
> > >> >
> > >>
> > >> Ah thanks, I'll make the changes.
> > >
> > > Note I think that full-blown CSE is a bit overkill just to counter a
> > > deficient reassoc (or VN).  At least it is supposed to be "cheap"
> > > and can be conditionalized on loop masks being used as well.
> >
> > Not sure we should make this conditional on loop masks being used.
> > It seems either that:
> >
> > (a) the vectoriser is supposed to avoid creating code that has folding
> >     or VN opportunities, in which case we need to generate the vectorised
> >     code in a smarter way or
> >
> > (b) the vectoriser is allowed to create code that has folding or VN
> >     opportunities, in which case it would be good to have a defined
> >     place to get rid of them.
> 
> It's certainly (b), and the definitive place to get rid of those is the post-loop
> optimizer FRE pass.  That just happens to be after a reassoc pass which
> makes FRE run into the pre-existing issue that we fail to capture all (or the
> best) possible CSE opportunity from separate associatable chains.
> 
> > I'm just worried that if we make it conditional on loop masks, we
> > could see cases that in which non-loop-mask stuff is optimised
> > differently based on whether the loop has masks or not.  E.g.
> > we might get worse code with an unpredicated main loop and a
> > predicated epilogue compared to a predicated main loop.
> 
> Sure.  Note for loop vectorization we can indeed reasonably easy CSE the
> main body and RPO VN should be O(region size) and cheap enough for this
> case (we could even add an extra cheap entry for single basic-blocks).  For BB
> vectorization we have to rely on the full function FRE pass though.
> 

I've moved the code so it works only on the vector loops. I don't think there needs
to be any code change to handle single BB efficiently no? the walk stop at the loop
exit block so it's already optimal in the case, unless I misunderstood?

Bootstrapped Regtested on aarch64-none-linux-gnu,
x86_64-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon
	successful vectorization.

--- inline copy of patch ---

diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index edb7538a67f00cd80a608ee82510cf437fe88083..9d8015ea35984963db1ddb90af32935d08c8a0e8 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-pretty-print.h"
 #include "opt-problem.h"
 #include "internal-fn.h"
-
+#include "tree-ssa-sccvn.h"
 
 /* Loop or bb location, with hotness information.  */
 dump_user_location_t vect_location;
@@ -1298,6 +1298,20 @@ vectorize_loops (void)
       if (has_mask_store
 	  && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
 	optimize_mask_stores (loop);
+
+      bitmap exit_bbs;
+      /* Perform local CSE, this esp. helps because we emit code for
+	 predicates that need to be shared for optimal predicate usage.
+	 However reassoc will re-order them and prevent CSE from working
+	 as it should.  CSE only the loop body, not the entry.  */
+      exit_bbs = BITMAP_ALLOC (NULL);
+      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
+      bitmap_set_bit (exit_bbs, loop->latch->index);
+
+      do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
+
+      BITMAP_FREE (exit_bbs);
+
       loop->aux = NULL;
     }
  
Richard Biener Nov. 5, 2021, 12:24 p.m. UTC | #7
On Fri, 5 Nov 2021, Tamar Christina wrote:

> 
> 
> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Tuesday, November 2, 2021 6:22 PM
> > To: Richard Sandiford <Richard.Sandiford@arm.com>
> > Cc: Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>; Tamar
> > Christina <Tamar.Christina@arm.com>; nd <nd@arm.com>
> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful
> > vectorization
> > 
> > On Tue, 2 Nov 2021, Richard Sandiford wrote:
> > 
> > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > > On Tue, 2 Nov 2021, Tamar Christina wrote:
> > > >
> > > >> > -----Original Message-----
> > > >> > From: Richard Biener <rguenther@suse.de>
> > > >> > Sent: Tuesday, November 2, 2021 2:24 PM
> > > >> > To: Tamar Christina <Tamar.Christina@arm.com>
> > > >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > > >> > Subject: Re: [PATCH]middle-end Add an RPO pass after successful
> > > >> > vectorization
> > > >> >
> > > >> > On Tue, 2 Nov 2021, Tamar Christina wrote:
> > > >> >
> > > >> > > Hi All,
> > > >> > >
> > > >> > > Following my current SVE predicate optimization series a
> > > >> > > problem has presented itself in that the way vector masks are
> > > >> > > generated for masked operations relies on CSE to share masks
> > efficiently.
> > > >> > >
> > > >> > > The issue however is that masking is done using the & operand
> > > >> > > and & is associative and so reassoc decides to reassociate the
> > masked operations.
> > > >> >
> > > >> > But it does this for the purpose of canonicalization and thus CSE.
> > > >>
> > > >> Yes, but it turns something like
> > > >>
> > > >> (a & b) & mask into a & (b & mask).
> > > >>
> > > >> When (a & b) is used somewhere else you now lose the CSE.  So it's
> > > >> actually hurting In this case.
> > > >
> > > > OK, so that's a known "issue" with reassoc, it doesn't consider
> > > > global CSE opportunities and I guess it pushes 'mask' to leaf if it
> > > > is loop carried.
> > > >
> > > >> >
> > > >> > > This makes CSE then unable to CSE an unmasked and a masked
> > > >> > > operation leading to duplicate operations being performed.
> > > >> > >
> > > >> > > To counter this we want to add an RPO pass over the vectorized
> > > >> > > loop body when vectorization succeeds.  This makes it then no
> > > >> > > longer reliant on the RTL level CSE.
> > > >> > >
> > > >> > > I have not added a testcase for this as it requires the changes
> > > >> > > in my patch series, however the entire series relies on this
> > > >> > > patch to work so all the tests there cover it.
> > > >> > >
> > > >> > > Bootstrapped Regtested on aarch64-none-linux-gnu,
> > > >> > > x86_64-linux-gnu and no issues.
> > > >> > >
> > > >> > > Ok for master?
> > > >> >
> > > >> > You are running VN over _all_ loop bodies rather only those
> > vectorized.
> > > >> > We loop over vectorized loops earlier for optimizing masked store
> > sequences.
> > > >> > I suppose you could hook in there.  I'll also notice that we have
> > > >> > pass_pre_slp_scalar_cleanup which eventually runs plus we have a
> > late FRE.
> > > >> > So I don't understand why it doesn't work to CSE later.
> > > >> >
> > > >>
> > > >> Atm, say you have the conditions a > b, and a > b & a > c
> > > >>
> > > >> We generate
> > > >>
> > > >> mask1 = (a > b) & loop_mask
> > > >> mask2 = (a > b & a > c) & loop_mask
> > > >>
> > > >> with the intention that mask1 can be re-used in mask2.
> > > >>
> > > >> Reassoc changes this to mask2 = a > b & (a > c & loop_mask)
> > > >>
> > > >> Which has now unmasked (a > b) in mask2, which leaves us unable to
> > > >> combine the mask1 and mask2.  It doesn't generate incorrect code, just
> > inefficient.
> > > >>
> > > >> >   for (i = 1; i < number_of_loops (cfun); i++)
> > > >> >     {
> > > >> >       loop_vec_info loop_vinfo;
> > > >> >       bool has_mask_store;
> > > >> >
> > > >> >       loop = get_loop (cfun, i);
> > > >> >       if (!loop || !loop->aux)
> > > >> >         continue;
> > > >> >       loop_vinfo = (loop_vec_info) loop->aux;
> > > >> >       has_mask_store = LOOP_VINFO_HAS_MASK_STORE (loop_vinfo);
> > > >> >       delete loop_vinfo;
> > > >> >       if (has_mask_store
> > > >> >           && targetm.vectorize.empty_mask_is_expensive
> > (IFN_MASK_STORE))
> > > >> >         optimize_mask_stores (loop);
> > > >> >       loop->aux = NULL;
> > > >> >     }
> > > >> >
> > > >>
> > > >> Ah thanks, I'll make the changes.
> > > >
> > > > Note I think that full-blown CSE is a bit overkill just to counter a
> > > > deficient reassoc (or VN).  At least it is supposed to be "cheap"
> > > > and can be conditionalized on loop masks being used as well.
> > >
> > > Not sure we should make this conditional on loop masks being used.
> > > It seems either that:
> > >
> > > (a) the vectoriser is supposed to avoid creating code that has folding
> > >     or VN opportunities, in which case we need to generate the vectorised
> > >     code in a smarter way or
> > >
> > > (b) the vectoriser is allowed to create code that has folding or VN
> > >     opportunities, in which case it would be good to have a defined
> > >     place to get rid of them.
> > 
> > It's certainly (b), and the definitive place to get rid of those is the post-loop
> > optimizer FRE pass.  That just happens to be after a reassoc pass which
> > makes FRE run into the pre-existing issue that we fail to capture all (or the
> > best) possible CSE opportunity from separate associatable chains.
> > 
> > > I'm just worried that if we make it conditional on loop masks, we
> > > could see cases that in which non-loop-mask stuff is optimised
> > > differently based on whether the loop has masks or not.  E.g.
> > > we might get worse code with an unpredicated main loop and a
> > > predicated epilogue compared to a predicated main loop.
> > 
> > Sure.  Note for loop vectorization we can indeed reasonably easy CSE the
> > main body and RPO VN should be O(region size) and cheap enough for this
> > case (we could even add an extra cheap entry for single basic-blocks).  For BB
> > vectorization we have to rely on the full function FRE pass though.
> > 
> 
> I've moved the code so it works only on the vector loops. I don't think there needs
> to be any code change to handle single BB efficiently no? the walk stop at the loop
> exit block so it's already optimal in the case, unless I misunderstood?

Yeah, it's at most some constant time work we could avoid.

> Bootstrapped Regtested on aarch64-none-linux-gnu,
> x86_64-linux-gnu and no issues.
> 
> Ok for master?

See below...

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* tree-vectorizer.c (vectorize_loops): Do local CSE through RPVN upon
> 	successful vectorization.
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index edb7538a67f00cd80a608ee82510cf437fe88083..9d8015ea35984963db1ddb90af32935d08c8a0e8 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "gimple-pretty-print.h"
>  #include "opt-problem.h"
>  #include "internal-fn.h"
> -
> +#include "tree-ssa-sccvn.h"
>  
>  /* Loop or bb location, with hotness information.  */
>  dump_user_location_t vect_location;
> @@ -1298,6 +1298,20 @@ vectorize_loops (void)
>        if (has_mask_store
>  	  && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
>  	optimize_mask_stores (loop);
> +
> +      bitmap exit_bbs;
> +      /* Perform local CSE, this esp. helps because we emit code for
> +	 predicates that need to be shared for optimal predicate usage.
> +	 However reassoc will re-order them and prevent CSE from working
> +	 as it should.  CSE only the loop body, not the entry.  */
> +      exit_bbs = BITMAP_ALLOC (NULL);

use auto_bitmap exit_bbs; and the allocation and ...

> +      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> +      bitmap_set_bit (exit_bbs, loop->latch->index);

treating the latch as exit is probably premature optimization
(yes, it's empty).

> +
> +      do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> +
> +      BITMAP_FREE (exit_bbs);

... deallocation can go.  Note I wonder whether, if we are already
spinning up VN, we should include the preheader in the operation?
We regularly end up emitting redundant vector initializers that could
be cleaned up earlier this way.

Otherwise the change looks OK.

Thanks,
Richard.
  
Tamar Christina Nov. 9, 2021, 3:04 p.m. UTC | #8
> > +      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > +      bitmap_set_bit (exit_bbs, loop->latch->index);
> 
> treating the latch as exit is probably premature optimization (yes, it's empty).
> 
> > +
> > +      do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> > +
> > +      BITMAP_FREE (exit_bbs);
> 
> ... deallocation can go.  Note I wonder whether, if we are already spinning up
> VN, we should include the preheader in the operation?
> We regularly end up emitting redundant vector initializers that could be
> cleaned up earlier this way.

I've change it to include the preheader but it looks like this breaks bootstrap on both
x86 and AArch64.

On x86 the following testcase

double matmul_c8_vanilla_bbase_0;
double *matmul_c8_vanilla_dest;
matmul_c8_vanilla_x;
matmul_c8_vanilla() {
  for (; matmul_c8_vanilla_x; matmul_c8_vanilla_x++)
    matmul_c8_vanilla_dest[matmul_c8_vanilla_x] += matmul_c8_vanilla_bbase_0;
}

ICEs with -std=gnu11 -ffast-math -ftree-vectorize -O2 with:

internal compiler error: tree check: expected ssa_name, have var_decl in SSA_VAL, at tree-ssa-sccvn.c:535
0x80731c tree_check_failed(tree_node const*, char const*, int, char const*, ...)
        ../gcc-dsg/gcc/tree.c:8689
0x7ebda2 tree_check(tree_node*, char const*, int, char const*, tree_code)
        ../gcc-dsg/gcc/tree.h:3433
0x7ebda2 SSA_VAL(tree_node*, bool*)
        ../gcc-dsg/gcc/tree-ssa-sccvn.c:535
0x7ebda2 vuse_ssa_val
        ../gcc-dsg/gcc/tree-ssa-sccvn.c:553
0x7ebda2 vn_reference_lookup(tree_node*, tree_node*, vn_lookup_kind, vn_reference_s**, bool, tree_node**, tree_node*)
        ../gcc-dsg/gcc/tree-ssa-sccvn.c:3664
0x10d8ca5 visit_reference_op_load
        ../gcc-dsg/gcc/tree-ssa-sccvn.c:5166
0x10d8ca5 visit_stmt
        ../gcc-dsg/gcc/tree-ssa-sccvn.c:5615
0x10d976c process_bb
        ../gcc-dsg/gcc/tree-ssa-sccvn.c:7344
0x10dafe5 do_rpo_vn
        ../gcc-dsg/gcc/tree-ssa-sccvn.c:7942
0x10dc828 do_rpo_vn(function*, edge_def*, bitmap_head*)
        ../gcc-dsg/gcc/tree-ssa-sccvn.c:8039
0x119c39c vectorize_loops()
        ../gcc-dsg/gcc/tree-vectorizer.c:1304

on AArch64 this one ICEs with -ffast-math -ftree-vectorize -O2

_Complex *a;
_Complex b;
c, d;
fn1() {
  _Complex e;
  for (; c; ++c)
    e = d * a[c];
  b = e;
}

With the message 

internal compiler error: tree check: expected ssa_name, have var_decl in VN_INFO, at tree-ssa-sccvn.c:451
0x734073 tree_check_failed(tree_node const*, char const*, int, char const*, ...)
        ../../gcc-fsf/gcc/tree.c:8691
0x10e2e2f tree_check(tree_node*, char const*, int, char const*, tree_code)
        ../../gcc-fsf/gcc/tree.h:3433
0x10e2e2f VN_INFO(tree_node*)
        ../../gcc-fsf/gcc/tree-ssa-sccvn.c:451
0x10ed223 process_bb
        ../../gcc-fsf/gcc/tree-ssa-sccvn.c:7331
0x10eea43 do_rpo_vn
        ../../gcc-fsf/gcc/tree-ssa-sccvn.c:7944
0x10efe2b do_rpo_vn(function*, edge_def*, bitmap_head*)
        ../../gcc-fsf/gcc/tree-ssa-sccvn.c:8039
0x11c436b vectorize_loops()
        ../../gcc-fsf/gcc/tree-vectorizer.c:1304

Any ideas?

Thanks,
Tamar

> 
> Otherwise the change looks OK.
> 

--- inline copy of patch ---

diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index edb7538a67f00cd80a608ee82510cf437fe88083..029d59016c9652f87d80fc5500f89532c79a66d0 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-pretty-print.h"
 #include "opt-problem.h"
 #include "internal-fn.h"
-
+#include "tree-ssa-sccvn.h"
 
 /* Loop or bb location, with hotness information.  */
 dump_user_location_t vect_location;
@@ -1298,6 +1298,17 @@ vectorize_loops (void)
       if (has_mask_store
 	  && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
 	optimize_mask_stores (loop);
+
+      auto_bitmap exit_bbs;
+      /* Perform local CSE, this esp. helps because we emit code for
+	 predicates that need to be shared for optimal predicate usage.
+	 However reassoc will re-order them and prevent CSE from working
+	 as it should.  CSE only the loop body, not the entry.  */
+      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
+
+      edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
+      do_rpo_vn (cfun, entry, exit_bbs);
+
       loop->aux = NULL;
     }
  
Richard Biener Nov. 10, 2021, 7:17 a.m. UTC | #9
On Tue, 9 Nov 2021, Tamar Christina wrote:

> > > +      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > +      bitmap_set_bit (exit_bbs, loop->latch->index);
> > 
> > treating the latch as exit is probably premature optimization (yes, it's empty).
> > 
> > > +
> > > +      do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> > > +
> > > +      BITMAP_FREE (exit_bbs);
> > 
> > ... deallocation can go.  Note I wonder whether, if we are already spinning up
> > VN, we should include the preheader in the operation?
> > We regularly end up emitting redundant vector initializers that could be
> > cleaned up earlier this way.
> 
> I've change it to include the preheader but it looks like this breaks bootstrap on both
> x86 and AArch64.
> 
> On x86 the following testcase
> 
> double matmul_c8_vanilla_bbase_0;
> double *matmul_c8_vanilla_dest;
> matmul_c8_vanilla_x;
> matmul_c8_vanilla() {
>   for (; matmul_c8_vanilla_x; matmul_c8_vanilla_x++)
>     matmul_c8_vanilla_dest[matmul_c8_vanilla_x] += matmul_c8_vanilla_bbase_0;
> }
> 
> ICEs with -std=gnu11 -ffast-math -ftree-vectorize -O2 with:
> 
> internal compiler error: tree check: expected ssa_name, have var_decl in SSA_VAL, at tree-ssa-sccvn.c:535
> 0x80731c tree_check_failed(tree_node const*, char const*, int, char const*, ...)
>         ../gcc-dsg/gcc/tree.c:8689
> 0x7ebda2 tree_check(tree_node*, char const*, int, char const*, tree_code)
>         ../gcc-dsg/gcc/tree.h:3433
> 0x7ebda2 SSA_VAL(tree_node*, bool*)
>         ../gcc-dsg/gcc/tree-ssa-sccvn.c:535
> 0x7ebda2 vuse_ssa_val
>         ../gcc-dsg/gcc/tree-ssa-sccvn.c:553
> 0x7ebda2 vn_reference_lookup(tree_node*, tree_node*, vn_lookup_kind, vn_reference_s**, bool, tree_node**, tree_node*)
>         ../gcc-dsg/gcc/tree-ssa-sccvn.c:3664
> 0x10d8ca5 visit_reference_op_load
>         ../gcc-dsg/gcc/tree-ssa-sccvn.c:5166
> 0x10d8ca5 visit_stmt
>         ../gcc-dsg/gcc/tree-ssa-sccvn.c:5615
> 0x10d976c process_bb
>         ../gcc-dsg/gcc/tree-ssa-sccvn.c:7344
> 0x10dafe5 do_rpo_vn
>         ../gcc-dsg/gcc/tree-ssa-sccvn.c:7942
> 0x10dc828 do_rpo_vn(function*, edge_def*, bitmap_head*)
>         ../gcc-dsg/gcc/tree-ssa-sccvn.c:8039
> 0x119c39c vectorize_loops()
>         ../gcc-dsg/gcc/tree-vectorizer.c:1304

OK, as I thought this is .MEMs not in SSA form.  We're later
fixing that up so maybe try the attached which re-orders the
postprocessing after vectorization to do that earlier.

Richard.

> on AArch64 this one ICEs with -ffast-math -ftree-vectorize -O2
> 
> _Complex *a;
> _Complex b;
> c, d;
> fn1() {
>   _Complex e;
>   for (; c; ++c)
>     e = d * a[c];
>   b = e;
> }
> 
> With the message 
> 
> internal compiler error: tree check: expected ssa_name, have var_decl in VN_INFO, at tree-ssa-sccvn.c:451
> 0x734073 tree_check_failed(tree_node const*, char const*, int, char const*, ...)
>         ../../gcc-fsf/gcc/tree.c:8691
> 0x10e2e2f tree_check(tree_node*, char const*, int, char const*, tree_code)
>         ../../gcc-fsf/gcc/tree.h:3433
> 0x10e2e2f VN_INFO(tree_node*)
>         ../../gcc-fsf/gcc/tree-ssa-sccvn.c:451
> 0x10ed223 process_bb
>         ../../gcc-fsf/gcc/tree-ssa-sccvn.c:7331
> 0x10eea43 do_rpo_vn
>         ../../gcc-fsf/gcc/tree-ssa-sccvn.c:7944
> 0x10efe2b do_rpo_vn(function*, edge_def*, bitmap_head*)
>         ../../gcc-fsf/gcc/tree-ssa-sccvn.c:8039
> 0x11c436b vectorize_loops()
>         ../../gcc-fsf/gcc/tree-vectorizer.c:1304
> 
> Any ideas?
> 
> Thanks,
> Tamar
> 
> > 
> > Otherwise the change looks OK.
> > 
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> index edb7538a67f00cd80a608ee82510cf437fe88083..029d59016c9652f87d80fc5500f89532c79a66d0 100644
> --- a/gcc/tree-vectorizer.c
> +++ b/gcc/tree-vectorizer.c
> @@ -81,7 +81,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "gimple-pretty-print.h"
>  #include "opt-problem.h"
>  #include "internal-fn.h"
> -
> +#include "tree-ssa-sccvn.h"
>  
>  /* Loop or bb location, with hotness information.  */
>  dump_user_location_t vect_location;
> @@ -1298,6 +1298,17 @@ vectorize_loops (void)
>        if (has_mask_store
>  	  && targetm.vectorize.empty_mask_is_expensive (IFN_MASK_STORE))
>  	optimize_mask_stores (loop);
> +
> +      auto_bitmap exit_bbs;
> +      /* Perform local CSE, this esp. helps because we emit code for
> +	 predicates that need to be shared for optimal predicate usage.
> +	 However reassoc will re-order them and prevent CSE from working
> +	 as it should.  CSE only the loop body, not the entry.  */
> +      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> +
> +      edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
> +      do_rpo_vn (cfun, entry, exit_bbs);
> +
>        loop->aux = NULL;
>      }
> 
>
  

Patch

diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 4712dc6e7f907637774482a71036a0bd381c2bd2..1e370d60fb19b03c3b6bce45c660af4b6d32dc51 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -81,7 +81,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "gimple-pretty-print.h"
 #include "opt-problem.h"
 #include "internal-fn.h"
-
+#include "tree-ssa-sccvn.h"
 
 /* Loop or bb location, with hotness information.  */
 dump_user_location_t vect_location;
@@ -1323,6 +1323,27 @@  vectorize_loops (void)
 	 ???  Also while we try hard to update loop-closed SSA form we fail
 	 to properly do this in some corner-cases (see PR56286).  */
       rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa_only_virtuals);
+
+      for (i = 1; i < number_of_loops (cfun); i++)
+	{
+	  loop = get_loop (cfun, i);
+	  if (!loop || !single_exit (loop))
+	    continue;
+
+	  bitmap exit_bbs;
+	  /* Perform local CSE, this esp. helps because we emit code for
+	     predicates that need to be shared for optimal predicate usage.
+	     However reassoc will re-order them and prevent CSE from working
+	     as it should.  CSE only the loop body, not the entry.  */
+	  exit_bbs = BITMAP_ALLOC (NULL);
+	  bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
+	  bitmap_set_bit (exit_bbs, loop->latch->index);
+
+	  do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
+
+	  BITMAP_FREE (exit_bbs);
+	}
+
       return TODO_cleanup_cfg;
     }