[v2] Re: OpenMP: Generate SIMD clones for functions with "declare target"

Message ID 001679b1-814a-c1db-5611-c663f6931d11@codesourcery.com
State New
Headers
Series [v2] Re: OpenMP: Generate SIMD clones for functions with "declare target" |

Commit Message

Sandra Loosemore Sept. 22, 2022, 3:17 a.m. UTC
  On 9/14/22 12:12, Jakub Jelinek wrote:

> If it is pure optimization thing and purely keyed on the definition,
> all the simd clones should be local to the TU, never exported from it.

OK, here is a revised patch that addresses that.  x86_64 target also 
generates a different set of clones for functions with internal linkage 
vs external so I hacked that to treat these implicit clones in the same 
way as other internal clones.

There is an existing problem with internal "declare simd" clones in that 
nothing ever DCEs clones that end up not being useful, or does a scan of 
the code in the compilation unit before clone generation to avoid 
generating useless clones in the first place.  I haven't tried to solve 
that problem, but I did attempt to mitigate it for these implicit 
"declare target" clones by tagging the option 
OPT_LEVELS_2_PLUS_SPEED_ONLY (instead of enabling it by default all the 
time) so the clones are not generated by default at -Os and -Og.  I 
added a couple new test cases to check this.

On 9/14/22 15:45, Thomas Schwinge wrote:
> However, OpenACC and OpenMP support may be active at the same time...
> 
>> +  if (attr == NULL_TREE
>> +      && flag_openmp_target_simd_clone && !flag_openacc)
> 
> ..., so '!flag_openacc' is not the right check here.  Instead you'd do
> '!oacc_get_fn_attrib (DECL_ATTRIBUTES (node->decl))' (untested) or
> similar.

This is fixed now too.

OK to check in?

-Sandra
  

Comments

Jakub Jelinek Sept. 30, 2022, 10:37 a.m. UTC | #1
On Wed, Sep 21, 2022 at 09:17:18PM -0600, Sandra Loosemore wrote:
> On 9/14/22 12:12, Jakub Jelinek wrote:
> 
> > If it is pure optimization thing and purely keyed on the definition,
> > all the simd clones should be local to the TU, never exported from it.
> 
> OK, here is a revised patch that addresses that.  x86_64 target also
> generates a different set of clones for functions with internal linkage vs
> external so I hacked that to treat these implicit clones in the same way as
> other internal clones.
> 
> There is an existing problem with internal "declare simd" clones in that
> nothing ever DCEs clones that end up not being useful, or does a scan of the
> code in the compilation unit before clone generation to avoid generating
> useless clones in the first place.  I haven't tried to solve that problem,
> but I did attempt to mitigate it for these implicit "declare target" clones
> by tagging the option OPT_LEVELS_2_PLUS_SPEED_ONLY (instead of enabling it
> by default all the time) so the clones are not generated by default at -Os
> and -Og.  I added a couple new test cases to check this.

We've discussed this at Cauldron.  Especially for this patch, but less
urgently for explicit declare simd on non-exported functions (less urgently
just because people don't mark everything declare simd usually) solving the
above is essential.  I don't say it can't be done incrementally, but if the
patch is added to trunk, it needs to be solved before 13 branches.
We need to arrange cgraph to process the declare simd clones after the
callers of the corresponding main function, so that by the time we try to
post-IPA optimize the clones we can see if they were actually used or not
and if not, throw them away.

On the other side, for the implicit declare simd (in explicit case it is
user's choice), maybe it might be useful to actually see if the function clone
is vectorizable before deciding whether to actually make use of it.
Because I doubt it will be a good optimization if we clone it, push
arguments into vectors, then because vectorization failed take it appart,
do a serial loop, create return vector from the scalar results and return.
Though, thinking more about it, for the amdgcn case maybe it is worth even
in that case if we manage to vectorize the caller.  Because if failed
vectorization on admgcn means we perform significantly slower, it can be
helpful to have even partial vectorization, vectorize statements that can
be vectorized and for others use a scalar loop.  Our vectorizer is not
prepared to do that right now I believe (which is why e.g. for
#pragma omp ordered simd we just make the whole loop non-vectorizable,
rather than using a scalar loop for stuff in there and vectorize the rest),
but with this optimization we'd effectively achieve that at least at
function call boundaries (though, only in one direction, if the caller can
be vectorized and callee can't; no optimization if caller can't and callee
could be).

> +/* Helper function for mark_auto_simd_clone; return false if the statement
> +   violates restrictions for an "omp declare simd" function.  Specifically,
> +   the function must not
> +   - throw or call setjmp/longjmp
> +   - write memory that could alias parallel calls
> +   - include openmp directives or calls
> +   - call functions that might do those things */
> +
> +static bool
> +auto_simd_check_stmt (gimple *stmt, tree outer)
> +{
> +  tree decl;
> +
> +  switch (gimple_code (stmt))
> +    {
> +    case GIMPLE_CALL:
> +      decl = gimple_call_fndecl (stmt);
> +
> +      /* We can't know whether indirect calls are safe.  */
> +      if (decl == NULL_TREE)
> +	return false;

What about internal function calls?  Are all of them undesirable, or
some of them?  We do have const / pure ifns, ...
> +
> +      /* Calls to functions that are CONST or PURE are ok.  */
> +      if (gimple_call_flags (stmt) & (ECF_CONST | ECF_PURE))
> +	break;
> +
> +      /* Calls to functions that are already marked "omp declare simd" are
> +	 OK.  */
> +      if (lookup_attribute ("omp declare simd", DECL_ATTRIBUTES (decl)))
> +	break;

You could instead look up the cgraph simd clone info for the function...

> +      /* OpenMP directives are not permitted.  */
> +    CASE_GIMPLE_OMP:
> +      return false;

This makes no sense.  The function is called on low GIMPLE during IPA,
there are no GOMP_* statements at this point in the IL, everything has
been expanded.  Most of OpenMP directives though end up calling
libgomp APIs which aren't pure/const and don't have declare simd
attribute...
Exception can be say master construct, or static scheduling nowait
worksharing loop.

> +      /* Conservatively reject all EH-related constructs.  */
> +    case GIMPLE_CATCH:
> +    case GIMPLE_EH_FILTER:
> +    case GIMPLE_EH_MUST_NOT_THROW:
> +    case GIMPLE_EH_ELSE:
> +    case GIMPLE_EH_DISPATCH:
> +    case GIMPLE_RESX:
> +    case GIMPLE_TRY:

Most of these won't appear in low gimple either, I think GIMPLE_RESX
does and GIMPLE_EH_DISPATCH too, the rest probably can't.

> +      return false;
> +
> +      /* Asms are not permitted since we don't know what they do.  */
> +    case GIMPLE_ASM:
> +      return false;

What about volatile stmts?  Even volatile loads should be punted on.

> +
> +    default:
> +      break;
> +    }
> +
> +  /* Memory writes are not permitted.
> +     FIXME: this could be relaxed a little to permit writes to
> +     function-local variables that could not alias other instances
> +     of the function running in parallel.  */
> +  if (gimple_store_p (stmt))
> +    return false;
> +  else
> +    return true;
> +}

> +  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (decl))
> +    {
> +      for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
> +	   gsi_next (&gsi))
> +	if (!auto_simd_check_stmt (gsi_stmt (gsi), decl))
> +	  return NULL_TREE;
> +    }

If you want to punt on exceptions, I guess you could punt on EDGE_EH
or EDGE_ABNORMAL edges out of basic blocks.

> +  attr = lookup_attribute ("omp declare simd",
> +			   DECL_ATTRIBUTES (node->decl));
> +
> +  /* See if we can add an "omp declare simd" directive implicitly
> +     before giving up.  */
> +  /* FIXME: OpenACC "#pragma acc routine" translates into
> +     "omp declare target", but appears also to have some other effects
> +     that conflict with generating SIMD clones, causing ICEs.  So don't
> +     do this if we've got OpenACC instead of OpenMP.  */
> +  if (attr == NULL_TREE
> +      && flag_openmp_target_simd_clone
> +      && !oacc_get_fn_attrib (node->decl))

I admit I don't remember where exactly the simd clone happens wrt. other
IPA passes, but I think it is late pass; so, does it happen for GCN
offloading only in the lto1 offloading compiler?
Shouldn't the auto optimization be then done only in the offloading
lto1 for GCN then (say guard on targetm boolean)?

Otherwise, if we do it say for host offloading fallback as well
(I think it is still undesirable for PTX offloading because it is a waste of
time, there is no vectorization there, it is SIMT instead), it might be
a good idea to check cgraph that the function has at least one caller.

> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-options "-fopenmp -O2" } */
> +
> +/* Test that simd clones are generated for functions with "declare target".  */
> +
> +#pragma omp declare target
> +int addit(int a, int b, int c)
> +{
> +  return a + b;
> +}
> +#pragma omp end declare target

Because in cases like this where nothing calls it in the same TU and not LTO
optimizing, creating the internal clones is pure waste of energy.  Nothing
will vectorize using those.

	Jakub
  
Sandra Loosemore Oct. 17, 2022, 1:23 a.m. UTC | #2
On 9/30/22 04:37, Jakub Jelinek wrote:

> We've discussed this at Cauldron.  Especially for this patch, but less
> urgently for explicit declare simd on non-exported functions (less urgently
> just because people don't mark everything declare simd usually) solving the
> above is essential.  I don't say it can't be done incrementally, but if the
> patch is added to trunk, it needs to be solved before 13 branches.
> We need to arrange cgraph to process the declare simd clones after the
> callers of the corresponding main function, so that by the time we try to
> post-IPA optimize the clones we can see if they were actually used or not
> and if not, throw them away.
> 
> On the other side, for the implicit declare simd (in explicit case it is
> user's choice), maybe it might be useful to actually see if the function clone
> is vectorizable before deciding whether to actually make use of it.
> Because I doubt it will be a good optimization if we clone it, push
> arguments into vectors, then because vectorization failed take it appart,
> do a serial loop, create return vector from the scalar results and return.
> Though, thinking more about it, for the amdgcn case maybe it is worth even
> in that case if we manage to vectorize the caller.  Because if failed
> vectorization on admgcn means we perform significantly slower, it can be
> helpful to have even partial vectorization, vectorize statements that can
> be vectorized and for others use a scalar loop.  Our vectorizer is not
> prepared to do that right now I believe (which is why e.g. for
> #pragma omp ordered simd we just make the whole loop non-vectorizable,
> rather than using a scalar loop for stuff in there and vectorize the rest),
> but with this optimization we'd effectively achieve that at least at
> function call boundaries (though, only in one direction, if the caller can
> be vectorized and callee can't; no optimization if caller can't and callee
> could be).

My sense is that the first approach would be more straightforward than 
the second one, and I am willing to continue to work on that.  However, 
I think I need some direction to get started, as I presently know 
nothing about cgraph and I was unable to find any useful overview or 
interface documentation in the GCC internals manual.  Is this as simple 
as inserting an existing pass into the passlist to clean up after 
vectorization, or does it involve writing something more or less from 
scratch?

> 
>> +      /* OpenMP directives are not permitted.  */
>> +    CASE_GIMPLE_OMP:
>> +      return false;
> 
> This makes no sense.  The function is called on low GIMPLE during IPA,
> there are no GOMP_* statements at this point in the IL, everything has
> been expanded.  Most of OpenMP directives though end up calling
> libgomp APIs which aren't pure/const and don't have declare simd
> attribute...
> Exception can be say master construct, or static scheduling nowait
> worksharing loop.
> 
>> +      /* Conservatively reject all EH-related constructs.  */
>> +    case GIMPLE_CATCH:
>> +    case GIMPLE_EH_FILTER:
>> +    case GIMPLE_EH_MUST_NOT_THROW:
>> +    case GIMPLE_EH_ELSE:
>> +    case GIMPLE_EH_DISPATCH:
>> +    case GIMPLE_RESX:
>> +    case GIMPLE_TRY:
> 
> Most of these won't appear in low gimple either, I think GIMPLE_RESX
> does and GIMPLE_EH_DISPATCH too, the rest probably can't.

OK, this was my bad.  I cut and pasted this from some code that was 
originally for the OMP lowering pass.  I've moved the entire 
plausibility filter to a new pass that runs just before OMP lowering. 
It seems easier to detect the things that are invalid in a cloneable 
function when they are still in a form closer to the source constructs.
>> +      return false;
>> +
>> +      /* Asms are not permitted since we don't know what they do.  */
>> +    case GIMPLE_ASM:
>> +      return false;
> 
> What about volatile stmts?  Even volatile loads should be punted on.

That's fixed now too.
> 
>> +  attr = lookup_attribute ("omp declare simd",
>> +			   DECL_ATTRIBUTES (node->decl));
>> +
>> +  /* See if we can add an "omp declare simd" directive implicitly
>> +     before giving up.  */
>> +  /* FIXME: OpenACC "#pragma acc routine" translates into
>> +     "omp declare target", but appears also to have some other effects
>> +     that conflict with generating SIMD clones, causing ICEs.  So don't
>> +     do this if we've got OpenACC instead of OpenMP.  */
>> +  if (attr == NULL_TREE
>> +      && flag_openmp_target_simd_clone
>> +      && !oacc_get_fn_attrib (node->decl))
> 
> I admit I don't remember where exactly the simd clone happens wrt. other
> IPA passes, but I think it is late pass; so, does it happen for GCN
> offloading only in the lto1 offloading compiler?
> Shouldn't the auto optimization be then done only in the offloading
> lto1 for GCN then (say guard on targetm boolean)?

I'm afraid I don't know much about offloading, but I was under the 
impression it all goes through the same compilation process, just with a 
different target?

> Otherwise, if we do it say for host offloading fallback as well
> (I think it is still undesirable for PTX offloading because it is a waste of
> time, there is no vectorization there, it is SIMT instead), it might be
> a good idea to check cgraph that the function has at least one caller.

As I said previously, I don't understand cgraph, but in my new patch I 
arranged things so that the implicit clones are only created if there is 
also a call to the function found in an OMP loop (not just one caller 
anywhere).  So this should be fixed now.

New patch attached.  Is this one OK for mainline?

-Sandra
  
Jakub Jelinek Oct. 20, 2022, 2:07 p.m. UTC | #3
On Sun, Oct 16, 2022 at 07:23:05PM -0600, Sandra Loosemore wrote:
> My sense is that the first approach would be more straightforward than the
> second one, and I am willing to continue to work on that.  However, I think
> I need some direction to get started, as I presently know nothing about
> cgraph and I was unable to find any useful overview or interface
> documentation in the GCC internals manual.  Is this as simple as inserting
> an existing pass into the passlist to clean up after vectorization, or does
> it involve writing something more or less from scratch?

We (as I've discovered during the work on assumptions) have
TODO_discard_function which when returned from an execute pass throws away
a function completely (except now assumption functions for which it doesn't
release body; this could be done in some pass shortly after IPA, or
alternatively before expansion).  But another thing that needs to be done is for the
non-public declare simd clones (both explicit and implicit from your patch)
to be ordered in cgraph after anything that has a cgraph edge to its
original function.  I don't know how to do that, you should talk to Honza,
Richi or Martin about that.
I think the current behavior is that callees are processed before callers
if possible (unless there are cycles), which is certainly what we want for
say assume functions, or IPA RA etc.  But in case of non-public simd clones
we want to do it the other way around (at the expense of IPA RA), so that
we can throw away functions which aren't needed.

> > I admit I don't remember where exactly the simd clone happens wrt. other
> > IPA passes, but I think it is late pass; so, does it happen for GCN
> > offloading only in the lto1 offloading compiler?
> > Shouldn't the auto optimization be then done only in the offloading
> > lto1 for GCN then (say guard on targetm boolean)?
> 
> I'm afraid I don't know much about offloading, but I was under the
> impression it all goes through the same compilation process, just with a
> different target?

I've looked at it today and it seems late ipa passes are executed after LTO
bytecode is streamed back in.
If you say try:
#pragma omp declare simd
int foo (int x) { return x; }

int
main ()
{
  int a[64] = {};
  #pragma omp target map(a)
  #pragma omp simd
  for (int i = 0; i < 64; i++)
    a[i] = foo (a[i]);
}
with
gcc -foffload-options='-fdump-tree-all -fdump-ipa-all' -fdump-tree-all -fdump-ipa-all -O2 -fopenmp a.c -o a
you ought to see the simdclone dump both as a.c.*i.simdclone and a.x*.mkoffload.*i.simdclone
where the former is what is done for the host code (and host fallback),
while the latter is what is done in the offloading lto.
Can't verify it 100% because I have only nvptx-none offloading configured
and in that case pass_omp_simd_clone::gate is disabled in offloading lto
because targetm.simd_clone.compute_vecsize_and_simdlen is NULL for nvptx.
But it is non-NULL for gcn.

Thus, IMHO it is exactly the pass_omp_simd_clone pass where you want to
implement this auto-simdization discovery, guarded with
#ifdef ACCEL_COMPILER and the new option (which means it will be done
only for gcn and not on the host right now).  And do it at the start of
ipa_omp_simd_clone, before the
  FOR_EACH_FUNCTION (node)
    expand_simd_clones (node);
loop, or, if it is purely local decision for each function, at the
start of expand_simd_clones with similar guarding, punt on functions
with "noclone" attribute, or !node->definition.  You need to repeat the
  if (node->has_gimple_body_p ())
    node->get_body ();
to get body before you analyze it.

And please put the new functions for such analysis into omp-simd-clone.cc
where they belong.

	Jakub
  
Sandra Loosemore Oct. 27, 2022, 2:27 a.m. UTC | #4
On 10/20/22 08:07, Jakub Jelinek wrote:
> Thus, IMHO it is exactly the pass_omp_simd_clone pass where you want to
> implement this auto-simdization discovery, guarded with
> #ifdef ACCEL_COMPILER and the new option (which means it will be done
> only for gcn and not on the host right now).

I'm running into a practical difficulty with making this controlled by a 
static #ifdef: namely, testing.

One of my test cases examines the .s output to make sure that the clones 
are emitted as local symbols and not global.  I have not been able to 
find the symbol linkage information in any of the dump files, and I have 
also not been able to figure out how to get a .s file from the offload 
compiler even outside of the DejaGnu test harness.  (It's possible I am 
just an extreme dummy about the latter problem, but so far none of my 
colleagues here has been able to give me a recipe either.)

On top of that, I worry that this should be tested more broadly than for 
the one target we're presently focusing on (AMD GCN), and we'll get much 
more regular test coverage if it's also enabled for x86_64 target which 
has the necessary compute_vecsize_and_simdlen target hook.

I remember Carlos O'Donnell used to have a favorite mantra, "design for 
test".  So, maybe generalize the new -fopenmp-target-simd-clone option 
to take a parameter to force clones to be generated on the OpenMP host 
for test purposes?  The "declare target" directive already has a clause

device_type(host|nohost|any)

that defaults to "any"; maybe we could use that syntax like
-fopenmp-target-simd-clone=any
and use the intersection of the two sets to determine what to 
auto-generate clones for?

-Sandra
  
Thomas Schwinge Oct. 27, 2022, 10:09 a.m. UTC | #5
Hi!

On 2022-10-26T20:27:19-0600, Sandra Loosemore <sandra@codesourcery.com> wrote:
> On 10/20/22 08:07, Jakub Jelinek wrote:
>> Thus, IMHO it is exactly the pass_omp_simd_clone pass where you want to
>> implement this auto-simdization discovery, guarded with
>> #ifdef ACCEL_COMPILER and the new option (which means it will be done
>> only for gcn and not on the host right now).
>
> I'm running into a practical difficulty with making this controlled by a
> static #ifdef: namely, testing.
>
> One of my test cases examines the .s output to make sure that the clones
> are emitted as local symbols and not global.  I have not been able to
> find the symbol linkage information in any of the dump files

Hmm, also some of '-fdump-ipa-all-details' doesn't help here?

> and I have
> also not been able to figure out how to get a .s file from the offload
> compiler even outside of the DejaGnu test harness.  (It's possible I am
> just an extreme dummy about the latter problem, but so far none of my
> colleagues here has been able to give me a recipe either.)

Right, currently only 'scan-offload-tree-dump[...]',
'scan-offload-rtl-dump[...]' are implemented; I assume
'scan-offload-assembler[...]' could be added without too much effort.

> On top of that, I worry that this should be tested more broadly than for
> the one target we're presently focusing on (AMD GCN), and we'll get much
> more regular test coverage if it's also enabled for x86_64 target which
> has the necessary compute_vecsize_and_simdlen target hook.
>
> I remember Carlos O'Donnell used to have a favorite mantra, "design for
> test".

Heh, I don't remember him ever saying that to me -- but maybe that's
because this is what I do anyway.  ;-P

> So, maybe generalize the new -fopenmp-target-simd-clone option
> to take a parameter to force clones to be generated on the OpenMP host
> for test purposes?  The "declare target" directive already has a clause
>
> device_type(host|nohost|any)
>
> that defaults to "any"; maybe we could use that syntax like
> -fopenmp-target-simd-clone=any
> and use the intersection of the two sets to determine what to
> auto-generate clones for?

Seems reasonable to me (but I'm missing a lot of context here).


There anyway is a goal (far out) to get rid of compilation-time
'#ifdef ACCEL_COMPILER' etc., and instead make such code dependent on a
command-line flag (or some other state), so that it's possible to use the
the same compiler for target (host) as well as offload target compilation.
(For example, to simulate offloading compilation with standard
x86_64-pc-linux-gnu GCC.)


And/or, where you implement the logic to "make sure that the clones
are emitted as local symbols and not global", do emit some "tag" in the
dump file, and the scan for that?

Random examples that I just remembered:

'gcc/omp-offload.cc:execute_oacc_loop_designation' handling of
'OMP_CLAUSE_NOHOST', and how that's scanned (host-side) in test cases
such as 'libgomp/testsuite/libgomp.oacc-c-c++-common/routine-nohost-1.c',
'libgomp/testsuite/libgomp.oacc-fortran/routine-nohost-1.f90'.

'gcc/config/nvptx/nvptx.cc:nvptx_find_sese' doing
'fprintf (dump_file, "SESE regions:"); [...]', and that's scanned in:

    libgomp/testsuite/libgomp.oacc-c-c++-common/nvptx-sese-1.c-/* Match {N->N(.N)+} */
    libgomp/testsuite/libgomp.oacc-c-c++-common/nvptx-sese-1.c:/* { dg-final { scan-offload-rtl-dump "SESE regions:.* \[0-9\]+{\[0-9\]+->\[0-9\]+(\\.\[0-9\]+)+}" "mach" } } */

(You'd be doing this at the 'scan-offload-tree-dump[...]' level, I
suppose.)


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Sandra Loosemore Oct. 27, 2022, 8:40 p.m. UTC | #6
On 10/27/22 04:09, Thomas Schwinge wrote:
> Hi!
> 
> On 2022-10-26T20:27:19-0600, Sandra Loosemore <sandra@codesourcery.com> wrote:
>> One of my test cases examines the .s output to make sure that the clones
>> are emitted as local symbols and not global.  I have not been able to
>> find the symbol linkage information in any of the dump files
> 
> Hmm, also some of '-fdump-ipa-all-details' doesn't help here?

Maybe I'm not looking at the right dump file, but all I see is names of 
functions in the dumps and nothing about symbol linkage/visibility, even 
with -details.

> And/or, where you implement the logic to "make sure that the clones
> are emitted as local symbols and not global", do emit some "tag" in the
> dump file, and the scan for that?
> 
> Random examples that I just remembered:
> 
> 'gcc/omp-offload.cc:execute_oacc_loop_designation' handling of
> 'OMP_CLAUSE_NOHOST', and how that's scanned (host-side) in test cases
> such as 'libgomp/testsuite/libgomp.oacc-c-c++-common/routine-nohost-1.c',
> 'libgomp/testsuite/libgomp.oacc-fortran/routine-nohost-1.f90'.
> 
> 'gcc/config/nvptx/nvptx.cc:nvptx_find_sese' doing
> 'fprintf (dump_file, "SESE regions:"); [...]', and that's scanned in:
> 
>      libgomp/testsuite/libgomp.oacc-c-c++-common/nvptx-sese-1.c-/* Match {N->N(.N)+} */
>      libgomp/testsuite/libgomp.oacc-c-c++-common/nvptx-sese-1.c:/* { dg-final { scan-offload-rtl-dump "SESE regions:.* \[0-9\]+{\[0-9\]+->\[0-9\]+(\\.\[0-9\]+)+}" "mach" } } */
> 
> (You'd be doing this at the 'scan-offload-tree-dump[...]' level, I
> suppose.)

I guess customizing the dump output from the simdclone pass with the 
information I need is the easiest solution.  I'm still concerned about 
getting adequate routine test coverage, though, when it's so specialized 
to a particular offload target.

Thanks for the help!  :-)

-Sandra
  

Patch

From dfdb9a2162978b964863f351c814211dca8e9a3f Mon Sep 17 00:00:00 2001
From: Sandra Loosemore <sandra@codesourcery.com>
Date: Thu, 22 Sep 2022 02:16:42 +0000
Subject: [PATCH] OpenMP: Generate SIMD clones for functions with "declare
 target"

This patch causes the IPA simdclone pass to generate clones for
functions with the "omp declare target" attribute as if they had
"omp declare simd", provided the function appears to be suitable for
SIMD execution.  The filter is conservative, rejecting functions
that write memory or that call other functions not known to be safe.
A new option -fopenmp-target-simd-clone is added to control this
transformation; it's enabled at -O2 and higher.

gcc/ChangeLog:

	* common.opt (fopenmp-target-simd-clone): New option.
	* opts.cc (default_options_table): Add -fopenmp-target-simd-clone.
	* doc/invoke.texi (-fopenmp-target-simd-clone): Document.
	* omp-simd-clone.cc (auto_simd_check_stmt): New function.
	(mark_auto_simd_clone): New function.
	(simd_clone_create): Add force_local argument, make the symbol
	have internal linkage if it is true.
	(expand_simd_clones): Also check for cloneable functions with
	"omp declare target".  Pass explicit_p argument to
	simd_clone.compute_vecsize_and_simdlen target hook.
	* target.def (TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN):
	Add bool explicit_p argument.
	* doc/tm.texi: Regenerated.
	* config/aarch64/aarch64.cc
	(aarch64_simd_clone_compute_vecsize_and_simdlen): Update.
	* config/gcn/gcn.cc
	(gcn_simd_clone_compute_vecsize_and_simdlen): Update.
	* config/i386/i386.cc
	(ix86_simd_clone_compute_vecsize_and_simdlen): Update.

gcc/testsuite/ChangeLog:

	* gcc.dg/gomp/target-simd-clone-1.c: New.
	* gcc.dg/gomp/target-simd-clone-2.c: New.
	* gcc.dg/gomp/target-simd-clone-3.c: New.
	* gcc.dg/gomp/target-simd-clone-4.c: New.
	* gcc.dg/gomp/target-simd-clone-5.c: New.
	* gcc.dg/gomp/target-simd-clone-6.c: New.
---
 gcc/common.opt                                |   4 +
 gcc/config/aarch64/aarch64.cc                 |  24 +-
 gcc/config/gcn/gcn.cc                         |  10 +-
 gcc/config/i386/i386.cc                       |  27 +-
 gcc/doc/invoke.texi                           |  12 +-
 gcc/doc/tm.texi                               |   2 +-
 gcc/omp-simd-clone.cc                         | 237 ++++++++++++++++--
 gcc/opts.cc                                   |   1 +
 gcc/target.def                                |   2 +-
 .../gcc.dg/gomp/target-simd-clone-1.c         |  18 ++
 .../gcc.dg/gomp/target-simd-clone-2.c         |  18 ++
 .../gcc.dg/gomp/target-simd-clone-3.c         |  17 ++
 .../gcc.dg/gomp/target-simd-clone-4.c         |  16 ++
 .../gcc.dg/gomp/target-simd-clone-5.c         |  13 +
 .../gcc.dg/gomp/target-simd-clone-6.c         |  13 +
 15 files changed, 362 insertions(+), 52 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-1.c
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-2.c
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-3.c
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-4.c
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-5.c
 create mode 100644 gcc/testsuite/gcc.dg/gomp/target-simd-clone-6.c

diff --git a/gcc/common.opt b/gcc/common.opt
index fba90ff6dcb..c735c62a8d4 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2217,6 +2217,10 @@  fomit-frame-pointer
 Common Var(flag_omit_frame_pointer) Optimization
 When possible do not generate stack frames.
 
+fopenmp-target-simd-clone
+Common Var(flag_openmp_target_simd_clone) Optimization
+Generate SIMD clones for functions with the OpenMP declare target directive.
+
 fopt-info
 Common Var(flag_opt_info) Optimization
 Enable all optimization info dumps on stderr.
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index f199e77cd42..c6d282c55ef 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26612,7 +26612,8 @@  currently_supported_simd_type (tree t, tree b)
 static int
 aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
 					struct cgraph_simd_clone *clonei,
-					tree base_type, int num)
+					tree base_type, int num,
+					bool explicit_p)
 {
   tree t, ret_type;
   unsigned int elt_bits, count;
@@ -26630,8 +26631,9 @@  aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
 	  || const_simdlen > 1024
 	  || (const_simdlen & (const_simdlen - 1)) != 0))
     {
-      warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
-		  "unsupported simdlen %wd", const_simdlen);
+      if (explicit_p)
+	warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
+		    "unsupported simdlen %wd", const_simdlen);
       return 0;
     }
 
@@ -26639,7 +26641,9 @@  aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
   if (TREE_CODE (ret_type) != VOID_TYPE
       && !currently_supported_simd_type (ret_type, base_type))
     {
-      if (TYPE_SIZE (ret_type) != TYPE_SIZE (base_type))
+      if (!explicit_p)
+	;
+      else if (TYPE_SIZE (ret_type) != TYPE_SIZE (base_type))
 	warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
 		    "GCC does not currently support mixed size types "
 		    "for %<simd%> functions");
@@ -26666,7 +26670,9 @@  aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
       if (clonei->args[i].arg_type != SIMD_CLONE_ARG_TYPE_UNIFORM
 	  && !currently_supported_simd_type (arg_type, base_type))
 	{
-	  if (TYPE_SIZE (arg_type) != TYPE_SIZE (base_type))
+	  if (!explicit_p)
+	    ;
+	  else if (TYPE_SIZE (arg_type) != TYPE_SIZE (base_type))
 	    warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
 			"GCC does not currently support mixed size types "
 			"for %<simd%> functions");
@@ -26696,9 +26702,11 @@  aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
       if (clonei->simdlen.is_constant (&const_simdlen)
 	  && maybe_ne (vec_bits, 64U) && maybe_ne (vec_bits, 128U))
 	{
-	  warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
-		      "GCC does not currently support simdlen %wd for type %qT",
-		      const_simdlen, base_type);
+	  if (explicit_p)
+	    warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
+			"GCC does not currently support simdlen %wd for "
+			"type %qT",
+			const_simdlen, base_type);
 	  return 0;
 	}
     }
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index ceb69000807..5c80b8df852 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -4562,7 +4562,8 @@  static int
 gcn_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *ARG_UNUSED (node),
 					    struct cgraph_simd_clone *clonei,
 					    tree base_type,
-					    int ARG_UNUSED (num))
+					    int ARG_UNUSED (num),
+					    bool explicit_p)
 {
   unsigned int elt_bits = GET_MODE_BITSIZE (SCALAR_TYPE_MODE (base_type));
 
@@ -4572,9 +4573,10 @@  gcn_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *ARG_UNUSED (node
     {
       /* Note that x86 has a similar message that is likely to trigger on
 	 sizes that are OK for gcn; the user can't win.  */
-      warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
-		  "unsupported simdlen %wd (amdgcn)",
-		  clonei->simdlen.to_constant ());
+      if (explicit_p)
+	warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
+		    "unsupported simdlen %wd (amdgcn)",
+		    clonei->simdlen.to_constant ());
       return 0;
     }
 
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index c4d0e36e9c0..99ae388ad56 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23647,7 +23647,8 @@  ix86_memmodel_check (unsigned HOST_WIDE_INT val)
 static int
 ix86_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
 					     struct cgraph_simd_clone *clonei,
-					     tree base_type, int num)
+					     tree base_type, int num,
+					     bool explicit_p)
 {
   int ret = 1;
 
@@ -23656,8 +23657,9 @@  ix86_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
 	  || clonei->simdlen > 1024
 	  || (clonei->simdlen & (clonei->simdlen - 1)) != 0))
     {
-      warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
-		  "unsupported simdlen %wd", clonei->simdlen.to_constant ());
+      if (explicit_p)
+	warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
+		    "unsupported simdlen %wd", clonei->simdlen.to_constant ());
       return 0;
     }
 
@@ -23677,8 +23679,9 @@  ix86_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
 	  break;
 	/* FALLTHRU */
       default:
-	warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
-		    "unsupported return type %qT for simd", ret_type);
+	if (explicit_p)
+	  warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
+		      "unsupported return type %qT for simd", ret_type);
 	return 0;
       }
 
@@ -23707,13 +23710,14 @@  ix86_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
 	default:
 	  if (clonei->args[i].arg_type == SIMD_CLONE_ARG_TYPE_UNIFORM)
 	    break;
-	  warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
-		      "unsupported argument type %qT for simd", arg_type);
+	  if (explicit_p)
+	    warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
+			"unsupported argument type %qT for simd", arg_type);
 	  return 0;
 	}
     }
 
-  if (!TREE_PUBLIC (node->decl))
+  if (!TREE_PUBLIC (node->decl) || !explicit_p)
     {
       /* If the function isn't exported, we can pick up just one ISA
 	 for the clones.  */
@@ -23784,9 +23788,10 @@  ix86_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
 	cnt /= clonei->vecsize_float;
       if (cnt > (TARGET_64BIT ? 16 : 8))
 	{
-	  warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
-		      "unsupported simdlen %wd",
-		      clonei->simdlen.to_constant ());
+	  if (explicit_p)
+	    warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
+			"unsupported simdlen %wd",
+			clonei->simdlen.to_constant ());
 	  return 0;
 	}
       }
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 8def6baa904..e05739a334c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -204,7 +204,7 @@  in the following sections.
 -flax-vector-conversions  -fms-extensions @gol
 -foffload=@var{arg}  -foffload-options=@var{arg} @gol
 -fopenacc  -fopenacc-dim=@var{geom} @gol
--fopenmp  -fopenmp-simd @gol
+-fopenmp  -fopenmp-simd  -fopenmp-target-simd-clone @gol
 -fpermitted-flt-eval-methods=@var{standard} @gol
 -fplan9-extensions  -fsigned-bitfields  -funsigned-bitfields @gol
 -fsigned-char  -funsigned-char  -fsso-struct=@var{endianness}}
@@ -2749,6 +2749,16 @@  Enable handling of OpenMP's SIMD directives with @code{#pragma omp}
 in C/C++ and @code{!$omp} in Fortran. Other OpenMP directives
 are ignored.
 
+@item -fopenmp-target-simd-clone
+@opindex fopenmp-target-simd-clone
+@cindex OpenMP target SIMD clone
+In addition to generating SIMD clones for functions marked with the
+@code{declare simd} directive, GCC also generates clones
+for functions marked with the OpenMP @code{declare target} directive
+that are suitable for vectorization when this option is in effect.
+It is enabled by default at @option{-O2} and higher (but not @option{-Os}
+or @option{-Og}).
+
 @item -fpermitted-flt-eval-methods=@var{style}
 @opindex fpermitted-flt-eval-methods
 @opindex fpermitted-flt-eval-methods=c11
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c3001c6ded9..d0a366f1908 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6249,7 +6249,7 @@  The default is @code{NULL_TREE} which means to not vectorize scatter
 stores.
 @end deftypefn
 
-@deftypefn {Target Hook} int TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN (struct cgraph_node *@var{}, struct cgraph_simd_clone *@var{}, @var{tree}, @var{int})
+@deftypefn {Target Hook} int TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN (struct cgraph_node *@var{}, struct cgraph_simd_clone *@var{}, @var{tree}, @var{int}, @var{bool})
 This hook should set @var{vecsize_mangle}, @var{vecsize_int}, @var{vecsize_float}
 fields in @var{simd_clone} structure pointed by @var{clone_info} argument and also
 @var{simdlen} field if it was previously 0.
diff --git a/gcc/omp-simd-clone.cc b/gcc/omp-simd-clone.cc
index 34cbee5afcd..f9e98b099d1 100644
--- a/gcc/omp-simd-clone.cc
+++ b/gcc/omp-simd-clone.cc
@@ -51,6 +51,152 @@  along with GCC; see the file COPYING3.  If not see
 #include "stringpool.h"
 #include "attribs.h"
 #include "omp-simd-clone.h"
+#include "omp-low.h"
+#include "omp-general.h"
+
+/* Helper function for mark_auto_simd_clone; return false if the statement
+   violates restrictions for an "omp declare simd" function.  Specifically,
+   the function must not
+   - throw or call setjmp/longjmp
+   - write memory that could alias parallel calls
+   - include openmp directives or calls
+   - call functions that might do those things */
+
+static bool
+auto_simd_check_stmt (gimple *stmt, tree outer)
+{
+  tree decl;
+
+  switch (gimple_code (stmt))
+    {
+    case GIMPLE_CALL:
+      decl = gimple_call_fndecl (stmt);
+
+      /* We can't know whether indirect calls are safe.  */
+      if (decl == NULL_TREE)
+	return false;
+
+      /* Calls to functions that are CONST or PURE are ok.  */
+      if (gimple_call_flags (stmt) & (ECF_CONST | ECF_PURE))
+	break;
+
+      /* Calls to functions that are already marked "omp declare simd" are
+	 OK.  */
+      if (lookup_attribute ("omp declare simd", DECL_ATTRIBUTES (decl)))
+	break;
+
+      /* Let recursive calls to the current function through.  */
+      if (decl == outer)
+	break;
+
+      /* Other function calls are not permitted.  */
+      return false;
+
+      /* OpenMP directives are not permitted.  */
+    CASE_GIMPLE_OMP:
+      return false;
+
+      /* Conservatively reject all EH-related constructs.  */
+    case GIMPLE_CATCH:
+    case GIMPLE_EH_FILTER:
+    case GIMPLE_EH_MUST_NOT_THROW:
+    case GIMPLE_EH_ELSE:
+    case GIMPLE_EH_DISPATCH:
+    case GIMPLE_RESX:
+    case GIMPLE_TRY:
+      return false;
+
+      /* Asms are not permitted since we don't know what they do.  */
+    case GIMPLE_ASM:
+      return false;
+
+    default:
+      break;
+    }
+
+  /* Memory writes are not permitted.
+     FIXME: this could be relaxed a little to permit writes to
+     function-local variables that could not alias other instances
+     of the function running in parallel.  */
+  if (gimple_store_p (stmt))
+    return false;
+  else
+    return true;
+}
+
+/* If the function NODE appears suitable for auto-annotation with "declare
+   simd", add and return such an attribute, otherwise return null.  */
+
+static tree
+mark_auto_simd_clone (struct cgraph_node *node)
+{
+  tree decl = node->decl;
+  tree t;
+  machine_mode m;
+  tree result;
+  basic_block bb;
+
+  /* Nothing to do if the function isn't a definition or doesn't
+     have a body.  */
+  if (!node->definition || !node->has_gimple_body_p ())
+    return NULL_TREE;
+
+  /* Nothing to do if the function already has the "omp declare simd"
+     attribute, is marked noclone, or is not "omp declare target".  */
+  if (lookup_attribute ("omp declare simd", DECL_ATTRIBUTES (decl))
+      || lookup_attribute ("noclone", DECL_ATTRIBUTES (decl))
+      || !lookup_attribute ("omp declare target", DECL_ATTRIBUTES (decl)))
+    return NULL_TREE;
+
+  /* Backends will check for vectorizable arguments/return types in a
+     target-specific way, but we can immediately filter out functions
+     that have non-scalar arguments/return types.  Also, atomic types
+     trigger warnings in simd_clone_clauses_extract.  */
+  t = TREE_TYPE (TREE_TYPE (decl));
+  m = TYPE_MODE (t);
+  if (!(VOID_TYPE_P (t) || is_a <scalar_mode> (m)) || TYPE_ATOMIC (t))
+    return NULL_TREE;
+
+  if (TYPE_ARG_TYPES (TREE_TYPE (decl)))
+    {
+      for (tree temp = TYPE_ARG_TYPES (TREE_TYPE (decl));
+	   temp; temp = TREE_CHAIN (temp))
+	{
+	  t = TREE_VALUE (temp);
+	  m = TYPE_MODE (t);
+	  if (!(VOID_TYPE_P (t) || is_a <scalar_mode> (m)) || TYPE_ATOMIC (t))
+	    return NULL_TREE;
+	}
+    }
+  else
+    {
+      for (tree temp = DECL_ARGUMENTS (decl); temp; temp = DECL_CHAIN (temp))
+	{
+	  t = TREE_TYPE (temp);
+	  m = TYPE_MODE (t);
+	  if (!(VOID_TYPE_P (t) || is_a <scalar_mode> (m)) || TYPE_ATOMIC (t))
+	    return NULL_TREE;
+	}
+    }
+
+  /* Scan the function body to see if it is suitable for SIMD-ization.  */
+  node->get_body ();
+
+  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (decl))
+    {
+      for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	if (!auto_simd_check_stmt (gsi_stmt (gsi), decl))
+	  return NULL_TREE;
+    }
+
+  /* All is good.  */
+  result = tree_cons (get_identifier ("omp declare simd"), NULL,
+		      DECL_ATTRIBUTES (decl));
+  DECL_ATTRIBUTES (decl) = result;
+  return result;
+}
+
 
 /* Return the number of elements in vector type VECTYPE, which is associated
    with a SIMD clone.  At present these always have a constant length.  */
@@ -430,10 +576,12 @@  simd_clone_mangle (struct cgraph_node *node,
   return get_identifier (str);
 }
 
-/* Create a simd clone of OLD_NODE and return it.  */
+/* Create a simd clone of OLD_NODE and return it.  If FORCE_LOCAL is true,
+   create it as a local symbol, otherwise copy the symbol linkage and
+   visibility attributes from OLD_NODE.  */
 
 static struct cgraph_node *
-simd_clone_create (struct cgraph_node *old_node)
+simd_clone_create (struct cgraph_node *old_node, bool force_local)
 {
   struct cgraph_node *new_node;
   if (old_node->definition)
@@ -463,23 +611,38 @@  simd_clone_create (struct cgraph_node *old_node)
     return new_node;
 
   set_decl_built_in_function (new_node->decl, NOT_BUILT_IN, 0);
-  TREE_PUBLIC (new_node->decl) = TREE_PUBLIC (old_node->decl);
-  DECL_COMDAT (new_node->decl) = DECL_COMDAT (old_node->decl);
-  DECL_WEAK (new_node->decl) = DECL_WEAK (old_node->decl);
-  DECL_EXTERNAL (new_node->decl) = DECL_EXTERNAL (old_node->decl);
-  DECL_VISIBILITY_SPECIFIED (new_node->decl)
-    = DECL_VISIBILITY_SPECIFIED (old_node->decl);
-  DECL_VISIBILITY (new_node->decl) = DECL_VISIBILITY (old_node->decl);
-  DECL_DLLIMPORT_P (new_node->decl) = DECL_DLLIMPORT_P (old_node->decl);
-  if (DECL_ONE_ONLY (old_node->decl))
-    make_decl_one_only (new_node->decl, DECL_ASSEMBLER_NAME (new_node->decl));
-
-  /* The method cgraph_version_clone_with_body () will force the new
-     symbol local.  Undo this, and inherit external visibility from
-     the old node.  */
-  new_node->local = old_node->local;
-  new_node->externally_visible = old_node->externally_visible;
-  new_node->calls_declare_variant_alt = old_node->calls_declare_variant_alt;
+  if (force_local)
+    {
+      TREE_PUBLIC (new_node->decl) = 0;
+      DECL_COMDAT (new_node->decl) = 0;
+      DECL_WEAK (new_node->decl) = 0;
+      DECL_EXTERNAL (new_node->decl) = 0;
+      DECL_VISIBILITY_SPECIFIED (new_node->decl) = 0;
+      DECL_VISIBILITY (new_node->decl) = VISIBILITY_DEFAULT;
+      DECL_DLLIMPORT_P (new_node->decl) = 0;
+    }
+  else
+    {
+      TREE_PUBLIC (new_node->decl) = TREE_PUBLIC (old_node->decl);
+      DECL_COMDAT (new_node->decl) = DECL_COMDAT (old_node->decl);
+      DECL_WEAK (new_node->decl) = DECL_WEAK (old_node->decl);
+      DECL_EXTERNAL (new_node->decl) = DECL_EXTERNAL (old_node->decl);
+      DECL_VISIBILITY_SPECIFIED (new_node->decl)
+	= DECL_VISIBILITY_SPECIFIED (old_node->decl);
+      DECL_VISIBILITY (new_node->decl) = DECL_VISIBILITY (old_node->decl);
+      DECL_DLLIMPORT_P (new_node->decl) = DECL_DLLIMPORT_P (old_node->decl);
+      if (DECL_ONE_ONLY (old_node->decl))
+	make_decl_one_only (new_node->decl,
+			    DECL_ASSEMBLER_NAME (new_node->decl));
+
+      /* The method cgraph_version_clone_with_body () will force the new
+	 symbol local.  Undo this, and inherit external visibility from
+	 the old node.  */
+      new_node->local = old_node->local;
+      new_node->externally_visible = old_node->externally_visible;
+      new_node->calls_declare_variant_alt
+	= old_node->calls_declare_variant_alt;
+    }
 
   return new_node;
 }
@@ -1683,13 +1846,32 @@  simd_clone_adjust (struct cgraph_node *node)
 void
 expand_simd_clones (struct cgraph_node *node)
 {
-  tree attr = lookup_attribute ("omp declare simd",
-				DECL_ATTRIBUTES (node->decl));
-  if (attr == NULL_TREE
-      || node->inlined_to
+  tree attr;
+  bool explicit_p = true;
+
+  if (node->inlined_to
       || lookup_attribute ("noclone", DECL_ATTRIBUTES (node->decl)))
     return;
 
+  attr = lookup_attribute ("omp declare simd",
+			   DECL_ATTRIBUTES (node->decl));
+
+  /* See if we can add an "omp declare simd" directive implicitly
+     before giving up.  */
+  /* FIXME: OpenACC "#pragma acc routine" translates into
+     "omp declare target", but appears also to have some other effects
+     that conflict with generating SIMD clones, causing ICEs.  So don't
+     do this if we've got OpenACC instead of OpenMP.  */
+  if (attr == NULL_TREE
+      && flag_openmp_target_simd_clone
+      && !oacc_get_fn_attrib (node->decl))
+    {
+      attr = mark_auto_simd_clone (node);
+      explicit_p = false;
+    }
+  if (attr == NULL_TREE)
+    return;
+
   /* Ignore
      #pragma omp declare simd
      extern int foo ();
@@ -1714,13 +1896,15 @@  expand_simd_clones (struct cgraph_node *node)
 
       poly_uint64 orig_simdlen = clone_info->simdlen;
       tree base_type = simd_clone_compute_base_data_type (node, clone_info);
+
       /* The target can return 0 (no simd clones should be created),
 	 1 (just one ISA of simd clones should be created) or higher
 	 count of ISA variants.  In that case, clone_info is initialized
 	 for the first ISA variant.  */
       int count
 	= targetm.simd_clone.compute_vecsize_and_simdlen (node, clone_info,
-							  base_type, 0);
+							  base_type, 0,
+							  explicit_p);
       if (count == 0)
 	continue;
 
@@ -1745,7 +1929,8 @@  expand_simd_clones (struct cgraph_node *node)
 	      /* And call the target hook again to get the right ISA.  */
 	      targetm.simd_clone.compute_vecsize_and_simdlen (node, clone,
 							      base_type,
-							      i / 2);
+							      i / 2,
+							      explicit_p);
 	      if ((i & 1) != 0)
 		clone->inbranch = 1;
 	    }
@@ -1763,7 +1948,7 @@  expand_simd_clones (struct cgraph_node *node)
 	  /* Only when we are sure we want to create the clone actually
 	     clone the function (or definitions) or create another
 	     extern FUNCTION_DECL (for prototypes without definitions).  */
-	  struct cgraph_node *n = simd_clone_create (node);
+	  struct cgraph_node *n = simd_clone_create (node, !explicit_p);
 	  if (n == NULL)
 	    {
 	      if (i == 0)
diff --git a/gcc/opts.cc b/gcc/opts.cc
index 54e57f36755..b8ca6fdca82 100644
--- a/gcc/opts.cc
+++ b/gcc/opts.cc
@@ -658,6 +658,7 @@  static const struct default_options default_options_table[] =
       REORDER_BLOCKS_ALGORITHM_STC },
     { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_ftree_loop_vectorize, NULL, 1 },
     { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_ftree_slp_vectorize, NULL, 1 },
+    { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_fopenmp_target_simd_clone, NULL, 1 },
 #ifdef INSN_SCHEDULING
   /* Only run the pre-regalloc scheduling pass if optimizing for speed.  */
     { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_fschedule_insns, NULL, 1 },
diff --git a/gcc/target.def b/gcc/target.def
index 4d49ffc2c88..6e830bed52a 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1634,7 +1634,7 @@  fields in @var{simd_clone} structure pointed by @var{clone_info} argument and al
 not determined by the bitsize (in which case @var{simdlen} is always used).\n\
 The hook should return 0 if SIMD clones shouldn't be emitted,\n\
 or number of @var{vecsize_mangle} variants that should be emitted.",
-int, (struct cgraph_node *, struct cgraph_simd_clone *, tree, int), NULL)
+int, (struct cgraph_node *, struct cgraph_simd_clone *, tree, int, bool), NULL)
 
 DEFHOOK
 (adjust,
diff --git a/gcc/testsuite/gcc.dg/gomp/target-simd-clone-1.c b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-1.c
new file mode 100644
index 00000000000..ab027a60970
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-1.c
@@ -0,0 +1,18 @@ 
+/* { dg-options "-fopenmp -O2" } */
+
+/* Test that simd clones are generated for functions with "declare target".  */
+
+#pragma omp declare target
+int addit(int a, int b, int c)
+{
+  return a + b;
+}
+#pragma omp end declare target
+
+/* Although addit has external linkage, we expect clones to be generated as
+   for a function with internal linkage.  */
+
+/* { dg-final { scan-assembler "\\.type.*_ZGVbN4vvv_addit,.*function" { target i?86-*-* x86_64-*-* } } } */
+/* { dg-final { scan-assembler "\\.type.*_ZGVbM4vvv_addit,.*function" { target i?86-*-* x86_64-*-* } } } */
+/* { dg-final { scan-assembler-not "\\.globl.*_ZGVbN4vvv_addit" { target i?86-*-* x86_64-*-* } } } */
+/* { dg-final { scan-assembler-not "\\.globl.*_ZGVbM4vvv_addit" { target i?86-*-* x86_64-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/gomp/target-simd-clone-2.c b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-2.c
new file mode 100644
index 00000000000..0ccbfe1d765
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-2.c
@@ -0,0 +1,18 @@ 
+/* { dg-options "-fopenmp -O2" } */
+
+/* Test that simd clones are not generated for functions with 
+   "declare target" but unsuitable arguments.  */
+
+struct s {
+  int a;
+  int b;
+};
+  
+#pragma omp declare target
+int addit (struct s x)
+{
+  return x.a + x.b;
+}
+#pragma omp end declare target
+
+/* { dg-final { scan-assembler-not "_Z.*_addit" { target i?86-*-* x86_64-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/gomp/target-simd-clone-3.c b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-3.c
new file mode 100644
index 00000000000..c313cfe53b0
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-3.c
@@ -0,0 +1,17 @@ 
+/* { dg-options "-fopenmp -O2" } */
+
+/* Test that simd clones are not generated for functions with 
+   "declare target" but that call possibly side-effecting functions 
+   in the body.  */
+
+extern int f (int);
+
+#pragma omp declare target
+int addit(int a, int b, int c)
+{
+  return f(a) + b;
+}
+#pragma omp end declare target
+
+/* { dg-final { scan-assembler-not "_Z.*_addit" { target i?86-*-* x86_64-*-* } } } */
+
diff --git a/gcc/testsuite/gcc.dg/gomp/target-simd-clone-4.c b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-4.c
new file mode 100644
index 00000000000..e32b22f6a59
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-4.c
@@ -0,0 +1,16 @@ 
+/* { dg-options "-fopenmp -O2" } */
+
+/* Test that simd clones are not generated for functions with 
+   "declare target" but that write memory in the body.  */
+
+extern int save;
+
+#pragma omp declare target
+int addit(int a, int b, int c)
+{
+  save = c;
+  return a + b;
+}
+#pragma omp end declare target
+
+/* { dg-final { scan-assembler-not "_Z.*_addit" { target i?86-*-* x86_64-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/gomp/target-simd-clone-5.c b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-5.c
new file mode 100644
index 00000000000..d39a9ab737f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-5.c
@@ -0,0 +1,13 @@ 
+/* { dg-options "-fopenmp -Os" } */
+
+/* Test that simd clones are not generated for functions with 
+   "declare target" at -Os.  */
+
+#pragma omp declare target
+int addit(int a, int b, int c)
+{
+  return a + b;
+}
+#pragma omp end declare target
+
+/* { dg-final { scan-assembler-not "_Z.*_addit" { target i?86-*-* x86_64-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/gomp/target-simd-clone-6.c b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-6.c
new file mode 100644
index 00000000000..a0c529b1c4e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/target-simd-clone-6.c
@@ -0,0 +1,13 @@ 
+/* { dg-options "-fopenmp -Og" } */
+
+/* Test that simd clones are not generated for functions with 
+   "declare target" at -Og.  */
+
+#pragma omp declare target
+int addit(int a, int b, int c)
+{
+  return a + b;
+}
+#pragma omp end declare target
+
+/* { dg-final { scan-assembler-not "_Z.*_addit" { target i?86-*-* x86_64-*-* } } } */
-- 
2.31.1