libgomp/nvptx: Prepare for reverse-offload callback handling

Message ID 57b3ae5e-8f15-8bea-fa09-39bccbaa2414@codesourcery.com
State New
Headers
Series libgomp/nvptx: Prepare for reverse-offload callback handling |

Commit Message

Tobias Burnus Aug. 26, 2022, 9:07 a.m. UTC
  @Tom and Alexander: Better suggestions are welcome for the busy loop in
libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
its value.


PRE-REMARK

As nvptx (and all other plugins) returns <= 0 for
GOMP_OFFLOAD_get_num_devices if GOMP_REQUIRES_REVERSE_OFFLOAD is
set. This patch is currently still a no op.

The patch is almost stand alone, except that it either needs a
  void *rev_fn_table = NULL;
in GOMP_OFFLOAD_load_image or the following patch:
  [Patch][2/3] nvptx: libgomp+mkoffload.cc: Prepare for reverse offload fn lookup
  https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600348.html
(which in turn needs the '[1/3]' patch).

Not required to be compilable, but the patch is based on the ideas/code from
the reverse-offload ME patch; the latter adds calls to
  GOMP_target_ext (omp_initial_device,
which is for host fallback code processed by the normal target_ext and for
device code by the target_ext of this patch.
→ "[Patch] OpenMP: Support reverse offload (middle end part)"
  https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598662.html

 * * *

This patch adds initial offloading support for nvptx.
When the nvptx's device GOMP_target_ext is called - it creates a lock,
fills a struct with the argument pointers (addr, kinds, sizes), its
device number and the set the function pointer address.

On the host side, the last address is checked - if fn_addr != NULL,
it passes all arguments on to the generic (target.c) gomp_target_rev
to do the actual offloading.

CUDA does lockup when trying to copy data from the currently running
stream; hence, a new stream is generated to do the memory copying.
Just having managed memory is not enough - it needs to be concurrently
accessible - otherwise, it will segfault on the host when migrated to
the device.

OK for mainline?

 * * *

Future work for nvptx:
* Adjust 'sleep', possibly using different values with and without USM and
  to do shorter sleeps than usleep(1)?
* Set a flag whether there is any offload function at all, avoiding to run
  the more expensive check if there is 'requires reverse_offload' without
  actual reverse-offloading functions present.
  (Recall that the '2/3' patch, mentioned above, only has fn != NULL for
  reverse-offload functions.)
* Document → libgomp.texi that reverse offload may cause some performance
  overhead for all target regions. + That reverse offload is run serialized.

And obviously: submitting the missing bits to get reverse offload working,
but that's mostly not an nvptx topic.

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  

Comments

Alexander Monakov Aug. 26, 2022, 2:56 p.m. UTC | #1
On Fri, 26 Aug 2022, Tobias Burnus wrote:

> @Tom and Alexander: Better suggestions are welcome for the busy loop in
> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
> its value.

I think to do that without polling you can use PTX 'brkpt' instruction on the
device and CUDA Debugger API on the host (but you'd have to be careful about
interactions with the real debugger).

How did the standardization process for this feature look like, how did it pass
if it's not efficiently implementable for the major offloading targets?

Alexander
  
Jakub Jelinek Sept. 9, 2022, 3:49 p.m. UTC | #2
On Fri, Aug 26, 2022 at 05:56:09PM +0300, Alexander Monakov via Gcc-patches wrote:
> 
> On Fri, 26 Aug 2022, Tobias Burnus wrote:
> 
> > @Tom and Alexander: Better suggestions are welcome for the busy loop in
> > libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
> > its value.
> 
> I think to do that without polling you can use PTX 'brkpt' instruction on the
> device and CUDA Debugger API on the host (but you'd have to be careful about
> interactions with the real debugger).
> 
> How did the standardization process for this feature look like, how did it pass
> if it's not efficiently implementable for the major offloading targets?

It doesn't have to be implementable on all major offloading targets, it is
enough when it can work on some.  As one needs to request the reverse
offloading through a declarative directive, it is always possible in that
case to just pretend devices that don't support it don't exist.

But it would be really nice to support it even on PTX.

Are there any other implementations of reverse offloading to PTX already?

	Jakub
  
Jakub Jelinek Sept. 9, 2022, 3:51 p.m. UTC | #3
On Fri, Aug 26, 2022 at 11:07:28AM +0200, Tobias Burnus wrote:
> @Tom and Alexander: Better suggestions are welcome for the busy loop in
> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
> its value.

I'm afraid you need Alexander or Tom here, I don't feel I can review it;
I could rubber stamp it if they are ok with it.

	Jakub
  
Tobias Burnus Sept. 13, 2022, 7:07 a.m. UTC | #4
@Alexander/@Tom – Can you comment on both libgomp/config/nvptx + libgomp/plugin/plugin-nvptx.c ? (Comments on the rest are welcome, too)

(Updated patch enclosed)

Because Jakub asked:

I'm afraid you need Alexander or Tom here, I don't feel I can review it;
I could rubber stamp it if they are ok with it.


Regarding:

How did the standardization process for this feature look like, how did it pass
if it's not efficiently implementable for the major offloading targets?

It doesn't have to be implementable on all major offloading targets, it is enough when it can work on some. As one needs to request the reverse offloading through a declarative directive, it is always possible in that case to just pretend devices that don't support it don't exist.

First, I think it is better to provide a feature even if it is slow – than not providing it at all. Secondly, as Jakub mentioned, it is not required that all devices support this feature well. It is sufficient some do.

I believe on of the main uses is debugging and for this use, the performance is not critical. This patch attempts to have no overhead if the feature is not used (no 'omp requires reverse_offload' and no actual reverse offload).

Additionally, for GCN, it can be implemented with almost no overhead by using the feature used for I/O. (CUDA implements 'printf' internally – but does not permit to piggyback on this feature.)

* * *

I think in the future, we could additionally pass information to GOMP_target_ext whether a target region is known not to do reverse offload – both by checking what's in the region and by utilizing an 'absent(target)' assumption places in the outer target regsion on an '(begin) assume(s)' directive. That should at least help with the common case of having no reverse offload – even if it does not for some large kernel which does use reverse offload for non-debugging purpose (e.g. to trigger file I/O or inter-node communication).

* * *

Regarding the implementation: I left in 'usleep(1)' for now – 1µs seems to be not too bad and I have no idea what's better.

I also don't have any idea what's the overhead for accessing concurrently accessible managed memory from the host (is it on the host until requested from the device – or is it always on the device and needs to be copied/migrated to the host for every access). Likewise, I don't know how much overhead it is to D2H copy the memory via the second CUDA stream.

Suggestions are welcome. But as this code is strictly confined to a single function, it can easily be modified later.

Documentation: I have not mentioned caveats in https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html as the reverse offload is not yet enabled, even with this patch.

On 26.08.22 11:07, Tobias Burnus wrote:

PRE-REMARK

As nvptx (and all other plugins) returns <= 0 for
GOMP_OFFLOAD_get_num_devices if GOMP_REQUIRES_REVERSE_OFFLOAD is
set. This patch is currently still a no op.

The patch is almost stand alone, except that it either needs a
  void *rev_fn_table = NULL;
in GOMP_OFFLOAD_load_image or the following patch:
  [Patch][2/3] nvptx: libgomp+mkoffload.cc: Prepare for reverse offload fn lookup
  https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600348.html
(which in turn needs the '[1/3]' patch).

Not required to be compilable, but the patch is based on the ideas/code from
the reverse-offload ME patch; the latter adds calls to
  GOMP_target_ext (omp_initial_device,
which is for host fallback code processed by the normal target_ext and for
device code by the target_ext of this patch.
→ "[Patch] OpenMP: Support reverse offload (middle end part)"
  https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598662.html

 * * *

This patch adds initial offloading support for nvptx.
When the nvptx's device GOMP_target_ext is called - it creates a lock,
fills a struct with the argument pointers (addr, kinds, sizes), its
device number and the set the function pointer address.

On the host side, the last address is checked - if fn_addr != NULL,
it passes all arguments on to the generic (target.c) gomp_target_rev
to do the actual offloading.

CUDA does lockup when trying to copy data from the currently running
stream; hence, a new stream is generated to do the memory copying.
Just having managed memory is not enough - it needs to be concurrently
accessible - otherwise, it will segfault on the host when migrated to
the device.

OK for mainline?

 * * *

Future work for nvptx:
* Adjust 'sleep', possibly using different values with and without USM and
  to do shorter sleeps than usleep(1)?
* Set a flag whether there is any offload function at all, avoiding to run
  the more expensive check if there is 'requires reverse_offload' without
  actual reverse-offloading functions present.
  (Recall that the '2/3' patch, mentioned above, only has fn != NULL for
  reverse-offload functions.)
* Document → libgomp.texi that reverse offload may cause some performance
  overhead for all target regions. + That reverse offload is run serialized.

And obviously: submitting the missing bits to get reverse offload working,
but that's mostly not an nvptx topic.


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Alexander Monakov Sept. 21, 2022, 8:06 p.m. UTC | #5
Hi.

On the high level, I'd be highly uncomfortable with this. I guess we are in
vague agreement that it cannot be efficiently implemented. It also goes
against the good practice of accelerator programming, which requires queueing
work on the accelerator and letting it run asynchronously with the CPU with high
occupancy.

(I know libgomp still waits for the GPU to finish in each GOMP_offload_run,
but maybe it's better to improve *that* instead of piling on new slowness)

What I said above also applies to MPI+GPU scenarios: a well-designed algorithm
should arrange for MPI communications to happen in parallel with some useful
offloaded calculations. I don't see the value in implementing the ability to
invoke an MPI call from the accelerator in such inefficient fashion.

(so yes, I disagree with "it is better to provide a feature even if it is slow –
than not providing it at all", when it is advertised as a general-purpose
feature, not a purely debugging helper)


On to the patch itself. IIRC one of the questions was use of CUDA managed
memory. I think it is unsafe because device-issued atomics are not guaranteed
to appear atomic to the host, unless compiling for compute capability 6.0 or
above, and using system-scope atomics ("atom.sys").

And for non-USM code path you're relying on cudaMemcpy observing device-side
atomics in the right order.

Atomics aside, CUDA pinned memory would be a natural choice for such a tiny
structure. Did you rule it out for some reason?

Some remarks on the diff below, not intended to be a complete review.

Alexander


> --- a/libgomp/config/nvptx/target.c
> +++ b/libgomp/config/nvptx/target.c
> @@ -26,7 +26,29 @@
>  #include "libgomp.h"
>  #include <limits.h>
>  
> +#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var

Shouldn't this be in a header (needs to be in sync with the plugin).

> +
> +/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
> +struct rev_offload {
> +  uint64_t fn;
> +  uint64_t mapnum;
> +  uint64_t addrs;
> +  uint64_t sizes;
> +  uint64_t kinds;
> +  int32_t dev_num;
> +  uint32_t lock;
> +};

Likewise.

> +
> +#if (__SIZEOF_SHORT__ != 2 \
> +     || __SIZEOF_SIZE_T__ != 8 \
> +     || __SIZEOF_POINTER__ != 8)
> +#error "Data-type conversion required for rev_offload"
> +#endif

Huh? This is not a requirement that is new for reverse offload, it has always
been like that for offloading (all ABI rules regarding type sizes, struct
layout, bitfield layout, endianness must match).

> +
> +
>  extern int __gomp_team_num __attribute__((shared));
> +extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
> +volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
>  
>  bool
>  GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
> @@ -88,16 +110,32 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
>  		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
>  		 unsigned int flags, void **depend, void **args)
>  {
> -  (void) device;
> -  (void) fn;
> -  (void) mapnum;
> -  (void) hostaddrs;
> -  (void) sizes;
> -  (void) kinds;
>    (void) flags;
>    (void) depend;
>    (void) args;
> -  __builtin_unreachable ();
> +
> +  if (device != GOMP_DEVICE_HOST_FALLBACK
> +      || fn == NULL
> +      || GOMP_REV_OFFLOAD_VAR == NULL)
> +    return;

Shouldn't this be an 'assert' instead?

> +
> +  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
> +    ;  /* spin  */
> +
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->addrs, hostaddrs, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num,
> +		    GOMP_ADDITIONAL_ICVS.device_num, __ATOMIC_SEQ_CST);

Looks like all these can be plain stores, you only need ...

> +
> +  /* 'fn' must be last.  */
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);

... this to be atomic with 'release' semantics in the usual producer-consumer
pattern.

> +
> +  /* Processed on the host - when done, fn is set to NULL.  */
> +  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
> +    ;  /* spin  */
> +  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
>  }
>  
>  void
> diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
> index 9d4cc62..316de74 100644
> --- a/libgomp/libgomp-plugin.c
> +++ b/libgomp/libgomp-plugin.c
> @@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
>    gomp_vfatal (msg, ap);
>    va_end (ap);
>  }
> +
> +void
> +GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
> +			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
> +			void (*dev_to_host_cpy) (void *, const void *, size_t,
> +						 void *),
> +			void (*host_to_dev_cpy) (void *, const void *, size_t,
> +						 void *), void *token)
> +{
> +  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
> +		   dev_to_host_cpy, host_to_dev_cpy, token);
> +}
> diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
> index 6ab5ac6..875f967 100644
> --- a/libgomp/libgomp-plugin.h
> +++ b/libgomp/libgomp-plugin.h
> @@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
>  extern void GOMP_PLUGIN_fatal (const char *, ...)
>  	__attribute__ ((noreturn, format (printf, 1, 2)));
>  
> +extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
> +				    uint64_t, int,
> +				    void (*) (void *, const void *, size_t,
> +					      void *),
> +				    void (*) (void *, const void *, size_t,
> +					      void *), void *);
> +
>  /* Prototypes for functions implemented by libgomp plugins.  */
>  extern const char *GOMP_OFFLOAD_get_name (void);
>  extern unsigned int GOMP_OFFLOAD_get_caps (void);
> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
> index 7519274..5803683 100644
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -1128,6 +1128,11 @@ extern int gomp_pause_host (void);
>  extern void gomp_init_targets_once (void);
>  extern int gomp_get_num_devices (void);
>  extern bool gomp_target_task_fn (void *);
> +extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
> +			     int,
> +			     void (*) (void *, const void *, size_t, void *),
> +			     void (*) (void *, const void *, size_t, void *),
> +			     void *);
>  
>  /* Splay tree definitions.  */
>  typedef struct splay_tree_node_s *splay_tree_node;
> diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
> index 46d5f10..12f76f7 100644
> --- a/libgomp/libgomp.map
> +++ b/libgomp/libgomp.map
> @@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
>  	GOMP_PLUGIN_goacc_profiling_dispatch;
>  	GOMP_PLUGIN_goacc_thread;
>  } GOMP_PLUGIN_1.2;
> +
> +GOMP_PLUGIN_1.4 {
> +  global:
> +	GOMP_PLUGIN_target_rev;
> +} GOMP_PLUGIN_1.3;
> diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
> index cd91b39..61359c7 100644
> --- a/libgomp/plugin/cuda-lib.def
> +++ b/libgomp/plugin/cuda-lib.def
> @@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
>  CUDA_ONE_CALL (cuLinkDestroy)
>  CUDA_ONE_CALL (cuMemAlloc)
>  CUDA_ONE_CALL (cuMemAllocHost)
> +CUDA_ONE_CALL (cuMemAllocManaged)
>  CUDA_ONE_CALL (cuMemcpy)
>  CUDA_ONE_CALL (cuMemcpyDtoDAsync)
>  CUDA_ONE_CALL (cuMemcpyDtoH)
> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
> index ba6b229..1bd9ee2 100644
> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -54,6 +54,8 @@
>  #include <assert.h>
>  #include <errno.h>
>  
> +#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
> +
>  /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
>     block to cache between kernel invocations.  For soft-stacks blocks bigger
>     than this, we will free the block before attempting another GPU memory
> @@ -274,6 +276,17 @@ struct targ_fn_descriptor
>    int max_threads_per_block;
>  };
>  
> +/* Reverse offload. Must match version used in config/nvptx/target.c. */
> +struct rev_offload {
> +  uint64_t fn;
> +  uint64_t mapnum;
> +  uint64_t addrs;
> +  uint64_t sizes;
> +  uint64_t kinds;
> +  int32_t dev_num;
> +  uint32_t lock;
> +};
> +
>  /* A loaded PTX image.  */
>  struct ptx_image_data
>  {
> @@ -302,6 +315,7 @@ struct ptx_device
>    bool map;
>    bool concur;
>    bool mkern;
> +  bool concurr_managed_access;
>    int mode;
>    int clock_khz;
>    int num_sms;
> @@ -329,6 +343,9 @@ struct ptx_device
>        pthread_mutex_t lock;
>      } omp_stacks;
>  
> +  struct rev_offload *rev_data;
> +  CUdeviceptr rev_data_dev;
> +
>    struct ptx_device *next;
>  };
>  
> @@ -423,7 +440,7 @@ nvptx_open_device (int n)
>    struct ptx_device *ptx_dev;
>    CUdevice dev, ctx_dev;
>    CUresult r;
> -  int async_engines, pi;
> +  int pi;
>  
>    CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
>  
> @@ -519,11 +536,17 @@ nvptx_open_device (int n)
>  		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
>    ptx_dev->max_threads_per_multiprocessor = pi;
>  
> +#if 0
> +  int async_engines;
>    r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
>  			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
>    if (r != CUDA_SUCCESS)
>      async_engines = 1;
> +#endif
>  
> +  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
> +			 CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, dev);
> +  ptx_dev->concurr_managed_access = r == CUDA_SUCCESS ? pi : false;
>    for (int i = 0; i != GOMP_DIM_MAX; i++)
>      ptx_dev->default_dims[i] = 0;
>  
> @@ -1380,7 +1403,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
>    else if (rev_fn_table)
>      {
>        CUdeviceptr var;
> -      size_t bytes;
> +      size_t bytes, i;
>        r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
>  			     "$offload_func_table");
>        if (r != CUDA_SUCCESS)
> @@ -1390,6 +1413,47 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
>        r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
>        if (r != CUDA_SUCCESS)
>  	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
> +      /* Free if only NULL entries.  */
> +      for (i = 0; i < fn_entries; ++i)
> +	if (rev_fn_table[i] != 0)
> +	  break;
> +      if (i == fn_entries)
> +	{
> +	  free (*rev_fn_table);
> +	  *rev_fn_table = NULL;
> +	}
> +    }
> +
> +  if (rev_fn_table && dev->rev_data == NULL)
> +    {
> +      CUdeviceptr dp = 0;
> +      if (dev->concurr_managed_access && CUDA_CALL_EXISTS (cuMemAllocManaged))
> +	{
> +	  CUDA_CALL_ASSERT (cuMemAllocManaged, (void *) &dev->rev_data,
> +			    sizeof (*dev->rev_data), CU_MEM_ATTACH_GLOBAL);
> +	  dp = (CUdeviceptr) dev->rev_data;
> +	}
> +      else
> +	{
> +	  CUDA_CALL_ASSERT (cuMemAllocHost, (void **) &dev->rev_data,
> +			    sizeof (*dev->rev_data));
> +	  memset (dev->rev_data, '\0', sizeof (*dev->rev_data));
> +	  CUDA_CALL_ASSERT (cuMemAlloc, &dev->rev_data_dev,
> +			    sizeof (*dev->rev_data));
> +	  dp = dev->rev_data_dev;
> +	}
> +      CUdeviceptr device_rev_offload_var;
> +      size_t device_rev_offload_size;
> +      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
> +				      &device_rev_offload_var,
> +				      &device_rev_offload_size, module,
> +				      XSTRING (GOMP_REV_OFFLOAD_VAR));
> +      if (r != CUDA_SUCCESS)
> +	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
> +      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
> +			     sizeof (dp));
> +      if (r != CUDA_SUCCESS)
> +	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
>      }
>  
>    nvptx_set_clocktick (module, dev);
> @@ -2001,6 +2065,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
>    return (void *) ptx_dev->omp_stacks.ptr;
>  }
>  
> +
> +void
> +rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
> +			 CUstream stream)
> +{
> +  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
> +  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
> +}
> +
> +void
> +rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
> +			 CUstream stream)
> +{
> +  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
> +  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
> +}
> +
>  void
>  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>  {
> @@ -2035,6 +2116,10 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>    nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
>  
>    size_t stack_size = nvptx_stacks_size ();
> +  bool reverse_off = ptx_dev->rev_data != NULL;
> +  bool has_usm = (ptx_dev->concurr_managed_access
> +		  && CUDA_CALL_EXISTS (cuMemAllocManaged));
> +  CUstream copy_stream = NULL;
>  
>    pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>    void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
> @@ -2048,12 +2133,62 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>    GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
>  		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
>  		     __FUNCTION__, fn_name, teams, threads);
> +  if (reverse_off)
> +    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
>    r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
>  			 32, threads, 1, 0, NULL, NULL, config);
>    if (r != CUDA_SUCCESS)
>      GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
> -
> -  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
> +  if (reverse_off)
> +    while (true)
> +      {
> +	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
> +	if (r == CUDA_SUCCESS)
> +	  break;
> +	if (r == CUDA_ERROR_LAUNCH_FAILED)
> +	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
> +			     maybe_abort_msg);
> +	else if (r != CUDA_ERROR_NOT_READY)
> +	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
> +	if (!has_usm)
> +	  {
> +	    CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, ptx_dev->rev_data,
> +			      ptx_dev->rev_data_dev,
> +			      sizeof (*ptx_dev->rev_data), copy_stream);
> +	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
> +	  }
> +	if (ptx_dev->rev_data->fn != 0)

Surely this needs to be an atomic load with 'acquire' semantics in has_usm case?

> +	  {
> +	    struct rev_offload *rev_data = ptx_dev->rev_data;
> +	    uint64_t fn_ptr = rev_data->fn;
> +	    uint64_t mapnum = rev_data->mapnum;
> +	    uint64_t addr_ptr = rev_data->addrs;
> +	    uint64_t sizes_ptr = rev_data->sizes;
> +	    uint64_t kinds_ptr = rev_data->kinds;
> +	    int dev_num = (int) rev_data->dev_num;
> +	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
> +				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
> +				    rev_off_host_to_dev_cpy, copy_stream);
> +	    rev_data->fn = 0;

Atomic store?

> +	    if (!has_usm)
> +	      {
> +		/* fn is the first element. */
> +		r = CUDA_CALL_NOCHECK (cuMemcpyHtoDAsync,
> +				       ptx_dev->rev_data_dev,
> +				       ptx_dev->rev_data,
> +				       sizeof (ptx_dev->rev_data->fn),
> +				       copy_stream);
> +		if (r != CUDA_SUCCESS)
> +		  GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
> +		CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
> +	      }
> +	  }
> +	usleep (1);
> +      }
> +  else
> +    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
> +  if (reverse_off)
> +    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
>    if (r == CUDA_ERROR_LAUNCH_FAILED)
>      GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
>  		       maybe_abort_msg);
> diff --git a/libgomp/target.c b/libgomp/target.c
> index 5763483..9377de0 100644
> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
>      htab_free (refcount_set);
>  }
>  
> +/* Handle reverse offload. This is called by the device plugins for a
> +   reverse offload; it is not called if the outer target runs on the host.  */
> +
> +void
> +gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
> +		 uint64_t mapnum __attribute__ ((unused)),
> +		 uint64_t devaddrs_ptr __attribute__ ((unused)),
> +		 uint64_t sizes_ptr __attribute__ ((unused)),
> +		 uint64_t kinds_ptr __attribute__ ((unused)),
> +		 int dev_num __attribute__ ((unused)),
> +		 void (*dev_to_host_cpy) (void *, const void *, size_t,
> +					  void *) __attribute__ ((unused)),
> +		 void (*host_to_dev_cpy) (void *, const void *, size_t,
> +					  void *) __attribute__ ((unused)),
> +		 void *token __attribute__ ((unused)))
> +{
> +  __builtin_unreachable ();
> +}
> +
>  /* Host fallback for GOMP_target_data{,_ext} routines.  */
>  
>  static void
  
Tobias Burnus Sept. 26, 2022, 3:07 p.m. UTC | #6
Hi Alexander,

On 21.09.22 22:06, Alexander Monakov wrote:
> It also goes
> against the good practice of accelerator programming, which requires queueing
> work on the accelerator and letting it run asynchronously with the CPU with high
> occupancy.
> (I know libgomp still waits for the GPU to finish in each GOMP_offload_run,
> but maybe it's better to improve *that* instead of piling on new slowness)

Doesn't OpenMP 'nowait' permit this? (+ 'depend' clause if needed).

> On to the patch itself.

> And for non-USM code path you're relying on cudaMemcpy observing device-side
> atomics in the right order.
> Atomics aside, CUDA pinned memory would be a natural choice for such a tiny
> structure. Did you rule it out for some reason?

I did use pinned memory (cuMemAllocHost) – but somehow it did escape me
that:

"All host memory allocated in all contexts using cuMemAllocHost() and
cuMemHostAlloc() is always directly accessible from all contexts on all
devices that support unified addressing."

I have now updated (but using cuMemHostAlloc instead, using a flag in
the hope that this choice is a tad faster).

>> +++ b/libgomp/config/nvptx/target.c
>> ...
>> +#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
> Shouldn't this be in a header (needs to be in sync with the plugin).
I have now created one.
>> +
>> +#if (__SIZEOF_SHORT__ != 2 \
>> +     || __SIZEOF_SIZE_T__ != 8 \
>> +     || __SIZEOF_POINTER__ != 8)
>> +#error "Data-type conversion required for rev_offload"
>> +#endif
> Huh? This is not a requirement that is new for reverse offload, it has always
> been like that for offloading (all ABI rules regarding type sizes, struct
> layout, bitfield layout, endianness must match).

In theory, compiling with "-m32 -foffload-options=-m64" or "-m32
-foffload-options=-m32" or "-m64 -foffload-options=-m32" is supported.
In practice, -m64 everywhere is required. I just want to make sure that
for this code the sizes are fine because, here, I am sure it breaks. For
other parts, I think the 64bit assumption is coded in but I am not
completely sure that's really the case.

>> +  if (device != GOMP_DEVICE_HOST_FALLBACK
>> +      || fn == NULL
>> +      || GOMP_REV_OFFLOAD_VAR == NULL)
>> +    return;
> Shouldn't this be an 'assert' instead?

This tries to mimic what was there before – doing nothing. In any case,
this code path is unspecified or implementation defined (I forgot which
of the two), but a user might still be able to construct such a code.

I leave it to Jakub whether he likes to have an assert, a error/warning
message, or just the return here.

>> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num,
>> +                GOMP_ADDITIONAL_ICVS.device_num, __ATOMIC_SEQ_CST);
> Looks like all these can be plain stores, you only need ...
>
>> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);
> ... this to be atomic with 'release' semantics in the usual producer-consumer
> pattern.
>
>> +  if (ptx_dev->rev_data->fn != 0)
> Surely this needs to be an atomic load with 'acquire' semantics in has_usm case?
>> +        rev_data->fn = 0;
>>
>> Atomic store?

Done so – updated patch attached. Thanks for the comments.

Tobias
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Alexander Monakov Sept. 26, 2022, 5:45 p.m. UTC | #7
Hi.

My main concerns remain not addressed:

1) what I said in the opening paragraphs of my previous email;

2) device-issued atomics are not guaranteed to appear atomic to the host
unless using atom.sys and translating for CUDA compute capability 6.0+.

Item 2 is a correctness issue. Item 1 I think is a matter of policy that
is up to you to hash out with Jakub.

On Mon, 26 Sep 2022, Tobias Burnus wrote:

> In theory, compiling with "-m32 -foffload-options=-m64" or "-m32
> -foffload-options=-m32" or "-m64 -foffload-options=-m32" is supported.

I have no words.

Alexander
  
Tobias Burnus Sept. 27, 2022, 9:23 a.m. UTC | #8
Hi,

On 26.09.22 19:45, Alexander Monakov wrote:

My main concerns remain not addressed:
1) what I said in the opening paragraphs of my previous email;

(i.e. the general disagreement whether the feature itself should be implemented for nvptx or not.)

2) device-issued atomics are not guaranteed to appear atomic to the host
unless using atom.sys and translating for CUDA compute capability 6.0+.

As you seem to have no other rough review comments, this can now be addressed :-)

We do support
  #if __PTX_SM__ >= 600  (CUDA >= 8.0, ptx isa >= 5.0)
and we also can configure GCC with
  --with-arch=sm_70 (or sm_80 or ...)
Thus, adding atomics with .sys scope is possible.

See attached patch. This seems to work fine and I hope I got the
assembly right in terms of atomic use. (And I do believe that the
.release/.acquire do not need an additional __sync_syncronize()/"membar.sys".)

Ignoring (1), does the overall patch and this part otherwise look okay(ish)?


Caveat: The .sys scope works well with >= sm_60 but not does not handle older
versions. For those, the __atomic_{load/store}_n are used.
I do not see a good solution beyond documentation. In the way it is used
(one thread only setting only on/off flag, no atomic increments etc.), I think it is
unlikely to cause races without .sys scope, but as always is difficult to rule out
some special unfortunate case where it does. At lease we do have now some
documentation (in general) - which still needs to be expanded and improved.
For this feature, I did not add any wording in this patch: until the feature
is actually enabled, it would be more confusing than helpful.


On Mon, 26 Sep 2022, Tobias Burnus wrote:


In theory, compiling with "-m32 -foffload-options=-m64" or "-m32
-foffload-options=-m32" or "-m64 -foffload-options=-m32" is supported.


I have no words.

@node Nvidia PTX Options
...
@item -m64
@opindex m64
Ignored, but preserved for backward compatibility.  Only 64-bit ABI is
supported.

And in config/nvptx/mkoffload.cc you also still find leftovers from -m32.

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Alexander Monakov Sept. 28, 2022, 1:16 p.m. UTC | #9
On Tue, 27 Sep 2022, Tobias Burnus wrote:

> Ignoring (1), does the overall patch and this part otherwise look okay(ish)?
> 
> 
> Caveat: The .sys scope works well with >= sm_60 but not does not handle
> older versions. For those, the __atomic_{load/store}_n are used.  I do not
> see a good solution beyond documentation. In the way it is used (one
> thread only setting only on/off flag, no atomic increments etc.), I think
> it is unlikely to cause races without .sys scope, but as always is
> difficult to rule out some special unfortunate case where it does. At
> lease we do have now some documentation (in general) - which still needs
> to be expanded and improved.  For this feature, I did not add any wording
> in this patch: until the feature is actually enabled, it would be more
> confusing than helpful.

If the implication is that distros will ship a racy-by-default implementation,
unless they know about the problem and configure for sm_60, then no, that
doesn't look fine to me. A possible solution is not enabling a feature that
has a known correctness issue.

Alexander
  
Tobias Burnus Oct. 2, 2022, 6:13 p.m. UTC | #10
On 27.09.22 11:23, Tobias Burnus wrote:

We do support
  #if __PTX_SM__ >= 600  (CUDA >= 8.0, ptx isa >= 5.0)
and we also can configure GCC with
  --with-arch=sm_70 (or sm_80 or ...)
Thus, adding atomics with .sys scope is possible.

See attached patch. This seems to work fine and I hope I got the
assembly right in terms of atomic use. (And I do believe that the
.release/.acquire do not need an additional __sync_syncronize()/"membar.sys".)

Regarding this:

While 'atom.op' (op = and/or/xor/cas/exch/add/inc/dec/min/max)
with scope is a sm_60 feature, the used 'st/ld' with scope qualifier
and .relaxed, .release / .relaxed, .acquire require sm_70.

(Does not really matter as only ..., sm_53 and sm_70, ... is currently
supported but not sm_60, but the #if should be obviously fixed.)

 * * *

Looking at the generated code for without inline assembler, we have instead of
  st.global.release.sys.u64 [%r27],%r39;
and
  ld.acquire.sys.global.u64 %r62,[%r27];
for the older-systems (__PTX_SM < 700) the code:
  @ %r69 membar.sys;
  @ %r69 atom.exch.b64 _,[%r27],%r41;
and
  ld.global.u64 %r64,[__gomp_rev_offload_var];
  ld.u64 %r36,[%r64];
  membar.sys;

In my understanding, the membar.sys ensures - similar to
  st.release / ld.acquire
that the memory handling is done in the correct order in scope .sys.
As the 'fn' variable is initially 0 - and then only set via the device
i.e. there is eventually a DMA write device->host, which is atomically
as the will int64_t is written at once (and not first, e.g. the lower
and then the upper half). The 'st'/'atom.exch' should work fine, despite
having no .sys scope.

Likewise, the membar.sys applies also in the other direction. Or did I
miss something. If so, would an explicit __sync_synchronize() (= membar.sys)
help between the 'st' and the 'ld'?

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Tobias Burnus Oct. 7, 2022, 2:26 p.m. UTC | #11
Updated patch enclosed. Changes:

* Fixes the sm >= 700 issue, I noted before (cf. below)

* The < sm_70 code is still in, but disabled at user-compile time, with a warning, if libgomp.a wasn't compiled with sm_70 or higher. (mkoffload strips the nvptx offload code)

* Some minor cleanup

OK for mainline?

Tobias

On 02.10.22 20:13, Tobias Burnus wrote:
On 27.09.22 11:23, Tobias Burnus wrote:

We do support
  #if __PTX_SM__ >= 600  (CUDA >= 8.0, ptx isa >= 5.0)
and we also can configure GCC with
  --with-arch=sm_70 (or sm_80 or ...)
Thus, adding atomics with .sys scope is possible.

See attached patch. This seems to work fine and I hope I got the
assembly right in terms of atomic use. (And I do believe that the
.release/.acquire do not need an additional __sync_syncronize()/"membar.sys".)

Regarding this:

While 'atom.op' (op = and/or/xor/cas/exch/add/inc/dec/min/max)
with scope is a sm_60 feature, the used 'st/ld' with scope qualifier
and .relaxed, .release / .relaxed, .acquire require sm_70.

(Does not really matter as only ..., sm_53 and sm_70, ... is currently
supported but not sm_60, but the #if should be obviously fixed.)

 * * *

Looking at the generated code for without inline assembler, we have instead of
  st.global.release.sys.u64 [%r27],%r39;
and
  ld.acquire.sys.global.u64 %r62,[%r27];
for the older-systems (__PTX_SM < 700) the code:
  @ %r69 membar.sys;
  @ %r69 atom.exch.b64 _,[%r27],%r41;
and
  ld.global.u64 %r64,[__gomp_rev_offload_var];
  ld.u64 %r36,[%r64];
  membar.sys;

In my understanding, the membar.sys ensures - similar to
  st.release / ld.acquire
that the memory handling is done in the correct order in scope .sys.
As the 'fn' variable is initially 0 - and then only set via the device
i.e. there is eventually a DMA write device->host, which is atomically
as the will int64_t is written at once (and not first, e.g. the lower
and then the upper half). The 'st'/'atom.exch' should work fine, despite
having no .sys scope.

Likewise, the membar.sys applies also in the other direction. Or did I
miss something. If so, would an explicit __sync_synchronize() (= membar.sys)
help between the 'st' and the 'ld'?

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra?e 201, 80634 M?nchen; Gesellschaft mit beschr?nkter Haftung; Gesch?ftsf?hrer: Thomas Heurung, Frank Th?rauf; Sitz der Gesellschaft: M?nchen; Registergericht M?nchen, HRB 106955
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rev-offload-run-nvptx-v5.diff
Type: text/x-patch
Size: 23685 bytes
Desc: not available
URL: <https://gcc.gnu.org/pipermail/gcc-patches/attachments/20221007/a2525d4a/attachment-0001.bin>
  
Thomas Schwinge April 4, 2023, 2:40 p.m. UTC | #12
Hi!

During GCC/OpenMP/nvptx reverse offload investigations, about how to
replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found
something re:

On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
> Better suggestions are welcome for the busy loop in
> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
> its value.

> On the host side, the last address is checked - if fn_addr != NULL,
> it passes all arguments on to the generic (target.c) gomp_target_rev
> to do the actual offloading.
>
> CUDA does lockup when trying to copy data from the currently running
> stream; hence, a new stream is generated to do the memory copying.

> Future work for nvptx:
> * Adjust 'sleep', possibly [...]
>   to do shorter sleeps than usleep(1)?

... this busy loop.

Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run':

    [...]
      if (reverse_offload)
        CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
      r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
                             32, threads, 1, 0, NULL, NULL, config);
      if (r != CUDA_SUCCESS)
        GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
      if (reverse_offload)
        while (true)
          {
            r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
            if (r == CUDA_SUCCESS)
              break;
            if (r == CUDA_ERROR_LAUNCH_FAILED)
              GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
                                 maybe_abort_msg);
            else if (r != CUDA_ERROR_NOT_READY)
              GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));

            if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
              {
                struct rev_offload *rev_data = ptx_dev->rev_data;
                GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
                                        rev_data->addrs, rev_data->sizes,
                                        rev_data->kinds, rev_data->dev_num,
                                        rev_off_dev_to_host_cpy,
                                        rev_off_host_to_dev_cpy, copy_stream);
                CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
                __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
              }
            usleep (1);
          }
      else
        r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
      if (reverse_offload)
        CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
    [...]

Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able
to use "Stream Memory Operations",
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html>,
that allow to "Wait on a memory location", "until the given condition on
the memory is satisfied"?

For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext':

    [...]
      GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
      GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
      GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
      GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
      GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;

      /* Set 'fn' to trigger processing on the host; wait for completion,
         which is flagged by setting 'fn' back to 0 on the host.  */
      uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
    #if __PTX_SM__ >= 700
      asm volatile ("st.global.release.sys.u64 [%0], %1;"
                    : : "r"(addr_struct_fn), "r" (fn) : "memory");
    #else
      __sync_synchronize ();  /* membar.sys */
      asm volatile ("st.volatile.global.u64 [%0], %1;"
                    : : "r"(addr_struct_fn), "r" (fn) : "memory");
    #endif

    #if __PTX_SM__ >= 700
      uint64_t fn2;
      do
        {
          asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
                        : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
        }
      while (fn2 != 0);
    #else
      /* ld.global.u64 %r64,[__gomp_rev_offload_var];
         ld.u64 %r36,[%r64];
         membar.sys;  */
      while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
        ;  /* spin  */
    #endif
    [...]


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Tobias Burnus April 28, 2023, 8:28 a.m. UTC | #13
Hi Thomas,

maybe I misunderstood your suggestion, but "Wait on a memory location"
assumes that there will be a change – but if a target region happens to
have no reverse offload, the memory location will never change, but
still the target region should return to the host.

What we would need: Wait on memory location – and return if either the
kernel stopped *or* the memory location changed.

My impression is that "return if the kernel stopped" is not really
guaranteed. Of did I miss some fineprint?

Tobias

On 04.04.23 16:40, Thomas Schwinge wrote:
> Hi!
>
> During GCC/OpenMP/nvptx reverse offload investigations, about how to
> replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found
> something re:
>
> On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
>> Better suggestions are welcome for the busy loop in
>> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
>> its value.
>> On the host side, the last address is checked - if fn_addr != NULL,
>> it passes all arguments on to the generic (target.c) gomp_target_rev
>> to do the actual offloading.
>>
>> CUDA does lockup when trying to copy data from the currently running
>> stream; hence, a new stream is generated to do the memory copying.
>> Future work for nvptx:
>> * Adjust 'sleep', possibly [...]
>>    to do shorter sleeps than usleep(1)?
> ... this busy loop.
>
> Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run':
>
>      [...]
>        if (reverse_offload)
>          CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
>        r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
>                               32, threads, 1, 0, NULL, NULL, config);
>        if (r != CUDA_SUCCESS)
>          GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
>        if (reverse_offload)
>          while (true)
>            {
>              r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
>              if (r == CUDA_SUCCESS)
>                break;
>              if (r == CUDA_ERROR_LAUNCH_FAILED)
>                GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
>                                   maybe_abort_msg);
>              else if (r != CUDA_ERROR_NOT_READY)
>                GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
>
>              if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
>                {
>                  struct rev_offload *rev_data = ptx_dev->rev_data;
>                  GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
>                                          rev_data->addrs, rev_data->sizes,
>                                          rev_data->kinds, rev_data->dev_num,
>                                          rev_off_dev_to_host_cpy,
>                                          rev_off_host_to_dev_cpy, copy_stream);
>                  CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
>                  __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
>                }
>              usleep (1);
>            }
>        else
>          r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
>        if (reverse_offload)
>          CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
>      [...]
>
> Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able
> to use "Stream Memory Operations",
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html
> that allow to "Wait on a memory location", "until the given condition on
> the memory is satisfied"?
>
> For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext':
>
>      [...]
>        GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
>        GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
>        GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
>        GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
>        GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
>
>        /* Set 'fn' to trigger processing on the host; wait for completion,
>           which is flagged by setting 'fn' back to 0 on the host.  */
>        uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
>      #if __PTX_SM__ >= 700
>        asm volatile ("st.global.release.sys.u64 [%0], %1;"
>                      : : "r"(addr_struct_fn), "r" (fn) : "memory");
>      #else
>        __sync_synchronize ();  /* membar.sys */
>        asm volatile ("st.volatile.global.u64 [%0], %1;"
>                      : : "r"(addr_struct_fn), "r" (fn) : "memory");
>      #endif
>
>      #if __PTX_SM__ >= 700
>        uint64_t fn2;
>        do
>          {
>            asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
>                          : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
>          }
>        while (fn2 != 0);
>      #else
>        /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>           ld.u64 %r36,[%r64];
>           membar.sys;  */
>        while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
>          ;  /* spin  */
>      #endif
>      [...]
>
>
> Grüße
>   Thomas
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Thomas Schwinge April 28, 2023, 9:23 a.m. UTC | #14
Hi Tobias!

On 2023-04-28T10:28:22+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
> maybe I misunderstood your suggestion, but

Forst, note that those CUDA "Stream Memory Operations" are something that
I found by chance, and don't have any actual experience with.  I can't
seem to find a lot of documentation/usage of this API?

By the way, a similar thing also exists for AMD GPUs:
'hipStreamWaitValue32', etc.


> "Wait on a memory location"
> assumes that there will be a change – but if a target region happens to
> have no reverse offload, the memory location will never change, but
> still the target region should return to the host.

Oh indeed.  ;-) Details...

> What we would need: Wait on memory location – and return if either the
> kernel stopped *or* the memory location changed.

Or, have a way to "cancel", from the host, the 'cuStreamWaitValue32',
'cuStreamWaitValue64', after the actual 'target' kernel completed?

> My impression is that "return if the kernel stopped" is not really
> guaranteed. Of did I miss some fineprint?

No, you're right.  I suppose this is as designed: for example, generally,
there may be additional kernel launches, and the "wait" will then
eventually trigger.

Could we, after the actual 'target' kernel completed, issue a host-side
"write" ('cuStreamWriteValue32', 'cuStreamWriteValue64') to that memory
location, to signal end of processing for reverse offloads?

That is:

  - enqueue 'cuLaunchKernel'
  - enqueue 'cuStreamWriteValue' (to signal end of processing for reverse offloads)
  - loop on 'cuStreamWaitValue' (until end of processing for reverse offloads)


Grüße
 Thomas


> On 04.04.23 16:40, Thomas Schwinge wrote:
>> Hi!
>>
>> During GCC/OpenMP/nvptx reverse offload investigations, about how to
>> replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found
>> something re:
>>
>> On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
>>> Better suggestions are welcome for the busy loop in
>>> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
>>> its value.
>>> On the host side, the last address is checked - if fn_addr != NULL,
>>> it passes all arguments on to the generic (target.c) gomp_target_rev
>>> to do the actual offloading.
>>>
>>> CUDA does lockup when trying to copy data from the currently running
>>> stream; hence, a new stream is generated to do the memory copying.
>>> Future work for nvptx:
>>> * Adjust 'sleep', possibly [...]
>>>    to do shorter sleeps than usleep(1)?
>> ... this busy loop.
>>
>> Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run':
>>
>>      [...]
>>        if (reverse_offload)
>>          CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
>>        r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
>>                               32, threads, 1, 0, NULL, NULL, config);
>>        if (r != CUDA_SUCCESS)
>>          GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
>>        if (reverse_offload)
>>          while (true)
>>            {
>>              r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
>>              if (r == CUDA_SUCCESS)
>>                break;
>>              if (r == CUDA_ERROR_LAUNCH_FAILED)
>>                GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
>>                                   maybe_abort_msg);
>>              else if (r != CUDA_ERROR_NOT_READY)
>>                GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
>>
>>              if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
>>                {
>>                  struct rev_offload *rev_data = ptx_dev->rev_data;
>>                  GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
>>                                          rev_data->addrs, rev_data->sizes,
>>                                          rev_data->kinds, rev_data->dev_num,
>>                                          rev_off_dev_to_host_cpy,
>>                                          rev_off_host_to_dev_cpy, copy_stream);
>>                  CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
>>                  __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
>>                }
>>              usleep (1);
>>            }
>>        else
>>          r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
>>        if (reverse_offload)
>>          CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
>>      [...]
>>
>> Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able
>> to use "Stream Memory Operations",
>> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html
>> that allow to "Wait on a memory location", "until the given condition on
>> the memory is satisfied"?
>>
>> For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext':
>>
>>      [...]
>>        GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
>>        GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
>>        GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
>>        GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
>>        GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
>>
>>        /* Set 'fn' to trigger processing on the host; wait for completion,
>>           which is flagged by setting 'fn' back to 0 on the host.  */
>>        uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
>>      #if __PTX_SM__ >= 700
>>        asm volatile ("st.global.release.sys.u64 [%0], %1;"
>>                      : : "r"(addr_struct_fn), "r" (fn) : "memory");
>>      #else
>>        __sync_synchronize ();  /* membar.sys */
>>        asm volatile ("st.volatile.global.u64 [%0], %1;"
>>                      : : "r"(addr_struct_fn), "r" (fn) : "memory");
>>      #endif
>>
>>      #if __PTX_SM__ >= 700
>>        uint64_t fn2;
>>        do
>>          {
>>            asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
>>                          : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
>>          }
>>        while (fn2 != 0);
>>      #else
>>        /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>>           ld.u64 %r36,[%r64];
>>           membar.sys;  */
>>        while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
>>          ;  /* spin  */
>>      #endif
>>      [...]
>>
>>
>> Grüße
>>   Thomas
>> -----------------
>> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  

Patch

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS.
	(enum CUmemAttach_flags): New stub with only member
	CU_MEM_ATTACH_GLOBAL.
	(cuMemAllocManaged): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/target.c (GOMP_REV_OFFLOAD_VAR): #define as
	variable-name string and use it to define the variable.
	(GOMP_DEVICE_NUM_VAR): Declare this extern global var.
	(struct rev_offload): Define.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemAllocManaged): Add.
	* plugin/plugin-nvptx.c (GOMP_REV_OFFLOAD_VAR): #define var string.
	(struct rev_offload): New.
	(struct ptx_device): Add concurr_managed_access, rev_data
	and rev_data_dev.
	(nvptx_open_device): Set ptx_device's concurr_managed_access;
	'#if 0' unused async_engines.
	(GOMP_OFFLOAD_load_image): Allocate rev_data variable.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offloading.

 include/cuda/cuda.h               |   8 ++-
 libgomp/config/nvptx/icv-device.c |   2 +-
 libgomp/config/nvptx/target.c     |  52 ++++++++++++--
 libgomp/libgomp-plugin.c          |  12 ++++
 libgomp/libgomp-plugin.h          |   7 ++
 libgomp/libgomp.h                 |   5 ++
 libgomp/libgomp.map               |   5 ++
 libgomp/plugin/cuda-lib.def       |   1 +
 libgomp/plugin/plugin-nvptx.c     | 148 +++++++++++++++++++++++++++++++++++++-
 libgomp/target.c                  |  18 +++++
 10 files changed, 246 insertions(+), 12 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05d150..08e496a2e98 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,9 +77,14 @@  typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
-  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+  CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89
 } CUdevice_attribute;
 
+typedef enum {
+  CU_MEM_ATTACH_GLOBAL = 0x1
+} CUmemAttach_flags;
+
 enum {
   CU_EVENT_DEFAULT = 0,
   CU_EVENT_DISABLE_TIMING = 2
@@ -169,6 +174,7 @@  CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemAllocManaged (CUdeviceptr *, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index faf90f9947c..f4f18cdac5e 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -60,7 +60,7 @@  omp_is_initial_device (void)
 
 /* This is set to the device number of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile int GOMP_DEVICE_NUM_VAR;
+volatile int GOMP_DEVICE_NUM_VAR;
 
 int
 omp_get_device_num (void)
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d20e15..06f6cd8b611 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -26,7 +26,29 @@ 
 #include "libgomp.h"
 #include <limits.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+
 extern int __gomp_team_num __attribute__((shared));
+extern volatile int GOMP_DEVICE_NUM_VAR;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +110,32 @@  GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
+    ;  /* spin  */
+
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->addrs, hostaddrs, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num, GOMP_DEVICE_NUM_VAR,
+		    __ATOMIC_SEQ_CST);
+
+  /* 'fn' must be last.  */
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
+    ;  /* spin  */
+  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc623a10..316de749f69 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@  GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index ab3ed638475..40dfb52e44e 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@  extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index c243c4d6cf4..bbab5b4b0af 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1014,6 +1014,11 @@  extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10f3e1..12f76f7e48f 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@  GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39b1d2..61359c7e74e 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@  CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemAllocManaged)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index bc63e274cdf..7ab9421b060 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -54,6 +54,8 @@ 
 #include <assert.h>
 #include <errno.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
 /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
    block to cache between kernel invocations.  For soft-stacks blocks bigger
    than this, we will free the block before attempting another GPU memory
@@ -274,6 +276,17 @@  struct targ_fn_descriptor
   int max_threads_per_block;
 };
 
+/* Reverse offload. Must match version used in config/nvptx/target.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
 /* A loaded PTX image.  */
 struct ptx_image_data
 {
@@ -302,6 +315,7 @@  struct ptx_device
   bool map;
   bool concur;
   bool mkern;
+  bool concurr_managed_access;
   int mode;
   int clock_khz;
   int num_sms;
@@ -329,6 +343,9 @@  struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
+  CUdeviceptr rev_data_dev;
+
   struct ptx_device *next;
 };
 
@@ -423,7 +440,7 @@  nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,11 +536,17 @@  nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
 
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, dev);
+  ptx_dev->concurr_managed_access = r == CUDA_SUCCESS ? pi : false;
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
@@ -1313,6 +1336,38 @@  GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor)
 				 * fn_entries);
 
+  if (rev_fn_table && dev->rev_data == NULL)
+    {
+      CUdeviceptr dp = 0;
+      if (dev->concurr_managed_access && CUDA_CALL_EXISTS (cuMemAllocManaged))
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocManaged, (void *) &dev->rev_data,
+			    sizeof (*dev->rev_data), CU_MEM_ATTACH_GLOBAL);
+	  dp = (CUdeviceptr) dev->rev_data;
+	}
+      else
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocHost, (void **) &dev->rev_data,
+			    sizeof (*dev->rev_data));
+	  memset (dev->rev_data, '\0', sizeof (*dev->rev_data));
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dev->rev_data_dev,
+			    sizeof (*dev->rev_data));
+	  dp = dev->rev_data_dev;
+	}
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+    }
+
   *target_table = targ_tbl;
 
   new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data));
@@ -1373,6 +1428,22 @@  GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
     targ_tbl->start = targ_tbl->end = 0;
   targ_tbl++;
 
+  if (rev_fn_table)
+    {
+      CUdeviceptr var;
+      size_t bytes;
+      r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
+			     "$offload_func_table");
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      assert (bytes == sizeof (uint64_t) * fn_entries);
+      *rev_fn_table = GOMP_PLUGIN_malloc (sizeof (uint64_t) * fn_entries);
+      r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+    }
+
+
   nvptx_set_clocktick (module, dev);
 
   return fn_entries + var_entries + other_entries;
@@ -1982,6 +2053,23 @@  nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2016,6 +2104,10 @@  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_off = ptx_dev->rev_data != NULL;
+  bool has_usm = (ptx_dev->concurr_managed_access
+		  && CUDA_CALL_EXISTS (cuMemAllocManaged));
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2029,12 +2121,62 @@  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+	if (!has_usm)
+	  {
+	    CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, ptx_dev->rev_data,
+			      ptx_dev->rev_data_dev,
+			      sizeof (*ptx_dev->rev_data), copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	  }
+	if (ptx_dev->rev_data->fn != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    uint64_t fn_ptr = rev_data->fn;
+	    uint64_t mapnum = rev_data->mapnum;
+	    uint64_t addr_ptr = rev_data->addrs;
+	    uint64_t sizes_ptr = rev_data->sizes;
+	    uint64_t kinds_ptr = rev_data->kinds;
+	    int dev_num = (int) rev_data->dev_num;
+	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
+				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    rev_data->fn = 0;
+	    if (!has_usm)
+	      {
+		/* fn is the first element. */
+		r = CUDA_CALL_NOCHECK (cuMemcpyHtoDAsync,
+				       ptx_dev->rev_data_dev,
+				       ptx_dev->rev_data,
+				       sizeof (ptx_dev->rev_data->fn),
+				       copy_stream);
+		if (r != CUDA_SUCCESS)
+		  GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+		CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	      }
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 135db1d88ab..0c6fad690f1 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2856,6 +2856,24 @@  GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is not called for the host. */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void