[v5] libgomp/nvptx: Prepare for reverse-offload callback handling

Message ID 798d7ee1-2ffa-a591-38cb-a9ad421265d0@codesourcery.com
State New
Headers
Series [v5] libgomp/nvptx: Prepare for reverse-offload callback handling |

Commit Message

Tobias Burnus Oct. 7, 2022, 2:26 p.m. UTC
  Updated patch enclosed. Changes:

* Fixes the sm >= 700 issue, I noted before (cf. below)

* The < sm_70 code is still in, but disabled at user-compile time, with a warning, if libgomp.a wasn't compiled with sm_70 or higher. (mkoffload strips the nvptx offload code)

* Some minor cleanup

OK for mainline?

Tobias

On 02.10.22 20:13, Tobias Burnus wrote:
On 27.09.22 11:23, Tobias Burnus wrote:

We do support
  #if __PTX_SM__ >= 600  (CUDA >= 8.0, ptx isa >= 5.0)
and we also can configure GCC with
  --with-arch=sm_70 (or sm_80 or ...)
Thus, adding atomics with .sys scope is possible.

See attached patch. This seems to work fine and I hope I got the
assembly right in terms of atomic use. (And I do believe that the
.release/.acquire do not need an additional __sync_syncronize()/"membar.sys".)

Regarding this:

While 'atom.op' (op = and/or/xor/cas/exch/add/inc/dec/min/max)
with scope is a sm_60 feature, the used 'st/ld' with scope qualifier
and .relaxed, .release / .relaxed, .acquire require sm_70.

(Does not really matter as only ..., sm_53 and sm_70, ... is currently
supported but not sm_60, but the #if should be obviously fixed.)

 * * *

Looking at the generated code for without inline assembler, we have instead of
  st.global.release.sys.u64 [%r27],%r39;
and
  ld.acquire.sys.global.u64 %r62,[%r27];
for the older-systems (__PTX_SM < 700) the code:
  @ %r69 membar.sys;
  @ %r69 atom.exch.b64 _,[%r27],%r41;
and
  ld.global.u64 %r64,[__gomp_rev_offload_var];
  ld.u64 %r36,[%r64];
  membar.sys;

In my understanding, the membar.sys ensures - similar to
  st.release / ld.acquire
that the memory handling is done in the correct order in scope .sys.
As the 'fn' variable is initially 0 - and then only set via the device
i.e. there is eventually a DMA write device->host, which is atomically
as the will int64_t is written at once (and not first, e.g. the lower
and then the upper half). The 'st'/'atom.exch' should work fine, despite
having no .sys scope.

Likewise, the membar.sys applies also in the other direction. Or did I
miss something. If so, would an explicit __sync_synchronize() (= membar.sys)
help between the 'st' and the 'ld'?

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  

Comments

Jakub Jelinek Oct. 11, 2022, 10:49 a.m. UTC | #1
On Fri, Oct 07, 2022 at 04:26:58PM +0200, Tobias Burnus wrote:
> libgomp/nvptx: Prepare for reverse-offload callback handling
> 
> This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
> later handle the reverse offload.
> For nvptx, it adds support for forwarding the offload gomp_target_ext call
> to the host by setting values in a struct on the device and querying it on
> the host - invoking gomp_target_rev on the result.
> 
> For host-device consistency guarantee reasons, reverse offload is currently
> limited -march=sm_70 (for libgomp).
> 
> gcc/ChangeLog:
> 
> 	* config/nvptx/mkoffload.cc (process): Warn if the linked-in libgomp.a
> 	has not been compiled with sm_70 or higher and disable code gen then.
> 
> include/ChangeLog:
> 
> 	* cuda/cuda.h (enum CUdevice_attribute): Add
> 	CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
> 	(CU_MEMHOSTALLOC_DEVICEMAP): Define.
> 	(cuMemHostAlloc): Add prototype.
> 
> libgomp/ChangeLog:
> 
> 	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
> 	'static' for this variable.
> 	* config/nvptx/libgomp-nvptx.h: New file.
> 	* config/nvptx/target.c: Include it.
> 	(GOMP_ADDITIONAL_ICVS): Declare extern var.
> 	(GOMP_REV_OFFLOAD_VAR): Declare var.
> 	(GOMP_target_ext): Handle reverse offload.
> 	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
> 	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
> 	* target.c (gomp_target_rev): ... this new stub function.
> 	* libgomp.h (gomp_target_rev): Declare.
> 	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
> 	* plugin/cuda-lib.def (cuMemHostAlloc): Add.
> 	* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
> 	(struct ptx_device): Add rev_data member. 
> 	(nvptx_open_device): #if 0 unused check; add
> 	unified address assert check.
> 	(GOMP_OFFLOAD_get_num_devices): Claim unified address
> 	support.
> 	(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
> 	offload functions exist. Make offload var available
> 	on host and device.
> 	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
> 	(GOMP_OFFLOAD_run): Handle reverse offload.

So, does this mean one has to have gcc configured --with-arch=sm_70
or later to make reverse offloading work (and then on the other
side no support for older PTX arches at all)?
If yes, I was kind of hoping we could arrange for it to be more
user-friendly, build libgomp.a normally (sm_35 or what is the default),
build the single TU in libgomp that needs the sm_70 stuff with -march=sm_70
and arrange for mkoffload to link in the sm_70 stuff only if the user
wants reverse offload (or has requires reverse_offload?).  In that case
ignore sm_60 and older devices, if reverse offload isn't wanted, don't link
in the part that needs sm_70 and make stuff working on sm_35 and later.
Or perhaps have 2 versions of target.o, one sm_35 and one sm_70 and let
mkoffload choose among them.

> +      /* The code for nvptx for GOMP_target_ext in libgomp/config/nvptx/target.c
> +	 for < sm_70 exists but is disabled here as it is unclear whether there
> +	 is the required consistency between host and device.
> +	 See https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602715.html
> +	 for details.  */
> +      warning_at (input_location, 0,
> +		  "Disabling offload-code generation for this device type: "
> +		  "%<omp requires reverse_offload%> can only be fulfilled "
> +		  "for %<sm_70%> or higher");
> +      inform (UNKNOWN_LOCATION,
> +	      "Reverse offload requires that GCC is configured with "
> +	      "%<--with-arch=sm_70%> or higher and not overridden by a lower "
> +	      "value for %<-foffload-options=nvptx-none=-march=%>");

Diagnostics (sure, Fortran FE is an exception) shouldn't start with capital
letters).

> @@ -519,10 +523,20 @@ nvptx_open_device (int n)
>  		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
>    ptx_dev->max_threads_per_multiprocessor = pi;
>  
> +#if 0
> +  int async_engines;
>    r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
>  			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
>    if (r != CUDA_SUCCESS)
>      async_engines = 1;
> +#endif

Please avoid #if 0 code.

> +
> +  /* Required below for reverse offload as implemented, but with compute
> +     capability >= 2.0 and 64bit device processes, this should be universally be
> +     the case; hence, an assert.  */
> +  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
> +			 CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev);
> +  assert (r == CUDA_SUCCESS && pi);
>  
>    for (int i = 0; i != GOMP_DIM_MAX; i++)
>      ptx_dev->default_dims[i] = 0;
> @@ -1179,8 +1193,10 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
>  {
>    int num_devices = nvptx_get_num_devices ();
>    /* Return -1 if no omp_requires_mask cannot be fulfilled but
> -     devices were present.  */
> -  if (num_devices > 0 && omp_requires_mask != 0)
> +     devices were present. Unified-shared address: see comment in

2 spaces after . rather than 1.

> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
>      htab_free (refcount_set);
>  }
>  
> +/* Handle reverse offload. This is called by the device plugins for a
> +   reverse offload; it is not called if the outer target runs on the host.  */

Likewise.

	Jakub
  
Alexander Monakov Oct. 11, 2022, 11:12 a.m. UTC | #2
On Tue, 11 Oct 2022, Jakub Jelinek wrote:

> So, does this mean one has to have gcc configured --with-arch=sm_70
> or later to make reverse offloading work (and then on the other
> side no support for older PTX arches at all)?
> If yes, I was kind of hoping we could arrange for it to be more
> user-friendly, build libgomp.a normally (sm_35 or what is the default),
> build the single TU in libgomp that needs the sm_70 stuff with -march=sm_70
> and arrange for mkoffload to link in the sm_70 stuff only if the user
> wants reverse offload (or has requires reverse_offload?).  In that case
> ignore sm_60 and older devices, if reverse offload isn't wanted, don't link
> in the part that needs sm_70 and make stuff working on sm_35 and later.
> Or perhaps have 2 versions of target.o, one sm_35 and one sm_70 and let
> mkoffload choose among them.

My understanding is such trickery should not be necessary with
the barrier-based approach, i.e. the sequence of PTX instructions

  st   % plain store
  membar.sys
  st.volatile

should be enough to guarantee that the former store is visible on the host
before the latter, and work all the way back to sm_20.

Alexander
  
Tobias Burnus Oct. 12, 2022, 8:55 a.m. UTC | #3
On 11.10.22 13:12, Alexander Monakov wrote:
> My understanding is such trickery should not be necessary with
> the barrier-based approach, i.e. the sequence of PTX instructions
>
>    st   % plain store
>    membar.sys
>    st.volatile
>
> should be enough to guarantee that the former store is visible on the host
> before the latter, and work all the way back to sm_20.

If I understand it correctly, you mean:

   GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;

   __sync_synchronize ();  /* membar.sys */
   asm volatile ("st.volatile.global.u64 [%0], %1;"
                 : : "r"(addr_struct_fn), "r" (fn) : "memory");


And then directly followed by the busy wait:

   while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
     ;  /* spin  */

which GCC expands to:

   /* ld.global.u64 %r64,[__gomp_rev_offload_var];
      ld.u64 %r36,[%r64];
      membar.sys;  */

The such updated patch is attached.

(This is the only change + removing the mkoffload.cc part is the only
larger change. Otherwise, it only handles the minor comments by Jakub.
The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used
until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e.
it is a really old left over!)

Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged),
sm_35 and sm_70.

Tobias

*With some added code; until GOMP_OFFLOAD_get_num_devices accepts
GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image
gets passed a non-NULL for rev_fn_table, the current patch is a no op.

Planned next is the related GCN patch – and the actual change
in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices)
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Tobias Burnus Oct. 17, 2022, 7:35 a.m. UTC | #4
On 12.10.22 10:55, Tobias Burnus wrote:
> On 11.10.22 13:12, Alexander Monakov wrote:
>> My understanding is such trickery should not be necessary with
>> the barrier-based approach, i.e. the sequence of PTX instructions
>>
>>    st   % plain store
>>    membar.sys
>>    st.volatile
>>
>> should be enough to guarantee that the former store is visible on the
>> host
>> before the latter, and work all the way back to sm_20.
>
> If I understand it correctly, you mean:
>
>   GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
>
>   __sync_synchronize ();  /* membar.sys */
>   asm volatile ("st.volatile.global.u64 [%0], %1;"
>                 : : "r"(addr_struct_fn), "r" (fn) : "memory");
>
>
> And then directly followed by the busy wait:
>
>   while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE)
> != 0)
>     ;  /* spin  */
>
> which GCC expands to:
>
>   /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>      ld.u64 %r36,[%r64];
>      membar.sys;  */
>
> The such updated patch is attached.
>
> (This is the only change + removing the mkoffload.cc part is the only
> larger change. Otherwise, it only handles the minor comments by Jakub.
> The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used
> until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e.
> it is a really old left over!)
>
> Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged),
> sm_35 and sm_70.
>
> Tobias
>
> *With some added code; until GOMP_OFFLOAD_get_num_devices accepts
> GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image
> gets passed a non-NULL for rev_fn_table, the current patch is a no op.
>
> Planned next is the related GCN patch – and the actual change
> in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices)
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Alexander Monakov Oct. 19, 2022, 3:53 p.m. UTC | #5
On Wed, 12 Oct 2022, Tobias Burnus wrote:

> On 11.10.22 13:12, Alexander Monakov wrote:
> > My understanding is such trickery should not be necessary with
> > the barrier-based approach, i.e. the sequence of PTX instructions
> >
> >    st   % plain store
> >    membar.sys
> >    st.volatile
> >
> > should be enough to guarantee that the former store is visible on the host
> > before the latter, and work all the way back to sm_20.
> 
> If I understand it correctly, you mean:
> 
>   GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
> 
>   __sync_synchronize ();  /* membar.sys */
>   asm volatile ("st.volatile.global.u64 [%0], %1;"
>                 : : "r"(addr_struct_fn), "r" (fn) : "memory");
> 
> 
> And then directly followed by the busy wait:
> 
>   while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
>     ;  /* spin  */
> 
> which GCC expands to:
> 
>   /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>      ld.u64 %r36,[%r64];
>      membar.sys;  */
> 
> The such updated patch is attached.

I think the topic for which I was Cc'ed (memory space and access method for
the synchronization variable) has been resolved nicely. I am not satisfied
with some other points raised in the conversation, I hope they are noted.

Alexander

> (This is the only change + removing the mkoffload.cc part is the only
> larger change. Otherwise, it only handles the minor comments by Jakub.
> The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used
> until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e.
> it is a really old left over!)
> 
> Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged),
> sm_35 and sm_70.
> 
> Tobias
> 
> *With some added code; until GOMP_OFFLOAD_get_num_devices accepts
> GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image
> gets passed a non-NULL for rev_fn_table, the current patch is a no op.
> 
> Planned next is the related GCN patch – and the actual change
> in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices)
  
Jakub Jelinek Oct. 24, 2022, 2:07 p.m. UTC | #6
On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote:
> libgomp/nvptx: Prepare for reverse-offload callback handling
> 
> This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
> later handle the reverse offload.
> For nvptx, it adds support for forwarding the offload gomp_target_ext call
> to the host by setting values in a struct on the device and querying it on
> the host - invoking gomp_target_rev on the result.
> 
> include/ChangeLog:
> 
> 	* cuda/cuda.h (enum CUdevice_attribute): Add
> 	CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
> 	(CU_MEMHOSTALLOC_DEVICEMAP): Define.
> 	(cuMemHostAlloc): Add prototype.
> 
> libgomp/ChangeLog:
> 
> 	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
> 	'static' for this variable.
> 	* config/nvptx/libgomp-nvptx.h: New file.
> 	* config/nvptx/target.c: Include it.
> 	(GOMP_ADDITIONAL_ICVS): Declare extern var.
> 	(GOMP_REV_OFFLOAD_VAR): Declare var.
> 	(GOMP_target_ext): Handle reverse offload.
> 	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
> 	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
> 	* target.c (gomp_target_rev): ... this new stub function.
> 	* libgomp.h (gomp_target_rev): Declare.
> 	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
> 	* plugin/cuda-lib.def (cuMemHostAlloc): Add.
> 	* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
> 	(struct ptx_device): Add rev_data member. 
> 	(nvptx_open_device): Remove async_engines query, last used in
> 	r10-304-g1f4c5b9b; add unified-address assert check.
> 	(GOMP_OFFLOAD_get_num_devices): Claim unified address
> 	support.
> 	(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
> 	offload functions exist. Make offload var available
> 	on host and device.
> 	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
> 	(GOMP_OFFLOAD_run): Handle reverse offload.

Ok, thanks.

	Jakub
  
Thomas Schwinge Oct. 24, 2022, 7:05 p.m. UTC | #7
Hi Tobias!

On 2022-10-24T16:07:25+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote:
>> libgomp/nvptx: Prepare for reverse-offload callback handling

> Ok, thanks.

Per commit r13-3460-g131d18e928a3ea1ab2d3bf61aa92d68a8a254609
"libgomp/nvptx: Prepare for reverse-offload callback handling",
I'm seeing a lot of libgomp execution test regressions.  Random
example, 'libgomp.c-c++-common/error-1.c':

    [...]
      GOMP_OFFLOAD_run: kernel main$_omp_fn$0: launch [(teams: 1), 1, 1] [(lanes: 32), (threads: 8), 1]

    Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
    0x00007ffff793b87d in GOMP_OFFLOAD_run (ord=<optimized out>, tgt_fn=<optimized out>, tgt_vars=<optimized out>, args=<optimized out>) at [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:2127
    2127            if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
    (gdb) print ptx_dev
    $1 = (struct ptx_device *) 0x6a55a0
    (gdb) print ptx_dev->rev_data
    $2 = (struct rev_offload *) 0xffffffff00000000
    (gdb) print ptx_dev->rev_data->fn
    Cannot access memory at address 0xffffffff00000000

Why is it even taking this 'if (reverse_offload)' code path, which isn't
applicable to this test case (as far as I understand)?  (Well, the answer
is 'bool reverse_offload = ptx_dev->rev_data != NULL;', but why is that?)


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Thomas Schwinge Oct. 24, 2022, 7:11 p.m. UTC | #8
Hi Tobias!

On 2022-10-24T21:05:46+0200, I wrote:
> On 2022-10-24T16:07:25+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>> On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote:
>>> libgomp/nvptx: Prepare for reverse-offload callback handling
>
>> Ok, thanks.
>
> Per commit r13-3460-g131d18e928a3ea1ab2d3bf61aa92d68a8a254609
> "libgomp/nvptx: Prepare for reverse-offload callback handling",
> I'm seeing a lot of libgomp execution test regressions.  Random
> example, 'libgomp.c-c++-common/error-1.c':
>
>     [...]
>       GOMP_OFFLOAD_run: kernel main$_omp_fn$0: launch [(teams: 1), 1, 1] [(lanes: 32), (threads: 8), 1]
>
>     Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
>     0x00007ffff793b87d in GOMP_OFFLOAD_run (ord=<optimized out>, tgt_fn=<optimized out>, tgt_vars=<optimized out>, args=<optimized out>) at [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:2127
>     2127            if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
>     (gdb) print ptx_dev
>     $1 = (struct ptx_device *) 0x6a55a0
>     (gdb) print ptx_dev->rev_data
>     $2 = (struct rev_offload *) 0xffffffff00000000
>     (gdb) print ptx_dev->rev_data->fn
>     Cannot access memory at address 0xffffffff00000000
>
> Why is it even taking this 'if (reverse_offload)' code path, which isn't
> applicable to this test case (as far as I understand)?  (Well, the answer
> is 'bool reverse_offload = ptx_dev->rev_data != NULL;', but why is that?)

Well.

    --- a/libgomp/plugin/plugin-nvptx.c
    +++ b/libgomp/plugin/plugin-nvptx.c

    @@ -329,6 +332,7 @@ struct ptx_device
           pthread_mutex_t lock;
         } omp_stacks;

    +  struct rev_offload *rev_data;
       struct ptx_device *next;
     };

... but as far as I can tell, this is never initialized in
'nvptx_open_device', which does 'ptx_dev = GOMP_PLUGIN_malloc ([...]);'.
Would the following be the correct fix (currently testing)?

    --- libgomp/plugin/plugin-nvptx.c
    +++ libgomp/plugin/plugin-nvptx.c
    @@ -546,6 +546,8 @@ nvptx_open_device (int n)
       ptx_dev->omp_stacks.size = 0;
       pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL);

    +  ptx_dev->rev_data = NULL;
    +
       return ptx_dev;
     }



Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  
Tobias Burnus Oct. 24, 2022, 7:46 p.m. UTC | #9
Hi Tobias!

On 24.10.22 21:11, Thomas Schwinge wrote:
> On 2022-10-24T21:05:46+0200, I wrote:
>> On 2022-10-24T16:07:25+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>>> On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote:
>>>> libgomp/nvptx: Prepare for reverse-offload callback handling
> Well.
>      +  struct rev_offload *rev_data;
> ... but as far as I can tell, this is never initialized in
> 'nvptx_open_device', which does 'ptx_dev = GOMP_PLUGIN_malloc ([...]);'.
> Would the following be the correct fix (currently testing)?
>
>      --- libgomp/plugin/plugin-nvptx.c
>      +++ libgomp/plugin/plugin-nvptx.c
>      @@ -546,6 +546,8 @@ nvptx_open_device (int n)
>         ptx_dev->omp_stacks.size = 0;
>         pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL);
>
>      +  ptx_dev->rev_data = NULL;
>      +
>         return ptx_dev;
>       }

LGTM and I think it is obvious – albeit I am not sure why it did not
fail when testing it here.

Thanks,

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
  

Patch

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

For host-device consistency guarantee reasons, reverse offload is currently
limited -march=sm_70 (for libgomp).

gcc/ChangeLog:

	* config/nvptx/mkoffload.cc (process): Warn if the linked-in libgomp.a
	has not been compiled with sm_70 or higher and disable code gen then.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
	(CU_MEMHOSTALLOC_DEVICEMAP): Define.
	(cuMemHostAlloc): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/libgomp-nvptx.h: New file.
	* config/nvptx/target.c: Include it.
	(GOMP_ADDITIONAL_ICVS): Declare extern var.
	(GOMP_REV_OFFLOAD_VAR): Declare var.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemHostAlloc): Add.
	* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
	(struct ptx_device): Add rev_data member. 
	(nvptx_open_device): #if 0 unused check; add
	unified address assert check.
	(GOMP_OFFLOAD_get_num_devices): Claim unified address
	support.
	(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
	offload functions exist. Make offload var available
	on host and device.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offload.

 gcc/config/nvptx/mkoffload.cc        |  60 +++++++++++++++-----
 include/cuda/cuda.h                  |   3 +
 libgomp/config/nvptx/icv-device.c    |   2 +-
 libgomp/config/nvptx/libgomp-nvptx.h |  51 +++++++++++++++++
 libgomp/config/nvptx/target.c        |  61 +++++++++++++++++---
 libgomp/libgomp-plugin.c             |  12 ++++
 libgomp/libgomp-plugin.h             |   7 +++
 libgomp/libgomp.h                    |   5 ++
 libgomp/libgomp.map                  |   5 ++
 libgomp/plugin/cuda-lib.def          |   1 +
 libgomp/plugin/plugin-nvptx.c        | 107 +++++++++++++++++++++++++++++++++--
 libgomp/target.c                     |  19 +++++++
 12 files changed, 304 insertions(+), 29 deletions(-)

diff --git a/gcc/config/nvptx/mkoffload.cc b/gcc/config/nvptx/mkoffload.cc
index 854cd72..aa2e042 100644
--- a/gcc/config/nvptx/mkoffload.cc
+++ b/gcc/config/nvptx/mkoffload.cc
@@ -258,6 +258,7 @@  process (FILE *in, FILE *out, uint32_t omp_requires)
   unsigned ix;
   const char *sm_ver = NULL, *version = NULL;
   const char *sm_ver2 = NULL, *version2 = NULL;
+  const char *sm_libgomp = NULL;
   size_t file_cnt = 0;
   size_t *file_idx = XALLOCAVEC (size_t, len);
 
@@ -268,6 +269,7 @@  process (FILE *in, FILE *out, uint32_t omp_requires)
   for (size_t i = 0; i != len;)
     {
       char c;
+      bool is_libgomp = false;
       bool output_fn_ptr = false;
       file_idx[file_cnt++] = i;
 
@@ -291,6 +293,13 @@  process (FILE *in, FILE *out, uint32_t omp_requires)
 		  version = input + i + strlen (".version ");
 		  continue;
 		}
+	      if (UNLIKELY (startswith (input + i,
+					"// BEGIN GLOBAL FUNCTION "
+					"DEF: GOMP_target_ext")))
+		{
+		  is_libgomp = true;
+		  continue;
+		}
 	      while (startswith (input + i, "//:"))
 		{
 		  i += 3;
@@ -319,28 +328,49 @@  process (FILE *in, FILE *out, uint32_t omp_requires)
 	  putc (c, out);
 	}
       fprintf (out, "\";\n\n");
+      if (is_libgomp)
+	sm_libgomp = sm_ver;
       if (output_fn_ptr
 	  && (omp_requires & GOMP_REQUIRES_REVERSE_OFFLOAD) != 0)
 	{
-	  if (sm_ver && sm_ver[0] == '3' && sm_ver[1] == '0'
-	      && sm_ver[2] == '\n')
-	    {
-	      warning_at (input_location, 0,
-			  "%<omp requires reverse_offload%> requires at "
-			  "least %<sm_35%> for "
-			  "%<-foffload-options=nvptx-none=-march=%> - disabling"
-			  " offload-code generation for this device type");
-	      /* As now an empty file is compiled and there is no call to
-		 GOMP_offload_register_ver, this device type is effectively
-		 disabled.  */
-	      fflush (out);
-	      ftruncate (fileno (out), 0);
-	      return;
-	    }
 	  sm_ver2 = sm_ver;
 	  version2 = version;
 	}
     }
+  if (sm_ver2 && sm_libgomp
+      && sm_libgomp[0] < '7' && sm_libgomp[1] && sm_libgomp[2] == '\n')
+    {
+      /* The code for nvptx for GOMP_target_ext in libgomp/config/nvptx/target.c
+	 for < sm_70 exists but is disabled here as it is unclear whether there
+	 is the required consistency between host and device.
+	 See https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602715.html
+	 for details.  */
+      warning_at (input_location, 0,
+		  "Disabling offload-code generation for this device type: "
+		  "%<omp requires reverse_offload%> can only be fulfilled "
+		  "for %<sm_70%> or higher");
+      inform (UNKNOWN_LOCATION,
+	      "Reverse offload requires that GCC is configured with "
+	      "%<--with-arch=sm_70%> or higher and not overridden by a lower "
+	      "value for %<-foffload-options=nvptx-none=-march=%>");
+      /* As now an empty file is compiled and there is no call to
+	 GOMP_offload_register_ver, this device type is effectively disabled.  */
+      fflush (out);
+      ftruncate (fileno (out), 0);
+      return;
+    }
+  if (sm_ver2 && sm_ver2[0] == '3' && sm_ver2[1] == '0' && sm_ver[2] == '\n')
+    {
+      warning_at (input_location, 0,
+		  "%<omp requires reverse_offload%> requires at least %<sm_35%> "
+		  "for %<-foffload-options=nvptx-none=-march=%> - disabling "
+		  "offload-code generation for this device type");
+      /* As now an empty file is compiled and there is no call to
+	 GOMP_offload_register_ver, this device type is effectively disabled.  */
+      fflush (out);
+      ftruncate (fileno (out), 0);
+      return;
+    }
 
   /* Create function-pointer array, required for reverse
      offload function-pointer lookup.  */
diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05..e081f04 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,6 +77,7 @@  typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
+  CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
   CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
 } CUdevice_attribute;
 
@@ -113,6 +114,7 @@  enum {
 #define CU_LAUNCH_PARAM_END ((void *) 0)
 #define CU_LAUNCH_PARAM_BUFFER_POINTER ((void *) 1)
 #define CU_LAUNCH_PARAM_BUFFER_SIZE ((void *) 2)
+#define CU_MEMHOSTALLOC_DEVICEMAP 0x02U
 
 enum {
   CU_STREAM_DEFAULT = 0,
@@ -169,6 +171,7 @@  CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemHostAlloc (void **, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index 6f869be..eef151c 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -30,7 +30,7 @@ 
 
 /* This is set to the ICV values of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
 
 void
 omp_set_default_device (int device_num __attribute__((unused)))
diff --git a/libgomp/config/nvptx/libgomp-nvptx.h b/libgomp/config/nvptx/libgomp-nvptx.h
new file mode 100644
index 0000000..5da9aae
--- /dev/null
+++ b/libgomp/config/nvptx/libgomp-nvptx.h
@@ -0,0 +1,51 @@ 
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+   Contributed by Tobias Burnus <tobias@codesourcery.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file contains defines and type definitions shared between the
+   nvptx target's libgomp.a and the plugin-nvptx.c, but that is only
+   needef for this target.  */
+
+#ifndef LIBGOMP_NVPTX_H
+#define LIBGOMP_NVPTX_H 1
+
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+#endif  /* LIBGOMP_NVPTX_H */
+
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d2..6470ae8 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -24,9 +24,12 @@ 
    <http://www.gnu.org/licenses/>.  */
 
 #include "libgomp.h"
+#include "libgomp-nvptx.h"  /* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */
 #include <limits.h>
 
 extern int __gomp_team_num __attribute__((shared));
+extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +91,60 @@  GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
+  static int lock = 0;  /* == gomp_mutex_t lock; gomp_mutex_init (&lock); */
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  gomp_mutex_lock (&lock);
+
+  GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
+  GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
+  GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
+  GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
+  GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
+
+  /* 'fn' must be last.  */
+#if __PTX_SM__ >= 700
+  uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
+  asm volatile ("st.global.release.sys.u64 [%0], %1;"
+		: : "r"(addr_struct_fn), "r" (fn) : "memory");
+#else
+/* The following has been effectively disabled via mkoffload as it is unclear
+   whether there is the required consistency between host and device.
+   See https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602715.html
+   Note: Using atomic with scope = .sys is already supported since >= 600.
+   The generated code is:
+     @ %r69 membar.sys;
+     @ %r69 atom.exch.b64 _,[%r27],%r41; */
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_RELEASE);
+#endif
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+#if __PTX_SM__ >= 700
+  uint64_t fn2;
+  do
+    {
+      asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
+		    : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
+    }
+  while (fn2 != 0);
+#else
+/* See remark above. The generated memory-access code is
+     ld.global.u64 %r64,[__gomp_rev_offload_var];
+     ld.u64 %r36,[%r64];
+     membar.sys;  */
+  __sync_synchronize ();  /* membar.sys */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
+    ;  /* spin  */
+#endif
+
+  gomp_mutex_unlock (&lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc62..316de74 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@  GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 6ab5ac6..875f967 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@  extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 7519274..5803683 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1128,6 +1128,11 @@  extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10..12f76f7 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@  GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39..dff42d6 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@  CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemHostAlloc)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ba6b229..de24398 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -40,6 +40,9 @@ 
 #include "gomp-constants.h"
 #include "oacc-int.h"
 
+/* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */
+#include "config/nvptx/libgomp-nvptx.h"
+
 #include <pthread.h>
 #ifndef PLUGIN_NVPTX_INCLUDE_SYSTEM_CUDA_H
 # include "cuda/cuda.h"
@@ -329,6 +332,7 @@  struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
   struct ptx_device *next;
 };
 
@@ -423,7 +427,7 @@  nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,10 +523,20 @@  nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
+
+  /* Required below for reverse offload as implemented, but with compute
+     capability >= 2.0 and 64bit device processes, this should be universally be
+     the case; hence, an assert.  */
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev);
+  assert (r == CUDA_SUCCESS && pi);
 
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
@@ -1179,8 +1193,10 @@  GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 {
   int num_devices = nvptx_get_num_devices ();
   /* Return -1 if no omp_requires_mask cannot be fulfilled but
-     devices were present.  */
-  if (num_devices > 0 && omp_requires_mask != 0)
+     devices were present. Unified-shared address: see comment in
+     nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.  */
+  if (num_devices > 0
+      && (omp_requires_mask & ~GOMP_REQUIRES_UNIFIED_ADDRESS) != 0)
     return -1;
   return num_devices;
 }
@@ -1380,7 +1396,7 @@  GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   else if (rev_fn_table)
     {
       CUdeviceptr var;
-      size_t bytes;
+      size_t bytes, i;
       r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
 			     "$offload_func_table");
       if (r != CUDA_SUCCESS)
@@ -1390,6 +1406,37 @@  GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+      /* Free if only NULL entries.  */
+      for (i = 0; i < fn_entries; ++i)
+	if ((*rev_fn_table)[i] != 0)
+	  break;
+      if (i == fn_entries)
+	{
+	  free (*rev_fn_table);
+	  *rev_fn_table = NULL;
+	}
+    }
+
+  if (rev_fn_table && *rev_fn_table && dev->rev_data == NULL)
+    {
+      /* cuMemHostAlloc memory is accessible on the device, if unified-shared
+	 address is supported; this is assumed - see comment in
+	 nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.   */
+      CUDA_CALL_ASSERT (cuMemHostAlloc, (void **) &dev->rev_data,
+			sizeof (*dev->rev_data), CU_MEMHOSTALLOC_DEVICEMAP);
+      CUdeviceptr dp = (CUdeviceptr) dev->rev_data;
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error - GOMP_REV_OFFLOAD_VAR: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
     }
 
   nvptx_set_clocktick (module, dev);
@@ -2001,6 +2048,23 @@  nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2035,6 +2099,8 @@  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_offload = ptx_dev->rev_data != NULL;
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2048,12 +2114,41 @@  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_offload)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_offload)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+
+	if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
+				    rev_data->addrs, rev_data->sizes,
+				    rev_data->kinds, rev_data->dev_num,
+				    rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	    __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_offload)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 5763483..71bcb05 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2925,6 +2925,25 @@  GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is called by the device plugins for a
+   reverse offload; it is not called if the outer target runs on the host.  */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void