[2/4] libgomp: let plugins handle allocating the target variable table

Message ID 20260505134929.3522938-3-aarsenovic@baylibre.com
State New
Headers
Series GCN: Target offload overhead improvements, batch 2 |

Commit Message

Arsen Arsenović May 5, 2026, 1:14 p.m. UTC
  In my examination of BabelStream results on AMD GCN, I've found that,
for each BabelStream kernel execution, we spend significant time in
allocating and initializing memory in gomp_map_vars (~55µs, whereas the
actual BabelStream code executes in ~746µs, meaning we increase the time
BabelStream measures by 7% just on that).

Upon further examination, I've found that the only reason gomp_map_vars
decides to allocate and map any memory in the first place is because it
is constructing the table of pointers to variables on the target, which
I've taken to calling the "target variable table".  Given that the GCN
plugin already must perform some memory allocation before starting up a
kernel, namely to allocate kernel arguments, it would be beneficial if
we could merge this allocation with the kernel arguments allocation.

In addition, since the kernel arguments live in host memory, populating
them can be performed using string functions, without any need to call
for expensive host2dev copies.

This patch introduces an opaque type for "offload sessions".  This type
is defined by each plugin and allows it to store data related to a
single offload job.  The sessions are allocated and managed by libgomp,
and initialized and utilized by the plugin.  Their lifetime starts with
a call to GOMP_OFFLOAD_session_start, and ends with
GOMP_OFFLOAD_{openacc_{async_,}exec,{async_,}run}.

The patch then uses this framework to make management of the target
variable table more flexible: the plugin may elect to implement
GOMP_OFFLOAD_session_allocate_target_var_table, which allows the plugin
to attempt to allocate the target variable table in host memory.

If it fails, or if the plugin does not provide this function, libgomp
will perform this allocation as it does today - in target memory - and
tell the session about it using
GOMP_OFFLOAD_session_set_target_var_table.

In the case of AMD GCN, upon a call to
GOMP_OFFLOAD_session_allocate_target_var_table, the plugin will
immediately allocate kernel arguments with enough space for the target
variable table, no matter what size the plugin asks for[1], and return
that pointer to libgomp.

This results in the runtime of gomp_map_vars effectively disappearing
from traces.

[1] It may be beneficial to limit this, to some fixed amount, to make it
    so that the future allocation cache has a higher cache hit rate.  It
    may also depend on whether hsa_memory_allocate for kernel arguments
    takes runtime proportional to the number of bytes it needs to
    allocate.

include/ChangeLog:

	* gomp-constants.h (GOMP_VERSION): Bump.  Signature of
	GOMP_OFFLOAD_run et al changed.

libgomp/ChangeLog:

	* libgomp-plugin.h (GOMP_OFFLOAD_run, GOMP_OFFLOAD_exec)
	(GOMP_OFFLOAD_async_run, GOMP_OFFLOAD_openacc_async_exec): Pass
	session in place of target variable table and devices.
	(struct gomp_offload_session): New.
	(GOMP_OFFLOAD_session_size): New
	(GOMP_OFFLOAD_check_session_struct): New.
	(GOMP_OFFLOAD_session_boilerplate): New.
	(GOMP_OFFLOAD_session_start): New.
	(GOMP_OFFLOAD_session_allocate_target_var_table): New.
	(GOMP_OFFLOAD_session_set_target_var_table): New.
	* libgomp.h (struct gomp_target_task): Add offload_session
	field.
	(struct gomp_device_descr): Add offload session management
	functions.
	(gomp_offload_session_new): New.
	(goacc_map_vars): Add SESSION to signature
	* oacc-host.c (struct gomp_offload_session): Define, for host
	offload fallback case.
	(host_session_size): New.  Implements GOMP_OFFLOAD_session_size.
	(host_session_start): New.  Implements
	GOMP_OFFLOAD_session_start.
	(host_session_set_target_var_table): New.  Implements
	GOMP_OFFLOAD_session_set_target_var_table.
	(host_run): Adjust to match GOMP_OFFLOAD_run.
	(host_openacc_exec): Adjust to match GOMP_OFFLOAD_openacc_exec.
	(host_openacc_async_exec): Adjust to match
	GOMP_OFFLOAD_openacc_async_exec.
	* oacc-mem.c (acc_map_data): Adjust call to goacc_map_vars.
	(goacc_enter_datum): Ditto.
	(goacc_enter_data_internal): Ditto.
	* oacc-parallel.c (GOACC_parallel_keyed): Allocate and pass
	offload session.
	(GOACC_data_start): Adjust call to goacc_map_vars.
	* plugin/plugin-gcn.c (struct kernel_dispatch): Remove
	kernarg_cache_node.
	(struct kernargs): Add a flexible array member for the target
	variable table.
	(struct kernel_launch): Store an offload session rather than
	target var. table pointer.
	(print_kernel_dispatch): Receive kernargs as parameter.
	(struct gomp_offload_session): Define.
	(init_session): New.
	(GOMP_OFFLOAD_session_start): Implement, using init_session.
	(release_session): New.
	(alloc_kernargs_on_agent): Rename to...
	(allocate_session_kernargs): ... this, store result in
	passed-in SESSION, and allocate extra room for target variable
	table (rounding it up to nearest multiple of 64 pointers).
	(GOMP_OFFLOAD_session_allocate_target_var_table): Implement
	using the previous function.
	(GOMP_OFFLOAD_session_set_target_var_table): Ditto.
	(create_kernel_dispatch): Remove kernarg allocation, instead
	receiving it as an argument.
	(release_kernel_dispatch): Receive kernargs as an argument,
	don't release them.
	(run_kernel): Adjust to use sessions.
	(destroy_module): Ditto.
	(GOMP_OFFLOAD_load_image): Ditto.
	(execute_queue_entry): Adjust to match changed struct
	kernel_launch.
	(queue_push_launch): Ditto.
	(gcn_exec): Receive and pass along session.
	(GOMP_OFFLOAD_run): Ditto.
	(GOMP_OFFLOAD_async_run): Ditto.
	(GOMP_OFFLOAD_openacc_exec): Ditto.
	(GOMP_OFFLOAD_openacc_async_exec): Ditto.
	* plugin/plugin-nvptx.c (struct gomp_offload_session): Define.
	(GOMP_OFFLOAD_session_start): Implement.
	(GOMP_OFFLOAD_session_set_target_var_table): Implement.
	(GOMP_OFFLOAD_openacc_exec): Adjust to receive session.
	(GOMP_OFFLOAD_openacc_async_exec): Ditto.
	(GOMP_OFFLOAD_run): Ditto.
	* target.c (gomp_get_tvt_size): Extract helper from...
	(gomp_map_vars_internal): ... here.  Receive SESSION, iff doing
	target offload.  Use a target variable table on the host
	allocated by GOMP_OFFLOAD_session_allocate_target_var_table if
	possible, or call GOMP_OFFLOAD_session_set_target_var_table with
	an allocated device pointer otherwise.
	(gomp_map_vars): Update to pass along session.
	(goacc_map_vars): Ditto.
	(GOMP_target): Allocate and pass along session.
	(GOMP_target_ext): Ditto.
	(gomp_target_data_fallback): Adjust call to gomp_map_vars.
	(GOMP_target_data): Ditto.
	(GOMP_target_data_ext): Ditto.
	(GOMP_target_enter_exit_data): Ditto.
	(gomp_target_task_fn): Start and pass along session, the storage
	for which is allocated by gomp_create_target_task.
	(DLSYM2): Rename from DLSYM, adding a new parameter for the
	variable to populate, akin to DLSYM_OPT.
	(DLSYM): Delegate to DLSYM2.
	(gomp_load_plugin_for_device): Populate session-related fields.
	* task.c (gomp_create_target_task): Allocate enough storage for
	an offload session.
	* testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c: New test.
	* testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c: New test.
---
 include/gomp-constants.h                      |   2 +-
 libgomp/libgomp-plugin.h                      |  81 +++++-
 libgomp/libgomp.h                             |  27 +-
 libgomp/oacc-host.c                           |  63 ++++-
 libgomp/oacc-mem.c                            |   8 +-
 libgomp/oacc-parallel.c                       |  24 +-
 libgomp/plugin/plugin-gcn.c                   | 254 ++++++++++++------
 libgomp/plugin/plugin-nvptx.c                 |  45 +++-
 libgomp/target.c                              | 191 ++++++++-----
 libgomp/task.c                                |  33 ++-
 .../gcn-kernel-launch-no-tvt-alloc.c          |  51 ++++
 .../gcn-kernel-launch-tvt-alloc.c             |  16 ++
 12 files changed, 604 insertions(+), 191 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c
  

Comments

Andrew Stubbs May 12, 2026, 10:55 a.m. UTC | #1
On 05/05/2026 14:14, Arsen Arsenović wrote:
> In my examination of BabelStream results on AMD GCN, I've found that,
> for each BabelStream kernel execution, we spend significant time in
> allocating and initializing memory in gomp_map_vars (~55µs, whereas the
> actual BabelStream code executes in ~746µs, meaning we increase the time
> BabelStream measures by 7% just on that).
> 
> Upon further examination, I've found that the only reason gomp_map_vars
> decides to allocate and map any memory in the first place is because it
> is constructing the table of pointers to variables on the target, which
> I've taken to calling the "target variable table".  Given that the GCN
> plugin already must perform some memory allocation before starting up a
> kernel, namely to allocate kernel arguments, it would be beneficial if
> we could merge this allocation with the kernel arguments allocation.
> 
> In addition, since the kernel arguments live in host memory, populating
> them can be performed using string functions, without any need to call
> for expensive host2dev copies.
> 
> This patch introduces an opaque type for "offload sessions".  This type
> is defined by each plugin and allows it to store data related to a
> single offload job.  The sessions are allocated and managed by libgomp,
> and initialized and utilized by the plugin.  Their lifetime starts with
> a call to GOMP_OFFLOAD_session_start, and ends with
> GOMP_OFFLOAD_{openacc_{async_,}exec,{async_,}run}.
> 
> The patch then uses this framework to make management of the target
> variable table more flexible: the plugin may elect to implement
> GOMP_OFFLOAD_session_allocate_target_var_table, which allows the plugin
> to attempt to allocate the target variable table in host memory.

Even though this patch is part of a "GCN" patch series, I think this 
patch needs review by a libgomp maintainer (CC'd).

However, see my comment below....

> If it fails, or if the plugin does not provide this function, libgomp
> will perform this allocation as it does today - in target memory - and
> tell the session about it using
> GOMP_OFFLOAD_session_set_target_var_table.
> 
> In the case of AMD GCN, upon a call to
> GOMP_OFFLOAD_session_allocate_target_var_table, the plugin will
> immediately allocate kernel arguments with enough space for the target
> variable table, no matter what size the plugin asks for[1], and return
> that pointer to libgomp.
> 
> This results in the runtime of gomp_map_vars effectively disappearing
> from traces.
> 
> [1] It may be beneficial to limit this, to some fixed amount, to make it
>      so that the future allocation cache has a higher cache hit rate.  It
>      may also depend on whether hsa_memory_allocate for kernel arguments
>      takes runtime proportional to the number of bytes it needs to
>      allocate.
> 
> include/ChangeLog:

> +/* Get new kernargs for SESSION such that it can store TABLE_SIZE char units of
> +   target variable table, reusing cached kernargs allocations, if possible.  */
> +
> +static inline struct kernargs *
> +allocate_session_kernargs (struct gomp_offload_session *session,
> +			   size_t table_size)
> +{
> +  GCN_DEBUG ("Session %p asked for allocation of kernargs+%zu...\n", session, table_size);
> +  struct agent_info *agent = session->agent;
> +  assert (!session->kernarg_cache_node);
> +
> +  /* To increase chance of cache hit, round up size of the target variable
> +     table to a multiple of (64*sizeof(void*)), and ensure that this size is
> +     nonzero.  */
> +  if (!table_size)
> +    table_size++;
> +
> +  {
> +    constexpr size_t rounding_factor = 64 * sizeof (void*);
> +    table_size += rounding_factor - 1;
> +    table_size = (table_size / rounding_factor) * table_size;
> +  }

This looks wrong. You probably don't mean to multiply by table_size there.

Andrew
  
Arsen Arsenović May 12, 2026, 10:58 a.m. UTC | #2
Andrew Stubbs <ams@baylibre.com> writes:

>> +  {
>> +    constexpr size_t rounding_factor = 64 * sizeof (void*);
>> +    table_size += rounding_factor - 1;
>> +    table_size = (table_size / rounding_factor) * table_size;
>> +  }
>
> This looks wrong. You probably don't mean to multiply by table_size there.

Yes, you're right.  Fixed locally.  Nice find!

Thanks.
  
Arsen Arsenović May 20, 2026, 10:43 a.m. UTC | #3
Arsen Arsenović <aarsenovic@baylibre.com> writes:

> In my examination of BabelStream results on AMD GCN, I've found that,
> for each BabelStream kernel execution, we spend significant time in
> allocating and initializing memory in gomp_map_vars (~55µs, whereas the
> actual BabelStream code executes in ~746µs, meaning we increase the time
> BabelStream measures by 7% just on that).
>
> Upon further examination, I've found that the only reason gomp_map_vars
> decides to allocate and map any memory in the first place is because it
> is constructing the table of pointers to variables on the target, which
> I've taken to calling the "target variable table".  Given that the GCN
> plugin already must perform some memory allocation before starting up a
> kernel, namely to allocate kernel arguments, it would be beneficial if
> we could merge this allocation with the kernel arguments allocation.
>
> In addition, since the kernel arguments live in host memory, populating
> them can be performed using string functions, without any need to call
> for expensive host2dev copies.
>
> This patch introduces an opaque type for "offload sessions".  This type
> is defined by each plugin and allows it to store data related to a
> single offload job.  The sessions are allocated and managed by libgomp,
> and initialized and utilized by the plugin.  Their lifetime starts with
> a call to GOMP_OFFLOAD_session_start, and ends with
> GOMP_OFFLOAD_{openacc_{async_,}exec,{async_,}run}.
>
> The patch then uses this framework to make management of the target
> variable table more flexible: the plugin may elect to implement
> GOMP_OFFLOAD_session_allocate_target_var_table, which allows the plugin
> to attempt to allocate the target variable table in host memory.
>
> If it fails, or if the plugin does not provide this function, libgomp
> will perform this allocation as it does today - in target memory - and
> tell the session about it using
> GOMP_OFFLOAD_session_set_target_var_table.
>
> In the case of AMD GCN, upon a call to
> GOMP_OFFLOAD_session_allocate_target_var_table, the plugin will
> immediately allocate kernel arguments with enough space for the target
> variable table, no matter what size the plugin asks for[1], and return
> that pointer to libgomp.
>
> This results in the runtime of gomp_map_vars effectively disappearing
> from traces.
>
> [1] It may be beneficial to limit this, to some fixed amount, to make it
>     so that the future allocation cache has a higher cache hit rate.  It
>     may also depend on whether hsa_memory_allocate for kernel arguments
>     takes runtime proportional to the number of bytes it needs to
>     allocate.
>
> include/ChangeLog:
>
> 	* gomp-constants.h (GOMP_VERSION): Bump.  Signature of
> 	GOMP_OFFLOAD_run et al changed.
>
> libgomp/ChangeLog:
>
> 	* libgomp-plugin.h (GOMP_OFFLOAD_run, GOMP_OFFLOAD_exec)
> 	(GOMP_OFFLOAD_async_run, GOMP_OFFLOAD_openacc_async_exec): Pass
> 	session in place of target variable table and devices.
> 	(struct gomp_offload_session): New.
> 	(GOMP_OFFLOAD_session_size): New
> 	(GOMP_OFFLOAD_check_session_struct): New.
> 	(GOMP_OFFLOAD_session_boilerplate): New.
> 	(GOMP_OFFLOAD_session_start): New.
> 	(GOMP_OFFLOAD_session_allocate_target_var_table): New.
> 	(GOMP_OFFLOAD_session_set_target_var_table): New.
> 	* libgomp.h (struct gomp_target_task): Add offload_session
> 	field.
> 	(struct gomp_device_descr): Add offload session management
> 	functions.
> 	(gomp_offload_session_new): New.
> 	(goacc_map_vars): Add SESSION to signature
> 	* oacc-host.c (struct gomp_offload_session): Define, for host
> 	offload fallback case.
> 	(host_session_size): New.  Implements GOMP_OFFLOAD_session_size.
> 	(host_session_start): New.  Implements
> 	GOMP_OFFLOAD_session_start.
> 	(host_session_set_target_var_table): New.  Implements
> 	GOMP_OFFLOAD_session_set_target_var_table.
> 	(host_run): Adjust to match GOMP_OFFLOAD_run.
> 	(host_openacc_exec): Adjust to match GOMP_OFFLOAD_openacc_exec.
> 	(host_openacc_async_exec): Adjust to match
> 	GOMP_OFFLOAD_openacc_async_exec.
> 	* oacc-mem.c (acc_map_data): Adjust call to goacc_map_vars.
> 	(goacc_enter_datum): Ditto.
> 	(goacc_enter_data_internal): Ditto.
> 	* oacc-parallel.c (GOACC_parallel_keyed): Allocate and pass
> 	offload session.
> 	(GOACC_data_start): Adjust call to goacc_map_vars.
> 	* plugin/plugin-gcn.c (struct kernel_dispatch): Remove
> 	kernarg_cache_node.
> 	(struct kernargs): Add a flexible array member for the target
> 	variable table.
> 	(struct kernel_launch): Store an offload session rather than
> 	target var. table pointer.
> 	(print_kernel_dispatch): Receive kernargs as parameter.
> 	(struct gomp_offload_session): Define.
> 	(init_session): New.
> 	(GOMP_OFFLOAD_session_start): Implement, using init_session.
> 	(release_session): New.
> 	(alloc_kernargs_on_agent): Rename to...
> 	(allocate_session_kernargs): ... this, store result in
> 	passed-in SESSION, and allocate extra room for target variable
> 	table (rounding it up to nearest multiple of 64 pointers).
> 	(GOMP_OFFLOAD_session_allocate_target_var_table): Implement
> 	using the previous function.
> 	(GOMP_OFFLOAD_session_set_target_var_table): Ditto.
> 	(create_kernel_dispatch): Remove kernarg allocation, instead
> 	receiving it as an argument.
> 	(release_kernel_dispatch): Receive kernargs as an argument,
> 	don't release them.
> 	(run_kernel): Adjust to use sessions.
> 	(destroy_module): Ditto.
> 	(GOMP_OFFLOAD_load_image): Ditto.
> 	(execute_queue_entry): Adjust to match changed struct
> 	kernel_launch.
> 	(queue_push_launch): Ditto.
> 	(gcn_exec): Receive and pass along session.
> 	(GOMP_OFFLOAD_run): Ditto.
> 	(GOMP_OFFLOAD_async_run): Ditto.
> 	(GOMP_OFFLOAD_openacc_exec): Ditto.
> 	(GOMP_OFFLOAD_openacc_async_exec): Ditto.
> 	* plugin/plugin-nvptx.c (struct gomp_offload_session): Define.
> 	(GOMP_OFFLOAD_session_start): Implement.
> 	(GOMP_OFFLOAD_session_set_target_var_table): Implement.
> 	(GOMP_OFFLOAD_openacc_exec): Adjust to receive session.
> 	(GOMP_OFFLOAD_openacc_async_exec): Ditto.
> 	(GOMP_OFFLOAD_run): Ditto.
> 	* target.c (gomp_get_tvt_size): Extract helper from...
> 	(gomp_map_vars_internal): ... here.  Receive SESSION, iff doing
> 	target offload.  Use a target variable table on the host
> 	allocated by GOMP_OFFLOAD_session_allocate_target_var_table if
> 	possible, or call GOMP_OFFLOAD_session_set_target_var_table with
> 	an allocated device pointer otherwise.
> 	(gomp_map_vars): Update to pass along session.
> 	(goacc_map_vars): Ditto.
> 	(GOMP_target): Allocate and pass along session.
> 	(GOMP_target_ext): Ditto.
> 	(gomp_target_data_fallback): Adjust call to gomp_map_vars.
> 	(GOMP_target_data): Ditto.
> 	(GOMP_target_data_ext): Ditto.
> 	(GOMP_target_enter_exit_data): Ditto.
> 	(gomp_target_task_fn): Start and pass along session, the storage
> 	for which is allocated by gomp_create_target_task.
> 	(DLSYM2): Rename from DLSYM, adding a new parameter for the
> 	variable to populate, akin to DLSYM_OPT.
> 	(DLSYM): Delegate to DLSYM2.
> 	(gomp_load_plugin_for_device): Populate session-related fields.
> 	* task.c (gomp_create_target_task): Allocate enough storage for
> 	an offload session.
> 	* testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c: New test.
> 	* testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c: New test.
> ---
>  include/gomp-constants.h                      |   2 +-
>  libgomp/libgomp-plugin.h                      |  81 +++++-
>  libgomp/libgomp.h                             |  27 +-
>  libgomp/oacc-host.c                           |  63 ++++-
>  libgomp/oacc-mem.c                            |   8 +-
>  libgomp/oacc-parallel.c                       |  24 +-
>  libgomp/plugin/plugin-gcn.c                   | 254 ++++++++++++------
>  libgomp/plugin/plugin-nvptx.c                 |  45 +++-
>  libgomp/target.c                              | 191 ++++++++-----
>  libgomp/task.c                                |  33 ++-
>  .../gcn-kernel-launch-no-tvt-alloc.c          |  51 ++++
>  .../gcn-kernel-launch-tvt-alloc.c             |  16 ++
>  12 files changed, 604 insertions(+), 191 deletions(-)
>  create mode 100644 libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c
>  create mode 100644 libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c

Ping.
  
Tobias Burnus May 28, 2026, 11:10 p.m. UTC | #4
First, I want to note that:
* [PATCH 1/4] libgomp/gcn: parallelize initializing threads of a team
   has been approved by Andrew (with request to expand a comment)
* [PATCH 3/4] libgomp/plugin-gcn: remove unneeded heap allocation in 
run_kernel
   has been approved by Andrew
* [PATCH 4/4] libgomp/oacc-mem: add missing assert to goacc_enter_datum
   is trivial and has been approved by Thomas.

Hence, only his patch remains to be reviewed.

On May 5, 2026, Arsen Arsenović wrote:
> In my examination of BabelStream results on AMD GCN, I've found that,
> for each BabelStream kernel execution, we spend significant time in
> allocating and initializing memory in gomp_map_vars (~55µs, whereas the
> actual BabelStream code executes in ~746µs, meaning we increase the time
> BabelStream measures by 7% just on that).

This overhead matters - not only for BabelStream but also for some
real world code that uses many tiny kernels rather than one
kernel and then doing lengthier work.

> Upon further examination, I've found that the only reason gomp_map_vars
> decides to allocate and map any memory in the first place is because it
> is constructing the table of pointers to variables on the target, which
> I've taken to calling the "target variable table".  Given that the GCN
> plugin already must perform some memory allocation before starting up a
> kernel, namely to allocate kernel arguments, it would be beneficial if
> we could merge this allocation with the kernel arguments allocation.
> 
> In addition, since the kernel arguments live in host memory, populating
> them can be performed using string functions, without any need to call
> for expensive host2dev copies.

I note that Nvptx (at least as currently implemented) does not
profit from this - and effectively will do the same as previously.

However, the obfuscation due to the 'session' handling is
still okayish enough - and for GCN there is a clear benefit!

> This patch introduces an opaque type for "offload sessions".  This type
> is defined by each plugin and allows it to store data related to a
> single offload job.  The sessions are allocated and managed by libgomp,
> and initialized and utilized by the plugin.  Their lifetime starts with
> a call to GOMP_OFFLOAD_session_start, and ends with
> GOMP_OFFLOAD_{openacc_{async_,}exec,{async_,}run}.
> 
> The patch then uses this framework to make management of the target
> variable table more flexible: the plugin may elect to implement
> GOMP_OFFLOAD_session_allocate_target_var_table, which allows the plugin
> to attempt to allocate the target variable table in host memory.
> 
> If it fails, or if the plugin does not provide this function, libgomp
> will perform this allocation as it does today - in target memory - and
> tell the session about it using
> GOMP_OFFLOAD_session_set_target_var_table.

Namely, the code either asks the runtime to allocate the memory,
which works for GCN and is cached (as kernel args in general)
or as for 'target {enter,exit,} data' or for Nvptx also for target,
the allocation is handled by libgomp - and the mentioned function
is then used to add it to the session opaque state structure.

* * *

As a general remark, I find the following hard to read, when
glancing at the code (appears twice):

   struct gomp_offload_session *session = (gomp_offload_session_new
                                           (devicep, alloca));

I think it is easier to read if written as either

   struct gomp_offload_session *session
     = gomp_offload_session_new (devicep, alloca);

or

   struct gomp_offload_session *session = gomp_offload_session_new (devicep,
                                                                    alloca);

?

The problem is that due to the ( and the alignment of
the indentation, it takes a while to realize that this is
a function (ok: macro) call and to disentangle what's the
function name and what are the arguments. It doesn't look
to different from code like:

   struct mystruct_t *var = {abc, def, ghm};


Same issue:

+      host_tvt = (devicep->session.alloc_tvt_func
+                 (session, gomp_get_tvt_size (mapnum)));

how about:

       host_tvt = devicep->session.alloc_tvt_func (session,
                                                   gomp_get_tvt_size 
(mapnum));

(That's 78 characters.)


This might be the only two cases; it felt as if there were more,
but possibly that's it.

* * *

> --- a/libgomp/libgomp-plugin.h
> +++ b/libgomp/libgomp-plugin.h
> +/* Check that the 'struct gomp_offload_struct' declaration is acceptable, and
> +   implement GOMP_OFFLOAD_session_size.  */

I wonder whether adding - "Call this in the plugin
after defining the struct." makes this clearer?

> +#define GOMP_OFFLOAD_session_boilerplate()		\
...

* * *

> +/* Attempt to allocate a target variable table in host memory for SESSION.
> +   This table must be of at least table_size bytes and aligned to
> +   __BIGGEST_ALIGNMENT__.
> +
> +   This function will be called at most once per SESSION.
> +
> +   If this function returns NULL, or if libgomp never calls it,
> +   GOMP_OFFLOAD_session_set_target_var_table will be called instead, with
> +   memory allocated by libgomp for the purpose.
> +
> +   If this function is omitted, libgomp will behave as if it always returns
> +   NULL.  */
> +extern void **GOMP_OFFLOAD_session_allocate_target_var_table

I gather that this is mostly written from the point of a plugin
writer. Still, it is also read when trying to understand the code
in general and as libgomp/*.c programmer. Still, I find the last
three paragraphs rather confusing. How about something like that:

This function is optional and might not be implemented. If not
implemented or when it returns NULL, the memory allocation shall
be done by the caller followed by a call to
GOMP_OFFLOAD_session_allocate_target_var_table.

Or something like that?

* * *
> +/* Set TABLE, a device pointer, as the pointer to the target variable table.
> +   It may be NULL, in which case there's no target variable table.
> +
> +   Called iff GOMP_OFFLOAD_session_allocate_target_var_table did not succeed
> +   or was not called.  */

Either this function or GOMP_OFFLOAD_session_allocate_target_var_table
(returning a non-NULL pointer) must be called before using the
target variable table.

Or something like that?

> +extern void GOMP_OFFLOAD_session_set_target_var_table
...

* * *
> +  struct {
> +    /* Cached below as 'size'.  */
> +    __typeof (GOMP_OFFLOAD_session_size) *size_func;
...
> +    /* Size of a single gomp_offload_session object, as specified by
> +       GOMP_OFFLOAD_session_size.  */
> +    size_t size;
> +  } session;

"specified by ..." sounds odd - "as returned by ..." ?


Is there actually a need to add the function pointer to the struct?

+      DLSYM2 (session.size, session_size);
+      device->session.size = device->session.size_func ();

It feels as if one could just do:

   __typeof (GOMP_OFFLOAD_session_size) *size_func
     = dlsym (plugin, GOMP_OFFLOAD_session_size);
   device->session_size = size_func ();

which seems to be possibly clearer and smaller?

However, also the current version is fine.

* * *
> +/* Allocate an offload session using for gomp_device_descr DEV using ALLOC, and
> +   initialize it.  Provided as a macro, so that 'alloca' can be used as
> +   ALLOC. */
> +#define gomp_offload_session_new(devicep, alloc)		\

This somehow reads odd. Maybe:

Allocate an offload session for the gomp_device_descr DEVICEP
using ALLOC and initialize it.

?

* * *

> --- a/libgomp/plugin/plugin-gcn.c
> +++ b/libgomp/plugin/plugin-gcn.c

> +allocate_session_kernargs (struct gomp_offload_session *session,
> +			   size_t table_size)
> +{
> +  GCN_DEBUG ("Session %p asked for allocation of kernargs+%zu...\n", session, table_size);

First, this line is too long.
Second, I wonder whether it is okay to use %zu or not.

For the current implementation, it probably is - but if
we ever want to move to Windows, it probably isn't. At
least for MinGW, GCC warns when using %zu - even though it
is new since C99, i.e. a long time.

Cf. -> 
https://www.eevblog.com/forum/programming/why-does-gccs-format-function-attribute-no-longer-think-zu-is-valid/

I am torn between replacing it by something
different and leaving it as is, given that we
target Linux and C99 is old and somewhat common.
(And depending what is linked on Windows for printf,
it actually works - besides: if ever compiling for
MinGW, GCC seemingly will warn.)

* * *

> +GOMP_OFFLOAD_session_allocate_target_var_table (struct gomp_offload_session *session,
> +						size_t table_size)
> +{
> +  GCN_DEBUG ("Session %p asked to allocate\n", session);
> +  /* libgomp wants us to handle the TVT.  */
> +  assert (!session->target_var_table);
> +
> +  if (secure_getenv ("GCN_INHIBIT_KERNARGS_TVT_MERGE"))
> +    /* ... but the user does not.  Used for testing.  */
> +    return NULL;

I am a bit unsure whether it makes sense that this is called
every time. On one hand, getenv has some overhead - and it
will be called once per kernel launch. On the other hand,
there is so much action happening and, hopefully, the time
spend here is neglibile compared with all other overhead.

Still, one option would be to cache the value.

* * *

> +  /* Target variable table in host memory.  If we're doing target offloading,
> +     we'll let the plugin attempt to allocate it..  */


s/.././

Otherwise LGTM.

Thanks for the patch!

Tobias
  
Arsen Arsenović May 29, 2026, 11:38 a.m. UTC | #5
Tobias Burnus <tburnus@baylibre.com> writes:

> However, the obfuscation due to the 'session' handling is
> still okayish enough - and for GCN there is a clear benefit!

Hopefully, the session stuff is broad enough to be arbitrarily extended
in the future also.  Its lifetime is meant to match that of one target
region, and ergo, anything that must be communicated between libgomp and
its plugins for a single target region could (and, in future, should) be
stored in sessions.

Or, at least, that's what I was going for with this.

>> This patch introduces an opaque type for "offload sessions".  This type
>> is defined by each plugin and allows it to store data related to a
>> single offload job.  The sessions are allocated and managed by libgomp,
>> and initialized and utilized by the plugin.  Their lifetime starts with
>> a call to GOMP_OFFLOAD_session_start, and ends with
>> GOMP_OFFLOAD_{openacc_{async_,}exec,{async_,}run}.
>> The patch then uses this framework to make management of the target
>> variable table more flexible: the plugin may elect to implement
>> GOMP_OFFLOAD_session_allocate_target_var_table, which allows the plugin
>> to attempt to allocate the target variable table in host memory.
>> If it fails, or if the plugin does not provide this function, libgomp
>> will perform this allocation as it does today - in target memory - and
>> tell the session about it using
>> GOMP_OFFLOAD_session_set_target_var_table.
>
> Namely, the code either asks the runtime to allocate the memory,
> which works for GCN and is cached (as kernel args in general)
> or as for 'target {enter,exit,} data' or for Nvptx also for target,
> the allocation is handled by libgomp - and the mentioned function
> is then used to add it to the session opaque state structure.
>
> * * *
>
> As a general remark, I find the following hard to read, when
> glancing at the code (appears twice):
>
>   struct gomp_offload_session *session = (gomp_offload_session_new
>                                           (devicep, alloca));
>
> I think it is easier to read if written as either
>
>   struct gomp_offload_session *session
>     = gomp_offload_session_new (devicep, alloca);

This seems OK to me also - it just seemed to me that the former is more
prevalent in GCC (or, at least, in GNU code more broadly).  Will change.

>> --- a/libgomp/libgomp-plugin.h
>> +++ b/libgomp/libgomp-plugin.h
>> +/* Check that the 'struct gomp_offload_struct' declaration is acceptable, and
>> +   implement GOMP_OFFLOAD_session_size.  */
>
> I wonder whether adding - "Call this in the plugin
> after defining the struct." makes this clearer?

Yes, that's a good idea.  Will add.

>> +#define GOMP_OFFLOAD_session_boilerplate()		\
> ...
>
> * * *
>
>> +/* Attempt to allocate a target variable table in host memory for SESSION.
>> +   This table must be of at least table_size bytes and aligned to
>> +   __BIGGEST_ALIGNMENT__.
>> +
>> +   This function will be called at most once per SESSION.
>> +
>> +   If this function returns NULL, or if libgomp never calls it,
>> +   GOMP_OFFLOAD_session_set_target_var_table will be called instead, with
>> +   memory allocated by libgomp for the purpose.
>> +
>> +   If this function is omitted, libgomp will behave as if it always returns
>> +   NULL.  */
>> +extern void **GOMP_OFFLOAD_session_allocate_target_var_table
>
> I gather that this is mostly written from the point of a plugin
> writer. Still, it is also read when trying to understand the code
> in general and as libgomp/*.c programmer.
>
> Still, I find the last three paragraphs rather confusing. How about
> something like that:
>
> This function is optional and might not be implemented. If not
> implemented or when it returns NULL, the memory allocation shall
> be done by the caller followed by a call to
> GOMP_OFFLOAD_session_allocate_target_var_table.
>
> Or something like that?

The original description leaves open the possibility of the function
being implemented but libgomp nonetheless opting not to call it.

I specifically wanted to provide the guarantee of exactly
'session_allocate_target_var_table' or 'session_set_target_var_table'
being called successfully, without allowing plugin authors to rely on
which one it will be.

Obviously, this implies that session_allocate_target_var_table failing
forces a call to session_set_target_var_table, but this formualtion does
not guarantee or imply that session_allocate_target_var_table will be
attempted if present.

The reason for this is that it is valid for libgomp to not want any
target variable table, but in that case it still must call
session_set_target_var_table with a NULL pointer.

> * * *
>> +/* Set TABLE, a device pointer, as the pointer to the target variable table.
>> +   It may be NULL, in which case there's no target variable table.
>> +
>> +   Called iff GOMP_OFFLOAD_session_allocate_target_var_table did not succeed
>> +   or was not called.  */
>
> Either this function or GOMP_OFFLOAD_session_allocate_target_var_table
> (returning a non-NULL pointer) must be called before using the
> target variable table.
>
> Or something like that?

Perhaps "before dispatching the offload session kernel" or such, but
yes, that seems fine (it is otherwise unclear what "using" here means -
it is already used to store pointers by libgomp before the session is
finished, either after session_allocate_target_var_table if the plugin
allocates, or before session_set_target_var_table if libgomp does).

>> +extern void GOMP_OFFLOAD_session_set_target_var_table
> ...
>
> * * *
>> +  struct {
>> +    /* Cached below as 'size'.  */
>> +    __typeof (GOMP_OFFLOAD_session_size) *size_func;
> ...
>> +    /* Size of a single gomp_offload_session object, as specified by
>> +       GOMP_OFFLOAD_session_size.  */
>> +    size_t size;
>> +  } session;
>
> "specified by ..." sounds odd - "as returned by ..." ?

Seems OK to me.

> Is there actually a need to add the function pointer to the struct?
>
> +      DLSYM2 (session.size, session_size);
> +      device->session.size = device->session.size_func ();
>
> It feels as if one could just do:
>
>   __typeof (GOMP_OFFLOAD_session_size) *size_func
>     = dlsym (plugin, GOMP_OFFLOAD_session_size);
>   device->session_size = size_func ();
>
> which seems to be possibly clearer and smaller?
>
> However, also the current version is fine.

Yes, that can be removed, good point.  I only added the cache later
while writing the patch and it didn't occur to me to remove the
redundant function.

> * * *
>> +/* Allocate an offload session using for gomp_device_descr DEV using ALLOC, and
>> +   initialize it.  Provided as a macro, so that 'alloca' can be used as
>> +   ALLOC. */
>> +#define gomp_offload_session_new(devicep, alloc)		\
>
> This somehow reads odd. Maybe:
>
> Allocate an offload session for the gomp_device_descr DEVICEP
> using ALLOC and initialize it.

... yes, there's clearly a typo there.

I'll replace the former sentence with your version.

> ?
>
> * * *
>
>> --- a/libgomp/plugin/plugin-gcn.c
>> +++ b/libgomp/plugin/plugin-gcn.c
>
>> +allocate_session_kernargs (struct gomp_offload_session *session,
>> +			   size_t table_size)
>> +{
>> +  GCN_DEBUG ("Session %p asked for allocation of kernargs+%zu...\n", session, table_size);
>
> First, this line is too long.
> Second, I wonder whether it is okay to use %zu or not.
>
> For the current implementation, it probably is - but if
> we ever want to move to Windows, it probably isn't. At
> least for MinGW, GCC warns when using %zu - even though it
> is new since C99, i.e. a long time.

> Cf. ->
> https://www.eevblog.com/forum/programming/why-does-gccs-format-function-attribute-no-longer-think-zu-is-valid/
>
> I am torn between replacing it by something
> different and leaving it as is, given that we
> target Linux and C99 is old and somewhat common.
> (And depending what is linked on Windows for printf,
> it actually works - besides: if ever compiling for
> MinGW, GCC seemingly will warn.)

Apparently it's only standard since C23 - I was under the impression
it's older.

%ull could work - I expect even that is more than any value that can
reasonably appear here and, since it's for a mere debug print, it isn't
too important.

Hopefully we can lose this nonsense with the C++ conversion eventually.

>
>> +GOMP_OFFLOAD_session_allocate_target_var_table (struct gomp_offload_session *session,
>> +						size_t table_size)
>> +{
>> +  GCN_DEBUG ("Session %p asked to allocate\n", session);
>> +  /* libgomp wants us to handle the TVT.  */
>> +  assert (!session->target_var_table);
>> +
>> +  if (secure_getenv ("GCN_INHIBIT_KERNARGS_TVT_MERGE"))
>> +    /* ... but the user does not.  Used for testing.  */
>> +    return NULL;
>
> I am a bit unsure whether it makes sense that this is called
> every time. On one hand, getenv has some overhead - and it
> will be called once per kernel launch. On the other hand,
> there is so much action happening and, hopefully, the time
> spend here is neglibile compared with all other overhead.
>
> Still, one option would be to cache the value.

Yes, caching this in init_environment_variables seems fair.  I'll do
that.

Here's a range diff (i.e. a diff-of-diffs) of the (untested) updated
patch series, including the comment Andrew asked for on the parallel
init patch:

~/gcc/gcc 130 $ git --no-pager range-diff ed3abd237854...HEAD
1:  9462a78715ed ! 1:  38cacbb6b203 libgomp/gcn: parallelize initializing threads of a team
    @@ Commit message
         be able to read from, and only initializes each remaining thread in the
         team with a few pointers.
     
    -    No functional changes intended in this commit.
    +    No functional changes intended in this commit.  It may seem like there
    +    is a functional change, as gomp_prep_our_thread no longer sets
    +    icv.nthreads_var, whereas the old code did, but the value that was being
    +    set by old code was always equal to the value already present in the
    +    ICV, because both are initialized from parent tasks ICV (or global ICV
    +    if that's missing) and, hence, the write was always redundant.
     
         libgomp/ChangeLog:
     
2:  3f95756e73fb ! 2:  33ac16dc133b libgomp: let plugins handle allocating the target variable table
    @@ libgomp/libgomp-plugin.h: extern int GOMP_OFFLOAD_memcpy3d (int, int, size_t, si
     +[[gnu::const]] extern size_t GOMP_OFFLOAD_session_size (void);
     +
     +/* Check that the 'struct gomp_offload_struct' declaration is acceptable, and
    -+   implement GOMP_OFFLOAD_session_size.  */
    ++   implement GOMP_OFFLOAD_session_size.  Call this in the plugin after defining
    ++   the aforementioned struct.  */
     +#define GOMP_OFFLOAD_session_boilerplate()                \
     +  GOMP_OFFLOAD_check_session_struct ();                   \
     +  [[gnu::const]] size_t                                   \
    @@ libgomp/libgomp-plugin.h: extern int GOMP_OFFLOAD_memcpy3d (int, int, size_t, si
     +/* Set TABLE, a device pointer, as the pointer to the target variable table.
     +   It may be NULL, in which case there's no target variable table.
     +
    -+   Called iff GOMP_OFFLOAD_session_allocate_target_var_table did not succeed
    -+   or was not called.  */
    ++   Before dispatching the offload kernel associated with this session, exactly
    ++   a successful call to GOMP_OFFLOAD_session_allocate_target_var_table or a
    ++   call to this function must happen, but not both.  */
     +extern void GOMP_OFFLOAD_session_set_target_var_table
     +  (struct gomp_offload_session *session, void **table);
     +
    @@ libgomp/libgomp.h: struct gomp_device_descr
        __typeof (GOMP_OFFLOAD_memcpy3d) *memcpy3d_func;
        __typeof (GOMP_OFFLOAD_memset) *memset_func;
     +  struct {
    -+    /* Cached below as 'size'.  */
    -+    __typeof (GOMP_OFFLOAD_session_size) *size_func;
     +    __typeof (GOMP_OFFLOAD_session_start) *start_func;
     +    __typeof (GOMP_OFFLOAD_session_allocate_target_var_table) *alloc_tvt_func;
     +    __typeof (GOMP_OFFLOAD_session_set_target_var_table) *set_tvt_func;
     +
    -+    /* Size of a single gomp_offload_session object, as specified by
    ++    /* Size of a single gomp_offload_session object, as returned by
     +       GOMP_OFFLOAD_session_size.  */
     +    size_t size;
     +  } session;
    @@ libgomp/libgomp.h: struct gomp_device_descr
        acc_dispatch_t openacc;
      };
      
    -+/* Allocate an offload session using for gomp_device_descr DEV using ALLOC, and
    -+   initialize it.  Provided as a macro, so that 'alloca' can be used as
    ++/* Allocate an offload session for the gomp_device_descr DEVICEP using ALLOC,
    ++   and initialize it.  Provided as a macro, so that 'alloca' can be used as
     +   ALLOC. */
     +#define gomp_offload_session_new(devicep, alloc)          \
     +  ({                                                              \
    @@ libgomp/oacc-host.c
     +};
     +_Static_assert (_Alignof (struct host_offload_session) < __BIGGEST_ALIGNMENT__,
     +          "gomp_offload_session requires too high alignment");
    -+
    -+static size_t
    -+host_session_size (void)
    -+{ return sizeof (struct host_offload_session); }
      
      static struct gomp_device_descr host_dispatch;
      
    @@ libgomp/oacc-host.c: static struct gomp_device_descr host_dispatch =
          .run_func = host_run,
      
     +    .session = {
    -+      .size_func = host_session_size,
     +      .start_func = host_session_start,
     +      .set_tvt_func = host_session_set_target_var_table,
     +      .size = sizeof (struct host_offload_session),
    @@ libgomp/plugin/plugin-gcn.c: struct kernargs {
        struct GOMP_kernel_launch_attributes kla;
      };
      
    +@@ libgomp/plugin/plugin-gcn.c: static int lowlat_size = -1;
    + 
    + static bool debug;
    + 
    ++/* Flag to decide whether to prevent merging the kernel arguments with the
    ++   target variable table, i.e. whether to always fail
    ++   GOMP_OFFLOAD_session_allocate_target_var_table.  Set from the
    ++   GCN_INHIBIT_KERNARGS_TVT_MERGE env var.  */
    ++
    ++static bool inhibit_kernargs_tvt_merge;
    ++
    + /* Flag to decide if the runtime should suppress a possible fallback to host
    +    execution.  */
    + 
     @@ libgomp/plugin/plugin-gcn.c: dump_executable_symbols (hsa_executable_t executable)
      
      /* Dump kernel DISPATCH data structure and indent it by INDENT spaces.  */
    @@ libgomp/plugin/plugin-gcn.c: dump_executable_symbols (hsa_executable_t executabl
        fprintf (stderr, "%*sthis: %p\n", indent, "", dispatch);
        fprintf (stderr, "%*squeue: %p\n", indent, "", dispatch->queue);
        fprintf (stderr, "%*skernarg_address: %p\n", indent, "", kernargs);
    +@@ libgomp/plugin/plugin-gcn.c: init_environment_variables (void)
    +   const char *lowlat = secure_getenv ("GOMP_GCN_LOWLAT_POOL");
    +   if (lowlat)
    +     lowlat_size = atoi (lowlat);
    ++
    ++  inhibit_kernargs_tvt_merge
    ++    = (bool) secure_getenv ("GCN_INHIBIT_KERNARGS_TVT_MERGE");
    + }
    + 
    + /* Return malloc'd string with name of SYMBOL.  */
     @@ libgomp/plugin/plugin-gcn.c: max_isa_vgprs (int isa)
      
      /* }}}  */
    @@ libgomp/plugin/plugin-gcn.c: max_isa_vgprs (int isa)
     +allocate_session_kernargs (struct gomp_offload_session *session,
     +                     size_t table_size)
     +{
    -+  GCN_DEBUG ("Session %p asked for allocation of kernargs+%zu...\n", session, table_size);
    ++  GCN_DEBUG ("Session %p asked for allocation of kernargs+%llu...\n",
    ++       session, (unsigned long long) table_size);
     +  struct agent_info *agent = session->agent;
     +  assert (!session->kernarg_cache_node);
     +
    @@ libgomp/plugin/plugin-gcn.c: max_isa_vgprs (int isa)
     +  /* libgomp wants us to handle the TVT.  */
     +  assert (!session->target_var_table);
     +
    -+  if (secure_getenv ("GCN_INHIBIT_KERNARGS_TVT_MERGE"))
    ++  if (inhibit_kernargs_tvt_merge)
     +    /* ... but the user does not.  Used for testing.  */
     +    return NULL;
     +
    @@ libgomp/target.c: GOMP_target (int device, void (*fn) (void *), const void *unus
            || !(fn_addr = gomp_get_target_fn_addr (devicep, fn)))
          return gomp_target_fallback (fn, hostaddrs, devicep, NULL);
      
    -+  struct gomp_offload_session *session = (gomp_offload_session_new
    -+                                    (devicep, alloca));
    ++  struct gomp_offload_session *session
    ++    = gomp_offload_session_new (devicep, alloca);
     +
        htab_t refcount_set = htab_create (mapnum);
        struct target_mem_desc *tgt_vars
    @@ libgomp/target.c: GOMP_target_ext (int device, void (*fn) (void *), size_t mapnu
        struct target_mem_desc *tgt_vars;
        htab_t refcount_set = NULL;
      
    -+  struct gomp_offload_session *session = (gomp_offload_session_new
    -+                                    (devicep, alloca));
    ++  struct gomp_offload_session *session
    ++    = gomp_offload_session_new (devicep, alloca);
     +
        if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
          {
    @@ libgomp/target.c: gomp_load_plugin_for_device (struct gomp_device_descr *device,
     +  device->session.size = 0;
        if (device->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
          {
    -+      DLSYM2 (session.size, session_size);
    -+      device->session.size = device->session.size_func ();
    ++      __typeof (GOMP_OFFLOAD_session_size) *size_func
    ++  = dlsym (plugin_handle, "GOMP_OFFLOAD_session_size");
    ++      device->session.size = size_func ();
     +      DLSYM2 (session.start, session_start);
     +      DLSYM_OPT (session.alloc_tvt, session_allocate_target_var_table);
     +      DLSYM2 (session.set_tvt, session_set_target_var_table);
3:  845319e68d1f = 3:  ee8ead4d9759 libgomp/plugin-gcn: remove unneeded heap allocation in run_kernel
4:  ed3abd237854 = 4:  7e06df7286eb libgomp/oacc-mem: add missing assert to goacc_enter_datum
  

Patch

diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index 0a0761043f96..8304ae839fd1 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -332,7 +332,7 @@  enum gomp_map_kind
 /* Versions of libgomp and device-specific plugins.  GOMP_VERSION
    should be incremented whenever an ABI-incompatible change is introduced
    to the plugin interface defined in libgomp/libgomp.h.  */
-#define GOMP_VERSION	3
+#define GOMP_VERSION	4
 #define GOMP_VERSION_NVIDIA_PTX 1
 #define GOMP_VERSION_GCN 3
 
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index bb4d577b66d7..66351e826d96 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -188,11 +188,76 @@  extern int GOMP_OFFLOAD_memcpy3d (int, int, size_t, size_t, size_t, void *,
 				  size_t);
 extern bool GOMP_OFFLOAD_memset (int, void *, int, size_t);
 extern bool GOMP_OFFLOAD_can_run (void *);
-extern void GOMP_OFFLOAD_run (int, void *, void *, void **);
-extern void GOMP_OFFLOAD_async_run (int, void *, void *, void **, void *);
 
-extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **,
-				       void **, unsigned *, void *);
+/* An opaque type, encapsulating the state required to launch a single
+   'target' region.  This type is expected to have alignment no greater than
+   the alignment 'malloc' and 'alloca' can provide.
+
+   The lifetime of the memory reserved for an offload session is managed by
+   libgomp.  It will ensure that it is deallocated only after a kernel is done
+   executing.
+
+   Per offload session, exactly one of GOMP_OFFLOAD{,_async}_run or
+   GOMP_OFFLOAD_openacc{,_async}_exec will be called.  This is also the last
+   operation performed on a session.  */
+struct gomp_offload_session;
+
+/* Validate that a 'struct gomp_offload_session' declaration is acceptable.  */
+#define GOMP_OFFLOAD_check_session_struct()				\
+  _Static_assert (_Alignof (struct gomp_offload_session) < __BIGGEST_ALIGNMENT__, \
+		 "gomp_offload_session requires too high alignment")
+
+/* Return size of a gomp_offload_session instance.  libgomp takes care of
+   allocating and deallocating enough memory to store the session.  */
+[[gnu::const]] extern size_t GOMP_OFFLOAD_session_size (void);
+
+/* Check that the 'struct gomp_offload_struct' declaration is acceptable, and
+   implement GOMP_OFFLOAD_session_size.  */
+#define GOMP_OFFLOAD_session_boilerplate()		\
+  GOMP_OFFLOAD_check_session_struct ();			\
+  [[gnu::const]] size_t					\
+  GOMP_OFFLOAD_session_size (void)			\
+  { return sizeof (struct gomp_offload_session); }
+
+/* Initialize SESSION for executing a kernel on DEVICE.  */
+extern void GOMP_OFFLOAD_session_start (struct gomp_offload_session *session,
+					int device);
+
+/* Attempt to allocate a target variable table in host memory for SESSION.
+   This table must be of at least table_size bytes and aligned to
+   __BIGGEST_ALIGNMENT__.
+
+   This function will be called at most once per SESSION.
+
+   If this function returns NULL, or if libgomp never calls it,
+   GOMP_OFFLOAD_session_set_target_var_table will be called instead, with
+   memory allocated by libgomp for the purpose.
+
+   If this function is omitted, libgomp will behave as if it always returns
+   NULL.  */
+extern void **GOMP_OFFLOAD_session_allocate_target_var_table
+  (struct gomp_offload_session *session, size_t table_size);
+
+/* Set TABLE, a device pointer, as the pointer to the target variable table.
+   It may be NULL, in which case there's no target variable table.
+
+   Called iff GOMP_OFFLOAD_session_allocate_target_var_table did not succeed
+   or was not called.  */
+extern void GOMP_OFFLOAD_session_set_target_var_table
+  (struct gomp_offload_session *session, void **table);
+
+extern void GOMP_OFFLOAD_run (struct gomp_offload_session *session,
+			      void *fn_ptr,
+			      void **args);
+extern void GOMP_OFFLOAD_async_run (struct gomp_offload_session *session,
+				    void *tgt_fn,
+				    void **args,
+				    void *async_data);
+
+extern void GOMP_OFFLOAD_openacc_exec (struct gomp_offload_session *session,
+				       void (*tgt_fn) (void *),
+				       size_t mapnum, void **hostaddrs,
+				       unsigned *dims, void *targ_mem_desc);
 extern void *GOMP_OFFLOAD_openacc_create_thread_data (int);
 extern void GOMP_OFFLOAD_openacc_destroy_thread_data (void *);
 extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (int);
@@ -203,9 +268,11 @@  extern bool GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *,
 						  struct goacc_asyncqueue *);
 extern void GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *,
 						       void (*)(void *), void *);
-extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **,
-					     void **, unsigned *, void *,
-					     struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_exec (struct gomp_offload_session *session,
+					     void (*fn_ptr) (void *),
+					     size_t mapnum, void **hostaddrs,
+					     unsigned *dims, void *targ_mem_desc,
+					     struct goacc_asyncqueue *aq);
 extern bool GOMP_OFFLOAD_openacc_async_dev2host (int, void *, const void *, size_t,
 						 struct goacc_asyncqueue *);
 extern bool GOMP_OFFLOAD_openacc_async_host2dev (int, void *, const void *, size_t,
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index c51bd680713f..2b0327ebf557 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -771,6 +771,9 @@  struct gomp_target_task
   struct gomp_task *task;
   struct gomp_team *team;
   /* Device-specific target arguments.  */
+
+  /* Pointer to the offload session for this task.  */
+  struct gomp_offload_session *offload_session;
   void **args;
   void *hostaddrs[];
 };
@@ -1465,6 +1468,17 @@  struct gomp_device_descr
   __typeof (GOMP_OFFLOAD_memcpy2d) *memcpy2d_func;
   __typeof (GOMP_OFFLOAD_memcpy3d) *memcpy3d_func;
   __typeof (GOMP_OFFLOAD_memset) *memset_func;
+  struct {
+    /* Cached below as 'size'.  */
+    __typeof (GOMP_OFFLOAD_session_size) *size_func;
+    __typeof (GOMP_OFFLOAD_session_start) *start_func;
+    __typeof (GOMP_OFFLOAD_session_allocate_target_var_table) *alloc_tvt_func;
+    __typeof (GOMP_OFFLOAD_session_set_target_var_table) *set_tvt_func;
+
+    /* Size of a single gomp_offload_session object, as specified by
+       GOMP_OFFLOAD_session_size.  */
+    size_t size;
+  } session;
   __typeof (GOMP_OFFLOAD_can_run) *can_run_func;
   __typeof (GOMP_OFFLOAD_run) *run_func;
   __typeof (GOMP_OFFLOAD_async_run) *async_run_func;
@@ -1491,6 +1505,16 @@  struct gomp_device_descr
   acc_dispatch_t openacc;
 };
 
+/* Allocate an offload session using for gomp_device_descr DEV using ALLOC, and
+   initialize it.  Provided as a macro, so that 'alloca' can be used as
+   ALLOC. */
+#define gomp_offload_session_new(devicep, alloc)		\
+  ({								\
+    void *session = alloc (devicep->session.size);	\
+    devicep->session.start_func (session, devicep->target_id);	\
+    session;							\
+  })
+
 /* Kind of the pragma, for which gomp_map_vars () is called.  */
 enum gomp_map_vars_kind
 {
@@ -1524,7 +1548,8 @@  extern struct target_mem_desc *goacc_map_vars (struct gomp_device_descr *,
 					       struct goacc_asyncqueue *,
 					       size_t, void **, void **,
 					       size_t *, void *, bool,
-					       enum gomp_map_vars_kind);
+					       enum gomp_map_vars_kind,
+					       struct gomp_offload_session *);
 extern void goacc_unmap_vars (struct target_mem_desc *, bool,
 			      struct goacc_asyncqueue *);
 extern void gomp_init_device (struct gomp_device_descr *);
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 028a5c943b7e..cdfd6822afe7 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -30,8 +30,24 @@ 
 #include "oacc-int.h"
 #include "gomp-constants.h"
 
+#include <assert.h>
 #include <stdbool.h>
 #include <stddef.h>
+#include <string.h>
+
+/* Defined under a name other than gomp_offload_session to make debugging with
+   GDB easier.  If this struct was called gomp_offload_session, GDB would
+   frequently ignore the plugin-specific definition.  */
+struct host_offload_session
+{
+  void *vars;
+};
+_Static_assert (_Alignof (struct host_offload_session) < __BIGGEST_ALIGNMENT__,
+		"gomp_offload_session requires too high alignment");
+
+static size_t
+host_session_size (void)
+{ return sizeof (struct host_offload_session); }
 
 static struct gomp_device_descr host_dispatch;
 
@@ -128,19 +144,41 @@  host_host2dev (int n __attribute__ ((unused)),
 }
 
 static void
-host_run (int n __attribute__ ((unused)), void *fn_ptr, void *vars,
-	  void **args __attribute__((unused)))
+host_session_start (struct gomp_offload_session *osession, int dev)
 {
-  void (*fn)(void *) = (void (*)(void *)) fn_ptr;
-
-  fn (vars);
+  (void) dev;
+  struct host_offload_session *session;
+  memcpy (&session, &osession, sizeof (session));
+  *session = (struct host_offload_session) {
+    .vars = NULL,
+  };
 }
 
 static void
-host_openacc_exec (void (*fn) (void *),
+host_session_set_target_var_table (struct gomp_offload_session *osession,
+				   void **table)
+{
+  struct host_offload_session *session;
+  memcpy (&session, &osession, sizeof (session));
+  assert (!session->vars);
+  session->vars = table;
+}
+
+static void
+host_run (struct gomp_offload_session *osession, void *fn_ptr, void **args)
+{
+  struct host_offload_session *session;
+  memcpy (&session, &osession, sizeof (session));
+  void (*fn)(void *) = (void (*)(void *)) fn_ptr;
+
+  fn (session->vars);
+}
+
+static void
+host_openacc_exec (struct gomp_offload_session *session __attribute__((unused)),
+		   void (*fn) (void *),
 		   size_t mapnum __attribute__ ((unused)),
 		   void **hostaddrs,
-		   void **devaddrs __attribute__ ((unused)),
 		   unsigned *dims __attribute__ ((unused)),
 		   void *targ_mem_desc __attribute__ ((unused)))
 {
@@ -148,10 +186,10 @@  host_openacc_exec (void (*fn) (void *),
 }
 
 static void
-host_openacc_async_exec (void (*fn) (void *),
+host_openacc_async_exec (struct gomp_offload_session *session __attribute__((unused)),
+			 void (*fn) (void *),
 			 size_t mapnum __attribute__ ((unused)),
 			 void **hostaddrs,
-			 void **devaddrs __attribute__ ((unused)),
 			 unsigned *dims __attribute__ ((unused)),
 			 void *targ_mem_desc __attribute__ ((unused)),
 			 struct goacc_asyncqueue *aq __attribute__ ((unused)))
@@ -288,6 +326,13 @@  static struct gomp_device_descr host_dispatch =
     .memcpy3d_func = NULL,
     .run_func = host_run,
 
+    .session = {
+      .size_func = host_session_size,
+      .start_func = host_session_start,
+      .set_tvt_func = host_session_set_target_var_table,
+      .size = sizeof (struct host_offload_session),
+    },
+
     .mem_map = { NULL },
     .mem_map_rev = { NULL },
     /* .lock initialized in goacc_host_init.  */
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 738281f5701c..5601daf13957 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -421,7 +421,7 @@  acc_map_data (void *h, void *d, size_t s)
 
       struct target_mem_desc *tgt
 	= goacc_map_vars (acc_dev, NULL, mapnum, &hostaddrs, &devaddrs, &sizes,
-			  &kinds, true, GOMP_MAP_VARS_ENTER_DATA);
+			  &kinds, true, GOMP_MAP_VARS_ENTER_DATA, NULL);
       assert (tgt);
       assert (tgt->list_count == 1);
       splay_tree_key n = tgt->list[0].key;
@@ -586,7 +586,7 @@  goacc_enter_datum (void **hostaddrs, size_t *sizes, void *kinds, int async)
 
       struct target_mem_desc *tgt
 	= goacc_map_vars (acc_dev, aq, mapnum, hostaddrs, NULL, sizes,
-			  kinds, true, GOMP_MAP_VARS_ENTER_DATA);
+			  kinds, true, GOMP_MAP_VARS_ENTER_DATA, NULL);
       assert (tgt);
       assert (tgt->list_count == 1);
       n = tgt->list[0].key;
@@ -1225,7 +1225,7 @@  goacc_enter_data_internal (struct gomp_device_descr *acc_dev, size_t mapnum,
 	      struct target_mem_desc *tgt_ __attribute__((unused))
 		= goacc_map_vars (acc_dev, aq, groupnum, &hostaddrs[i], NULL,
 				  &sizes[i], &kinds[i], true,
-				  GOMP_MAP_VARS_ENTER_DATA);
+				  GOMP_MAP_VARS_ENTER_DATA, NULL);
 	      assert (tgt_ == NULL);
 	      gomp_mutex_lock (&acc_dev->lock);
 
@@ -1276,7 +1276,7 @@  goacc_enter_data_internal (struct gomp_device_descr *acc_dev, size_t mapnum,
 	  struct target_mem_desc *tgt
 	    = goacc_map_vars (acc_dev, aq, groupnum, &hostaddrs[i], NULL,
 			      &sizes[i], &kinds[i], true,
-			      GOMP_MAP_VARS_ENTER_DATA);
+			      GOMP_MAP_VARS_ENTER_DATA, NULL);
 	  assert (tgt);
 
 	  gomp_mutex_lock (&acc_dev->lock);
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 9f48c8b7f644..04ff26f28e59 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -291,9 +291,14 @@  GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
 
   goacc_aq aq = get_goacc_asyncqueue (async);
 
+  /* Prepare an offload session.  */
+  struct gomp_offload_session *session
+    = (aq ? gomp_offload_session_new (acc_dev, gomp_malloc)
+       : gomp_offload_session_new (acc_dev, alloca));
+
   struct target_mem_desc *tgt
     = goacc_map_vars (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds, true,
-		      GOMP_MAP_VARS_TARGET);
+		      GOMP_MAP_VARS_TARGET, session);
 
   if (profiling_p)
     {
@@ -304,13 +309,11 @@  GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
 				&api_info);
     }
 
-  void **devaddrs = (void **) tgt->tgt_start;
   if (aq == NULL)
-    acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, dims,
-				tgt);
+    acc_dev->openacc.exec_func (session, tgt_fn, mapnum, hostaddrs, dims, tgt);
   else
-    acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-				      dims, tgt, aq);
+    acc_dev->openacc.async.exec_func (session, tgt_fn, mapnum, hostaddrs, dims,
+				      tgt, aq);
 
   if (profiling_p)
     {
@@ -324,6 +327,11 @@  GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
   /* If running synchronously (aq == NULL), this will unmap immediately.  */
   goacc_unmap_vars (tgt, true, aq);
 
+  if (aq)
+      /* We need to clean up the above-allocated session later if executing
+	 asynchronously. */
+      acc_dev->openacc.async.queue_callback_func (aq, free, session);
+
   if (profiling_p)
     {
       prof_info.event_type = acc_ev_exit_data_end;
@@ -454,7 +462,7 @@  GOACC_data_start (int flags_m, size_t mapnum,
     {
       prof_info.device_type = acc_device_host;
       api_info.device_type = prof_info.device_type;
-      tgt = goacc_map_vars (NULL, NULL, 0, NULL, NULL, NULL, NULL, true, 0);
+      tgt = goacc_map_vars (NULL, NULL, 0, NULL, NULL, NULL, NULL, true, 0, NULL);
       tgt->prev = thr->mapped_data;
       thr->mapped_data = tgt;
 
@@ -463,7 +471,7 @@  GOACC_data_start (int flags_m, size_t mapnum,
 
   gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
   tgt = goacc_map_vars (acc_dev, NULL, mapnum, hostaddrs, NULL, sizes, kinds,
-			true, 0);
+			true, 0, NULL);
   gomp_debug (0, "  %s: mappings prepared\n", __FUNCTION__);
   tgt->prev = thr->mapped_data;
   thr->mapped_data = tgt;
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 32f573f1b7f6..99ba65e14243 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -282,9 +282,6 @@  struct kernel_dispatch
   struct agent_info *agent;
   /* Pointer to a command queue associated with a kernel dispatch agent.  */
   void *queue;
-  /* Pointer to a memory space used for kernel arguments passing, wrapped in a
-     node from the agent kernel argument cache.  */
-  struct alloc_cache_node *kernarg_cache_node;
   /* Kernel object.  */
   uint64_t object;
   /* Synchronization signal used for dispatch synchronization.  */
@@ -305,14 +302,18 @@  struct kernargs {
 
   /* Output data.  */
   struct output output_data;
+
+  /* Target variable table.  Size determined by gomp_map_vars.  See
+     GOMP_OFFLOAD_session_allocate_target_var_table.  */
+  _Alignas (__BIGGEST_ALIGNMENT__) void *target_variable_table[];
 };
 
 /* A queue entry for a future asynchronous launch.  */
 
 struct kernel_launch
 {
+  struct gomp_offload_session *session;
   struct kernel_info *kernel;
-  void *vars;
   struct GOMP_kernel_launch_attributes kla;
 };
 
@@ -1085,11 +1086,10 @@  dump_executable_symbols (hsa_executable_t executable)
 
 /* Dump kernel DISPATCH data structure and indent it by INDENT spaces.  */
 
-static void
-print_kernel_dispatch (struct kernel_dispatch *dispatch, unsigned indent)
+static inline void
+print_kernel_dispatch (struct kernel_dispatch *dispatch, unsigned indent,
+		       struct kernargs *kernargs)
 {
-  struct kernargs *kernargs = dispatch->kernarg_cache_node->allocation;
-
   fprintf (stderr, "%*sthis: %p\n", indent, "", dispatch);
   fprintf (stderr, "%*squeue: %p\n", indent, "", dispatch->queue);
   fprintf (stderr, "%*skernarg_address: %p\n", indent, "", kernargs);
@@ -1824,6 +1824,118 @@  max_isa_vgprs (int isa)
 
 /* }}}  */
 /* {{{ Run  */
+struct gomp_offload_session
+{
+  /* Pointer to a memory space used for kernel arguments passing, wrapped in a
+     node from the agent kernel argument cache.  */
+  struct alloc_cache_node *kernarg_cache_node;
+  /* Pointer to the actual target variable table.  */
+  void **target_var_table;
+  /* Device executing the kernel for this offload session.  */
+  struct agent_info *agent;
+};
+GOMP_OFFLOAD_session_boilerplate ();
+
+/* Prepare SESSION for use by AGENT.  */
+void
+init_session (struct gomp_offload_session *session, struct agent_info *agent)
+{
+  assert (agent);
+  *session = (struct gomp_offload_session) {
+    .kernarg_cache_node = NULL,
+    .target_var_table = NULL,
+    .agent = agent,
+  };
+}
+
+void
+GOMP_OFFLOAD_session_start (struct gomp_offload_session *session, int device)
+{
+  GCN_DEBUG ("Starting session %p\n", session);
+  assert ((((uintptr_t) session) % __BIGGEST_ALIGNMENT__) == 0);
+  init_session (session, get_agent_info (device));
+}
+
+/* Release resources held by SESSION (but not SESSION itself).  */
+void
+release_session (struct gomp_offload_session *session)
+{
+  release_alloc_cache_node (session->kernarg_cache_node);
+}
+
+/* Get new kernargs for SESSION such that it can store TABLE_SIZE char units of
+   target variable table, reusing cached kernargs allocations, if possible.  */
+
+static inline struct kernargs *
+allocate_session_kernargs (struct gomp_offload_session *session,
+			   size_t table_size)
+{
+  GCN_DEBUG ("Session %p asked for allocation of kernargs+%zu...\n", session, table_size);
+  struct agent_info *agent = session->agent;
+  assert (!session->kernarg_cache_node);
+
+  /* To increase chance of cache hit, round up size of the target variable
+     table to a multiple of (64*sizeof(void*)), and ensure that this size is
+     nonzero.  */
+  if (!table_size)
+    table_size++;
+
+  {
+    constexpr size_t rounding_factor = 64 * sizeof (void*);
+    table_size += rounding_factor - 1;
+    table_size = (table_size / rounding_factor) * table_size;
+  }
+  size_t kernargs_size = sizeof (struct kernargs) + table_size;
+
+  session->kernarg_cache_node = (alloc_cache_try_find
+				 (&agent->kernarg_cache,
+				  kernargs_size));
+
+  if (!session->kernarg_cache_node)
+    {
+      /* Cache miss.  */
+      void *ka_addr;
+      hsa_status_t status = hsa_fns.hsa_memory_allocate_fn
+	(agent->kernarg_region, sizeof (struct kernargs), &ka_addr);
+      if (status != HSA_STATUS_SUCCESS)
+	hsa_fatal ("Could not allocate memory for GCN kernel arguments", status);
+
+      session->kernarg_cache_node = (alloc_cache_add_taken_node
+				     (&agent->kernarg_cache, ka_addr,
+				      kernargs_size));
+      if (!session->kernarg_cache_node)
+	GOMP_PLUGIN_fatal ("Could not allocate cache node for kernel arguments");
+    }
+
+  return session->kernarg_cache_node->allocation;
+}
+
+void **
+GOMP_OFFLOAD_session_allocate_target_var_table (struct gomp_offload_session *session,
+						size_t table_size)
+{
+  GCN_DEBUG ("Session %p asked to allocate\n", session);
+  /* libgomp wants us to handle the TVT.  */
+  assert (!session->target_var_table);
+
+  if (secure_getenv ("GCN_INHIBIT_KERNARGS_TVT_MERGE"))
+    /* ... but the user does not.  Used for testing.  */
+    return NULL;
+
+  struct kernargs *kernargs = allocate_session_kernargs (session, table_size);
+  return session->target_var_table = &kernargs->target_variable_table[0];
+}
+
+void
+GOMP_OFFLOAD_session_set_target_var_table (struct gomp_offload_session *session,
+					   void **table)
+{
+  GCN_DEBUG ("Session %p will use TVT %p...\n", session, table);
+  assert (!session->target_var_table);
+  /* libgomp wants to handle the TVT.  */
+  allocate_session_kernargs (session, 0);
+  session->target_var_table = table;
+}
 
 /* Create or reuse a team arena and stack space.
  
@@ -2010,40 +2122,12 @@  alloc_by_agent (struct agent_info *agent, size_t size)
   return ptr;
 }
 
-/* Get a cached kernargs from AGENT, returning an existing one if any are
-   available.  Returns an alloc_cache_node whose value is this allocation.  */
-
-static struct alloc_cache_node *
-alloc_kernargs_on_agent (struct agent_info *agent, size_t size)
-{
-  struct alloc_cache_node *ka_node = (alloc_cache_try_find
-				      (&agent->kernarg_cache, size));
-
-  /* The cache was empty.  */
-  if (!ka_node)
-    {
-      void *ka_addr;
-      hsa_status_t status = hsa_fns.hsa_memory_allocate_fn
-	(agent->kernarg_region, sizeof (struct kernargs), &ka_addr);
-      if (status != HSA_STATUS_SUCCESS)
-	hsa_fatal ("Could not allocate memory for GCN kernel arguments", status);
-
-      ka_node = alloc_cache_add_taken_node (&agent->kernarg_cache,
-					    ka_addr,
-					    size);
-      if (!ka_node)
-	GOMP_PLUGIN_fatal ("Could not allocate cache node for kernel arguments");
-    }
-
-  return ka_node;
-}
-
 /* Create kernel dispatch data structure for given KERNEL, along with
    the necessary device signals and memory allocations.  */
 
 static struct kernel_dispatch *
 create_kernel_dispatch (struct kernel_info *kernel, int num_teams,
-			int num_threads)
+			int num_threads, struct kernargs *kernargs)
 {
   struct agent_info *agent = kernel->agent;
   struct kernel_dispatch *shadow
@@ -2088,11 +2172,6 @@  create_kernel_dispatch (struct kernel_info *kernel, int num_teams,
       return NULL;
     }
 
-  /* Get an allocation, if possible from the cache.  */
-  shadow->kernarg_cache_node = (alloc_kernargs_on_agent
-				(agent, sizeof (struct kernargs)));
-  struct kernargs *kernargs = shadow->kernarg_cache_node->allocation;
-
   /* Zero-initialize the output_data (minimum needed).  */
   kernargs->abi.out_ptr = (int64_t)&kernargs->output_data;
   kernargs->output_data.next_output = 0;
@@ -2185,19 +2264,17 @@  console_output (struct kernel_info *kernel, struct kernargs *kernargs,
 /* Release data structure created for a kernel dispatch in SHADOW argument,
    and clean up the signal and memory allocations.  */
 
-static void
-release_kernel_dispatch (struct kernel_dispatch *shadow)
+static inline void
+release_kernel_dispatch (struct kernel_dispatch *shadow,
+			 struct kernargs *kernargs)
 {
   GCN_DEBUG ("Released kernel dispatch: %p\n", shadow);
 
-  struct kernargs *kernargs = shadow->kernarg_cache_node->allocation;
   void *addr = (void *)kernargs->abi.arena_ptr;
   if (!addr)
     addr = (void *)kernargs->abi.stack_ptr;
   release_ephemeral_memories (shadow->agent, addr);
 
-  release_alloc_cache_node (shadow->kernarg_cache_node);
-
   hsa_signal_t s;
   s.handle = shadow->signal;
   hsa_fns.hsa_signal_destroy_fn (s);
@@ -2295,15 +2372,16 @@  init_kernel (struct kernel_info *kernel)
 		       "mutex");
 }
 
-/* Run KERNEL on its agent, pass VARS to it as arguments and take
-   launch attributes from KLA.
+/* Run KERNEL on its agent as part of SESSION and take launch attributes from
+   KLA.
    
    MODULE_LOCKED indicates that the caller already holds the lock and
    run_kernel need not lock it again.
    If AQ is NULL then agent->sync_queue will be used.  */
 
 static void
-run_kernel (struct kernel_info *kernel, void *vars,
+run_kernel (struct gomp_offload_session *session,
+	    struct kernel_info *kernel,
 	    struct GOMP_kernel_launch_attributes *kla,
 	    struct goacc_asyncqueue *aq, bool module_locked)
 {
@@ -2389,6 +2467,9 @@  run_kernel (struct kernel_info *kernel, void *vars,
 					     packet->grid_size_x,
 					     kla->wdims[0]);
 
+  struct kernargs *kernargs = session->kernarg_cache_node->allocation;
+  packet->kernarg_address = kernargs;
+
   if (kla->ndim >= 2)
     {
       packet->grid_size_y = kla->gdims[1];
@@ -2426,27 +2507,25 @@  run_kernel (struct kernel_info *kernel, void *vars,
 
   struct kernel_dispatch *shadow
     = create_kernel_dispatch (kernel, packet->grid_size_x,
-			      packet->grid_size_z);
+			      packet->grid_size_z, kernargs);
   shadow->queue = command_q;
 
   if (debug)
     {
       fprintf (stderr, "\nKernel has following dependencies:\n");
-      print_kernel_dispatch (shadow, 2);
+      print_kernel_dispatch (shadow, 2, kernargs);
     }
 
   packet->private_segment_size = shadow->private_segment_size;
   packet->group_segment_size = shadow->group_segment_size;
   packet->kernel_object = shadow->object;
-  struct kernargs *kernargs = (packet->kernarg_address
-			       = shadow->kernarg_cache_node->allocation);
   hsa_signal_t s;
   s.handle = shadow->signal;
   packet->completion_signal = s;
   hsa_fns.hsa_signal_store_relaxed_fn (s, 1);
-  memcpy (kernargs, &vars, sizeof (vars));
 
-  GCN_DEBUG ("Copying kernel runtime pointer to kernarg_address\n");
+  GCN_DEBUG ("Copying kernel runtime pointer %p to kernarg_address\n", session->target_var_table);
+  memcpy (kernargs, &session->target_var_table, sizeof (session->target_var_table));
 
   uint16_t header;
   header = HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE;
@@ -2476,7 +2555,8 @@  run_kernel (struct kernel_info *kernel, void *vars,
 
   unsigned int return_value = (unsigned int)kernargs->output_data.return_value;
 
-  release_kernel_dispatch (shadow);
+  release_kernel_dispatch (shadow, kernargs);
+  release_session (session);
 
   if (!module_locked && pthread_rwlock_unlock (&agent->module_rwlock))
     GOMP_PLUGIN_fatal ("Unable to unlock a GCN agent rwlock");
@@ -2788,7 +2868,10 @@  destroy_module (struct module_info *module, bool locked)
   if (module->fini_array_func)
     {
       init_kernel (module->fini_array_func);
-      run_kernel (module->fini_array_func, NULL, &kla, NULL, locked);
+      struct gomp_offload_session session;
+      init_session (&session, module->fini_array_func->agent);
+      GOMP_OFFLOAD_session_set_target_var_table (&session, NULL);
+      run_kernel (&session, module->fini_array_func, &kla, NULL, locked);
     }
   module->constructors_run_p = false;
 
@@ -2820,8 +2903,8 @@  execute_queue_entry (struct goacc_asyncqueue *aq, int index)
       if (DEBUG_QUEUES)
 	GCN_DEBUG ("Async thread %d:%d: Executing launch entry (%d)\n",
 		   aq->agent->device_id, aq->id, index);
-      run_kernel (entry->u.launch.kernel,
-		  entry->u.launch.vars,
+      run_kernel (entry->u.launch.session,
+		  entry->u.launch.kernel,
 		  &entry->u.launch.kla, aq, false);
       if (DEBUG_QUEUES)
 	GCN_DEBUG ("Async thread %d:%d: Executing launch entry (%d) done\n",
@@ -2983,8 +3066,10 @@  wait_for_queue_nonfull (struct goacc_asyncqueue *aq)
    kernel to run.  */
 
 static void
-queue_push_launch (struct goacc_asyncqueue *aq, struct kernel_info *kernel,
-		   void *vars, struct GOMP_kernel_launch_attributes *kla)
+queue_push_launch (struct goacc_asyncqueue *aq,
+		   struct gomp_offload_session *session,
+		   struct kernel_info *kernel,
+		   struct GOMP_kernel_launch_attributes *kla)
 {
   assert (aq->agent == kernel->agent);
 
@@ -2999,8 +3084,8 @@  queue_push_launch (struct goacc_asyncqueue *aq, struct kernel_info *kernel,
 	       aq->id, queue_last);
 
   aq->queue[queue_last].type = KERNEL_LAUNCH;
+  aq->queue[queue_last].u.launch.session = session;
   aq->queue[queue_last].u.launch.kernel = kernel;
-  aq->queue[queue_last].u.launch.vars = vars;
   aq->queue[queue_last].u.launch.kla = *kla;
 
   aq->queue_n++;
@@ -3401,8 +3486,8 @@  managed_heap_create (struct agent_info *agent, size_t size)
 /* Execute an OpenACC kernel, synchronously or asynchronously.  */
 
 static void
-gcn_exec (struct kernel_info *kernel,
-	  void **devaddrs, unsigned *dims, void *targ_mem_desc, bool async,
+gcn_exec (struct kernel_info *kernel, struct gomp_offload_session *session,
+	  unsigned *dims, void *targ_mem_desc, bool async,
 	  struct goacc_asyncqueue *aq)
 {
   if (!GOMP_OFFLOAD_can_run (kernel))
@@ -3522,9 +3607,9 @@  gcn_exec (struct kernel_info *kernel,
     }
 
   if (!async)
-    run_kernel (kernel, devaddrs, &kla, NULL, false);
+    run_kernel (session, kernel, &kla, NULL, false);
   else
-    queue_push_launch (aq, kernel, devaddrs, &kla);
+    queue_push_launch (aq, session, kernel, &kla);
 
   if (profiling_dispatch_p)
     {
@@ -4096,7 +4181,10 @@  GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   if (module->init_array_func)
     {
       init_kernel (module->init_array_func);
-      run_kernel (module->init_array_func, NULL, &kla, NULL, false);
+      struct gomp_offload_session session;
+      init_session (&session, agent);
+      GOMP_OFFLOAD_session_set_target_var_table (&session, NULL);
+      run_kernel (&session, module->init_array_func, &kla, NULL, false);
     }
   module->constructors_run_p = true;
 
@@ -5232,9 +5320,9 @@  GOMP_OFFLOAD_get_interop_type_desc (struct interop_obj_t *obj,
    specified device.  */
 
 void
-GOMP_OFFLOAD_run (int device, void *fn_ptr, void *vars, void **args)
+GOMP_OFFLOAD_run (struct gomp_offload_session *session, void *fn_ptr, void **args)
 {
-  struct agent_info *agent = get_agent_info (device);
+  struct agent_info *agent = session->agent;
   struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
   struct GOMP_kernel_launch_attributes def;
   struct GOMP_kernel_launch_attributes *kla;
@@ -5248,7 +5336,7 @@  GOMP_OFFLOAD_run (int device, void *fn_ptr, void *vars, void **args)
       GCN_WARNING ("Will not run GCN kernel because the grid size is zero\n");
       return;
     }
-  run_kernel (kernel, vars, kla, NULL, false);
+  run_kernel (session, kernel, kla, NULL, false);
 }
 
 /* Run an asynchronous OpenMP kernel on DEVICE.  This is similar to
@@ -5256,11 +5344,13 @@  GOMP_OFFLOAD_run (int device, void *fn_ptr, void *vars, void **args)
    GOMP_PLUGIN_target_task_completion when it has finished.  */
 
 void
-GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
-			void **args, void *async_data)
+GOMP_OFFLOAD_async_run (struct gomp_offload_session *session,
+			void *tgt_fn,
+			void **args,
+			void *async_data)
 {
   GCN_DEBUG ("GOMP_OFFLOAD_async_run invoked\n");
-  struct agent_info *agent = get_agent_info (device);
+  struct agent_info *agent = session->agent;
   struct kernel_info *kernel = (struct kernel_info *) tgt_fn;
   struct GOMP_kernel_launch_attributes def;
   struct GOMP_kernel_launch_attributes *kla;
@@ -5278,7 +5368,7 @@  GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
   maybe_init_omp_async (agent);
   if (!agent->omp_async_queue)
     GOMP_PLUGIN_fatal ("Asynchronous queue initialization failed");
-  queue_push_launch (agent->omp_async_queue, kernel, tgt_vars, kla);
+  queue_push_launch (agent->omp_async_queue, session, kernel, kla);
   queue_push_callback (agent->omp_async_queue,
 		       GOMP_PLUGIN_target_task_completion, async_data);
 }
@@ -5422,30 +5512,30 @@  GOMP_OFFLOAD_is_accessible_ptr (int device, const void *ptr, size_t size)
    already-loaded KERNEL.  */
 
 void
-GOMP_OFFLOAD_openacc_exec (void (*fn_ptr) (void *),
+GOMP_OFFLOAD_openacc_exec (struct gomp_offload_session *session,
+			   void (*fn_ptr) (void *),
 			   size_t mapnum __attribute__((unused)),
 			   void **hostaddrs __attribute__((unused)),
-			   void **devaddrs, unsigned *dims,
-			   void *targ_mem_desc)
+			   unsigned *dims, void *targ_mem_desc)
 {
   struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
 
-  gcn_exec (kernel, devaddrs, dims, targ_mem_desc, false, NULL);
+  gcn_exec (kernel, session, dims, targ_mem_desc, false, NULL);
 }
 
 /* Run an asynchronous OpenACC kernel on the specified queue.  */
 
 void
-GOMP_OFFLOAD_openacc_async_exec (void (*fn_ptr) (void *),
+GOMP_OFFLOAD_openacc_async_exec (struct gomp_offload_session *session,
+				 void (*fn_ptr) (void *),
 				 size_t mapnum __attribute__((unused)),
 				 void **hostaddrs __attribute__((unused)),
-				 void **devaddrs,
 				 unsigned *dims, void *targ_mem_desc,
 				 struct goacc_asyncqueue *aq)
 {
   struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
 
-  gcn_exec (kernel, devaddrs, dims, targ_mem_desc, true, aq);
+  gcn_exec (kernel, session, dims, targ_mem_desc, true, aq);
 }
 
 /* Create a new asynchronous thread and queue for running future kernels.  */
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index a540e9d4cce8..a0100ec3f97a 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -828,6 +828,33 @@  link_ptx (CUmodule *module, const struct targ_ptx_obj *ptx_objs,
   return true;
 }
 
+/* The NVPTX plugin can't make much use of this abstraction, so it has the bare
+   minimum possible.  */
+struct gomp_offload_session
+{
+  int device;
+  void **target_var_table;
+};
+GOMP_OFFLOAD_session_boilerplate();
+
+void
+GOMP_OFFLOAD_session_start (struct gomp_offload_session *session, int device)
+{
+  assert ((((uintptr_t) session) % __BIGGEST_ALIGNMENT__) == 0);
+  *session = (struct gomp_offload_session) {
+    .device = device,
+    .target_var_table = NULL,
+  };
+}
+
+void
+GOMP_OFFLOAD_session_set_target_var_table (struct gomp_offload_session *session,
+					   void **table)
+{
+  assert (!session->target_var_table);
+  session->target_var_table = table;
+}
+
 static void
 nvptx_exec (void (*fn), unsigned *dims, void *targ_mem_desc,
 	    CUdeviceptr dp, CUstream stream)
@@ -1991,15 +2018,15 @@  GOMP_OFFLOAD_page_locked_host_free (void *ptr)
 }
 
 void
-GOMP_OFFLOAD_openacc_exec (void (*fn) (void *),
+GOMP_OFFLOAD_openacc_exec (struct gomp_offload_session *session,
+			   void (*fn) (void *),
 			   size_t mapnum  __attribute__((unused)),
 			   void **hostaddrs __attribute__((unused)),
-			   void **devaddrs,
 			   unsigned *dims, void *targ_mem_desc)
 {
   GOMP_PLUGIN_debug (0, "nvptx %s\n", __FUNCTION__);
 
-  CUdeviceptr dp = (CUdeviceptr) devaddrs;
+  CUdeviceptr dp = (CUdeviceptr) session->target_var_table;
   nvptx_exec (fn, dims, targ_mem_desc, dp, NULL);
 
   CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL);
@@ -2012,16 +2039,16 @@  GOMP_OFFLOAD_openacc_exec (void (*fn) (void *),
 }
 
 void
-GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *),
+GOMP_OFFLOAD_openacc_async_exec (struct gomp_offload_session *session,
+				 void (*fn) (void *),
 				 size_t mapnum __attribute__((unused)),
 				 void **hostaddrs __attribute__((unused)),
-				 void **devaddrs,
 				 unsigned *dims, void *targ_mem_desc,
 				 struct goacc_asyncqueue *aq)
 {
   GOMP_PLUGIN_debug (0, "nvptx %s\n", __FUNCTION__);
 
-  CUdeviceptr dp = (CUdeviceptr) devaddrs;
+  CUdeviceptr dp = (CUdeviceptr) session->target_var_table;
   nvptx_exec (fn, dims, targ_mem_desc, dp, aq->cuda_stream);
 }
 
@@ -2957,7 +2984,7 @@  GOMP_OFFLOAD_get_interop_type_desc (struct interop_obj_t *obj,
 }
 
 void
-GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
+GOMP_OFFLOAD_run (struct gomp_offload_session *session, void *tgt_fn, void **args)
 {
   struct targ_fn_descriptor *tgt_fn_desc
     = (struct targ_fn_descriptor *) tgt_fn;
@@ -2965,7 +2992,7 @@  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   const struct targ_fn_launch *launch = tgt_fn_desc->launch;
   const char *fn_name = launch->fn;
   CUresult r;
-  struct ptx_device *ptx_dev = ptx_devices[ord];
+  struct ptx_device *ptx_dev = ptx_devices[session->device];
   const char *maybe_abort_msg = "(perhaps abort was called)";
   int teams = 0, threads = 0;
 
@@ -3003,7 +3030,7 @@  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
-  void *fn_args[] = {tgt_vars, stacks, (void *) stack_size};
+  void *fn_args[] = {session->target_var_table, stacks, (void *) stack_size};
   size_t fn_args_size = sizeof fn_args;
   void *config[] = {
     CU_LAUNCH_PARAM_BUFFER_POINTER, fn_args,
diff --git a/libgomp/target.c b/libgomp/target.c
index d562b0493eac..8f1612b74824 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -1173,14 +1173,27 @@  gomp_present_fatal (void *addr, size_t size, struct gomp_device_descr *devicep)
 #endif
 }
 
+/* Get size of region required for target variable table for MAPNUM
+   mappings.  */
+
+static inline size_t
+gomp_get_tvt_size (size_t mapnum)
+{
+  return mapnum * sizeof (void *);
+}
+
 static inline __attribute__((always_inline)) struct target_mem_desc *
 gomp_map_vars_internal (struct gomp_device_descr *devicep,
 			struct goacc_asyncqueue *aq, size_t mapnum,
 			void **hostaddrs, void **devaddrs, size_t *sizes,
 			void *kinds, bool short_mapkind,
 			htab_t *refcount_set,
-			enum gomp_map_vars_kind pragma_kind)
+			enum gomp_map_vars_kind pragma_kind,
+			struct gomp_offload_session *session)
 {
+  bool target_p = pragma_kind & GOMP_MAP_VARS_TARGET;
+  assert (/* SESSION must be present iff doing target offload.  */
+	  !!session == target_p);
   size_t i, tgt_align, tgt_size, not_found_cnt = 0;
   bool has_firstprivate = false;
   bool has_always_ptrset = false;
@@ -1206,32 +1219,12 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
     {
       tgt->tgt_start = 0;
       tgt->tgt_end = 0;
+      if (session)
+	/* We promise to always call either set_tvt or alloc_tvt.  */
+	devicep->session.set_tvt_func (session, NULL);
       return tgt;
     }
 
-  tgt_align = sizeof (void *);
-  tgt_size = 0;
-  cbuf.chunks = NULL;
-  cbuf.chunk_cnt = -1;
-  cbuf.use_cnt = 0;
-  cbuf.buf = NULL;
-  if (mapnum > 1 || (pragma_kind & GOMP_MAP_VARS_TARGET))
-    {
-      size_t chunks_size = (mapnum + 1) * sizeof (struct gomp_coalesce_chunk);
-      cbuf.chunks = (struct gomp_coalesce_chunk *) gomp_alloca (chunks_size);
-      cbuf.chunk_cnt = 0;
-    }
-  if (pragma_kind & GOMP_MAP_VARS_TARGET)
-    {
-      size_t align = 4 * sizeof (void *);
-      tgt_align = align;
-      tgt_size = mapnum * sizeof (void *);
-      cbuf.chunk_cnt = 1;
-      cbuf.use_cnt = 1 + (mapnum > 1);
-      cbuf.chunks[0].start = 0;
-      cbuf.chunks[0].end = tgt_size;
-    }
-
   gomp_mutex_lock (&devicep->lock);
   if (devicep->state == GOMP_DEVICE_FINALIZED)
     {
@@ -1240,6 +1233,48 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
       return NULL;
     }
 
+  /* Target variable table in host memory.  If we're doing target offloading,
+     we'll let the plugin attempt to allocate it..  */
+  void **host_tvt = NULL;
+  if (target_p && devicep->session.alloc_tvt_func)
+    {
+      host_tvt = (devicep->session.alloc_tvt_func
+		  (session, gomp_get_tvt_size (mapnum)));
+      assert ((((uintptr_t)host_tvt) % __BIGGEST_ALIGNMENT__) == 0);
+    }
+
+  /* True if we need to allocate the target var table, i.e. when doing
+     offloading and when we fail to allocate it above.  */
+  bool need_tvt_alloc = !host_tvt && target_p;
+
+  /* Initialize size tracking variables.  */
+  tgt_align = sizeof (void *);
+  tgt_size = 0;
+
+  /* Prepare coalesce buffer.  */
+  cbuf.chunks = NULL;
+  cbuf.chunk_cnt = -1;
+  cbuf.use_cnt = 0;
+  cbuf.buf = NULL;
+  if (mapnum > 1 || !host_tvt)
+    {
+      size_t chunks_size = (mapnum + 1) * sizeof (struct gomp_coalesce_chunk);
+      cbuf.chunks = (struct gomp_coalesce_chunk *) gomp_alloca (chunks_size);
+      cbuf.chunk_cnt = 0;
+    }
+
+  if (need_tvt_alloc)
+    {
+      /* Prepare for allocating the target variable table.  */
+      size_t align = 4 * sizeof (void *);
+      tgt_align = align;
+      tgt_size = gomp_get_tvt_size (mapnum);
+      cbuf.chunk_cnt = 1;
+      cbuf.use_cnt = 1 + (mapnum > 1);
+      cbuf.chunks[0].start = 0;
+      cbuf.chunks[0].end = tgt_size;
+    }
+
   for (i = 0; i < mapnum; i++)
     {
       int kind = get_kind (short_mapkind, kinds, i);
@@ -1496,7 +1531,7 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
       tgt->tgt_start = (uintptr_t) tgt->to_free;
       tgt->tgt_end = tgt->tgt_start + sizes[0];
     }
-  else if (not_found_cnt || (pragma_kind & GOMP_MAP_VARS_TARGET))
+  else if (not_found_cnt || need_tvt_alloc || has_firstprivate)
     {
       /* Allocate tgt_align aligned tgt_size block of memory.  */
       /* FIXME: Perhaps change interface to allocate properly aligned
@@ -1534,8 +1569,8 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
     }
 
   tgt_size = 0;
-  if (pragma_kind & GOMP_MAP_VARS_TARGET)
-    tgt_size = mapnum * sizeof (void *);
+  if (need_tvt_alloc)
+    tgt_size = gomp_get_tvt_size (mapnum);
 
   tgt->array = NULL;
   if (not_found_cnt || has_firstprivate || has_always_ptrset)
@@ -2068,7 +2103,7 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	  }
     }
 
-  if (pragma_kind & GOMP_MAP_VARS_TARGET)
+  if (target_p)
     {
       /* The target variables table is constructed with maps using iterators
 	 unexpanded. Now that the iterator maps are expanded, we will need to
@@ -2080,12 +2115,24 @@  gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	if (!iterator_count || iterator_count[i] <= 1)
 	  {
 	    cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i);
-	    gomp_copy_host2dev (devicep, aq,
-				(void *) (tgt->tgt_start + map_num * sizeof (void *)),
-				(void *) &cur_node.tgt_offset, sizeof (void *),
-				true, cbufp);
+	    if (host_tvt)
+	      /* In this case, it's on the host.  */
+	      memcpy (&host_tvt[map_num], &cur_node.tgt_offset,
+		      sizeof (void *));
+	    else
+	      /* Otherwise, the table is on the device.  */
+	      gomp_copy_host2dev (devicep, aq,
+				  (void *) (tgt->tgt_start + map_num * sizeof (void *)),
+				  (void *) &cur_node.tgt_offset, sizeof (void *),
+				  true, cbufp);
 	    map_num++;
 	  }
+
+      if (!host_tvt)
+	/* The call to GOMP_OFFLOAD_session_allocate_target_var_table failed,
+	   so we must inform the session about the target var table we
+	   allocated.  */
+	devicep->session.set_tvt_func (session, (void **) tgt->tgt_start);
     }
 
   if (cbufp)
@@ -2133,7 +2180,8 @@  static struct target_mem_desc *
 gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	       void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds,
 	       bool short_mapkind, htab_t *refcount_set,
-	       enum gomp_map_vars_kind pragma_kind)
+	       enum gomp_map_vars_kind pragma_kind,
+	       struct gomp_offload_session *session)
 {
   /* This management of a local refcount_set is for convenience of callers
      who do not share a refcount_set over multiple map/unmap uses.  */
@@ -2147,7 +2195,7 @@  gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   struct target_mem_desc *tgt;
   tgt = gomp_map_vars_internal (devicep, NULL, mapnum, hostaddrs, devaddrs,
 				sizes, kinds, short_mapkind, refcount_set,
-				pragma_kind);
+				pragma_kind, session);
   if (local_refcount_set)
     htab_free (local_refcount_set);
 
@@ -2159,11 +2207,13 @@  goacc_map_vars (struct gomp_device_descr *devicep,
 		struct goacc_asyncqueue *aq, size_t mapnum,
 		void **hostaddrs, void **devaddrs, size_t *sizes,
 		void *kinds, bool short_mapkind,
-		enum gomp_map_vars_kind pragma_kind)
+		enum gomp_map_vars_kind pragma_kind,
+		struct gomp_offload_session *session)
 {
   return gomp_map_vars_internal (devicep, aq, mapnum, hostaddrs, devaddrs,
 				 sizes, kinds, short_mapkind, NULL,
-				 GOMP_MAP_VARS_OPENACC | pragma_kind);
+				 GOMP_MAP_VARS_OPENACC | pragma_kind,
+				 session);
 }
 
 static void
@@ -3200,12 +3250,14 @@  GOMP_target (int device, void (*fn) (void *), const void *unused,
       || !(fn_addr = gomp_get_target_fn_addr (devicep, fn)))
     return gomp_target_fallback (fn, hostaddrs, devicep, NULL);
 
+  struct gomp_offload_session *session = (gomp_offload_session_new
+					  (devicep, alloca));
+
   htab_t refcount_set = htab_create (mapnum);
   struct target_mem_desc *tgt_vars
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
-		     &refcount_set, GOMP_MAP_VARS_TARGET);
-  devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start,
-		     NULL);
+		     &refcount_set, GOMP_MAP_VARS_TARGET, session);
+  devicep->run_func (session, fn_addr, NULL);
   htab_clear (refcount_set);
   gomp_unmap_vars (tgt_vars, true, &refcount_set);
   htab_free (refcount_set);
@@ -3524,6 +3576,9 @@  GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
   struct target_mem_desc *tgt_vars;
   htab_t refcount_set = NULL;
 
+  struct gomp_offload_session *session = (gomp_offload_session_new
+					  (devicep, alloca));
+
   if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     {
       if (!fpc_done)
@@ -3538,16 +3593,16 @@  GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 	    }
 	}
       tgt_vars = NULL;
+      devicep->session.set_tvt_func (session, hostaddrs);
     }
   else
     {
       refcount_set = htab_create (mapnum);
       tgt_vars = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds,
-				true, &refcount_set, GOMP_MAP_VARS_TARGET);
+				true, &refcount_set, GOMP_MAP_VARS_TARGET,
+				session);
     }
-  devicep->run_func (devicep->target_id, fn_addr,
-		     tgt_vars ? (void *) tgt_vars->tgt_start : hostaddrs,
-		     new_args);
+  devicep->run_func (session, fn_addr, new_args);
   if (tgt_vars)
     {
       htab_clear (refcount_set);
@@ -4146,7 +4201,7 @@  gomp_target_data_fallback (struct gomp_device_descr *devicep)
          would get out of sync.  */
       struct target_mem_desc *tgt
 	= gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, false,
-			 NULL, GOMP_MAP_VARS_DATA);
+			 NULL, GOMP_MAP_VARS_DATA, NULL);
       tgt->prev = icv->target_data;
       icv->target_data = tgt;
     }
@@ -4165,7 +4220,7 @@  GOMP_target_data (int device, const void *unused, size_t mapnum,
 
   struct target_mem_desc *tgt
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
-		     NULL, GOMP_MAP_VARS_DATA);
+		     NULL, GOMP_MAP_VARS_DATA, NULL);
   struct gomp_task_icv *icv = gomp_icv (true);
   tgt->prev = icv->target_data;
   icv->target_data = tgt;
@@ -4184,7 +4239,7 @@  GOMP_target_data_ext (int device, size_t mapnum, void **hostaddrs,
 
   struct target_mem_desc *tgt
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, true,
-		     NULL, GOMP_MAP_VARS_DATA);
+		     NULL, GOMP_MAP_VARS_DATA, NULL);
   struct gomp_task_icv *icv = gomp_icv (true);
   tgt->prev = icv->target_data;
   icv->target_data = tgt;
@@ -4509,7 +4564,7 @@  GOMP_target_enter_exit_data (int device, size_t mapnum, void **hostaddrs,
 	{
 	  gomp_map_vars (devicep, sizes[i] + 1, &hostaddrs[i], NULL, &sizes[i],
 			 &kinds[i], true, &refcount_set,
-			 GOMP_MAP_VARS_ENTER_DATA);
+			 GOMP_MAP_VARS_ENTER_DATA, NULL);
 	  i += sizes[i];
 	}
       else if ((kinds[i] & 0xff) == GOMP_MAP_TO_PSET)
@@ -4520,7 +4575,7 @@  GOMP_target_enter_exit_data (int device, size_t mapnum, void **hostaddrs,
 	      break;
 	  gomp_map_vars (devicep, j-i, &hostaddrs[i], NULL, &sizes[i],
 			 &kinds[i], true, &refcount_set,
-			 GOMP_MAP_VARS_ENTER_DATA);
+			 GOMP_MAP_VARS_ENTER_DATA, NULL);
 	  i += j - i - 1;
 	}
       else if (i + 1 < mapnum
@@ -4531,12 +4586,12 @@  GOMP_target_enter_exit_data (int device, size_t mapnum, void **hostaddrs,
 	  /* An attach operation must be processed together with the mapped
 	     base-pointer list item.  */
 	  gomp_map_vars (devicep, 2, &hostaddrs[i], NULL, &sizes[i], &kinds[i],
-			 true, &refcount_set, GOMP_MAP_VARS_ENTER_DATA);
+			 true, &refcount_set, GOMP_MAP_VARS_ENTER_DATA, NULL);
 	  i += 1;
 	}
       else
 	gomp_map_vars (devicep, 1, &hostaddrs[i], NULL, &sizes[i], &kinds[i],
-		       true, &refcount_set, GOMP_MAP_VARS_ENTER_DATA);
+		       true, &refcount_set, GOMP_MAP_VARS_ENTER_DATA, NULL);
   else
     gomp_exit_data (devicep, mapnum, hostaddrs, sizes, kinds, &refcount_set);
   htab_free (refcount_set);
@@ -4569,24 +4624,23 @@  gomp_target_task_fn (void *data)
 	  return false;
 	}
 
-      void *actual_arguments;
+      struct gomp_offload_session *session = ttask->offload_session;
+      devicep->session.start_func (session, devicep->target_id);
+
       if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
 	{
 	  ttask->tgt = NULL;
-	  actual_arguments = ttask->hostaddrs;
+	  devicep->session.set_tvt_func (session, ttask->hostaddrs);
 	}
       else
-	{
-	  ttask->tgt = gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs,
-				      NULL, ttask->sizes, ttask->kinds, true,
-				      NULL, GOMP_MAP_VARS_TARGET);
-	  actual_arguments = (void *) ttask->tgt->tgt_start;
-	}
+	ttask->tgt = gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs,
+				    NULL, ttask->sizes, ttask->kinds, true,
+				    NULL, GOMP_MAP_VARS_TARGET,
+				    session);
       ttask->state = GOMP_TARGET_TASK_READY_TO_RUN;
 
       assert (devicep->async_run_func);
-      devicep->async_run_func (devicep->target_id, fn_addr, actual_arguments,
-			       ttask->args, (void *) ttask);
+      devicep->async_run_func (session, fn_addr, ttask->args, (void *) ttask);
       return true;
     }
   else if (devicep == NULL
@@ -4608,13 +4662,13 @@  gomp_target_task_fn (void *data)
 	    {
 	      gomp_map_vars (devicep, ttask->sizes[i] + 1, &ttask->hostaddrs[i],
 			     NULL, &ttask->sizes[i], &ttask->kinds[i], true,
-			     &refcount_set, GOMP_MAP_VARS_ENTER_DATA);
+			     &refcount_set, GOMP_MAP_VARS_ENTER_DATA, NULL);
 	      i += ttask->sizes[i];
 	    }
 	  else
 	    gomp_map_vars (devicep, 1, &ttask->hostaddrs[i], NULL, &ttask->sizes[i],
 			   &ttask->kinds[i], true, &refcount_set,
-			   GOMP_MAP_VARS_ENTER_DATA);
+			   GOMP_MAP_VARS_ENTER_DATA, NULL);
       else
 	gomp_exit_data (devicep, ttask->mapnum, ttask->hostaddrs, ttask->sizes,
 			ttask->kinds, &refcount_set);
@@ -6020,8 +6074,9 @@  gomp_load_plugin_for_device (struct gomp_device_descr *device,
   /* Check if all required functions are available in the plugin and store
      their handlers.  None of the symbols can legitimately be NULL,
      so we don't need to check dlerror all the time.  */
-#define DLSYM(f)							\
-  if (!(device->f##_func = dlsym (plugin_handle, "GOMP_OFFLOAD_" #f)))	\
+#define DLSYM(f) DLSYM2(f, f)
+#define DLSYM2(f, n)							\
+  if (!(device->f##_func = dlsym (plugin_handle, "GOMP_OFFLOAD_" #n)))	\
     goto dl_fail
   /* Similar, but missing functions are not an error.  Return false if
      failed, true otherwise.  */
@@ -6065,8 +6120,15 @@  gomp_load_plugin_for_device (struct gomp_device_descr *device,
     }
 
   device->capabilities = device->get_caps_func ();
+  device->session.size = 0;
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
     {
+      DLSYM2 (session.size, session_size);
+      device->session.size = device->session.size_func ();
+      DLSYM2 (session.start, session_start);
+      DLSYM_OPT (session.alloc_tvt, session_allocate_target_var_table);
+      DLSYM2 (session.set_tvt, session_set_target_var_table);
+
       DLSYM (run);
       DLSYM_OPT (async_run, async_run);
       DLSYM_OPT (can_run, can_run);
@@ -6114,6 +6176,7 @@  gomp_load_plugin_for_device (struct gomp_device_descr *device,
 	}
     }
 #undef DLSYM
+#undef DLSYM2
 #undef DLSYM_OPT
 
   return 1;
diff --git a/libgomp/task.c b/libgomp/task.c
index cbba28516e3f..89dafb872208 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -936,12 +936,25 @@  gomp_create_target_task (struct gomp_device_descr *devicep,
 	}
     }
 
-  task = gomp_malloc (sizeof (*task) + depend_size
-		      + sizeof (*ttask)
-		      + args_cnt * sizeof (void *)
-		      + mapnum * (sizeof (void *) + sizeof (size_t)
-				  + sizeof (unsigned short))
-		      + tgt_size);
+  size_t task_alloc_size = (sizeof (*task) + depend_size
+		       + sizeof (*ttask)
+		       + args_cnt * sizeof (void *)
+		       + mapnum * (sizeof (void *) + sizeof (size_t)
+				   + sizeof (unsigned short))
+		       + tgt_size);
+  size_t session_start_offset = 0;
+  if (devicep && devicep->session.size)
+    {
+      /* gomp_malloc always aligns to __BIGGEST_ALIGNMENT__, so, we can just
+	 round up the size to preserve that alignment...  */
+      size_t align = __BIGGEST_ALIGNMENT__ - 1;
+      task_alloc_size = (task_alloc_size + align) & ~align;
+      session_start_offset = task_alloc_size;
+
+      /* ... and reserve enough room.  */
+      task_alloc_size += devicep->session.size;
+    }
+  task = gomp_malloc (task_alloc_size);
   gomp_init_task (task, parent, gomp_icv (false));
   task->priority = 0;
   task->kind = GOMP_TASK_WAITING;
@@ -951,6 +964,14 @@  gomp_create_target_task (struct gomp_device_descr *devicep,
   ttask->devicep = devicep;
   ttask->fn = fn;
   ttask->mapnum = mapnum;
+
+  ttask->offload_session = NULL;
+  if (session_start_offset)
+    {
+      uintptr_t session_ptr = (uintptr_t) task + session_start_offset;
+      ttask->offload_session = (void *) session_ptr;
+    }
+
   memcpy (ttask->hostaddrs, hostaddrs, mapnum * sizeof (void *));
   if (args_cnt)
     {
diff --git a/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c b/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c
new file mode 100644
index 000000000000..7494c5a5f4c8
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-no-tvt-alloc.c
@@ -0,0 +1,51 @@ 
+/* { dg-do run } */
+/* { dg-require-effective-target offload_device_gcn } */
+#include <assert.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+int
+main ()
+{
+  setenv ("GCN_DEBUG", "1", true);
+#ifdef INHIBIT_KERNARGS_MERGE
+  /* See gcn-kernel-launch-tvt-alloc.c  */
+  setenv ("GCN_INHIBIT_KERNARGS_TVT_MERGE", "1", true);
+#endif
+
+  int i[1] = { 0 };
+
+#pragma omp target data map(tofrom: i[0:1])
+  {
+    fprintf (stderr, "================\n");
+
+#pragma omp target
+    { i[0] = 1; }
+  }
+
+  assert (i[0] == 1);
+}
+
+/* Here, we want to ensure that we have no allocations after the point
+   delimited by ===...
+
+   Past that point, the only data to map onto the device is the target
+   variable table, which should be passed as kernel arguments.  The GCN plugin
+   currently does not log allocating those.  We rely on that here.
+
+   So, dg-output lets us match the entire output with a regex.  Multiple
+   dg-output invocations will have their regexes concatenated in order.  The
+   following is that regex, broken down by function:
+
+   Ignore ===... marker and everything before it.
+     { dg-output {^.*================[\r\n]+} }
+   Then, each further line is either...
+     { dg-output {((} }
+   ... a line not starting with "GCN debug: "...
+     { dg-output {(?!GCN debug:)[^\r\n]+} }
+   ... or a "GCN debug: ..." line that is not an allocation:
+     { dg-output {|GCN debug: (?!Allocating )[^\r\n]*} }
+   ... followed by a line terminator, of course.
+     { dg-output {)[\r\n]+)*} }
+   There should be nothing left.
+     { dg-output {$} }  */
diff --git a/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c b/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c
new file mode 100644
index 000000000000..ab5ed2dc4336
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/gcn-kernel-launch-tvt-alloc.c
@@ -0,0 +1,16 @@ 
+/* { dg-do run } */
+/* { dg-require-effective-target offload_device_gcn } */
+
+/* Test that the no-merge case still works.  */
+
+#define INHIBIT_KERNARGS_MERGE
+#include "./gcn-kernel-launch-no-tvt-alloc.c"
+
+/* See commentary in <gcn-kernel-launch-no-tvt-alloc.c>.
+
+   Ignore ===... marker and everything before it.
+     { dg-output {^.*================[\r\n]+} }
+   We expect at least "GCN debug: Allocating \d+ bytes..."
+     { dg-output {.*[\r\n]+GCN debug: Allocating \d+ bytes.*} }
+   There should be nothing left.
+     { dg-output {$} }  */