From patchwork Sun Dec  3 00:32:22 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Stubbs <ams@codesourcery.com>
X-Patchwork-Id: 81214
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 4CC86384DEEA
	for <patchwork@sourceware.org>; Sun,  3 Dec 2023 00:33:18 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252])
 by sourceware.org (Postfix) with ESMTPS id 32A5B3857BA1
 for <gcc-patches@gcc.gnu.org>; Sun,  3 Dec 2023 00:32:50 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 32A5B3857BA1
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 32A5B3857BA1
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=68.232.137.252
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701563576; cv=none;
 b=BboWa8pwV8Qw4iglw8CbJIhr5c+LDrzgurP84E0xwFxv0GRY+ACTUQuOmGJmHvEMwPmOZt4h9fqLQrp/qJB49Eo5uxXoyK+C99a78+Ylc+mUuM3gdWP5qiqGL0vvh0Gy5Eb0cRzrzhjqW98nQukNcoRUdksIFfSLpa9V+e0Vvdc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1701563576; c=relaxed/simple;
 bh=CN082+Kiz+0w5rfdwoDUXi73E+j39hKccFhknQ2favc=;
 h=From:To:Subject:Date:Message-ID:MIME-Version;
 b=RuVF5cnEey7Mv7HnLqxywbQdbPUSEIwiCT1tiO+cm9U0qQa+i+o3GE/AJ3LbX6znZ3JSsh8MCsvzWkDXVh+sPp1fbTrIwmRyeA2COz1aF4S00js17/VsqvzSsRKJyBMszn7IfKjUTNMEk4vpNF3W2UtcMmAGcz2keluOHkFoOiY=
ARC-Authentication-Results: i=1; server2.sourceware.org
X-CSE-ConnectionGUID: fheKjpAFRpWd69u/jbGQGg==
X-CSE-MsgGUID: Xg5DY+UTSm2iQOngmC0CZA==
X-IronPort-AV: E=Sophos;i="6.04,246,1695715200"; d="scan'208";a="24279924"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa4.mentor.iphmx.com with ESMTP; 02 Dec 2023 16:32:49 -0800
IronPort-SDR: 
 7N7M21uUCWwqQhrBr1fR+49U8MWBH6qcE9/qpohAek3k1+X/I/hIHZbOyVezSfSREpMZ/C3mJO
 Gk5Ex7xfPuvis9LHRqlV17gabLJhoONjc2i9JqEwhixYGqZdc9mLClMJe2Xc+1iCsJHfBkL/As
 T8sRvHXFFptfF3Zk8qVLjQFZl9xajHPOOgtXeMA/5UM1CLzuFCwQ3D6QG5aTOjBH3TH55+mNfo
 /A7M6E09BzjfNJsqmjukjtFq5wu89pkKlyjKMiw7592NYYm5/9j7l0OLUU4KzwmszyToTW+Y+i
 PUo=
From: Andrew Stubbs <ams@codesourcery.com>
To: <gcc-patches@gcc.gnu.org>
Subject: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
Date: Sun, 3 Dec 2023 00:32:22 +0000
Message-ID: <20231203003224.1638841-2-ams@codesourcery.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20231203003224.1638841-1-ams@codesourcery.com>
References: <20231203003224.1638841-1-ams@codesourcery.com>
MIME-Version: 1.0
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-15.mgc.mentorg.com (139.181.222.15) To
 svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11)
X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_PASS,
 SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org

This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The memory
can be allocated, reallocated, and freed using a basic but fast algorithm,
is thread safe and the size of the low-latency heap can be configured using
the GOMP_NVPTX_LOWLAT_POOL environment variable.

The use of the PTX dynamic_smem_size feature means that low-latency allocator
will not work with the PTX 3.1 multilib.

For now, the omp_low_lat_mem_alloc allocator also works, but that will change
when I implement the access traits.

libgomp/ChangeLog:

	* allocator.c (MEMSPACE_ALLOC): New macro.
	(MEMSPACE_CALLOC): New macro.
	(MEMSPACE_REALLOC): New macro.
	(MEMSPACE_FREE): New macro.
	(predefined_alloc_mapping): New array.  Add _Static_assert to match.
	(ARRAY_SIZE): New macro.
	(omp_aligned_alloc): Use MEMSPACE_ALLOC.
	Implement fall-backs for predefined allocators.  Simplify existing
	fall-backs.
	(omp_free): Use MEMSPACE_FREE.
	(omp_calloc): Use MEMSPACE_CALLOC. Implement fall-backs for
	predefined allocators.  Simplify existing fall-backs.
	(omp_realloc): Use MEMSPACE_REALLOC, MEMSPACE_ALLOC, and MEMSPACE_FREE.
	Implement fall-backs for predefined allocators.  Simplify existing
	fall-backs.
	* config/nvptx/team.c (__nvptx_lowlat_pool): New asm variable.
	(__nvptx_lowlat_init): New prototype.
	(gomp_nvptx_main): Call __nvptx_lowlat_init.
	* libgomp.texi: Update memory space table.
	* plugin/plugin-nvptx.c (lowlat_pool_size): New variable.
	(GOMP_OFFLOAD_init_device): Read the GOMP_NVPTX_LOWLAT_POOL envvar.
	(GOMP_OFFLOAD_run): Apply lowlat_pool_size.
	* basic-allocator.c: New file.
	* config/nvptx/allocator.c: New file.
	* testsuite/libgomp.c/omp_alloc-1.c: New test.
	* testsuite/libgomp.c/omp_alloc-2.c: New test.
	* testsuite/libgomp.c/omp_alloc-3.c: New test.
	* testsuite/libgomp.c/omp_alloc-4.c: New test.
	* testsuite/libgomp.c/omp_alloc-5.c: New test.
	* testsuite/libgomp.c/omp_alloc-6.c: New test.

Co-authored-by: Kwok Cheung Yeung  <kcy@codesourcery.com>
Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
---
 libgomp/allocator.c                       | 246 ++++++++------
 libgomp/basic-allocator.c                 | 380 ++++++++++++++++++++++
 libgomp/config/nvptx/allocator.c          | 120 +++++++
 libgomp/config/nvptx/team.c               |  18 +
 libgomp/libgomp.texi                      |   9 +-
 libgomp/plugin/plugin-nvptx.c             |  23 +-
 libgomp/testsuite/libgomp.c/omp_alloc-1.c |  56 ++++
 libgomp/testsuite/libgomp.c/omp_alloc-2.c |  64 ++++
 libgomp/testsuite/libgomp.c/omp_alloc-3.c |  42 +++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c | 196 +++++++++++
 libgomp/testsuite/libgomp.c/omp_alloc-5.c |  63 ++++
 libgomp/testsuite/libgomp.c/omp_alloc-6.c | 117 +++++++
 12 files changed, 1231 insertions(+), 103 deletions(-)
 create mode 100644 libgomp/basic-allocator.c
 create mode 100644 libgomp/config/nvptx/allocator.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-6.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index b4e50e2ad72..fa398128368 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -37,6 +37,47 @@
 
 #define omp_max_predefined_alloc omp_thread_mem_alloc
 
+/* These macros may be overridden in config/<target>/allocator.c.
+   The following definitions (ab)use comma operators to avoid unused
+   variable errors.  */
+#ifndef MEMSPACE_ALLOC
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
+  malloc (((void)(MEMSPACE), (SIZE)))
+#endif
+#ifndef MEMSPACE_CALLOC
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+  calloc (1, (((void)(MEMSPACE), (SIZE))))
+#endif
+#ifndef MEMSPACE_REALLOC
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
+  realloc (ADDR, (((void)(MEMSPACE), (void)(OLDSIZE), (SIZE))))
+#endif
+#ifndef MEMSPACE_FREE
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
+  free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
+#endif
+
+/* Map the predefined allocators to the correct memory space.
+   The index to this table is the omp_allocator_handle_t enum value.
+   When the user calls omp_alloc with a predefined allocator this
+   table determines what memory they get.  */
+static const omp_memspace_handle_t predefined_alloc_mapping[] = {
+  omp_default_mem_space,   /* omp_null_allocator doesn't actually use this. */
+  omp_default_mem_space,   /* omp_default_mem_alloc. */
+  omp_large_cap_mem_space, /* omp_large_cap_mem_alloc. */
+  omp_const_mem_space,     /* omp_const_mem_alloc. */
+  omp_high_bw_mem_space,   /* omp_high_bw_mem_alloc. */
+  omp_low_lat_mem_space,   /* omp_low_lat_mem_alloc. */
+  omp_low_lat_mem_space,   /* omp_cgroup_mem_alloc (implementation defined). */
+  omp_low_lat_mem_space,   /* omp_pteam_mem_alloc (implementation defined). */
+  omp_low_lat_mem_space,   /* omp_thread_mem_alloc (implementation defined). */
+};
+
+#define ARRAY_SIZE(A) (sizeof (A) / sizeof ((A)[0]))
+_Static_assert (ARRAY_SIZE (predefined_alloc_mapping)
+		== omp_max_predefined_alloc + 1,
+		"predefined_alloc_mapping must match omp_memspace_handle_t");
+
 enum gomp_numa_memkind_kind
 {
   GOMP_MEMKIND_NONE = 0,
@@ -533,7 +574,7 @@ retry:
 	}
       else
 #endif
-	ptr = malloc (new_size);
+	ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size);
       if (ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -565,7 +606,13 @@ retry:
 	}
       else
 #endif
-	ptr = malloc (new_size);
+	{
+	  omp_memspace_handle_t memspace;
+	  memspace = (allocator_data
+		      ? allocator_data->memspace
+		      : predefined_alloc_mapping[allocator]);
+	  ptr = MEMSPACE_ALLOC (memspace, new_size);
+	}
       if (ptr == NULL)
 	goto fail;
     }
@@ -582,36 +629,26 @@ retry:
   ((struct omp_mem_header *) ret)[-1].allocator = allocator;
   return ret;
 
-fail:
-  if (allocator_data)
+fail:;
+  int fallback = (allocator_data
+		  ? allocator_data->fallback
+		  : allocator == omp_default_mem_alloc
+		  ? omp_atv_null_fb
+		  : omp_atv_default_mem_fb);
+  switch (fallback)
     {
-      switch (allocator_data->fallback)
-	{
-	case omp_atv_default_mem_fb:
-	  if ((new_alignment > sizeof (void *) && new_alignment > alignment)
-#if defined(LIBGOMP_USE_MEMKIND) || defined(LIBGOMP_USE_LIBNUMA)
-	      || memkind
-#endif
-	      || (allocator_data
-		  && allocator_data->pool_size < ~(uintptr_t) 0))
-	    {
-	      allocator = omp_default_mem_alloc;
-	      goto retry;
-	    }
-	  /* Otherwise, we've already performed default mem allocation
-	     and if that failed, it won't succeed again (unless it was
-	     intermittent.  Return NULL then, as that is the fallback.  */
-	  break;
-	case omp_atv_null_fb:
-	  break;
-	default:
-	case omp_atv_abort_fb:
-	  gomp_fatal ("Out of memory allocating %lu bytes",
-		      (unsigned long) size);
-	case omp_atv_allocator_fb:
-	  allocator = allocator_data->fb_data;
-	  goto retry;
-	}
+    case omp_atv_default_mem_fb:
+      allocator = omp_default_mem_alloc;
+      goto retry;
+    case omp_atv_null_fb:
+      break;
+    default:
+    case omp_atv_abort_fb:
+      gomp_fatal ("Out of memory allocating %lu bytes",
+		  (unsigned long) size);
+    case omp_atv_allocator_fb:
+      allocator = allocator_data->fb_data;
+      goto retry;
     }
   return NULL;
 }
@@ -644,6 +681,7 @@ void
 omp_free (void *ptr, omp_allocator_handle_t allocator)
 {
   struct omp_mem_header *data;
+  omp_memspace_handle_t memspace = omp_default_mem_space;
 
   if (ptr == NULL)
     return;
@@ -683,10 +721,12 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
 	  return;
 	}
 #endif
+
+      memspace = allocator_data->memspace;
     }
-#ifdef LIBGOMP_USE_MEMKIND
   else
     {
+#ifdef LIBGOMP_USE_MEMKIND
       enum gomp_numa_memkind_kind memkind = GOMP_MEMKIND_NONE;
       if (data->allocator == omp_high_bw_mem_alloc)
 	memkind = GOMP_MEMKIND_HBW_PREFERRED;
@@ -702,9 +742,12 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
 	      return;
 	    }
 	}
-    }
 #endif
-  free (data->ptr);
+
+      memspace = predefined_alloc_mapping[data->allocator];
+    }
+
+  MEMSPACE_FREE (memspace, data->ptr, data->size);
 }
 
 ialias (omp_free)
@@ -831,7 +874,7 @@ retry:
 	}
       else
 #endif
-	ptr = calloc (1, new_size);
+	ptr = MEMSPACE_CALLOC (allocator_data->memspace, new_size);
       if (ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -865,7 +908,13 @@ retry:
 	}
       else
 #endif
-	ptr = calloc (1, new_size);
+	{
+	  omp_memspace_handle_t memspace;
+	  memspace = (allocator_data
+		      ? allocator_data->memspace
+		      : predefined_alloc_mapping[allocator]);
+	  ptr = MEMSPACE_CALLOC (memspace, new_size);
+	}
       if (ptr == NULL)
 	goto fail;
     }
@@ -882,36 +931,26 @@ retry:
   ((struct omp_mem_header *) ret)[-1].allocator = allocator;
   return ret;
 
-fail:
-  if (allocator_data)
+fail:;
+  int fallback = (allocator_data
+		  ? allocator_data->fallback
+		  : allocator == omp_default_mem_alloc
+		  ? omp_atv_null_fb
+		  : omp_atv_default_mem_fb);
+  switch (fallback)
     {
-      switch (allocator_data->fallback)
-	{
-	case omp_atv_default_mem_fb:
-	  if ((new_alignment > sizeof (void *) && new_alignment > alignment)
-#if defined(LIBGOMP_USE_MEMKIND) || defined(LIBGOMP_USE_LIBNUMA)
-	      || memkind
-#endif
-	      || (allocator_data
-		  && allocator_data->pool_size < ~(uintptr_t) 0))
-	    {
-	      allocator = omp_default_mem_alloc;
-	      goto retry;
-	    }
-	  /* Otherwise, we've already performed default mem allocation
-	     and if that failed, it won't succeed again (unless it was
-	     intermittent.  Return NULL then, as that is the fallback.  */
-	  break;
-	case omp_atv_null_fb:
-	  break;
-	default:
-	case omp_atv_abort_fb:
-	  gomp_fatal ("Out of memory allocating %lu bytes",
-		      (unsigned long) (size * nmemb));
-	case omp_atv_allocator_fb:
-	  allocator = allocator_data->fb_data;
-	  goto retry;
-	}
+    case omp_atv_default_mem_fb:
+      allocator = omp_default_mem_alloc;
+      goto retry;
+    case omp_atv_null_fb:
+      break;
+    default:
+    case omp_atv_abort_fb:
+      gomp_fatal ("Out of memory allocating %lu bytes",
+		  (unsigned long) (size * nmemb));
+    case omp_atv_allocator_fb:
+      allocator = allocator_data->fb_data;
+      goto retry;
     }
   return NULL;
 }
@@ -1101,9 +1140,10 @@ retry:
       else
 #endif
       if (prev_size)
-	new_ptr = realloc (data->ptr, new_size);
+	new_ptr = MEMSPACE_REALLOC (allocator_data->memspace, data->ptr,
+				    data->size, new_size);
       else
-	new_ptr = malloc (new_size);
+	new_ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size);
       if (new_ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -1151,7 +1191,13 @@ retry:
 	}
       else
 #endif
-	new_ptr = realloc (data->ptr, new_size);
+	{
+	  omp_memspace_handle_t memspace;
+	  memspace = (allocator_data
+		      ? allocator_data->memspace
+		      : predefined_alloc_mapping[allocator]);
+	  new_ptr = MEMSPACE_REALLOC (memspace, data->ptr, data->size, new_size);
+	}
       if (new_ptr == NULL)
 	goto fail;
       ret = (char *) new_ptr + sizeof (struct omp_mem_header);
@@ -1178,7 +1224,13 @@ retry:
 	}
       else
 #endif
-	new_ptr = malloc (new_size);
+	{
+	  omp_memspace_handle_t memspace;
+	  memspace = (allocator_data
+		      ? allocator_data->memspace
+		      : predefined_alloc_mapping[allocator]);
+	  new_ptr = MEMSPACE_ALLOC (memspace, new_size);
+	}
       if (new_ptr == NULL)
 	goto fail;
     }
@@ -1227,39 +1279,35 @@ retry:
       return ret;
     }
 #endif
-  free (data->ptr);
+  {
+    omp_memspace_handle_t was_memspace;
+    was_memspace = (free_allocator_data
+		    ? free_allocator_data->memspace
+		    : predefined_alloc_mapping[free_allocator]);
+    MEMSPACE_FREE (was_memspace, data->ptr, data->size);
+  }
   return ret;
 
-fail:
-  if (allocator_data)
+fail:;
+  int fallback = (allocator_data
+		  ? allocator_data->fallback
+		  : allocator == omp_default_mem_alloc
+		  ? omp_atv_null_fb
+		  : omp_atv_default_mem_fb);
+  switch (fallback)
     {
-      switch (allocator_data->fallback)
-	{
-	case omp_atv_default_mem_fb:
-	  if (new_alignment > sizeof (void *)
-#if defined(LIBGOMP_USE_MEMKIND) || defined(LIBGOMP_USE_LIBNUMA)
-	      || memkind
-#endif
-	      || (allocator_data
-		  && allocator_data->pool_size < ~(uintptr_t) 0))
-	    {
-	      allocator = omp_default_mem_alloc;
-	      goto retry;
-	    }
-	  /* Otherwise, we've already performed default mem allocation
-	     and if that failed, it won't succeed again (unless it was
-	     intermittent.  Return NULL then, as that is the fallback.  */
-	  break;
-	case omp_atv_null_fb:
-	  break;
-	default:
-	case omp_atv_abort_fb:
-	  gomp_fatal ("Out of memory allocating %lu bytes",
-		      (unsigned long) size);
-	case omp_atv_allocator_fb:
-	  allocator = allocator_data->fb_data;
-	  goto retry;
-	}
+    case omp_atv_default_mem_fb:
+      allocator = omp_default_mem_alloc;
+      goto retry;
+    case omp_atv_null_fb:
+      break;
+    default:
+    case omp_atv_abort_fb:
+      gomp_fatal ("Out of memory allocating %lu bytes",
+		  (unsigned long) size);
+    case omp_atv_allocator_fb:
+      allocator = allocator_data->fb_data;
+      goto retry;
     }
   return NULL;
 }
diff --git a/libgomp/basic-allocator.c b/libgomp/basic-allocator.c
new file mode 100644
index 00000000000..35c7439bed6
--- /dev/null
+++ b/libgomp/basic-allocator.c
@@ -0,0 +1,380 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This is a basic "malloc" implementation intended for use with small,
+   low-latency memories.
+
+   To use this template, define BASIC_ALLOC_PREFIX, and then #include the
+   source file.  The other configuration macros are optional.
+
+   The root heap descriptor is stored in the first bytes of the heap, and each
+   free chunk contains a similar descriptor for the next free chunk in the
+   chain.
+
+   The descriptor is two values: offset and size, which describe the
+   location of a chunk of memory available for allocation. The offset is
+   relative to the base of the heap.  The special offset value 0xffffffff
+   indicates that the heap (free chain) is locked.  The offset and size are
+   32-bit values so the base alignment can be 8-bytes.
+
+   Memory is allocated to the first free chunk that fits.  The free chain
+   is always stored in order of the offset to assist coalescing adjacent
+   chunks.  */
+
+#include "libgomp.h"
+
+#ifndef BASIC_ALLOC_PREFIX
+#error "BASIC_ALLOC_PREFIX not defined."
+#endif
+
+#ifndef BASIC_ALLOC_YIELD
+#define BASIC_ALLOC_YIELD
+#endif
+
+#define ALIGN(VAR) (((VAR) + 7) & ~7)    /* 8-byte granularity.  */
+
+#define fn1(prefix, name) prefix ## _ ## name
+#define fn(prefix, name) fn1 (prefix, name)
+#define basic_alloc_init fn(BASIC_ALLOC_PREFIX,init)
+#define basic_alloc_alloc fn(BASIC_ALLOC_PREFIX,alloc)
+#define basic_alloc_calloc fn(BASIC_ALLOC_PREFIX,calloc)
+#define basic_alloc_free fn(BASIC_ALLOC_PREFIX,free)
+#define basic_alloc_realloc fn(BASIC_ALLOC_PREFIX,realloc)
+
+typedef struct {
+  uint32_t offset;
+  uint32_t size;
+} heapdesc;
+
+void
+basic_alloc_init (char *heap, size_t limit)
+{
+  if (heap == NULL)
+    return;
+
+  /* Initialize the head of the free chain.  */
+  heapdesc *root = (heapdesc *) heap;
+  root->offset = ALIGN(1);
+  root->size = limit - root->offset;
+
+  /* And terminate the chain.  */
+  heapdesc *next = (heapdesc *) (heap + root->offset);
+  next->offset = 0;
+  next->size = 0;
+}
+
+static void *
+basic_alloc_alloc (char *heap, size_t size)
+{
+  if (heap == NULL)
+    return NULL;
+
+  /* Memory is allocated in N-byte granularity.  */
+  size = ALIGN (size);
+
+  /* Acquire a lock on the low-latency heap.  */
+  heapdesc root, *root_ptr = (heapdesc *) heap;
+  do
+    {
+      root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, 
+					 MEMMODEL_ACQUIRE);
+      if (root.offset != 0xffffffff)
+	{
+	  root.size = root_ptr->size;
+	  break;
+	}
+      /* Spin.  */
+      BASIC_ALLOC_YIELD;
+    }
+  while (1);
+
+  /* Walk the free chain.  */
+  heapdesc chunk = root;
+  heapdesc *prev_chunkptr = NULL;
+  heapdesc *chunkptr = (heapdesc *) (heap + chunk.offset);
+  heapdesc onward_chain = *chunkptr;
+  while (chunk.size != 0 && (uint32_t) size > chunk.size)
+    {
+      chunk = onward_chain;
+      prev_chunkptr = chunkptr;
+      chunkptr = (heapdesc *) (heap + chunk.offset);
+      onward_chain = *chunkptr;
+    }
+
+  void *result = NULL;
+  if (chunk.size != 0)
+    {
+      /* Allocation successful.  */
+      result = chunkptr;
+
+      /* Update the free chain.  */
+      heapdesc stillfree = chunk;
+      stillfree.offset += size;
+      stillfree.size -= size;
+      heapdesc *stillfreeptr = (heapdesc *) (heap + stillfree.offset);
+
+      if (stillfree.size == 0)
+	/* The whole chunk was used.  */
+	stillfree = onward_chain;
+      else
+	/* The chunk was split, so restore the onward chain.  */
+	*stillfreeptr = onward_chain;
+
+      /* The previous free slot or root now points to stillfree.  */
+      if (prev_chunkptr)
+	*prev_chunkptr = stillfree;
+      else
+	root = stillfree;
+    }
+
+  /* Update the free chain root and release the lock.  */
+  root_ptr->size = root.size;
+  __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE);
+
+  return result;
+}
+
+static void *
+basic_alloc_calloc (char *heap, size_t size)
+{
+  /* Memory is allocated in N-byte granularity.  */
+  size = ALIGN (size);
+
+  uint64_t *result = basic_alloc_alloc (heap, size);
+  if (result)
+    /* Inline memset in which we know size is a multiple of 8.  */
+    for (unsigned i = 0; i < (unsigned) size / 8; i++)
+    result[i] = 0;
+
+  return result;
+}
+
+static void
+basic_alloc_free (char *heap, void *addr, size_t size)
+{
+  /* Memory is allocated in N-byte granularity.  */
+  size = ALIGN (size);
+
+  /* Acquire a lock on the low-latency heap.  */
+  heapdesc root, *root_ptr = (heapdesc *) heap;
+  do
+    {
+      root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, 
+					 MEMMODEL_ACQUIRE);
+      if (root.offset != 0xffffffff)
+	{
+	  root.size = root_ptr->size;
+	  break;
+	}
+      /* Spin.  */
+    }
+  while (1);
+
+  /* Walk the free chain to find where to insert a new entry.  */
+  heapdesc chunk = root, prev_chunk = {0};
+  heapdesc *prev_chunkptr = NULL, *prevprev_chunkptr = NULL;
+  heapdesc *chunkptr = (heapdesc *) (heap + chunk.offset);
+  heapdesc onward_chain = *chunkptr;
+  while (chunk.size != 0 && addr > (void *) chunkptr)
+    {
+      prev_chunk = chunk;
+      chunk = onward_chain;
+      prevprev_chunkptr = prev_chunkptr;
+      prev_chunkptr = chunkptr;
+      chunkptr = (heapdesc *) (heap + chunk.offset);
+      onward_chain = *chunkptr;
+    }
+
+  /* Create the new chunk descriptor.  */
+  heapdesc newfreechunk;
+  newfreechunk.offset = (uint32_t) ((uintptr_t) addr - (uintptr_t) heap);
+  newfreechunk.size = (uint32_t) size;
+
+  /* Coalesce adjacent free chunks.  */
+  if (newfreechunk.offset + size == chunk.offset)
+    {
+      /* Free chunk follows.  */
+      newfreechunk.size += chunk.size;
+      chunk = onward_chain;
+    }
+  if (prev_chunkptr)
+    {
+      if (prev_chunk.offset + prev_chunk.size
+	  == newfreechunk.offset)
+	{
+	  /* Free chunk precedes.  */
+	  newfreechunk.offset = prev_chunk.offset;
+	  newfreechunk.size += prev_chunk.size;
+	  addr = heap + prev_chunk.offset;
+	  prev_chunkptr = prevprev_chunkptr;
+	}
+    }
+
+  /* Update the free chain in the new and previous chunks.  */
+  *(heapdesc *) addr = chunk;
+  if (prev_chunkptr)
+    *prev_chunkptr = newfreechunk;
+  else
+    root = newfreechunk;
+
+  /* Update the free chain root and release the lock.  */
+  root_ptr->size = root.size;
+  __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE);
+
+}
+
+static void *
+basic_alloc_realloc (char *heap, void *addr, size_t oldsize,
+				      size_t size)
+{
+  /* Memory is allocated in N-byte granularity.  */
+  oldsize = ALIGN (oldsize);
+  size = ALIGN (size);
+
+  if (oldsize == size)
+    return addr;
+
+  /* Acquire a lock on the low-latency heap.  */
+  heapdesc root, *root_ptr = (heapdesc *) heap;
+  do
+    {
+      root.offset = __atomic_exchange_n (&root_ptr->offset, 0xffffffff, 
+					 MEMMODEL_ACQUIRE);
+      if (root.offset != 0xffffffff)
+	{
+	  root.size = root_ptr->size;
+	  break;
+	}
+      /* Spin.  */
+    }
+  while (1);
+
+  /* Walk the free chain.  */
+  heapdesc chunk = root;
+  heapdesc *prev_chunkptr = NULL;
+  heapdesc *chunkptr = (heapdesc *) (heap + chunk.offset);
+  heapdesc onward_chain = *chunkptr;
+  while (chunk.size != 0 && (void *) chunkptr < addr)
+    {
+      chunk = onward_chain;
+      prev_chunkptr = chunkptr;
+      chunkptr = (heapdesc *) (heap + chunk.offset);
+      onward_chain = *chunkptr;
+    }
+
+  void *result = NULL;
+  if (size < oldsize)
+    {
+      /* The new allocation is smaller than the old; we can always
+	 shrink an allocation in place.  */
+      result = addr;
+
+      heapdesc *nowfreeptr = (heapdesc *) (addr + size);
+
+      /* Update the free chain.  */
+      heapdesc nowfree;
+      nowfree.offset = (char *) nowfreeptr - heap;
+      nowfree.size = oldsize - size;
+
+      if (nowfree.offset + size == chunk.offset)
+	{
+	  /* Coalesce following free chunk.  */
+	  nowfree.size += chunk.size;
+	  *nowfreeptr = onward_chain;
+	}
+      else
+	*nowfreeptr = chunk;
+
+      /* The previous free slot or root now points to nowfree.  */
+      if (prev_chunkptr)
+	*prev_chunkptr = nowfree;
+      else
+	root = nowfree;
+    }
+  else if (chunk.size != 0
+	   && (char *) addr + oldsize == (char *) chunkptr
+	   && chunk.size >= size-oldsize)
+    {
+      /* The new allocation is larger than the old, and we found a
+	 large enough free block right after the existing block,
+	 so we extend into that space.  */
+      result = addr;
+
+      uint32_t delta = size-oldsize;
+
+      /* Update the free chain.  */
+      heapdesc stillfree = chunk;
+      stillfree.offset += delta;
+      stillfree.size -= delta;
+      heapdesc *stillfreeptr = (heapdesc *) (heap + stillfree.offset);
+
+      if (stillfree.size == 0)
+	/* The whole chunk was used.  */
+	stillfree = onward_chain;
+      else
+	/* The chunk was split, so restore the onward chain.  */
+	*stillfreeptr = onward_chain;
+
+      /* The previous free slot or root now points to stillfree.  */
+      if (prev_chunkptr)
+	*prev_chunkptr = stillfree;
+      else
+	root = stillfree;
+    }
+  /* Else realloc in-place has failed and result remains NULL.  */
+
+  /* Update the free chain root and release the lock.  */
+  root_ptr->size = root.size;
+  __atomic_store_n (&root_ptr->offset, root.offset, MEMMODEL_RELEASE);
+
+  if (result == NULL)
+    {
+      /* The allocation could not be extended in place, so we simply
+	 allocate fresh memory and move the data.  If we can't allocate
+	 from low-latency memory then we leave the original alloaction
+	 intact and return NULL.
+	 We could do a fall-back to main memory, but we don't know what
+	 the fall-back trait said to do.  */
+      result = basic_alloc_alloc (heap, size);
+      if (result != NULL)
+	{
+	  /* Inline memcpy in which we know oldsize is a multiple of 8.  */
+	  uint64_t *from = addr, *to = result;
+	  for (unsigned i = 0; i < (unsigned) oldsize / 8; i++)
+	    to[i] = from[i];
+
+	  basic_alloc_free (heap, addr, oldsize);
+	}
+    }
+
+  return result;
+}
+
+#undef ALIGN
+#undef fn1
+#undef fn
+#undef basic_alloc_init
+#undef basic_alloc_alloc
+#undef basic_alloc_free
+#undef basic_alloc_realloc
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
new file mode 100644
index 00000000000..6014fba177f
--- /dev/null
+++ b/libgomp/config/nvptx/allocator.c
@@ -0,0 +1,120 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* The low-latency allocators use space reserved in .shared memory when the
+   kernel is launched.  The heap is initialized in gomp_nvptx_main and all
+   allocations are forgotten when the kernel exits.  Allocations to other
+   memory spaces all use the system malloc syscall.
+
+   The root heap descriptor is stored elsewhere in shared memory, and each
+   free chunk contains a similar descriptor for the next free chunk in the
+   chain.
+
+   The descriptor is two 16-bit values: offset and size, which describe the
+   location of a chunk of memory available for allocation. The offset is
+   relative to the base of the heap.  The special value 0xffff, 0xffff
+   indicates that the heap is locked.  The descriptor is encoded into a
+   single 32-bit integer so that it may be easily accessed atomically.
+
+   Memory is allocated to the first free chunk that fits.  The free chain
+   is always stored in order of the offset to assist coalescing adjacent
+   chunks.  */
+
+#include "libgomp.h"
+#include <stdlib.h>
+
+#define BASIC_ALLOC_PREFIX __nvptx_lowlat
+#include "../../basic-allocator.c"
+
+/* There should be some .shared space reserved for us.  There's no way to
+   express this magic extern sizeless array in C so use asm.  */
+asm (".extern .shared .u8 __nvptx_lowlat_pool[];\n");
+
+static void *
+nvptx_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r" (shared_pool));
+
+      return __nvptx_lowlat_alloc (shared_pool, size);
+    }
+  else
+    return malloc (size);
+}
+
+static void *
+nvptx_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r" (shared_pool));
+
+      return __nvptx_lowlat_calloc (shared_pool, size);
+    }
+  else
+    return calloc (1, size);
+}
+
+static void
+nvptx_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r" (shared_pool));
+
+      __nvptx_lowlat_free (shared_pool, addr, size);
+    }
+  else
+    free (addr);
+}
+
+static void *
+nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
+			size_t oldsize, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r" (shared_pool));
+
+      return __nvptx_lowlat_realloc (shared_pool, addr, oldsize, size);
+    }
+  else
+    return realloc (addr, size);
+}
+
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
+  nvptx_memspace_alloc (MEMSPACE, SIZE)
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+  nvptx_memspace_calloc (MEMSPACE, SIZE)
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
+  nvptx_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
+  nvptx_memspace_free (MEMSPACE, ADDR, SIZE)
+
+#include "../../allocator.c"
diff --git a/libgomp/config/nvptx/team.c b/libgomp/config/nvptx/team.c
index 59521fabd99..9243774e41a 100644
--- a/libgomp/config/nvptx/team.c
+++ b/libgomp/config/nvptx/team.c
@@ -37,6 +37,12 @@ int __gomp_team_num __attribute__((shared,nocommon));
 static void gomp_thread_start (struct gomp_thread_pool *);
 extern void build_indirect_map (void);
 
+/* There should be some .shared space reserved for us.  There's no way to
+   express this magic extern sizeless array in C so use asm.  */
+asm (".extern .shared .u8 __nvptx_lowlat_pool[];\n");
+
+/* Defined in basic-allocator.c via config/nvptx/allocator.c.  */
+void __nvptx_lowlat_init (void *heap, size_t size);
 
 /* This externally visible function handles target region entry.  It
    sets up a per-team thread pool and transfers control by calling FN (FN_DATA)
@@ -68,6 +74,18 @@ gomp_nvptx_main (void (*fn) (void *), void *fn_data)
       nvptx_thrs = alloca (ntids * sizeof (*nvptx_thrs));
       memset (nvptx_thrs, 0, ntids * sizeof (*nvptx_thrs));
 
+      /* Find the low-latency heap details ....  */
+      uint32_t *shared_pool;
+      uint32_t shared_pool_size = 0;
+      asm ("cvta.shared.u64\t%0, __nvptx_lowlat_pool;" : "=r"(shared_pool));
+#if __PTX_ISA_VERSION_MAJOR__ > 4 \
+    || (__PTX_ISA_VERSION_MAJOR__ == 4 && __PTX_ISA_VERSION_MINOR__ >= 1)
+      asm ("mov.u32\t%0, %%dynamic_smem_size;\n"
+	   : "=r"(shared_pool_size));
+#endif
+      __nvptx_lowlat_init (shared_pool, shared_pool_size);
+
+      /* Initialize the thread pool.  */
       struct gomp_thread_pool *pool = alloca (sizeof (*pool));
       pool->threads = alloca (ntids * sizeof (*pool->threads));
       for (tid = 0; tid < ntids; tid++)
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index e5fe7af76af..39d0749e7b3 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3012,11 +3012,14 @@ value.
 @item omp_const_mem_alloc       @tab omp_const_mem_space
 @item omp_high_bw_mem_alloc     @tab omp_high_bw_mem_space
 @item omp_low_lat_mem_alloc     @tab omp_low_lat_mem_space
-@item omp_cgroup_mem_alloc      @tab --
-@item omp_pteam_mem_alloc       @tab --
-@item omp_thread_mem_alloc      @tab --
+@item omp_cgroup_mem_alloc      @tab omp_low_lat_mem_space (implementation defined)
+@item omp_pteam_mem_alloc       @tab omp_low_lat_mem_space (implementation defined)
+@item omp_thread_mem_alloc      @tab omp_low_lat_mem_space (implementation defined)
 @end multitable
 
+The @code{omp_low_lat_mem_space} is only available on supported devices.
+See @ref{Offload-Target Specifics}.
+
 The predefined allocators use the default values for the traits,
 as listed below.  Except that the last three allocators have the
 @code{access} trait set to @code{cgroup}, @code{pteam}, and
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 0548e7e09e5..d4a254ed4f0 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -341,6 +341,11 @@ struct ptx_device
 
 static struct ptx_device **ptx_devices;
 
+/* OpenMP kernels reserve a small amount of ".shared" space for use by
+   omp_alloc.  The size is configured using GOMP_NVPTX_LOWLAT_POOL, but the
+   default is set here.  */
+static unsigned lowlat_pool_size = 8 * 1024;
+
 static inline struct nvptx_thread *
 nvptx_thread (void)
 {
@@ -1219,6 +1224,22 @@ GOMP_OFFLOAD_init_device (int n)
       instantiated_devices++;
     }
 
+  const char *var_name = "GOMP_NVPTX_LOWLAT_POOL";
+  const char *env_var = secure_getenv (var_name);
+  notify_var (var_name, env_var);
+
+  if (env_var != NULL)
+    {
+      char *endptr;
+      unsigned long val = strtoul (env_var, &endptr, 10);
+      if (endptr == NULL || *endptr != '\0'
+	  || errno == ERANGE || errno == EINVAL
+	  || val > UINT_MAX)
+	GOMP_PLUGIN_error ("Error parsing %s", var_name);
+      else
+	lowlat_pool_size = val;
+    }
+
   pthread_mutex_unlock (&ptx_dev_lock);
 
   return dev != NULL;
@@ -2178,7 +2199,7 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
-			 32, threads, 1, 0, NULL, NULL, config);
+			 32, threads, 1, lowlat_pool_size, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
   if (reverse_offload)
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-1.c b/libgomp/testsuite/libgomp.c/omp_alloc-1.c
new file mode 100644
index 00000000000..f4e594f1e98
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-1.c
@@ -0,0 +1,56 @@
+/* { dg-do run } */
+
+/* Test that omp_alloc returns usable memory.  */
+
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+void
+test (int n, omp_allocator_handle_t allocator)
+{
+  #pragma omp target map(to:n) map(to:allocator)
+  {
+    int *a;
+    a = (int *) omp_alloc (n * sizeof (int), allocator);
+
+    #pragma omp parallel
+    for (int i = 0; i < n; i++)
+      a[i] = i;
+
+    for (int i = 0; i < n; i++)
+      if (a[i] != i)
+	{
+	  __builtin_printf ("data mismatch at %i\n", i);
+	  __builtin_abort ();
+	}
+
+    omp_free (a, allocator);
+  }
+}
+
+int
+main ()
+{
+  // Smaller than low-latency memory limit
+  test (10, omp_default_mem_alloc);
+  test (10, omp_large_cap_mem_alloc);
+  test (10, omp_const_mem_alloc);
+  test (10, omp_high_bw_mem_alloc);
+  test (10, omp_low_lat_mem_alloc);
+  test (10, omp_cgroup_mem_alloc);
+  test (10, omp_pteam_mem_alloc);
+  test (10, omp_thread_mem_alloc);
+
+  // Larger than low-latency memory limit
+  test (100000, omp_default_mem_alloc);
+  test (100000, omp_large_cap_mem_alloc);
+  test (100000, omp_const_mem_alloc);
+  test (100000, omp_high_bw_mem_alloc);
+  test (100000, omp_low_lat_mem_alloc);
+  test (100000, omp_cgroup_mem_alloc);
+  test (100000, omp_pteam_mem_alloc);
+  test (100000, omp_thread_mem_alloc);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-2.c b/libgomp/testsuite/libgomp.c/omp_alloc-2.c
new file mode 100644
index 00000000000..e9fd1602946
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-2.c
@@ -0,0 +1,64 @@
+/* { dg-do run } */
+
+/* Test concurrent and repeated allocations.  */
+
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+void
+test (int n, omp_allocator_handle_t allocator)
+{
+  #pragma omp target map(to:n) map(to:allocator)
+  {
+    int **a;
+    a = (int **) omp_alloc (n * sizeof (int *), allocator);
+
+    #pragma omp parallel for
+    for (int i = 0; i < n; i++)
+      {
+	/*Use 10x to ensure we do activate low-latency fall-back.  */
+	a[i] = omp_alloc (sizeof (int) * 10, allocator);
+	a[i][0] = i;
+      }
+
+    for (int i = 0; i < n; i++)
+      if (a[i][0] != i)
+	{
+	  __builtin_printf ("data mismatch at %i\n", i);
+	  __builtin_abort ();
+	}
+
+    #pragma omp parallel for
+    for (int i = 0; i < n; i++)
+      omp_free (a[i], allocator);
+
+    omp_free (a, allocator);
+  }
+}
+
+int
+main ()
+{
+  // Smaller than low-latency memory limit
+  test (10, omp_default_mem_alloc);
+  test (10, omp_large_cap_mem_alloc);
+  test (10, omp_const_mem_alloc);
+  test (10, omp_high_bw_mem_alloc);
+  test (10, omp_low_lat_mem_alloc);
+  test (10, omp_cgroup_mem_alloc);
+  test (10, omp_pteam_mem_alloc);
+  test (10, omp_thread_mem_alloc);
+
+  // Larger than low-latency memory limit (on aggregate)
+  test (1000, omp_default_mem_alloc);
+  test (1000, omp_large_cap_mem_alloc);
+  test (1000, omp_const_mem_alloc);
+  test (1000, omp_high_bw_mem_alloc);
+  test (1000, omp_low_lat_mem_alloc);
+  test (1000, omp_cgroup_mem_alloc);
+  test (1000, omp_pteam_mem_alloc);
+  test (1000, omp_thread_mem_alloc);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-3.c b/libgomp/testsuite/libgomp.c/omp_alloc-3.c
new file mode 100644
index 00000000000..792e2200f30
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-3.c
@@ -0,0 +1,42 @@
+/* { dg-do run } */
+
+/* Stress-test omp_alloc/omp_malloc under concurrency.  */
+
+#include <omp.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#pragma omp requires dynamic_allocators
+
+#define N 1000
+
+void
+test (omp_allocator_handle_t allocator)
+{
+  #pragma omp target map(to:allocator)
+  {
+    #pragma omp parallel for
+    for (int i = 0; i < N; i++)
+      for (int j = 0; j < N; j++)
+	{
+	  int *p = omp_alloc (sizeof (int), allocator);
+	  omp_free (p, allocator);
+	}
+  }
+}
+
+int
+main ()
+{
+  // Smaller than low-latency memory limit
+  test (omp_default_mem_alloc);
+  test (omp_large_cap_mem_alloc);
+  test (omp_const_mem_alloc);
+  test (omp_high_bw_mem_alloc);
+  test (omp_low_lat_mem_alloc);
+  test (omp_cgroup_mem_alloc);
+  test (omp_pteam_mem_alloc);
+  test (omp_thread_mem_alloc);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-4.c b/libgomp/testsuite/libgomp.c/omp_alloc-4.c
new file mode 100644
index 00000000000..66e13c09234
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-4.c
@@ -0,0 +1,196 @@
+/* { dg-do run } */
+
+/* Test that low-latency free chains are sound.  */
+
+#include <stddef.h>
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+void
+check (int cond, const char *msg)
+{
+  if (!cond)
+    {
+      __builtin_printf ("%s\n", msg);
+      __builtin_abort ();
+    }
+}
+
+int
+main ()
+{
+  #pragma omp target
+  {
+    /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
+    omp_alloctrait_t traits[1]
+      = { { omp_atk_fallback, omp_atv_null_fb } };
+    omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
+							1, traits);
+
+    int size = 4;
+
+    char *a = omp_alloc (size, lowlat);
+    char *b = omp_alloc (size, lowlat);
+    char *c = omp_alloc (size, lowlat);
+    char *d = omp_alloc (size, lowlat);
+
+    /* There are headers and padding to account for.  */
+    int size2 = size + (b-a);
+    int size3 = size + (c-a);
+    int size4 = size + (d-a) + 100; /* Random larger amount.  */
+
+    check (a != NULL && b != NULL && c != NULL && d != NULL,
+	   "omp_alloc returned NULL\n");
+
+    omp_free (a, lowlat);
+    char *p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not reuse first chunk");
+
+    omp_free (b, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not reuse second chunk");
+
+    omp_free (c, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not reuse third chunk");
+
+    omp_free (a, lowlat);
+    omp_free (b, lowlat);
+    p = omp_alloc (size2, lowlat);
+    check (p == a, "allocate did not coalesce first two chunks");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not split first chunk (1)");
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split first chunk (2)");
+
+    omp_free (b, lowlat);
+    omp_free (c, lowlat);
+    p = omp_alloc (size2, lowlat);
+    check (p == b, "allocate did not coalesce middle two chunks");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split second chunk (1)");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split second chunk (2)");
+
+    omp_free (b, lowlat);
+    omp_free (a, lowlat);
+    p = omp_alloc (size2, lowlat);
+    check (p == a, "allocate did not coalesce first two chunks, reverse free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not split first chunk (1), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split first chunk (2), reverse free");
+
+    omp_free (c, lowlat);
+    omp_free (b, lowlat);
+    p = omp_alloc (size2, lowlat);
+    check (p == b, "allocate did not coalesce second two chunks, reverse free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split second chunk (1), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split second chunk (2), reverse free");
+
+    omp_free (a, lowlat);
+    omp_free (b, lowlat);
+    omp_free (c, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == a, "allocate did not coalesce first three chunks");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not split first chunk (1)");
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split first chunk (2)");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split first chunk (3)");
+
+    omp_free (b, lowlat);
+    omp_free (c, lowlat);
+    omp_free (d, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == b, "allocate did not coalesce last three chunks");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split second chunk (1)");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split second chunk (2)");
+    p = omp_alloc (size, lowlat);
+    check (p == d, "allocate did not split second chunk (3)");
+
+    omp_free (c, lowlat);
+    omp_free (b, lowlat);
+    omp_free (a, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == a, "allocate did not coalesce first three chunks, reverse free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not split first chunk (1), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split first chunk (2), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split first chunk (3), reverse free");
+
+    omp_free (d, lowlat);
+    omp_free (c, lowlat);
+    omp_free (b, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == b, "allocate did not coalesce second three chunks, reverse free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split second chunk (1), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split second chunk (2), reverse free");
+    p = omp_alloc (size, lowlat);
+    check (p == d, "allocate did not split second chunk (3), reverse free");
+
+    omp_free (c, lowlat);
+    omp_free (a, lowlat);
+    omp_free (b, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == a, "allocate did not coalesce first three chunks, mixed free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == a, "allocate did not split first chunk (1), mixed free");
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split first chunk (2), mixed free");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split first chunk (3), mixed free");
+
+    omp_free (d, lowlat);
+    omp_free (b, lowlat);
+    omp_free (c, lowlat);
+    p = omp_alloc (size3, lowlat);
+    check (p == b, "allocate did not coalesce second three chunks, mixed free");
+
+    omp_free (p, lowlat);
+    p = omp_alloc (size, lowlat);
+    check (p == b, "allocate did not split second chunk (1), mixed free");
+    p = omp_alloc (size, lowlat);
+    check (p == c, "allocate did not split second chunk (2), mixed free");
+    p = omp_alloc (size, lowlat);
+    check (p == d, "allocate did not split second chunk (3), mixed free");
+
+    omp_free (a, lowlat);
+    omp_free (b, lowlat);
+    omp_free (c, lowlat);
+    omp_free (d, lowlat);
+    p = omp_alloc (size4, lowlat);
+    check (p == a, "allocate did not coalesce all memory");
+  }
+
+  return 0;
+}
+
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-5.c b/libgomp/testsuite/libgomp.c/omp_alloc-5.c
new file mode 100644
index 00000000000..10805ded6d0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-5.c
@@ -0,0 +1,63 @@
+/* { dg-do run } */
+
+/* Test calloc with omp_alloc.  */
+
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+void
+test (int n, omp_allocator_handle_t allocator)
+{
+  #pragma omp target map(to:n) map(to:allocator)
+  {
+    int *a;
+    a = (int *) omp_calloc (n, sizeof (int), allocator);
+
+    for (int i = 0; i < n; i++)
+      if (a[i] != 0)
+	{
+	  __builtin_printf ("memory not zeroed at %i\n", i);
+	  __builtin_abort ();
+	}
+
+    #pragma omp parallel
+    for (int i = 0; i < n; i++)
+      a[i] = i;
+
+    for (int i = 0; i < n; i++)
+      if (a[i] != i)
+	{
+	  __builtin_printf ("data mismatch at %i\n", i);
+	  __builtin_abort ();
+	}
+
+    omp_free (a, allocator);
+  }
+}
+
+int
+main ()
+{
+  // Smaller than low-latency memory limit
+  test (10, omp_default_mem_alloc);
+  test (10, omp_large_cap_mem_alloc);
+  test (10, omp_const_mem_alloc);
+  test (10, omp_high_bw_mem_alloc);
+  test (10, omp_low_lat_mem_alloc);
+  test (10, omp_cgroup_mem_alloc);
+  test (10, omp_pteam_mem_alloc);
+  test (10, omp_thread_mem_alloc);
+
+  // Larger than low-latency memory limit
+  test (100000, omp_default_mem_alloc);
+  test (100000, omp_large_cap_mem_alloc);
+  test (100000, omp_const_mem_alloc);
+  test (100000, omp_high_bw_mem_alloc);
+  test (100000, omp_low_lat_mem_alloc);
+  test (100000, omp_cgroup_mem_alloc);
+  test (100000, omp_pteam_mem_alloc);
+  test (100000, omp_thread_mem_alloc);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-6.c b/libgomp/testsuite/libgomp.c/omp_alloc-6.c
new file mode 100644
index 00000000000..66bf69b0455
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-6.c
@@ -0,0 +1,117 @@
+/* { dg-do run } */
+
+/* Test that low-latency realloc and free chains are sound.  */
+
+#include <stddef.h>
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+void
+check (int cond, const char *msg)
+{
+  if (!cond)
+    {
+      __builtin_printf ("%s\n", msg);
+      __builtin_abort ();
+    }
+}
+
+int
+main ()
+{
+  #pragma omp target
+  {
+    /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
+    omp_alloctrait_t traits[1]
+      = { { omp_atk_fallback, omp_atv_null_fb } };
+    omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
+							1, traits);
+
+    int size = 16;
+
+    char *a = (char *) omp_alloc (size, lowlat);
+    char *b = (char *) omp_alloc (size, lowlat);
+    char *c = (char *) omp_alloc (size, lowlat);
+    char *d = (char *) omp_alloc (size, lowlat);
+
+    /* There are headers and padding to account for.  */
+    int size2 = size + (b-a);
+    int size3 = size + (c-a);
+    int size4 = size + (d-a) + 100; /* Random larger amount.  */
+
+    check (a != NULL && b != NULL && c != NULL && d != NULL,
+	   "omp_alloc returned NULL\n");
+
+    char *p = omp_realloc (b, size, lowlat, lowlat);
+    check (p == b, "realloc did not reuse same size chunk, no space after");
+
+    p = omp_realloc (b, size-8, lowlat, lowlat);
+    check (p == b, "realloc did not reuse smaller chunk, no space after");
+
+    p = omp_realloc (b, size, lowlat, lowlat);
+    check (p == b, "realloc did not reuse original size chunk, no space after");
+
+    /* Make space after b.  */
+    omp_free (c, lowlat);
+
+    p = omp_realloc (b, size, lowlat, lowlat);
+    check (p == b, "realloc did not reuse same size chunk");
+
+    p = omp_realloc (b, size-8, lowlat, lowlat);
+    check (p == b, "realloc did not reuse smaller chunk");
+
+    p = omp_realloc (b, size, lowlat, lowlat);
+    check (p == b, "realloc did not reuse original size chunk");
+
+    p = omp_realloc (b, size+8, lowlat, lowlat);
+    check (p == b, "realloc did not extend in place by a little");
+
+    p = omp_realloc (b, size2, lowlat, lowlat);
+    check (p == b, "realloc did not extend into whole next chunk");
+
+    p = omp_realloc (b, size3, lowlat, lowlat);
+    check (p != b, "realloc did not move b elsewhere");
+    omp_free (p, lowlat);
+
+
+    p = omp_realloc (a, size, lowlat, lowlat);
+    check (p == a, "realloc did not reuse same size chunk, first position");
+
+    p = omp_realloc (a, size-8, lowlat, lowlat);
+    check (p == a, "realloc did not reuse smaller chunk, first position");
+
+    p = omp_realloc (a, size, lowlat, lowlat);
+    check (p == a, "realloc did not reuse original size chunk, first position");
+
+    p = omp_realloc (a, size+8, lowlat, lowlat);
+    check (p == a, "realloc did not extend in place by a little, first position");
+
+    p = omp_realloc (a, size3, lowlat, lowlat);
+    check (p == a, "realloc did not extend into whole next chunk, first position");
+
+    p = omp_realloc (a, size4, lowlat, lowlat);
+    check (p != a, "realloc did not move a elsewhere, first position");
+    omp_free (p, lowlat);
+
+
+    p = omp_realloc (d, size, lowlat, lowlat);
+    check (p == d, "realloc did not reuse same size chunk, last position");
+
+    p = omp_realloc (d, size-8, lowlat, lowlat);
+    check (p == d, "realloc did not reuse smaller chunk, last position");
+
+    p = omp_realloc (d, size, lowlat, lowlat);
+    check (p == d, "realloc did not reuse original size chunk, last position");
+
+    p = omp_realloc (d, size+8, lowlat, lowlat);
+    check (p == d, "realloc did not extend in place by d little, last position");
+
+    /* Larger than low latency memory.  */
+    p = omp_realloc (d, 100000000, lowlat, lowlat);
+    check (p == NULL, "realloc did not fail on OOM");
+  }
+
+  return 0;
+}
+

From patchwork Sun Dec  3 00:32:23 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Stubbs <ams@codesourcery.com>
X-Patchwork-Id: 81213
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 43282385C416
	for <patchwork@sourceware.org>; Sun,  3 Dec 2023 00:33:11 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252])
 by sourceware.org (Postfix) with ESMTPS id 309C3385803B
 for <gcc-patches@gcc.gnu.org>; Sun,  3 Dec 2023 00:32:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 309C3385803B
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 309C3385803B
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=68.232.137.252
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701563573; cv=none;
 b=b7wkd5F0VsTKXPyT7L3n7SSmOI+nFB/uOfEH1pjuRxAvCaMFwqgCmEoeRhIBvYIoxETcAXWqtvgh8+ulI/KM0PM4RNeRBD6DwKWJwtXR7tEw0NqG+CHJSVVx1Hy42Zfy+Hs7WmUc7Fp1i2cD6surCuq9gYeqWO1WALcAt6iRVvM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1701563573; c=relaxed/simple;
 bh=Ol1rEAQAHKFDv/YXEBiVWvJ3U0tMJeJcSz7njmCTrpE=;
 h=From:To:Subject:Date:Message-ID:MIME-Version;
 b=dwF/S72RYcp5/iLnVban5iGCgJB26h2g5lE1IfC4+iZiT+G1WNsf+0faXhfDS0P1PdzaY8kQoySyygjhlylCudcIBQyvlsYQw2l9G/RHrhrk6+TDwPPFulI6kiApQmFt3mVWzGG+S9ND2OX1RzU4hvsHySx9trR2Kfe1uKRsEvg=
ARC-Authentication-Results: i=1; server2.sourceware.org
X-CSE-ConnectionGUID: fheKjpAFRpWd69u/jbGQGg==
X-CSE-MsgGUID: gngyK9PORZiV6gfegt8YGw==
X-IronPort-AV: E=Sophos;i="6.04,246,1695715200"; d="scan'208";a="24279925"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa4.mentor.iphmx.com with ESMTP; 02 Dec 2023 16:32:50 -0800
IronPort-SDR: 
 QJPO/3gKV7L1npzd1fuasQfp/uKVZAJ9VmJMALiLu+fhaxohU6gzJe3DiCHftWVbZFg5G8IM3l
 jRivkVKLUGpPO6naCAeeqLJtxGiGO0KhxEZf5PY0H/2LTHV0ob+/Rpe1U+ePpyVrbgxghis9+o
 Xzw71UOAnYJXWNGN7f2hhtNH6ZPLMooFvQup40gef8BAYZfI9J+51H9IEq+jMGOZgF/fSss/6U
 qmPJJWz5s/HQaJSKsBLbBdM2JTyUzFCgmqLjrwYDyQXM+E/0MORmsoINNciedlrip7rwW69G05
 LwU=
From: Andrew Stubbs <ams@codesourcery.com>
To: <gcc-patches@gcc.gnu.org>
Subject: [PATCH v3 2/3] openmp, nvptx: low-lat memory access traits
Date: Sun, 3 Dec 2023 00:32:23 +0000
Message-ID: <20231203003224.1638841-3-ams@codesourcery.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20231203003224.1638841-1-ams@codesourcery.com>
References: <20231203003224.1638841-1-ams@codesourcery.com>
MIME-Version: 1.0
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-15.mgc.mentorg.com (139.181.222.15) To
 svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11)
X-Spam-Status: No, score=-11.5 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_ASCII_DIVIDERS, KAM_DMARC_STATUS,
 SPF_HELO_PASS, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org

The NVPTX low latency memory is not accessible outside the team that allocates
it, and therefore should be unavailable for allocators with the access trait
"all".  This change means that the omp_low_lat_mem_alloc predefined
allocator no longer works (but omp_cgroup_mem_alloc still does).

libgomp/ChangeLog:

	* allocator.c (MEMSPACE_VALIDATE): New macro.
	(omp_init_allocator): Use MEMSPACE_VALIDATE.
	(omp_aligned_alloc): Use OMP_LOW_LAT_MEM_ALLOC_INVALID.
	(omp_aligned_calloc): Likewise.
	(omp_realloc): Likewise.
	* config/nvptx/allocator.c (nvptx_memspace_validate): New function.
	(MEMSPACE_VALIDATE): New macro.
	(OMP_LOW_LAT_MEM_ALLOC_INVALID): New define.
	* libgomp.texi: Document low-latency implementation details.
	* testsuite/libgomp.c/omp_alloc-1.c (main): Add gnu_lowlat.
	* testsuite/libgomp.c/omp_alloc-2.c (main): Add gnu_lowlat.
	* testsuite/libgomp.c/omp_alloc-3.c (main): Add gnu_lowlat.
	* testsuite/libgomp.c/omp_alloc-4.c (main): Add access trait.
	* testsuite/libgomp.c/omp_alloc-5.c (main): Add gnu_lowlat.
	* testsuite/libgomp.c/omp_alloc-6.c (main): Add access trait.
	* testsuite/libgomp.c/omp_alloc-traits.c: New test.
---
 libgomp/allocator.c                           | 20 ++++++
 libgomp/config/nvptx/allocator.c              | 21 ++++++
 libgomp/libgomp.texi                          | 18 +++++
 libgomp/testsuite/libgomp.c/omp_alloc-1.c     | 10 +++
 libgomp/testsuite/libgomp.c/omp_alloc-2.c     |  8 +++
 libgomp/testsuite/libgomp.c/omp_alloc-3.c     |  7 ++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c     |  7 +-
 libgomp/testsuite/libgomp.c/omp_alloc-5.c     |  8 +++
 libgomp/testsuite/libgomp.c/omp_alloc-6.c     |  7 +-
 .../testsuite/libgomp.c/omp_alloc-traits.c    | 66 +++++++++++++++++++
 10 files changed, 166 insertions(+), 6 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-traits.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index fa398128368..a8a80f8028d 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -56,6 +56,10 @@
 #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
   free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
 #endif
+#ifndef MEMSPACE_VALIDATE
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+  (((void)(MEMSPACE), (void)(ACCESS), 1))
+#endif
 
 /* Map the predefined allocators to the correct memory space.
    The index to this table is the omp_allocator_handle_t enum value.
@@ -439,6 +443,10 @@ omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
   if (data.pinned)
     return omp_null_allocator;
 
+  /* Reject unsupported memory spaces.  */
+  if (!MEMSPACE_VALIDATE (data.memspace, data.access))
+    return omp_null_allocator;
+
   ret = gomp_malloc (sizeof (struct omp_allocator_data));
   *ret = data;
 #ifndef HAVE_SYNC_BUILTINS
@@ -522,6 +530,10 @@ retry:
     new_size += new_alignment - sizeof (void *);
   if (__builtin_add_overflow (size, new_size, &new_size))
     goto fail;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+    goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
@@ -820,6 +832,10 @@ retry:
     goto fail;
   if (__builtin_add_overflow (size_temp, new_size, &new_size))
     goto fail;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+    goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
@@ -1054,6 +1070,10 @@ retry:
   if (__builtin_add_overflow (size, new_size, &new_size))
     goto fail;
   old_size = data->size;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+    goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
index 6014fba177f..a3302411bcb 100644
--- a/libgomp/config/nvptx/allocator.c
+++ b/libgomp/config/nvptx/allocator.c
@@ -108,6 +108,21 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
     return realloc (addr, size);
 }
 
+static inline int
+nvptx_memspace_validate (omp_memspace_handle_t memspace, unsigned access)
+{
+#if __PTX_ISA_VERSION_MAJOR__ > 4 \
+    || (__PTX_ISA_VERSION_MAJOR__ == 4 && __PTX_ISA_VERSION_MINOR >= 1)
+  /* Disallow use of low-latency memory when it must be accessible by
+     all threads.  */
+  return (memspace != omp_low_lat_mem_space
+	  || access != omp_atv_all);
+#else
+  /* Low-latency memory is not available before PTX 4.1.  */
+  return (memspace != omp_low_lat_mem_space);
+#endif
+}
+
 #define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
   nvptx_memspace_alloc (MEMSPACE, SIZE)
 #define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
@@ -116,5 +131,11 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
   nvptx_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
 #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
   nvptx_memspace_free (MEMSPACE, ADDR, SIZE)
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+  nvptx_memspace_validate (MEMSPACE, ACCESS)
+
+/* The default low-latency memspace implies omp_atv_all, which is incompatible
+   with the .shared memory space.  */
+#define OMP_LOW_LAT_MEM_ALLOC_INVALID 1
 
 #include "../../allocator.c"
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 39d0749e7b3..7fdd6fe9410 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -5769,6 +5769,9 @@ Additional notes regarding the traits:
 @item The @code{sync_hint} trait has no effect.
 @end itemize
 
+See also:
+@ref{Offload-Target Specifics}
+
 @c ---------------------------------------------------------------------
 @c Offload-Target Specifics
 @c ---------------------------------------------------------------------
@@ -5902,6 +5905,21 @@ The implementation remark:
       directive for non-contiguous list items will use the 2D and 3D
       memory-copy functions of the CUDA library.  Higher dimensions will
       call those functions in a loop and are therefore supported.
+@item Low-latency memory (@code{omp_low_lat_mem_space}) is supported when the
+      the @code{access} trait is set to @code{cgroup}, the ISA is at least
+      @code{sm_53}, and the PTX version is at least 4.1.  The default pool size
+      is 8 kiB per team, but may be adjusted at runtime by setting environment
+      variable @code{GOMP_NVPTX_LOWLAT_POOL=@var{bytes}}.  The maximum value is
+      limited by the available hardware, and care should be taken that the
+      selected pool size does not unduly limit the number of teams that can
+      run simultaneously.
+@item @code{omp_low_lat_mem_alloc} cannot be used with true low-latency memory
+      because the definition implies the @code{omp_atv_all} trait; main
+      graphics memory is used instead.
+@item @code{omp_cgroup_mem_alloc}, @code{omp_pteam_mem_alloc}, and
+      @code{omp_thread_mem_alloc}, all use low-latency memory as first
+      preference, and fall back to main graphics memory when the low-latency
+      pool is exhausted.
 @end itemize
 
 
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-1.c b/libgomp/testsuite/libgomp.c/omp_alloc-1.c
index f4e594f1e98..7f7f440c12c 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-1.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-1.c
@@ -32,12 +32,21 @@ test (int n, omp_allocator_handle_t allocator)
 int
 main ()
 {
+  /* omp_low_lat_mem_alloc doesn't actually get low-latency memory on GPU.  */
+  omp_allocator_handle_t gpu_lowlat = 0;
+  #pragma omp target map(from:gpu_lowlat)
+    {
+      omp_alloctrait_t traits[1] = { { omp_atk_access, omp_atv_cgroup } };
+      gpu_lowlat = omp_init_allocator (omp_low_lat_mem_space, 1, traits);
+    }
+
   // Smaller than low-latency memory limit
   test (10, omp_default_mem_alloc);
   test (10, omp_large_cap_mem_alloc);
   test (10, omp_const_mem_alloc);
   test (10, omp_high_bw_mem_alloc);
   test (10, omp_low_lat_mem_alloc);
+  test (10, gpu_lowlat);
   test (10, omp_cgroup_mem_alloc);
   test (10, omp_pteam_mem_alloc);
   test (10, omp_thread_mem_alloc);
@@ -48,6 +57,7 @@ main ()
   test (100000, omp_const_mem_alloc);
   test (100000, omp_high_bw_mem_alloc);
   test (100000, omp_low_lat_mem_alloc);
+  test (100000, gpu_lowlat);
   test (100000, omp_cgroup_mem_alloc);
   test (100000, omp_pteam_mem_alloc);
   test (100000, omp_thread_mem_alloc);
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-2.c b/libgomp/testsuite/libgomp.c/omp_alloc-2.c
index e9fd1602946..54523f1061e 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-2.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-2.c
@@ -40,12 +40,19 @@ test (int n, omp_allocator_handle_t allocator)
 int
 main ()
 {
+  /* omp_low_lat_mem_alloc doesn't actually get low-latency memory on GPU.  */
+  omp_alloctrait_t traits[1] = { { omp_atk_access, omp_atv_cgroup } };
+  omp_allocator_handle_t gpu_lowlat;
+  #pragma omp target map(from:gpu_lowlat)
+  gpu_lowlat = omp_init_allocator (omp_low_lat_mem_space, 1, traits);
+
   // Smaller than low-latency memory limit
   test (10, omp_default_mem_alloc);
   test (10, omp_large_cap_mem_alloc);
   test (10, omp_const_mem_alloc);
   test (10, omp_high_bw_mem_alloc);
   test (10, omp_low_lat_mem_alloc);
+  test (10, gpu_lowlat);
   test (10, omp_cgroup_mem_alloc);
   test (10, omp_pteam_mem_alloc);
   test (10, omp_thread_mem_alloc);
@@ -56,6 +63,7 @@ main ()
   test (1000, omp_const_mem_alloc);
   test (1000, omp_high_bw_mem_alloc);
   test (1000, omp_low_lat_mem_alloc);
+  test (1000, gpu_lowlat);
   test (1000, omp_cgroup_mem_alloc);
   test (1000, omp_pteam_mem_alloc);
   test (1000, omp_thread_mem_alloc);
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-3.c b/libgomp/testsuite/libgomp.c/omp_alloc-3.c
index 792e2200f30..682d149d379 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-3.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-3.c
@@ -28,12 +28,19 @@ test (omp_allocator_handle_t allocator)
 int
 main ()
 {
+  /* omp_low_lat_mem_alloc doesn't actually get low-latency memory on GPU.  */
+  omp_alloctrait_t traits[1] = { { omp_atk_access, omp_atv_cgroup } };
+  omp_allocator_handle_t gpu_lowlat;
+  #pragma omp target map(from:gpu_lowlat)
+  gpu_lowlat = omp_init_allocator (omp_low_lat_mem_space, 1, traits);
+
   // Smaller than low-latency memory limit
   test (omp_default_mem_alloc);
   test (omp_large_cap_mem_alloc);
   test (omp_const_mem_alloc);
   test (omp_high_bw_mem_alloc);
   test (omp_low_lat_mem_alloc);
+  test (gpu_lowlat);
   test (omp_cgroup_mem_alloc);
   test (omp_pteam_mem_alloc);
   test (omp_thread_mem_alloc);
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-4.c b/libgomp/testsuite/libgomp.c/omp_alloc-4.c
index 66e13c09234..dd8fcfbeeba 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-4.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-4.c
@@ -23,10 +23,11 @@ main ()
   #pragma omp target
   {
     /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
-    omp_alloctrait_t traits[1]
-      = { { omp_atk_fallback, omp_atv_null_fb } };
+    omp_alloctrait_t traits[2]
+      = { { omp_atk_fallback, omp_atv_null_fb },
+          { omp_atk_access, omp_atv_cgroup } };
     omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
-							1, traits);
+							2, traits);
 
     int size = 4;
 
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-5.c b/libgomp/testsuite/libgomp.c/omp_alloc-5.c
index 10805ded6d0..26bf38c1ca6 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-5.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-5.c
@@ -39,12 +39,19 @@ test (int n, omp_allocator_handle_t allocator)
 int
 main ()
 {
+  /* omp_low_lat_mem_alloc doesn't actually get low-latency memory on GPU.  */
+  omp_alloctrait_t traits[1] = { { omp_atk_access, omp_atv_cgroup } };
+  omp_allocator_handle_t gpu_lowlat;
+  #pragma omp target map(from:gpu_lowlat)
+  gpu_lowlat = omp_init_allocator (omp_low_lat_mem_space, 1, traits);
+
   // Smaller than low-latency memory limit
   test (10, omp_default_mem_alloc);
   test (10, omp_large_cap_mem_alloc);
   test (10, omp_const_mem_alloc);
   test (10, omp_high_bw_mem_alloc);
   test (10, omp_low_lat_mem_alloc);
+  test (10, gpu_lowlat);
   test (10, omp_cgroup_mem_alloc);
   test (10, omp_pteam_mem_alloc);
   test (10, omp_thread_mem_alloc);
@@ -55,6 +62,7 @@ main ()
   test (100000, omp_const_mem_alloc);
   test (100000, omp_high_bw_mem_alloc);
   test (100000, omp_low_lat_mem_alloc);
+  test (100000, gpu_lowlat);
   test (100000, omp_cgroup_mem_alloc);
   test (100000, omp_pteam_mem_alloc);
   test (100000, omp_thread_mem_alloc);
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-6.c b/libgomp/testsuite/libgomp.c/omp_alloc-6.c
index 66bf69b0455..947a0ed23f8 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-6.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-6.c
@@ -23,10 +23,11 @@ main ()
   #pragma omp target
   {
     /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
-    omp_alloctrait_t traits[1]
-      = { { omp_atk_fallback, omp_atv_null_fb } };
+    omp_alloctrait_t traits[2]
+      = { { omp_atk_fallback, omp_atv_null_fb },
+          { omp_atk_access, omp_atv_cgroup } };
     omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
-							1, traits);
+							2, traits);
 
     int size = 16;
 
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-traits.c b/libgomp/testsuite/libgomp.c/omp_alloc-traits.c
new file mode 100644
index 00000000000..4ff0fca4986
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-traits.c
@@ -0,0 +1,66 @@
+/* { dg-do run } */
+
+/* { dg-require-effective-target offload_device } */
+/* { dg-xfail-if "not implemented" { ! offload_target_nvptx } } */
+
+/* Test that GPU low-latency allocation is limited to team access.  */
+
+#include <stddef.h>
+#include <omp.h>
+
+#pragma omp requires dynamic_allocators
+
+int
+main ()
+{
+  #pragma omp target
+  {
+    /* Ensure that the memory we get *is* low-latency with a null-fallback.  */
+    omp_alloctrait_t traits[2]
+      = { { omp_atk_fallback, omp_atv_null_fb },
+	  { omp_atk_access, omp_atv_cgroup } };
+    omp_allocator_handle_t lowlat = omp_init_allocator (omp_low_lat_mem_space,
+							2, traits); // good
+
+    omp_alloctrait_t traits_all[2]
+      = { { omp_atk_fallback, omp_atv_null_fb },
+	  { omp_atk_access, omp_atv_all } };
+    omp_allocator_handle_t lowlat_all
+      = omp_init_allocator (omp_low_lat_mem_space, 2, traits_all); // bad
+
+    omp_alloctrait_t traits_default[1]
+      = { { omp_atk_fallback, omp_atv_null_fb } };
+    omp_allocator_handle_t lowlat_default
+      = omp_init_allocator (omp_low_lat_mem_space, 1, traits_default); // bad
+
+    if (lowlat_all != omp_null_allocator
+	|| lowlat_default != omp_null_allocator)
+      __builtin_abort ();
+
+    void *a = omp_alloc (1, lowlat); // good
+
+    if (!a)
+      __builtin_abort ();
+
+    omp_free (a, lowlat);
+
+
+    a = omp_calloc (1, 1, lowlat); // good
+
+    if (!a)
+      __builtin_abort ();
+
+    omp_free (a, lowlat);
+
+
+    a = omp_realloc (NULL, 1, lowlat, lowlat); // good
+
+    if (!a)
+      __builtin_abort ();
+
+    omp_free (a, lowlat);
+  }
+
+  return 0;
+}
+

From patchwork Sun Dec  3 00:32:24 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Stubbs <ams@codesourcery.com>
X-Patchwork-Id: 81215
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 5EED03857BAB
	for <patchwork@sourceware.org>; Sun,  3 Dec 2023 00:33:37 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252])
 by sourceware.org (Postfix) with ESMTPS id 40B123858413
 for <gcc-patches@gcc.gnu.org>; Sun,  3 Dec 2023 00:32:54 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 40B123858413
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 40B123858413
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=68.232.137.252
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1701563577; cv=none;
 b=GNrxCL0d8sM25jfINB8DZmVAJ+kBybZJYY49+pxdShvsAOH0to/irrDP1u5VOU7FiDy67406v6YxgzxnIxHytFcqfYJUJit9OSz69ybhq9fDS2qhZuQqkMTFqvjS4+6WmGEx2w+7275RH0fmiw1hIeDXswc7uzxlYP0zeTi5KjE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1701563577; c=relaxed/simple;
 bh=qFdPdRGcUgVsM2eFvUvg78wyDX6875KfRYLR+aVKa/s=;
 h=From:To:Subject:Date:Message-ID:MIME-Version;
 b=f5ksniLFMM9A3dvu0NvxQAzRagBJkI1XPX0Uw3x2aWYRfE3Li9OzQ24/yiETDlxuSLIy19VmE9vy9KPBYYcSygOcrugVcymsHBYaBq6hjmLZCoYTKOl7uhSJit0weE3lSuTy59qR8woAfvc5XPw5Nakp/4MA7eNBc/p6Z+rrYf0=
ARC-Authentication-Results: i=1; server2.sourceware.org
X-CSE-ConnectionGUID: fheKjpAFRpWd69u/jbGQGg==
X-CSE-MsgGUID: EmdNwgE2Ty2SfkTP4eK4Mg==
X-IronPort-AV: E=Sophos;i="6.04,246,1695715200"; d="scan'208";a="24279926"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa4.mentor.iphmx.com with ESMTP; 02 Dec 2023 16:32:53 -0800
IronPort-SDR: 
 t88+JgJi6YwOTdN79PxtjZmt0grQuGcMq03dej/+IMAj09skjjhnAbLLD8IpqPq8KKct9oP6vq
 kHaUbcuKdtmxcqgajzHUcmyKQ/sTGcmWkMTrFAK2KrOGQkUe3vwrEhGgbKhs4XHFECB5OtTvBj
 xsY//IlLP/Xxs+NA0rXNG2e2365s0gYxWFg24OBJN37qXh1GS0YwwE3oD9sohSJWfxU17KGuuP
 Gw4ucwWqZRjm2wmI3eyC9JfKv7fa24KsUHWjlUu/wUjW899+UF0Dl01BDN2HQVwJj1gfrZcUW0
 ZDM=
From: Andrew Stubbs <ams@codesourcery.com>
To: <gcc-patches@gcc.gnu.org>
Subject: [PATCH v3 3/3] amdgcn, libgomp: low-latency allocator
Date: Sun, 3 Dec 2023 00:32:24 +0000
Message-ID: <20231203003224.1638841-4-ams@codesourcery.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20231203003224.1638841-1-ams@codesourcery.com>
References: <20231203003224.1638841-1-ams@codesourcery.com>
MIME-Version: 1.0
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-15.mgc.mentorg.com (139.181.222.15) To
 svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11)
X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_SHORT, SPF_HELO_PASS,
 SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org

This implements the OpenMP low-latency memory allocator for AMD GCN using the
small per-team LDS memory (Local Data Store).

Since addresses can now refer to LDS space, the "Global" address space is
no-longer compatible.  This patch therefore switches the backend to use
entirely "Flat" addressing (which supports both memories).  A future patch
will re-enable "global" instructions for cases where it is known to be safe
to do so.

gcc/ChangeLog:

	* config/gcn/gcn-builtins.def (DISPATCH_PTR): New built-in.
	* config/gcn/gcn.cc (gcn_init_machine_status): Disable global
	addressing.
	(gcn_expand_builtin_1): Implement GCN_BUILTIN_DISPATCH_PTR.

libgomp/ChangeLog:

	* config/gcn/libgomp-gcn.h (TEAM_ARENA_START): Move to here.
	(TEAM_ARENA_FREE): Likewise.
	(TEAM_ARENA_END): Likewise.
	(GCN_LOWLAT_HEAP): New.
	* config/gcn/team.c (LITTLEENDIAN_CPU): New, and import hsa.h.
	(__gcn_lowlat_init): New prototype.
	(gomp_gcn_enter_kernel): Initialize the low-latency heap.
	* libgomp.h (TEAM_ARENA_START): Move to libgomp.h.
	(TEAM_ARENA_FREE): Likewise.
	(TEAM_ARENA_END): Likewise.
	* plugin/plugin-gcn.c (lowlat_size): New variable.
	(print_kernel_dispatch): Label the group_segment_size purpose.
	(init_environment_variables): Read GOMP_GCN_LOWLAT_POOL.
	(create_kernel_dispatch): Pass low-latency head allocation to kernel.
	(run_kernel): Use shadow; don't assume values.
	* testsuite/libgomp.c/omp_alloc-traits.c: Enable for amdgcn.
	* config/gcn/allocator.c: New file.
	* libgomp.texi: Document low-latency implementation details.
---
 gcc/config/gcn/gcn-builtins.def               |   2 +
 gcc/config/gcn/gcn.cc                         |  16 ++-
 libgomp/config/gcn/allocator.c                | 127 ++++++++++++++++++
 libgomp/config/gcn/libgomp-gcn.h              |   6 +
 libgomp/config/gcn/team.c                     |  12 ++
 libgomp/libgomp.h                             |   3 -
 libgomp/libgomp.texi                          |  13 ++
 libgomp/plugin/plugin-gcn.c                   |  35 ++++-
 .../testsuite/libgomp.c/omp_alloc-traits.c    |   2 +-
 9 files changed, 205 insertions(+), 11 deletions(-)
 create mode 100644 libgomp/config/gcn/allocator.c

diff --git a/gcc/config/gcn/gcn-builtins.def b/gcc/config/gcn/gcn-builtins.def
index 636a8e7a1a9..471457d7c23 100644
--- a/gcc/config/gcn/gcn-builtins.def
+++ b/gcc/config/gcn/gcn-builtins.def
@@ -164,6 +164,8 @@ DEF_BUILTIN (FIRST_CALL_THIS_THREAD_P, -1, "first_call_this_thread_p", B_INSN,
 	     _A1 (GCN_BTI_BOOL), gcn_expand_builtin_1)
 DEF_BUILTIN (KERNARG_PTR, -1, "kernarg_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
 	     gcn_expand_builtin_1)
+DEF_BUILTIN (DISPATCH_PTR, -1, "dispatch_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
+	     gcn_expand_builtin_1)
 DEF_BUILTIN (GET_STACK_LIMIT, -1, "get_stack_limit", B_INSN,
 	     _A1 (GCN_BTI_VOIDPTR), gcn_expand_builtin_1)
 
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 22d2b6ebf6d..d70238820dd 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -110,7 +110,8 @@ gcn_init_machine_status (void)
 
   f = ggc_cleared_alloc<machine_function> ();
 
-  if (TARGET_GCN3)
+  // FIXME: re-enable global addressing with safety for LDS-flat addresses
+  //if (TARGET_GCN3)
     f->use_flat_addressing = true;
 
   return f;
@@ -4881,6 +4882,19 @@ gcn_expand_builtin_1 (tree exp, rtx target, rtx /*subtarget */ ,
 	  }
 	return ptr;
       }
+    case GCN_BUILTIN_DISPATCH_PTR:
+      {
+	rtx ptr;
+	if (cfun->machine->args.reg[DISPATCH_PTR_ARG] >= 0)
+	   ptr = gen_rtx_REG (DImode,
+			      cfun->machine->args.reg[DISPATCH_PTR_ARG]);
+	else
+	  {
+	    ptr = gen_reg_rtx (DImode);
+	    emit_move_insn (ptr, const0_rtx);
+	  }
+	return ptr;
+      }
     case GCN_BUILTIN_FIRST_CALL_THIS_THREAD_P:
       {
 	/* Stash a marker in the unused upper 16 bits of s[0:1] to indicate
diff --git a/libgomp/config/gcn/allocator.c b/libgomp/config/gcn/allocator.c
new file mode 100644
index 00000000000..e9a95d683f9
--- /dev/null
+++ b/libgomp/config/gcn/allocator.c
@@ -0,0 +1,127 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* The low-latency allocators use space reserved in LDS memory when the
+   kernel is launched.  The heap is initialized in gomp_gcn_enter_kernel and
+   all allocations are forgotten when the kernel exits.  Allocations to other
+   memory spaces all use the system malloc syscall.
+
+   The pointers returned are 64-bit "Flat" addresses indistinguishable from
+   regular pointers, but only compatible with the "flat_load/store"
+   instructions.  The compiler has been coded to assign default address
+   spaces accordingly.
+
+   LDS memory is not visible to other teams, and therefore may only be used
+   when the memspace access trait is set accordingly.  */
+
+#include "libgomp.h"
+#include <stdlib.h>
+
+#define BASIC_ALLOC_PREFIX __gcn_lowlat
+#define BASIC_ALLOC_YIELD asm ("s_sleep 1" ::: "memory")
+#include "../../basic-allocator.c"
+
+/* The low-latency heap is located in LDS memory, but we need the __flat
+   address space for compatibility reasons.  */
+#define FLAT_HEAP_PTR \
+  ((void *) (uintptr_t) (void __flat *) (void __lds *) GCN_LOWLAT_HEAP)
+
+static void *
+gcn_memspace_alloc (omp_memspace_handle_t memspace, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      return __gcn_lowlat_alloc (shared_pool, size);
+    }
+  else
+    return malloc (size);
+}
+
+static void *
+gcn_memspace_calloc (omp_memspace_handle_t memspace, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      return __gcn_lowlat_calloc (shared_pool, size);
+    }
+  else
+    return calloc (1, size);
+}
+
+static void
+gcn_memspace_free (omp_memspace_handle_t memspace, void *addr, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      __gcn_lowlat_free (shared_pool, addr, size);
+    }
+  else
+    free (addr);
+}
+
+static void *
+gcn_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
+			size_t oldsize, size_t size)
+{
+  if (memspace == omp_low_lat_mem_space)
+    {
+      char *shared_pool = FLAT_HEAP_PTR;
+
+      return __gcn_lowlat_realloc (shared_pool, addr, oldsize, size);
+    }
+  else
+    return realloc (addr, size);
+}
+
+static inline int
+gcn_memspace_validate (omp_memspace_handle_t memspace, unsigned access)
+{
+  /* Disallow use of low-latency memory when it must be accessible by
+     all threads.  */
+  return (memspace != omp_low_lat_mem_space
+	  || access != omp_atv_all);
+}
+
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
+  gcn_memspace_alloc (MEMSPACE, SIZE)
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+  gcn_memspace_calloc (MEMSPACE, SIZE)
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
+  gcn_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
+  gcn_memspace_free (MEMSPACE, ADDR, SIZE)
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+  gcn_memspace_validate (MEMSPACE, ACCESS)
+
+/* The default low-latency memspace implies omp_atv_all, which is incompatible
+   with the LDS memory space.  */
+#define OMP_LOW_LAT_MEM_ALLOC_INVALID 1
+
+#include "../../allocator.c"
diff --git a/libgomp/config/gcn/libgomp-gcn.h b/libgomp/config/gcn/libgomp-gcn.h
index f62b7dde0e7..05b6fb60cc9 100644
--- a/libgomp/config/gcn/libgomp-gcn.h
+++ b/libgomp/config/gcn/libgomp-gcn.h
@@ -33,6 +33,12 @@
 #define DEFAULT_GCN_STACK_SIZE (32*1024)
 #define DEFAULT_TEAM_ARENA_SIZE (64*1024)
 
+/* These define the LDS location of data needed by OpenMP.  */
+#define TEAM_ARENA_START 16  /* LDS offset of free pointer.  */
+#define TEAM_ARENA_FREE  24  /* LDS offset of free pointer.  */
+#define TEAM_ARENA_END   32  /* LDS offset of end pointer.  */
+#define GCN_LOWLAT_HEAP  40  /* LDS offset of the OpenMP low-latency heap.  */
+
 struct heap
 {
   int64_t size;
diff --git a/libgomp/config/gcn/team.c b/libgomp/config/gcn/team.c
index fb20cbbcf9f..7ee6115b666 100644
--- a/libgomp/config/gcn/team.c
+++ b/libgomp/config/gcn/team.c
@@ -29,6 +29,12 @@
 #include <stdlib.h>
 #include <string.h>
 
+#define LITTLEENDIAN_CPU
+#include "hsa.h"
+
+/* Defined in basic-allocator.c via config/amdgcn/allocator.c.  */
+void __gcn_lowlat_init (void *heap, size_t size);
+
 static void gomp_thread_start (struct gomp_thread_pool *);
 extern void build_indirect_map (void);
 
@@ -75,6 +81,12 @@ gomp_gcn_enter_kernel (void)
       *arena_free = team_arena;
       *arena_end = team_arena + kernargs->arena_size_per_team;
 
+      /* Initialize the low-latency heap.  The header is the size.  */
+      void __lds *lowlat = (void __lds *)GCN_LOWLAT_HEAP;
+      hsa_kernel_dispatch_packet_t *queue_ptr = __builtin_gcn_dispatch_ptr ();
+      __gcn_lowlat_init ((void*)(uintptr_t)(void __flat*)lowlat,
+			 queue_ptr->group_segment_size - GCN_LOWLAT_HEAP);
+
       /* Allocate and initialize the team-local-storage data.  */
       struct gomp_thread *thrs = team_malloc_cleared (sizeof (*thrs)
 						      * numthreads);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 15a767cf317..fa29f428976 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -114,9 +114,6 @@ extern void gomp_aligned_free (void *);
 #ifdef __AMDGCN__
 #include "libgomp-gcn.h"
 /* The arena is initialized in config/gcn/team.c.  */
-#define TEAM_ARENA_START 16  /* LDS offset of free pointer.  */
-#define TEAM_ARENA_FREE  24  /* LDS offset of free pointer.  */
-#define TEAM_ARENA_END   32  /* LDS offset of end pointer.  */
 
 static inline void * __attribute__((malloc))
 team_malloc (size_t size)
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 7fdd6fe9410..9d0aee72b33 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -5838,6 +5838,19 @@ The implementation remark:
       available devices (``host fallback'').
 @item The available stack size can be changed using the @code{GCN_STACK_SIZE}
       environment variable; the default is 32 kiB per thread.
+@item Low-latency memory (@code{omp_low_lat_mem_space}) is supported when the
+      the @code{access} trait is set to @code{cgroup}.  The default pool size
+      is automatically scaled to share the 64 kiB LDS memory between the number
+      of teams configured to run on each compute-unit, but may be adjusted at
+      runtime by setting environment variable
+      @code{GOMP_GCN_LOWLAT_POOL=@var{bytes}}.
+@item @code{omp_low_lat_mem_alloc} cannot be used with true low-latency memory
+      because the definition implies the @code{omp_atv_all} trait; main
+      graphics memory is used instead.
+@item @code{omp_cgroup_mem_alloc}, @code{omp_pteam_mem_alloc}, and
+      @code{omp_thread_mem_alloc}, all use low-latency memory as first
+      preference, and fall back to main graphics memory when the low-latency
+      pool is exhausted.
 @end itemize
 
 
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 8aabbd99881..7f8178c78b7 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -550,6 +550,7 @@ static size_t gcn_kernel_heap_size = DEFAULT_GCN_HEAP_SIZE;
 
 static int team_arena_size = DEFAULT_TEAM_ARENA_SIZE;
 static int stack_size = DEFAULT_GCN_STACK_SIZE;
+static int lowlat_size = -1;
 
 /* Flag to decide whether print to stderr information about what is going on.
    Set in init_debug depending on environment variables.  */
@@ -1016,8 +1017,8 @@ print_kernel_dispatch (struct kernel_dispatch *dispatch, unsigned indent)
   fprintf (stderr, "%*sobject: %lu\n", indent, "", dispatch->object);
   fprintf (stderr, "%*sprivate_segment_size: %u\n", indent, "",
 	   dispatch->private_segment_size);
-  fprintf (stderr, "%*sgroup_segment_size: %u\n", indent, "",
-	   dispatch->group_segment_size);
+  fprintf (stderr, "%*sgroup_segment_size: %u (low-latency pool)\n", indent,
+	   "", dispatch->group_segment_size);
   fprintf (stderr, "\n");
 }
 
@@ -1088,6 +1089,10 @@ init_environment_variables (void)
       if (tmp)
 	stack_size = tmp;;
     }
+
+  const char *lowlat = secure_getenv ("GOMP_GCN_LOWLAT_POOL");
+  if (lowlat)
+    lowlat_size = atoi (lowlat);
 }
 
 /* Return malloc'd string with name of SYMBOL.  */
@@ -1930,7 +1935,25 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams,
 
   shadow->signal = sync_signal.handle;
   shadow->private_segment_size = kernel->private_segment_size;
-  shadow->group_segment_size = kernel->group_segment_size;
+
+  if (lowlat_size < 0)
+    {
+      /* Divide the LDS between the number of running teams.
+	 Allocate not less than is defined in the kernel metadata.  */
+      int teams_per_cu = num_teams / get_cu_count (agent);
+      int LDS_per_team = (teams_per_cu ? 65536 / teams_per_cu : 65536);
+      shadow->group_segment_size
+	= (kernel->group_segment_size > LDS_per_team
+	   ? kernel->group_segment_size
+	   : LDS_per_team);;
+    }
+  else if (lowlat_size < GCN_LOWLAT_HEAP+8)
+    /* Ensure that there's space for the OpenMP libgomp data.  */
+    shadow->group_segment_size = GCN_LOWLAT_HEAP+8;
+  else
+    shadow->group_segment_size = (lowlat_size > 65536
+				  ? 65536
+				  : lowlat_size);
 
   /* We expect kernels to request a single pointer, explicitly, and the
      rest of struct kernargs, implicitly.  If they request anything else
@@ -2290,9 +2313,9 @@ run_kernel (struct kernel_info *kernel, void *vars,
       print_kernel_dispatch (shadow, 2);
     }
 
-  packet->private_segment_size = kernel->private_segment_size;
-  packet->group_segment_size = kernel->group_segment_size;
-  packet->kernel_object = kernel->object;
+  packet->private_segment_size = shadow->private_segment_size;
+  packet->group_segment_size = shadow->group_segment_size;
+  packet->kernel_object = shadow->object;
   packet->kernarg_address = shadow->kernarg_address;
   hsa_signal_t s;
   s.handle = shadow->signal;
diff --git a/libgomp/testsuite/libgomp.c/omp_alloc-traits.c b/libgomp/testsuite/libgomp.c/omp_alloc-traits.c
index 4ff0fca4986..e9acc8673a3 100644
--- a/libgomp/testsuite/libgomp.c/omp_alloc-traits.c
+++ b/libgomp/testsuite/libgomp.c/omp_alloc-traits.c
@@ -1,7 +1,7 @@
 /* { dg-do run } */
 
 /* { dg-require-effective-target offload_device } */
-/* { dg-xfail-if "not implemented" { ! offload_target_nvptx } } */
+/* { dg-xfail-if "not implemented" { ! { offload_target_nvptx || offload_target_amdgcn } } } */
 
 /* Test that GPU low-latency allocation is limited to team access.  */