From patchwork Wed Jan 11 11:49:24 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Thomas Schwinge <thomas@codesourcery.com>
X-Patchwork-Id: 62943
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 77A013857803
	for <patchwork@sourceware.org>; Wed, 11 Jan 2023 11:49:47 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98])
 by sourceware.org (Postfix) with ESMTPS id 922923858407
 for <gcc-patches@gcc.gnu.org>; Wed, 11 Jan 2023 11:49:29 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 922923858407
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="5.96,315,1665475200"; d="scan'208,223";a="93935076"
Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167])
 by esa2.mentor.iphmx.com with ESMTP; 11 Jan 2023 03:49:28 -0800
IronPort-SDR: 
 sDtJzw1pQRqofRxGlS4hg3VFl4YzMBbWq7e0NHlRTUlqWNlzefHoyDkkDj3YSxMBVnEcJgYj5+
 yeOthYxYH3zf3zbQcT+0dWKV7oEJGHzvfm43k4dIaR6D5FmmI9IACOyPVvCqNEv4fJMniYY1rK
 onBeuLvPozIe5hJ2iJZtGhUf+k13w6B31cRMhP1TtQ3vVDMxMGd+i2G7113lQTJVFdyRfy6Sh2
 dDLWLHmAaSrkqc4DS90GA0M0tzJrIGOtfn0Wf+FLAkUYh6LTOh4iMi4+g+zxitcmYsIFoWCsoU
 UyY=
From: Thomas Schwinge <thomas@codesourcery.com>
To: <gcc-patches@gcc.gnu.org>, Tom de Vries <tdevries@suse.de>
Subject: [PING] nvptx: Support global constructors/destructors via
 'collect2' for offloading (was: nvptx: Support global
 constructors/destructors via 'collect2')
In-Reply-To: <87o7rup7f8.fsf@euler.schwinge.homeip.net>
References: <878rjqaku5.fsf@dem-tschwing-1.ger.mentorg.com>
 <87y1rq7wt4.fsf@dem-tschwing-1.ger.mentorg.com>
 <87r0wqp7jf.fsf@euler.schwinge.homeip.net>
 <87o7rup7f8.fsf@euler.schwinge.homeip.net>
User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/28.2
 (x86_64-pc-linux-gnu)
Date: Wed, 11 Jan 2023 12:49:24 +0100
Message-ID: <87fschl29n.fsf@euler.schwinge.homeip.net>
MIME-Version: 1.0
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-13.mgc.mentorg.com (139.181.222.13) To
 svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10)
X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, RCVD_IN_MSPIKE_H2,
 SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

Hi!

Ping.


Grüße
 Thomas


On 2022-12-23T14:37:47+0100, I wrote:
> Hi!
>
> On 2022-12-23T14:35:16+0100, I wrote:
>> On 2022-12-02T14:35:35+0100, I wrote:
>>> On 2022-12-01T22:13:38+0100, I wrote:
>>>> I'm working on support for global constructors/destructors with
>>>> GCC/nvptx
>>>
>>> See "nvptx: Support global constructors/destructors via 'collect2'"
>>> [posted before]
>>
>> Building on that, attached is now the additional "for offloading" piece:
>> "nvptx: Support global constructors/destructors via 'collect2' for offloading".
>> OK to push?
>
> Now really attached.
>
>> I did manually test this (by putting a few constructors/destructors into
>> 'libgomp/config/nvptx/oacc-parallel.c', and observing them be executed),
>> and also in my WIP development tree with standard libgfortran
>> constructors (with 'LIBGFOR_MINIMAL' disabled).
>
>
> Grüße
>  Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

From fb67006eeca0c8e2bfdf86576ed3109dacaf6868 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 30 Nov 2022 22:09:35 +0100
Subject: [PATCH] nvptx: Support global constructors/destructors via 'collect2'
 for offloading

This extends "nvptx: Support global constructors/destructors via 'collect2'"
for offloading.

	libgcc/
	* config/nvptx/crtstuff.c ["mgomp"]
	(__do_global_ctors__entry__mgomp)
	(__do_global_dtors__entry__mgomp): New.
	[!"mgomp"] (__do_global_ctors__entry, __do_global_dtors__entry):
	New.
	libgomp/
	* plugin/plugin-nvptx.c (nvptx_do_global_cdtors): New.
	(nvptx_close_device, GOMP_OFFLOAD_load_image)
	(GOMP_OFFLOAD_unload_image): Call it.
---
 libgcc/config/nvptx/crtstuff.c |  64 ++++++++++++++++++-
 libgomp/plugin/plugin-nvptx.c  | 113 ++++++++++++++++++++++++++++++++-
 2 files changed, 175 insertions(+), 2 deletions(-)

diff --git a/libgcc/config/nvptx/crtstuff.c b/libgcc/config/nvptx/crtstuff.c
index 0823fc49901..8dc80687e0a 100644
--- a/libgcc/config/nvptx/crtstuff.c
+++ b/libgcc/config/nvptx/crtstuff.c
@@ -29,6 +29,14 @@
    files (via 'CRT_BEGIN' and 'CRT_END'): 'crtbegin.o' and 'crtend.o', but we
    do so anyway, for symmetry with other configurations.  */
 
+
+/* See 'crt0.c', 'mgomp.c'.  */
+#if defined(__nvptx_softstack__) && defined(__nvptx_unisimt__)
+extern void *__nvptx_stacks[32] __attribute__((shared,nocommon));
+extern unsigned __nvptx_uni[32] __attribute__((shared,nocommon));
+#endif
+
+
 #ifdef CRT_BEGIN
 
 void
@@ -37,6 +45,33 @@ __do_global_ctors (void)
   DO_GLOBAL_CTORS_BODY;
 }
 
+/* Need '.entry' wrapper for offloading.  */
+
+# if defined(__nvptx_softstack__) && defined(__nvptx_unisimt__)
+
+__attribute__((kernel)) void __do_global_ctors__entry__mgomp (void *);
+
+void
+__do_global_ctors__entry__mgomp (void *nvptx_stacks_0)
+{
+  __nvptx_stacks[0] = nvptx_stacks_0;
+  __nvptx_uni[0] = 0;
+
+  __do_global_ctors ();
+}
+
+# else
+
+__attribute__((kernel)) void __do_global_ctors__entry (void);
+
+void
+__do_global_ctors__entry (void)
+{
+  __do_global_ctors ();
+}
+
+# endif
+
 #elif defined(CRT_END) /* ! CRT_BEGIN */
 
 void
@@ -45,7 +80,7 @@ __do_global_dtors (void)
   /* In this configuration here, there's no way that "this routine is run more
      than once [...] when exit is called recursively": for nvptx target, the
      call to '__do_global_dtors' is registered via 'atexit', which doesn't
-     re-enter a function already run.
+     re-enter a function already run, and neither does nvptx offload target.
      Therefore, we do *not* "arrange to remember where in the list we left off
      processing".  */
   func_ptr *p;
@@ -53,6 +88,33 @@ __do_global_dtors (void)
     (*p++) ();
 }
 
+/* Need '.entry' wrapper for offloading.  */
+
+# if defined(__nvptx_softstack__) && defined(__nvptx_unisimt__)
+
+__attribute__((kernel)) void __do_global_dtors__entry__mgomp (void *);
+
+void
+__do_global_dtors__entry__mgomp (void *nvptx_stacks_0)
+{
+  __nvptx_stacks[0] = nvptx_stacks_0;
+  __nvptx_uni[0] = 0;
+
+  __do_global_dtors ();
+}
+
+# else
+
+__attribute__((kernel)) void __do_global_dtors__entry (void);
+
+void
+__do_global_dtors__entry (void)
+{
+  __do_global_dtors ();
+}
+
+# endif
+
 #else /* ! CRT_BEGIN && ! CRT_END */
 #error "One of CRT_BEGIN or CRT_END must be defined."
 #endif
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index fcc97c6e0d5..395639537e8 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -338,6 +338,11 @@ struct ptx_device
 
 static struct ptx_device **ptx_devices;
 
+static bool nvptx_do_global_cdtors (CUmodule, struct ptx_device *,
+				    const char *);
+static size_t nvptx_stacks_size ();
+static void *nvptx_stacks_acquire (struct ptx_device *, size_t, int);
+
 static inline struct nvptx_thread *
 nvptx_thread (void)
 {
@@ -557,6 +562,17 @@ nvptx_close_device (struct ptx_device *ptx_dev)
   if (!ptx_dev)
     return true;
 
+  bool ret = true;
+
+  for (struct ptx_image_data *image = ptx_dev->images;
+       image != NULL;
+       image = image->next)
+    {
+      if (!nvptx_do_global_cdtors (image->module, ptx_dev,
+				   "__do_global_dtors__entry"))
+	ret = false;
+    }
+
   for (struct ptx_free_block *b = ptx_dev->free_blocks; b;)
     {
       struct ptx_free_block *b_next = b->next;
@@ -577,7 +593,8 @@ nvptx_close_device (struct ptx_device *ptx_dev)
     CUDA_CALL (cuCtxDestroy, ptx_dev->ctx);
 
   free (ptx_dev);
-  return true;
+
+  return ret;
 }
 
 static int
@@ -1280,6 +1297,93 @@ nvptx_set_clocktick (CUmodule module, struct ptx_device *dev)
     GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
 }
 
+/* Invoke MODULE's global constructors/destructors.  */
+
+static bool
+nvptx_do_global_cdtors (CUmodule module, struct ptx_device *ptx_dev,
+			const char *funcname)
+{
+  bool ret = true;
+  char *funcname_mgomp = NULL;
+  CUresult r;
+  CUfunction funcptr;
+  r = CUDA_CALL_NOCHECK (cuModuleGetFunction,
+			 &funcptr, module, funcname);
+  GOMP_PLUGIN_debug (0, "cuModuleGetFunction (%s): %s\n",
+		     funcname, cuda_error (r));
+  if (r == CUDA_ERROR_NOT_FOUND)
+    {
+      /* Try '[funcname]__mgomp'.  */
+
+      size_t funcname_len = strlen (funcname);
+      const char *mgomp_suffix = "__mgomp";
+      size_t mgomp_suffix_len = strlen (mgomp_suffix);
+      funcname_mgomp
+	= GOMP_PLUGIN_malloc (funcname_len + mgomp_suffix_len + 1);
+      memcpy (funcname_mgomp, funcname, funcname_len);
+      memcpy (funcname_mgomp + funcname_len,
+	      mgomp_suffix, mgomp_suffix_len + 1);
+      funcname = funcname_mgomp;
+
+      r = CUDA_CALL_NOCHECK (cuModuleGetFunction,
+			     &funcptr, module, funcname);
+      GOMP_PLUGIN_debug (0, "cuModuleGetFunction (%s): %s\n",
+			 funcname, cuda_error (r));
+    }
+  if (r == CUDA_ERROR_NOT_FOUND)
+    ;
+  else if (r != CUDA_SUCCESS)
+    {
+      GOMP_PLUGIN_error ("cuModuleGetFunction (%s) error: %s",
+			 funcname, cuda_error (r));
+      ret = false;
+    }
+  else
+    {
+      /* If necessary, set up soft stack.  */
+      void *nvptx_stacks_0;
+      void *kargs[1];
+      if (funcname_mgomp)
+	{
+	  size_t stack_size = nvptx_stacks_size ();
+	  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
+	  nvptx_stacks_0 = nvptx_stacks_acquire (ptx_dev, stack_size, 1);
+	  nvptx_stacks_0 += stack_size;
+	  kargs[0] = &nvptx_stacks_0;
+	}
+      r = CUDA_CALL_NOCHECK (cuLaunchKernel,
+			     funcptr,
+			     1, 1, 1, 1, 1, 1,
+			     /* sharedMemBytes */ 0,
+			     /* hStream */ NULL,
+			     /* kernelParams */ funcname_mgomp ? kargs : NULL,
+			     /* extra */ NULL);
+      if (r != CUDA_SUCCESS)
+	{
+	  GOMP_PLUGIN_error ("cuLaunchKernel (%s) error: %s",
+			     funcname, cuda_error (r));
+	  ret = false;
+	}
+
+      r = CUDA_CALL_NOCHECK (cuStreamSynchronize,
+			     NULL);
+      if (r != CUDA_SUCCESS)
+	{
+	  GOMP_PLUGIN_error ("cuStreamSynchronize (%s) error: %s",
+			     funcname, cuda_error (r));
+	  ret = false;
+	}
+
+      if (funcname_mgomp)
+	pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
+    }
+
+  if (funcname_mgomp)
+    free (funcname_mgomp);
+
+  return ret;
+}
+
 /* Load the (partial) program described by TARGET_DATA to device
    number ORD.  Allocate and return TARGET_TABLE.  If not NULL, REV_FN_TABLE
    will contain the on-device addresses of the functions for reverse offload.
@@ -1452,6 +1556,9 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
 
   nvptx_set_clocktick (module, dev);
 
+  if (!nvptx_do_global_cdtors (module, dev, "__do_global_ctors__entry"))
+    return -1;
+
   return fn_entries + var_entries + other_entries;
 }
 
@@ -1477,6 +1584,10 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
   for (prev_p = &dev->images; (image = *prev_p) != 0; prev_p = &image->next)
     if (image->target_data == target_data)
       {
+	if (!nvptx_do_global_cdtors (image->module, dev,
+				     "__do_global_dtors__entry"))
+	  ret = false;
+
 	*prev_p = image->next;
 	if (CUDA_CALL_NOCHECK (cuModuleUnload, image->module) != CUDA_SUCCESS)
 	  ret = false;
-- 
2.25.1