From patchwork Mon Aug 29 10:54:33 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Tobias Burnus <tobias@codesourcery.com>
X-Patchwork-Id: 57134
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id D30033857BB0
	for <patchwork@sourceware.org>; Mon, 29 Aug 2022 10:55:17 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98])
 by sourceware.org (Postfix) with ESMTPS id DF49B3858D37
 for <gcc-patches@gcc.gnu.org>; Mon, 29 Aug 2022 10:54:52 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DF49B3858D37
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="5.93,272,1654588800";
 d="diff'?scan'208,217";a="82167686"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa2.mentor.iphmx.com with ESMTP; 29 Aug 2022 02:54:40 -0800
IronPort-SDR: 
 dSFsCb+oLrjEsk7iB7jUA9xAbtn/x7E9UxwvsX4B+FZHY9kHcgLlUH0S/2Bl5XM8DJHo1KVUcV
 vCLxAW6jHA6LEGimJXNeVDAHrzAvBKBIj4jKIDwtT09NM8qj6bZiXbQrh8VetjvMhrVSlHTex0
 DgtO4SaXjEpD/yMKXqOryvZ0hv6b7ivGTqNsceq9FlVcqi3voUtyc/SGWjRNg3qKkyhLlBsLue
 6o6gfhXKtOqMfDiJXem9sz0LM4AvMz/pFQoRiU6eaBr9vphEMjgnKaVylQ3uT19hWLnjWMfcUm
 6qY=
Message-ID: <a73c933f-c633-cada-9a05-8db3f041d797@codesourcery.com>
Date: Mon, 29 Aug 2022 12:54:33 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.0
Content-Language: en-US
To: gcc-patches <gcc-patches@gcc.gnu.org>, Jakub Jelinek <jakub@redhat.com>
From: Tobias Burnus <tobias@codesourcery.com>
Subject: [Patch] libgomp.texi: Document libmemkind + nvptx/gcn specifics
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: SVR-IES-MBX-07.mgc.mentorg.com (139.181.222.7) To
 svr-ies-mbx-12.mgc.mentorg.com (139.181.222.12)
X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, HTML_MESSAGE, KAM_DMARC_STATUS,
 RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Cc: Andrew Stubbs <ams@codesourcery.com>,
 Thomas Schwinge <thomas@codesourcery.com>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

I had this patch lying around since about half a year. I did tweak and agumented it
a bit today, but finally want to get rid of it (locally - by getting it committed) ...

This patch changes -misa to -march for nvptx (the latter is now an alias
for the former), it adds a new section about libmemkind and some information
about interns of our nvptx/gcn implementation. (The latter should be mostly
correct, but I might have missed some fine print or a more recent update.)

OK for mainline?

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

libgomp.texi: Document libmemkind + nvptx/gcn specifics

libgomp/ChangeLog:

	* libgomp.texi (OpenMP-Implementation Specifics): New; add libmemkind
	section; move OpenMP Context Selectors from ...
	(Offload-Target Specifics): ... here; add 'AMD Radeo (GCN)' and
	'nvptx' sections.

 libgomp/libgomp.texi | 132 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 126 insertions(+), 6 deletions(-)

diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 6298de8254c..4c5903b55cc 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -113,6 +113,8 @@ changed to GNU Offloading and Multi Processing Runtime Library.
 * OpenACC Library Interoperability:: OpenACC library interoperability with the
                                NVIDIA CUBLAS library.
 * OpenACC Profiling Interface::
+* OpenMP-Implementation Specifics:: Notes specifics of this OpenMP
+                               implementation
 * Offload-Target Specifics::   Notes on offload-target specific internals
 * The libgomp ABI::            Notes on the external ABI presented by libgomp.
 * Reporting Bugs::             How to report bugs in the GNU Offloading and
@@ -4280,16 +4282,15 @@ offloading devices (it's not clear if they should be):
 @end itemize
 
 @c ---------------------------------------------------------------------
-@c Offload-Target Specifics
+@c OpenMP-Implementation Specifics
 @c ---------------------------------------------------------------------
 
-@node Offload-Target Specifics
-@chapter Offload-Target Specifics
-
-The following sections present notes on the offload-target specifics.
+@node OpenMP-Implementation Specifics:
+@chapter OpenMP-Implementation Specifics:
 
 @menu
 * OpenMP Context Selectors::
+* Memory allocation with libmemkind::
 @end menu
 
 @node OpenMP Context Selectors
@@ -4308,9 +4309,128 @@ The following sections present notes on the offload-target specifics.
       @tab See @code{-march=} in ``AMD GCN Options''
 @item @code{nvptx}
       @tab @code{gpu}
-      @tab See @code{-misa=} in ``Nvidia PTX Options''
+      @tab See @code{-march=} in ``Nvidia PTX Options''
 @end multitable
 
+@node Memory allocation with libmemkind
+@section Memory allocation with libmemkind
+
+On Linux systems, where the @uref{https://github.com/memkind/memkind, memkind
+library} (@code{libmemkind.so.0}) is available at runtime, it is used when
+creating memory allocators requesting
+
+@itemize
+@item the memory space @code{omp_high_bw_mem_space}
+@item the memory space @code{omp_large_cap_mem_space}
+@item the partition trait @code{omp_atv_interleaved}
+@end itemize
+
+
+@c ---------------------------------------------------------------------
+@c Offload-Target Specifics
+@c ---------------------------------------------------------------------
+
+@node Offload-Target Specifics
+@chapter Offload-Target Specifics
+
+The following sections present notes on the offload-target specifics
+
+@menu
+* AMD Radeon::
+* nvptx::
+@end menu
+
+@node AMD Radeon
+@section AMD Radeon (GCN)
+
+On the hardware side, there is the hierarchy (fine to coarse):
+@itemize
+@item work item (thread)
+@item wavefront
+@item work group
+@item compute unite (CU)
+@end itemize
+
+All OpenMP and OpenACC levels are used, i.e.
+@itemize
+@item OpenMP's simd and OpenACC's vector map to work items (thread)
+@item OpenMP's threads (``parallel'') and OpenACC's workers map
+      to wavefronts
+@item OpenMP's teams and OpenACC's gang use use a threadpool with the
+      size of the number of teams or gangs, respectively.
+@end itemize
+
+The used sizes are
+@itemize
+@item Number of teams is the specified @code{num_teams} (OpenMP) or
+      @code{num_gangs} (OpenACC) or otherwise the number of CU
+@item Number of wavefronts is 4 for gfx900 and 16 otherwise;
+      @code{num_threads} (OpenMP) and @code{num_workers} (OpenACC)
+      overrides this if smaller.
+@item The wavefront has 102 scalars and 64 vectors
+@item Number of workitems is always 64
+@item The hardware permits maximally 40 workgroups/CU and
+      16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.
+@item 80 scalars registers and 24 vector registers in non-kernel functions
+      (the chosen procedure-calling API).
+@item For the kernel itself: as many as register pressure demands (number of
+      teams and number of threads, scaled down if registers are exhausted)
+@end itemize
+
+The implementation remark:
+@itemize
+@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported
+      using the C library @code{printf} functions and the Fortran
+      @code{print}/@code{write} statements.
+@end itemize
+
+
+
+@node nvptx
+@section nvptx
+
+On the hardware side, there is the hierarchy (fine to coarse):
+@itemize
+@item thread
+@item warp
+@item thread block
+@item streaming multiprocessor
+@end itemize
+
+All OpenMP and OpenACC levels are used, i.e.
+@itemize
+@item OpenMP's simd and OpenACC's vector map to threads
+@item OpenMP's threads (``parallel'') and OpenACC's workers map to warps
+@item OpenMP's teams and OpenACC's gang use use a threadpool with the
+      size of the number of teams or gangs, respectively.
+@end itemize
+
+The used sizes are
+@itemize
+@item The @code{warp_size} is always 32
+@item CUDA kernel launched: @code{dim=@{#teams,1,1@}, blocks=@{#threads,warp_size,1@}}.
+@end itemize
+
+Additional information can be obtained by setting the environment variable to
+@code{GOMP_DEBUG=1} (very verbose; grep for @code{kernel.*launch} for launch
+parameters).
+
+GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
+which caches the JIT in the user's directory (see CUDA documentation; can be
+tuned by the environment variables @code{CUDA_CACHE_@{DISABLE,MAXSIZE,PATH@}}.
+
+Note: While PTX ISA is generic, the @code{-mptx=} and @code{-march=} commandline
+options still affect the used PTX ISA code and, thus, the requirments on
+CUDA version and hardware.
+
+The implementation remark:
+@itemize
+@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported
+      using the C library @code{printf} functions. Note that the Fortran
+      @code{print}/@code{write} statements
+      are not supported, yet.
+@end itemize
+
 
 @c ---------------------------------------------------------------------
 @c The libgomp ABI