From patchwork Mon Aug 29 10:54:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Tobias Burnus X-Patchwork-Id: 57134 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D30033857BB0 for ; Mon, 29 Aug 2022 10:55:17 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98]) by sourceware.org (Postfix) with ESMTPS id DF49B3858D37 for ; Mon, 29 Aug 2022 10:54:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DF49B3858D37 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.93,272,1654588800"; d="diff'?scan'208,217";a="82167686" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa2.mentor.iphmx.com with ESMTP; 29 Aug 2022 02:54:40 -0800 IronPort-SDR: dSFsCb+oLrjEsk7iB7jUA9xAbtn/x7E9UxwvsX4B+FZHY9kHcgLlUH0S/2Bl5XM8DJHo1KVUcV vCLxAW6jHA6LEGimJXNeVDAHrzAvBKBIj4jKIDwtT09NM8qj6bZiXbQrh8VetjvMhrVSlHTex0 DgtO4SaXjEpD/yMKXqOryvZ0hv6b7ivGTqNsceq9FlVcqi3voUtyc/SGWjRNg3qKkyhLlBsLue 6o6gfhXKtOqMfDiJXem9sz0LM4AvMz/pFQoRiU6eaBr9vphEMjgnKaVylQ3uT19hWLnjWMfcUm 6qY= Message-ID: Date: Mon, 29 Aug 2022 12:54:33 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.0 Content-Language: en-US To: gcc-patches , Jakub Jelinek From: Tobias Burnus Subject: [Patch] libgomp.texi: Document libmemkind + nvptx/gcn specifics X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: SVR-IES-MBX-07.mgc.mentorg.com (139.181.222.7) To svr-ies-mbx-12.mgc.mentorg.com (139.181.222.12) X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, HTML_MESSAGE, KAM_DMARC_STATUS, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Andrew Stubbs , Thomas Schwinge Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" I had this patch lying around since about half a year. I did tweak and agumented it a bit today, but finally want to get rid of it (locally - by getting it committed) ... This patch changes -misa to -march for nvptx (the latter is now an alias for the former), it adds a new section about libmemkind and some information about interns of our nvptx/gcn implementation. (The latter should be mostly correct, but I might have missed some fine print or a more recent update.) OK for mainline? Tobias ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 libgomp.texi: Document libmemkind + nvptx/gcn specifics libgomp/ChangeLog: * libgomp.texi (OpenMP-Implementation Specifics): New; add libmemkind section; move OpenMP Context Selectors from ... (Offload-Target Specifics): ... here; add 'AMD Radeo (GCN)' and 'nvptx' sections. libgomp/libgomp.texi | 132 ++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 126 insertions(+), 6 deletions(-) diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index 6298de8254c..4c5903b55cc 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -113,6 +113,8 @@ changed to GNU Offloading and Multi Processing Runtime Library. * OpenACC Library Interoperability:: OpenACC library interoperability with the NVIDIA CUBLAS library. * OpenACC Profiling Interface:: +* OpenMP-Implementation Specifics:: Notes specifics of this OpenMP + implementation * Offload-Target Specifics:: Notes on offload-target specific internals * The libgomp ABI:: Notes on the external ABI presented by libgomp. * Reporting Bugs:: How to report bugs in the GNU Offloading and @@ -4280,16 +4282,15 @@ offloading devices (it's not clear if they should be): @end itemize @c --------------------------------------------------------------------- -@c Offload-Target Specifics +@c OpenMP-Implementation Specifics @c --------------------------------------------------------------------- -@node Offload-Target Specifics -@chapter Offload-Target Specifics - -The following sections present notes on the offload-target specifics. +@node OpenMP-Implementation Specifics: +@chapter OpenMP-Implementation Specifics: @menu * OpenMP Context Selectors:: +* Memory allocation with libmemkind:: @end menu @node OpenMP Context Selectors @@ -4308,9 +4309,128 @@ The following sections present notes on the offload-target specifics. @tab See @code{-march=} in ``AMD GCN Options'' @item @code{nvptx} @tab @code{gpu} - @tab See @code{-misa=} in ``Nvidia PTX Options'' + @tab See @code{-march=} in ``Nvidia PTX Options'' @end multitable +@node Memory allocation with libmemkind +@section Memory allocation with libmemkind + +On Linux systems, where the @uref{https://github.com/memkind/memkind, memkind +library} (@code{libmemkind.so.0}) is available at runtime, it is used when +creating memory allocators requesting + +@itemize +@item the memory space @code{omp_high_bw_mem_space} +@item the memory space @code{omp_large_cap_mem_space} +@item the partition trait @code{omp_atv_interleaved} +@end itemize + + +@c --------------------------------------------------------------------- +@c Offload-Target Specifics +@c --------------------------------------------------------------------- + +@node Offload-Target Specifics +@chapter Offload-Target Specifics + +The following sections present notes on the offload-target specifics + +@menu +* AMD Radeon:: +* nvptx:: +@end menu + +@node AMD Radeon +@section AMD Radeon (GCN) + +On the hardware side, there is the hierarchy (fine to coarse): +@itemize +@item work item (thread) +@item wavefront +@item work group +@item compute unite (CU) +@end itemize + +All OpenMP and OpenACC levels are used, i.e. +@itemize +@item OpenMP's simd and OpenACC's vector map to work items (thread) +@item OpenMP's threads (``parallel'') and OpenACC's workers map + to wavefronts +@item OpenMP's teams and OpenACC's gang use use a threadpool with the + size of the number of teams or gangs, respectively. +@end itemize + +The used sizes are +@itemize +@item Number of teams is the specified @code{num_teams} (OpenMP) or + @code{num_gangs} (OpenACC) or otherwise the number of CU +@item Number of wavefronts is 4 for gfx900 and 16 otherwise; + @code{num_threads} (OpenMP) and @code{num_workers} (OpenACC) + overrides this if smaller. +@item The wavefront has 102 scalars and 64 vectors +@item Number of workitems is always 64 +@item The hardware permits maximally 40 workgroups/CU and + 16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU. +@item 80 scalars registers and 24 vector registers in non-kernel functions + (the chosen procedure-calling API). +@item For the kernel itself: as many as register pressure demands (number of + teams and number of threads, scaled down if registers are exhausted) +@end itemize + +The implementation remark: +@itemize +@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported + using the C library @code{printf} functions and the Fortran + @code{print}/@code{write} statements. +@end itemize + + + +@node nvptx +@section nvptx + +On the hardware side, there is the hierarchy (fine to coarse): +@itemize +@item thread +@item warp +@item thread block +@item streaming multiprocessor +@end itemize + +All OpenMP and OpenACC levels are used, i.e. +@itemize +@item OpenMP's simd and OpenACC's vector map to threads +@item OpenMP's threads (``parallel'') and OpenACC's workers map to warps +@item OpenMP's teams and OpenACC's gang use use a threadpool with the + size of the number of teams or gangs, respectively. +@end itemize + +The used sizes are +@itemize +@item The @code{warp_size} is always 32 +@item CUDA kernel launched: @code{dim=@{#teams,1,1@}, blocks=@{#threads,warp_size,1@}}. +@end itemize + +Additional information can be obtained by setting the environment variable to +@code{GOMP_DEBUG=1} (very verbose; grep for @code{kernel.*launch} for launch +parameters). + +GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA, +which caches the JIT in the user's directory (see CUDA documentation; can be +tuned by the environment variables @code{CUDA_CACHE_@{DISABLE,MAXSIZE,PATH@}}. + +Note: While PTX ISA is generic, the @code{-mptx=} and @code{-march=} commandline +options still affect the used PTX ISA code and, thus, the requirments on +CUDA version and hardware. + +The implementation remark: +@itemize +@item I/O within OpenMP target regions and OpenACC parallel/kernels is supported + using the C library @code{printf} functions. Note that the Fortran + @code{print}/@code{write} statements + are not supported, yet. +@end itemize + @c --------------------------------------------------------------------- @c The libgomp ABI