From patchwork Sun Jan 16 16:33:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Thomas Schwinge X-Patchwork-Id: 50080 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DD3553858020 for ; Sun, 16 Jan 2022 16:34:10 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98]) by sourceware.org (Postfix) with ESMTPS id D1E763858C39 for ; Sun, 16 Jan 2022 16:33:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D1E763858C39 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com IronPort-SDR: 3PGOzL6oQtTQakQUWER6FGhGuGo+P4y9BciUWVH8wpH1k31DXRv4gedVlHS+J6tdHz3lnI3NUd lVhwq7oKgVZtOPYMyE4iUIwEGjqDVe6WUkyZHtSS3RzB8Cq3K4v2Gb6gSNM2ASTrnLesK7eXsw Rzvh/R+MLjB0ZXmF/CgIFQI5bIJLfnk4akiQYIZVpsyLhOFzCBI+BDCbA0vwcqSyI6n8TPqG3U Ixtt4z+C28TooAR/cFlieDoWXwOnjOujipiOvFIEwqCzWaRqR/+PS1Kh9JvP+SzxbHnK4/mDaR yg45CamOEFhJKV9/ChSmog5P X-IronPort-AV: E=Sophos;i="5.88,293,1635235200"; d="scan'208,223";a="70810844" Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167]) by esa2.mentor.iphmx.com with ESMTP; 16 Jan 2022 08:33:51 -0800 IronPort-SDR: eEWxAyU/4pQeTIHiAvGJZk+06j3+R5u6c32Nc2/TMp4lZsKaffrV7nzqG4hiIQciDWRuRyZz9m PugzAvQs15c4aRxiGNmZIwLvxDry77Jf6ftdL9Iac7XsFnh3f+pPncyIKEO7N/f2klzPDUdC7b 0eWozXUz5PhTK0nLzVBosptiJs3dAxVodytqVYQG8dbB2MWgX2XwaZN6dA0f7Eg7WujmClBgRw einW5yumoIcmEOKj7RsBiNQhIGE9oUF+QKKzVkm29cWlur3fP3P3JRl+1jR/4eYp/w2d7cuIKR 35U= From: Thomas Schwinge To: Subject: amdgcn: Tune default OpenMP/OpenACC GPU utilization In-Reply-To: <08b8cdb2-11ef-1ceb-efc2-b8495bda6bef@codesourcery.com> References: <08b8cdb2-11ef-1ceb-efc2-b8495bda6bef@codesourcery.com> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1 (x86_64-pc-linux-gnu) Date: Sun, 16 Jan 2022 17:33:42 +0100 Message-ID: <87lezfskrd.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11) To svr-ies-mbx-01.mgc.mentorg.com (139.181.222.1) X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Andrew Stubbs , Kwok Cheung Yeung , Tobias Burnus Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" Hi! On 2020-07-15T21:49:11+0100, Andrew Stubbs wrote: > This patch tunes the default GPU thread count for OpenMP and OpenACC on > AMD GCN devices. It chooses a sensible default if no attributes are > given at all, increases the number of OpenACC gangs if only one worker > per gang is specified, and increases the number of workers otherwise. > The tuning is still a work in progress as we fix issues that limit > occupancy. Pushed to in commit a78b1ab1df9ca44acc5638e8f9d0ae2e62bd65ed "amdgcn: Tune default OpenMP/OpenACC GPU utilization", see attached. Tobias, this should've unblocked your "[wwwdocs] gcc-12/changes.html (GCN): >1 workers per gang"; see . Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 From a78b1ab1df9ca44acc5638e8f9d0ae2e62bd65ed Mon Sep 17 00:00:00 2001 From: Kwok Cheung Yeung Date: Thu, 29 Aug 2019 10:16:42 -0700 Subject: [PATCH] amdgcn: Tune default OpenMP/OpenACC GPU utilization libgomp/ * plugin/plugin-gcn.c (parse_target_attributes): Automatically set the number of teams and threads if necessary. (gcn_exec): Automatically set the number of gangs and workers if necessary. Co-Authored-By: Andrew Stubbs --- libgomp/plugin/plugin-gcn.c | 82 ++++++++++++++++++++++++++++++------- 1 file changed, 67 insertions(+), 15 deletions(-) diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c index d0f05b28bf3..f305d726874 100644 --- a/libgomp/plugin/plugin-gcn.c +++ b/libgomp/plugin/plugin-gcn.c @@ -1219,24 +1219,55 @@ parse_target_attributes (void **input, if (gcn_dims_found) { + bool gfx900_workaround_p = false; + if (agent->device_isa == EF_AMDGPU_MACH_AMDGCN_GFX900 && gcn_threads == 0 && override_z_dim == 0) { - gcn_threads = 4; + gfx900_workaround_p = true; GCN_WARNING ("VEGA BUG WORKAROUND: reducing default number of " - "threads to 4 per team.\n"); + "threads to at most 4 per team.\n"); GCN_WARNING (" - If this is not a Vega 10 device, please use " "GCN_NUM_THREADS=16\n"); } + /* Ideally, when a dimension isn't explicitly specified, we should + tune it to run 40 (or 32?) threads per CU with no threads getting queued. + In practice, we tune for peak performance on BabelStream, which + for OpenACC is currently 32 threads per CU. */ def->ndim = 3; - /* Fiji has 64 CUs, but Vega20 has 60. */ - def->gdims[0] = (gcn_teams > 0) ? gcn_teams : get_cu_count (agent); - /* Each thread is 64 work items wide. */ - def->gdims[1] = 64; - /* A work group can have 16 wavefronts. */ - def->gdims[2] = (gcn_threads > 0) ? gcn_threads : 16; - def->wdims[0] = 1; /* Single team per work-group. */ + if (gcn_teams <= 0 && gcn_threads <= 0) + { + /* Set up a reasonable number of teams and threads. */ + gcn_threads = gfx900_workaround_p ? 4 : 16; // 8; + def->gdims[0] = get_cu_count (agent); // * (40 / gcn_threads); + def->gdims[2] = gcn_threads; + } + else if (gcn_teams <= 0 && gcn_threads > 0) + { + /* Auto-scale the number of teams with the number of threads. */ + def->gdims[0] = get_cu_count (agent); // * (40 / gcn_threads); + def->gdims[2] = gcn_threads; + } + else if (gcn_teams > 0 && gcn_threads <= 0) + { + int max_threads = gfx900_workaround_p ? 4 : 16; + + /* Auto-scale the number of threads with the number of teams. */ + def->gdims[0] = gcn_teams; + def->gdims[2] = 16; // get_cu_count (agent) * 40 / gcn_teams; + if (def->gdims[2] == 0) + def->gdims[2] = 1; + else if (def->gdims[2] > max_threads) + def->gdims[2] = max_threads; + } + else + { + def->gdims[0] = gcn_teams; + def->gdims[2] = gcn_threads; + } + def->gdims[1] = 64; /* Each thread is 64 work items wide. */ + def->wdims[0] = 1; /* Single team per work-group. */ def->wdims[1] = 64; def->wdims[2] = 16; *result = def; @@ -3031,13 +3062,34 @@ gcn_exec (struct kernel_info *kernel, size_t mapnum, void **hostaddrs, if (hsa_kernel_desc->oacc_dims[2] > 0) dims[2] = hsa_kernel_desc->oacc_dims[2]; - /* If any of the OpenACC dimensions remain 0 then we get to pick a number. - There isn't really a correct answer for this without a clue about the - problem size, so let's do a reasonable number of single-worker gangs. - 64 gangs matches a typical Fiji device. */ + /* Ideally, when a dimension isn't explicitly specified, we should + tune it to run 40 (or 32?) threads per CU with no threads getting queued. + In practice, we tune for peak performance on BabelStream, which + for OpenACC is currently 32 threads per CU. */ + if (dims[0] == 0 && dims[1] == 0) + { + /* If any of the OpenACC dimensions remain 0 then we get to pick a + number. There isn't really a correct answer for this without a clue + about the problem size, so let's do a reasonable number of workers + and gangs. */ - if (dims[0] == 0) dims[0] = get_cu_count (kernel->agent); /* Gangs. */ - if (dims[1] == 0) dims[1] = 16; /* Workers. */ + dims[0] = get_cu_count (kernel->agent) * 4; /* Gangs. */ + dims[1] = 8; /* Workers. */ + } + else if (dims[0] == 0 && dims[1] > 0) + { + /* Auto-scale the number of gangs with the requested number of workers. */ + dims[0] = get_cu_count (kernel->agent) * (32 / dims[1]); + } + else if (dims[0] > 0 && dims[1] == 0) + { + /* Auto-scale the number of workers with the requested number of gangs. */ + dims[1] = get_cu_count (kernel->agent) * 32 / dims[0]; + if (dims[1] == 0) + dims[1] = 1; + if (dims[1] > 16) + dims[1] = 16; + } /* The incoming dimensions are expressed in terms of gangs, workers, and vectors. The HSA dimensions are expressed in terms of "work-items", -- 2.34.1