From patchwork Tue Mar 22 13:41:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tom de Vries X-Patchwork-Id: 52217 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A36B1389EC60 for ; Tue, 22 Mar 2022 13:43:08 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A36B1389EC60 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1647956588; bh=GuJYezQ5c3U8RGJDF7P/pksc/GWVNNaGHeoeRwVajgc=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=CDnitgV7MbxKhwYSZripnWGtQ/QbLRVtNBStNZ8VO/NevZStEjIO0OAXnIDSBvnJM bLLbxxihmvXuQ2ZgQtHBW6i7BHPbHGoM7gBZy3/qI4Dyyv6pPVf93S5lN4Rim7ploU yYZdJQNojV9IEI/xaLWPGl3QXrsWj3N/O2Qup5AQ= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by sourceware.org (Postfix) with ESMTPS id 7EF7A389942E for ; Tue, 22 Mar 2022 13:41:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7EF7A389942E Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 4A3F71F387 for ; Tue, 22 Mar 2022 13:41:11 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 3577E133B6 for ; Tue, 22 Mar 2022 13:41:11 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 1n7mC/fROWLlEAAAMHmgww (envelope-from ) for ; Tue, 22 Mar 2022 13:41:11 +0000 Date: Tue, 22 Mar 2022 14:41:09 +0100 To: gcc-patches@gcc.gnu.org Subject: [committed][nvptx] Add warp sync at simt exit Message-ID: <20220322134108.GA32364@delia.home> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Tom de Vries via Gcc-patches From: Tom de Vries Reply-To: Tom de Vries Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" Hi, Consider this code (with N defined to 1024): ... float v = 0.0; #pragma omp target map(tofrom: v) #pragma omp parallel for simd for (int i = 0 ; i < N; i++) { #pragma omp atomic update v = v + 1.0; } ... It hangs when executing on target board unix/-foffload=-misa=sm_75, using drivers 470.103.01 and 510.54 on a T400 board (sm_75). I'm tentatively identifying the problem as a bug in -muniform-simt for architectures that support Independent Thread Scheduling (sm_70 and later). The problem -muniform-simt is trying to address is to make sure that a register produced outside an openmp simd region is available when used in any lane inside an simd region. The solution is to, outside an simd region, execute in all warp lanes, thus producing consistent values in result registers in each warp thread. This approach doesn't work when executing in all warp lanes multiplies the side effects from 1 to 32 separate side effects, which is the case for atomic insns. So atomic insns are rewritten to execute only in lane 0, and if there are any results, those are propagated to the other threads in the warp. [ And likewise for system calls malloc, free, vprintf. ] Now, consider a non-atomic update: ld, add, store. The store has side effects, are those multiplied or not? Pre-sm_70 we can assume that at the end of an SIMT region, any divergent control flow has reconverged, and we have a uniform warp, executing in lock step. So: - the load will load the same value into the result register across the warp, - the add will write the same value into the result register across the warp, - the store will write the same value to the same memory location, 32 times, at once, having the result of a single store. So, no side-effect multiplication (well, at least that's the observation). Starting sm_70, the threads in a warp are no longer guaranteed to reconverge after divergence. There's a "Convergence Optimizer" that can can identify that it is safe for a warp to reconverge, but that works only as long as the code does not contain "synchronizing operations". Consequently, the ld, add, store sequence can be executed by a non-uniform warp, which means the side effects can have multiplied, and the registers are no longer guarantueed to be in sync. The atomic update in the example above is translated using an atom.cas loop, which means that we have divergence (because only one thread is allowed to succeed at a time) and the "Convergence Optimizer" doesn't reconverge probably because the atom.cas counts as a "synchronizing operation". So, it seems plausible that the root cause for the mentioned hang is the problem described above. Fix this by adding an explicit warp sync at simt exit. Note that we're assuming here that the warp will stay uniform until the next SIMT region entry. Tested on x86_64 with nvptx accelerator. Committed to trunk. Thanks, - Tom [nvptx] Add warp sync at simt exit gcc/ChangeLog: 2022-03-09 Tom de Vries PR target/104916 PR target/104783 * config/nvptx/nvptx.md (define_expand "omp_simt_exit"): Emit warp sync (or uniform warp check for mptx < 6.0). libgomp/ChangeLog: 2022-03-15 Tom de Vries PR target/104916 PR target/104783 * testsuite/libgomp.c/pr104783-2.c: New test. --- gcc/config/nvptx/nvptx.md | 4 ++++ libgomp/testsuite/libgomp.c/pr104783-2.c | 25 +++++++++++++++++++++++++ 2 files changed, 29 insertions(+) diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 1dec7caa0d1..5550ce25513 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -1881,6 +1881,10 @@ (define_expand "omp_simt_exit" "" { emit_insn (gen_omp_simt_exit (Pmode, operands[0])); + if (TARGET_PTX_6_0) + emit_insn (gen_nvptx_warpsync ()); + else + emit_insn (gen_nvptx_uniform_warp_check ()); DONE; }) diff --git a/libgomp/testsuite/libgomp.c/pr104783-2.c b/libgomp/testsuite/libgomp.c/pr104783-2.c new file mode 100644 index 00000000000..8750d915d01 --- /dev/null +++ b/libgomp/testsuite/libgomp.c/pr104783-2.c @@ -0,0 +1,25 @@ +#define N (32 * 32) + +#define TYPE float +#define VAR v +#define INIT 0.0 +#define UPDATE + 1.0 +#define EXPECTED N + +int +main (void) +{ + TYPE VAR = INIT; + #pragma omp target map(tofrom: VAR) + #pragma omp parallel for simd + for (int i = 0 ; i < N; i++) + { + #pragma omp atomic update + VAR = VAR UPDATE; + } + + if (VAR != EXPECTED) + __builtin_abort (); + + return 0; +}