From patchwork Wed Sep 21 07:45:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chung-Lin Tang X-Patchwork-Id: 57835 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D29083860757 for ; Wed, 21 Sep 2022 07:46:51 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D29083860757 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1663746411; bh=LMeaakuCMH5i6w1zJ8loJn5d4jlIUlfhXuB+KO492hM=; h=Date:Subject:To:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=UlJDUwasxea+BWhQ5TnATk9aSApJTklxm7JtP0jKFfbzrWtgh0UY/o5YXYinn2faO 2F2p6SwabPAykEeJvHDY/YTOPCfw7mhiya7I1+gEtaCJOah8sSa5CpQmJAg1Z68tjO SUHctG92R1U8qIuLT1jBgdpP+OucN1ZsMERIB8KI= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) by sourceware.org (Postfix) with ESMTPS id C3FA73858407 for ; Wed, 21 Sep 2022 07:45:44 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C3FA73858407 Received: by mail-pf1-x42d.google.com with SMTP id a29so5110916pfk.5 for ; Wed, 21 Sep 2022 00:45:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:from:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date; bh=LMeaakuCMH5i6w1zJ8loJn5d4jlIUlfhXuB+KO492hM=; b=QSRadetn8HqnOhv2qfqY6Irc1TR4a6I4eCVDz5GvvoVt31+V2vNtwiqDSHwxWcqgly FzgyFbCGarlemHFfeEKJEKJDUTk3R9RvNVzF0BZ5RR0c2zynUy3T4EB96/0UOHvR0MtT O9gS86rSiPfedWKJ5CKJ5vXKAG03+6fhqo3OqAD32VdeWrw52s0UB2Py7jVbxLorghuG 5+Z50UwCnkQ+xi2uWVDm31Pul9o091o4Yd2Wh/Tw3vsRUSFkeuqYweT3DnXY9ftDTJpn RfUyFdafaFlkjZwZ/hpIxc3+tesdG5h7roUNmdW0WvO6B+AYZ0XYiCS/coIKe0ASrQQH 6dVA== X-Gm-Message-State: ACrzQf065myeDhU5Hy8DUceeBwmYr7M2CqM4E6s+a6ZBgAGOwqi5tEFs 73xAVYCbQM+WUxbGPuK1Hn+xx00LkdsZ5w== X-Google-Smtp-Source: AMsMyM66Fca5IbpjGrplYAnrPfZ7XSh5XtmHGO4sjI6FK9zPIbSjqHMo1RTSnveFMXrQeoDBFESrYA== X-Received: by 2002:a63:6e8e:0:b0:430:3886:3a20 with SMTP id j136-20020a636e8e000000b0043038863a20mr23633675pgc.604.1663746343037; Wed, 21 Sep 2022 00:45:43 -0700 (PDT) Received: from [192.168.50.11] (112-104-15-252.adsl.dynamic.seed.net.tw. [112.104.15.252]) by smtp.gmail.com with ESMTPSA id na18-20020a17090b4c1200b001fde265ff4bsm1245343pjb.4.2022.09.21.00.45.38 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 21 Sep 2022 00:45:39 -0700 (PDT) Message-ID: <8b974d21-e288-4596-7500-277a43c92771@gmail.com> Date: Wed, 21 Sep 2022 15:45:36 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:91.0) Gecko/20100101 Thunderbird/91.13.0 Content-Language: en-US Subject: [PATCH, nvptx, 1/2] Reimplement libgomp barriers for nvptx To: gcc-patches , Tom de Vries , Catherine Moore X-Spam-Status: No, score=-10.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Chung-Lin Tang via Gcc-patches From: Chung-Lin Tang Reply-To: Chung-Lin Tang Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" Hi Tom, I had a patch submitted earlier, where I reported that the current way of implementing barriers in libgomp on nvptx created a quite significant performance drop on some SPEChpc2021 benchmarks: https://gcc.gnu.org/pipermail/gcc-patches/2022-September/600818.html That previous patch wasn't accepted well (admittedly, it was kind of a hack). So in this patch, I tried to (mostly) re-implement team-barriers for NVPTX. Basically, instead of trying to have the GPU do CPU-with-OS-like things that it isn't suited for, barriers are implemented simplistically with bar.* synchronization instructions. Tasks are processed after threads have joined, and only if team->task_count != 0 (arguably, there might be a little bit of performance forfeited where earlier arriving threads could've been used to process tasks ahead of other threads. But that again falls into requiring implementing complex futex-wait/wake like behavior. Really, that kind of tasking is not what target offloading is usually used for) Implementation highlight notes: 1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in the usual manner) 2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction. 3. gomp_barrier_wait_last() now is implemented using "bar.arrive" 4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end(): The main synchronization is done using a 'bar.red' instruction. This reduces across all threads the condition (team->task_count != 0), to enable the task processing down below if any thread created a task. (this bar.red usage required the need of the second GCC patch in this series) This patch has been tested on x86_64/powerpc64le with nvptx offloading, using libgomp, ovo, omptests, and sollve_vv testsuites, all without regressions. Also verified that the SPEChpc 2021 521.miniswp_t and 534.hpgmgfv_t performance regressions that occurred in the GCC12 cycle has been restored to devel/omp/gcc-11 (OG11) branch levels. Is this okay for trunk? (also suggest backporting to GCC12 branch, if performance regression can be considered a defect) Thanks, Chung-Lin libgomp/ChangeLog: 2022-09-21 Chung-Lin Tang * config/nvptx/bar.c (generation_to_barrier): Remove. (futex_wait,futex_wake,do_spin,do_wait): Remove. (GOMP_WAIT_H): Remove. (#include "../linux/bar.c"): Remove. (gomp_barrier_wait_end): New function. (gomp_barrier_wait): Likewise. (gomp_barrier_wait_last): Likewise. (gomp_team_barrier_wait_end): Likewise. (gomp_team_barrier_wait): Likewise. (gomp_team_barrier_wait_final): Likewise. (gomp_team_barrier_wait_cancel_end): Likewise. (gomp_team_barrier_wait_cancel): Likewise. (gomp_team_barrier_cancel): Likewise. * config/nvptx/bar.h (gomp_team_barrier_wake): Remove prototype, add new static inline function. diff --git a/libgomp/config/nvptx/bar.c b/libgomp/config/nvptx/bar.c index eee2107..0b958ed 100644 --- a/libgomp/config/nvptx/bar.c +++ b/libgomp/config/nvptx/bar.c @@ -30,137 +30,143 @@ #include #include "libgomp.h" -/* For cpu_relax. */ -#include "doacross.h" - -/* Assuming ADDR is &bar->generation, return bar. Copied from - rtems/bar.c. */ +void +gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) +{ + if (__builtin_expect (state & BAR_WAS_LAST, 0)) + { + /* Next time we'll be awaiting TOTAL threads again. */ + bar->awaited = bar->total; + __atomic_store_n (&bar->generation, bar->generation + BAR_INCR, + MEMMODEL_RELEASE); + } + if (bar->total > 1) + asm ("bar.sync 1, %0;" : : "r" (32 * bar->total)); +} -static gomp_barrier_t * -generation_to_barrier (int *addr) +void +gomp_barrier_wait (gomp_barrier_t *bar) { - char *bar - = (char *) addr - __builtin_offsetof (gomp_barrier_t, generation); - return (gomp_barrier_t *)bar; + gomp_barrier_wait_end (bar, gomp_barrier_wait_start (bar)); } -/* Implement futex_wait-like behaviour to plug into the linux/bar.c - implementation. Assumes ADDR is &bar->generation. */ +/* Like gomp_barrier_wait, except that if the encountering thread + is not the last one to hit the barrier, it returns immediately. + The intended usage is that a thread which intends to gomp_barrier_destroy + this barrier calls gomp_barrier_wait, while all other threads + call gomp_barrier_wait_last. When gomp_barrier_wait returns, + the barrier can be safely destroyed. */ -static inline void -futex_wait (int *addr, int val) +void +gomp_barrier_wait_last (gomp_barrier_t *bar) { - gomp_barrier_t *bar = generation_to_barrier (addr); + /* The above described behavior matches 'bar.arrive' perfectly. */ + if (bar->total > 1) + asm ("bar.arrive 1, %0;" : : "r" (32 * bar->total)); +} - if (bar->total < 2) - /* A barrier with less than two threads, nop. */ - return; +void +gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) +{ + struct gomp_thread *thr = gomp_thread (); + struct gomp_team *team = thr->ts.team; - gomp_mutex_lock (&bar->lock); + bool run_tasks = (team->task_count != 0); + if (bar->total > 1) + run_tasks = __builtin_nvptx_bar_red_or (1, 32 * bar->total, true, + (team->task_count != 0)); - /* Futex semantics: only go to sleep if *addr == val. */ - if (__builtin_expect (__atomic_load_n (addr, MEMMODEL_ACQUIRE) != val, 0)) + if (__builtin_expect (state & BAR_WAS_LAST, 0)) { - gomp_mutex_unlock (&bar->lock); - return; + /* Next time we'll be awaiting TOTAL threads again. */ + bar->awaited = bar->total; + team->work_share_cancelled = 0; } - /* Register as waiter. */ - unsigned int waiters - = __atomic_add_fetch (&bar->waiters, 1, MEMMODEL_ACQ_REL); - if (waiters == 0) - __builtin_abort (); - unsigned int waiter_id = waiters; - - if (waiters > 1) + if (__builtin_expect (run_tasks == true, 0)) { - /* Wake other threads in bar.sync. */ - asm volatile ("bar.sync 1, %0;" : : "r" (32 * waiters)); + while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE) + & BAR_TASK_PENDING) + gomp_barrier_handle_tasks (state); - /* Ensure that they have updated waiters. */ - asm volatile ("bar.sync 1, %0;" : : "r" (32 * waiters)); + if (bar->total > 1) + asm volatile ("bar.sync 1, %0;" : : "r" (32 * bar->total)); } +} - gomp_mutex_unlock (&bar->lock); - - while (1) - { - /* Wait for next thread in barrier. */ - asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1))); - - /* Get updated waiters. */ - unsigned int updated_waiters - = __atomic_load_n (&bar->waiters, MEMMODEL_ACQUIRE); - - /* Notify that we have updated waiters. */ - asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1))); - - waiters = updated_waiters; - - if (waiter_id > waiters) - /* A wake happened, and we're in the group of woken threads. */ - break; - - /* Continue waiting. */ - } +void +gomp_team_barrier_wait (gomp_barrier_t *bar) +{ + gomp_team_barrier_wait_end (bar, gomp_barrier_wait_start (bar)); } -/* Implement futex_wake-like behaviour to plug into the linux/bar.c - implementation. Assumes ADDR is &bar->generation. */ +void +gomp_team_barrier_wait_final (gomp_barrier_t *bar) +{ + gomp_barrier_state_t state = gomp_barrier_wait_final_start (bar); + if (__builtin_expect (state & BAR_WAS_LAST, 0)) + bar->awaited_final = bar->total; + gomp_team_barrier_wait_end (bar, state); +} -static inline void -futex_wake (int *addr, int count) +bool +gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar, + gomp_barrier_state_t state) { - gomp_barrier_t *bar = generation_to_barrier (addr); + struct gomp_thread *thr = gomp_thread (); + struct gomp_team *team = thr->ts.team; - if (bar->total < 2) - /* A barrier with less than two threads, nop. */ - return; + bool run_tasks = (team->task_count != 0); + if (bar->total > 1) + run_tasks = __builtin_nvptx_bar_red_or (1, 32 * bar->total, true, + (team->task_count != 0)); + if (state & BAR_CANCELLED) + return true; - gomp_mutex_lock (&bar->lock); - unsigned int waiters = __atomic_load_n (&bar->waiters, MEMMODEL_ACQUIRE); - if (waiters == 0) + if (__builtin_expect (state & BAR_WAS_LAST, 0)) { - /* No threads to wake. */ - gomp_mutex_unlock (&bar->lock); - return; + /* Note: BAR_CANCELLED should never be set in state here, because + cancellation means that at least one of the threads has been + cancelled, thus on a cancellable barrier we should never see + all threads to arrive. */ + + /* Next time we'll be awaiting TOTAL threads again. */ + bar->awaited = bar->total; + team->work_share_cancelled = 0; } - if (count == INT_MAX) - /* Release all threads. */ - __atomic_store_n (&bar->waiters, 0, MEMMODEL_RELEASE); - else if (count < bar->total) - /* Release count threads. */ - __atomic_add_fetch (&bar->waiters, -count, MEMMODEL_ACQ_REL); - else - /* Count has an illegal value. */ - __builtin_abort (); - - /* Wake other threads in bar.sync. */ - asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1))); + if (__builtin_expect (run_tasks == true, 0)) + { + while (__atomic_load_n (&bar->generation, MEMMODEL_ACQUIRE) + & BAR_TASK_PENDING) + gomp_barrier_handle_tasks (state); - /* Let them get the updated waiters. */ - asm volatile ("bar.sync 1, %0;" : : "r" (32 * (waiters + 1))); + if (bar->total > 1) + asm volatile ("bar.sync 1, %0;" : : "r" (32 * bar->total)); + } - gomp_mutex_unlock (&bar->lock); + return false; } -/* Copied from linux/wait.h. */ - -static inline int do_spin (int *addr, int val) +bool +gomp_team_barrier_wait_cancel (gomp_barrier_t *bar) { - /* The current implementation doesn't spin. */ - return 1; + return gomp_team_barrier_wait_cancel_end (bar, gomp_barrier_wait_start (bar)); } -/* Copied from linux/wait.h. */ - -static inline void do_wait (int *addr, int val) +void +gomp_team_barrier_cancel (struct gomp_team *team) { - if (do_spin (addr, val)) - futex_wait (addr, val); -} + gomp_mutex_lock (&team->task_lock); + if (team->barrier.generation & BAR_CANCELLED) + { + gomp_mutex_unlock (&team->task_lock); + return; + } + team->barrier.generation |= BAR_CANCELLED; + gomp_mutex_unlock (&team->task_lock); -/* Reuse the linux implementation. */ -#define GOMP_WAIT_H 1 -#include "../linux/bar.c" + /* The 'exit' instruction cancels this thread and also fullfills any other + CTA threads waiting on barriers. */ + asm volatile ("exit;"); +} diff --git a/libgomp/config/nvptx/bar.h b/libgomp/config/nvptx/bar.h index 28bf7f4..ddda33e 100644 --- a/libgomp/config/nvptx/bar.h +++ b/libgomp/config/nvptx/bar.h @@ -83,10 +83,16 @@ extern void gomp_team_barrier_wait_end (gomp_barrier_t *, extern bool gomp_team_barrier_wait_cancel (gomp_barrier_t *); extern bool gomp_team_barrier_wait_cancel_end (gomp_barrier_t *, gomp_barrier_state_t); -extern void gomp_team_barrier_wake (gomp_barrier_t *, int); struct gomp_team; extern void gomp_team_barrier_cancel (struct gomp_team *); +static inline void +gomp_team_barrier_wake (gomp_barrier_t *bar, int count) +{ + /* We never "wake up" threads on nvptx. Threads wait at barrier + instructions till barrier fullfilled. Do nothing here. */ +} + static inline gomp_barrier_state_t gomp_barrier_wait_start (gomp_barrier_t *bar) { From patchwork Wed Sep 21 07:45:54 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chung-Lin Tang X-Patchwork-Id: 57836 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 3285F3850429 for ; Wed, 21 Sep 2022 07:47:12 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3285F3850429 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1663746432; bh=/dF2cCK8oYwosoS8ez5BGx5heXtu9DI/Tkguim+0zLo=; h=Date:Subject:To:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=kk6TeDTRicvfYWUR2VrpU4Q0Q5ixAZ1mP3oVEyh4dSupL7cRjLK61meNfG/duaBtS wgYsD0AuiuWszcSsm8zecW0aErTMig7/m/kBYDEsnUML/QuoCXRlcZLv3Q/R/wjcKN kAXIei5fNdWcjPNMsl5WKwKowlGmJl1Kr4Ngq2Ds= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-pl1-x629.google.com (mail-pl1-x629.google.com [IPv6:2607:f8b0:4864:20::629]) by sourceware.org (Postfix) with ESMTPS id 55D533857354 for ; Wed, 21 Sep 2022 07:46:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 55D533857354 Received: by mail-pl1-x629.google.com with SMTP id d24so4832861pls.4 for ; Wed, 21 Sep 2022 00:46:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:from:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date; bh=/dF2cCK8oYwosoS8ez5BGx5heXtu9DI/Tkguim+0zLo=; b=iZ2qD5UVK6dSYq08qrosJS8cSIxwVUlVmULxIPM+HJB8E4OI6AjdgkwWeFvbj4wCHz bIiWelsj5ANlFCyp6zuJivbB3pzNu/Bnp6NgroAw3dgfEyKm8c223nLNsw0wv64N+kiR uIpSoKZPFkbKwyNqMuIEfwfGVjMdAkDt63gsc8nVEOG2I1AuetYEfBcNK6MPnUURHMz6 W7qDBcG5C7CvZSvHcuWWo5KjnlLtmaIBPmTeIkRp5xGCTkOvWd5eRZ6qYhJ8leZRnxyw /jJVJ+InAeb9EcVOxUK1flen0rrV4b03vcp66dAMkmudDnYf4xgqfDQZCWIl2tvs6ovb nrTw== X-Gm-Message-State: ACrzQf2HitBbAzoMrlXZncNgZJo7EXbZ7d58xYl81VqeVR6/bT2tGH+4 g/diEdIi816tUmzRZ00B8vwpQ5SZ+aEf/A== X-Google-Smtp-Source: AMsMyM6sXnBIiKxy8zjR27rHs4N2GDxhY+a9WE12HPtSBznp7ZdUgn+s0bE1bTRHBnEgDtZOfGCFQA== X-Received: by 2002:a17:90b:3809:b0:202:b482:b7d6 with SMTP id mq9-20020a17090b380900b00202b482b7d6mr7959021pjb.209.1663746358922; Wed, 21 Sep 2022 00:45:58 -0700 (PDT) Received: from [192.168.50.11] (112-104-15-252.adsl.dynamic.seed.net.tw. [112.104.15.252]) by smtp.gmail.com with ESMTPSA id c190-20020a624ec7000000b00540f3ac5fb8sm1360573pfb.69.2022.09.21.00.45.56 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 21 Sep 2022 00:45:57 -0700 (PDT) Message-ID: <16675a67-3dd2-fc62-fd38-6eaa24da66f7@gmail.com> Date: Wed, 21 Sep 2022 15:45:54 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:91.0) Gecko/20100101 Thunderbird/91.13.0 Content-Language: en-US Subject: [PATCH, nvptx, 2/2] Reimplement libgomp barriers for nvptx: bar.red instruction support in GCC To: gcc-patches , Tom de Vries , Catherine Moore X-Spam-Status: No, score=-10.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Chung-Lin Tang via Gcc-patches From: Chung-Lin Tang Reply-To: Chung-Lin Tang Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" Hi Tom, following the first patch. This new barrier implementation I posted in the first patch uses the 'bar.red' instruction. Usually this could've been easily done with a single line of inline assembly. However I quickly realized that because the NVPTX GCC port is implemented with all virtual general registers, we don't have a register constraint usable to select "predicate registers". Since bar.red uses predicate typed values, I can't create it directly using inline asm. So it appears that the most simple way of accessing it is with a target builtin. The attached patch adds bar.red instructions to the nvptx port, and __builtin_nvptx_bar_red_* builtins to use it. The code should support all variations of bar.red (and, or, and popc operations). (This support was used to implement the first libgomp barrier patch, so must be approved together) Thanks, Chung-Lin 2022-09-21 Chung-Lin Tang gcc/ChangeLog: * config/nvptx/nvptx.cc (nvptx_print_operand): Add 'p' case, adjust comments. (enum nvptx_builtins): Add NVPTX_BUILTIN_BAR_RED_AND, NVPTX_BUILTIN_BAR_RED_OR, and NVPTX_BUILTIN_BAR_RED_POPC. (nvptx_expand_bar_red): New function. (nvptx_init_builtins): Add DEFs of __builtin_nvptx_bar_red_[and/or/popc]. (nvptx_expand_builtin): Use nvptx_expand_bar_red to expand NVPTX_BUILTIN_BAR_RED_[AND/OR/POPC] cases. * config/nvptx/nvptx.md (define_c_enum "unspecv"): Add UNSPECV_BARRED_AND, UNSPECV_BARRED_OR, and UNSPECV_BARRED_POPC. (BARRED): New int iterator. (barred_op,barred_mode,barred_ptxtype): New int attrs. (nvptx_barred_): New define_insn. diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc index 49cc681..afc3a890 100644 --- a/gcc/config/nvptx/nvptx.cc +++ b/gcc/config/nvptx/nvptx.cc @@ -2879,6 +2879,7 @@ nvptx_mem_maybe_shared_p (const_rtx x) t -- print a type opcode suffix, promoting QImode to 32 bits T -- print a type size in bits u -- print a type opcode suffix without promotions. + p -- print a '!' for constant 0. x -- print a destination operand that may also be a bit bucket. */ static void @@ -3012,6 +3013,11 @@ nvptx_print_operand (FILE *file, rtx x, int code) fprintf (file, "@!"); goto common; + case 'p': + if (INTVAL (x) == 0) + fprintf (file, "!"); + break; + case 'c': mode = GET_MODE (XEXP (x, 0)); switch (x_code) @@ -6151,9 +6157,90 @@ enum nvptx_builtins NVPTX_BUILTIN_CMP_SWAPLL, NVPTX_BUILTIN_MEMBAR_GL, NVPTX_BUILTIN_MEMBAR_CTA, + NVPTX_BUILTIN_BAR_RED_AND, + NVPTX_BUILTIN_BAR_RED_OR, + NVPTX_BUILTIN_BAR_RED_POPC, NVPTX_BUILTIN_MAX }; +/* Expander for 'bar.red' instruction builtins. */ + +static rtx +nvptx_expand_bar_red (tree exp, rtx target, + machine_mode ARG_UNUSED (m), int ARG_UNUSED (ignore)) +{ + int code = DECL_MD_FUNCTION_CODE (TREE_OPERAND (CALL_EXPR_FN (exp), 0)); + machine_mode mode = TYPE_MODE (TREE_TYPE (exp)); + + if (!target) + target = gen_reg_rtx (mode); + + rtx pred, dst; + rtx bar = expand_expr (CALL_EXPR_ARG (exp, 0), + NULL_RTX, SImode, EXPAND_NORMAL); + rtx nthr = expand_expr (CALL_EXPR_ARG (exp, 1), + NULL_RTX, SImode, EXPAND_NORMAL); + rtx cpl = expand_expr (CALL_EXPR_ARG (exp, 2), + NULL_RTX, SImode, EXPAND_NORMAL); + rtx redop = expand_expr (CALL_EXPR_ARG (exp, 3), + NULL_RTX, SImode, EXPAND_NORMAL); + if (CONST_INT_P (bar)) + { + if (INTVAL (bar) < 0 || INTVAL (bar) > 15) + { + error_at (EXPR_LOCATION (exp), + "barrier value must be within [0,15]"); + return const0_rtx; + } + } + else if (!REG_P (bar)) + bar = copy_to_mode_reg (SImode, bar); + + if (!CONST_INT_P (nthr) && !REG_P (nthr)) + nthr = copy_to_mode_reg (SImode, nthr); + + if (!CONST_INT_P (cpl)) + { + error_at (EXPR_LOCATION (exp), + "complement argument must be constant"); + return const0_rtx; + } + + pred = gen_reg_rtx (BImode); + if (!REG_P (redop)) + redop = copy_to_mode_reg (SImode, redop); + emit_insn (gen_rtx_SET (pred, gen_rtx_NE (BImode, redop, GEN_INT (0)))); + redop = pred; + + rtx pat; + switch (code) + { + case NVPTX_BUILTIN_BAR_RED_AND: + dst = gen_reg_rtx (BImode); + pat = gen_nvptx_barred_and (dst, bar, nthr, cpl, redop); + break; + case NVPTX_BUILTIN_BAR_RED_OR: + dst = gen_reg_rtx (BImode); + pat = gen_nvptx_barred_or (dst, bar, nthr, cpl, redop); + break; + case NVPTX_BUILTIN_BAR_RED_POPC: + dst = gen_reg_rtx (SImode); + pat = gen_nvptx_barred_popc (dst, bar, nthr, cpl, redop); + break; + default: + gcc_unreachable (); + } + emit_insn (pat); + if (GET_MODE (dst) == BImode) + { + rtx tmp = gen_reg_rtx (mode); + emit_insn (gen_rtx_SET (tmp, gen_rtx_NE (mode, dst, GEN_INT (0)))); + dst = tmp; + } + emit_move_insn (target, dst); + return target; +} + static GTY(()) tree nvptx_builtin_decls[NVPTX_BUILTIN_MAX]; /* Return the NVPTX builtin for CODE. */ @@ -6194,6 +6281,13 @@ nvptx_init_builtins (void) DEF (MEMBAR_GL, "membar_gl", (VOID, VOID, NULL_TREE)); DEF (MEMBAR_CTA, "membar_cta", (VOID, VOID, NULL_TREE)); + DEF (BAR_RED_AND, "bar_red_and", + (UINT, UINT, UINT, UINT, UINT, NULL_TREE)); + DEF (BAR_RED_OR, "bar_red_or", + (UINT, UINT, UINT, UINT, UINT, NULL_TREE)); + DEF (BAR_RED_POPC, "bar_red_popc", + (UINT, UINT, UINT, UINT, UINT, NULL_TREE)); + #undef DEF #undef ST #undef UINT @@ -6236,6 +6330,11 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget), emit_insn (gen_nvptx_membar_cta ()); return NULL_RTX; + case NVPTX_BUILTIN_BAR_RED_AND: + case NVPTX_BUILTIN_BAR_RED_OR: + case NVPTX_BUILTIN_BAR_RED_POPC: + return nvptx_expand_bar_red (exp, target, mode, ignore); + default: gcc_unreachable (); } } diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 8ed6850..740c4de 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -58,6 +58,9 @@ UNSPECV_CAS_LOCAL UNSPECV_XCHG UNSPECV_ST + UNSPECV_BARRED_AND + UNSPECV_BARRED_OR + UNSPECV_BARRED_POPC UNSPECV_BARSYNC UNSPECV_WARPSYNC UNSPECV_UNIFORM_WARP_CHECK @@ -2274,6 +2277,35 @@ "TARGET_PTX_6_0" "%.\\tbar.warp.sync\\t0xffffffff;") +(define_int_iterator BARRED + [UNSPECV_BARRED_AND + UNSPECV_BARRED_OR + UNSPECV_BARRED_POPC]) +(define_int_attr barred_op + [(UNSPECV_BARRED_AND "and") + (UNSPECV_BARRED_OR "or") + (UNSPECV_BARRED_POPC "popc")]) +(define_int_attr barred_mode + [(UNSPECV_BARRED_AND "BI") + (UNSPECV_BARRED_OR "BI") + (UNSPECV_BARRED_POPC "SI")]) +(define_int_attr barred_ptxtype + [(UNSPECV_BARRED_AND "pred") + (UNSPECV_BARRED_OR "pred") + (UNSPECV_BARRED_POPC "u32")]) + +(define_insn "nvptx_barred_" + [(set (match_operand: 0 "nvptx_register_operand" "=R") + (unspec_volatile + [(match_operand:SI 1 "nvptx_nonmemory_operand" "Ri") + (match_operand:SI 2 "nvptx_nonmemory_operand" "Ri") + (match_operand:SI 3 "const_int_operand" "i") + (match_operand:BI 4 "nvptx_register_operand" "R")] + BARRED))] + "" + "\\tbar.red.. \\t%0, %1, %2, %p3%4;";" + [(set_attr "predicable" "no")]) + (define_insn "nvptx_uniform_warp_check" [(unspec_volatile [(const_int 0)] UNSPECV_UNIFORM_WARP_CHECK)] ""