From patchwork Wed Jan 11 12:06:18 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Thomas Schwinge X-Patchwork-Id: 62945 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D90F138432FD for ; Wed, 11 Jan 2023 12:06:56 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98]) by sourceware.org (Postfix) with ESMTPS id A88603858C52; Wed, 11 Jan 2023 12:06:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A88603858C52 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.96,315,1665475200"; d="scan'208,223";a="93937396" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa2.mentor.iphmx.com with ESMTP; 11 Jan 2023 04:06:25 -0800 IronPort-SDR: XWI/mr2ehrADxfA6r1BAiKa0SdPxKrZ+ujOGv8R50EwQWkJyRWZXlVZg+a+ZY+fKDadTlXtiVS a6z/s8pOS6UjfyUa2MG9MtGa45ZganzeOq61obBYCWiKkKX1arrtA+AdKlFmnzecpxxGKFIkCf +e1Avw0XU0QyPACwbGOiepb8h7Wse7fyJuO/pZtvn2zdxY4/mKthymCwZx7a3nnnwQ1USLSwSh uPc+i46K4hArNv+TCvt6TQKjwCx01sd3g5E4m3OAju37SQ4IMYbV22YPpZR4fjOwZYvjVs2Hh0 SnM= From: Thomas Schwinge To: Richard Biener , Tom de Vries , CC: Janne Blomqvist , , Alexander Monakov Subject: [PING] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?) In-Reply-To: <87ili2p60p.fsf@euler.schwinge.homeip.net> References: <87mtaigz3l.fsf@dem-tschwing-1.ger.mentorg.com> <87zgcxoa05.fsf@euler.schwinge.homeip.net> <87ili2p60p.fsf@euler.schwinge.homeip.net> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/28.2 (x86_64-pc-linux-gnu) Date: Wed, 11 Jan 2023 13:06:18 +0100 Message-ID: <87cz7ll1hh.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11) To svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_SHORT, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" Hi! Ping -- the '-mframe-malloc-threshold' idea, at least. Note that while this issue originally did pop up for Fortran I/O, it's likewise relevant for other functions that maintain big frames, for example in newlib: libc/string/libc_a-memmem.o:.local .align 16 .b8 %frame_ar[2064]; libc/string/libc_a-strcasestr.o:.local .align 16 .b8 %frame_ar[2064]; libc/string/libc_a-strstr.o:.local .align 16 .b8 %frame_ar[2064]; libm/math/libm_a-k_rem_pio2.o:.local .align 16 .b8 %frame_ar[560]; Therefore a generic solution (or, workaround if you'd like) does seem appropriate. Grüße Thomas On 2022-12-23T15:08:06+0100, I wrote: > Hi! > > On 2022-11-11T15:35:44+0100, Richard Biener via Fortran wrote: >> On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge wrote: >>> For example, for Fortran code like: >>> >>> write (*,*) "Hello world" >>> >>> ..., 'gfortran' creates: >>> >>> struct __st_parameter_dt dt_parm.0; >>> >>> try >>> { >>> dt_parm.0.common.filename = &"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 1}; >>> dt_parm.0.common.line = 29; >>> dt_parm.0.common.flags = 128; >>> dt_parm.0.common.unit = 6; >>> _gfortran_st_write (&dt_parm.0); >>> _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 1 sz: 1}, 11); >>> _gfortran_st_write_done (&dt_parm.0); >>> } >>> finally >>> { >>> dt_parm.0 = {CLOBBER(eol)}; >>> } >>> >>> The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes, >>> really! -- there's a lot of state in Fortran I/O apparently). That's a >>> problem for GPU execution -- here: OpenACC/nvptx -- where typically you >>> have small stacks. (For example, GCC/OpenACC/nvptx: 1 KiB per thread; >>> GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack' >>> "Use custom stacks instead of local memory for automatic storage".) >>> >>> Now, the Nvidia Driver tries to accomodate for such largish stack usage, >>> and dynamically increases the per-thread stack as necessary (thereby >>> potentially reducing parallelism) -- if it manages to understand the call >>> graph. In case of libgfortran I/O, it evidently doesn't. Not being able >>> to disprove existance of recursion is the common problem, as I've read. >>> At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example: >>> >>> warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be statically determined >>> >>> That's still not an actual problem: if the GPU kernel's stack usage still >>> fits into 1 KiB. Very often it does, but if, as happens in libgfortran >>> I/O handling, there is another such 'dt_parm' put onto the stack, the >>> stack then overflows; device-side SIGSEGV. >>> >>> (There is, by the way, some similar analysis by Tom de Vries in >>> "[nvptx, openacc, openmp, testsuite] >>> Recursive tests may fail due to thread stack limit".) >>> >>> Of course, you shouldn't really be doing I/O in GPU kernels, but people >>> do like their occasional "'printf' debugging", so we ought to make that >>> work (... without pessimizing any "normal" code). >>> >>> I assume that generally reducing the size of 'dt_parm' etc. is out of >>> scope. >>> >>> There is a way to manually set a per-thread stack size, but it's not >>> obvious which size to set: that sizes needs to work for the whole GPU >>> kernel, and should be as low as possible (to maximize parallelism). >>> I assume that even if GCC did an accurate call graph analysis of the GPU >>> kernel's maximum stack usage, that still wouldn't help: that's before the >>> PTX JIT does its own code transformations, including stack spilling. >>> >>> There exists a 'CU_JIT_LTO' flag to "Enable link-time optimization >>> (-dlto) for device code". This might help, assuming that it manages to >>> simplify the libgfortran I/O code such that the PTX JIT then understands >>> the call graph. But: that's available only starting with recent >>> CUDA 11.4, so not a general solution -- if it works at all, which I've >>> not tested. >>> >>> Similarly, we could enable GCC's LTO for device code generation -- but >>> that's a big project, out of scope at this time. And again, we don't >>> know if that at all helps this case. >>> >>> I see a few options: >>> >>> (a) Figure out what it is in the libgfortran I/O implementation that >>> causes "Stack size [...] cannot be statically determined", and re-work >>> that code to avoid that, or even disable certain things for nvptx, if >>> feasible. > >> Shrink st_parameter_dt (it's part of the ABI though, kind of). Lots of the >> bloat is from things that are unused for simpler I/O cases (so some >> "inheritance" could help), and lots of the bloat is from using >> string/length pairs using char * + size_t for what looks like could be >> encoded a lot more efficiently. >> >> There's probably not much low-hanging fruit. > > (Similarly comments in Janne's email.) > > > Well, as had to be expected, libgfortran I/O is really just one example, > but the underlying problem may also be triggered in other ways (via other > newlib/libc functions, for example). > > So, really a generic solution seems to be called for. > >>> (b) Also for GCC/OpenACC/nvptx use the GCC/OpenMP/nvptx '-msoft-stack'. >>> I don't really want to do that however: it does introduce a bit of >>> complexity in all the generated device code and run-time overhead that we >>> generally would like to avoid. > > Directly using '-msoft-stack' isn't actually possible: it does implement > "one stack per 32-threads warp", but for OpenACC we need "one stack per > thread of a warp" (that is, each OpenACC 'vector' independently), and > pre-allocating from device memory all those stacks (which may be a lot!) > I foresee to really negatively impact overall performance? > >>> (c) I'm contemplating a tweak/compiler pass for transforming such large >>> stack objects into heap allocation (during nvptx offloading compilation). >>> 'malloc'/'free' do exist; they're slow, but that's not a problem for the >>> code paths this is to affect. (Might also add some compile-time >>> diagnostic, of course.) Could maybe even limit this to only be used >>> during libgfortran compilation? This is then conceptually a bit similar >>> to (b), but localized to relevant parts only. Has such a thing been done >>> before in GCC, that I could build upon? >>> >>> Any other clever ideas? > >> Converting to heap allocation is difficult outside of the frontend and you >> have to be very careful with memleaks. > > Heh, in fact it seems to be pretty simple! (Famous last words?) See > "[WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold'" > attached. What do people think about such a thing? > > Still to be discussed are '-Wframe-malloc-threshold' (default-on vs. > '-Wextra'; or '-fopt-info' 'missed: [...]' or 'note: [...]' instead?), > default value for '-mframe-malloc-threshold=[...]' (potentially different > for GCC/nvptx target libraries build vs. user-compiled code?), etc. > > >> The library is written in C and >> I see heap allocated temporaries there but in at least one >> place a stack one is used: >> >> void >> st_endfile (st_parameter_filepos *fpp) >> { >> ... >> if (u->current_record) >> { >> st_parameter_dt dtp; >> dtp.common = fpp->common; >> memset (&dtp.u.p, 0, sizeof (dtp.u.p)); >> dtp.u.p.current_unit = u; >> next_record (&dtp, 1); >> >> that might be a mistake though - maybe it's enough to change that >> to a heap allocation? It might be also totally superfluous since >> only 'u' should matter here ... (not sure if the above is the case >> you are running into). > > (Have not yet looked into that; won't solve the general issue.) > > > Grüße > Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 From 3f5524adacff23710cf1cab393a56bf23853cafa Mon Sep 17 00:00:00 2001 From: Thomas Schwinge Date: Wed, 21 Dec 2022 21:25:19 +0100 Subject: [PATCH] [WIP] nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' --- gcc/config/nvptx/nvptx.cc | 102 ++++++++++++++++-- gcc/config/nvptx/nvptx.h | 3 + gcc/config/nvptx/nvptx.opt | 12 +++ gcc/doc/invoke.texi | 16 ++- .../nvptx/frame-malloc-threshold-1.c | 29 +++++ .../nvptx/frame-malloc-threshold-2.c | 13 +++ .../nvptx/frame-malloc-threshold-3.c | 14 +++ .../nvptx/frame-malloc-threshold-4.c | 16 +++ .../nvptx/frame-malloc-threshold-5.c | 15 +++ .../nvptx/frame-malloc-threshold-6.c | 15 +++ .../nvptx/frame-malloc-threshold-7.c | 15 +++ 11 files changed, 240 insertions(+), 10 deletions(-) create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c create mode 100644 gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc index b93a253ab318..2efd70595991 100644 --- a/gcc/config/nvptx/nvptx.cc +++ b/gcc/config/nvptx/nvptx.cc @@ -178,6 +178,16 @@ static hash_map gang_private_shared_hmap; /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; +/* True if any function 'has_malloc_frame'. + Because of 'nvptx_name_replacement', we can't just: + nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_FREE)); + nvptx_record_fndecl (builtin_decl_explicit (BUILT_IN_MALLOC)); + ..., but instead have to track them individually. +*/ +static bool need_free_malloc_decl; +static bool have_free_decl; +static bool have_malloc_decl; + /* True if any function references __nvptx_stacks. */ static bool need_softstack_decl; static bool have_softstack_decl; @@ -976,6 +986,11 @@ write_fn_marker (std::stringstream &s, bool is_defn, bool globalize, s << " GLOBAL"; s << " FUNCTION " << (is_defn ? "DEF: " : "DECL: "); s << name << "\n"; + + if (strcmp (name, "free") == 0) + have_free_decl = true; + else if (strcmp (name, "malloc") == 0) + have_malloc_decl = true; } /* Emit a linker marker for a variable decl or defn. */ @@ -1231,22 +1246,66 @@ nvptx_maybe_record_fnsym (rtx sym) nvptx_record_needed_fndecl (decl); } +//TODO /* Emit a local array to hold some part of a conventional stack frame and initialize REGNO to point to it. If the size is zero, it'll never be valid to dereference, so we can simply initialize to zero. */ static void -init_frame (FILE *file, int regno, unsigned align, unsigned size) +init_frame (FILE *file, int regno, int align, HOST_WIDE_INT size) { - if (size) - fprintf (file, "\t.local .align %d .b8 %s_ar[%u];\n", - align, reg_names[regno], size); fprintf (file, "\t.reg.u%d %s;\n", POINTER_SIZE, reg_names[regno]); - fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n" - : "\tmov.u%d %s, 0;\n"), - POINTER_SIZE, reg_names[regno], reg_names[regno]); + + if (regno == FRAME_POINTER_REGNUM + && ((unsigned HOST_WIDE_INT) size + >= (unsigned HOST_WIDE_INT) nvptx_frame_malloc_threshold)) + { + warning_at (DECL_SOURCE_LOCATION (current_function_decl), + OPT_Wframe_malloc_threshold, + "using % for frame with size of %wu bytes", size); + + /* + (2022-12-21, v12.0) states that in addition to the "in-kernel + 'malloc()' function" there also exists an "in-kernel + '__nv_aligned_device_malloc()' function", where "the address of the + allocated memory will be a multiple of 'align'". However that's not + documented on + + (2022-12-21, v12.0), so we shall not use that function. */ + /* + (2022-12-21, v12.0) does not, but + + (2022-12-21, v12.0) does state that the pointer returned by + "in-kernel 'malloc()' [...] is guaranteed to be aligned to a + 16-byte boundary". */ + if (align > 16) + sorry ("unfulfilled %d bytes alignment for frame", align); + + /* We don't need to support 'realloc', so instead of newlib 'malloc' + directly use the PTX 'malloc'. */ + fprintf (file, + "\t{\n" + "\t .param .u64 %%ptr;\n" + "\t .param .u64 %%size;\n" + "\t st.param.u64 [%%size], " HOST_WIDE_INT_PRINT_DEC ";\n" + "\t call (%%ptr), malloc, (%%size);\n" + "\t ld.param.u64 %s, [%%ptr];\n" + "\t}\n", + size, reg_names[regno]); + cfun->machine->has_malloc_frame = true; + need_free_malloc_decl = true; + } + else + { + if (size) + fprintf (file, "\t.local .align %d .b8 %s_ar[" HOST_WIDE_INT_PRINT_DEC "];\n", + align, reg_names[regno], size); + fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n" + : "\tmov.u%d %s, 0;\n"), + POINTER_SIZE, reg_names[regno], reg_names[regno]); + } } /* Emit soft stack frame setup sequence. */ @@ -1744,12 +1803,22 @@ nvptx_output_set_softstack (unsigned src_regno) } return ""; } + /* Output a return instruction. Also copy the return value to its outgoing location. */ const char * nvptx_output_return (void) { + if (cfun->machine->has_malloc_frame) + fprintf (asm_out_file, + "\t{\n" + "\t .param .u64 %%ptr;\n" + "\t st.param.u64 [%%ptr], %s;\n" + "\t call free, (%%ptr);\n" + "\t}\n", + reg_names[FRAME_POINTER_REGNUM]); + machine_mode mode = (machine_mode)cfun->machine->return_mode; if (mode != VOIDmode) @@ -4470,8 +4539,8 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn, rtx_code_label *label = NULL; empty = false; - /* The frame size might not be DImode compatible, but the frame - array's declaration will be. So it's ok to round up here. */ + /* The frame size might not be DImode-compatible, but the actual frame + allocated by 'init_frame' will be. So it's ok to round up here. */ fs = (fs + GET_MODE_SIZE (DImode) - 1) / GET_MODE_SIZE (DImode); /* Detect single iteration loop. */ if (fs == 1) @@ -5989,6 +6058,21 @@ write_shared_buffer (FILE *file, rtx sym, unsigned align, unsigned size) static void nvptx_file_end (void) { + if (need_free_malloc_decl) + { + if (!have_free_decl) + { + write_fn_marker (func_decls, false, true, "free"); + func_decls << ".extern .func free (.param .b64 %ptr);\n"; + } + if (!have_malloc_decl) + { + write_fn_marker (func_decls, false, true, "malloc"); + func_decls + << ".extern .func (.param .b64 %ptr) malloc (.param .b64 %size);\n"; + } + } + hash_table::iterator iter; tree decl; FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter) diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h index bc1021a80317..82d695551090 100644 --- a/gcc/config/nvptx/nvptx.h +++ b/gcc/config/nvptx/nvptx.h @@ -214,6 +214,8 @@ struct nvptx_args { #define TRAMPOLINE_SIZE 32 #define TRAMPOLINE_ALIGNMENT 256 + +#define NVPTX_FRAME_MALLOC_THRESHOLD_INIT 257 /* We don't run reload, so this isn't actually used, but it still needs to be defined. Showing an argp->fp elimination also stops @@ -244,6 +246,7 @@ struct GTY(()) machine_function bool is_varadic; /* This call is varadic */ bool has_varadic; /* Current function has a varadic call. */ bool has_chain; /* Current function has outgoing static chain. */ + bool has_malloc_frame; bool has_softstack; /* Current function has a soft stack frame. */ bool has_simtreg; /* Current function has an OpenMP SIMD region. */ int num_args; /* Number of args of current call. */ diff --git a/gcc/config/nvptx/nvptx.opt b/gcc/config/nvptx/nvptx.opt index 71d3b68510bd..6ccd3defc776 100644 --- a/gcc/config/nvptx/nvptx.opt +++ b/gcc/config/nvptx/nvptx.opt @@ -28,6 +28,18 @@ Target RejectNegative Mask(ABI64) Ignored, but preserved for backward compatibility. Only 64-bit ABI is supported. +mframe-malloc-threshold= +Target Joined RejectNegative Host_Wide_Int ByteSize Var(nvptx_frame_malloc_threshold) Init(NVPTX_FRAME_MALLOC_THRESHOLD_INIT) +-mframe-malloc-threshold= When the frame size exceeds , frame allocation switches from '.local' memory to 'malloc'. + +mno-frame-malloc-threshold +Target Alias(mframe-malloc-threshold=,18446744073709551615EiB,none) +Always use '.local' memory for frame allocation. Equivalent to -mframe-malloc-threshold= or larger. + +Wframe-malloc-threshold +Target Warning +Warn when the threshold is reached where frame allocation switches from '.local' memory to 'malloc'. + mmainkernel Target RejectNegative Link in code for a __main kernel. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 471309dfacfe..e3b6ea0fe4b8 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -1179,7 +1179,9 @@ Objective-C and Objective-C++ Dialects}. -march=@var{arch} -mbmx -mno-bmx -mcdx -mno-cdx} @emph{Nvidia PTX Options} -@gccoptlist{-m64 -mmainkernel -moptimize} +@gccoptlist{-m64 @gol +-mframe-malloc-threshold=@var{byte-size} @gol +-mmainkernel -moptimize} @emph{OpenRISC Options} @gccoptlist{-mboard=@var{name} -mnewlib -mhard-mul -mhard-div @gol @@ -28367,6 +28369,18 @@ This option sets the values of the preprocessor macros for instance, for @samp{3.1} the macros have the values @samp{3} and @samp{1}, respectively. +@item -mframe-malloc-threshold=@var{byte-size} +@opindex mframe-malloc-threshold= +@opindex mno-frame-malloc-threshold +TODO + +This is not relevant if @code{-msoft-stack} is enabled. + +@option{-mframe-malloc-threshold=TODO} is enabled by default. +This may be disabled either by specifying +@var{byte-size} of @samp{SIZE_MAX} or more or by +@option{-mno-frame-malloc-threshold}. + @item -mmainkernel @opindex mmainkernel Link in code for a __main kernel. This is for stand-alone instead of diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c new file mode 100644 index 000000000000..b16c17bfdf99 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-1.c @@ -0,0 +1,29 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +/* PTX-provided 'free', 'malloc'; cf. 'nvptx_name_replacement'. */ +void ptx_free (void *) __asm__ ("free"); +void *ptx_malloc (__SIZE_TYPE__) __asm__ ("malloc"); + +int f (void) +/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */ +{ + char a[1234]; + + ptx_malloc (5); + + ptx_free (ptx_malloc (1)); +} + +/* We exceed the default '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-not {%frame_ar} } } + { dg-final { scan-assembler-times {(?n)call free,.*;} 2 } } + { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 3 } } +*/ + +/* Of the implicit (via 'need_free_malloc_decl') and explicit declarations of + 'free', 'malloc', only one is emitted each: + { dg-final { scan-assembler-times {(?n)\.extern .* free .*;} 1 } } + { dg-final { scan-assembler-times {(?n)\.extern .* malloc .*;} 1 } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c new file mode 100644 index 000000000000..2f6a919eb1f1 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-2.c @@ -0,0 +1,13 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ + +int f (void) +{ + char a[1234]; +} + +/* We exceed the default '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-not {%frame_ar} } } + { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } } + { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c new file mode 100644 index 000000000000..7434132b2ad5 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-3.c @@ -0,0 +1,14 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[256]; +} + +/* We don't exceed the default '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c new file mode 100644 index 000000000000..c4068ab7ad23 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-4.c @@ -0,0 +1,16 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mframe-malloc-threshold=32 } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +/* { dg-warning {using 'malloc' for frame with size of [0-9]+ bytes} {} { target *-*-* } .-1 } */ +{ + char a[32]; +} + +/* We exceed the specified '-mframe-malloc-threshold=[...]'. + { dg-final { scan-assembler-not {%frame_ar} } } + { dg-final { scan-assembler-times {(?n)call free,.*;} 1 } } + { dg-final { scan-assembler-times {(?n)call .*, malloc, .*;} 1 } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c new file mode 100644 index 000000000000..cc262427b03c --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-5.c @@ -0,0 +1,15 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mframe-malloc-threshold=1249 } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[1234]; +} + +/* We don't exceed the specified '-mframe-malloc-threshold=[...]'. +/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c new file mode 100644 index 000000000000..72017ca2f439 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-6.c @@ -0,0 +1,15 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mframe-malloc-threshold=2KiB } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[1234]; +} + +/* We don't exceed the specified '-mframe-malloc-threshold=[...]'. +/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ diff --git a/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c new file mode 100644 index 000000000000..b2f85a55f050 --- /dev/null +++ b/gcc/testsuite/gcc.target/nvptx/frame-malloc-threshold-7.c @@ -0,0 +1,15 @@ +/* { dg-do assemble } */ +/* { dg-options {-save-temps -O0} } */ +/* { dg-additional-options -mno-frame-malloc-threshold } */ +/* { dg-additional-options -Wframe-malloc-threshold } */ + +int f (void) +{ + char a[1234]; +} + +/* We'll never exceed the specified unlimited '-mframe-malloc-threshold=[...]'. +/* { dg-final { scan-assembler-times {(?n)cvta\.local\.u64 %frame, %frame_ar;} 1 } } + { dg-final { scan-assembler-not {free} } } + { dg-final { scan-assembler-not {malloc} } } +*/ -- 2.35.1