c++, dyninit: Optimize C++ dynamic initialization by constants into DECL_INITIAL adjustment [PR102876]

Message ID 20211104094250.GR304296@tucnak
State Under Review
Headers
Series c++, dyninit: Optimize C++ dynamic initialization by constants into DECL_INITIAL adjustment [PR102876] |

Commit Message

Jakub Jelinek Nov. 4, 2021, 9:42 a.m. UTC
  Hi!

When users don't use constexpr everywhere in initialization of namespace
scope non-comdat vars and the initializers aren't constant when FE is
looking at them, the FE performs dynamic initialization of those variables.
But after inlining and some constant propagation, we often end up with
just storing constants into those variables in the _GLOBAL__sub_I_*
constructor.
C++ gives us permission to change some of that dynamic initialization
back into static initialization - https://eel.is/c++draft/basic.start.static#3
For classes that need (dynamic) construction, I believe access to some var
from other dynamic construction before that var is constructed is UB, but
as the example in the above mentioned spot of C++:
inline double fd() { return 1.0; }
extern double d1;
double d2 = d1;     // unspecified:
                    // either statically initialized to 0.0 or
                    // dynamically initialized to 0.0 if d1 is
                    // dynamically initialized, or 1.0 otherwise
double d1 = fd();   // either initialized statically or dynamically to 1.0
some vars can be used before they are dynamically initialized and the
implementation can still optimize those into static initialization.

The following patch attempts to optimize some such cases back into
DECL_INITIAL initializers and where possible (originally const vars without
mutable members) put those vars back to .rodata etc.

Because we put all dynamic initialization from a single TU into one single
function (well, originally one function per priority but typically inline
those back into one function), we can either have a simpler approach
(from the PR it seems that is what LLVM uses) where either we manage to
optimize all dynamic initializers into constant in the TU, or nothing,
or by adding some markup - in the form of a pair of internal functions in
this patch - around each dynamic initialization that can be optimized,
we can optimize each dynamic initialization separately.

The patch adds a new pass that is invoked (through gate check) only on
DECL_ARTIFICIAL DECL_STATIC_CONSTRUCTOR functions, and looks there for
sequences like:
  .DYNAMIC_INIT_START (&b, 0);
  b = 1;
  .DYNAMIC_INIT_END (&b);
or
  .DYNAMIC_INIT_START (&e, 1);
  # DEBUG this => &e.f
  MEM[(struct S *)&e + 4B] ={v} {CLOBBER};
  MEM[(struct S *)&e + 4B].a = 1;
  MEM[(struct S *)&e + 4B].b = 2;
  MEM[(struct S *)&e + 4B].c = 3;
  # DEBUG BEGIN_STMT
  MEM[(struct S *)&e + 4B].d = 6;
  # DEBUG this => NULL
  .DYNAMIC_INIT_END (&e);
(where between the pair of markers everything is either debug stmts or
stores of constants into the variables or their parts).
The pass needs to be done late enough so that after IPA all the needed
constant propagation and perhaps loop unrolling is done, on the other
side should be early enough so that if we can't optimize it, we can
remove those .DYNAMIC_INIT* internal calls that could prevent some
further optimizations (they have fnspec such that they pretend to read
the corresponding variable).

Currently the optimization is only able to optimize cases where the whole
variable is stored in a single store (typically scalar variables), or
uses the native_{encode,interpret}* infrastructure to create or update
the CONSTRUCTOR.  This means that except for the first category, we can't
right now handle unions or anything that needs relocations (vars containing
pointers to other vars or references).
I think it would be nice to incrementally add before the native_* fallback
some attempt to just create or update a CONSTRUCTOR if possible.  If we only
see var.a.b.c.d[10].e = const; style of stores, this shouldn't be that hard
as the whole access path is recorded there and we'd just need to decide what
to do with unions if two or more union members are accessed.  And do a deep
copy of the CONSTRUCTOR and try to efficiently update the copy afterwards
(the CONSTRUCTORs should be sorted on increasing offsets of the
members/elements, so doing an ordered vec insertion might not be the best
idea).  But MEM_REFs complicate this, parts or all of the access path
is lost.  For non-unions in most cases we could try to guess which field
it is (do we have some existing function to do that?  I vaguely remember
we've been doing that in some cases in the past in some folding but stopped
doing so) but with unions it will be harder or impossible.

As the middle-end can't easily differentiate between const variables without
and with mutable members, both of those will have TREE_READONLY on the
var decl clear (because of dynamic initialization) and TYPE_READONLY set
on the type, the patch remembers this in an extra argument to
.DYNAMIC_INIT_START (true if it is ok to set TREE_READONLY on the var decl
back if the var dynamic initialization could be optimized into DECL_INITIAL).
Thinking more about it, I'm not sure about const vars without mutable
members with non-trivial destructors, do we register their dtors dynamically
through __cxa_atexit in the ctors (that would mean the optimization
currently punts on them), or not (in that case we could put it into .rodata
even when the dtor will want to perhaps write to them)?

Anyway, forgot to do another set of bootstraps with gathering statistics how
many vars were optimized, so just trying to figure it out from the sizes of
_GLOBAL__sub_I_* functions:

# Without patch, x86_64-linux cc1plus
$ readelf -Ws obj50/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
13934
# With the patch, x86_64-linux cc1plus
$ readelf -Ws obj52/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
6966
# Without patch, i686-linux cc1plus
$ readelf -Ws obj51/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
24158
# With the patch, i686-linux cc1plus
$ readelf -Ws obj53/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
10536

That seems like a huge improvement, although on a closer look, most of that
saving is from just one TU:
$ readelf -Ws obj50/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
6693
$ readelf -Ws obj52/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
1
$ readelf -Ws obj51/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
13001
$ readelf -Ws obj53/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
1
So, the shrinking on all the dynamic initialization functions except
i386-options.o is:
7241 -> 6965 for 64-bit and
11157 -> 10535 for 32-bit.
Will try to use constexpr for i386-options.c later today.

Another optimization that could be useful but not sure if it can be easily
done is if we before expansion of the _GLOBAL__sub_I_* functions end up with
nothing in their body (that's those 1 byte functions on x86) perhaps either
not emit those functions at all or at least don't register them in
.init_array etc. so that cycles aren't wasted at runtime:
$ readelf -Ws obj50/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
4
$ readelf -Ws obj52/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
87
$ readelf -Ws obj51/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
4
$ readelf -Ws obj53/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
84

Also, wonder if I should add some new -f* option to control the optimization
or doing it always at -O+ with -fdisable-tree-pass-dyninit as a way to
disable it is good enough, and whether the 1024 hardcoded constant
(upper bound on optimized size so that we don't spend huge amounts of
compile time trying to optimize initializers of gigabyte sizes) shouldn't be
a param.

Bootstrapped/regtested on x86_64-linux and i686-linux.

2021-11-04  Jakub Jelinek  <jakub@redhat.com>

	PR c++/102876
gcc/
	* internal-fn.def (DYNAMIC_INIT_START, DYNAMIC_INIT_END): New internal
	functions.
	* internal-fn.c (expand_DYNAMIC_INIT_START, expand_DYNAMIC_INIT_END):
	New functions.
	* tree-pass.h (make_pass_dyninit): Declare.
	* passes.def (pass_dyninit): Add after dce4.
	* gimple-ssa-store-merging.c (pass_data_dyninit): New variable.
	(class pass_dyninit): New type.
	(pass_dyninit::execute): New method.
	(make_pass_dyninit): New function.
gcc/cp/
	* decl2.c (one_static_initialization_or_destruction): Emit
	.DYNAMIC_INIT_START and .DYNAMIC_INIT_END internal calls around
	dynamic initialization of variables that don't need a guard.
gcc/testsuite/
	* g++.dg/opt/init3.C: New test.


	Jakub
  

Comments

Richard Biener Nov. 4, 2021, 11:13 a.m. UTC | #1
On Thu, 4 Nov 2021, Jakub Jelinek wrote:

> Hi!
> 
> When users don't use constexpr everywhere in initialization of namespace
> scope non-comdat vars and the initializers aren't constant when FE is
> looking at them, the FE performs dynamic initialization of those variables.
> But after inlining and some constant propagation, we often end up with
> just storing constants into those variables in the _GLOBAL__sub_I_*
> constructor.
> C++ gives us permission to change some of that dynamic initialization
> back into static initialization - https://eel.is/c++draft/basic.start.static#3
> For classes that need (dynamic) construction, I believe access to some var
> from other dynamic construction before that var is constructed is UB, but
> as the example in the above mentioned spot of C++:
> inline double fd() { return 1.0; }
> extern double d1;
> double d2 = d1;     // unspecified:
>                     // either statically initialized to 0.0 or
>                     // dynamically initialized to 0.0 if d1 is
>                     // dynamically initialized, or 1.0 otherwise
> double d1 = fd();   // either initialized statically or dynamically to 1.0
> some vars can be used before they are dynamically initialized and the
> implementation can still optimize those into static initialization.
> 
> The following patch attempts to optimize some such cases back into
> DECL_INITIAL initializers and where possible (originally const vars without
> mutable members) put those vars back to .rodata etc.
> 
> Because we put all dynamic initialization from a single TU into one single
> function (well, originally one function per priority but typically inline
> those back into one function), we can either have a simpler approach
> (from the PR it seems that is what LLVM uses) where either we manage to
> optimize all dynamic initializers into constant in the TU, or nothing,
> or by adding some markup - in the form of a pair of internal functions in
> this patch - around each dynamic initialization that can be optimized,
> we can optimize each dynamic initialization separately.
> 
> The patch adds a new pass that is invoked (through gate check) only on
> DECL_ARTIFICIAL DECL_STATIC_CONSTRUCTOR functions, and looks there for
> sequences like:
>   .DYNAMIC_INIT_START (&b, 0);
>   b = 1;
>   .DYNAMIC_INIT_END (&b);
> or
>   .DYNAMIC_INIT_START (&e, 1);
>   # DEBUG this => &e.f
>   MEM[(struct S *)&e + 4B] ={v} {CLOBBER};
>   MEM[(struct S *)&e + 4B].a = 1;
>   MEM[(struct S *)&e + 4B].b = 2;
>   MEM[(struct S *)&e + 4B].c = 3;
>   # DEBUG BEGIN_STMT
>   MEM[(struct S *)&e + 4B].d = 6;
>   # DEBUG this => NULL
>   .DYNAMIC_INIT_END (&e);
> (where between the pair of markers everything is either debug stmts or
> stores of constants into the variables or their parts).
> The pass needs to be done late enough so that after IPA all the needed
> constant propagation and perhaps loop unrolling is done, on the other
> side should be early enough so that if we can't optimize it, we can
> remove those .DYNAMIC_INIT* internal calls that could prevent some
> further optimizations (they have fnspec such that they pretend to read
> the corresponding variable).
> 
> Currently the optimization is only able to optimize cases where the whole
> variable is stored in a single store (typically scalar variables), or
> uses the native_{encode,interpret}* infrastructure to create or update
> the CONSTRUCTOR.  This means that except for the first category, we can't
> right now handle unions or anything that needs relocations (vars containing
> pointers to other vars or references).
> I think it would be nice to incrementally add before the native_* fallback
> some attempt to just create or update a CONSTRUCTOR if possible.  If we only
> see var.a.b.c.d[10].e = const; style of stores, this shouldn't be that hard
> as the whole access path is recorded there and we'd just need to decide what
> to do with unions if two or more union members are accessed.  And do a deep
> copy of the CONSTRUCTOR and try to efficiently update the copy afterwards
> (the CONSTRUCTORs should be sorted on increasing offsets of the
> members/elements, so doing an ordered vec insertion might not be the best
> idea).  But MEM_REFs complicate this, parts or all of the access path
> is lost.  For non-unions in most cases we could try to guess which field
> it is (do we have some existing function to do that?  I vaguely remember
> we've been doing that in some cases in the past in some folding but stopped
> doing so) but with unions it will be harder or impossible.
> 
> As the middle-end can't easily differentiate between const variables without
> and with mutable members, both of those will have TREE_READONLY on the
> var decl clear (because of dynamic initialization) and TYPE_READONLY set
> on the type, the patch remembers this in an extra argument to
> .DYNAMIC_INIT_START (true if it is ok to set TREE_READONLY on the var decl
> back if the var dynamic initialization could be optimized into DECL_INITIAL).
> Thinking more about it, I'm not sure about const vars without mutable
> members with non-trivial destructors, do we register their dtors dynamically
> through __cxa_atexit in the ctors (that would mean the optimization
> currently punts on them), or not (in that case we could put it into .rodata
> even when the dtor will want to perhaps write to them)?
> 
> Anyway, forgot to do another set of bootstraps with gathering statistics how
> many vars were optimized, so just trying to figure it out from the sizes of
> _GLOBAL__sub_I_* functions:
> 
> # Without patch, x86_64-linux cc1plus
> $ readelf -Ws obj50/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
> 13934
> # With the patch, x86_64-linux cc1plus
> $ readelf -Ws obj52/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
> 6966
> # Without patch, i686-linux cc1plus
> $ readelf -Ws obj51/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
> 24158
> # With the patch, i686-linux cc1plus
> $ readelf -Ws obj53/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
> 10536
> 
> That seems like a huge improvement, although on a closer look, most of that
> saving is from just one TU:
> $ readelf -Ws obj50/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
> 6693
> $ readelf -Ws obj52/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
> 1
> $ readelf -Ws obj51/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
> 13001
> $ readelf -Ws obj53/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
> 1
> So, the shrinking on all the dynamic initialization functions except
> i386-options.o is:
> 7241 -> 6965 for 64-bit and
> 11157 -> 10535 for 32-bit.
> Will try to use constexpr for i386-options.c later today.
> 
> Another optimization that could be useful but not sure if it can be easily
> done is if we before expansion of the _GLOBAL__sub_I_* functions end up with
> nothing in their body (that's those 1 byte functions on x86) perhaps either
> not emit those functions at all or at least don't register them in
> .init_array etc. so that cycles aren't wasted at runtime:
> $ readelf -Ws obj50/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
> 4
> $ readelf -Ws obj52/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
> 87
> $ readelf -Ws obj51/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
> 4
> $ readelf -Ws obj53/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
> 84
> 
> Also, wonder if I should add some new -f* option to control the optimization
> or doing it always at -O+ with -fdisable-tree-pass-dyninit as a way to
> disable it is good enough, and whether the 1024 hardcoded constant
> (upper bound on optimized size so that we don't spend huge amounts of
> compile time trying to optimize initializers of gigabyte sizes) shouldn't be
> a param.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux.

As a general comment I wonder whether doing this fully in the C++
frontend leveraging the constexpr support is a better approach, esp.
before we end up putting all initializers into a single function ...
even partly constexpr evaluating things might help in some case.

On that note it might be worth experimenting with keeping each
initializer in a separate function until IPA where IPA could
then figure out dependences via IPA REFs (with LTO on the whole
program), a) diagnosing inter-CU undefined behavior, b) "fixing"
things by making sure the initialization happens init-before-use
(when there's no cycle), c) with local analysis do the promotion
to READONLY at IPA time and elide the function.

I think most PRs really ask for more optimistic constexpr
evaluation on the frontend side.

Richard.

> 2021-11-04  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR c++/102876
> gcc/
> 	* internal-fn.def (DYNAMIC_INIT_START, DYNAMIC_INIT_END): New internal
> 	functions.
> 	* internal-fn.c (expand_DYNAMIC_INIT_START, expand_DYNAMIC_INIT_END):
> 	New functions.
> 	* tree-pass.h (make_pass_dyninit): Declare.
> 	* passes.def (pass_dyninit): Add after dce4.
> 	* gimple-ssa-store-merging.c (pass_data_dyninit): New variable.
> 	(class pass_dyninit): New type.
> 	(pass_dyninit::execute): New method.
> 	(make_pass_dyninit): New function.
> gcc/cp/
> 	* decl2.c (one_static_initialization_or_destruction): Emit
> 	.DYNAMIC_INIT_START and .DYNAMIC_INIT_END internal calls around
> 	dynamic initialization of variables that don't need a guard.
> gcc/testsuite/
> 	* g++.dg/opt/init3.C: New test.
> 
> --- gcc/internal-fn.def.jj	2021-11-02 09:05:47.029664211 +0100
> +++ gcc/internal-fn.def	2021-11-02 12:40:38.702436113 +0100
> @@ -367,6 +367,10 @@ DEF_INTERNAL_FN (PHI, 0, NULL)
>     automatic variable.  */
>  DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>  
> +/* Mark start and end of dynamic initialization of a variable.  */
> +DEF_INTERNAL_FN (DYNAMIC_INIT_START, ECF_LEAF | ECF_NOTHROW, ". r ")
> +DEF_INTERNAL_FN (DYNAMIC_INIT_END, ECF_LEAF | ECF_NOTHROW, ". r ")
> +
>  /* DIM_SIZE and DIM_POS return the size of a particular compute
>     dimension and the executing thread's position within that
>     dimension.  DIM_POS is pure (and not const) so that it isn't
> --- gcc/internal-fn.c.jj	2021-11-02 09:05:47.029664211 +0100
> +++ gcc/internal-fn.c	2021-11-02 12:40:38.703436099 +0100
> @@ -3485,6 +3485,16 @@ expand_CO_ACTOR (internal_fn, gcall *)
>    gcc_unreachable ();
>  }
>  
> +static void
> +expand_DYNAMIC_INIT_START (internal_fn, gcall *)
> +{
> +}
> +
> +static void
> +expand_DYNAMIC_INIT_END (internal_fn, gcall *)
> +{
> +}
> +
>  /* Expand a call to FN using the operands in STMT.  FN has a single
>     output operand and NARGS input operands.  */
>  
> --- gcc/tree-pass.h.jj	2021-10-28 11:29:01.891721153 +0200
> +++ gcc/tree-pass.h	2021-11-02 14:15:00.139185088 +0100
> @@ -445,6 +445,7 @@ extern gimple_opt_pass *make_pass_cse_re
>  extern gimple_opt_pass *make_pass_cse_sincos (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_optimize_bswap (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_store_merging (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_dyninit (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_optimize_widening_mul (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_warn_function_return (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_warn_function_noreturn (gcc::context *ctxt);
> --- gcc/passes.def.jj	2021-11-01 14:37:06.685853324 +0100
> +++ gcc/passes.def	2021-11-02 14:23:47.836715821 +0100
> @@ -261,6 +261,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_tsan);
>        NEXT_PASS (pass_dse);
>        NEXT_PASS (pass_dce);
> +      NEXT_PASS (pass_dyninit);
>        /* Pass group that runs when 1) enabled, 2) there are loops
>  	 in the function.  Make sure to run pass_fix_loops before
>  	 to discover/remove loops before running the gate function
> --- gcc/gimple-ssa-store-merging.c.jj	2021-09-01 12:06:19.488211919 +0200
> +++ gcc/gimple-ssa-store-merging.c	2021-11-03 18:02:55.190015359 +0100
> @@ -170,6 +170,8 @@
>  #include "optabs-tree.h"
>  #include "dbgcnt.h"
>  #include "selftest.h"
> +#include "cgraph.h"
> +#include "varasm.h"
>  
>  /* The maximum size (in bits) of the stores this pass should generate.  */
>  #define MAX_STORE_BITSIZE (BITS_PER_WORD)
> @@ -5465,6 +5467,334 @@ pass_store_merging::execute (function *f
>    return 0;
>  }
>  
> +/* Pass to optimize C++ dynamic initialization.  */
> +
> +const pass_data pass_data_dyninit = {
> +  GIMPLE_PASS,     /* type */
> +  "dyninit",	   /* name */
> +  OPTGROUP_NONE,   /* optinfo_flags */
> +  TV_GIMPLE_STORE_MERGING,	 /* tv_id */
> +  PROP_ssa,	/* properties_required */
> +  0,		   /* properties_provided */
> +  0,		   /* properties_destroyed */
> +  0,		   /* todo_flags_start */
> +  0,		/* todo_flags_finish */
> +};
> +
> +class pass_dyninit : public gimple_opt_pass
> +{
> +public:
> +  pass_dyninit (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_dyninit, ctxt)
> +  {
> +  }
> +
> +  virtual bool
> +  gate (function *fun)
> +  {
> +    return (DECL_ARTIFICIAL (fun->decl)
> +	    && DECL_STATIC_CONSTRUCTOR (fun->decl)
> +	    && optimize);
> +  }
> +
> +  virtual unsigned int execute (function *);
> +}; // class pass_dyninit
> +
> +unsigned int
> +pass_dyninit::execute (function *fun)
> +{
> +  basic_block bb;
> +  auto_vec<gimple *, 32> ifns;
> +  hash_map<tree, gimple *> *map = NULL;
> +  auto_vec<tree, 32> vars;
> +  gimple **cur = NULL;
> +  bool ssdf_calls = false;
> +
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      for (gimple_stmt_iterator gsi = gsi_after_labels (bb);
> +	   !gsi_end_p (gsi); gsi_next (&gsi))
> +	{
> +	  gimple *stmt = gsi_stmt (gsi);
> +	  if (is_gimple_debug (stmt))
> +	    continue;
> +
> +	  /* The C++ FE can wrap dynamic initialization of certain
> +	     variables with a pair of iternal function calls, like:
> +	     .DYNAMIC_INIT_START (&b, 0);
> +	     b = 1;
> +	     .DYNAMIC_INIT_END (&b);
> +
> +	     or
> +	     .DYNAMIC_INIT_START (&e, 1);
> +	     # DEBUG this => &e.f
> +	     MEM[(struct S *)&e + 4B] ={v} {CLOBBER};
> +	     MEM[(struct S *)&e + 4B].a = 1;
> +	     MEM[(struct S *)&e + 4B].b = 2;
> +	     MEM[(struct S *)&e + 4B].c = 3;
> +	     # DEBUG BEGIN_STMT
> +	     MEM[(struct S *)&e + 4B].d = 6;
> +	     # DEBUG this => NULL
> +	     .DYNAMIC_INIT_END (&e);
> +
> +	     Verify if there are only stores of constants to the corresponding
> +	     variable or parts of that variable and if so, try to reconstruct
> +	     a static initializer from the static initializer if any and
> +	     the constant stores into the variable.  This is permitted by
> +	     [basic.start.static]/3.  */
> +	  if (is_gimple_call (stmt))
> +	    {
> +	      if (gimple_call_internal_p (stmt, IFN_DYNAMIC_INIT_START))
> +		{
> +		  ifns.safe_push (stmt);
> +		  if (cur)
> +		    *cur = NULL;
> +		  tree arg = gimple_call_arg (stmt, 0);
> +		  gcc_assert (TREE_CODE (arg) == ADDR_EXPR
> +			      && DECL_P (TREE_OPERAND (arg, 0)));
> +		  tree var = TREE_OPERAND (arg, 0);
> +		  gcc_checking_assert (is_global_var (var));
> +		  varpool_node *node = varpool_node::get (var);
> +		  if (node == NULL
> +		      || node->in_other_partition
> +		      || TREE_ASM_WRITTEN (var)
> +		      || DECL_SIZE_UNIT (var) == NULL_TREE
> +		      || !tree_fits_uhwi_p (DECL_SIZE_UNIT (var))
> +		      || tree_to_uhwi (DECL_SIZE_UNIT (var)) > 1024
> +		      || TYPE_SIZE_UNIT (TREE_TYPE (var)) == NULL_TREE
> +		      || !tree_int_cst_equal (TYPE_SIZE_UNIT (TREE_TYPE (var)),
> +					      DECL_SIZE_UNIT (var)))
> +		    continue;
> +		  if (map == NULL)
> +		    map = new hash_map<tree, gimple *> (61);
> +		  bool existed_p;
> +		  cur = &map->get_or_insert (var, &existed_p);
> +		  if (existed_p)
> +		    {
> +		      /* Punt if we see more than one .DYNAMIC_INIT_START
> +			 internal call for the same variable.  */
> +		      *cur = NULL;
> +		      cur = NULL;
> +		    }
> +		  else
> +		    {
> +		      *cur = stmt;
> +		      vars.safe_push (var);
> +		    }
> +		  continue;
> +		}
> +	      else if (gimple_call_internal_p (stmt, IFN_DYNAMIC_INIT_END))
> +		{
> +		  ifns.safe_push (stmt);
> +		  tree arg = gimple_call_arg (stmt, 0);
> +		  gcc_assert (TREE_CODE (arg) == ADDR_EXPR
> +			      && DECL_P (TREE_OPERAND (arg, 0)));
> +		  tree var = TREE_OPERAND (arg, 0);
> +		  gcc_checking_assert (is_global_var (var));
> +		  if (cur)
> +		    {
> +		      /* Punt if .DYNAMIC_INIT_END call argument doesn't
> +			 pair with .DYNAMIC_INIT_START.  */
> +		      if (vars.last () != var)
> +			*cur = NULL;
> +		      cur = NULL;
> +		    }
> +		  continue;
> +		}
> +
> +	      /* Punt if we see any artificial
> +		 __static_initialization_and_destruction_* calls, e.g. if
> +		 it would be partially inlined, because we wouldn't then see
> +		 all .DYNAMIC_INIT_* calls.  */
> +	      tree fndecl = gimple_call_fndecl (stmt);
> +	      if (fndecl
> +		  && DECL_ARTIFICIAL (fndecl)
> +		  && DECL_NAME (fndecl)
> +		  && startswith (IDENTIFIER_POINTER (DECL_NAME (fndecl)),
> +				 "__static_initialization_and_destruction_"))
> +		ssdf_calls = true;
> +	    }
> +	  if (cur)
> +	    {
> +	      if (store_valid_for_store_merging_p (stmt))
> +		{
> +		  tree lhs = gimple_assign_lhs (stmt);
> +		  tree rhs = gimple_assign_rhs1 (stmt);
> +		  poly_int64 bitsize, bitpos;
> +		  HOST_WIDE_INT ibitsize, ibitpos;
> +		  machine_mode mode;
> +		  int unsignedp, reversep, volatilep = 0;
> +		  tree offset;
> +		  tree var = vars.last ();
> +		  if (rhs_valid_for_store_merging_p (rhs)
> +		      && get_inner_reference (lhs, &bitsize, &bitpos, &offset,
> +					      &mode, &unsignedp, &reversep,
> +					      &volatilep) == var
> +		      && !reversep
> +		      && !volatilep
> +		      && (offset == NULL_TREE || integer_zerop (offset))
> +		      && bitsize.is_constant (&ibitsize)
> +		      && bitpos.is_constant (&ibitpos)
> +		      && ibitpos >= 0
> +		      && ibitsize <= tree_to_shwi (DECL_SIZE (var))
> +		      && ibitsize + ibitpos <= tree_to_shwi (DECL_SIZE (var)))
> +		    continue;
> +		}
> +	      *cur = NULL;
> +	      cur = NULL;
> +	    }
> +	}
> +      if (cur)
> +	{
> +	  *cur = NULL;
> +	  cur = NULL;
> +	}
> +    }
> +  if (map && !ssdf_calls)
> +    {
> +      for (tree var : vars)
> +	{
> +	  gimple *g = *map->get (var);
> +	  if (g == NULL)
> +	    continue;
> +	  varpool_node *node = varpool_node::get (var);
> +	  node->get_constructor ();
> +	  tree init = DECL_INITIAL (var);
> +	  if (init == NULL)
> +	    init = build_zero_cst (TREE_TYPE (var));
> +	  gimple_stmt_iterator gsi = gsi_for_stmt (g);
> +	  unsigned char *buf = NULL;
> +	  unsigned int buf_size = tree_to_uhwi (DECL_SIZE_UNIT (var));
> +	  bool buf_valid = false;
> +	  do
> +	    {
> +	      gsi_next (&gsi);
> +	      gimple *stmt = gsi_stmt (gsi);
> +	      if (is_gimple_debug (stmt))
> +		continue;
> +	      if (is_gimple_call (stmt))
> +		break;
> +	      if (gimple_clobber_p (stmt))
> +		continue;
> +	      tree lhs = gimple_assign_lhs (stmt);
> +	      tree rhs = gimple_assign_rhs1 (stmt);
> +	      if (lhs == var)
> +		{
> +		  /* Simple assignment to the whole variable.
> +		     rhs is the initializer.  */
> +		  buf_valid = false;
> +		  init = rhs;
> +		  continue;
> +		}
> +	      poly_int64 bitsize, bitpos;
> +	      machine_mode mode;
> +	      int unsignedp, reversep, volatilep = 0;
> +	      tree offset;
> +	      get_inner_reference (lhs, &bitsize, &bitpos, &offset,
> +				   &mode, &unsignedp, &reversep, &volatilep);
> +	      HOST_WIDE_INT ibitsize = bitsize.to_constant ();
> +	      HOST_WIDE_INT ibitpos = bitpos.to_constant ();
> +	      if (BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN
> +		  || CHAR_BIT != 8
> +		  || BITS_PER_UNIT != 8)
> +		{
> +		  g = NULL;
> +		  break;
> +		}
> +	      if (!buf_valid)
> +		{
> +		  if (buf == NULL)
> +		    buf = XNEWVEC (unsigned char, buf_size * 2);
> +		  memset (buf, 0, buf_size);
> +		  if (native_encode_initializer (init, buf, buf_size)
> +		      != (int) buf_size)
> +		    {
> +		      g = NULL;
> +		      break;
> +		    }
> +		  buf_valid = true;
> +		}
> +	      /* Otherwise go through byte representation.  */
> +	      if (!encode_tree_to_bitpos (rhs, buf, ibitsize,
> +					  ibitpos, buf_size))
> +		{
> +		  g = NULL;
> +		  break;
> +		}
> +	    }
> +	  while (1);
> +	  if (g == NULL)
> +	    {
> +	      XDELETE (buf);
> +	      continue;
> +	    }
> +	  if (buf_valid)
> +	    {
> +	      init = native_interpret_aggregate (TREE_TYPE (var), buf, 0,
> +						 buf_size);
> +	      if (init)
> +		{
> +		  /* Verify the dynamic initialization doesn't e.g. set
> +		     some padding bits to non-zero by trying to encode
> +		     it again and comparing.  */
> +		  memset (buf + buf_size, 0, buf_size);
> +		  if (native_encode_initializer (init, buf + buf_size,
> +						 buf_size) != (int) buf_size
> +		      || memcmp (buf, buf + buf_size, buf_size) != 0)
> +		    init = NULL_TREE;
> +		}
> +	    }
> +	  XDELETE (buf);
> +	  if (!init || !initializer_constant_valid_p (init, TREE_TYPE (var)))
> +	    continue;
> +	  if (integer_nonzerop (gimple_call_arg (g, 1)))
> +	    TREE_READONLY (var) = 1;
> +	  if (dump_file)
> +	    {
> +	      fprintf (dump_file, "dynamic initialization of ");
> +	      print_generic_stmt (dump_file, var, TDF_SLIM);
> +	      fprintf (dump_file, " optimized into: ");
> +	      print_generic_stmt (dump_file, init, TDF_SLIM);
> +	      if (TREE_READONLY (var))
> +		fprintf (dump_file, " and making it read-only\n");
> +	      fprintf (dump_file, "\n");
> +	    }
> +	  if (initializer_zerop (init))
> +	    DECL_INITIAL (var) = NULL_TREE;
> +	  else
> +	    DECL_INITIAL (var) = init;
> +	  gsi = gsi_for_stmt (g);
> +	  gsi_next (&gsi);
> +	  do
> +	    {
> +	      gimple *stmt = gsi_stmt (gsi);
> +	      if (is_gimple_debug (stmt))
> +		{
> +		  gsi_next (&gsi);
> +		  continue;
> +		}
> +	      if (is_gimple_call (stmt))
> +		break;
> +	      /* Remove now all the stores for the dynamic initialization.  */
> +	      unlink_stmt_vdef (stmt);
> +	      gsi_remove (&gsi, true);
> +	      if (gimple_vdef (stmt))
> +		release_ssa_name (gimple_vdef (stmt));
> +	    }
> +	  while (1);
> +	}
> +    }
> +  delete map;
> +  for (gimple *g : ifns)
> +    {
> +      gimple_stmt_iterator gsi = gsi_for_stmt (g);
> +      unlink_stmt_vdef (g);
> +      gsi_remove (&gsi, true);
> +      if (gimple_vdef (g))
> +	release_ssa_name (gimple_vdef (g));
> +    }
> +  return 0;
> +}
>  } // anon namespace
>  
>  /* Construct and return a store merging pass object.  */
> @@ -5475,6 +5805,14 @@ make_pass_store_merging (gcc::context *c
>    return new pass_store_merging (ctxt);
>  }
>  
> +/* Construct and return a dyninit pass object.  */
> +
> +gimple_opt_pass *
> +make_pass_dyninit (gcc::context *ctxt)
> +{
> +  return new pass_dyninit (ctxt);
> +}
> +
>  #if CHECKING_P
>  
>  namespace selftest {
> --- gcc/cp/decl2.c.jj	2021-11-02 09:05:47.004664566 +0100
> +++ gcc/cp/decl2.c	2021-11-03 17:18:11.395288518 +0100
> @@ -4133,13 +4133,36 @@ one_static_initialization_or_destruction
>      {
>        if (init)
>  	{
> +	  bool sanitize = sanitize_flags_p (SANITIZE_ADDRESS, decl);
> +	  if (optimize && guard == NULL_TREE && !sanitize)
> +	    {
> +	      tree t = build_fold_addr_expr (decl);
> +	      tree type = TREE_TYPE (decl);
> +	      tree is_const
> +		= constant_boolean_node (TYPE_READONLY (type)
> +					 && !cp_has_mutable_p (type),
> +					 boolean_type_node);
> +	      t = build_call_expr_internal_loc (DECL_SOURCE_LOCATION (decl),
> +						IFN_DYNAMIC_INIT_START,
> +						void_type_node, 2, t,
> +						is_const);
> +	      finish_expr_stmt (t);
> +	    }
>  	  finish_expr_stmt (init);
> -	  if (sanitize_flags_p (SANITIZE_ADDRESS, decl))
> +	  if (sanitize)
>  	    {
>  	      varpool_node *vnode = varpool_node::get (decl);
>  	      if (vnode)
>  		vnode->dynamically_initialized = 1;
>  	    }
> +	  else if (optimize && guard == NULL_TREE)
> +	    {
> +	      tree t = build_fold_addr_expr (decl);
> +	      t = build_call_expr_internal_loc (DECL_SOURCE_LOCATION (decl),
> +						IFN_DYNAMIC_INIT_END,
> +						void_type_node, 1, t);
> +	      finish_expr_stmt (t);
> +	    }
>  	}
>  
>        /* If we're using __cxa_atexit, register a function that calls the
> --- gcc/testsuite/g++.dg/opt/init3.C.jj	2021-11-03 17:53:01.872472570 +0100
> +++ gcc/testsuite/g++.dg/opt/init3.C	2021-11-03 17:52:57.484535115 +0100
> @@ -0,0 +1,31 @@
> +// PR c++/102876
> +// { dg-do compile }
> +// { dg-options "-O2 -fdump-tree-dyninit" }
> +// { dg-final { scan-tree-dump "dynamic initialization of b\[\n\r]* optimized into: 1" "dyninit" } }
> +// { dg-final { scan-tree-dump "dynamic initialization of e\[\n\r]* optimized into: {.e=5, .f={.a=1, .b=2, .c=3, .d=6}, .g=6}\[\n\r]* and making it read-only" "dyninit" } }
> +// { dg-final { scan-tree-dump "dynamic initialization of f\[\n\r]* optimized into: {.e=7, .f={.a=1, .b=2, .c=3, .d=6}, .g=1}" "dyninit" } }
> +// { dg-final { scan-tree-dump "dynamic initialization of h\[\n\r]* optimized into: {.h=8, .i={.a=1, .b=2, .c=3, .d=6}, .j=9}" "dyninit" } }
> +// { dg-final { scan-tree-dump-times "dynamic initialization of " 4 "dyninit" } }
> +// { dg-final { scan-tree-dump-times "and making it read-only" 1 "dyninit" } }
> +
> +struct S { S () : a(1), b(2), c(3), d(4) { d += 2; } int a, b, c, d; };
> +struct T { int e; S f; int g; };
> +struct U { int h; mutable S i; int j; };
> +extern int b;
> +int foo (int &);
> +int bar (int &);
> +int baz () { return 1; }
> +int qux () { return b = 2; }
> +// Dynamic initialization of a shouldn't be optimized, foo can't be inlined.
> +int a = foo (b);
> +int b = baz ();
> +// Likewise for c.
> +int c = bar (b);
> +// While qux is inlined, the dynamic initialization modifies another
> +// variable, so punt for d as well.
> +int d = qux ();
> +const T e = { 5, S (), 6 };
> +T f = { 7, S (), baz () };
> +const T &g = e;
> +const U h = { 8, S (), 9 };
> +const U &i = h;
> 
> 	Jakub
> 
>
  
Jakub Jelinek Nov. 4, 2021, 3:35 p.m. UTC | #2
On Thu, Nov 04, 2021 at 12:13:51PM +0100, Richard Biener wrote:
> As a general comment I wonder whether doing this fully in the C++
> frontend leveraging the constexpr support is a better approach, esp.
> before we end up putting all initializers into a single function ...
> even partly constexpr evaluating things might help in some case.

I initially thought that is what we should do, but I agree with Jason
that it isn't either/or, while we should keep investigating the
auto-constexpr handling for inline functions (curious about details for
that, e.g. should those implicit constexpr be just a different flag
from what we currently use, so that we e.g. ignore them during manifestly
constant evaluation and only handle them when doing optimization only
constant evaluation?  Do we want to copy their bodies early before all
cp_fold like we do for real constexpr functions, or can we process
them on their cp_folded bodies before gimplification (gimplification
is destructive, so after that we couldn't use those obviously)?),
that still won't handle cases of functions not marked inline, functions
with bodies defined only after the variable with dynamic initialization,
functions with bodies in different TUs with LTO, etc.
Or e.g. strict C++ says something isn't valid in constant expressions,
reinterpret_cast, etc., but our optimizers handle it fine and we still
optimize into constant stores.

> On that note it might be worth experimenting with keeping each
> initializer in a separate function until IPA where IPA could
> then figure out dependences via IPA REFs (with LTO on the whole
> program), a) diagnosing inter-CU undefined behavior, b) "fixing"
> things by making sure the initialization happens init-before-use
> (when there's no cycle), c) with local analysis do the promotion
> to READONLY at IPA time and elide the function.

I thought about separate functions, but it isn't clear to me how those
would actually help.  Because in order to optimize the dynamic initializers
that weren't possible to optimize with constexpr machinery, we need
inlining, not really sure if we can rely just on just early inlining, and then
need some constant propagation etc.  But on the other side, we don't want
to call hundreds of different functions from the *GLOBAL_*_I_* functions,
so even if we used separate functions, we want IPA to inline it.
For the diagnostics of UB, we have -fsanitize=address which should diagnose
incorrect initialization ordering.

	Jakub
  
Richard Biener Nov. 5, 2021, 10:44 a.m. UTC | #3
On Thu, 4 Nov 2021, Jakub Jelinek wrote:

> On Thu, Nov 04, 2021 at 12:13:51PM +0100, Richard Biener wrote:
> > As a general comment I wonder whether doing this fully in the C++
> > frontend leveraging the constexpr support is a better approach, esp.
> > before we end up putting all initializers into a single function ...
> > even partly constexpr evaluating things might help in some case.
> 
> I initially thought that is what we should do, but I agree with Jason
> that it isn't either/or, while we should keep investigating the
> auto-constexpr handling for inline functions (curious about details for
> that, e.g. should those implicit constexpr be just a different flag
> from what we currently use, so that we e.g. ignore them during manifestly
> constant evaluation and only handle them when doing optimization only
> constant evaluation?  Do we want to copy their bodies early before all
> cp_fold like we do for real constexpr functions, or can we process
> them on their cp_folded bodies before gimplification (gimplification
> is destructive, so after that we couldn't use those obviously)?),
> that still won't handle cases of functions not marked inline, functions
> with bodies defined only after the variable with dynamic initialization,
> functions with bodies in different TUs with LTO, etc.
> Or e.g. strict C++ says something isn't valid in constant expressions,
> reinterpret_cast, etc., but our optimizers handle it fine and we still
> optimize into constant stores.

Agreed that we should attack it from both sides, I just had the
impression that most bugreports complain that clang++ can do it
and those mostly looked opportunities that could be leveraged
by simply const-evaluating the initializer. So I wonder if we shouldn't
do that first.

> > On that note it might be worth experimenting with keeping each
> > initializer in a separate function until IPA where IPA could
> > then figure out dependences via IPA REFs (with LTO on the whole
> > program), a) diagnosing inter-CU undefined behavior, b) "fixing"
> > things by making sure the initialization happens init-before-use
> > (when there's no cycle), c) with local analysis do the promotion
> > to READONLY at IPA time and elide the function.
> 
> I thought about separate functions, but it isn't clear to me how those
> would actually help.  Because in order to optimize the dynamic initializers
> that weren't possible to optimize with constexpr machinery, we need
> inlining, not really sure if we can rely just on just early inlining, and then
> need some constant propagation etc.  But on the other side, we don't want
> to call hundreds of different functions from the *GLOBAL_*_I_* functions,
> so even if we used separate functions, we want IPA to inline it.

All true, but at least separate functions make it easier to see what
the initializer is without resorting to tricks like the internal functions
you add (just guessing a bit, didn't look at the patch yet).

Say, if the CTOR function has

  a = 2;
  b = foo ();
  c = 0;

coming from

int a = baz (); // returns constant 2
int b = foo (); // not resolvable
int c = bar (); // returns constant 0

then how do we know that foo () does not modify a[] or c[]?
At least modifying c from foo () should be UB?  modifying a
might be OK.  But with

  a = 2;
  b = foo ();
  c = 0;

we need to prove we can move the inits before any possible clobbers
to make them static inits?  Promoting a is OK I guess since foo ()
will simply re-initialize it.  But promoting c is only OK if
foo modifying it would be UB.

> For the diagnostics of UB, we have -fsanitize=address which should diagnose
> incorrect initialization ordering.

Ah, I see.  Of course that doesn't diagnose things that are UB but
happen to be "corrected" by link order?

Richard.
  
Jakub Jelinek Nov. 5, 2021, 11:29 a.m. UTC | #4
On Fri, Nov 05, 2021 at 11:44:53AM +0100, Richard Biener wrote:
> Agreed that we should attack it from both sides, I just had the
> impression that most bugreports complain that clang++ can do it
> and those mostly looked opportunities that could be leveraged
> by simply const-evaluating the initializer. So I wonder if we shouldn't
> do that first.

Yes, clang++ can do it (apparently in a limited way, they can either
optimize all dynamic initializers in a TU or none, so kind of what
my patch would do without those internal functions), but they
clearly aren't doing it using const evaluating the initializer,
from -mllvm -print-after-all (seems quite unreadable variant of
GCC -fdump-{tree,ipa,rtl}-all-details with everything intermixed
on stdout) it seems to be done in a
Global Variable Optimizer
pass that seems to be before inlining but after
Interprocedural Sparse Conditional Constant Propagation
Called Value Propagation

They do seem to handle e.g.
int foo ();
int a = foo ();
int foo () { return 1; }
int bar (int);
int b = bar (foo ());
int bar (int x) { return x + 7; }
which we won't be able to optimize in the FE even if we wanted to 
treat as constexpr all functions rather than only inlines that Jason
was planning to handle like that, the bodies of
the functions aren't available when we process those variable initializers.

> All true, but at least separate functions make it easier to see what
> the initializer is without resorting to tricks like the internal functions
> you add (just guessing a bit, didn't look at the patch yet).

I think the internal function calls are actually cheaper than separate
functions and can be kept in the IL after IPA until we use them and
remove them.
If wanted, we could actually run the pass twice, once before IPA so that
it can optimize vars where early inlining optimized stuff into constants,
in that first pass we would remove the ifns wrapping only dynamic
initialization of vars that the early pass instance was able to optimize,
and then one after IPA and constant propagation, dce etc. which would
handle the rest (and that one would remove all the ifns).

> Say, if the CTOR function has
> 
>   a = 2;
>   b = foo ();
>   c = 0;
> 
> coming from
> 
> int a = baz (); // returns constant 2
> int b = foo (); // not resolvable
> int c = bar (); // returns constant 0
> 
> then how do we know that foo () does not modify a[] or c[]?
> At least modifying c from foo () should be UB?  modifying a

foo certainly can read and modify a no matter what type it has,
and it won't change anything, a has been initialized to 2 either
dynamically or statically and both behave the same.
As for c, if it is not vacuously initialized (i.e. needs construction
with non-trivial constructor), reading or storing it I believe would be
UB.  If it is vacuously initialized, then the
https://eel.is/c++draft/basic.start.static#3
I was refering to applies:
"An implementation is permitted to perform the initialization of a variable
with static or thread storage duration as a static initialization even if
such initialization is not required to be done statically, provided that

- the dynamic version of the initialization does not change the value of any
  other object of static or thread storage duration prior to its
  initialization, and

- the static version of the initialization produces the same value in the
  initialized variable as would be produced by the dynamic initialization if
  all variables not required to be initialized statically were initialized
  dynamically.

[Note 2: As a consequence, if the initialization of an object obj1 refers to
an object obj2 potentially requiring dynamic initialization and defined later
in the same translation unit, it is unspecified whether the value of obj2
used will be the value of the fully initialized obj2 (because obj2 was
statically initialized) or will be the value of obj2 merely zero-initialized.
For example, inline double fd() { return 1.0; }
extern double d1;
double d2 = d1;     // unspecified:
                    // either statically initialized to 0.0 or
                    // dynamically initialized to 0.0 if d1 is
                    // dynamically initialized, or 1.0 otherwise
double d1 = fd();   // either initialized statically or dynamically to 1.0
- end note]"

My reading is that the first bullet talks about just dynamic initialization
of the particular variable and not e.g. about all the dynamic initialization
of previous objects, so when the optimization uses those ifn markers
and checks something even stronger (that no other variables are modified
in that particular dynamic initialization) and the example shows that
at least reading of c in foo is ok but one needs to be prepared to see there
either a value that would be there if the optimization didn't happen or
one where it did.  The example doesn't talk about writing the variable...

> > For the diagnostics of UB, we have -fsanitize=address which should diagnose
> > incorrect initialization ordering.
> 
> Ah, I see.  Of course that doesn't diagnose things that are UB but
> happen to be "corrected" by link order?

It has been a while since I've looked at it, I think it works by making
the yet not constructed global vars non-accessible through shadow memory
until they are actually constructed.

	Jakub
  
Martin Sebor Nov. 5, 2021, 5:06 p.m. UTC | #5
On 11/4/21 3:42 AM, Jakub Jelinek via Gcc-patches wrote:
> Hi!
> 
> When users don't use constexpr everywhere in initialization of namespace
> scope non-comdat vars and the initializers aren't constant when FE is
> looking at them, the FE performs dynamic initialization of those variables.
> But after inlining and some constant propagation, we often end up with
> just storing constants into those variables in the _GLOBAL__sub_I_*
> constructor.
> C++ gives us permission to change some of that dynamic initialization
> back into static initialization - https://eel.is/c++draft/basic.start.static#3
> For classes that need (dynamic) construction, I believe access to some var
> from other dynamic construction before that var is constructed is UB, but
> as the example in the above mentioned spot of C++:
> inline double fd() { return 1.0; }
> extern double d1;
> double d2 = d1;     // unspecified:
>                      // either statically initialized to 0.0 or
>                      // dynamically initialized to 0.0 if d1 is
>                      // dynamically initialized, or 1.0 otherwise
> double d1 = fd();   // either initialized statically or dynamically to 1.0
> some vars can be used before they are dynamically initialized and the
> implementation can still optimize those into static initialization.
> 
> The following patch attempts to optimize some such cases back into
> DECL_INITIAL initializers and where possible (originally const vars without
> mutable members) put those vars back to .rodata etc.
> 
> Because we put all dynamic initialization from a single TU into one single
> function (well, originally one function per priority but typically inline
> those back into one function), we can either have a simpler approach
> (from the PR it seems that is what LLVM uses) where either we manage to
> optimize all dynamic initializers into constant in the TU, or nothing,
> or by adding some markup - in the form of a pair of internal functions in
> this patch - around each dynamic initialization that can be optimized,
> we can optimize each dynamic initialization separately.
> 
> The patch adds a new pass that is invoked (through gate check) only on
> DECL_ARTIFICIAL DECL_STATIC_CONSTRUCTOR functions, and looks there for
> sequences like:
>    .DYNAMIC_INIT_START (&b, 0);
>    b = 1;
>    .DYNAMIC_INIT_END (&b);
> or
>    .DYNAMIC_INIT_START (&e, 1);
>    # DEBUG this => &e.f
>    MEM[(struct S *)&e + 4B] ={v} {CLOBBER};
>    MEM[(struct S *)&e + 4B].a = 1;
>    MEM[(struct S *)&e + 4B].b = 2;
>    MEM[(struct S *)&e + 4B].c = 3;
>    # DEBUG BEGIN_STMT
>    MEM[(struct S *)&e + 4B].d = 6;
>    # DEBUG this => NULL
>    .DYNAMIC_INIT_END (&e);
> (where between the pair of markers everything is either debug stmts or
> stores of constants into the variables or their parts).
> The pass needs to be done late enough so that after IPA all the needed
> constant propagation and perhaps loop unrolling is done, on the other
> side should be early enough so that if we can't optimize it, we can
> remove those .DYNAMIC_INIT* internal calls that could prevent some
> further optimizations (they have fnspec such that they pretend to read
> the corresponding variable).

In my work-in-progress patch to diagnose stores into constant
objects (and subobjects) I deal with the same problem.  I had
considered a pair of markers like those above (David Malcolm
suggested a smilar approach as well), but decided to go
a different route, not trusting they could be kept together,
or that they wouldn't be viewed as overly intrusive.  With
it, I have been able to distinguish dynamic initialization
from overwriting stores even at the end of compilation, but
I'd be fine changing that and running the detection earlier.

So if the markers are added for the purpose of optimizing
the dynamic initialization at file scope, could they be added
for those of locals as well?  That way I wouldn't need to add
a separate solution.

> 
> Currently the optimization is only able to optimize cases where the whole
> variable is stored in a single store (typically scalar variables), or
> uses the native_{encode,interpret}* infrastructure to create or update
> the CONSTRUCTOR.  This means that except for the first category, we can't
> right now handle unions or anything that needs relocations (vars containing
> pointers to other vars or references).
> I think it would be nice to incrementally add before the native_* fallback
> some attempt to just create or update a CONSTRUCTOR if possible.  If we only
> see var.a.b.c.d[10].e = const; style of stores, this shouldn't be that hard
> as the whole access path is recorded there and we'd just need to decide what
> to do with unions if two or more union members are accessed.  And do a deep
> copy of the CONSTRUCTOR and try to efficiently update the copy afterwards
> (the CONSTRUCTORs should be sorted on increasing offsets of the
> members/elements, so doing an ordered vec insertion might not be the best
> idea).  But MEM_REFs complicate this, parts or all of the access path
> is lost.  For non-unions in most cases we could try to guess which field
> it is (do we have some existing function to do that?

The sprintf pass (only, for now) uses field_at_offset() for this.

> I vaguely remember
> we've been doing that in some cases in the past in some folding but stopped
> doing so) but with unions it will be harder or impossible.
> 
> As the middle-end can't easily differentiate between const variables without
> and with mutable members,

In the work-in-progress patch I mentioned above I've added two
hooks: one to return if a type has a mutable member:

   extern bool lhd_has_mutable (const_tree);

and another if a decl is mutable:

   extern bool lhd_is_mutable (const_tree);

They let the middle end tell (with some effort) if a subobject
is mutable (they're called from field_at_offset()).  The effort
is in recreating the full path to the mmeber for every MEM_REF.
Since mutable is rare, this is a lot of work for usually no gain.
Rather than doing this for every MEM_REF I've been thinking of
caching this info for every struct type in some persistent
structure.  But for my purposes the mutable bit isn't enough.
To tell if a member is const the search needs to be done top
to bottom to detect even non-const membbers of const subobjects
of complete objects (const or not).  So the cache I'm thinking
of would need to be some sort of a sparse mapping from every
byte offset into a struct type to its const member.  I would
love it if the same solution could be used both for
the optimization and for warnings.  (Ideally, the cache would
be populated by the front ends as they parse structs, but it
could also be populated lazily by the middle end.)

I don't know if the optimization you're interested in (or one
like it) might also be applicable to const locals.  Detecting
accidental stores into those is much more likely to find subtle
bugs than just globals, simply because there are more cases,
and because many globals are in read-only memory and so writing
into those will crash.

Martin
  
Jason Merrill Nov. 5, 2021, 7:46 p.m. UTC | #6
On 11/5/21 13:06, Martin Sebor wrote:
> On 11/4/21 3:42 AM, Jakub Jelinek via Gcc-patches wrote:
>> Hi!
>>
>> When users don't use constexpr everywhere in initialization of namespace
>> scope non-comdat vars and the initializers aren't constant when FE is
>> looking at them, the FE performs dynamic initialization of those 
>> variables.
>> But after inlining and some constant propagation, we often end up with
>> just storing constants into those variables in the _GLOBAL__sub_I_*
>> constructor.
>> C++ gives us permission to change some of that dynamic initialization
>> back into static initialization - 
>> https://eel.is/c++draft/basic.start.static#3
>> For classes that need (dynamic) construction, I believe access to some 
>> var
>> from other dynamic construction before that var is constructed is UB, but
>> as the example in the above mentioned spot of C++:
>> inline double fd() { return 1.0; }
>> extern double d1;
>> double d2 = d1;     // unspecified:
>>                      // either statically initialized to 0.0 or
>>                      // dynamically initialized to 0.0 if d1 is
>>                      // dynamically initialized, or 1.0 otherwise
>> double d1 = fd();   // either initialized statically or dynamically to 
>> 1.0
>> some vars can be used before they are dynamically initialized and the
>> implementation can still optimize those into static initialization.
>>
>> The following patch attempts to optimize some such cases back into
>> DECL_INITIAL initializers and where possible (originally const vars 
>> without
>> mutable members) put those vars back to .rodata etc.
>>
>> Because we put all dynamic initialization from a single TU into one 
>> single
>> function (well, originally one function per priority but typically inline
>> those back into one function), we can either have a simpler approach
>> (from the PR it seems that is what LLVM uses) where either we manage to
>> optimize all dynamic initializers into constant in the TU, or nothing,
>> or by adding some markup - in the form of a pair of internal functions in
>> this patch - around each dynamic initialization that can be optimized,
>> we can optimize each dynamic initialization separately.
>>
>> The patch adds a new pass that is invoked (through gate check) only on
>> DECL_ARTIFICIAL DECL_STATIC_CONSTRUCTOR functions, and looks there for
>> sequences like:
>>    .DYNAMIC_INIT_START (&b, 0);
>>    b = 1;
>>    .DYNAMIC_INIT_END (&b);
>> or
>>    .DYNAMIC_INIT_START (&e, 1);
>>    # DEBUG this => &e.f
>>    MEM[(struct S *)&e + 4B] ={v} {CLOBBER};
>>    MEM[(struct S *)&e + 4B].a = 1;
>>    MEM[(struct S *)&e + 4B].b = 2;
>>    MEM[(struct S *)&e + 4B].c = 3;
>>    # DEBUG BEGIN_STMT
>>    MEM[(struct S *)&e + 4B].d = 6;
>>    # DEBUG this => NULL
>>    .DYNAMIC_INIT_END (&e);
>> (where between the pair of markers everything is either debug stmts or
>> stores of constants into the variables or their parts).
>> The pass needs to be done late enough so that after IPA all the needed
>> constant propagation and perhaps loop unrolling is done, on the other
>> side should be early enough so that if we can't optimize it, we can
>> remove those .DYNAMIC_INIT* internal calls that could prevent some
>> further optimizations (they have fnspec such that they pretend to read
>> the corresponding variable).
> 
> In my work-in-progress patch to diagnose stores into constant
> objects (and subobjects) I deal with the same problem.  I had
> considered a pair of markers like those above (David Malcolm
> suggested a smilar approach as well), but decided to go
> a different route, not trusting they could be kept together,
> or that they wouldn't be viewed as overly intrusive.  With
> it, I have been able to distinguish dynamic initialization
> from overwriting stores even at the end of compilation, but
> I'd be fine changing that and running the detection earlier.
> 
> So if the markers are added for the purpose of optimizing
> the dynamic initialization at file scope, could they be added
> for those of locals as well?  That way I wouldn't need to add
> a separate solution.
> 
>>
>> Currently the optimization is only able to optimize cases where the whole
>> variable is stored in a single store (typically scalar variables), or
>> uses the native_{encode,interpret}* infrastructure to create or update
>> the CONSTRUCTOR.  This means that except for the first category, we can't
>> right now handle unions or anything that needs relocations (vars 
>> containing
>> pointers to other vars or references).
>> I think it would be nice to incrementally add before the native_* 
>> fallback
>> some attempt to just create or update a CONSTRUCTOR if possible.  If 
>> we only
>> see var.a.b.c.d[10].e = const; style of stores, this shouldn't be that 
>> hard
>> as the whole access path is recorded there and we'd just need to 
>> decide what
>> to do with unions if two or more union members are accessed.  And do a 
>> deep
>> copy of the CONSTRUCTOR and try to efficiently update the copy afterwards
>> (the CONSTRUCTORs should be sorted on increasing offsets of the
>> members/elements, so doing an ordered vec insertion might not be the best
>> idea).  But MEM_REFs complicate this, parts or all of the access path
>> is lost.  For non-unions in most cases we could try to guess which field
>> it is (do we have some existing function to do that?
> 
> The sprintf pass (only, for now) uses field_at_offset() for this.
> 
>> I vaguely remember
>> we've been doing that in some cases in the past in some folding but 
>> stopped
>> doing so) but with unions it will be harder or impossible.
>>
>> As the middle-end can't easily differentiate between const variables 
>> without
>> and with mutable members,
> 
> In the work-in-progress patch I mentioned above I've added two
> hooks: one to return if a type has a mutable member:
> 
>    extern bool lhd_has_mutable (const_tree);
> 
> and another if a decl is mutable:
> 
>    extern bool lhd_is_mutable (const_tree);
> 
> They let the middle end tell (with some effort) if a subobject
> is mutable (they're called from field_at_offset()).  The effort
> is in recreating the full path to the mmeber for every MEM_REF.
> Since mutable is rare, this is a lot of work for usually no gain.
> Rather than doing this for every MEM_REF I've been thinking of
> caching this info for every struct type in some persistent
> structure.

In the front end, this is CLASSTYPE_HAS_MUTABLE, or cp_has_mutable_p.

Jason
  
Jakub Jelinek Nov. 6, 2021, 10:04 a.m. UTC | #7
On Fri, Nov 05, 2021 at 11:06:44AM -0600, Martin Sebor wrote:
> In my work-in-progress patch to diagnose stores into constant
> objects (and subobjects) I deal with the same problem.  I had
> considered a pair of markers like those above (David Malcolm
> suggested a smilar approach as well), but decided to go
> a different route, not trusting they could be kept together,
> or that they wouldn't be viewed as overly intrusive.  With
> it, I have been able to distinguish dynamic initialization
> from overwriting stores even at the end of compilation, but
> I'd be fine changing that and running the detection earlier.
> 
> So if the markers are added for the purpose of optimizing
> the dynamic initialization at file scope, could they be added
> for those of locals as well?  That way I wouldn't need to add
> a separate solution.

I'm afraid not.  The ifns by pretending to read the corresponding
variables (and maybe they should be pretending to read from all
namespace scope global variables defined in the TU) prevent code motion of
stores across them and various other optimizations.
I'd hope it is acceptable for the global constructor functions because
those happen just once per process, never happen inside loops etc.
But for automatic or static block scope variables that would be way too
expensive.  We can't affort that.

	Jakub
  
Richard Biener Dec. 2, 2021, 1:35 p.m. UTC | #8
On Thu, 4 Nov 2021, Jakub Jelinek wrote:

> Hi!
> 
> When users don't use constexpr everywhere in initialization of namespace
> scope non-comdat vars and the initializers aren't constant when FE is
> looking at them, the FE performs dynamic initialization of those variables.
> But after inlining and some constant propagation, we often end up with
> just storing constants into those variables in the _GLOBAL__sub_I_*
> constructor.
> C++ gives us permission to change some of that dynamic initialization
> back into static initialization - https://eel.is/c++draft/basic.start.static#3
> For classes that need (dynamic) construction, I believe access to some var
> from other dynamic construction before that var is constructed is UB, but
> as the example in the above mentioned spot of C++:
> inline double fd() { return 1.0; }
> extern double d1;
> double d2 = d1;     // unspecified:
>                     // either statically initialized to 0.0 or
>                     // dynamically initialized to 0.0 if d1 is
>                     // dynamically initialized, or 1.0 otherwise
> double d1 = fd();   // either initialized statically or dynamically to 1.0
> some vars can be used before they are dynamically initialized and the
> implementation can still optimize those into static initialization.
> 
> The following patch attempts to optimize some such cases back into
> DECL_INITIAL initializers and where possible (originally const vars without
> mutable members) put those vars back to .rodata etc.
> 
> Because we put all dynamic initialization from a single TU into one single
> function (well, originally one function per priority but typically inline
> those back into one function), we can either have a simpler approach
> (from the PR it seems that is what LLVM uses) where either we manage to
> optimize all dynamic initializers into constant in the TU, or nothing,
> or by adding some markup - in the form of a pair of internal functions in
> this patch - around each dynamic initialization that can be optimized,
> we can optimize each dynamic initialization separately.
> 
> The patch adds a new pass that is invoked (through gate check) only on
> DECL_ARTIFICIAL DECL_STATIC_CONSTRUCTOR functions, and looks there for
> sequences like:
>   .DYNAMIC_INIT_START (&b, 0);
>   b = 1;
>   .DYNAMIC_INIT_END (&b);
> or
>   .DYNAMIC_INIT_START (&e, 1);
>   # DEBUG this => &e.f
>   MEM[(struct S *)&e + 4B] ={v} {CLOBBER};
>   MEM[(struct S *)&e + 4B].a = 1;
>   MEM[(struct S *)&e + 4B].b = 2;
>   MEM[(struct S *)&e + 4B].c = 3;
>   # DEBUG BEGIN_STMT
>   MEM[(struct S *)&e + 4B].d = 6;
>   # DEBUG this => NULL
>   .DYNAMIC_INIT_END (&e);

So with

+/* Mark start and end of dynamic initialization of a variable.  */
+DEF_INTERNAL_FN (DYNAMIC_INIT_START, ECF_LEAF | ECF_NOTHROW, ". r ")
+DEF_INTERNAL_FN (DYNAMIC_INIT_END, ECF_LEAF | ECF_NOTHROW, ". r ")

there's nothing preventing code motion of unrelated stmts into
the block, but that should be harmless.  What it also does
is make 'e' aliased (because it's address is now taken), probably
relevant only for IPA/LTO or for statics.

The setup does not prevent CSEing the inits with uses from another
initializer - probably OK as well (if not then .DYNAMIC_INIT_END
should also be considered writing to 'e').

". r " also means it clobbers and uses all global memory, I think
we'd like to have it const + looping-pure-or-const.  ".cr " would
possibly achieve this, not sure about the looping part.


> (where between the pair of markers everything is either debug stmts or
> stores of constants into the variables or their parts).
> The pass needs to be done late enough so that after IPA all the needed
> constant propagation and perhaps loop unrolling is done, on the other
> side should be early enough so that if we can't optimize it, we can
> remove those .DYNAMIC_INIT* internal calls that could prevent some
> further optimizations (they have fnspec such that they pretend to read
> the corresponding variable).
> 
> Currently the optimization is only able to optimize cases where the whole
> variable is stored in a single store (typically scalar variables), or
> uses the native_{encode,interpret}* infrastructure to create or update
> the CONSTRUCTOR.  This means that except for the first category, we can't
> right now handle unions or anything that needs relocations (vars containing
> pointers to other vars or references).
> I think it would be nice to incrementally add before the native_* fallback
> some attempt to just create or update a CONSTRUCTOR if possible.  If we only
> see var.a.b.c.d[10].e = const; style of stores, this shouldn't be that hard
> as the whole access path is recorded there and we'd just need to decide what
> to do with unions if two or more union members are accessed.  And do a deep
> copy of the CONSTRUCTOR and try to efficiently update the copy afterwards
> (the CONSTRUCTORs should be sorted on increasing offsets of the
> members/elements, so doing an ordered vec insertion might not be the best
> idea).  But MEM_REFs complicate this, parts or all of the access path
> is lost.  For non-unions in most cases we could try to guess which field
> it is (do we have some existing function to do that?  I vaguely remember
> we've been doing that in some cases in the past in some folding but stopped
> doing so) but with unions it will be harder or impossible.

I suppose we could, at least for non-overlapping inits, create a
new aggregate type on the fly to be able to compose the CTOR and then
view-convert it to the decls type.  Would need to check that
a CTOR wrapped in a V_C_E is handled OK by varasm of course.

An alternative way of recording the initializer (maybe just emit
it right away into asm?) would be another possibility.

I also note that loops are quite common in some initializers so more
aggressively unrolling those for initializations might be a good
idea as well.

> As the middle-end can't easily differentiate between const variables without
> and with mutable members, both of those will have TREE_READONLY on the
> var decl clear (because of dynamic initialization) and TYPE_READONLY set
> on the type, the patch remembers this in an extra argument to
> .DYNAMIC_INIT_START (true if it is ok to set TREE_READONLY on the var decl
> back if the var dynamic initialization could be optimized into DECL_INITIAL).
> Thinking more about it, I'm not sure about const vars without mutable
> members with non-trivial destructors, do we register their dtors dynamically
> through __cxa_atexit in the ctors (that would mean the optimization
> currently punts on them), or not (in that case we could put it into .rodata
> even when the dtor will want to perhaps write to them)?

I think anything like this asks for doing the whole thing at IPA level
to see which functions are "initialization" and thus need not considered
writing when the initializer is made static.

That said, do we want to record the fact that we guarded init
with .DYNAMIC_INIT_* on the varpool node?  I think we want a flag
in struct function or in the cgraph node to tell whether there's
a .DYNAMIC_INIT_* in it to avoid the whole function walk of
pass_dyninit::execute which for most functions will be a noop.

Is there any reason you run the pass before pass_store_merging?  It
does seem to rely on it to some extent - in fact it looks like
both might be married somehow?  I don't think that doing it "early"
for the sake of loop optimizations is worth the trouble (doing
it way earlier for the sake of IPA would be another thing though).

> Anyway, forgot to do another set of bootstraps with gathering statistics how
> many vars were optimized, so just trying to figure it out from the sizes of
> _GLOBAL__sub_I_* functions:
> 
> # Without patch, x86_64-linux cc1plus
> $ readelf -Ws obj50/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
> 13934
> # With the patch, x86_64-linux cc1plus
> $ readelf -Ws obj52/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
> 6966
> # Without patch, i686-linux cc1plus
> $ readelf -Ws obj51/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
> 24158
> # With the patch, i686-linux cc1plus
> $ readelf -Ws obj53/gcc/cc1plus | grep ' _GLOBAL__sub_I_' | awk 'BEGIN{I=0}{I=I+$3}END{print I}'
> 10536
> 
> That seems like a huge improvement, although on a closer look, most of that
> saving is from just one TU:
> $ readelf -Ws obj50/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
> 6693
> $ readelf -Ws obj52/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
> 1
> $ readelf -Ws obj51/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
> 13001
> $ readelf -Ws obj53/gcc/i386-options.o | grep ' _GLOBAL__sub_I_' | awk '{print $3}'
> 1
> So, the shrinking on all the dynamic initialization functions except
> i386-options.o is:
> 7241 -> 6965 for 64-bit and
> 11157 -> 10535 for 32-bit.
> Will try to use constexpr for i386-options.c later today.
> 
> Another optimization that could be useful but not sure if it can be easily
> done is if we before expansion of the _GLOBAL__sub_I_* functions end up with
> nothing in their body (that's those 1 byte functions on x86) perhaps either
> not emit those functions at all or at least don't register them in
> .init_array etc. so that cycles aren't wasted at runtime:
> $ readelf -Ws obj50/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
> 4
> $ readelf -Ws obj52/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
> 87
> $ readelf -Ws obj51/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
> 4
> $ readelf -Ws obj53/gcc/{*,*/*}.o | grep ' _GLOBAL__sub_I_' | awk '($3 == 1){print $3}' | wc -l
> 84
> 
> Also, wonder if I should add some new -f* option to control the optimization
> or doing it always at -O+ with -fdisable-tree-pass-dyninit as a way to
> disable it is good enough, and whether the 1024 hardcoded constant
> (upper bound on optimized size so that we don't spend huge amounts of
> compile time trying to optimize initializers of gigabyte sizes) shouldn't be
> a param.

I also see you gate .DYNAMIC_INIT_* creation on 'optimize' but only
scheudle the pass in the O1+ pipeline, missing out -Og.  I suppose
for -Og not creating .DYNAMIC_INIT_* would be reasonable.

Some more comments inline.

> Bootstrapped/regtested on x86_64-linux and i686-linux.
> 
> 2021-11-04  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR c++/102876
> gcc/
> 	* internal-fn.def (DYNAMIC_INIT_START, DYNAMIC_INIT_END): New internal
> 	functions.
> 	* internal-fn.c (expand_DYNAMIC_INIT_START, expand_DYNAMIC_INIT_END):
> 	New functions.
> 	* tree-pass.h (make_pass_dyninit): Declare.
> 	* passes.def (pass_dyninit): Add after dce4.
> 	* gimple-ssa-store-merging.c (pass_data_dyninit): New variable.
> 	(class pass_dyninit): New type.
> 	(pass_dyninit::execute): New method.
> 	(make_pass_dyninit): New function.
> gcc/cp/
> 	* decl2.c (one_static_initialization_or_destruction): Emit
> 	.DYNAMIC_INIT_START and .DYNAMIC_INIT_END internal calls around
> 	dynamic initialization of variables that don't need a guard.
> gcc/testsuite/
> 	* g++.dg/opt/init3.C: New test.
> 
> --- gcc/internal-fn.def.jj	2021-11-02 09:05:47.029664211 +0100
> +++ gcc/internal-fn.def	2021-11-02 12:40:38.702436113 +0100
> @@ -367,6 +367,10 @@ DEF_INTERNAL_FN (PHI, 0, NULL)
>     automatic variable.  */
>  DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>  
> +/* Mark start and end of dynamic initialization of a variable.  */
> +DEF_INTERNAL_FN (DYNAMIC_INIT_START, ECF_LEAF | ECF_NOTHROW, ". r ")
> +DEF_INTERNAL_FN (DYNAMIC_INIT_END, ECF_LEAF | ECF_NOTHROW, ". r ")
> +
>  /* DIM_SIZE and DIM_POS return the size of a particular compute
>     dimension and the executing thread's position within that
>     dimension.  DIM_POS is pure (and not const) so that it isn't
> --- gcc/internal-fn.c.jj	2021-11-02 09:05:47.029664211 +0100
> +++ gcc/internal-fn.c	2021-11-02 12:40:38.703436099 +0100
> @@ -3485,6 +3485,16 @@ expand_CO_ACTOR (internal_fn, gcall *)
>    gcc_unreachable ();
>  }
>  
> +static void
> +expand_DYNAMIC_INIT_START (internal_fn, gcall *)
> +{
> +}
> +
> +static void
> +expand_DYNAMIC_INIT_END (internal_fn, gcall *)
> +{
> +}
> +
>  /* Expand a call to FN using the operands in STMT.  FN has a single
>     output operand and NARGS input operands.  */
>  
> --- gcc/tree-pass.h.jj	2021-10-28 11:29:01.891721153 +0200
> +++ gcc/tree-pass.h	2021-11-02 14:15:00.139185088 +0100
> @@ -445,6 +445,7 @@ extern gimple_opt_pass *make_pass_cse_re
>  extern gimple_opt_pass *make_pass_cse_sincos (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_optimize_bswap (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_store_merging (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_dyninit (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_optimize_widening_mul (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_warn_function_return (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_warn_function_noreturn (gcc::context *ctxt);
> --- gcc/passes.def.jj	2021-11-01 14:37:06.685853324 +0100
> +++ gcc/passes.def	2021-11-02 14:23:47.836715821 +0100
> @@ -261,6 +261,7 @@ along with GCC; see the file COPYING3.
>        NEXT_PASS (pass_tsan);
>        NEXT_PASS (pass_dse);
>        NEXT_PASS (pass_dce);
> +      NEXT_PASS (pass_dyninit);
>        /* Pass group that runs when 1) enabled, 2) there are loops
>  	 in the function.  Make sure to run pass_fix_loops before
>  	 to discover/remove loops before running the gate function
> --- gcc/gimple-ssa-store-merging.c.jj	2021-09-01 12:06:19.488211919 +0200
> +++ gcc/gimple-ssa-store-merging.c	2021-11-03 18:02:55.190015359 +0100
> @@ -170,6 +170,8 @@
>  #include "optabs-tree.h"
>  #include "dbgcnt.h"
>  #include "selftest.h"
> +#include "cgraph.h"
> +#include "varasm.h"
>  
>  /* The maximum size (in bits) of the stores this pass should generate.  */
>  #define MAX_STORE_BITSIZE (BITS_PER_WORD)
> @@ -5465,6 +5467,334 @@ pass_store_merging::execute (function *f
>    return 0;
>  }
>  
> +/* Pass to optimize C++ dynamic initialization.  */
> +
> +const pass_data pass_data_dyninit = {
> +  GIMPLE_PASS,     /* type */
> +  "dyninit",	   /* name */
> +  OPTGROUP_NONE,   /* optinfo_flags */
> +  TV_GIMPLE_STORE_MERGING,	 /* tv_id */
> +  PROP_ssa,	/* properties_required */
> +  0,		   /* properties_provided */
> +  0,		   /* properties_destroyed */
> +  0,		   /* todo_flags_start */
> +  0,		/* todo_flags_finish */
> +};
> +
> +class pass_dyninit : public gimple_opt_pass
> +{
> +public:
> +  pass_dyninit (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_dyninit, ctxt)
> +  {
> +  }
> +
> +  virtual bool
> +  gate (function *fun)
> +  {
> +    return (DECL_ARTIFICIAL (fun->decl)
> +	    && DECL_STATIC_CONSTRUCTOR (fun->decl)
> +	    && optimize);
> +  }
> +
> +  virtual unsigned int execute (function *);
> +}; // class pass_dyninit
> +
> +unsigned int
> +pass_dyninit::execute (function *fun)
> +{
> +  basic_block bb;
> +  auto_vec<gimple *, 32> ifns;
> +  hash_map<tree, gimple *> *map = NULL;
> +  auto_vec<tree, 32> vars;
> +  gimple **cur = NULL;
> +  bool ssdf_calls = false;
> +
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      for (gimple_stmt_iterator gsi = gsi_after_labels (bb);
> +	   !gsi_end_p (gsi); gsi_next (&gsi))
> +	{
> +	  gimple *stmt = gsi_stmt (gsi);
> +	  if (is_gimple_debug (stmt))
> +	    continue;
> +
> +	  /* The C++ FE can wrap dynamic initialization of certain
> +	     variables with a pair of iternal function calls, like:
> +	     .DYNAMIC_INIT_START (&b, 0);
> +	     b = 1;
> +	     .DYNAMIC_INIT_END (&b);
> +
> +	     or
> +	     .DYNAMIC_INIT_START (&e, 1);
> +	     # DEBUG this => &e.f
> +	     MEM[(struct S *)&e + 4B] ={v} {CLOBBER};
> +	     MEM[(struct S *)&e + 4B].a = 1;
> +	     MEM[(struct S *)&e + 4B].b = 2;
> +	     MEM[(struct S *)&e + 4B].c = 3;
> +	     # DEBUG BEGIN_STMT
> +	     MEM[(struct S *)&e + 4B].d = 6;
> +	     # DEBUG this => NULL
> +	     .DYNAMIC_INIT_END (&e);
> +
> +	     Verify if there are only stores of constants to the corresponding
> +	     variable or parts of that variable and if so, try to reconstruct
> +	     a static initializer from the static initializer if any and
> +	     the constant stores into the variable.  This is permitted by
> +	     [basic.start.static]/3.  */
> +	  if (is_gimple_call (stmt))
> +	    {
> +	      if (gimple_call_internal_p (stmt, IFN_DYNAMIC_INIT_START))

this overload already tests is_gimple_call

> +		{
> +		  ifns.safe_push (stmt);
> +		  if (cur)
> +		    *cur = NULL;
> +		  tree arg = gimple_call_arg (stmt, 0);
> +		  gcc_assert (TREE_CODE (arg) == ADDR_EXPR
> +			      && DECL_P (TREE_OPERAND (arg, 0)));
> +		  tree var = TREE_OPERAND (arg, 0);
> +		  gcc_checking_assert (is_global_var (var));
> +		  varpool_node *node = varpool_node::get (var);
> +		  if (node == NULL
> +		      || node->in_other_partition
> +		      || TREE_ASM_WRITTEN (var)
> +		      || DECL_SIZE_UNIT (var) == NULL_TREE
> +		      || !tree_fits_uhwi_p (DECL_SIZE_UNIT (var))
> +		      || tree_to_uhwi (DECL_SIZE_UNIT (var)) > 1024

this should maybe be a --param?

Did you do any statistics on things other than GCC what of the various
checks prevents eliding the dynamic initialization and which of those
we could mitigate in the future?

> +		      || TYPE_SIZE_UNIT (TREE_TYPE (var)) == NULL_TREE
> +		      || !tree_int_cst_equal (TYPE_SIZE_UNIT (TREE_TYPE (var)),
> +					      DECL_SIZE_UNIT (var)))
> +		    continue;
> +		  if (map == NULL)
> +		    map = new hash_map<tree, gimple *> (61);
> +		  bool existed_p;
> +		  cur = &map->get_or_insert (var, &existed_p);
> +		  if (existed_p)
> +		    {
> +		      /* Punt if we see more than one .DYNAMIC_INIT_START
> +			 internal call for the same variable.  */

how can this happen?

> +		      *cur = NULL;
> +		      cur = NULL;
> +		    }
> +		  else
> +		    {
> +		      *cur = stmt;
> +		      vars.safe_push (var);
> +		    }
> +		  continue;
> +		}
> +	      else if (gimple_call_internal_p (stmt, IFN_DYNAMIC_INIT_END))
> +		{
> +		  ifns.safe_push (stmt);
> +		  tree arg = gimple_call_arg (stmt, 0);
> +		  gcc_assert (TREE_CODE (arg) == ADDR_EXPR
> +			      && DECL_P (TREE_OPERAND (arg, 0)));
> +		  tree var = TREE_OPERAND (arg, 0);
> +		  gcc_checking_assert (is_global_var (var));
> +		  if (cur)
> +		    {
> +		      /* Punt if .DYNAMIC_INIT_END call argument doesn't
> +			 pair with .DYNAMIC_INIT_START.  */
> +		      if (vars.last () != var)
> +			*cur = NULL;
> +		      cur = NULL;
> +		    }
> +		  continue;
> +		}
> +
> +	      /* Punt if we see any artificial
> +		 __static_initialization_and_destruction_* calls, e.g. if
> +		 it would be partially inlined, because we wouldn't then see
> +		 all .DYNAMIC_INIT_* calls.  */
> +	      tree fndecl = gimple_call_fndecl (stmt);
> +	      if (fndecl
> +		  && DECL_ARTIFICIAL (fndecl)
> +		  && DECL_NAME (fndecl)
> +		  && startswith (IDENTIFIER_POINTER (DECL_NAME (fndecl)),
> +				 "__static_initialization_and_destruction_"))
> +		ssdf_calls = true;

Ugh, that looks unreliable - but how's that a problem if we saw
both START/END ifns for a decl?

> +	    }
> +	  if (cur)
> +	    {
> +	      if (store_valid_for_store_merging_p (stmt))
> +		{
> +		  tree lhs = gimple_assign_lhs (stmt);
> +		  tree rhs = gimple_assign_rhs1 (stmt);
> +		  poly_int64 bitsize, bitpos;
> +		  HOST_WIDE_INT ibitsize, ibitpos;
> +		  machine_mode mode;
> +		  int unsignedp, reversep, volatilep = 0;
> +		  tree offset;
> +		  tree var = vars.last ();
> +		  if (rhs_valid_for_store_merging_p (rhs)
> +		      && get_inner_reference (lhs, &bitsize, &bitpos, &offset,
> +					      &mode, &unsignedp, &reversep,
> +					      &volatilep) == var
> +		      && !reversep
> +		      && !volatilep
> +		      && (offset == NULL_TREE || integer_zerop (offset))
> +		      && bitsize.is_constant (&ibitsize)
> +		      && bitpos.is_constant (&ibitpos)
> +		      && ibitpos >= 0
> +		      && ibitsize <= tree_to_shwi (DECL_SIZE (var))
> +		      && ibitsize + ibitpos <= tree_to_shwi (DECL_SIZE (var)))
> +		    continue;
> +		}
> +	      *cur = NULL;
> +	      cur = NULL;
> +	    }
> +	}
> +      if (cur)
> +	{
> +	  *cur = NULL;
> +	  cur = NULL;
> +	}
> +    }
> +  if (map && !ssdf_calls)
> +    {
> +      for (tree var : vars)
> +	{
> +	  gimple *g = *map->get (var);
> +	  if (g == NULL)
> +	    continue;
> +	  varpool_node *node = varpool_node::get (var);
> +	  node->get_constructor ();
> +	  tree init = DECL_INITIAL (var);
> +	  if (init == NULL)
> +	    init = build_zero_cst (TREE_TYPE (var));
> +	  gimple_stmt_iterator gsi = gsi_for_stmt (g);
> +	  unsigned char *buf = NULL;
> +	  unsigned int buf_size = tree_to_uhwi (DECL_SIZE_UNIT (var));
> +	  bool buf_valid = false;
> +	  do
> +	    {
> +	      gsi_next (&gsi);
> +	      gimple *stmt = gsi_stmt (gsi);
> +	      if (is_gimple_debug (stmt))
> +		continue;
> +	      if (is_gimple_call (stmt))
> +		break;
> +	      if (gimple_clobber_p (stmt))
> +		continue;
> +	      tree lhs = gimple_assign_lhs (stmt);
> +	      tree rhs = gimple_assign_rhs1 (stmt);
> +	      if (lhs == var)
> +		{
> +		  /* Simple assignment to the whole variable.
> +		     rhs is the initializer.  */
> +		  buf_valid = false;
> +		  init = rhs;
> +		  continue;
> +		}
> +	      poly_int64 bitsize, bitpos;
> +	      machine_mode mode;
> +	      int unsignedp, reversep, volatilep = 0;
> +	      tree offset;
> +	      get_inner_reference (lhs, &bitsize, &bitpos, &offset,
> +				   &mode, &unsignedp, &reversep, &volatilep);
> +	      HOST_WIDE_INT ibitsize = bitsize.to_constant ();
> +	      HOST_WIDE_INT ibitpos = bitpos.to_constant ();
> +	      if (BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN
> +		  || CHAR_BIT != 8
> +		  || BITS_PER_UNIT != 8)
> +		{
> +		  g = NULL;
> +		  break;
> +		}
> +	      if (!buf_valid)
> +		{
> +		  if (buf == NULL)
> +		    buf = XNEWVEC (unsigned char, buf_size * 2);
> +		  memset (buf, 0, buf_size);
> +		  if (native_encode_initializer (init, buf, buf_size)
> +		      != (int) buf_size)
> +		    {
> +		      g = NULL;
> +		      break;
> +		    }
> +		  buf_valid = true;
> +		}
> +	      /* Otherwise go through byte representation.  */
> +	      if (!encode_tree_to_bitpos (rhs, buf, ibitsize,
> +					  ibitpos, buf_size))
> +		{
> +		  g = NULL;
> +		  break;
> +		}
> +	    }
> +	  while (1);
> +	  if (g == NULL)
> +	    {
> +	      XDELETE (buf);
> +	      continue;
> +	    }
> +	  if (buf_valid)
> +	    {
> +	      init = native_interpret_aggregate (TREE_TYPE (var), buf, 0,
> +						 buf_size);
> +	      if (init)
> +		{
> +		  /* Verify the dynamic initialization doesn't e.g. set
> +		     some padding bits to non-zero by trying to encode
> +		     it again and comparing.  */
> +		  memset (buf + buf_size, 0, buf_size);
> +		  if (native_encode_initializer (init, buf + buf_size,
> +						 buf_size) != (int) buf_size
> +		      || memcmp (buf, buf + buf_size, buf_size) != 0)
> +		    init = NULL_TREE;
> +		}
> +	    }
> +	  XDELETE (buf);
> +	  if (!init || !initializer_constant_valid_p (init, TREE_TYPE (var)))
> +	    continue;
> +	  if (integer_nonzerop (gimple_call_arg (g, 1)))
> +	    TREE_READONLY (var) = 1;
> +	  if (dump_file)
> +	    {
> +	      fprintf (dump_file, "dynamic initialization of ");
> +	      print_generic_stmt (dump_file, var, TDF_SLIM);
> +	      fprintf (dump_file, " optimized into: ");
> +	      print_generic_stmt (dump_file, init, TDF_SLIM);
> +	      if (TREE_READONLY (var))
> +		fprintf (dump_file, " and making it read-only\n");
> +	      fprintf (dump_file, "\n");
> +	    }
> +	  if (initializer_zerop (init))
> +	    DECL_INITIAL (var) = NULL_TREE;
> +	  else
> +	    DECL_INITIAL (var) = init;
> +	  gsi = gsi_for_stmt (g);
> +	  gsi_next (&gsi);
> +	  do
> +	    {
> +	      gimple *stmt = gsi_stmt (gsi);
> +	      if (is_gimple_debug (stmt))
> +		{
> +		  gsi_next (&gsi);
> +		  continue;
> +		}
> +	      if (is_gimple_call (stmt))
> +		break;
> +	      /* Remove now all the stores for the dynamic initialization.  */
> +	      unlink_stmt_vdef (stmt);
> +	      gsi_remove (&gsi, true);
> +	      if (gimple_vdef (stmt))
> +		release_ssa_name (gimple_vdef (stmt));

release_defs () should do the trick

> +	    }
> +	  while (1);
> +	}
> +    }
> +  delete map;
> +  for (gimple *g : ifns)
> +    {
> +      gimple_stmt_iterator gsi = gsi_for_stmt (g);
> +      unlink_stmt_vdef (g);
> +      gsi_remove (&gsi, true);
> +      if (gimple_vdef (g))
> +	release_ssa_name (gimple_vdef (g));

likewise.

> +    }
> +  return 0;
> +}
>  } // anon namespace
>  
>  /* Construct and return a store merging pass object.  */
> @@ -5475,6 +5805,14 @@ make_pass_store_merging (gcc::context *c
>    return new pass_store_merging (ctxt);
>  }
>  
> +/* Construct and return a dyninit pass object.  */
> +
> +gimple_opt_pass *
> +make_pass_dyninit (gcc::context *ctxt)
> +{
> +  return new pass_dyninit (ctxt);
> +}
> +
>  #if CHECKING_P
>  
>  namespace selftest {
> --- gcc/cp/decl2.c.jj	2021-11-02 09:05:47.004664566 +0100
> +++ gcc/cp/decl2.c	2021-11-03 17:18:11.395288518 +0100
> @@ -4133,13 +4133,36 @@ one_static_initialization_or_destruction
>      {
>        if (init)
>  	{
> +	  bool sanitize = sanitize_flags_p (SANITIZE_ADDRESS, decl);
> +	  if (optimize && guard == NULL_TREE && !sanitize)
> +	    {
> +	      tree t = build_fold_addr_expr (decl);
> +	      tree type = TREE_TYPE (decl);
> +	      tree is_const
> +		= constant_boolean_node (TYPE_READONLY (type)
> +					 && !cp_has_mutable_p (type),
> +					 boolean_type_node);
> +	      t = build_call_expr_internal_loc (DECL_SOURCE_LOCATION (decl),
> +						IFN_DYNAMIC_INIT_START,
> +						void_type_node, 2, t,
> +						is_const);
> +	      finish_expr_stmt (t);
> +	    }
>  	  finish_expr_stmt (init);
> -	  if (sanitize_flags_p (SANITIZE_ADDRESS, decl))
> +	  if (sanitize)
>  	    {
>  	      varpool_node *vnode = varpool_node::get (decl);
>  	      if (vnode)
>  		vnode->dynamically_initialized = 1;
>  	    }
> +	  else if (optimize && guard == NULL_TREE)
> +	    {
> +	      tree t = build_fold_addr_expr (decl);
> +	      t = build_call_expr_internal_loc (DECL_SOURCE_LOCATION (decl),
> +						IFN_DYNAMIC_INIT_END,
> +						void_type_node, 1, t);
> +	      finish_expr_stmt (t);
> +	    }
>  	}
>  
>        /* If we're using __cxa_atexit, register a function that calls the
> --- gcc/testsuite/g++.dg/opt/init3.C.jj	2021-11-03 17:53:01.872472570 +0100
> +++ gcc/testsuite/g++.dg/opt/init3.C	2021-11-03 17:52:57.484535115 +0100
> @@ -0,0 +1,31 @@
> +// PR c++/102876
> +// { dg-do compile }
> +// { dg-options "-O2 -fdump-tree-dyninit" }
> +// { dg-final { scan-tree-dump "dynamic initialization of b\[\n\r]* optimized into: 1" "dyninit" } }
> +// { dg-final { scan-tree-dump "dynamic initialization of e\[\n\r]* optimized into: {.e=5, .f={.a=1, .b=2, .c=3, .d=6}, .g=6}\[\n\r]* and making it read-only" "dyninit" } }
> +// { dg-final { scan-tree-dump "dynamic initialization of f\[\n\r]* optimized into: {.e=7, .f={.a=1, .b=2, .c=3, .d=6}, .g=1}" "dyninit" } }
> +// { dg-final { scan-tree-dump "dynamic initialization of h\[\n\r]* optimized into: {.h=8, .i={.a=1, .b=2, .c=3, .d=6}, .j=9}" "dyninit" } }
> +// { dg-final { scan-tree-dump-times "dynamic initialization of " 4 "dyninit" } }
> +// { dg-final { scan-tree-dump-times "and making it read-only" 1 "dyninit" } }
> +
> +struct S { S () : a(1), b(2), c(3), d(4) { d += 2; } int a, b, c, d; };
> +struct T { int e; S f; int g; };
> +struct U { int h; mutable S i; int j; };
> +extern int b;
> +int foo (int &);
> +int bar (int &);
> +int baz () { return 1; }
> +int qux () { return b = 2; }
> +// Dynamic initialization of a shouldn't be optimized, foo can't be inlined.
> +int a = foo (b);
> +int b = baz ();
> +// Likewise for c.
> +int c = bar (b);
> +// While qux is inlined, the dynamic initialization modifies another
> +// variable, so punt for d as well.
> +int d = qux ();
> +const T e = { 5, S (), 6 };
> +T f = { 7, S (), baz () };
> +const T &g = e;
> +const U h = { 8, S (), 9 };
> +const U &i = h;
> 
> 	Jakub
> 
>
  

Patch

--- gcc/internal-fn.def.jj	2021-11-02 09:05:47.029664211 +0100
+++ gcc/internal-fn.def	2021-11-02 12:40:38.702436113 +0100
@@ -367,6 +367,10 @@  DEF_INTERNAL_FN (PHI, 0, NULL)
    automatic variable.  */
 DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 
+/* Mark start and end of dynamic initialization of a variable.  */
+DEF_INTERNAL_FN (DYNAMIC_INIT_START, ECF_LEAF | ECF_NOTHROW, ". r ")
+DEF_INTERNAL_FN (DYNAMIC_INIT_END, ECF_LEAF | ECF_NOTHROW, ". r ")
+
 /* DIM_SIZE and DIM_POS return the size of a particular compute
    dimension and the executing thread's position within that
    dimension.  DIM_POS is pure (and not const) so that it isn't
--- gcc/internal-fn.c.jj	2021-11-02 09:05:47.029664211 +0100
+++ gcc/internal-fn.c	2021-11-02 12:40:38.703436099 +0100
@@ -3485,6 +3485,16 @@  expand_CO_ACTOR (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+static void
+expand_DYNAMIC_INIT_START (internal_fn, gcall *)
+{
+}
+
+static void
+expand_DYNAMIC_INIT_END (internal_fn, gcall *)
+{
+}
+
 /* Expand a call to FN using the operands in STMT.  FN has a single
    output operand and NARGS input operands.  */
 
--- gcc/tree-pass.h.jj	2021-10-28 11:29:01.891721153 +0200
+++ gcc/tree-pass.h	2021-11-02 14:15:00.139185088 +0100
@@ -445,6 +445,7 @@  extern gimple_opt_pass *make_pass_cse_re
 extern gimple_opt_pass *make_pass_cse_sincos (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_optimize_bswap (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_store_merging (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_dyninit (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_optimize_widening_mul (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_warn_function_return (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_warn_function_noreturn (gcc::context *ctxt);
--- gcc/passes.def.jj	2021-11-01 14:37:06.685853324 +0100
+++ gcc/passes.def	2021-11-02 14:23:47.836715821 +0100
@@ -261,6 +261,7 @@  along with GCC; see the file COPYING3.
       NEXT_PASS (pass_tsan);
       NEXT_PASS (pass_dse);
       NEXT_PASS (pass_dce);
+      NEXT_PASS (pass_dyninit);
       /* Pass group that runs when 1) enabled, 2) there are loops
 	 in the function.  Make sure to run pass_fix_loops before
 	 to discover/remove loops before running the gate function
--- gcc/gimple-ssa-store-merging.c.jj	2021-09-01 12:06:19.488211919 +0200
+++ gcc/gimple-ssa-store-merging.c	2021-11-03 18:02:55.190015359 +0100
@@ -170,6 +170,8 @@ 
 #include "optabs-tree.h"
 #include "dbgcnt.h"
 #include "selftest.h"
+#include "cgraph.h"
+#include "varasm.h"
 
 /* The maximum size (in bits) of the stores this pass should generate.  */
 #define MAX_STORE_BITSIZE (BITS_PER_WORD)
@@ -5465,6 +5467,334 @@  pass_store_merging::execute (function *f
   return 0;
 }
 
+/* Pass to optimize C++ dynamic initialization.  */
+
+const pass_data pass_data_dyninit = {
+  GIMPLE_PASS,     /* type */
+  "dyninit",	   /* name */
+  OPTGROUP_NONE,   /* optinfo_flags */
+  TV_GIMPLE_STORE_MERGING,	 /* tv_id */
+  PROP_ssa,	/* properties_required */
+  0,		   /* properties_provided */
+  0,		   /* properties_destroyed */
+  0,		   /* todo_flags_start */
+  0,		/* todo_flags_finish */
+};
+
+class pass_dyninit : public gimple_opt_pass
+{
+public:
+  pass_dyninit (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_dyninit, ctxt)
+  {
+  }
+
+  virtual bool
+  gate (function *fun)
+  {
+    return (DECL_ARTIFICIAL (fun->decl)
+	    && DECL_STATIC_CONSTRUCTOR (fun->decl)
+	    && optimize);
+  }
+
+  virtual unsigned int execute (function *);
+}; // class pass_dyninit
+
+unsigned int
+pass_dyninit::execute (function *fun)
+{
+  basic_block bb;
+  auto_vec<gimple *, 32> ifns;
+  hash_map<tree, gimple *> *map = NULL;
+  auto_vec<tree, 32> vars;
+  gimple **cur = NULL;
+  bool ssdf_calls = false;
+
+  FOR_EACH_BB_FN (bb, fun)
+    {
+      for (gimple_stmt_iterator gsi = gsi_after_labels (bb);
+	   !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (is_gimple_debug (stmt))
+	    continue;
+
+	  /* The C++ FE can wrap dynamic initialization of certain
+	     variables with a pair of iternal function calls, like:
+	     .DYNAMIC_INIT_START (&b, 0);
+	     b = 1;
+	     .DYNAMIC_INIT_END (&b);
+
+	     or
+	     .DYNAMIC_INIT_START (&e, 1);
+	     # DEBUG this => &e.f
+	     MEM[(struct S *)&e + 4B] ={v} {CLOBBER};
+	     MEM[(struct S *)&e + 4B].a = 1;
+	     MEM[(struct S *)&e + 4B].b = 2;
+	     MEM[(struct S *)&e + 4B].c = 3;
+	     # DEBUG BEGIN_STMT
+	     MEM[(struct S *)&e + 4B].d = 6;
+	     # DEBUG this => NULL
+	     .DYNAMIC_INIT_END (&e);
+
+	     Verify if there are only stores of constants to the corresponding
+	     variable or parts of that variable and if so, try to reconstruct
+	     a static initializer from the static initializer if any and
+	     the constant stores into the variable.  This is permitted by
+	     [basic.start.static]/3.  */
+	  if (is_gimple_call (stmt))
+	    {
+	      if (gimple_call_internal_p (stmt, IFN_DYNAMIC_INIT_START))
+		{
+		  ifns.safe_push (stmt);
+		  if (cur)
+		    *cur = NULL;
+		  tree arg = gimple_call_arg (stmt, 0);
+		  gcc_assert (TREE_CODE (arg) == ADDR_EXPR
+			      && DECL_P (TREE_OPERAND (arg, 0)));
+		  tree var = TREE_OPERAND (arg, 0);
+		  gcc_checking_assert (is_global_var (var));
+		  varpool_node *node = varpool_node::get (var);
+		  if (node == NULL
+		      || node->in_other_partition
+		      || TREE_ASM_WRITTEN (var)
+		      || DECL_SIZE_UNIT (var) == NULL_TREE
+		      || !tree_fits_uhwi_p (DECL_SIZE_UNIT (var))
+		      || tree_to_uhwi (DECL_SIZE_UNIT (var)) > 1024
+		      || TYPE_SIZE_UNIT (TREE_TYPE (var)) == NULL_TREE
+		      || !tree_int_cst_equal (TYPE_SIZE_UNIT (TREE_TYPE (var)),
+					      DECL_SIZE_UNIT (var)))
+		    continue;
+		  if (map == NULL)
+		    map = new hash_map<tree, gimple *> (61);
+		  bool existed_p;
+		  cur = &map->get_or_insert (var, &existed_p);
+		  if (existed_p)
+		    {
+		      /* Punt if we see more than one .DYNAMIC_INIT_START
+			 internal call for the same variable.  */
+		      *cur = NULL;
+		      cur = NULL;
+		    }
+		  else
+		    {
+		      *cur = stmt;
+		      vars.safe_push (var);
+		    }
+		  continue;
+		}
+	      else if (gimple_call_internal_p (stmt, IFN_DYNAMIC_INIT_END))
+		{
+		  ifns.safe_push (stmt);
+		  tree arg = gimple_call_arg (stmt, 0);
+		  gcc_assert (TREE_CODE (arg) == ADDR_EXPR
+			      && DECL_P (TREE_OPERAND (arg, 0)));
+		  tree var = TREE_OPERAND (arg, 0);
+		  gcc_checking_assert (is_global_var (var));
+		  if (cur)
+		    {
+		      /* Punt if .DYNAMIC_INIT_END call argument doesn't
+			 pair with .DYNAMIC_INIT_START.  */
+		      if (vars.last () != var)
+			*cur = NULL;
+		      cur = NULL;
+		    }
+		  continue;
+		}
+
+	      /* Punt if we see any artificial
+		 __static_initialization_and_destruction_* calls, e.g. if
+		 it would be partially inlined, because we wouldn't then see
+		 all .DYNAMIC_INIT_* calls.  */
+	      tree fndecl = gimple_call_fndecl (stmt);
+	      if (fndecl
+		  && DECL_ARTIFICIAL (fndecl)
+		  && DECL_NAME (fndecl)
+		  && startswith (IDENTIFIER_POINTER (DECL_NAME (fndecl)),
+				 "__static_initialization_and_destruction_"))
+		ssdf_calls = true;
+	    }
+	  if (cur)
+	    {
+	      if (store_valid_for_store_merging_p (stmt))
+		{
+		  tree lhs = gimple_assign_lhs (stmt);
+		  tree rhs = gimple_assign_rhs1 (stmt);
+		  poly_int64 bitsize, bitpos;
+		  HOST_WIDE_INT ibitsize, ibitpos;
+		  machine_mode mode;
+		  int unsignedp, reversep, volatilep = 0;
+		  tree offset;
+		  tree var = vars.last ();
+		  if (rhs_valid_for_store_merging_p (rhs)
+		      && get_inner_reference (lhs, &bitsize, &bitpos, &offset,
+					      &mode, &unsignedp, &reversep,
+					      &volatilep) == var
+		      && !reversep
+		      && !volatilep
+		      && (offset == NULL_TREE || integer_zerop (offset))
+		      && bitsize.is_constant (&ibitsize)
+		      && bitpos.is_constant (&ibitpos)
+		      && ibitpos >= 0
+		      && ibitsize <= tree_to_shwi (DECL_SIZE (var))
+		      && ibitsize + ibitpos <= tree_to_shwi (DECL_SIZE (var)))
+		    continue;
+		}
+	      *cur = NULL;
+	      cur = NULL;
+	    }
+	}
+      if (cur)
+	{
+	  *cur = NULL;
+	  cur = NULL;
+	}
+    }
+  if (map && !ssdf_calls)
+    {
+      for (tree var : vars)
+	{
+	  gimple *g = *map->get (var);
+	  if (g == NULL)
+	    continue;
+	  varpool_node *node = varpool_node::get (var);
+	  node->get_constructor ();
+	  tree init = DECL_INITIAL (var);
+	  if (init == NULL)
+	    init = build_zero_cst (TREE_TYPE (var));
+	  gimple_stmt_iterator gsi = gsi_for_stmt (g);
+	  unsigned char *buf = NULL;
+	  unsigned int buf_size = tree_to_uhwi (DECL_SIZE_UNIT (var));
+	  bool buf_valid = false;
+	  do
+	    {
+	      gsi_next (&gsi);
+	      gimple *stmt = gsi_stmt (gsi);
+	      if (is_gimple_debug (stmt))
+		continue;
+	      if (is_gimple_call (stmt))
+		break;
+	      if (gimple_clobber_p (stmt))
+		continue;
+	      tree lhs = gimple_assign_lhs (stmt);
+	      tree rhs = gimple_assign_rhs1 (stmt);
+	      if (lhs == var)
+		{
+		  /* Simple assignment to the whole variable.
+		     rhs is the initializer.  */
+		  buf_valid = false;
+		  init = rhs;
+		  continue;
+		}
+	      poly_int64 bitsize, bitpos;
+	      machine_mode mode;
+	      int unsignedp, reversep, volatilep = 0;
+	      tree offset;
+	      get_inner_reference (lhs, &bitsize, &bitpos, &offset,
+				   &mode, &unsignedp, &reversep, &volatilep);
+	      HOST_WIDE_INT ibitsize = bitsize.to_constant ();
+	      HOST_WIDE_INT ibitpos = bitpos.to_constant ();
+	      if (BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN
+		  || CHAR_BIT != 8
+		  || BITS_PER_UNIT != 8)
+		{
+		  g = NULL;
+		  break;
+		}
+	      if (!buf_valid)
+		{
+		  if (buf == NULL)
+		    buf = XNEWVEC (unsigned char, buf_size * 2);
+		  memset (buf, 0, buf_size);
+		  if (native_encode_initializer (init, buf, buf_size)
+		      != (int) buf_size)
+		    {
+		      g = NULL;
+		      break;
+		    }
+		  buf_valid = true;
+		}
+	      /* Otherwise go through byte representation.  */
+	      if (!encode_tree_to_bitpos (rhs, buf, ibitsize,
+					  ibitpos, buf_size))
+		{
+		  g = NULL;
+		  break;
+		}
+	    }
+	  while (1);
+	  if (g == NULL)
+	    {
+	      XDELETE (buf);
+	      continue;
+	    }
+	  if (buf_valid)
+	    {
+	      init = native_interpret_aggregate (TREE_TYPE (var), buf, 0,
+						 buf_size);
+	      if (init)
+		{
+		  /* Verify the dynamic initialization doesn't e.g. set
+		     some padding bits to non-zero by trying to encode
+		     it again and comparing.  */
+		  memset (buf + buf_size, 0, buf_size);
+		  if (native_encode_initializer (init, buf + buf_size,
+						 buf_size) != (int) buf_size
+		      || memcmp (buf, buf + buf_size, buf_size) != 0)
+		    init = NULL_TREE;
+		}
+	    }
+	  XDELETE (buf);
+	  if (!init || !initializer_constant_valid_p (init, TREE_TYPE (var)))
+	    continue;
+	  if (integer_nonzerop (gimple_call_arg (g, 1)))
+	    TREE_READONLY (var) = 1;
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "dynamic initialization of ");
+	      print_generic_stmt (dump_file, var, TDF_SLIM);
+	      fprintf (dump_file, " optimized into: ");
+	      print_generic_stmt (dump_file, init, TDF_SLIM);
+	      if (TREE_READONLY (var))
+		fprintf (dump_file, " and making it read-only\n");
+	      fprintf (dump_file, "\n");
+	    }
+	  if (initializer_zerop (init))
+	    DECL_INITIAL (var) = NULL_TREE;
+	  else
+	    DECL_INITIAL (var) = init;
+	  gsi = gsi_for_stmt (g);
+	  gsi_next (&gsi);
+	  do
+	    {
+	      gimple *stmt = gsi_stmt (gsi);
+	      if (is_gimple_debug (stmt))
+		{
+		  gsi_next (&gsi);
+		  continue;
+		}
+	      if (is_gimple_call (stmt))
+		break;
+	      /* Remove now all the stores for the dynamic initialization.  */
+	      unlink_stmt_vdef (stmt);
+	      gsi_remove (&gsi, true);
+	      if (gimple_vdef (stmt))
+		release_ssa_name (gimple_vdef (stmt));
+	    }
+	  while (1);
+	}
+    }
+  delete map;
+  for (gimple *g : ifns)
+    {
+      gimple_stmt_iterator gsi = gsi_for_stmt (g);
+      unlink_stmt_vdef (g);
+      gsi_remove (&gsi, true);
+      if (gimple_vdef (g))
+	release_ssa_name (gimple_vdef (g));
+    }
+  return 0;
+}
 } // anon namespace
 
 /* Construct and return a store merging pass object.  */
@@ -5475,6 +5805,14 @@  make_pass_store_merging (gcc::context *c
   return new pass_store_merging (ctxt);
 }
 
+/* Construct and return a dyninit pass object.  */
+
+gimple_opt_pass *
+make_pass_dyninit (gcc::context *ctxt)
+{
+  return new pass_dyninit (ctxt);
+}
+
 #if CHECKING_P
 
 namespace selftest {
--- gcc/cp/decl2.c.jj	2021-11-02 09:05:47.004664566 +0100
+++ gcc/cp/decl2.c	2021-11-03 17:18:11.395288518 +0100
@@ -4133,13 +4133,36 @@  one_static_initialization_or_destruction
     {
       if (init)
 	{
+	  bool sanitize = sanitize_flags_p (SANITIZE_ADDRESS, decl);
+	  if (optimize && guard == NULL_TREE && !sanitize)
+	    {
+	      tree t = build_fold_addr_expr (decl);
+	      tree type = TREE_TYPE (decl);
+	      tree is_const
+		= constant_boolean_node (TYPE_READONLY (type)
+					 && !cp_has_mutable_p (type),
+					 boolean_type_node);
+	      t = build_call_expr_internal_loc (DECL_SOURCE_LOCATION (decl),
+						IFN_DYNAMIC_INIT_START,
+						void_type_node, 2, t,
+						is_const);
+	      finish_expr_stmt (t);
+	    }
 	  finish_expr_stmt (init);
-	  if (sanitize_flags_p (SANITIZE_ADDRESS, decl))
+	  if (sanitize)
 	    {
 	      varpool_node *vnode = varpool_node::get (decl);
 	      if (vnode)
 		vnode->dynamically_initialized = 1;
 	    }
+	  else if (optimize && guard == NULL_TREE)
+	    {
+	      tree t = build_fold_addr_expr (decl);
+	      t = build_call_expr_internal_loc (DECL_SOURCE_LOCATION (decl),
+						IFN_DYNAMIC_INIT_END,
+						void_type_node, 1, t);
+	      finish_expr_stmt (t);
+	    }
 	}
 
       /* If we're using __cxa_atexit, register a function that calls the
--- gcc/testsuite/g++.dg/opt/init3.C.jj	2021-11-03 17:53:01.872472570 +0100
+++ gcc/testsuite/g++.dg/opt/init3.C	2021-11-03 17:52:57.484535115 +0100
@@ -0,0 +1,31 @@ 
+// PR c++/102876
+// { dg-do compile }
+// { dg-options "-O2 -fdump-tree-dyninit" }
+// { dg-final { scan-tree-dump "dynamic initialization of b\[\n\r]* optimized into: 1" "dyninit" } }
+// { dg-final { scan-tree-dump "dynamic initialization of e\[\n\r]* optimized into: {.e=5, .f={.a=1, .b=2, .c=3, .d=6}, .g=6}\[\n\r]* and making it read-only" "dyninit" } }
+// { dg-final { scan-tree-dump "dynamic initialization of f\[\n\r]* optimized into: {.e=7, .f={.a=1, .b=2, .c=3, .d=6}, .g=1}" "dyninit" } }
+// { dg-final { scan-tree-dump "dynamic initialization of h\[\n\r]* optimized into: {.h=8, .i={.a=1, .b=2, .c=3, .d=6}, .j=9}" "dyninit" } }
+// { dg-final { scan-tree-dump-times "dynamic initialization of " 4 "dyninit" } }
+// { dg-final { scan-tree-dump-times "and making it read-only" 1 "dyninit" } }
+
+struct S { S () : a(1), b(2), c(3), d(4) { d += 2; } int a, b, c, d; };
+struct T { int e; S f; int g; };
+struct U { int h; mutable S i; int j; };
+extern int b;
+int foo (int &);
+int bar (int &);
+int baz () { return 1; }
+int qux () { return b = 2; }
+// Dynamic initialization of a shouldn't be optimized, foo can't be inlined.
+int a = foo (b);
+int b = baz ();
+// Likewise for c.
+int c = bar (b);
+// While qux is inlined, the dynamic initialization modifies another
+// variable, so punt for d as well.
+int d = qux ();
+const T e = { 5, S (), 6 };
+T f = { 7, S (), baz () };
+const T &g = e;
+const U h = { 8, S (), 9 };
+const U &i = h;