[v6,2/2] BZ #17645, fix slow DSO sorting behavior in dynamic loader -- Algorithm changes

Message ID 30aa9fbd-fc08-9a6f-d34b-accb47abd8af@codesourcery.com
State Superseded
Delegated to: Adhemerval Zanella Netto
Headers
Series [v6,1/2] BZ #17645, fix slow DSO sorting behavior in dynamic loader -- Testing infrastructure |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent

Commit Message

Chung-Lin Tang July 20, 2021, 2:25 p.m. UTC
  Hi Florian,
This is v6 of the algorithms part of the patch set.
(there was no v5, I've bumped my version reference to be in sync with the v6 testing patch)

Besides the nits you noted, which should be fixed, the main issues you raised in the
last review round were:

1) Keeping the main program at the first of the sorting results. The raised concerns that may
    interact with this include:
    a) main executable having a soname
    b) constructors for the main executable hardwired to run last in elf/dl-init.c

    I've fixed this by introducing a new bit-field in struct link_map, literally 'l_main_map' that
    is set only for the main executable (I am not aware of another bit that provides the same condition).

    The algorithm adjustments are also quite elegant: the condition for DFS traversal is now
    "l_visited == 0 && l_main_map == 0", which will ensure that the map for the main program will
    always be placed at the start of the sorting results, when it will be at the rpo_head of the
    last call to dfs_traversal.

    I've placed l_main_map together with l_visited inside struct link_map, to enable compiler
    optimizations to extract those bits together and combine the zero-test of both bits.

    I've added a new tst-dso-ordering10 test in the Testing Infrastructure patch, that uses the
    soname feature to test:

       tst-dso-ordering10: {}->a->b->c;soname({})=c   # main program has soname of 'c'
       output: b>a>{}<a<b                             # sort will ignore DSO 'c', but still put main program at start

    Indeed, before the l_main_map changes, the new DFS algorithm segfaulted on this testcase.
    (so your prediction was spot on :) )
    After these l_main_map adjustments, both algorithms behaved the same.

2) The other issue you raised was whether reldeps could be dropped entirely (during fini, which is
    the remaining case we do that).

    It appears that, no we can't do that. It seems that dlclose operations may change the still
    existing reldeps such that the proper destructing order at some later program point needs to
    be recomputed.

    One existing test, elf/order2-cmp, already seems to be testing for this. I have experimented
    with disabling reldeps processing completely for both algorithms, and when under that mode,
    elf/order2-cmp consistently FAILs. This should be an indication that reldeps is still needed.

    Conceptually, this means that we need to (potentially) update any existing in-memory link_map
    order when dlclose'ing, and then maybe we can achieve the sort-less closing you mentioned.
    But that may not be much better than simply doing a sort at close-time, if the sort is fast
    enough. (and the dependency update itself may turn out to be another re-sort-and-replace-current-order,
    which, you know, is kind of like just simply doing close-time sorting)

    (BTW, I also tried a bit to recreate the elf/order2 test program using the scripts/dso-ordering-test.py
     description format, but currently it appears to be not expressable yet. Some new features are needed,
     particularly allowing to describe constructor/destructor functions with calls into other
     objects. Decided to not pursue that one right now)

3) A smaller issue was simply using 'break' instead of the 'goto end' in _dl_sort_maps_dfs, relying on the
    compiler to deduct that the assert was satisfied at that point.

    GCC does appear to do that, at least 7.5 which I have on my system. But glibc appears to support down
    to GCC 6.2, and I did not have that version handy to verify if the assert eliding worked there.
    So I left the goto as is, for sake of always optimizing this.

Not sure if this is in time to get into the upcoming release, but hope this version addresses all known issues.

Thanks,
Chung-Lin

2021-07-20  Chung-Lin Tang  <cltang@codesourcery.com>

	[BZ #17645]
	[BZ #15311]
	[BZ #15310]

	* elf/dl-close.c (_dl_close_worker): Remove used[], done[] char arrays,
	use l_map_used/l_map_done bitfields instead. Adjust call to _dl_sort_maps.
	* elf/dl-deps.c (_dl_map_object_deps): Adjust call to _dl_sort_maps.
	* elf/dl-fini.c (_dl_fini): Likewise.
	* elf/dl-sort_maps.c
	(_dl_sort_maps_original): Adapted from original _dl_sort_maps.
	(dfs_traversal): New static function.
	(_dl_sort_maps_dfs): New Reverse-Postorder (RPO) based implementation.
	(_dl_sort_maps): Select and call either _dl_sort_maps_original or
	_dl_sort_maps_dfs based on glibc.rtld.dynamic_sort.
	* elf/dl-tunables.list (rtld): New namespace with tunable 'dynamic_sort'.
	* elf/tst-rtld-list-tunables.exp (glibc.rtld.dynamic_sort): Add new
	tunable parameters.
	* elf/rtld.c (dl_main): Initialize main_map->l_main_map bit to 1.
	* include/link.h (struct link_map): Add l_main_map:1, l_visited:1,
	l_map_used:1, and l_map_done:1 bitfields.
	* sysdeps/generic/ldsodefs.h (_dl_sort_maps): Adjust declaration.
  

Comments

Adhemerval Zanella Aug. 10, 2021, 9:08 p.m. UTC | #1
On 20/07/2021 11:25, Chung-Lin Tang wrote:
> Hi Florian,
> This is v6 of the algorithms part of the patch set.
> (there was no v5, I've bumped my version reference to be in sync with the v6 testing patch)
> 
> Besides the nits you noted, which should be fixed, the main issues you raised in the
> last review round were:

Ok, so lets move this forward as well.

> 
> 1) Keeping the main program at the first of the sorting results. The raised concerns that may
>    interact with this include:
>    a) main executable having a soname
>    b) constructors for the main executable hardwired to run last in elf/dl-init.c
> 
>    I've fixed this by introducing a new bit-field in struct link_map, literally 'l_main_map' that
>    is set only for the main executable (I am not aware of another bit that provides the same condition).
> 
>    The algorithm adjustments are also quite elegant: the condition for DFS traversal is now
>    "l_visited == 0 && l_main_map == 0", which will ensure that the map for the main program will
>    always be placed at the start of the sorting results, when it will be at the rpo_head of the
>    last call to dfs_traversal.
> 
>    I've placed l_main_map together with l_visited inside struct link_map, to enable compiler
>    optimizations to extract those bits together and combine the zero-test of both bits.
> 
>    I've added a new tst-dso-ordering10 test in the Testing Infrastructure patch, that uses the
>    soname feature to test:
> 
>       tst-dso-ordering10: {}->a->b->c;soname({})=c   # main program has soname of 'c'
>       output: b>a>{}<a<b                             # sort will ignore DSO 'c', but still put main program at start

I think it worth to mention is was done on the previous patch.

> 
>    Indeed, before the l_main_map changes, the new DFS algorithm segfaulted on this testcase.
>    (so your prediction was spot on :) )
>    After these l_main_map adjustments, both algorithms behaved the same.
> 
> 2) The other issue you raised was whether reldeps could be dropped entirely (during fini, which is
>    the remaining case we do that).
> 
>    It appears that, no we can't do that. It seems that dlclose operations may change the still
>    existing reldeps such that the proper destructing order at some later program point needs to
>    be recomputed.
> 
>    One existing test, elf/order2-cmp, already seems to be testing for this. I have experimented
>    with disabling reldeps processing completely for both algorithms, and when under that mode,
>    elf/order2-cmp consistently FAILs. This should be an indication that reldeps is still needed.
> 
>    Conceptually, this means that we need to (potentially) update any existing in-memory link_map
>    order when dlclose'ing, and then maybe we can achieve the sort-less closing you mentioned.
>    But that may not be much better than simply doing a sort at close-time, if the sort is fast
>    enough. (and the dependency update itself may turn out to be another re-sort-and-replace-current-order,
>    which, you know, is kind of like just simply doing close-time sorting)
> 
>    (BTW, I also tried a bit to recreate the elf/order2 test program using the scripts/dso-ordering-test.py
>     description format, but currently it appears to be not expressable yet. Some new features are needed,
>     particularly allowing to describe constructor/destructor functions with calls into other
>     objects. Decided to not pursue that one right now)

Alright, I can reevaluate the reldeps later.

> 
> 3) A smaller issue was simply using 'break' instead of the 'goto end' in _dl_sort_maps_dfs, relying on the
>    compiler to deduct that the assert was satisfied at that point.
> 
>    GCC does appear to do that, at least 7.5 which I have on my system. But glibc appears to support down
>    to GCC 6.2, and I did not have that version handy to verify if the assert eliding worked there.
>    So I left the goto as is, for sake of always optimizing this.
> 
> Not sure if this is in time to get into the upcoming release, but hope this version addresses all known issues.

I would like to have this on the start of 2.35 cycle, at least with the
current sort algorithm as default.  We can set it the newer one as default
eventually before the release, but at least we will keep testing it for
regressions.

I have tested your patch on x86_64-linux-gnu and i686-linux-gnu using
both the current and new sorting algorithm as default.  The output looks
good, but I am still reading all the bug reports and related links and
checking with provided testcases if there are any regression (at least
the BZ#17645 does show a good improvement). I plan to do some more sniff
tests on other architectures as well.

Below there are minor comments, most related to how to initialize the
sort algorithm from the tunables.

I also took the liberty to create a personal branch with your patch
plus the remarks I did on both patches [1].

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/dso-opt

> 
> Thanks,
> Chung-Lin
> 
> 2021-07-20  Chung-Lin Tang  <cltang@codesourcery.com>
> 
>     [BZ #17645]
>     [BZ #15311]
>     [BZ #15310]
> 
>     * elf/dl-close.c (_dl_close_worker): Remove used[], done[] char arrays,
>     use l_map_used/l_map_done bitfields instead. Adjust call to _dl_sort_maps.
>     * elf/dl-deps.c (_dl_map_object_deps): Adjust call to _dl_sort_maps.
>     * elf/dl-fini.c (_dl_fini): Likewise.
>     * elf/dl-sort_maps.c
>     (_dl_sort_maps_original): Adapted from original _dl_sort_maps.
>     (dfs_traversal): New static function.
>     (_dl_sort_maps_dfs): New Reverse-Postorder (RPO) based implementation.
>     (_dl_sort_maps): Select and call either _dl_sort_maps_original or
>     _dl_sort_maps_dfs based on glibc.rtld.dynamic_sort.
>     * elf/dl-tunables.list (rtld): New namespace with tunable 'dynamic_sort'.
>     * elf/tst-rtld-list-tunables.exp (glibc.rtld.dynamic_sort): Add new
>     tunable parameters.
>     * elf/rtld.c (dl_main): Initialize main_map->l_main_map bit to 1.
>     * include/link.h (struct link_map): Add l_main_map:1, l_visited:1,
>     l_map_used:1, and l_map_done:1 bitfields.
>     * sysdeps/generic/ldsodefs.h (_dl_sort_maps): Adjust declaration.

We do not use ChangeLog entries anymore, instead a patch description is
preferable.

> 
> diff --git a/elf/dl-close.c b/elf/dl-close.c
> index f39001cab9..a55b3a00fa 100644
> --- a/elf/dl-close.c
> +++ b/elf/dl-close.c
> @@ -167,8 +167,6 @@ _dl_close_worker (struct link_map *map, bool force)
>  
>    bool any_tls = false;
>    const unsigned int nloaded = ns->_ns_nloaded;
> -  char used[nloaded];
> -  char done[nloaded];
>    struct link_map *maps[nloaded];
>  
>    /* Run over the list and assign indexes to the link maps and enter

Ok.

> @@ -176,24 +174,21 @@ _dl_close_worker (struct link_map *map, bool force)
>    int idx = 0;
>    for (struct link_map *l = ns->_ns_loaded; l != NULL; l = l->l_next)
>      {
> +      l->l_map_used = 0;
> +      l->l_map_done = 0;
>        l->l_idx = idx;
>        maps[idx] = l;
>        ++idx;
> -
>      }
>    assert (idx == nloaded);
>  
> -  /* Prepare the bitmaps.  */
> -  memset (used, '\0', sizeof (used));
> -  memset (done, '\0', sizeof (done));
> -
>    /* Keep track of the lowest index link map we have covered already.  */
>    int done_index = -1;
>    while (++done_index < nloaded)
>      {
>        struct link_map *l = maps[done_index];
>  
> -      if (done[done_index])
> +      if (l->l_map_done)
>  	/* Already handled.  */
>  	continue;
>  

Ok.

> @@ -204,12 +199,12 @@ _dl_close_worker (struct link_map *map, bool force)
>  	  /* See CONCURRENCY NOTES in cxa_thread_atexit_impl.c to know why
>  	     acquire is sufficient and correct.  */
>  	  && atomic_load_acquire (&l->l_tls_dtor_count) == 0
> -	  && !used[done_index])
> +	  && !l->l_map_used)
>  	continue;
>  
>        /* We need this object and we handle it now.  */
> -      done[done_index] = 1;
> -      used[done_index] = 1;
> +      l->l_map_used = 1;
> +      l->l_map_done = 1;
>        /* Signal the object is still needed.  */
>        l->l_idx = IDX_STILL_USED;
>  

Ok.

> @@ -225,9 +220,9 @@ _dl_close_worker (struct link_map *map, bool force)
>  		{
>  		  assert ((*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded);
>  
> -		  if (!used[(*lp)->l_idx])
> +		  if (!(*lp)->l_map_used)
>  		    {
> -		      used[(*lp)->l_idx] = 1;
> +		      (*lp)->l_map_used = 1;
>  		      /* If we marked a new object as used, and we've
>  			 already processed it, then we need to go back
>  			 and process again from that point forward to

Ok.

> @@ -250,9 +245,9 @@ _dl_close_worker (struct link_map *map, bool force)
>  	      {
>  		assert (jmap->l_idx >= 0 && jmap->l_idx < nloaded);
>  
> -		if (!used[jmap->l_idx])
> +		if (!jmap->l_map_used)
>  		  {
> -		    used[jmap->l_idx] = 1;
> +		    jmap->l_map_used = 1;
>  		    if (jmap->l_idx - 1 < done_index)
>  		      done_index = jmap->l_idx - 1;
>  		  }

Ok.

> @@ -262,8 +257,7 @@ _dl_close_worker (struct link_map *map, bool force)
>  
>    /* Sort the entries.  We can skip looking for the binary itself which is
>       at the front of the search list for the main namespace.  */
> -  _dl_sort_maps (maps + (nsid == LM_ID_BASE), nloaded - (nsid == LM_ID_BASE),
> -		 used + (nsid == LM_ID_BASE), true);
> +  _dl_sort_maps (maps, nloaded, (nsid == LM_ID_BASE), true);
>  
>    /* Call all termination functions at once.  */
>  #ifdef SHARED

Ok, new algorithm need to take in consideration the program link_map as well.

> @@ -280,7 +274,7 @@ _dl_close_worker (struct link_map *map, bool force)
>        /* All elements must be in the same namespace.  */
>        assert (imap->l_ns == nsid);
>  
> -      if (!used[i])
> +      if (!imap->l_map_used)
>  	{
>  	  assert (imap->l_type == lt_loaded && !imap->l_nodelete_active);
>  

Ok.

> @@ -333,7 +327,7 @@ _dl_close_worker (struct link_map *map, bool force)
>  	  if (i < first_loaded)
>  	    first_loaded = i;
>  	}
> -      /* Else used[i].  */
> +      /* Else imap->l_map_used.  */
>        else if (imap->l_type == lt_loaded)
>  	{
>  	  struct r_scope_elem *new_list = NULL;
> @@ -557,7 +551,7 @@ _dl_close_worker (struct link_map *map, bool force)
>    for (unsigned int i = first_loaded; i < nloaded; ++i)
>      {
>        struct link_map *imap = maps[i];
> -      if (!used[i])
> +      if (!imap->l_map_used)
>  	{
>  	  assert (imap->l_type == lt_loaded);
>  

Ok.

> diff --git a/elf/dl-deps.c b/elf/dl-deps.c
> index 087a49b212..237d9636c5 100644
> --- a/elf/dl-deps.c
> +++ b/elf/dl-deps.c
> @@ -613,10 +613,9 @@ Filters not supported with LD_TRACE_PRELINKING"));
>  
>    /* If libc.so.6 is the main map, it participates in the sort, so
>       that the relocation order is correct regarding libc.so.6.  */
> -  if (l_initfini[0] == GL (dl_ns)[l_initfini[0]->l_ns].libc_map)
> -    _dl_sort_maps (l_initfini, nlist, NULL, false);
> -  else
> -    _dl_sort_maps (&l_initfini[1], nlist - 1, NULL, false);
> +  _dl_sort_maps (l_initfini, nlist,
> +		 (l_initfini[0] != GL (dl_ns)[l_initfini[0]->l_ns].libc_map),
> +		 false);
>  
>    /* Terminate the list of dependencies.  */
>    l_initfini[nlist] = NULL;

Ok.

> diff --git a/elf/dl-fini.c b/elf/dl-fini.c
> index 6dbdfe4b3e..c683884c35 100644
> --- a/elf/dl-fini.c
> +++ b/elf/dl-fini.c
> @@ -92,8 +92,7 @@ _dl_fini (void)
>  	  /* Now we have to do the sorting.  We can skip looking for the
>  	     binary itself which is at the front of the search list for
>  	     the main namespace.  */
> -	  _dl_sort_maps (maps + (ns == LM_ID_BASE), nmaps - (ns == LM_ID_BASE),
> -			 NULL, true);
> +	  _dl_sort_maps (maps, nmaps, (ns == LM_ID_BASE), true);
>  
>  	  /* We do not rely on the linked list of loaded object anymore
>  	     from this point on.  We have our own list here (maps).  The

Ok.

> diff --git a/elf/dl-sort-maps.c b/elf/dl-sort-maps.c
> index d21770267a..91c5009ac4 100644
> --- a/elf/dl-sort-maps.c
> +++ b/elf/dl-sort-maps.c
> @@ -16,16 +16,24 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>  
> +#include <assert.h>
>  #include <ldsodefs.h>
> +#include <elf/dl-tunables.h>
>  
> +/* Note: this is the older, "original" sorting algorithm, being used as
> +   default up to 2.32.

I don't think this comment adds much (and it requires to be updated to recent
glibc version).

>  
> -/* Sort array MAPS according to dependencies of the contained objects.
> -   Array USED, if non-NULL, is permutated along MAPS.  If FOR_FINI this is
> -   called for finishing an object.  */
> -void
> -_dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
> -	       bool for_fini)
> +   Sort array MAPS according to dependencies of the contained objects.
> +   If FOR_FINI is true, this is called for finishing an object.  */
> +static void
> +_dl_sort_maps_original (struct link_map **maps, unsigned int nmaps,
> +			unsigned int skip, bool for_fini)
>  {
> +  /* Allows caller to do the common optimization of skipping the first map,
> +     usually the main binary.  */
> +  maps += skip;
> +  nmaps -= skip;
> +
>    /* A list of one element need not be sorted.  */
>    if (nmaps <= 1)
>      return;

Ok, adjust the argument due on how is now called.

> @@ -66,14 +74,6 @@ _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
>  			   (k - i) * sizeof (maps[0]));
>  		  maps[k] = thisp;
>  
> -		  if (used != NULL)
> -		    {
> -		      char here_used = used[i];
> -		      memmove (&used[i], &used[i + 1],
> -			       (k - i) * sizeof (used[0]));
> -		      used[k] = here_used;
> -		    }
> -
>  		  if (seen[i + 1] > nmaps - i)
>  		    {
>  		      ++i;

Ok, the member is now move to the link_map itself.

> @@ -120,3 +120,183 @@ _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
>      next:;
>      }
>  }
> +
> +#if !HAVE_TUNABLES
> +/* In this case, just default to the original algorithm.  */
> +strong_alias (_dl_sort_maps_original, _dl_sort_maps);
> +#else
> +
> +/* We use a recursive function due to its better clarity and ease of
> +   implementation, as well as faster execution speed. We already use
> +   alloca() for list allocation during the breadth-first search of
> +   dependencies in _dl_map_object_deps(), and this should be on the
> +   same order of worst-case stack usage.

I am not sure about recursive function usage, specially the implications
regarding stack hardening (such as the ones for stack clash).  But I guess
it should not worse than the current alloca or VLA usage we have internally.

> +
> +   Note: the '*rpo' parameter is supposed to point to one past the
> +   last element of the array where we save the sort results, and is
> +   decremented before storing the current map at each level.  */
> +
> +static void
> +dfs_traversal (struct link_map ***rpo, struct link_map *map,
> +	       bool *do_reldeps)
> +{
> +  if (map->l_visited)
> +    return;
> +
> +  map->l_visited = 1;
> +
> +  if (map->l_initfini)
> +    {
> +      for (int i = 0; map->l_initfini[i] != NULL; i++)
> +	{
> +	  struct link_map *dep = map->l_initfini[i];
> +	  if (dep->l_visited == 0
> +	      && dep->l_main_map == 0)
> +	    dfs_traversal (rpo, dep, do_reldeps);
> +	}
> +    }
> +
> +  if (__glibc_unlikely (do_reldeps != NULL && map->l_reldeps != NULL))
> +    {
> +      /* Indicate that we encountered relocation dependencies during
> +	 traversal.  */
> +      *do_reldeps = true;
> +
> +      for (int m = map->l_reldeps->act - 1; m >= 0; m--)
> +	{
> +	  struct link_map *dep = map->l_reldeps->list[m];
> +	  if (dep->l_visited == 0
> +	      && dep->l_main_map == 0)
> +	    dfs_traversal (rpo, dep, do_reldeps);
> +	}
> +    }
> +
> +  *rpo -= 1;
> +  **rpo = map;
> +}

Ok, the recursive algorithm seems straightforward.

> +
> +/* Topologically sort array MAPS according to dependencies of the contained
> +   objects.  */
> +
> +static void
> +_dl_sort_maps_dfs (struct link_map **maps, unsigned int nmaps,
> +		   unsigned int skip __attribute__ ((unused)), bool for_fini)
> +{
> +  for (int i = nmaps - 1; i >= 0; i--)
> +    maps[i]->l_visited = 0;
> +
> +  /* We apply DFS traversal for each of maps[i] until the whole total order
> +     is found and we're at the start of the Reverse-Postorder (RPO) sequence,
> +     which is a topological sort.
> +
> +     We go from maps[nmaps - 1] backwards towards maps[0] at this level.
> +     Due to the breadth-first search (BFS) ordering we receive, going
> +     backwards usually gives a more shallow depth-first recursion depth,
> +     adding more stack usage safety. Also, combined with the natural
> +     processing order of l_initfini[] at each node during DFS, this maintains
> +     an ordering closer to the original link ordering in the sorting results
> +     under most simpler cases.
> +
> +     Another reason we order the top level backwards, it that maps[0] is
> +     usually exactly the main object of which we're in the midst of
> +     _dl_map_object_deps() processing, and maps[0]->l_initfini[] is still
> +     blank. If we start the traversal from maps[0], since having no
> +     dependencies yet filled in, maps[0] will always be immediately
> +     incorrectly placed at the last place in the order (first in reverse).
> +     Adjusting the order so that maps[0] is last traversed naturally avoids
> +     this problem.
> +
> +     Further, the old "optimization" of skipping the main object at maps[0]
> +     from the call-site (i.e. _dl_sort_maps(maps+1,nmaps-1)) is in general
> +     no longer valid, since traversing along object dependency-links
> +     may "find" the main object even when it is not included in the initial
> +     order (e.g. a dlopen()'ed shared object can have circular dependencies
> +     linked back to itself). In such a case, traversing N-1 objects will
> +     create a N-object result, and raise problems.
> +
> +     To summarize, just passing in the full list, and iterating from back
> +     to front makes things much more straightforward.  */
> +
> +  /* Array to hold RPO sorting results, before we copy back to maps[].  */
> +  struct link_map *rpo[nmaps];
> +
> +  /* The 'head' position during each DFS iteration. Note that we start at
> +     one past the last element due to first-decrement-then-store (see the
> +     bottom of above dfs_traversal() routine).  */
> +  struct link_map **rpo_head = &rpo[nmaps];
> +
> +  bool do_reldeps = false;
> +  bool *do_reldeps_ref = (for_fini ? &do_reldeps : NULL);
> +
> +  for (int i = nmaps - 1; i >= 0; i--)
> +    {
> +      dfs_traversal (&rpo_head, maps[i], do_reldeps_ref);
> +
> +      /* We can break early if all objects are already placed.  */
> +      if (rpo_head == rpo)
> +	goto end;
> +    }
> +  assert (rpo_head == rpo);
> +
> + end:
> +  /* Here we may do a second pass of sorting, using only l_initfini[]
> +     static dependency links. This is avoided if !FOR_FINI or if we didn't
> +     find any reldeps in the first DFS traversal.
> +
> +     The reason we do this is: while it is unspecified how circular
> +     dependencies should be handled, the presumed reasonable behavior is to
> +     have destructors to respect static dependency links as much as possible,
> +     overriding reldeps if needed. And the first sorting pass, which takes
> +     l_initfini/l_reldeps links equally, may not preserve this priority.
> +
> +     Hence we do a 2nd sorting pass, taking only DT_NEEDED links into account
> +     (see how the do_reldeps argument to dfs_traversal() is NULL below).  */
> +  if (do_reldeps)
> +    {
> +      for (int i = nmaps - 1; i >= 0; i--)
> +	rpo[i]->l_visited = 0;
> +
> +      struct link_map **maps_head = &maps[nmaps];
> +      for (int i = nmaps - 1; i >= 0; i--)
> +	{
> +	  dfs_traversal (&maps_head, rpo[i], NULL);
> +
> +	  /* We can break early if all objects are already placed.
> +	     The below memcpy is not needed in the do_reldeps case here,
> +	     since we wrote back to maps[] during DFS traversal.  */
> +	  if (maps_head == maps)
> +	    return;
> +	}
> +      assert (maps_head == maps);
> +      return;
> +    }
> +
> +  memcpy (maps, rpo, sizeof (struct link_map *) * nmaps);
> +}
> +
> +void
> +_dl_sort_maps (struct link_map **maps, unsigned int nmaps,
> +	       unsigned int skip, bool for_fini)
> +{
> +  /* Index code for sorting algorithm currently in use.  */
> +  static int32_t algorithm = 0;
> +  if (__glibc_unlikely (algorithm == 0))
> +    algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
> +			     int32_t, NULL);

This is not race free, a better alternative would be to add a new member on
the GLRO struct and a init function to setup the value after tunables
initialization.  Also, but using a enumeration to define to possible values,
there is no need to add the __builtin_unreachable() below.

> +
> +  /* It can be tempting to use a static function pointer to store and call
> +     the current selected sorting algorithm routine, but experimentation
> +     shows that current processors still do not handle indirect branches
> +     that efficiently, plus a static function pointer will involve
> +     PTR_MANGLE/DEMANGLE, further impairing performance of small, common
> +     input cases. A simple if-case with direct function calls appears to
> +     be the fastest.  */
> +  if (__glibc_likely (algorithm == 1))
> +    _dl_sort_maps_original (maps, nmaps, skip, for_fini);
> +  else if (algorithm == 2)
> +    _dl_sort_maps_dfs (maps, nmaps, skip, for_fini);
> +  else
> +    __builtin_unreachable ();
> +}
> +
> +#endif /* HAVE_TUNABLES.  */
> diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
> index 8ddd4a2314..46ffb23784 100644
> --- a/elf/dl-tunables.list
> +++ b/elf/dl-tunables.list
> @@ -156,4 +156,13 @@ glibc {
>        security_level: SXID_IGNORE
>      }
>    }
> +
> +  rtld {
> +    dynamic_sort {
> +      type: INT_32
> +      minval: 1
> +      maxval: 2
> +      default: 1
> +    }
> +  }
>  }

Ok, although I am not sure about the security implication (my guess it should
be fine).

> diff --git a/elf/rtld.c b/elf/rtld.c
> index d733359eaf..abdc13c13a 100644
> --- a/elf/rtld.c
> +++ b/elf/rtld.c
> @@ -1389,6 +1389,9 @@ dl_main (const ElfW(Phdr) *phdr,
>        main_map->l_name = (char *) "";
>        *user_entry = main_map->l_entry;
>  
> +      /* Set bit indicating this is the main program map.  */
> +      main_map->l_main_map = 1;
> +
>  #ifdef HAVE_AUX_VECTOR
>        /* Adjust the on-stack auxiliary vector so that it looks like the
>  	 binary was executed directly.  */

Ok.

> diff --git a/elf/tst-rtld-list-tunables.exp b/elf/tst-rtld-list-tunables.exp
> index 9f66c52885..9bf572715f 100644
> --- a/elf/tst-rtld-list-tunables.exp
> +++ b/elf/tst-rtld-list-tunables.exp
> @@ -10,5 +10,6 @@ glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0x[f]+)
>  glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0x[f]+)
>  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0x[f]+)
>  glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0x[f]+)
> +glibc.rtld.dynamic_sort: 1 (min: 1, max: 2)
>  glibc.rtld.nns: 0x4 (min: 0x1, max: 0x10)
>  glibc.rtld.optional_static_tls: 0x200 (min: 0x0, max: 0x[f]+)
> diff --git a/include/link.h b/include/link.h
> index 4af16cb596..50f45db243 100644
> --- a/include/link.h
> +++ b/include/link.h
> @@ -181,6 +181,11 @@ struct link_map
>      unsigned int l_init_called:1; /* Nonzero if DT_INIT function called.  */
>      unsigned int l_global:1;	/* Nonzero if object in _dl_global_scope.  */
>      unsigned int l_reserved:2;	/* Reserved for internal use.  */
> +    unsigned int l_main_map:1;  /* Nonzero for the map of the main program.  */
> +    unsigned int l_visited:1;   /* Used internally for map dependency
> +				   graph traversal.  */
> +    unsigned int l_map_used:1;  /* These two bits are used during traversal */
> +    unsigned int l_map_done:1;  /* of maps in _dl_close_worker. */
>      unsigned int l_phdr_allocated:1; /* Nonzero if the data structure pointed
>  					to by `l_phdr' is allocated.  */
>      unsigned int l_soname_added:1; /* Nonzero if the SONAME is for sure in

Ok.

> diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
> index 9c15259236..ad48a0d7da 100644
> --- a/sysdeps/generic/ldsodefs.h
> +++ b/sysdeps/generic/ldsodefs.h
> @@ -1084,7 +1084,7 @@ extern void _dl_fini (void) attribute_hidden;
>  
>  /* Sort array MAPS according to dependencies of the contained objects.  */
>  extern void _dl_sort_maps (struct link_map **maps, unsigned int nmaps,
> -			   char *used, bool for_fini) attribute_hidden;
> +			   unsigned int skip, bool for_fini) attribute_hidden;
>  
>  /* The dynamic linker calls this function before and having changing
>     any shared object mappings.  The `r_state' member of `struct r_debug'

Ok.
  
Chung-Lin Tang Oct. 5, 2021, 2:27 p.m. UTC | #2
Hi Adhemerval,

On 2021/8/11 5:08 AM, Adhemerval Zanella wrote:
>> +void
>> +_dl_sort_maps (struct link_map **maps, unsigned int nmaps,
>> +	       unsigned int skip, bool for_fini)
>> +{
>> +  /* Index code for sorting algorithm currently in use.  */
>> +  static int32_t algorithm = 0;
>> +  if (__glibc_unlikely (algorithm == 0))
>> +    algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
>> +			     int32_t, NULL);
> This is not race free, a better alternative would be to add a new member on
> the GLRO struct and a init function to setup the value after tunables
> initialization.  Also, but using a enumeration to define to possible values,
> there is no need to add the __builtin_unreachable() below.

I'm not sure why there can be a race with TUNABLE_GET(), since isn't all tunables
set only once during initialization?

That said, I agree that placing the algorithm code initializing separately in
_dl_sort_maps_init looks better.



The changes to this part (relative to my v6 patch) I see you have on the
azanella/dso-opt branch:

  void
+_dl_sort_maps_init (void)
+{
+  int32_t algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort, int32_t, NULL);
+  GLRO(dl_dso_sort_algo) = algorithm == 1 ? dso_sort_algorithm_original
+                                         : dso_sort_algorithm_dfs;
+}
+
+void
  _dl_sort_maps (struct link_map **maps, unsigned int nmaps,
                unsigned int skip, bool for_fini)
  {
-  /* Index code for sorting algorithm currently in use.  */
-  static int32_t algorithm = 0;
-  if (__glibc_unlikely (algorithm == 0))
-    algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
-                            int32_t, NULL);
-
    /* It can be tempting to use a static function pointer to store and call
       the current selected sorting algorithm routine, but experimentation
       shows that current processors still do not handle indirect branches
@@ -291,12 +293,10 @@
       PTR_MANGLE/DEMANGLE, further impairing performance of small, common
       input cases. A simple if-case with direct function calls appears to
       be the fastest.  */
-  if (__glibc_likely (algorithm == 1))
+  if (GLRO(dl_dso_sort_algo) == dso_sort_algorithm_original)
      _dl_sort_maps_original (maps, nmaps, skip, for_fini);
-  else if (algorithm == 2)
-    _dl_sort_maps_dfs (maps, nmaps, skip, for_fini);
    else
-    __builtin_unreachable ();
+    _dl_sort_maps_dfs (maps, nmaps, skip, for_fini);
  }


I'm not sure if defining 'enum dso_sort_algorithm' is really needed?
Is __builtin_unreachable() that undesirable?

Also, the final code should probably add __glibc_likely() to the default
algorithm case.

Thanks,
Chung-Lin
  
Adhemerval Zanella Oct. 5, 2021, 5:15 p.m. UTC | #3
On 05/10/2021 11:27, Chung-Lin Tang wrote:
> Hi Adhemerval,
> 
> On 2021/8/11 5:08 AM, Adhemerval Zanella wrote:
>>> +void
>>> +_dl_sort_maps (struct link_map **maps, unsigned int nmaps,
>>> +           unsigned int skip, bool for_fini)
>>> +{
>>> +  /* Index code for sorting algorithm currently in use.  */
>>> +  static int32_t algorithm = 0;
>>> +  if (__glibc_unlikely (algorithm == 0))
>>> +    algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
>>> +                 int32_t, NULL);
>> This is not race free, a better alternative would be to add a new member on
>> the GLRO struct and a init function to setup the value after tunables
>> initialization.  Also, but using a enumeration to define to possible values,
>> there is no need to add the __builtin_unreachable() below.
> 
> I'm not sure why there can be a race with TUNABLE_GET(), since isn't all tunables
> set only once during initialization?

We are trying to avoid even 'benign' data races, where a static variable might
be concurrent set from a constant value.  One pattern that we used to have
is to check if a syscall was present and set a static variable, similar to
what you do.

Another way to do it would be use relaxed atomic, such as the way mips64
getdents does [1].  But I still think it would be better organized if we
move the selected dl_maps sorting algorithm to a GLRO variable.

[1] sysdeps/unix/sysv/linux/mips/mips64/getdents64.c

> 
> That said, I agree that placing the algorithm code initializing separately in
> _dl_sort_maps_init looks better.
> 
> 
> 
> The changes to this part (relative to my v6 patch) I see you have on the
> azanella/dso-opt branch:
> 
>  void
> +_dl_sort_maps_init (void)
> +{
> +  int32_t algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort, int32_t, NULL);
> +  GLRO(dl_dso_sort_algo) = algorithm == 1 ? dso_sort_algorithm_original
> +                                         : dso_sort_algorithm_dfs;
> +}
> +
> +void
>  _dl_sort_maps (struct link_map **maps, unsigned int nmaps,
>                unsigned int skip, bool for_fini)
>  {
> -  /* Index code for sorting algorithm currently in use.  */
> -  static int32_t algorithm = 0;
> -  if (__glibc_unlikely (algorithm == 0))
> -    algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
> -                            int32_t, NULL);
> -
>    /* It can be tempting to use a static function pointer to store and call
>       the current selected sorting algorithm routine, but experimentation
>       shows that current processors still do not handle indirect branches
> @@ -291,12 +293,10 @@
>       PTR_MANGLE/DEMANGLE, further impairing performance of small, common
>       input cases. A simple if-case with direct function calls appears to
>       be the fastest.  */
> -  if (__glibc_likely (algorithm == 1))
> +  if (GLRO(dl_dso_sort_algo) == dso_sort_algorithm_original)
>      _dl_sort_maps_original (maps, nmaps, skip, for_fini);
> -  else if (algorithm == 2)
> -    _dl_sort_maps_dfs (maps, nmaps, skip, for_fini);
>    else
> -    __builtin_unreachable ();
> +    _dl_sort_maps_dfs (maps, nmaps, skip, for_fini);
>  }
> 
> 
> I'm not sure if defining 'enum dso_sort_algorithm' is really needed?
> Is __builtin_unreachable() that undesirable?

The tunables code should already sanitize the input and returns the
possible algorithms as a type, instead of a number.  I think it is
clear to work with predefined type, than hint the compiler explicitly. 

> 
> Also, the final code should probably add __glibc_likely() to the default
> algorithm case.

Sometime I think we abuse the __glibc_{un}likely, specially on short code
like that.  But please reinstate it if you see that it does improve code 
generation.

> 
> Thanks,
> Chung-Lin
  

Patch

diff --git a/elf/dl-close.c b/elf/dl-close.c
index f39001cab9..a55b3a00fa 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -167,8 +167,6 @@  _dl_close_worker (struct link_map *map, bool force)
 
   bool any_tls = false;
   const unsigned int nloaded = ns->_ns_nloaded;
-  char used[nloaded];
-  char done[nloaded];
   struct link_map *maps[nloaded];
 
   /* Run over the list and assign indexes to the link maps and enter
@@ -176,24 +174,21 @@  _dl_close_worker (struct link_map *map, bool force)
   int idx = 0;
   for (struct link_map *l = ns->_ns_loaded; l != NULL; l = l->l_next)
     {
+      l->l_map_used = 0;
+      l->l_map_done = 0;
       l->l_idx = idx;
       maps[idx] = l;
       ++idx;
-
     }
   assert (idx == nloaded);
 
-  /* Prepare the bitmaps.  */
-  memset (used, '\0', sizeof (used));
-  memset (done, '\0', sizeof (done));
-
   /* Keep track of the lowest index link map we have covered already.  */
   int done_index = -1;
   while (++done_index < nloaded)
     {
       struct link_map *l = maps[done_index];
 
-      if (done[done_index])
+      if (l->l_map_done)
 	/* Already handled.  */
 	continue;
 
@@ -204,12 +199,12 @@  _dl_close_worker (struct link_map *map, bool force)
 	  /* See CONCURRENCY NOTES in cxa_thread_atexit_impl.c to know why
 	     acquire is sufficient and correct.  */
 	  && atomic_load_acquire (&l->l_tls_dtor_count) == 0
-	  && !used[done_index])
+	  && !l->l_map_used)
 	continue;
 
       /* We need this object and we handle it now.  */
-      done[done_index] = 1;
-      used[done_index] = 1;
+      l->l_map_used = 1;
+      l->l_map_done = 1;
       /* Signal the object is still needed.  */
       l->l_idx = IDX_STILL_USED;
 
@@ -225,9 +220,9 @@  _dl_close_worker (struct link_map *map, bool force)
 		{
 		  assert ((*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded);
 
-		  if (!used[(*lp)->l_idx])
+		  if (!(*lp)->l_map_used)
 		    {
-		      used[(*lp)->l_idx] = 1;
+		      (*lp)->l_map_used = 1;
 		      /* If we marked a new object as used, and we've
 			 already processed it, then we need to go back
 			 and process again from that point forward to
@@ -250,9 +245,9 @@  _dl_close_worker (struct link_map *map, bool force)
 	      {
 		assert (jmap->l_idx >= 0 && jmap->l_idx < nloaded);
 
-		if (!used[jmap->l_idx])
+		if (!jmap->l_map_used)
 		  {
-		    used[jmap->l_idx] = 1;
+		    jmap->l_map_used = 1;
 		    if (jmap->l_idx - 1 < done_index)
 		      done_index = jmap->l_idx - 1;
 		  }
@@ -262,8 +257,7 @@  _dl_close_worker (struct link_map *map, bool force)
 
   /* Sort the entries.  We can skip looking for the binary itself which is
      at the front of the search list for the main namespace.  */
-  _dl_sort_maps (maps + (nsid == LM_ID_BASE), nloaded - (nsid == LM_ID_BASE),
-		 used + (nsid == LM_ID_BASE), true);
+  _dl_sort_maps (maps, nloaded, (nsid == LM_ID_BASE), true);
 
   /* Call all termination functions at once.  */
 #ifdef SHARED
@@ -280,7 +274,7 @@  _dl_close_worker (struct link_map *map, bool force)
       /* All elements must be in the same namespace.  */
       assert (imap->l_ns == nsid);
 
-      if (!used[i])
+      if (!imap->l_map_used)
 	{
 	  assert (imap->l_type == lt_loaded && !imap->l_nodelete_active);
 
@@ -333,7 +327,7 @@  _dl_close_worker (struct link_map *map, bool force)
 	  if (i < first_loaded)
 	    first_loaded = i;
 	}
-      /* Else used[i].  */
+      /* Else imap->l_map_used.  */
       else if (imap->l_type == lt_loaded)
 	{
 	  struct r_scope_elem *new_list = NULL;
@@ -557,7 +551,7 @@  _dl_close_worker (struct link_map *map, bool force)
   for (unsigned int i = first_loaded; i < nloaded; ++i)
     {
       struct link_map *imap = maps[i];
-      if (!used[i])
+      if (!imap->l_map_used)
 	{
 	  assert (imap->l_type == lt_loaded);
 
diff --git a/elf/dl-deps.c b/elf/dl-deps.c
index 087a49b212..237d9636c5 100644
--- a/elf/dl-deps.c
+++ b/elf/dl-deps.c
@@ -613,10 +613,9 @@  Filters not supported with LD_TRACE_PRELINKING"));
 
   /* If libc.so.6 is the main map, it participates in the sort, so
      that the relocation order is correct regarding libc.so.6.  */
-  if (l_initfini[0] == GL (dl_ns)[l_initfini[0]->l_ns].libc_map)
-    _dl_sort_maps (l_initfini, nlist, NULL, false);
-  else
-    _dl_sort_maps (&l_initfini[1], nlist - 1, NULL, false);
+  _dl_sort_maps (l_initfini, nlist,
+		 (l_initfini[0] != GL (dl_ns)[l_initfini[0]->l_ns].libc_map),
+		 false);
 
   /* Terminate the list of dependencies.  */
   l_initfini[nlist] = NULL;
diff --git a/elf/dl-fini.c b/elf/dl-fini.c
index 6dbdfe4b3e..c683884c35 100644
--- a/elf/dl-fini.c
+++ b/elf/dl-fini.c
@@ -92,8 +92,7 @@  _dl_fini (void)
 	  /* Now we have to do the sorting.  We can skip looking for the
 	     binary itself which is at the front of the search list for
 	     the main namespace.  */
-	  _dl_sort_maps (maps + (ns == LM_ID_BASE), nmaps - (ns == LM_ID_BASE),
-			 NULL, true);
+	  _dl_sort_maps (maps, nmaps, (ns == LM_ID_BASE), true);
 
 	  /* We do not rely on the linked list of loaded object anymore
 	     from this point on.  We have our own list here (maps).  The
diff --git a/elf/dl-sort-maps.c b/elf/dl-sort-maps.c
index d21770267a..91c5009ac4 100644
--- a/elf/dl-sort-maps.c
+++ b/elf/dl-sort-maps.c
@@ -16,16 +16,24 @@ 
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#include <assert.h>
 #include <ldsodefs.h>
+#include <elf/dl-tunables.h>
 
+/* Note: this is the older, "original" sorting algorithm, being used as
+   default up to 2.32.
 
-/* Sort array MAPS according to dependencies of the contained objects.
-   Array USED, if non-NULL, is permutated along MAPS.  If FOR_FINI this is
-   called for finishing an object.  */
-void
-_dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
-	       bool for_fini)
+   Sort array MAPS according to dependencies of the contained objects.
+   If FOR_FINI is true, this is called for finishing an object.  */
+static void
+_dl_sort_maps_original (struct link_map **maps, unsigned int nmaps,
+			unsigned int skip, bool for_fini)
 {
+  /* Allows caller to do the common optimization of skipping the first map,
+     usually the main binary.  */
+  maps += skip;
+  nmaps -= skip;
+
   /* A list of one element need not be sorted.  */
   if (nmaps <= 1)
     return;
@@ -66,14 +74,6 @@  _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
 			   (k - i) * sizeof (maps[0]));
 		  maps[k] = thisp;
 
-		  if (used != NULL)
-		    {
-		      char here_used = used[i];
-		      memmove (&used[i], &used[i + 1],
-			       (k - i) * sizeof (used[0]));
-		      used[k] = here_used;
-		    }
-
 		  if (seen[i + 1] > nmaps - i)
 		    {
 		      ++i;
@@ -120,3 +120,183 @@  _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
     next:;
     }
 }
+
+#if !HAVE_TUNABLES
+/* In this case, just default to the original algorithm.  */
+strong_alias (_dl_sort_maps_original, _dl_sort_maps);
+#else
+
+/* We use a recursive function due to its better clarity and ease of
+   implementation, as well as faster execution speed. We already use
+   alloca() for list allocation during the breadth-first search of
+   dependencies in _dl_map_object_deps(), and this should be on the
+   same order of worst-case stack usage.
+
+   Note: the '*rpo' parameter is supposed to point to one past the
+   last element of the array where we save the sort results, and is
+   decremented before storing the current map at each level.  */
+
+static void
+dfs_traversal (struct link_map ***rpo, struct link_map *map,
+	       bool *do_reldeps)
+{
+  if (map->l_visited)
+    return;
+
+  map->l_visited = 1;
+
+  if (map->l_initfini)
+    {
+      for (int i = 0; map->l_initfini[i] != NULL; i++)
+	{
+	  struct link_map *dep = map->l_initfini[i];
+	  if (dep->l_visited == 0
+	      && dep->l_main_map == 0)
+	    dfs_traversal (rpo, dep, do_reldeps);
+	}
+    }
+
+  if (__glibc_unlikely (do_reldeps != NULL && map->l_reldeps != NULL))
+    {
+      /* Indicate that we encountered relocation dependencies during
+	 traversal.  */
+      *do_reldeps = true;
+
+      for (int m = map->l_reldeps->act - 1; m >= 0; m--)
+	{
+	  struct link_map *dep = map->l_reldeps->list[m];
+	  if (dep->l_visited == 0
+	      && dep->l_main_map == 0)
+	    dfs_traversal (rpo, dep, do_reldeps);
+	}
+    }
+
+  *rpo -= 1;
+  **rpo = map;
+}
+
+/* Topologically sort array MAPS according to dependencies of the contained
+   objects.  */
+
+static void
+_dl_sort_maps_dfs (struct link_map **maps, unsigned int nmaps,
+		   unsigned int skip __attribute__ ((unused)), bool for_fini)
+{
+  for (int i = nmaps - 1; i >= 0; i--)
+    maps[i]->l_visited = 0;
+
+  /* We apply DFS traversal for each of maps[i] until the whole total order
+     is found and we're at the start of the Reverse-Postorder (RPO) sequence,
+     which is a topological sort.
+
+     We go from maps[nmaps - 1] backwards towards maps[0] at this level.
+     Due to the breadth-first search (BFS) ordering we receive, going
+     backwards usually gives a more shallow depth-first recursion depth,
+     adding more stack usage safety. Also, combined with the natural
+     processing order of l_initfini[] at each node during DFS, this maintains
+     an ordering closer to the original link ordering in the sorting results
+     under most simpler cases.
+
+     Another reason we order the top level backwards, it that maps[0] is
+     usually exactly the main object of which we're in the midst of
+     _dl_map_object_deps() processing, and maps[0]->l_initfini[] is still
+     blank. If we start the traversal from maps[0], since having no
+     dependencies yet filled in, maps[0] will always be immediately
+     incorrectly placed at the last place in the order (first in reverse).
+     Adjusting the order so that maps[0] is last traversed naturally avoids
+     this problem.
+
+     Further, the old "optimization" of skipping the main object at maps[0]
+     from the call-site (i.e. _dl_sort_maps(maps+1,nmaps-1)) is in general
+     no longer valid, since traversing along object dependency-links
+     may "find" the main object even when it is not included in the initial
+     order (e.g. a dlopen()'ed shared object can have circular dependencies
+     linked back to itself). In such a case, traversing N-1 objects will
+     create a N-object result, and raise problems.
+
+     To summarize, just passing in the full list, and iterating from back
+     to front makes things much more straightforward.  */
+
+  /* Array to hold RPO sorting results, before we copy back to maps[].  */
+  struct link_map *rpo[nmaps];
+
+  /* The 'head' position during each DFS iteration. Note that we start at
+     one past the last element due to first-decrement-then-store (see the
+     bottom of above dfs_traversal() routine).  */
+  struct link_map **rpo_head = &rpo[nmaps];
+
+  bool do_reldeps = false;
+  bool *do_reldeps_ref = (for_fini ? &do_reldeps : NULL);
+
+  for (int i = nmaps - 1; i >= 0; i--)
+    {
+      dfs_traversal (&rpo_head, maps[i], do_reldeps_ref);
+
+      /* We can break early if all objects are already placed.  */
+      if (rpo_head == rpo)
+	goto end;
+    }
+  assert (rpo_head == rpo);
+
+ end:
+  /* Here we may do a second pass of sorting, using only l_initfini[]
+     static dependency links. This is avoided if !FOR_FINI or if we didn't
+     find any reldeps in the first DFS traversal.
+
+     The reason we do this is: while it is unspecified how circular
+     dependencies should be handled, the presumed reasonable behavior is to
+     have destructors to respect static dependency links as much as possible,
+     overriding reldeps if needed. And the first sorting pass, which takes
+     l_initfini/l_reldeps links equally, may not preserve this priority.
+
+     Hence we do a 2nd sorting pass, taking only DT_NEEDED links into account
+     (see how the do_reldeps argument to dfs_traversal() is NULL below).  */
+  if (do_reldeps)
+    {
+      for (int i = nmaps - 1; i >= 0; i--)
+	rpo[i]->l_visited = 0;
+
+      struct link_map **maps_head = &maps[nmaps];
+      for (int i = nmaps - 1; i >= 0; i--)
+	{
+	  dfs_traversal (&maps_head, rpo[i], NULL);
+
+	  /* We can break early if all objects are already placed.
+	     The below memcpy is not needed in the do_reldeps case here,
+	     since we wrote back to maps[] during DFS traversal.  */
+	  if (maps_head == maps)
+	    return;
+	}
+      assert (maps_head == maps);
+      return;
+    }
+
+  memcpy (maps, rpo, sizeof (struct link_map *) * nmaps);
+}
+
+void
+_dl_sort_maps (struct link_map **maps, unsigned int nmaps,
+	       unsigned int skip, bool for_fini)
+{
+  /* Index code for sorting algorithm currently in use.  */
+  static int32_t algorithm = 0;
+  if (__glibc_unlikely (algorithm == 0))
+    algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
+			     int32_t, NULL);
+
+  /* It can be tempting to use a static function pointer to store and call
+     the current selected sorting algorithm routine, but experimentation
+     shows that current processors still do not handle indirect branches
+     that efficiently, plus a static function pointer will involve
+     PTR_MANGLE/DEMANGLE, further impairing performance of small, common
+     input cases. A simple if-case with direct function calls appears to
+     be the fastest.  */
+  if (__glibc_likely (algorithm == 1))
+    _dl_sort_maps_original (maps, nmaps, skip, for_fini);
+  else if (algorithm == 2)
+    _dl_sort_maps_dfs (maps, nmaps, skip, for_fini);
+  else
+    __builtin_unreachable ();
+}
+
+#endif /* HAVE_TUNABLES.  */
diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
index 8ddd4a2314..46ffb23784 100644
--- a/elf/dl-tunables.list
+++ b/elf/dl-tunables.list
@@ -156,4 +156,13 @@  glibc {
       security_level: SXID_IGNORE
     }
   }
+
+  rtld {
+    dynamic_sort {
+      type: INT_32
+      minval: 1
+      maxval: 2
+      default: 1
+    }
+  }
 }
diff --git a/elf/rtld.c b/elf/rtld.c
index d733359eaf..abdc13c13a 100644
--- a/elf/rtld.c
+++ b/elf/rtld.c
@@ -1389,6 +1389,9 @@  dl_main (const ElfW(Phdr) *phdr,
       main_map->l_name = (char *) "";
       *user_entry = main_map->l_entry;
 
+      /* Set bit indicating this is the main program map.  */
+      main_map->l_main_map = 1;
+
 #ifdef HAVE_AUX_VECTOR
       /* Adjust the on-stack auxiliary vector so that it looks like the
 	 binary was executed directly.  */
diff --git a/elf/tst-rtld-list-tunables.exp b/elf/tst-rtld-list-tunables.exp
index 9f66c52885..9bf572715f 100644
--- a/elf/tst-rtld-list-tunables.exp
+++ b/elf/tst-rtld-list-tunables.exp
@@ -10,5 +10,6 @@  glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0x[f]+)
 glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0x[f]+)
 glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0x[f]+)
 glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0x[f]+)
+glibc.rtld.dynamic_sort: 1 (min: 1, max: 2)
 glibc.rtld.nns: 0x4 (min: 0x1, max: 0x10)
 glibc.rtld.optional_static_tls: 0x200 (min: 0x0, max: 0x[f]+)
diff --git a/include/link.h b/include/link.h
index 4af16cb596..50f45db243 100644
--- a/include/link.h
+++ b/include/link.h
@@ -181,6 +181,11 @@  struct link_map
     unsigned int l_init_called:1; /* Nonzero if DT_INIT function called.  */
     unsigned int l_global:1;	/* Nonzero if object in _dl_global_scope.  */
     unsigned int l_reserved:2;	/* Reserved for internal use.  */
+    unsigned int l_main_map:1;  /* Nonzero for the map of the main program.  */
+    unsigned int l_visited:1;   /* Used internally for map dependency
+				   graph traversal.  */
+    unsigned int l_map_used:1;  /* These two bits are used during traversal */
+    unsigned int l_map_done:1;  /* of maps in _dl_close_worker. */
     unsigned int l_phdr_allocated:1; /* Nonzero if the data structure pointed
 					to by `l_phdr' is allocated.  */
     unsigned int l_soname_added:1; /* Nonzero if the SONAME is for sure in
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index 9c15259236..ad48a0d7da 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -1084,7 +1084,7 @@  extern void _dl_fini (void) attribute_hidden;
 
 /* Sort array MAPS according to dependencies of the contained objects.  */
 extern void _dl_sort_maps (struct link_map **maps, unsigned int nmaps,
-			   char *used, bool for_fini) attribute_hidden;
+			   unsigned int skip, bool for_fini) attribute_hidden;
 
 /* The dynamic linker calls this function before and having changing
    any shared object mappings.  The `r_state' member of `struct r_debug'