diff mbox series

[v4,2/2] BZ #17645, fix slow DSO sorting behavior in dynamic loader -- Algorithm changes

Message ID 85ab188b-64c2-d1fd-33fa-d7d2e6fb2d97@codesourcery.com
State Superseded
Headers show
Series [v4,1/2] BZ #17645, fix slow DSO sorting behavior in dynamic loader -- Testing infrastructure | expand

Commit Message

Chung-Lin Tang Jan. 3, 2021, 11:22 a.m. UTC
Hi Carlos,
the sorting algorithm code portions of the patch has not been updated a lot
for this v4 version, mainly a rebase of v3 to current master.

Here was the v3 version for reference (responded to some v2 issues there).
Unfortunately this one never seemed to got a review, though it seems the
main issues were in the testing infrastructure patch.
https://sourceware.org/pipermail/libc-alpha/2020-July/116546.html

Note that in the attached patch, I still have the old sorting algorithm set as default,
the more "conservative" setting. This can be adjusted if deemed appropriate when patch
is approved.

Tested current master on x86_64-linux with no regressions, seeking approval
for getting this into 2.33.

Thanks,
Chung-Lin

2021-01-03  Chung-Lin Tang  <cltang@codesourcery.com>

	[BZ #17645]
	[BZ #15311]
	[BZ #15310]

	* elf/dl-close.c (_dl_close_worker): Remove used[], done[] char arrays,
	use l_map_used/l_map_done bitfields instead. Adjust call to _dl_sort_maps.
	* elf/dl-deps.c (_dl_map_object_deps): Adjust call to _dl_sort_maps.
	* elf/dl-fini.c (_dl_fini): Likewise.
	* elf/dl-sort_maps.c
	(_dl_sort_maps_original): Adapted from original _dl_sort_maps.
	(dfs_traversal): New static function.
	(_dl_sort_maps_dfs): New Reverse-Postorder (RPO) based implementation.
	(_dl_sort_maps): Select and call either _dl_sort_maps_original or
	_dl_sort_maps_dfs based on glibc.rtld.dynamic_sort.
	* elf/dl-tunables.list (rtld): New namespace with tunable 'dynamic_sort'.
	* include/link.h (struct link_map): Add l_visited:1, l_map_used:1, and
	l_map_done:1 bitfields.
	* sysdeps/generic/ldsodefs.h (_dl_sort_maps): Adjust declaration.

Comments

Florian Weimer Jan. 13, 2021, noon UTC | #1
Chung-Lin,

I have not reviewed the actual sorting algorithm change yet.  I will do
so in a follow-up.

This change should be quite safe as posted, given that the core
behavioral change is hidden behind a tunable setting, and the default is
the old behavior.  But ideally, I think the new behavior should be the
default, so that people can actually experience it.  On the other, if
the old behavior is preserved, we might still include this change in
glibc 2.33.

For glibc 2.34, I plan to work on system-wide tunable settings, without
environment variables.  So it would be possible to flip the default
there, while still providing a way to system administrators to restore
the old behavior.

* Chung-Lin Tang:

> diff --git a/elf/dl-close.c b/elf/dl-close.c
> index c51becd06b..84d25a29d0 100644
> --- a/elf/dl-close.c
> +++ b/elf/dl-close.c
> @@ -164,8 +164,6 @@ _dl_close_worker (struct link_map *map, bool force)
>  
>    bool any_tls = false;
>    const unsigned int nloaded = ns->_ns_nloaded;
> -  char used[nloaded];
> -  char done[nloaded];

This change (moving those arrays into struct link_map) can go in
separately if you want to split it out.

There seems to be some further cleanup potential regarding the use of
IDX_STILL_USED and l_map_done.

> @@ -247,9 +242,9 @@ _dl_close_worker (struct link_map *map, bool force)
>  	      {
>  		assert (jmap->l_idx >= 0 && jmap->l_idx < nloaded);
>  
> -		if (!used[jmap->l_idx])
> +		if (!jmap->l_map_used)
>  		  {
> -		    used[jmap->l_idx] = 1;
> +		    jmap->l_map_used = 1;
>  		    if (jmap->l_idx - 1 < done_index)
>  		      done_index = jmap->l_idx - 1;
>  		  }

The above comes under:

      /* And the same for relocation dependencies.  */
      if (l->l_reldeps != NULL)
        for (unsigned int j = 0; j < l->l_reldeps->act; ++j)

I have a conceptual concern here: I think it is a bug to let relocation
dependencies affect finalizer order.  In order to cut down the sorting
algorithm variations, I would like to suggest not to take relocation
dependencies into account for the new sorting mode.

This way, there will be only one user-visible transition: from the old
sorting to the new sorting.  That's going to be much easier to test.

> diff --git a/elf/dl-sort-maps.c b/elf/dl-sort-maps.c
> index d21770267a..5aa96f4cc1 100644
> --- a/elf/dl-sort-maps.c
> +++ b/elf/dl-sort-maps.c
> @@ -16,16 +16,24 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>  
> +#include <assert.h>
>  #include <ldsodefs.h>
> +#include <elf/dl-tunables.h>
>  
> +/* Note: this is the older, "original" sorting algorithm, being used as
> +   default up to 2.32.
>  
> -/* Sort array MAPS according to dependencies of the contained objects.
> -   Array USED, if non-NULL, is permutated along MAPS.  If FOR_FINI this is
> -   called for finishing an object.  */
> -void
> -_dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
> -	       bool for_fini)
> +   Sort array MAPS according to dependencies of the contained objects.
> +   If FOR_FINI is true, this is called for finishing an object.  */
> +static void
> +_dl_sort_maps_original (struct link_map **maps, unsigned int nmaps,
> +			unsigned int skip, bool for_fini)
>  {
> +  /* Allows caller to do the common optimization of skipping the first map,
> +     usually the main binary.  */
> +  maps += skip;
> +  nmaps -= skip;

As I mentioned in the test case review, I do not think this is purely an
optimization.  If the main executable has a soname, it participates in
the dependency sort like a shared object, so its constructor might not
come last.  I don't think this is was what programmers expect.

> diff --git a/include/link.h b/include/link.h
> index 4af16cb596..f2dbcbaf77 100644
> --- a/include/link.h
> +++ b/include/link.h
> @@ -181,6 +181,10 @@ struct link_map
>      unsigned int l_init_called:1; /* Nonzero if DT_INIT function called.  */
>      unsigned int l_global:1;	/* Nonzero if object in _dl_global_scope.  */
>      unsigned int l_reserved:2;	/* Reserved for internal use.  */
> +    unsigned int l_visited:1;   /* Used internally for map dependency
> +				   graph traversal.  */
> +    unsigned int l_map_used:1;  /* These two bits are used during traversal */
> +    unsigned int l_map_done:1;  /* of maps in _dl_close_worker(). */
>      unsigned int l_phdr_allocated:1; /* Nonzero if the data structure pointed
>  					to by `l_phdr' is allocated.  */
>      unsigned int l_soname_added:1; /* Nonzero if the SONAME is for sure in

Some style nits: No () after function name, ”.  */” at the end of a
comment.

Thanks,
Florian
Florian Weimer Jan. 18, 2021, 8:28 p.m. UTC | #2
* Chung-Lin Tang:

> +/* We use a recursive function due to its better clarity and ease of
> +   implementation, as well as faster execution speed. We already use
> +   alloca() for list allocation during the breadth-first search of
> +   dependencies in _dl_map_object_deps(), and this should be on the
> +   same order of worst-case stack usage.  */
> +
> +static void
> +dfs_traversal (struct link_map ***rpo, struct link_map *map,
> +	       bool *do_reldeps)
> +{

Could you please add a comment describing the rpo, that it points past
the end of the destination array?  That's not obvious based on the
interface.

Although I'm quite happy how simple the code is to read. 8-)

> +  if (map->l_visited)
> +    return;
> +
> +  map->l_visited = 1;
> +
> +  if (map->l_initfini)
> +    {
> +      for (int i = 0; map->l_initfini[i] != NULL; i++)
> +	{
> +	  struct link_map *dep = map->l_initfini[i];
> +	  if (dep->l_visited == 0)
> +	    dfs_traversal (rpo, dep, do_reldeps);
> +	}
> +    }

First DT_NEEDED dependency is added first …
> +
> +  if (__glibc_unlikely (do_reldeps != NULL && map->l_reldeps != NULL))
> +    {
> +      /* Indicate that we encountered relocation dependencies during
> +	 traversal.  */
> +      *do_reldeps = true;
> +
> +      for (int m = map->l_reldeps->act - 1; m >= 0; m--)
> +	{
> +	  struct link_map *dep = map->l_reldeps->list[m];
> +	  if (dep->l_visited == 0)
> +	    dfs_traversal (rpo, dep, do_reldeps);
> +	}
> +    }
> +
> +  *rpo -= 1;
> +  **rpo = map;
> +}

… and the depending object is added last.  The l_visited check at the
start (and in the loops) ensures that this doesn't happen before.

> +/* Topologically sort array MAPS according to dependencies of the contained
> +   objects.  */
> +
> +static void
> +_dl_sort_maps_dfs (struct link_map **maps, unsigned int nmaps,
> +		   unsigned int skip __attribute__ ((unused)), bool for_fini)
> +{

> +  for (int i = nmaps - 1; i >= 0; i--)
> +    {
> +      dfs_traversal (&rpo_head, maps[i], do_reldeps_ref);
> +
> +      /* We can break early if all objects are already placed.  */
> +      if (rpo_head == rpo)
> +	goto end;
> +    }
> +  assert (rpo_head == rpo);

Should “goto end” be a “break”?  The compiler elides the assert for that
case, I expect.

> +  /* Here we may do a second pass of sorting, using only l_initfini[]
> +     static dependency links. This is avoided if !FOR_FINI or if we didn't
> +     find any reldeps in the first DFS traversal.
> +
> +     The reason we do this is: while it is unspecified how circular
> +     dependencies should be handled, the presumed reasonable behavior is to
> +     have destructors to respect static dependency links as much as possible,
> +     overriding reldeps if needed. And the first sorting pass, which takes
> +     l_initfini/l_reldeps links equally, may not preserve this priority.
> +
> +     Hence we do a 2nd sorting pass, taking only DT_NEEDED links into account
> +     (see how the do_reldeps argument to dfs_traversal() is NULL below).  */
> +  if (do_reldeps)
> +    {
> +      for (int i = nmaps - 1; i >= 0; i--)
> +	rpo[i]->l_visited = 0;
> +
> +      struct link_map **maps_head = &maps[nmaps];
> +      for (int i = nmaps - 1; i >= 0; i--)
> +	{
> +	  dfs_traversal (&maps_head, rpo[i], NULL);
> +
> +	  /* We can break early if all objects are already placed.
> +	     The below memcpy is not needed in the do_reldeps case here,
> +	     since we wrote back to maps[] during DFS traversal.  */
> +	  if (maps_head == maps)
> +	    return;
> +	}
> +      assert (maps_head == maps);
> +      return;
> +    }

About that second sort: I *suspect* the original goal was to use reldeps
for constructor ordering as well.  There were actually two sorting calls
in the original implementation of shared object loading, but the second
happend *before* relocation processing, so it never encountered any
collected relocations.  They only mattered on dlclose/process
termination, and this is not actually what users want.

At this point, I am quite convinced that we should get rid of reldeps
sorting completely.  If we reorder the link map chaining at load time, I
think we won't need to have to do *any* sorting at closing time.  This
means that we also don't have to allocate any memory for it.

Does anyone have concerns about moving in that direction?

I do believe that we should make one dependency sorting change in one
release, and not do this transition piecewise.

> +void
> +_dl_sort_maps (struct link_map **maps, unsigned int nmaps,
> +	       unsigned int skip, bool for_fini)
> +{
> +  /* Index code for sorting algorithm currently in use.  */
> +  static int32_t algorithm = 0;
> +  if (__glibc_unlikely (algorithm == 0))
> +    algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
> +			     int32_t, NULL);
> +
> +  /* It can be tempting to use a static function pointer to store and call
> +     the current selected sorting algorithm routine, but experimentation
> +     shows that current processors still do not handle indirect branches
> +     that efficiently, plus a static function pointer will involve
> +     PTR_MANGLE/DEMANGLE, further impairing performance of small, common
> +     input cases. A simple if-case with direct function calls appears to
> +     be the fastest.  */
> +  if (__glibc_likely (algorithm == 1))
> +    _dl_sort_maps_original (maps, nmaps, skip, for_fini);
> +  else if (algorithm == 2)
> +    _dl_sort_maps_dfs (maps, nmaps, skip, for_fini);
> +  else
> +    __builtin_unreachable ();
> +}

Some nitpicking: We could declare the function pointer attribute_relro,
and since this function is necessarily called before final relocation,
it will be set before RELRO because active, so writing to it still
permitted.  The flag-based solution is of course fine, so I suggest just
to remove the comment.

Thanks,
Florian
Florian Weimer Jan. 18, 2021, 8:46 p.m. UTC | #3
Sorry, one more thing:

I observed glibc testing time to increase by more than 30 seconds, or
roughly 5% of the total.  I think that's a bit problematic for a single
test.

I asume that this comes from the large secodn test. Maybe we should turn
this test into an xtest?

Thanks,
Florian
diff mbox series

Patch

diff --git a/elf/dl-close.c b/elf/dl-close.c
index c51becd06b..84d25a29d0 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -164,8 +164,6 @@  _dl_close_worker (struct link_map *map, bool force)
 
   bool any_tls = false;
   const unsigned int nloaded = ns->_ns_nloaded;
-  char used[nloaded];
-  char done[nloaded];
   struct link_map *maps[nloaded];
 
   /* Run over the list and assign indexes to the link maps and enter
@@ -173,24 +171,21 @@  _dl_close_worker (struct link_map *map, bool force)
   int idx = 0;
   for (struct link_map *l = ns->_ns_loaded; l != NULL; l = l->l_next)
     {
+      l->l_map_used = 0;
+      l->l_map_done = 0;
       l->l_idx = idx;
       maps[idx] = l;
       ++idx;
-
     }
   assert (idx == nloaded);
 
-  /* Prepare the bitmaps.  */
-  memset (used, '\0', sizeof (used));
-  memset (done, '\0', sizeof (done));
-
   /* Keep track of the lowest index link map we have covered already.  */
   int done_index = -1;
   while (++done_index < nloaded)
     {
       struct link_map *l = maps[done_index];
 
-      if (done[done_index])
+      if (l->l_map_done)
 	/* Already handled.  */
 	continue;
 
@@ -201,12 +196,12 @@  _dl_close_worker (struct link_map *map, bool force)
 	  /* See CONCURRENCY NOTES in cxa_thread_atexit_impl.c to know why
 	     acquire is sufficient and correct.  */
 	  && atomic_load_acquire (&l->l_tls_dtor_count) == 0
-	  && !used[done_index])
+	  && !l->l_map_used)
 	continue;
 
       /* We need this object and we handle it now.  */
-      done[done_index] = 1;
-      used[done_index] = 1;
+      l->l_map_used = 1;
+      l->l_map_done = 1;
       /* Signal the object is still needed.  */
       l->l_idx = IDX_STILL_USED;
 
@@ -222,9 +217,9 @@  _dl_close_worker (struct link_map *map, bool force)
 		{
 		  assert ((*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded);
 
-		  if (!used[(*lp)->l_idx])
+		  if (!(*lp)->l_map_used)
 		    {
-		      used[(*lp)->l_idx] = 1;
+		      (*lp)->l_map_used = 1;
 		      /* If we marked a new object as used, and we've
 			 already processed it, then we need to go back
 			 and process again from that point forward to
@@ -247,9 +242,9 @@  _dl_close_worker (struct link_map *map, bool force)
 	      {
 		assert (jmap->l_idx >= 0 && jmap->l_idx < nloaded);
 
-		if (!used[jmap->l_idx])
+		if (!jmap->l_map_used)
 		  {
-		    used[jmap->l_idx] = 1;
+		    jmap->l_map_used = 1;
 		    if (jmap->l_idx - 1 < done_index)
 		      done_index = jmap->l_idx - 1;
 		  }
@@ -259,8 +254,7 @@  _dl_close_worker (struct link_map *map, bool force)
 
   /* Sort the entries.  We can skip looking for the binary itself which is
      at the front of the search list for the main namespace.  */
-  _dl_sort_maps (maps + (nsid == LM_ID_BASE), nloaded - (nsid == LM_ID_BASE),
-		 used + (nsid == LM_ID_BASE), true);
+  _dl_sort_maps (maps, nloaded, (nsid == LM_ID_BASE), true);
 
   /* Call all termination functions at once.  */
 #ifdef SHARED
@@ -277,7 +271,7 @@  _dl_close_worker (struct link_map *map, bool force)
       /* All elements must be in the same namespace.  */
       assert (imap->l_ns == nsid);
 
-      if (!used[i])
+      if (!imap->l_map_used)
 	{
 	  assert (imap->l_type == lt_loaded && !imap->l_nodelete_active);
 
@@ -330,7 +324,7 @@  _dl_close_worker (struct link_map *map, bool force)
 	  if (i < first_loaded)
 	    first_loaded = i;
 	}
-      /* Else used[i].  */
+      /* Else imap->l_map_used.  */
       else if (imap->l_type == lt_loaded)
 	{
 	  struct r_scope_elem *new_list = NULL;
@@ -554,7 +548,7 @@  _dl_close_worker (struct link_map *map, bool force)
   for (unsigned int i = first_loaded; i < nloaded; ++i)
     {
       struct link_map *imap = maps[i];
-      if (!used[i])
+      if (!imap->l_map_used)
 	{
 	  assert (imap->l_type == lt_loaded);
 
diff --git a/elf/dl-deps.c b/elf/dl-deps.c
index 087a49b212..237d9636c5 100644
--- a/elf/dl-deps.c
+++ b/elf/dl-deps.c
@@ -613,10 +613,9 @@  Filters not supported with LD_TRACE_PRELINKING"));
 
   /* If libc.so.6 is the main map, it participates in the sort, so
      that the relocation order is correct regarding libc.so.6.  */
-  if (l_initfini[0] == GL (dl_ns)[l_initfini[0]->l_ns].libc_map)
-    _dl_sort_maps (l_initfini, nlist, NULL, false);
-  else
-    _dl_sort_maps (&l_initfini[1], nlist - 1, NULL, false);
+  _dl_sort_maps (l_initfini, nlist,
+		 (l_initfini[0] != GL (dl_ns)[l_initfini[0]->l_ns].libc_map),
+		 false);
 
   /* Terminate the list of dependencies.  */
   l_initfini[nlist] = NULL;
diff --git a/elf/dl-fini.c b/elf/dl-fini.c
index 6dbdfe4b3e..c683884c35 100644
--- a/elf/dl-fini.c
+++ b/elf/dl-fini.c
@@ -92,8 +92,7 @@  _dl_fini (void)
 	  /* Now we have to do the sorting.  We can skip looking for the
 	     binary itself which is at the front of the search list for
 	     the main namespace.  */
-	  _dl_sort_maps (maps + (ns == LM_ID_BASE), nmaps - (ns == LM_ID_BASE),
-			 NULL, true);
+	  _dl_sort_maps (maps, nmaps, (ns == LM_ID_BASE), true);
 
 	  /* We do not rely on the linked list of loaded object anymore
 	     from this point on.  We have our own list here (maps).  The
diff --git a/elf/dl-sort-maps.c b/elf/dl-sort-maps.c
index d21770267a..5aa96f4cc1 100644
--- a/elf/dl-sort-maps.c
+++ b/elf/dl-sort-maps.c
@@ -16,16 +16,24 @@ 
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#include <assert.h>
 #include <ldsodefs.h>
+#include <elf/dl-tunables.h>
 
+/* Note: this is the older, "original" sorting algorithm, being used as
+   default up to 2.32.
 
-/* Sort array MAPS according to dependencies of the contained objects.
-   Array USED, if non-NULL, is permutated along MAPS.  If FOR_FINI this is
-   called for finishing an object.  */
-void
-_dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
-	       bool for_fini)
+   Sort array MAPS according to dependencies of the contained objects.
+   If FOR_FINI is true, this is called for finishing an object.  */
+static void
+_dl_sort_maps_original (struct link_map **maps, unsigned int nmaps,
+			unsigned int skip, bool for_fini)
 {
+  /* Allows caller to do the common optimization of skipping the first map,
+     usually the main binary.  */
+  maps += skip;
+  nmaps -= skip;
+
   /* A list of one element need not be sorted.  */
   if (nmaps <= 1)
     return;
@@ -66,14 +74,6 @@  _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
 			   (k - i) * sizeof (maps[0]));
 		  maps[k] = thisp;
 
-		  if (used != NULL)
-		    {
-		      char here_used = used[i];
-		      memmove (&used[i], &used[i + 1],
-			       (k - i) * sizeof (used[0]));
-		      used[k] = here_used;
-		    }
-
 		  if (seen[i + 1] > nmaps - i)
 		    {
 		      ++i;
@@ -120,3 +120,177 @@  _dl_sort_maps (struct link_map **maps, unsigned int nmaps, char *used,
     next:;
     }
 }
+
+#if !HAVE_TUNABLES
+/* In this case, just default to the original algorithm.  */
+strong_alias (_dl_sort_maps_original, _dl_sort_maps);
+#else
+
+/* We use a recursive function due to its better clarity and ease of
+   implementation, as well as faster execution speed. We already use
+   alloca() for list allocation during the breadth-first search of
+   dependencies in _dl_map_object_deps(), and this should be on the
+   same order of worst-case stack usage.  */
+
+static void
+dfs_traversal (struct link_map ***rpo, struct link_map *map,
+	       bool *do_reldeps)
+{
+  if (map->l_visited)
+    return;
+
+  map->l_visited = 1;
+
+  if (map->l_initfini)
+    {
+      for (int i = 0; map->l_initfini[i] != NULL; i++)
+	{
+	  struct link_map *dep = map->l_initfini[i];
+	  if (dep->l_visited == 0)
+	    dfs_traversal (rpo, dep, do_reldeps);
+	}
+    }
+
+  if (__glibc_unlikely (do_reldeps != NULL && map->l_reldeps != NULL))
+    {
+      /* Indicate that we encountered relocation dependencies during
+	 traversal.  */
+      *do_reldeps = true;
+
+      for (int m = map->l_reldeps->act - 1; m >= 0; m--)
+	{
+	  struct link_map *dep = map->l_reldeps->list[m];
+	  if (dep->l_visited == 0)
+	    dfs_traversal (rpo, dep, do_reldeps);
+	}
+    }
+
+  *rpo -= 1;
+  **rpo = map;
+}
+
+/* Topologically sort array MAPS according to dependencies of the contained
+   objects.  */
+
+static void
+_dl_sort_maps_dfs (struct link_map **maps, unsigned int nmaps,
+		   unsigned int skip __attribute__ ((unused)), bool for_fini)
+{
+  for (int i = nmaps - 1; i >= 0; i--)
+    maps[i]->l_visited = 0;
+
+  /* We apply DFS traversal for each of maps[i] until the whole total order
+     is found and we're at the start of the Reverse-Postorder (RPO) sequence,
+     which is a topological sort.
+
+     We go from maps[nmaps - 1] backwards towards maps[0] at this level.
+     Due to the breadth-first search (BFS) ordering we receive, going
+     backwards usually gives a more shallow depth-first recursion depth,
+     adding more stack usage safety. Also, combined with the natural
+     processing order of l_initfini[] at each node during DFS, this maintains
+     an ordering closer to the original link ordering in the sorting results
+     under most simpler cases.
+
+     Another reason we order the top level backwards, it that maps[0] is
+     usually exactly the main object of which we're in the midst of
+     _dl_map_object_deps() processing, and maps[0]->l_initfini[] is still
+     blank. If we start the traversal from maps[0], since having no
+     dependencies yet filled in, maps[0] will always be immediately
+     incorrectly placed at the last place in the order (first in reverse).
+     Adjusting the order so that maps[0] is last traversed naturally avoids
+     this problem.
+
+     Further, the old "optimization" of skipping the main object at maps[0]
+     from the call-site (i.e. _dl_sort_maps(maps+1,nmaps-1)) is in general
+     no longer valid, since traversing along object dependency-links
+     may "find" the main object even when it is not included in the initial
+     order (e.g. a dlopen()'ed shared object can have circular dependencies
+     linked back to itself). In such a case, traversing N-1 objects will
+     create a N-object result, and raise problems.
+
+     To summarize, just passing in the full list, and iterating from back
+     to front makes things much more straightforward.  */
+
+  /* Array to hold RPO sorting results, before we copy back to maps[].  */
+  struct link_map *rpo[nmaps];
+
+  /* The 'head' position during each DFS iteration. Note that we start at
+     one past the last element due to first-decrement-then-store (see the
+     bottom of above dfs_traversal() routine).  */
+  struct link_map **rpo_head = &rpo[nmaps];
+
+  bool do_reldeps = false;
+  bool *do_reldeps_ref = (for_fini ? &do_reldeps : NULL);
+
+  for (int i = nmaps - 1; i >= 0; i--)
+    {
+      dfs_traversal (&rpo_head, maps[i], do_reldeps_ref);
+
+      /* We can break early if all objects are already placed.  */
+      if (rpo_head == rpo)
+	goto end;
+    }
+  assert (rpo_head == rpo);
+
+ end:
+  /* Here we may do a second pass of sorting, using only l_initfini[]
+     static dependency links. This is avoided if !FOR_FINI or if we didn't
+     find any reldeps in the first DFS traversal.
+
+     The reason we do this is: while it is unspecified how circular
+     dependencies should be handled, the presumed reasonable behavior is to
+     have destructors to respect static dependency links as much as possible,
+     overriding reldeps if needed. And the first sorting pass, which takes
+     l_initfini/l_reldeps links equally, may not preserve this priority.
+
+     Hence we do a 2nd sorting pass, taking only DT_NEEDED links into account
+     (see how the do_reldeps argument to dfs_traversal() is NULL below).  */
+  if (do_reldeps)
+    {
+      for (int i = nmaps - 1; i >= 0; i--)
+	rpo[i]->l_visited = 0;
+
+      struct link_map **maps_head = &maps[nmaps];
+      for (int i = nmaps - 1; i >= 0; i--)
+	{
+	  dfs_traversal (&maps_head, rpo[i], NULL);
+
+	  /* We can break early if all objects are already placed.
+	     The below memcpy is not needed in the do_reldeps case here,
+	     since we wrote back to maps[] during DFS traversal.  */
+	  if (maps_head == maps)
+	    return;
+	}
+      assert (maps_head == maps);
+      return;
+    }
+
+  memcpy (maps, rpo, sizeof (struct link_map *) * nmaps);
+}
+
+void
+_dl_sort_maps (struct link_map **maps, unsigned int nmaps,
+	       unsigned int skip, bool for_fini)
+{
+  /* Index code for sorting algorithm currently in use.  */
+  static int32_t algorithm = 0;
+  if (__glibc_unlikely (algorithm == 0))
+    algorithm = TUNABLE_GET (glibc, rtld, dynamic_sort,
+			     int32_t, NULL);
+
+  /* It can be tempting to use a static function pointer to store and call
+     the current selected sorting algorithm routine, but experimentation
+     shows that current processors still do not handle indirect branches
+     that efficiently, plus a static function pointer will involve
+     PTR_MANGLE/DEMANGLE, further impairing performance of small, common
+     input cases. A simple if-case with direct function calls appears to
+     be the fastest.  */
+  if (__glibc_likely (algorithm == 1))
+    _dl_sort_maps_original (maps, nmaps, skip, for_fini);
+  else if (algorithm == 2)
+    _dl_sort_maps_dfs (maps, nmaps, skip, for_fini);
+  else
+    __builtin_unreachable ();
+}
+
+#endif /* HAVE_TUNABLES.  */
diff --git a/elf/dl-tunables.list b/elf/dl-tunables.list
index 3cf0ad83ec..85157040ad 100644
--- a/elf/dl-tunables.list
+++ b/elf/dl-tunables.list
@@ -150,4 +150,13 @@  glibc {
       security_level: SXID_IGNORE
     }
   }
+
+  rtld {
+    dynamic_sort {
+      type: INT_32
+      minval: 1
+      maxval: 2
+      default: 1
+    }
+  }
 }
diff --git a/include/link.h b/include/link.h
index 4af16cb596..f2dbcbaf77 100644
--- a/include/link.h
+++ b/include/link.h
@@ -181,6 +181,10 @@  struct link_map
     unsigned int l_init_called:1; /* Nonzero if DT_INIT function called.  */
     unsigned int l_global:1;	/* Nonzero if object in _dl_global_scope.  */
     unsigned int l_reserved:2;	/* Reserved for internal use.  */
+    unsigned int l_visited:1;   /* Used internally for map dependency
+				   graph traversal.  */
+    unsigned int l_map_used:1;  /* These two bits are used during traversal */
+    unsigned int l_map_done:1;  /* of maps in _dl_close_worker(). */
     unsigned int l_phdr_allocated:1; /* Nonzero if the data structure pointed
 					to by `l_phdr' is allocated.  */
     unsigned int l_soname_added:1; /* Nonzero if the SONAME is for sure in
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index aab7245e93..339e2d4310 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -1040,7 +1040,7 @@  extern void _dl_fini (void) attribute_hidden;
 
 /* Sort array MAPS according to dependencies of the contained objects.  */
 extern void _dl_sort_maps (struct link_map **maps, unsigned int nmaps,
-			   char *used, bool for_fini) attribute_hidden;
+			   unsigned int skip, bool for_fini) attribute_hidden;
 
 /* The dynamic linker calls this function before and having changing
    any shared object mappings.  The `r_state' member of `struct r_debug'