[BZ,#13862] Reuse of cached stack can cause bounds overrun of thread DTV

Message ID 20141127140654.GA31381@gmail.com
State Superseded
Headers

Commit Message

H.J. Lu Nov. 27, 2014, 2:06 p.m. UTC
  On Wed, Nov 26, 2014 at 09:06:32PM -0800, H.J. Lu wrote:
> On Wed, Nov 26, 2014 at 9:39 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> > On Wed, Nov 26, 2014 at 9:07 AM, Joseph Myers <joseph@codesourcery.com> wrote:
> >> On Wed, 26 Nov 2014, H.J. Lu wrote:
> >>
> >>> On Wed, Nov 26, 2014 at 8:23 AM, Joseph Myers <joseph@codesourcery.com> wrote:
> >>> > On Wed, 26 Nov 2014, H.J. Lu wrote:
> >>> >
> >>> >> Is Paul's paperwork ready now?
> >>> >
> >>> > There is an assignment for past and future changes (dated 2014-5-13) and a
> >>> > disclaimer from Procera Networks for past changes (dated 2014-1-21).
> >>> >
> >>>
> >>> Can we move forward with Paul's fix?
> >>
> >> Someone needs to retest and resubmit it.  It's so old it's not in
> >> patchwork (which only tracks patch submissions back to this March).
> >>
> >
> > I will rebase and submit a new patch with a testcase.
> 
> Paul's fix is based on _dl_update_slotinfo in the same file.  However,
> his patch failed to properly convert INSTALL_NEW_DTV to INSTALL_DTV.
> I believe this is the correct fix.  I will submit the complete patch with a
> testcase later.
> 
> 

Here is the promised patch.  OK for trunk?

Thanks.


H.J.
----
From b811f571b5604f84c84d5b96394e6b892e05d14f Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Thu, 27 Nov 2014 05:42:23 -0800
Subject: [PATCH] Reallocate DTV if the current DTV isn't big enough

This patch reallocates DTV if the current DTV isn't big enough, using
the similar approach in _dl_update_slotinfo.  Tested on X86-64, x32
and ia32.

	[BZ #13862]
	* elf/dl-tls.c (oom): Remove #ifdef SHARED/#endif.
	(_dl_static_dtv, _dl_initial_dtv): Moved before ...
	(_dl_allocate_tls_init): This.  Reallocate DTV if the current
	DTV isn't big enough.
	* nptl/Makefile (tests): Add tst-stack4.
	(modules-names): Add tst-stack4mod.
	($(objpfx)tst-stack4): New.
	(tst-stack4mod.sos): Likewise.
	($(objpfx)tst-stack4.out): Likewise.
	($(tst-stack4mod.sos)): Likewise.
	(clean): Likewise.
	* nptl/tst-stack4.c: New file.
	* nptl/tst-stack4mod.c: Likewise.
---
 ChangeLog            |  17 +++++++
 elf/dl-tls.c         |  53 ++++++++++++++++++---
 nptl/Makefile        |  17 ++++++-
 nptl/tst-stack4.c    | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++
 nptl/tst-stack4mod.c |   9 ++++
 5 files changed, 217 insertions(+), 9 deletions(-)
 create mode 100644 nptl/tst-stack4.c
 create mode 100644 nptl/tst-stack4mod.c
  

Comments

Carlos O'Donell Nov. 27, 2014, 4:28 p.m. UTC | #1
On 11/27/2014 09:06 AM, H.J. Lu wrote:
> Here is the promised patch.  OK for trunk?

Not yet.

My personal opinion is that there has been insufficient detailed
analysis posted for this patch. For this kind of change I want to
see the author understand the code paths in question and how the
call flow gets to the position of failure. Paul has given a hand-waving
example of the failure without much detail. I would like to see
sufficient detail like:
---
When calling pthread_create we call allocate_stack which then reuses
a stack by finding one via get_cached_stack. In get_cached_stack we
select an appropriately sized stack and in that process get the old
dtv from the thread, which could be reused if it was the right size.
Then _dl_allocate_tls_init is called to reinitialize the TLS structures
for the thread. In _dl_allocate_tls_init the expectation that is violated
is that dtv should have been calloc'd to a size of GL(dl_tls_max_dtv_idx) 
+ DTV_SURPLUS; by allocate_dtv. While this is true, it was done with an
old value of dl_tls_max_dtv_idx and that value might have changed since
the stack was last released.
---
This is just an example, you would have to verify that this is actually
the case.

Q: A dlclose() may modify GL(dl_tls_max_dtv_idx) concurrently with another
   thread starting up. Is it safe to read GL(dl_tls_max_dtv_idx) more than
   once in a function since they may differ? Does reading GL(dl_tls_max_dtv_idx)
   require an atomic load? We are trying to be explicit about this in glibc.

Q: If GL(dl_tls_max_dtv_idx) may change at any time during thread startup
   does it make sense to reallocate the dtv? Why not leave it uninitialized
   and allow _dl_update_slotinfo to reallocate it?

Q: If you don't want to delay the dtv reallocation to _dl_update_slotinfo,
   why don't you refactor the reallocation of the dtv into a single function
   to avoid duplication with the same code in _dl_update_slotinfo?

> ----
> From b811f571b5604f84c84d5b96394e6b892e05d14f Mon Sep 17 00:00:00 2001
> From: "H.J. Lu" <hjl.tools@gmail.com>
> Date: Thu, 27 Nov 2014 05:42:23 -0800
> Subject: [PATCH] Reallocate DTV if the current DTV isn't big enough
> 
> This patch reallocates DTV if the current DTV isn't big enough, using
> the similar approach in _dl_update_slotinfo.  Tested on X86-64, x32
> and ia32.
> 
> 	[BZ #13862]
> 	* elf/dl-tls.c (oom): Remove #ifdef SHARED/#endif.
> 	(_dl_static_dtv, _dl_initial_dtv): Moved before ...
> 	(_dl_allocate_tls_init): This.  Reallocate DTV if the current
> 	DTV isn't big enough.
> 	* nptl/Makefile (tests): Add tst-stack4.
> 	(modules-names): Add tst-stack4mod.
> 	($(objpfx)tst-stack4): New.
> 	(tst-stack4mod.sos): Likewise.
> 	($(objpfx)tst-stack4.out): Likewise.
> 	($(tst-stack4mod.sos)): Likewise.
> 	(clean): Likewise.
> 	* nptl/tst-stack4.c: New file.
> 	* nptl/tst-stack4mod.c: Likewise.
> ---
>  ChangeLog            |  17 +++++++
>  elf/dl-tls.c         |  53 ++++++++++++++++++---
>  nptl/Makefile        |  17 ++++++-
>  nptl/tst-stack4.c    | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  nptl/tst-stack4mod.c |   9 ++++
>  5 files changed, 217 insertions(+), 9 deletions(-)
>  create mode 100644 nptl/tst-stack4.c
>  create mode 100644 nptl/tst-stack4mod.c
> 
> diff --git a/ChangeLog b/ChangeLog
> index 92ffe16..e6474be 100644
> --- a/ChangeLog
> +++ b/ChangeLog
> @@ -1,3 +1,20 @@
> +2014-11-27  H.J. Lu  <hongjiu.lu@intel.com>
> +
> +	[BZ #13862]
> +	* elf/dl-tls.c (oom): Remove #ifdef SHARED/#endif.
> +	(_dl_static_dtv, _dl_initial_dtv): Moved before ...
> +	(_dl_allocate_tls_init): This.  Reallocate DTV if the current
> +	DTV isn't big enough.
> +	* nptl/Makefile (tests): Add tst-stack4.
> +	(modules-names): Add tst-stack4mod.
> +	($(objpfx)tst-stack4): New.
> +	(tst-stack4mod.sos): Likewise.
> +	($(objpfx)tst-stack4.out): Likewise.
> +	($(tst-stack4mod.sos)): Likewise.
> +	(clean): Likewise.
> +	* nptl/tst-stack4.c: New file.
> +	* nptl/tst-stack4mod.c: Likewise.
> +
>  2014-11-27  Stefan Liebler  <stli@linux.vnet.ibm.com>
>  
>  	* nscd/connections.c: Include libc-internal.h because of macro
> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index 5204fda..3142b16 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -34,14 +34,12 @@
>  
>  
>  /* Out-of-memory handler.  */
> -#ifdef SHARED
>  static void
>  __attribute__ ((__noreturn__))
>  oom (void)
>  {
>    _dl_fatal_printf ("cannot allocate memory for thread-local data: ABORT\n");
>  }
> -#endif

OK.

>  
>  size_t
> @@ -397,6 +395,11 @@ _dl_allocate_tls_storage (void)
>  }
>  
>  
> +#ifndef SHARED
> +extern dtv_t _dl_static_dtv[];
> +# define _dl_initial_dtv (&_dl_static_dtv[1])
> +#endif
> +

OK.

>  void *
>  internal_function
>  _dl_allocate_tls_init (void *result)
> @@ -410,6 +413,47 @@ _dl_allocate_tls_init (void *result)
>    size_t total = 0;
>    size_t maxgen = 0;
>  
> +  /* Check if the current dtv is big enough.   */
> +  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
> +    {
> +      /* Reallocate the dtv.  */
> +      dtv_t *newp;
> +      size_t newsize = GL(dl_tls_max_dtv_idx) + DTV_SURPLUS;

IMO, access to GL(dl_tls_max_dtv_idx) needs to be annotated as
an atomic load until someone does the analysis to relax that.
I'm not commenting on the other users in the function that you
didn't touch.

> +      size_t oldsize = dtv[-1].counter;
> +
> +      if (dtv == GL(dl_initial_dtv))
> +	{
> +	  /* This is the initial dtv that was allocated
> +	     during rtld startup using the dl-minimal.c
> +	     malloc instead of the real malloc.  We can't
> +	     free it, we have to abandon the old storage.  */
> +
> +	  newp = malloc ((2 + newsize) * sizeof (dtv_t));
> +	  if (newp == NULL)
> +	    oom ();
> +	  memcpy (newp, &dtv[-1], (2 + oldsize) * sizeof (dtv_t));
> +	}
> +      else
> +	{
> +	  newp = realloc (&dtv[-1],
> +			  (2 + newsize) * sizeof (dtv_t));
> +	  if (newp == NULL)
> +	    oom ();
> +	}
> +
> +      newp[0].counter = newsize;
> +
> +      /* Clear the newly allocated part.  */
> +      memset (newp + 2 + oldsize, '\0',
> +	      (newsize - oldsize) * sizeof (dtv_t));
> +
> +      /* Install this new dtv in the thread data structures.  */
> +      INSTALL_DTV (result, newp);
> +
> +      /* Point dtv to the generation counter.  */
> +      dtv = &newp[1];
> +    }
> +

The code itself is OK, but it's a copy of what _dl_update_slotinfo
does. Please refactor into a static inline function e.g. resize_dtv.

>    /* We have to prepare the dtv for all currently loaded modules using
>       TLS.  For those which are dynamically loaded we add the values
>       indicating deferred allocation.  */
> @@ -492,11 +536,6 @@ _dl_allocate_tls (void *mem)
>  rtld_hidden_def (_dl_allocate_tls)
>  
>  
> -#ifndef SHARED
> -extern dtv_t _dl_static_dtv[];
> -# define _dl_initial_dtv (&_dl_static_dtv[1])
> -#endif
> -
>  void
>  internal_function
>  _dl_deallocate_tls (void *tcb, bool dealloc_tcb)
> diff --git a/nptl/Makefile b/nptl/Makefile
> index 86a1b0c..ac76596 100644
> --- a/nptl/Makefile
> +++ b/nptl/Makefile
> @@ -254,7 +254,7 @@ tests = tst-typesizes \
>  	tst-exec1 tst-exec2 tst-exec3 tst-exec4 \
>  	tst-exit1 tst-exit2 tst-exit3 \
>  	tst-stdio1 tst-stdio2 \
> -	tst-stack1 tst-stack2 tst-stack3 tst-pthread-getattr \
> +	tst-stack1 tst-stack2 tst-stack3 tst-stack4 tst-pthread-getattr \

OK.

>  	tst-pthread-attr-affinity \
>  	tst-unload \
>  	tst-dlsym1 \
> @@ -311,7 +311,7 @@ endif
>  
>  modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
>  		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
> -		tst-tls5modd tst-tls5mode tst-tls5modf \
> +		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \

OK.

>  		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod
>  extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) tst-cleanup4aux.o
>  test-extras += $(modules-names) tst-cleanup4aux
> @@ -479,6 +479,19 @@ $(objpfx)tst-stack3-mem.out: $(objpfx)tst-stack3.out
>  	$(evaluate-test)
>  generated += tst-stack3-mem.out tst-stack3.mtrace
>  
> +$(objpfx)tst-stack4: $(libdl) $(shared-thread-library)
> +tst-stack4mod.sos=$(shell for i in 0 1 2 3 4 5 6 7 8 9 10 \
> +				   11 12 13 14 15 16 17 18 19; do \
> +			    for j in 0 1 2 3 4 5 6 7 8 9 10 \
> +				     11 12 13 14 15 16 17 18 19; do \
> +			      echo $(objpfx)tst-stack4mod-$$i-$$j.so; \
> +			    done; done)
> +$(objpfx)tst-stack4.out: $(tst-stack4mod.sos)
> +$(tst-stack4mod.sos): $(objpfx)tst-stack4mod.so
> +	cp -f $< $@
> +clean:
> +	rm -f $(tst-stack4mod.sos)
> +

OK.

>  $(objpfx)tst-cleanup4: $(objpfx)tst-cleanup4aux.o $(shared-thread-library)
>  $(objpfx)tst-cleanupx4: $(objpfx)tst-cleanup4aux.o $(shared-thread-library)
>  
> diff --git a/nptl/tst-stack4.c b/nptl/tst-stack4.c
> new file mode 100644
> index 0000000..0a0d2f9
> --- /dev/null
> +++ b/nptl/tst-stack4.c
> @@ -0,0 +1,130 @@

Missing first line explaining test.

Missing license.

> +#include <stdio.h>
> +#include <stdint.h>
> +#include <dlfcn.h>
> +#include <assert.h>
> +#include <pthread.h>
> +
> +#define DSO_SHARED_FILES 20
> +#define DSO_OPEN_THREADS 20
> +#define DSO_EXEC_THREADS 2

Why these particular constants? If you chose them because
they happened to work on a particular system, please comment
that.

> +
> +pthread_mutex_t g_lock;
> +

Describe what the mutex protects in a comment.

> +typedef void (*function) (void);
> +
> +void *
> +dso_invoke(void *dso_fun)
> +{
> +  function *fun_vec = (function *) dso_fun;
> +  int dso;
> +
> +  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
> +    (*fun_vec[dso]) ();
> +
> +  pthread_exit (NULL);
> +}
> +
> +void *
> +dso_process (void * p)
> +{
> +  void *handle[DSO_SHARED_FILES];
> +  function fun_vec[DSO_SHARED_FILES];
> +  char dso_path[DSO_SHARED_FILES][100];
> +  int dso;
> +  uintptr_t t = (uintptr_t) p;
> +
> +  // Open dso's and get a function.

Use /* */.

> +  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
> +    {
> +      sprintf (dso_path[dso], "tst-stack4mod-%i-%i.so", t, dso);
> +
> +      pthread_mutex_lock (&g_lock);
> +
> +      handle[dso] = dlopen (dso_path[dso], RTLD_NOW);
> +      assert (handle[dso]);
> +
> +      fun_vec[dso] = (function) dlsym (handle[dso], "function");
> +      assert (fun_vec[dso]);
> +
> +      pthread_mutex_unlock (&g_lock);
> +    }
> +
> +  // Spawn workers

Use /* */

> +  pthread_t thread[DSO_EXEC_THREADS];
> +  int i, ret;
> +  uintptr_t result = 0;
> +  for (i = 0; i < DSO_EXEC_THREADS; i++)
> +    {
> +      pthread_mutex_lock (&g_lock);
> +      ret = pthread_create (&thread[i], NULL, dso_invoke, (void *) fun_vec);
> +      if (ret != 0)
> +	{
> +	  printf ("pthread_create failed: %d\n", ret);
> +	  result = 1;
> +	}
> +      pthread_mutex_unlock (&g_lock);
> +    }
> +
> +  if (!result) 
> +    for (i = 0; i < DSO_EXEC_THREADS; i++)
> +      {
> +	ret = pthread_join (thread[i], NULL);
> +	if (ret != 0)
> +	  {
> +	    printf ("pthread_join failed: %d\n", ret);
> +	    result = 1;
> +	  }
> +      }
> +
> +  // Close all dso's

Use /* */

> +  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
> +    {
> +      pthread_mutex_lock (&g_lock);
> +      dlclose (handle[dso]);
> +      pthread_mutex_unlock (&g_lock);
> +    }
> +
> +  // Exit
> +  pthread_exit ((void *) result);
> +}
> +
> +

This test needs a verbose, minimum one paragraph, comment explaining 
what the test is testing.

> +int
> +main()
> +{
> +  pthread_t thread[DSO_OPEN_THREADS];
> +  int i,j;
> +  int ret;
> +  int result = 0;
> +
> +  pthread_mutex_init (&g_lock, NULL);
> +
> +  for (j = 0; j < 100; j++)

Why 100 times? Please comment.

> +    {
> +      for (i = 0; i < DSO_OPEN_THREADS; i++)
> +	{
> +	  ret = pthread_create (&thread[i], NULL, dso_process,
> +				(void *) (uintptr_t) i);
> +	  if (ret != 0)
> +	    {
> +	      printf ("pthread_create failed: %d\n", ret);
> +	      result = 1;
> +	    }
> +	}
> +
> +      if (result)
> +	break;
> +
> +      for (i = 0; i < DSO_OPEN_THREADS; i++)
> +	{
> +	  ret = pthread_join (thread[i], NULL);
> +	  if (ret != 0)
> +	    {
> +	      printf ("pthread_join failed: %d\n", ret);
> +	      result = 1;
> +	    }
> +	}
> +    }
> +
> +  return result;
> +}
> diff --git a/nptl/tst-stack4mod.c b/nptl/tst-stack4mod.c
> new file mode 100644
> index 0000000..b8e800f
> --- /dev/null
> +++ b/nptl/tst-stack4mod.c
> @@ -0,0 +1,9 @@

Please give a one line description at the top of this file.

> +__thread int var[256] __attribute__ ((visibility ("hidden"))) = {0};
> +
> +void
> +function (void)
> +{
> +  int i;
> +  for (i = 0; i < 256; i++)

Why 256 times? Please comment.

> +    var[i] = i;
> +}
> 

Cheers,
Carlos.
  
H.J. Lu Nov. 27, 2014, 5:45 p.m. UTC | #2
On Thu, Nov 27, 2014 at 8:28 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 11/27/2014 09:06 AM, H.J. Lu wrote:
>> Here is the promised patch.  OK for trunk?
>
> Not yet.
>
> My personal opinion is that there has been insufficient detailed
> analysis posted for this patch. For this kind of change I want to
> see the author understand the code paths in question and how the
> call flow gets to the position of failure. Paul has given a hand-waving
> example of the failure without much detail. I would like to see
> sufficient detail like:
> ---
> When calling pthread_create we call allocate_stack which then reuses
> a stack by finding one via get_cached_stack. In get_cached_stack we
> select an appropriately sized stack and in that process get the old
> dtv from the thread, which could be reused if it was the right size.
> Then _dl_allocate_tls_init is called to reinitialize the TLS structures
> for the thread. In _dl_allocate_tls_init the expectation that is violated
> is that dtv should have been calloc'd to a size of GL(dl_tls_max_dtv_idx)
> + DTV_SURPLUS; by allocate_dtv. While this is true, it was done with an
> old value of dl_tls_max_dtv_idx and that value might have changed since
> the stack was last released.
> ---
> This is just an example, you would have to verify that this is actually
> the case.

That is the case.

> Q: A dlclose() may modify GL(dl_tls_max_dtv_idx) concurrently with another
>    thread starting up. Is it safe to read GL(dl_tls_max_dtv_idx) more than
>    once in a function since they may differ? Does reading GL(dl_tls_max_dtv_idx)
>    require an atomic load? We are trying to be explicit about this in glibc.

There is a race between DTV access and dlclose:

https://sourceware.org/ml/libc-alpha/2014-11/msg00469.html

> Q: If GL(dl_tls_max_dtv_idx) may change at any time during thread startup
>    does it make sense to reallocate the dtv? Why not leave it uninitialized
>    and allow _dl_update_slotinfo to reallocate it?

Are you suggesting folding  _dl_allocate_tls_init into _dl_update_slotinfo
and remove _dl_allocate_tls_init?  I can give it a try.

> Q: If you don't want to delay the dtv reallocation to _dl_update_slotinfo,
>    why don't you refactor the reallocation of the dtv into a single function
>    to avoid duplication with the same code in _dl_update_slotinfo?

I will do that.

>> ----
>> From b811f571b5604f84c84d5b96394e6b892e05d14f Mon Sep 17 00:00:00 2001
>> From: "H.J. Lu" <hjl.tools@gmail.com>
>> Date: Thu, 27 Nov 2014 05:42:23 -0800
>> Subject: [PATCH] Reallocate DTV if the current DTV isn't big enough
>>
>> This patch reallocates DTV if the current DTV isn't big enough, using
>> the similar approach in _dl_update_slotinfo.  Tested on X86-64, x32
>> and ia32.
>>
>>       [BZ #13862]
>>       * elf/dl-tls.c (oom): Remove #ifdef SHARED/#endif.
>>       (_dl_static_dtv, _dl_initial_dtv): Moved before ...
>>       (_dl_allocate_tls_init): This.  Reallocate DTV if the current
>>       DTV isn't big enough.
>>       * nptl/Makefile (tests): Add tst-stack4.
>>       (modules-names): Add tst-stack4mod.
>>       ($(objpfx)tst-stack4): New.
>>       (tst-stack4mod.sos): Likewise.
>>       ($(objpfx)tst-stack4.out): Likewise.
>>       ($(tst-stack4mod.sos)): Likewise.
>>       (clean): Likewise.
>>       * nptl/tst-stack4.c: New file.
>>       * nptl/tst-stack4mod.c: Likewise.
>> ---
>>  ChangeLog            |  17 +++++++
>>  elf/dl-tls.c         |  53 ++++++++++++++++++---
>>  nptl/Makefile        |  17 ++++++-
>>  nptl/tst-stack4.c    | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  nptl/tst-stack4mod.c |   9 ++++
>>  5 files changed, 217 insertions(+), 9 deletions(-)
>>  create mode 100644 nptl/tst-stack4.c
>>  create mode 100644 nptl/tst-stack4mod.c
>>
>> diff --git a/ChangeLog b/ChangeLog
>> index 92ffe16..e6474be 100644
>> --- a/ChangeLog
>> +++ b/ChangeLog
>> @@ -1,3 +1,20 @@
>> +2014-11-27  H.J. Lu  <hongjiu.lu@intel.com>
>> +
>> +     [BZ #13862]
>> +     * elf/dl-tls.c (oom): Remove #ifdef SHARED/#endif.
>> +     (_dl_static_dtv, _dl_initial_dtv): Moved before ...
>> +     (_dl_allocate_tls_init): This.  Reallocate DTV if the current
>> +     DTV isn't big enough.
>> +     * nptl/Makefile (tests): Add tst-stack4.
>> +     (modules-names): Add tst-stack4mod.
>> +     ($(objpfx)tst-stack4): New.
>> +     (tst-stack4mod.sos): Likewise.
>> +     ($(objpfx)tst-stack4.out): Likewise.
>> +     ($(tst-stack4mod.sos)): Likewise.
>> +     (clean): Likewise.
>> +     * nptl/tst-stack4.c: New file.
>> +     * nptl/tst-stack4mod.c: Likewise.
>> +
>>  2014-11-27  Stefan Liebler  <stli@linux.vnet.ibm.com>
>>
>>       * nscd/connections.c: Include libc-internal.h because of macro
>> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
>> index 5204fda..3142b16 100644
>> --- a/elf/dl-tls.c
>> +++ b/elf/dl-tls.c
>> @@ -34,14 +34,12 @@
>>
>>
>>  /* Out-of-memory handler.  */
>> -#ifdef SHARED
>>  static void
>>  __attribute__ ((__noreturn__))
>>  oom (void)
>>  {
>>    _dl_fatal_printf ("cannot allocate memory for thread-local data: ABORT\n");
>>  }
>> -#endif
>
> OK.
>
>>
>>  size_t
>> @@ -397,6 +395,11 @@ _dl_allocate_tls_storage (void)
>>  }
>>
>>
>> +#ifndef SHARED
>> +extern dtv_t _dl_static_dtv[];
>> +# define _dl_initial_dtv (&_dl_static_dtv[1])
>> +#endif
>> +
>
> OK.
>
>>  void *
>>  internal_function
>>  _dl_allocate_tls_init (void *result)
>> @@ -410,6 +413,47 @@ _dl_allocate_tls_init (void *result)
>>    size_t total = 0;
>>    size_t maxgen = 0;
>>
>> +  /* Check if the current dtv is big enough.   */
>> +  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
>> +    {
>> +      /* Reallocate the dtv.  */
>> +      dtv_t *newp;
>> +      size_t newsize = GL(dl_tls_max_dtv_idx) + DTV_SURPLUS;
>
> IMO, access to GL(dl_tls_max_dtv_idx) needs to be annotated as
> an atomic load until someone does the analysis to relax that.
> I'm not commenting on the other users in the function that you
> didn't touch.

There is a race condition. But it is a separate issue.  They should
be fixed together.

>> +      size_t oldsize = dtv[-1].counter;
>> +
>> +      if (dtv == GL(dl_initial_dtv))
>> +     {
>> +       /* This is the initial dtv that was allocated
>> +          during rtld startup using the dl-minimal.c
>> +          malloc instead of the real malloc.  We can't
>> +          free it, we have to abandon the old storage.  */
>> +
>> +       newp = malloc ((2 + newsize) * sizeof (dtv_t));
>> +       if (newp == NULL)
>> +         oom ();
>> +       memcpy (newp, &dtv[-1], (2 + oldsize) * sizeof (dtv_t));
>> +     }
>> +      else
>> +     {
>> +       newp = realloc (&dtv[-1],
>> +                       (2 + newsize) * sizeof (dtv_t));
>> +       if (newp == NULL)
>> +         oom ();
>> +     }
>> +
>> +      newp[0].counter = newsize;
>> +
>> +      /* Clear the newly allocated part.  */
>> +      memset (newp + 2 + oldsize, '\0',
>> +           (newsize - oldsize) * sizeof (dtv_t));
>> +
>> +      /* Install this new dtv in the thread data structures.  */
>> +      INSTALL_DTV (result, newp);
>> +
>> +      /* Point dtv to the generation counter.  */
>> +      dtv = &newp[1];
>> +    }
>> +
>
> The code itself is OK, but it's a copy of what _dl_update_slotinfo
> does. Please refactor into a static inline function e.g. resize_dtv.
>
>>    /* We have to prepare the dtv for all currently loaded modules using
>>       TLS.  For those which are dynamically loaded we add the values
>>       indicating deferred allocation.  */
>> @@ -492,11 +536,6 @@ _dl_allocate_tls (void *mem)
>>  rtld_hidden_def (_dl_allocate_tls)
>>
>>
>> -#ifndef SHARED
>> -extern dtv_t _dl_static_dtv[];
>> -# define _dl_initial_dtv (&_dl_static_dtv[1])
>> -#endif
>> -
>>  void
>>  internal_function
>>  _dl_deallocate_tls (void *tcb, bool dealloc_tcb)
>> diff --git a/nptl/Makefile b/nptl/Makefile
>> index 86a1b0c..ac76596 100644
>> --- a/nptl/Makefile
>> +++ b/nptl/Makefile
>> @@ -254,7 +254,7 @@ tests = tst-typesizes \
>>       tst-exec1 tst-exec2 tst-exec3 tst-exec4 \
>>       tst-exit1 tst-exit2 tst-exit3 \
>>       tst-stdio1 tst-stdio2 \
>> -     tst-stack1 tst-stack2 tst-stack3 tst-pthread-getattr \
>> +     tst-stack1 tst-stack2 tst-stack3 tst-stack4 tst-pthread-getattr \
>
> OK.
>
>>       tst-pthread-attr-affinity \
>>       tst-unload \
>>       tst-dlsym1 \
>> @@ -311,7 +311,7 @@ endif
>>
>>  modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
>>               tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
>> -             tst-tls5modd tst-tls5mode tst-tls5modf \
>> +             tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
>
> OK.
>
>>               tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod
>>  extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) tst-cleanup4aux.o
>>  test-extras += $(modules-names) tst-cleanup4aux
>> @@ -479,6 +479,19 @@ $(objpfx)tst-stack3-mem.out: $(objpfx)tst-stack3.out
>>       $(evaluate-test)
>>  generated += tst-stack3-mem.out tst-stack3.mtrace
>>
>> +$(objpfx)tst-stack4: $(libdl) $(shared-thread-library)
>> +tst-stack4mod.sos=$(shell for i in 0 1 2 3 4 5 6 7 8 9 10 \
>> +                                11 12 13 14 15 16 17 18 19; do \
>> +                         for j in 0 1 2 3 4 5 6 7 8 9 10 \
>> +                                  11 12 13 14 15 16 17 18 19; do \
>> +                           echo $(objpfx)tst-stack4mod-$$i-$$j.so; \
>> +                         done; done)
>> +$(objpfx)tst-stack4.out: $(tst-stack4mod.sos)
>> +$(tst-stack4mod.sos): $(objpfx)tst-stack4mod.so
>> +     cp -f $< $@
>> +clean:
>> +     rm -f $(tst-stack4mod.sos)
>> +
>
> OK.
>
>>  $(objpfx)tst-cleanup4: $(objpfx)tst-cleanup4aux.o $(shared-thread-library)
>>  $(objpfx)tst-cleanupx4: $(objpfx)tst-cleanup4aux.o $(shared-thread-library)
>>
>> diff --git a/nptl/tst-stack4.c b/nptl/tst-stack4.c
>> new file mode 100644
>> index 0000000..0a0d2f9
>> --- /dev/null
>> +++ b/nptl/tst-stack4.c
>> @@ -0,0 +1,130 @@
>
> Missing first line explaining test.

I will add it.

> Missing license.

What is the license policy on testcases?

>> +#include <stdio.h>
>> +#include <stdint.h>
>> +#include <dlfcn.h>
>> +#include <assert.h>
>> +#include <pthread.h>
>> +
>> +#define DSO_SHARED_FILES 20
>> +#define DSO_OPEN_THREADS 20
>> +#define DSO_EXEC_THREADS 2
>
> Why these particular constants? If you chose them because
> they happened to work on a particular system, please comment
> that.

They came from:

https://sourceware.org/bugzilla/attachment.cgi?id=6290

>> +
>> +pthread_mutex_t g_lock;
>> +
>
> Describe what the mutex protects in a comment.

I will add a few words.

>> +typedef void (*function) (void);
>> +
>> +void *
>> +dso_invoke(void *dso_fun)
>> +{
>> +  function *fun_vec = (function *) dso_fun;
>> +  int dso;
>> +
>> +  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
>> +    (*fun_vec[dso]) ();
>> +
>> +  pthread_exit (NULL);
>> +}
>> +
>> +void *
>> +dso_process (void * p)
>> +{
>> +  void *handle[DSO_SHARED_FILES];
>> +  function fun_vec[DSO_SHARED_FILES];
>> +  char dso_path[DSO_SHARED_FILES][100];
>> +  int dso;
>> +  uintptr_t t = (uintptr_t) p;
>> +
>> +  // Open dso's and get a function.
>
> Use /* */.

We have been using // for comments in glibc for a while.  Why not here?

>
>> +  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
>> +    {
>> +      sprintf (dso_path[dso], "tst-stack4mod-%i-%i.so", t, dso);
>> +
>> +      pthread_mutex_lock (&g_lock);
>> +
>> +      handle[dso] = dlopen (dso_path[dso], RTLD_NOW);
>> +      assert (handle[dso]);
>> +
>> +      fun_vec[dso] = (function) dlsym (handle[dso], "function");
>> +      assert (fun_vec[dso]);
>> +
>> +      pthread_mutex_unlock (&g_lock);
>> +    }
>> +
>> +  // Spawn workers
>
> Use /* */
>
>> +  pthread_t thread[DSO_EXEC_THREADS];
>> +  int i, ret;
>> +  uintptr_t result = 0;
>> +  for (i = 0; i < DSO_EXEC_THREADS; i++)
>> +    {
>> +      pthread_mutex_lock (&g_lock);
>> +      ret = pthread_create (&thread[i], NULL, dso_invoke, (void *) fun_vec);
>> +      if (ret != 0)
>> +     {
>> +       printf ("pthread_create failed: %d\n", ret);
>> +       result = 1;
>> +     }
>> +      pthread_mutex_unlock (&g_lock);
>> +    }
>> +
>> +  if (!result)
>> +    for (i = 0; i < DSO_EXEC_THREADS; i++)
>> +      {
>> +     ret = pthread_join (thread[i], NULL);
>> +     if (ret != 0)
>> +       {
>> +         printf ("pthread_join failed: %d\n", ret);
>> +         result = 1;
>> +       }
>> +      }
>> +
>> +  // Close all dso's
>
> Use /* */
>
>> +  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
>> +    {
>> +      pthread_mutex_lock (&g_lock);
>> +      dlclose (handle[dso]);
>> +      pthread_mutex_unlock (&g_lock);
>> +    }
>> +
>> +  // Exit
>> +  pthread_exit ((void *) result);
>> +}
>> +
>> +
>
> This test needs a verbose, minimum one paragraph, comment explaining
> what the test is testing.
>
>> +int
>> +main()
>> +{
>> +  pthread_t thread[DSO_OPEN_THREADS];
>> +  int i,j;
>> +  int ret;
>> +  int result = 0;
>> +
>> +  pthread_mutex_init (&g_lock, NULL);
>> +
>> +  for (j = 0; j < 100; j++)
>
> Why 100 times? Please comment.
>
>> +    {
>> +      for (i = 0; i < DSO_OPEN_THREADS; i++)
>> +     {
>> +       ret = pthread_create (&thread[i], NULL, dso_process,
>> +                             (void *) (uintptr_t) i);
>> +       if (ret != 0)
>> +         {
>> +           printf ("pthread_create failed: %d\n", ret);
>> +           result = 1;
>> +         }
>> +     }
>> +
>> +      if (result)
>> +     break;
>> +
>> +      for (i = 0; i < DSO_OPEN_THREADS; i++)
>> +     {
>> +       ret = pthread_join (thread[i], NULL);
>> +       if (ret != 0)
>> +         {
>> +           printf ("pthread_join failed: %d\n", ret);
>> +           result = 1;
>> +         }
>> +     }
>> +    }
>> +
>> +  return result;
>> +}
>> diff --git a/nptl/tst-stack4mod.c b/nptl/tst-stack4mod.c
>> new file mode 100644
>> index 0000000..b8e800f
>> --- /dev/null
>> +++ b/nptl/tst-stack4mod.c
>> @@ -0,0 +1,9 @@
>
> Please give a one line description at the top of this file.
>
>> +__thread int var[256] __attribute__ ((visibility ("hidden"))) = {0};
>> +
>> +void
>> +function (void)
>> +{
>> +  int i;
>> +  for (i = 0; i < 256; i++)
>
> Why 256 times? Please comment.
>
>> +    var[i] = i;
>> +}
>>

I don't have answers for any particular values in this
testcase.
  
Carlos O'Donell Nov. 27, 2014, 9:51 p.m. UTC | #3
On 11/27/2014 12:45 PM, H.J. Lu wrote:
> On Thu, Nov 27, 2014 at 8:28 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>> On 11/27/2014 09:06 AM, H.J. Lu wrote:
>>> Here is the promised patch.  OK for trunk?
>>
>> Not yet.
>>
>> My personal opinion is that there has been insufficient detailed
>> analysis posted for this patch. For this kind of change I want to
>> see the author understand the code paths in question and how the
>> call flow gets to the position of failure. Paul has given a hand-waving
>> example of the failure without much detail. I would like to see
>> sufficient detail like:
>> ---
>> When calling pthread_create we call allocate_stack which then reuses
>> a stack by finding one via get_cached_stack. In get_cached_stack we
>> select an appropriately sized stack and in that process get the old
>> dtv from the thread, which could be reused if it was the right size.
>> Then _dl_allocate_tls_init is called to reinitialize the TLS structures
>> for the thread. In _dl_allocate_tls_init the expectation that is violated
>> is that dtv should have been calloc'd to a size of GL(dl_tls_max_dtv_idx)
>> + DTV_SURPLUS; by allocate_dtv. While this is true, it was done with an
>> old value of dl_tls_max_dtv_idx and that value might have changed since
>> the stack was last released.
>> ---
>> This is just an example, you would have to verify that this is actually
>> the case.
> 
> That is the case.

Good.

>> Q: A dlclose() may modify GL(dl_tls_max_dtv_idx) concurrently with another
>>    thread starting up. Is it safe to read GL(dl_tls_max_dtv_idx) more than
>>    once in a function since they may differ? Does reading GL(dl_tls_max_dtv_idx)
>>    require an atomic load? We are trying to be explicit about this in glibc.
> 
> There is a race between DTV access and dlclose:

That doesn't answer my question.

Alex Oliva and I went through the code and we have come to the conclusion
that it is *not* safe to read GL(dl_tls_max_dtv_idx) more than once if the
code doing the reading expects to get the same value. The same value is
only guaranteed if you hold GL(dl_load_lock). However, it is safe to read
an old value, since _dl_update_slotinfo will resize the DTV again if you need
to access a TLS variable of a dlopen that was in-flight during thread creation.

Therefore, please make the read of GL(dl_tls_max_dtv_idx) an atomic access
to prevent the compiler from reloading this and to document that it is
written to by other threads concurrently.

>> Q: If GL(dl_tls_max_dtv_idx) may change at any time during thread startup
>>    does it make sense to reallocate the dtv? Why not leave it uninitialized
>>    and allow _dl_update_slotinfo to reallocate it?
> 
> Are you suggesting folding  _dl_allocate_tls_init into _dl_update_slotinfo
> and remove _dl_allocate_tls_init?  I can give it a try.

I had not thought of that, but it's a good idea.

I don't know how you'd make it work though.

I think _dl_allocate_tls_init needs to do some book keeping, but the
allocation of the DTV could be deferred to __tls_get_addr time 
(_dl_update_slotinfo).

>> Q: If you don't want to delay the dtv reallocation to _dl_update_slotinfo,
>>    why don't you refactor the reallocation of the dtv into a single function
>>    to avoid duplication with the same code in _dl_update_slotinfo?
> 
> I will do that.

Thanks.

>>> ----
>>> From b811f571b5604f84c84d5b96394e6b892e05d14f Mon Sep 17 00:00:00 2001
>>> From: "H.J. Lu" <hjl.tools@gmail.com>
>>> Date: Thu, 27 Nov 2014 05:42:23 -0800
>>> Subject: [PATCH] Reallocate DTV if the current DTV isn't big enough
>>>
>>> This patch reallocates DTV if the current DTV isn't big enough, using
>>> the similar approach in _dl_update_slotinfo.  Tested on X86-64, x32
>>> and ia32.
>>>
>>>       [BZ #13862]
>>>       * elf/dl-tls.c (oom): Remove #ifdef SHARED/#endif.
>>>       (_dl_static_dtv, _dl_initial_dtv): Moved before ...
>>>       (_dl_allocate_tls_init): This.  Reallocate DTV if the current
>>>       DTV isn't big enough.
>>>       * nptl/Makefile (tests): Add tst-stack4.
>>>       (modules-names): Add tst-stack4mod.
>>>       ($(objpfx)tst-stack4): New.
>>>       (tst-stack4mod.sos): Likewise.
>>>       ($(objpfx)tst-stack4.out): Likewise.
>>>       ($(tst-stack4mod.sos)): Likewise.
>>>       (clean): Likewise.
>>>       * nptl/tst-stack4.c: New file.
>>>       * nptl/tst-stack4mod.c: Likewise.
>>> ---
>>>  ChangeLog            |  17 +++++++
>>>  elf/dl-tls.c         |  53 ++++++++++++++++++---
>>>  nptl/Makefile        |  17 ++++++-
>>>  nptl/tst-stack4.c    | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  nptl/tst-stack4mod.c |   9 ++++
>>>  5 files changed, 217 insertions(+), 9 deletions(-)
>>>  create mode 100644 nptl/tst-stack4.c
>>>  create mode 100644 nptl/tst-stack4mod.c
>>>
>>> diff --git a/ChangeLog b/ChangeLog
>>> index 92ffe16..e6474be 100644
>>> --- a/ChangeLog
>>> +++ b/ChangeLog
>>> @@ -1,3 +1,20 @@
>>> +2014-11-27  H.J. Lu  <hongjiu.lu@intel.com>
>>> +
>>> +     [BZ #13862]
>>> +     * elf/dl-tls.c (oom): Remove #ifdef SHARED/#endif.
>>> +     (_dl_static_dtv, _dl_initial_dtv): Moved before ...
>>> +     (_dl_allocate_tls_init): This.  Reallocate DTV if the current
>>> +     DTV isn't big enough.
>>> +     * nptl/Makefile (tests): Add tst-stack4.
>>> +     (modules-names): Add tst-stack4mod.
>>> +     ($(objpfx)tst-stack4): New.
>>> +     (tst-stack4mod.sos): Likewise.
>>> +     ($(objpfx)tst-stack4.out): Likewise.
>>> +     ($(tst-stack4mod.sos)): Likewise.
>>> +     (clean): Likewise.
>>> +     * nptl/tst-stack4.c: New file.
>>> +     * nptl/tst-stack4mod.c: Likewise.
>>> +
>>>  2014-11-27  Stefan Liebler  <stli@linux.vnet.ibm.com>
>>>
>>>       * nscd/connections.c: Include libc-internal.h because of macro
>>> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
>>> index 5204fda..3142b16 100644
>>> --- a/elf/dl-tls.c
>>> +++ b/elf/dl-tls.c
>>> @@ -34,14 +34,12 @@
>>>
>>>
>>>  /* Out-of-memory handler.  */
>>> -#ifdef SHARED
>>>  static void
>>>  __attribute__ ((__noreturn__))
>>>  oom (void)
>>>  {
>>>    _dl_fatal_printf ("cannot allocate memory for thread-local data: ABORT\n");
>>>  }
>>> -#endif
>>
>> OK.
>>
>>>
>>>  size_t
>>> @@ -397,6 +395,11 @@ _dl_allocate_tls_storage (void)
>>>  }
>>>
>>>
>>> +#ifndef SHARED
>>> +extern dtv_t _dl_static_dtv[];
>>> +# define _dl_initial_dtv (&_dl_static_dtv[1])
>>> +#endif
>>> +
>>
>> OK.
>>
>>>  void *
>>>  internal_function
>>>  _dl_allocate_tls_init (void *result)
>>> @@ -410,6 +413,47 @@ _dl_allocate_tls_init (void *result)
>>>    size_t total = 0;
>>>    size_t maxgen = 0;
>>>
>>> +  /* Check if the current dtv is big enough.   */
>>> +  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
>>> +    {
>>> +      /* Reallocate the dtv.  */
>>> +      dtv_t *newp;
>>> +      size_t newsize = GL(dl_tls_max_dtv_idx) + DTV_SURPLUS;
>>
>> IMO, access to GL(dl_tls_max_dtv_idx) needs to be annotated as
>> an atomic load until someone does the analysis to relax that.
>> I'm not commenting on the other users in the function that you
>> didn't touch.
> 
> There is a race condition. But it is a separate issue.  They should
> be fixed together.

You are adding new code. That new code should be correct. This needs
an atomic load.

>>> +      size_t oldsize = dtv[-1].counter;
>>> +
>>> +      if (dtv == GL(dl_initial_dtv))
>>> +     {
>>> +       /* This is the initial dtv that was allocated
>>> +          during rtld startup using the dl-minimal.c
>>> +          malloc instead of the real malloc.  We can't
>>> +          free it, we have to abandon the old storage.  */
>>> +
>>> +       newp = malloc ((2 + newsize) * sizeof (dtv_t));
>>> +       if (newp == NULL)
>>> +         oom ();
>>> +       memcpy (newp, &dtv[-1], (2 + oldsize) * sizeof (dtv_t));
>>> +     }
>>> +      else
>>> +     {
>>> +       newp = realloc (&dtv[-1],
>>> +                       (2 + newsize) * sizeof (dtv_t));
>>> +       if (newp == NULL)
>>> +         oom ();
>>> +     }
>>> +
>>> +      newp[0].counter = newsize;
>>> +
>>> +      /* Clear the newly allocated part.  */
>>> +      memset (newp + 2 + oldsize, '\0',
>>> +           (newsize - oldsize) * sizeof (dtv_t));
>>> +
>>> +      /* Install this new dtv in the thread data structures.  */
>>> +      INSTALL_DTV (result, newp);
>>> +
>>> +      /* Point dtv to the generation counter.  */
>>> +      dtv = &newp[1];
>>> +    }
>>> +
>>
>> The code itself is OK, but it's a copy of what _dl_update_slotinfo
>> does. Please refactor into a static inline function e.g. resize_dtv.
>>
>>>    /* We have to prepare the dtv for all currently loaded modules using
>>>       TLS.  For those which are dynamically loaded we add the values
>>>       indicating deferred allocation.  */
>>> @@ -492,11 +536,6 @@ _dl_allocate_tls (void *mem)
>>>  rtld_hidden_def (_dl_allocate_tls)
>>>
>>>
>>> -#ifndef SHARED
>>> -extern dtv_t _dl_static_dtv[];
>>> -# define _dl_initial_dtv (&_dl_static_dtv[1])
>>> -#endif
>>> -
>>>  void
>>>  internal_function
>>>  _dl_deallocate_tls (void *tcb, bool dealloc_tcb)
>>> diff --git a/nptl/Makefile b/nptl/Makefile
>>> index 86a1b0c..ac76596 100644
>>> --- a/nptl/Makefile
>>> +++ b/nptl/Makefile
>>> @@ -254,7 +254,7 @@ tests = tst-typesizes \
>>>       tst-exec1 tst-exec2 tst-exec3 tst-exec4 \
>>>       tst-exit1 tst-exit2 tst-exit3 \
>>>       tst-stdio1 tst-stdio2 \
>>> -     tst-stack1 tst-stack2 tst-stack3 tst-pthread-getattr \
>>> +     tst-stack1 tst-stack2 tst-stack3 tst-stack4 tst-pthread-getattr \
>>
>> OK.
>>
>>>       tst-pthread-attr-affinity \
>>>       tst-unload \
>>>       tst-dlsym1 \
>>> @@ -311,7 +311,7 @@ endif
>>>
>>>  modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
>>>               tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
>>> -             tst-tls5modd tst-tls5mode tst-tls5modf \
>>> +             tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
>>
>> OK.
>>
>>>               tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod
>>>  extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) tst-cleanup4aux.o
>>>  test-extras += $(modules-names) tst-cleanup4aux
>>> @@ -479,6 +479,19 @@ $(objpfx)tst-stack3-mem.out: $(objpfx)tst-stack3.out
>>>       $(evaluate-test)
>>>  generated += tst-stack3-mem.out tst-stack3.mtrace
>>>
>>> +$(objpfx)tst-stack4: $(libdl) $(shared-thread-library)
>>> +tst-stack4mod.sos=$(shell for i in 0 1 2 3 4 5 6 7 8 9 10 \
>>> +                                11 12 13 14 15 16 17 18 19; do \
>>> +                         for j in 0 1 2 3 4 5 6 7 8 9 10 \
>>> +                                  11 12 13 14 15 16 17 18 19; do \
>>> +                           echo $(objpfx)tst-stack4mod-$$i-$$j.so; \
>>> +                         done; done)
>>> +$(objpfx)tst-stack4.out: $(tst-stack4mod.sos)
>>> +$(tst-stack4mod.sos): $(objpfx)tst-stack4mod.so
>>> +     cp -f $< $@
>>> +clean:
>>> +     rm -f $(tst-stack4mod.sos)
>>> +
>>
>> OK.
>>
>>>  $(objpfx)tst-cleanup4: $(objpfx)tst-cleanup4aux.o $(shared-thread-library)
>>>  $(objpfx)tst-cleanupx4: $(objpfx)tst-cleanup4aux.o $(shared-thread-library)
>>>
>>> diff --git a/nptl/tst-stack4.c b/nptl/tst-stack4.c
>>> new file mode 100644
>>> index 0000000..0a0d2f9
>>> --- /dev/null
>>> +++ b/nptl/tst-stack4.c
>>> @@ -0,0 +1,130 @@
>>
>> Missing first line explaining test.
> 
> I will add it.

Thanks.

>> Missing license.
> 
> What is the license policy on testcases?

Same license as the project with the appropriate header.

>>> +#include <stdio.h>
>>> +#include <stdint.h>
>>> +#include <dlfcn.h>
>>> +#include <assert.h>
>>> +#include <pthread.h>
>>> +
>>> +#define DSO_SHARED_FILES 20
>>> +#define DSO_OPEN_THREADS 20
>>> +#define DSO_EXEC_THREADS 2
>>
>> Why these particular constants? If you chose them because
>> they happened to work on a particular system, please comment
>> that.
> 
> They came from:
> 
> https://sourceware.org/bugzilla/attachment.cgi?id=6290

You don't answer the question.

If you are submitting the patch you have to be ready to defend it.

If you can't defend it you must at least add a comment saying the
choices are arbitrary.

/* The choices of thread count, and file counts are arbitary.
   The point is simply to run enough threads that an exiting
   thread has it's stack reused by another thread at the same
   time as new libraries have been loaded.  */

>>> +
>>> +pthread_mutex_t g_lock;
>>> +
>>
>> Describe what the mutex protects in a comment.
> 
> I will add a few words.
> 
>>> +typedef void (*function) (void);
>>> +
>>> +void *
>>> +dso_invoke(void *dso_fun)
>>> +{
>>> +  function *fun_vec = (function *) dso_fun;
>>> +  int dso;
>>> +
>>> +  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
>>> +    (*fun_vec[dso]) ();
>>> +
>>> +  pthread_exit (NULL);
>>> +}
>>> +
>>> +void *
>>> +dso_process (void * p)
>>> +{
>>> +  void *handle[DSO_SHARED_FILES];
>>> +  function fun_vec[DSO_SHARED_FILES];
>>> +  char dso_path[DSO_SHARED_FILES][100];
>>> +  int dso;
>>> +  uintptr_t t = (uintptr_t) p;
>>> +
>>> +  // Open dso's and get a function.
>>
>> Use /* */.
> 
> We have been using // for comments in glibc for a while.  Why not here?

glibc follows the GNU Coding Standard which says to use /* */?

I do not use // in any of my changes to glibc.

If you want to support using // that's fine, but start a new thread,
and ask for consensus on it and I'll add it to the consenus page.

>>
>>> +  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
>>> +    {
>>> +      sprintf (dso_path[dso], "tst-stack4mod-%i-%i.so", t, dso);
>>> +
>>> +      pthread_mutex_lock (&g_lock);
>>> +
>>> +      handle[dso] = dlopen (dso_path[dso], RTLD_NOW);
>>> +      assert (handle[dso]);
>>> +
>>> +      fun_vec[dso] = (function) dlsym (handle[dso], "function");
>>> +      assert (fun_vec[dso]);
>>> +
>>> +      pthread_mutex_unlock (&g_lock);
>>> +    }
>>> +
>>> +  // Spawn workers
>>
>> Use /* */
>>
>>> +  pthread_t thread[DSO_EXEC_THREADS];
>>> +  int i, ret;
>>> +  uintptr_t result = 0;
>>> +  for (i = 0; i < DSO_EXEC_THREADS; i++)
>>> +    {
>>> +      pthread_mutex_lock (&g_lock);
>>> +      ret = pthread_create (&thread[i], NULL, dso_invoke, (void *) fun_vec);
>>> +      if (ret != 0)
>>> +     {
>>> +       printf ("pthread_create failed: %d\n", ret);
>>> +       result = 1;
>>> +     }
>>> +      pthread_mutex_unlock (&g_lock);
>>> +    }
>>> +
>>> +  if (!result)
>>> +    for (i = 0; i < DSO_EXEC_THREADS; i++)
>>> +      {
>>> +     ret = pthread_join (thread[i], NULL);
>>> +     if (ret != 0)
>>> +       {
>>> +         printf ("pthread_join failed: %d\n", ret);
>>> +         result = 1;
>>> +       }
>>> +      }
>>> +
>>> +  // Close all dso's
>>
>> Use /* */
>>
>>> +  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
>>> +    {
>>> +      pthread_mutex_lock (&g_lock);
>>> +      dlclose (handle[dso]);
>>> +      pthread_mutex_unlock (&g_lock);
>>> +    }
>>> +
>>> +  // Exit
>>> +  pthread_exit ((void *) result);
>>> +}
>>> +
>>> +
>>
>> This test needs a verbose, minimum one paragraph, comment explaining
>> what the test is testing.

No comment?

>>> +int
>>> +main()
>>> +{
>>> +  pthread_t thread[DSO_OPEN_THREADS];
>>> +  int i,j;
>>> +  int ret;
>>> +  int result = 0;
>>> +
>>> +  pthread_mutex_init (&g_lock, NULL);
>>> +
>>> +  for (j = 0; j < 100; j++)
>>
>> Why 100 times? Please comment.

Comment stating this is arbitrary.

>>> +    {
>>> +      for (i = 0; i < DSO_OPEN_THREADS; i++)
>>> +     {
>>> +       ret = pthread_create (&thread[i], NULL, dso_process,
>>> +                             (void *) (uintptr_t) i);
>>> +       if (ret != 0)
>>> +         {
>>> +           printf ("pthread_create failed: %d\n", ret);
>>> +           result = 1;
>>> +         }
>>> +     }
>>> +
>>> +      if (result)
>>> +     break;
>>> +
>>> +      for (i = 0; i < DSO_OPEN_THREADS; i++)
>>> +     {
>>> +       ret = pthread_join (thread[i], NULL);
>>> +       if (ret != 0)
>>> +         {
>>> +           printf ("pthread_join failed: %d\n", ret);
>>> +           result = 1;
>>> +         }
>>> +     }
>>> +    }
>>> +
>>> +  return result;
>>> +}
>>> diff --git a/nptl/tst-stack4mod.c b/nptl/tst-stack4mod.c
>>> new file mode 100644
>>> index 0000000..b8e800f
>>> --- /dev/null
>>> +++ b/nptl/tst-stack4mod.c
>>> @@ -0,0 +1,9 @@
>>
>> Please give a one line description at the top of this file.
>>
>>> +__thread int var[256] __attribute__ ((visibility ("hidden"))) = {0};
>>> +
>>> +void
>>> +function (void)
>>> +{
>>> +  int i;
>>> +  for (i = 0; i < 256; i++)
>>
>> Why 256 times? Please comment.
>>> +    var[i] = i;
>>> +}
>>>
> 
> I don't have answers for any particular values in this
> testcase.

Then add a comment saying they are arbitrary. Such that
future readers know they are not special values.

Cheers,
Carlos.
  

Patch

diff --git a/ChangeLog b/ChangeLog
index 92ffe16..e6474be 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,20 @@ 
+2014-11-27  H.J. Lu  <hongjiu.lu@intel.com>
+
+	[BZ #13862]
+	* elf/dl-tls.c (oom): Remove #ifdef SHARED/#endif.
+	(_dl_static_dtv, _dl_initial_dtv): Moved before ...
+	(_dl_allocate_tls_init): This.  Reallocate DTV if the current
+	DTV isn't big enough.
+	* nptl/Makefile (tests): Add tst-stack4.
+	(modules-names): Add tst-stack4mod.
+	($(objpfx)tst-stack4): New.
+	(tst-stack4mod.sos): Likewise.
+	($(objpfx)tst-stack4.out): Likewise.
+	($(tst-stack4mod.sos)): Likewise.
+	(clean): Likewise.
+	* nptl/tst-stack4.c: New file.
+	* nptl/tst-stack4mod.c: Likewise.
+
 2014-11-27  Stefan Liebler  <stli@linux.vnet.ibm.com>
 
 	* nscd/connections.c: Include libc-internal.h because of macro
diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 5204fda..3142b16 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -34,14 +34,12 @@ 
 
 
 /* Out-of-memory handler.  */
-#ifdef SHARED
 static void
 __attribute__ ((__noreturn__))
 oom (void)
 {
   _dl_fatal_printf ("cannot allocate memory for thread-local data: ABORT\n");
 }
-#endif
 
 
 size_t
@@ -397,6 +395,11 @@  _dl_allocate_tls_storage (void)
 }
 
 
+#ifndef SHARED
+extern dtv_t _dl_static_dtv[];
+# define _dl_initial_dtv (&_dl_static_dtv[1])
+#endif
+
 void *
 internal_function
 _dl_allocate_tls_init (void *result)
@@ -410,6 +413,47 @@  _dl_allocate_tls_init (void *result)
   size_t total = 0;
   size_t maxgen = 0;
 
+  /* Check if the current dtv is big enough.   */
+  if (dtv[-1].counter < GL(dl_tls_max_dtv_idx))
+    {
+      /* Reallocate the dtv.  */
+      dtv_t *newp;
+      size_t newsize = GL(dl_tls_max_dtv_idx) + DTV_SURPLUS;
+      size_t oldsize = dtv[-1].counter;
+
+      if (dtv == GL(dl_initial_dtv))
+	{
+	  /* This is the initial dtv that was allocated
+	     during rtld startup using the dl-minimal.c
+	     malloc instead of the real malloc.  We can't
+	     free it, we have to abandon the old storage.  */
+
+	  newp = malloc ((2 + newsize) * sizeof (dtv_t));
+	  if (newp == NULL)
+	    oom ();
+	  memcpy (newp, &dtv[-1], (2 + oldsize) * sizeof (dtv_t));
+	}
+      else
+	{
+	  newp = realloc (&dtv[-1],
+			  (2 + newsize) * sizeof (dtv_t));
+	  if (newp == NULL)
+	    oom ();
+	}
+
+      newp[0].counter = newsize;
+
+      /* Clear the newly allocated part.  */
+      memset (newp + 2 + oldsize, '\0',
+	      (newsize - oldsize) * sizeof (dtv_t));
+
+      /* Install this new dtv in the thread data structures.  */
+      INSTALL_DTV (result, newp);
+
+      /* Point dtv to the generation counter.  */
+      dtv = &newp[1];
+    }
+
   /* We have to prepare the dtv for all currently loaded modules using
      TLS.  For those which are dynamically loaded we add the values
      indicating deferred allocation.  */
@@ -492,11 +536,6 @@  _dl_allocate_tls (void *mem)
 rtld_hidden_def (_dl_allocate_tls)
 
 
-#ifndef SHARED
-extern dtv_t _dl_static_dtv[];
-# define _dl_initial_dtv (&_dl_static_dtv[1])
-#endif
-
 void
 internal_function
 _dl_deallocate_tls (void *tcb, bool dealloc_tcb)
diff --git a/nptl/Makefile b/nptl/Makefile
index 86a1b0c..ac76596 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -254,7 +254,7 @@  tests = tst-typesizes \
 	tst-exec1 tst-exec2 tst-exec3 tst-exec4 \
 	tst-exit1 tst-exit2 tst-exit3 \
 	tst-stdio1 tst-stdio2 \
-	tst-stack1 tst-stack2 tst-stack3 tst-pthread-getattr \
+	tst-stack1 tst-stack2 tst-stack3 tst-stack4 tst-pthread-getattr \
 	tst-pthread-attr-affinity \
 	tst-unload \
 	tst-dlsym1 \
@@ -311,7 +311,7 @@  endif
 
 modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
-		tst-tls5modd tst-tls5mode tst-tls5modf \
+		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
 		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) tst-cleanup4aux.o
 test-extras += $(modules-names) tst-cleanup4aux
@@ -479,6 +479,19 @@  $(objpfx)tst-stack3-mem.out: $(objpfx)tst-stack3.out
 	$(evaluate-test)
 generated += tst-stack3-mem.out tst-stack3.mtrace
 
+$(objpfx)tst-stack4: $(libdl) $(shared-thread-library)
+tst-stack4mod.sos=$(shell for i in 0 1 2 3 4 5 6 7 8 9 10 \
+				   11 12 13 14 15 16 17 18 19; do \
+			    for j in 0 1 2 3 4 5 6 7 8 9 10 \
+				     11 12 13 14 15 16 17 18 19; do \
+			      echo $(objpfx)tst-stack4mod-$$i-$$j.so; \
+			    done; done)
+$(objpfx)tst-stack4.out: $(tst-stack4mod.sos)
+$(tst-stack4mod.sos): $(objpfx)tst-stack4mod.so
+	cp -f $< $@
+clean:
+	rm -f $(tst-stack4mod.sos)
+
 $(objpfx)tst-cleanup4: $(objpfx)tst-cleanup4aux.o $(shared-thread-library)
 $(objpfx)tst-cleanupx4: $(objpfx)tst-cleanup4aux.o $(shared-thread-library)
 
diff --git a/nptl/tst-stack4.c b/nptl/tst-stack4.c
new file mode 100644
index 0000000..0a0d2f9
--- /dev/null
+++ b/nptl/tst-stack4.c
@@ -0,0 +1,130 @@ 
+#include <stdio.h>
+#include <stdint.h>
+#include <dlfcn.h>
+#include <assert.h>
+#include <pthread.h>
+
+#define DSO_SHARED_FILES 20
+#define DSO_OPEN_THREADS 20
+#define DSO_EXEC_THREADS 2
+
+pthread_mutex_t g_lock;
+
+typedef void (*function) (void);
+
+void *
+dso_invoke(void *dso_fun)
+{
+  function *fun_vec = (function *) dso_fun;
+  int dso;
+
+  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
+    (*fun_vec[dso]) ();
+
+  pthread_exit (NULL);
+}
+
+void *
+dso_process (void * p)
+{
+  void *handle[DSO_SHARED_FILES];
+  function fun_vec[DSO_SHARED_FILES];
+  char dso_path[DSO_SHARED_FILES][100];
+  int dso;
+  uintptr_t t = (uintptr_t) p;
+
+  // Open dso's and get a function.
+  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
+    {
+      sprintf (dso_path[dso], "tst-stack4mod-%i-%i.so", t, dso);
+
+      pthread_mutex_lock (&g_lock);
+
+      handle[dso] = dlopen (dso_path[dso], RTLD_NOW);
+      assert (handle[dso]);
+
+      fun_vec[dso] = (function) dlsym (handle[dso], "function");
+      assert (fun_vec[dso]);
+
+      pthread_mutex_unlock (&g_lock);
+    }
+
+  // Spawn workers
+  pthread_t thread[DSO_EXEC_THREADS];
+  int i, ret;
+  uintptr_t result = 0;
+  for (i = 0; i < DSO_EXEC_THREADS; i++)
+    {
+      pthread_mutex_lock (&g_lock);
+      ret = pthread_create (&thread[i], NULL, dso_invoke, (void *) fun_vec);
+      if (ret != 0)
+	{
+	  printf ("pthread_create failed: %d\n", ret);
+	  result = 1;
+	}
+      pthread_mutex_unlock (&g_lock);
+    }
+
+  if (!result) 
+    for (i = 0; i < DSO_EXEC_THREADS; i++)
+      {
+	ret = pthread_join (thread[i], NULL);
+	if (ret != 0)
+	  {
+	    printf ("pthread_join failed: %d\n", ret);
+	    result = 1;
+	  }
+      }
+
+  // Close all dso's
+  for (dso = 0; dso < DSO_SHARED_FILES; dso++)
+    {
+      pthread_mutex_lock (&g_lock);
+      dlclose (handle[dso]);
+      pthread_mutex_unlock (&g_lock);
+    }
+
+  // Exit
+  pthread_exit ((void *) result);
+}
+
+
+int
+main()
+{
+  pthread_t thread[DSO_OPEN_THREADS];
+  int i,j;
+  int ret;
+  int result = 0;
+
+  pthread_mutex_init (&g_lock, NULL);
+
+  for (j = 0; j < 100; j++)
+    {
+      for (i = 0; i < DSO_OPEN_THREADS; i++)
+	{
+	  ret = pthread_create (&thread[i], NULL, dso_process,
+				(void *) (uintptr_t) i);
+	  if (ret != 0)
+	    {
+	      printf ("pthread_create failed: %d\n", ret);
+	      result = 1;
+	    }
+	}
+
+      if (result)
+	break;
+
+      for (i = 0; i < DSO_OPEN_THREADS; i++)
+	{
+	  ret = pthread_join (thread[i], NULL);
+	  if (ret != 0)
+	    {
+	      printf ("pthread_join failed: %d\n", ret);
+	      result = 1;
+	    }
+	}
+    }
+
+  return result;
+}
diff --git a/nptl/tst-stack4mod.c b/nptl/tst-stack4mod.c
new file mode 100644
index 0000000..b8e800f
--- /dev/null
+++ b/nptl/tst-stack4mod.c
@@ -0,0 +1,9 @@ 
+__thread int var[256] __attribute__ ((visibility ("hidden"))) = {0};
+
+void
+function (void)
+{
+  int i;
+  for (i = 0; i < 256; i++)
+    var[i] = i;
+}