[PATCHv3] Protect _dl_profile_fixup data-dependency order [BZ #23690]

Message ID 20181011025754.23862-1-tuliom@linux.ibm.com
State Superseded
Delegated to: Carlos O'Donell
Headers

Commit Message

Tulio Magno Quites Machado Filho Oct. 11, 2018, 2:57 a.m. UTC
  Florian Weimer <fw@deneb.enyo.de> writes:

> * Tulio Magno Quites Machado Filho:
>
>> I suspect this patch doesn't address all the comments from v1.
>> However, I believe some of the open questions/comments may not be
>> necessary anymore after the latest changes.
>>
>> I've decided to not add the new test to xtests, because it executes in
>> less than 3s in most of my tests.  There is just a single case that
>> takes up to 30s.
>>
>> Changes since v1:
>>
>>  - Fixed the coding style issues.
>>  - Replaced atomic loads/store with memory fences.
>>  - Added a test.
>
> I don't think the fences are correct, they still need to be combined
> with relaxed MO loads and stores.
>
> Does the issue that Carlos mentioned really show up in cross-builds?

Yes, it does fail on hppa and ia64.
But v3 (using thread fences) pass on build-many-glibcs.

Changes since v2:

 - Fixed coding style in nptl/tst-audit-threads-mod1.c.
 - Replaced pthreads.h functions with respective support/xthread.h ones.
 - Replaced malloc() with xcalloc() in nptl/tst-audit-threads.c.
 - Removed bzero().
 - Reduced the amount of functions to 7k in order to fit the relocation
   limit  of some architectures, e.g. m68k, mips.
 - Fixed issues in nptl/Makefile.

Changes since v1:

 - Fixed the coding style issues.
 - Replaced atomic loads/store with memory fences.
 - Added a test.

---- 8< ----

The field reloc_result->addr is used to indicate if the rest of the
fields of reloc_result have already been written, creating a
data-dependency order.
Reading reloc_result->addr to the variable value requires to complete
before reading the rest of the fields of reloc_result.
Likewise, the writes to the other fields of the reloc_result must
complete before reloc_result-addr is updated.

Tested with build-many-glibcs.

2018-10-10  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>

	[BZ #23690]
	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
	modification order when accessing reloc_result->addr.
	* nptl/Makefile (tests): Add tst-audit-threads.
	(modules-names): Add tst-audit-threads-mod1 and
	tst-audit-threads-mod2.
	Add rules to build tst-audit-threads.
	* nptl/tst-audit-threads-mod1.c: New file.
	* nptl/tst-audit-threads-mod2.c: Likewise.
	* nptl/tst-audit-threads.c: Likewise.
	* nptl/tst-audit-threads.h: Likewise.

Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
---
 elf/dl-runtime.c              | 19 ++++++++-
 nptl/Makefile                 | 10 ++++-
 nptl/tst-audit-threads-mod1.c | 38 ++++++++++++++++++
 nptl/tst-audit-threads-mod2.c | 22 +++++++++++
 nptl/tst-audit-threads.c      | 91 +++++++++++++++++++++++++++++++++++++++++++
 nptl/tst-audit-threads.h      | 84 +++++++++++++++++++++++++++++++++++++++
 6 files changed, 260 insertions(+), 4 deletions(-)
 create mode 100644 nptl/tst-audit-threads-mod1.c
 create mode 100644 nptl/tst-audit-threads-mod2.c
 create mode 100644 nptl/tst-audit-threads.c
 create mode 100644 nptl/tst-audit-threads.h
  

Comments

Carlos O'Donell Oct. 12, 2018, 1:03 a.m. UTC | #1
On 10/10/18 10:57 PM, Tulio Magno Quites Machado Filho wrote:
> Florian Weimer <fw@deneb.enyo.de> writes:
> 
>> * Tulio Magno Quites Machado Filho:
>>
>>> I suspect this patch doesn't address all the comments from v1.
>>> However, I believe some of the open questions/comments may not be
>>> necessary anymore after the latest changes.
>>>
>>> I've decided to not add the new test to xtests, because it executes in
>>> less than 3s in most of my tests.  There is just a single case that
>>> takes up to 30s.
>>>
>>> Changes since v1:
>>>
>>>  - Fixed the coding style issues.
>>>  - Replaced atomic loads/store with memory fences.
>>>  - Added a test.
>>
>> I don't think the fences are correct, they still need to be combined
>> with relaxed MO loads and stores.
>>
>> Does the issue that Carlos mentioned really show up in cross-builds?
> 
> Yes, it does fail on hppa and ia64.
> But v3 (using thread fences) pass on build-many-glibcs.

We will need a v4. Please review (1), (2) and (3) carefully, feel free to
ignore (4).

(1) I added a bunch of comments.

Comments added inline.

(2) -Wl,-z,now worries.

Added some things for you to check.

(3) Fence-to-fence sync.

For fence-to-fence synchronization to work we need an acquire and release
fence, and we have that.

We are missing the atomic read and write of the guard. Please review below.
Florian mentioned this in his review. He is correct.

And all the problems are back again because you can't do atomic loads of
the large guards because they are actually the function descriptor structures.
However, this is just laziness, we used the addr because it was convenient.
It is no longer convenient. Just add a 'init' field to reloc_result and use
that as the guard to synchronize the threads against for initialization of
the results. This should solve the reloc_result problem (ignorning the issues
hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).

(4) Review of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE.	

I reviewed the uses of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE to
see if there was any other case of this problem, particularly where there
might be a case where a write happens on one thread that might not be
seen in another.

I also looked at _dl_relocate_object and the initialization of all 
l_reloc_result via calloc, and that is also covered because the
atomic_thread_fence_acquire ensures any secondary thread sees the
initialization.

So just _dl_fixup for hppa and ia64 (the case not related to this issue)
still have potential ordering issues if the compiler writes ip before gp.

Nothing for you to worry about.

> Changes since v2:
> 
>  - Fixed coding style in nptl/tst-audit-threads-mod1.c.
>  - Replaced pthreads.h functions with respective support/xthread.h ones.
>  - Replaced malloc() with xcalloc() in nptl/tst-audit-threads.c.
>  - Removed bzero().
>  - Reduced the amount of functions to 7k in order to fit the relocation
>    limit  of some architectures, e.g. m68k, mips.
>  - Fixed issues in nptl/Makefile.
> 
> Changes since v1:
> 
>  - Fixed the coding style issues.
>  - Replaced atomic loads/store with memory fences.
>  - Added a test.
> 
> ---- 8< ----
> 
> The field reloc_result->addr is used to indicate if the rest of the
> fields of reloc_result have already been written, creating a
> data-dependency order.
> Reading reloc_result->addr to the variable value requires to complete
> before reading the rest of the fields of reloc_result.
> Likewise, the writes to the other fields of the reloc_result must
> complete before reloc_result-addr is updated.
> 
> Tested with build-many-glibcs.
> 
> 2018-10-10  Tulio Magno Quites Machado Filho  <tuliom@linux.ibm.com>
> 
> 	[BZ #23690]
> 	* elf/dl-runtime.c (_dl_profile_fixup): Guarantee memory
> 	modification order when accessing reloc_result->addr.
> 	* nptl/Makefile (tests): Add tst-audit-threads.
> 	(modules-names): Add tst-audit-threads-mod1 and
> 	tst-audit-threads-mod2.
> 	Add rules to build tst-audit-threads.
> 	* nptl/tst-audit-threads-mod1.c: New file.
> 	* nptl/tst-audit-threads-mod2.c: Likewise.
> 	* nptl/tst-audit-threads.c: Likewise.
> 	* nptl/tst-audit-threads.h: Likewise.
> 
> Signed-off-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>

Please send v4.

> ---
>  elf/dl-runtime.c              | 19 ++++++++-
>  nptl/Makefile                 | 10 ++++-
>  nptl/tst-audit-threads-mod1.c | 38 ++++++++++++++++++
>  nptl/tst-audit-threads-mod2.c | 22 +++++++++++
>  nptl/tst-audit-threads.c      | 91 +++++++++++++++++++++++++++++++++++++++++++
>  nptl/tst-audit-threads.h      | 84 +++++++++++++++++++++++++++++++++++++++
>  6 files changed, 260 insertions(+), 4 deletions(-)
>  create mode 100644 nptl/tst-audit-threads-mod1.c
>  create mode 100644 nptl/tst-audit-threads-mod2.c
>  create mode 100644 nptl/tst-audit-threads.c
>  create mode 100644 nptl/tst-audit-threads.h
> 
> diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
> index 63bbc89776..c1ba372bd7 100644
> --- a/elf/dl-runtime.c
> +++ b/elf/dl-runtime.c
> @@ -183,9 +183,18 @@ _dl_profile_fixup (
>    /* This is the address in the array where we store the result of previous
>       relocations.  */
>    struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
> -  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
>  
> +  /* CONCURRENCY NOTES:
> +

Suggest adding:

Multiple threads may be calling the same PLT sequence and with LD_AUDIT enabled
they will be calling into _dl_profile_fixup to update the reloc_result with the
result of the lazy resolution. The reloc_result guard variable is addr, and we
use relaxed MO loads and store to it along with an atomic_thread_acquire and
atomic_thread_release fence to ensure that the results of the structure are
consistent with the loaded value of the guard.

> +     The following code uses DL_FIXUP_VALUE_CODE_ADDR to access a potential
> +     member of reloc_result->addr to indicate if it is the first time this
> +     object is being relocated.
> +     Reading/Writing from/to reloc_result->addr must not happen before previous
> +     writes to reloc_result complete as they could end-up with an incomplete
> +     struct.  */

OK.

> +  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;

OK.

>    DL_FIXUP_VALUE_TYPE value = *resultp;

Not OK. This is a guard. You read it here, and write to it below.
That's a data race. Both need to be atomic accesses with any MO you want.
On hppa this will require a new enough compile to get a 64-bit atomic load.
On ia64 I don't know if there is a usable 128-bit atomic.

The key problem here is that addr is being overloaded as a guard here because
it was convenient. It's non-zero when the symbol is initialized, otherwhise it's
zero when it's not. However, for arches with function descriptors you've found
out that using it is causing problems because it's too big for traditional atomic
operations.

What you really need is a new "init" field in reloc_result, make it a word,
and then use word-sized atomics on that with relaxed MO, and keep the fences.

> +  atomic_thread_fence_acquire ();

OK, this acquire ensures all previous writes on threads are visible.

>    if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)

OK, either this is zero, and we redo the initialization, or it's not
and we see all the results of the previous writes because of the
atomic_thread_fence_acquire.

>      {
>        /* This is the first time we have to relocate this object.  */
> @@ -346,7 +355,13 @@ _dl_profile_fixup (
>  
>        /* Store the result for later runs.  */
>        if (__glibc_likely (! GLRO(dl_bind_not)))
> -	*resultp = value;

OK.

> +	{
> +	  /* Guarantee all previous writes complete before
> +	     resultp (aka. reloc_result->addr) is updated.  See CONCURRENCY
> +	     NOTES earlier  */
> +	  atomic_thread_fence_release ();

OK, this ensures that any write done by the auditors, if any, are seen by
subsequent threads attempting a resolution of the same function, and this
sequences-before all the writes with the earlier acquire.

> +	  *resultp = value;

Not OK, see above, this needs to be an atomic relaxed-MO store to 'init'
or something smaller than value.

You need a guard small enough that arches will have an atomic load/store
to the size.

> +	}
>      }
>  
>    /* By default we do not call the pltexit function.  */
> diff --git a/nptl/Makefile b/nptl/Makefile
> index be8066524c..48aba579c0 100644
> --- a/nptl/Makefile
> +++ b/nptl/Makefile
> @@ -382,7 +382,8 @@ tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
>  	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
>  	 tst-oncex3 tst-oncex4
>  ifeq ($(build-shared),yes)
> -tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
> +tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
> +	 tst-audit-threads

OK.

>  tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
>  tests-nolibpthread += tst-fini1
>  ifeq ($(have-z-execstack),yes)
> @@ -394,7 +395,8 @@ modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
>  		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
>  		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
>  		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
> -		tst-join7mod tst-compat-forwarder-mod
> +		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
> +		tst-audit-threads-mod2

OK.

>  extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
>  		   tst-cleanup4aux.o tst-cleanupx4aux.o
>  test-extras += tst-cleanup4aux tst-cleanupx4aux
> @@ -709,6 +711,10 @@ endif
>  
>  $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
>  
> +$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
> +$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
> +tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so

Do we need to add -Wl,-z,lazy?

Users might have -Wl,-z,now as the default for their build?

With BIND_NOW the test doesn't test what we want.

> +
>  # The tests here better do not run in parallel
>  ifneq ($(filter %tests,$(MAKECMDGOALS)),)
>  .NOTPARALLEL:
> diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
> new file mode 100644
> index 0000000000..194c65a6bb
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod1.c
> @@ -0,0 +1,38 @@
> +/* Dummy audit library for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <elf.h>
> +#include <link.h>
> +#include <stdio.h>
> +#include <assert.h>
> +#include <string.h>
> +

Suggest:

/* We must use a dummy LD_AUDIT module to force the dynamic loader to
   *not* update the real PLT, and instead use a cached value for the
   lazy resolution result. It is the update of that cached value that
   we are testing for correctness by doing this.  */

> +volatile int count = 0;
> +
> +unsigned int
> +la_version (unsigned int ver)
> +{
> +  return 1;
> +}
> +
> +unsigned int
> +la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
> +{
> +  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
> +}

I'm worried binutils will optimize away the PLT entries and this test will
pass without failing but the lazy resolution will not be tested.

Can we just *count* the number of PLT resolutions and see if they match?

Counting the PLT resolutions and using -Wl,-z,lazy (above) will mean we have
done our best to test what we intended to test.

> diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
> new file mode 100644
> index 0000000000..6ceedb0196
> --- /dev/null
> +++ b/nptl/tst-audit-threads-mod2.c
> @@ -0,0 +1,22 @@
> +/* Shared object with a huge number of functions for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +

Suggest:

/* Define all the retNumN functions in a library.  */

Just to be clear that this must be distinct from the executable.

> +/* Define all the retNumN functions.  */
> +#define definenum
> +#include "tst-audit-threads.h"
> diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
> new file mode 100644
> index 0000000000..0c81edc762
> --- /dev/null
> +++ b/nptl/tst-audit-threads.c
> @@ -0,0 +1,91 @@
> +/* Test multi-threading using LD_AUDIT.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +

Suggest:

/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
   library with a huge number of functions in order to validate lazy symbol
   binding with an audit library.  We use one thread per CPU to test that
   concurrent lazy resolution does not have any defects which would cause
   the process to fail.  We use an LD_AUDIT library to force the testing of
   the relocation resolution caching code in the dynamic loader i.e. 
   _dl_runtime_profile and _dl_profile_fixup.  */

> +/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
> +   library with a huge number of functions in order to validate lazy symbol
> +   binding with an audit library.  */
> +
> +#include <support/xthread.h>
> +#include <strings.h>
> +#include <stdlib.h>
> +#include <sys/sysinfo.h>
> +
> +static int do_test (void);
> +
> +/* This test usually takes less than 3s to run.  However, there are cases that
> +   take up to 30s.  */
> +#define TIMEOUT 60
> +#define TEST_FUNCTION do_test ()
> +#include "../test-skeleton.c"
> +

Suggest:

/* Declare the functions we are going to call.  */

> +#define externnum
> +#include "tst-audit-threads.h"
> +#undef externnum
> +
> +int num_threads;
> +pthread_barrier_t barrier;
> +
> +void
> +sync_all (int num)
> +{
> +  pthread_barrier_wait (&barrier);
> +}
> +
> +void
> +call_all_ret_nums (void)
> +{

Suggest:

/* Call each function one at a time from all threads.  */

> +#define callnum
> +#include "tst-audit-threads.h"
> +#undef callnum
> +}
> +
> +void *
> +thread_main (void *unused)
> +{
> +  call_all_ret_nums ();
> +  return NULL;
> +}
> +
> +#define STR2(X) #X
> +#define STR(X) STR2(X)
> +
> +static int
> +do_test (void)
> +{
> +  int i;
> +  pthread_t *threads;
> +
> +  num_threads = get_nprocs ();
> +  if (num_threads <= 1)
> +    num_threads = 2;

OK.

> +
> +  /* Used to synchronize all the threads after calling each retNumN.  */
> +  xpthread_barrier_init (&barrier, NULL, num_threads);

OK.

> +
> +  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
> +  for (i = 0; i < num_threads; i++)
> +    threads[i] = xpthread_create(NULL, thread_main, NULL);
> +
> +  for (i = 0; i < num_threads; i++)
> +    xpthread_join(threads[i]);
> +
> +  free (threads);
> +
> +  return 0;

OK.

> +}
> diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
> new file mode 100644
> index 0000000000..cb17645f4b
> --- /dev/null
> +++ b/nptl/tst-audit-threads.h
> @@ -0,0 +1,84 @@
> +/* Helper header for test-audit-threads.
> +
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +

Suggest adding:

/* We use this helper to create a large number of functions, all of
   which will be resolved lazily and thus have their PLT updated.
   This is done to provide enough functions that we can statistically
   observe a thread vs. PLT resolution failure if one exists.  */

> +#define CONCAT(a, b) a ## b
> +#define NUM(x, y) CONCAT (x, y)
> +
> +#define FUNC10(x)	\
> +  FUNC (NUM (x, 0));	\
> +  FUNC (NUM (x, 1));	\
> +  FUNC (NUM (x, 2));	\
> +  FUNC (NUM (x, 3));	\
> +  FUNC (NUM (x, 4));	\
> +  FUNC (NUM (x, 5));	\
> +  FUNC (NUM (x, 6));	\
> +  FUNC (NUM (x, 7));	\
> +  FUNC (NUM (x, 8));	\
> +  FUNC (NUM (x, 9))
> +
> +#define FUNC100(x)	\
> +  FUNC10 (NUM (x, 0));	\
> +  FUNC10 (NUM (x, 1));	\
> +  FUNC10 (NUM (x, 2));	\
> +  FUNC10 (NUM (x, 3));	\
> +  FUNC10 (NUM (x, 4));	\
> +  FUNC10 (NUM (x, 5));	\
> +  FUNC10 (NUM (x, 6));	\
> +  FUNC10 (NUM (x, 7));	\
> +  FUNC10 (NUM (x, 8));	\
> +  FUNC10 (NUM (x, 9))
> +
> +#define FUNC1000(x)		\
> +  FUNC100 (NUM (x, 0));		\
> +  FUNC100 (NUM (x, 1));		\
> +  FUNC100 (NUM (x, 2));		\
> +  FUNC100 (NUM (x, 3));		\
> +  FUNC100 (NUM (x, 4));		\
> +  FUNC100 (NUM (x, 5));		\
> +  FUNC100 (NUM (x, 6));		\
> +  FUNC100 (NUM (x, 7));		\
> +  FUNC100 (NUM (x, 8));		\
> +  FUNC100 (NUM (x, 9))
> +
> +#define FUNC7000()	\
> +  FUNC1000 (1);		\
> +  FUNC1000 (2);		\
> +  FUNC1000 (3);		\
> +  FUNC1000 (4);		\
> +  FUNC1000 (5);		\
> +  FUNC1000 (6);		\
> +  FUNC1000 (7);
> +
> +#ifdef FUNC
> +# undef FUNC
> +#endif
> +
> +#ifdef externnum
> +# define FUNC(x) extern int CONCAT (retNum, x) (void)
> +#endif

OK.

> +
> +#ifdef definenum
> +# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
> +#endif

OK.

> +
> +#ifdef callnum
> +# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
> +#endif

OK.

> +
> +FUNC7000 ();
> 

OK, 7000 functions to test, all of which need resolution.
  
Florian Weimer Oct. 15, 2018, 12:57 p.m. UTC | #2
* Carlos O'Donell:

> (3) Fence-to-fence sync.
>
> For fence-to-fence synchronization to work we need an acquire and release
> fence, and we have that.
>
> We are missing the atomic read and write of the guard. Please review below.
> Florian mentioned this in his review. He is correct.
>
> And all the problems are back again because you can't do atomic loads of
> the large guards because they are actually the function descriptor structures.
> However, this is just laziness, we used the addr because it was convenient.
> It is no longer convenient. Just add a 'init' field to reloc_result and use
> that as the guard to synchronize the threads against for initialization of
> the results. This should solve the reloc_result problem (ignorning the issues
> hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).

I think due to various external factors, we should go with the
fence-based solution for now, and change it later to something which
uses an acquire/release on the code address later, using proper atomics.

I don't want to see this bug fix blocked by ia64 and hppa.  The proper
fix needs some reshuffling of the macros here, or maybe use an unused
bit in the flags field as an indicator for initialization.

> (4) Review of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE.	
> 
> I reviewed the uses of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE to
> see if there was any other case of this problem, particularly where there
> might be a case where a write happens on one thread that might not be
> seen in another.
> 
> I also looked at _dl_relocate_object and the initialization of all 
> l_reloc_result via calloc, and that is also covered because the
> atomic_thread_fence_acquire ensures any secondary thread sees the
> initialization.

I don't think the analysis is correct.  It's up to the application to
ensure that the dlopen (or at least the call to an ELF constructor in
the new DSO) happens before a call to any function in the DSO, and this
is why there is no need to synchronize the calloc with the profiling
code.

Thanks,
Florian
  
Carlos O'Donell Oct. 15, 2018, 1:53 p.m. UTC | #3
On 10/15/18 8:57 AM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> (3) Fence-to-fence sync.
>>
>> For fence-to-fence synchronization to work we need an acquire and release
>> fence, and we have that.
>>
>> We are missing the atomic read and write of the guard. Please review below.
>> Florian mentioned this in his review. He is correct.
>>
>> And all the problems are back again because you can't do atomic loads of
>> the large guards because they are actually the function descriptor structures.
>> However, this is just laziness, we used the addr because it was convenient.
>> It is no longer convenient. Just add a 'init' field to reloc_result and use
>> that as the guard to synchronize the threads against for initialization of
>> the results. This should solve the reloc_result problem (ignorning the issues
>> hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).
> 
> I think due to various external factors, we should go with the
> fence-based solution for now, and change it later to something which
> uses an acquire/release on the code address later, using proper atomics.

Let me clarify.

The fence fix as proposed in v3 is wrong for all architectures.

We are emulating C/C++ 11 atomics within glibc, and a fence-to-fence sync
*requires* an atomic load / store of the guard, you can't use a non-atomic
access. The point of the atomic load/store is to ensure you don't have a
data race.

> I don't want to see this bug fix blocked by ia64 and hppa.  The proper
> fix needs some reshuffling of the macros here, or maybe use an unused
> bit in the flags field as an indicator for initialization.

The fix for this is straight forward.

Add a new initializer field to the reloc_result, it's an internal data
structure. It can be as big as we want and we can optimize it later.

You don't need to do any big cleanups, but we *do* have to get the
synchronization correct.

>> (4) Review of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE.	
>>
>> I reviewed the uses of elf_machine_fixup_plt, and DL_FIXUP_MAKE_VALUE to
>> see if there was any other case of this problem, particularly where there
>> might be a case where a write happens on one thread that might not be
>> seen in another.
>>
>> I also looked at _dl_relocate_object and the initialization of all 
>> l_reloc_result via calloc, and that is also covered because the
>> atomic_thread_fence_acquire ensures any secondary thread sees the
>> initialization.
> 
> I don't think the analysis is correct.  It's up to the application to
> ensure that the dlopen (or at least the call to an ELF constructor in
> the new DSO) happens before a call to any function in the DSO, and this
> is why there is no need to synchronize the calloc with the profiling
> code.

I agree, you would need some inter-thread synchronization to ensure all
other threads new the dlopen was complete, and that would ensure that
the writes would be seen.
  
Florian Weimer Oct. 17, 2018, 8:12 p.m. UTC | #4
* Carlos O'Donell:

> On 10/15/18 8:57 AM, Florian Weimer wrote:
>> * Carlos O'Donell:
>> 
>>> (3) Fence-to-fence sync.
>>>
>>> For fence-to-fence synchronization to work we need an acquire and release
>>> fence, and we have that.
>>>
>>> We are missing the atomic read and write of the guard. Please review below.
>>> Florian mentioned this in his review. He is correct.
>>>
>>> And all the problems are back again because you can't do atomic loads of
>>> the large guards because they are actually the function descriptor structures.
>>> However, this is just laziness, we used the addr because it was convenient.
>>> It is no longer convenient. Just add a 'init' field to reloc_result and use
>>> that as the guard to synchronize the threads against for initialization of
>>> the results. This should solve the reloc_result problem (ignorning the issues
>>> hppa and ia64 have with the fdesc updates across multiple threads in _dl_fixup).
>> 
>> I think due to various external factors, we should go with the
>> fence-based solution for now, and change it later to something which
>> uses an acquire/release on the code address later, using proper atomics.
>
> Let me clarify.
>
> The fence fix as proposed in v3 is wrong for all architectures.
>
> We are emulating C/C++ 11 atomics within glibc, and a fence-to-fence sync
> *requires* an atomic load / store of the guard, you can't use a non-atomic
> access. The point of the atomic load/store is to ensure you don't have a
> data race.

Carlos, I'm sorry, but I think your position is logically inconsistent.

Formally, you cannot follow the memory model here without a substantial
rewrite of the code, breaking up the struct fdesc abstraction.  The
reason is that without blocking synchronization, you still end up with
two non-atomic writes to the same object, which is a data race, and
undefined, even if both threads write the same value.

As far as I can see, POWER is !USE_ATOMIC_COMPILER_BUILTINS, so our
relaxed MO store is just a regular store, without a compiler barrier.
That means after all that rewriting, we basically end up with the same
code and the same formal data race that we would have when we just used
fences.

This is different for USE_ATOMIC_COMPILER_BUILTINS architectures, where
we do use actual atomic stores.  But for !USE_ATOMIC_COMPILER_BUILTINS,
the fence-based approach is as good as we can get, with or without
breaking the abstractions.

So as I said, given the constraints we are working under, we should go
with the solution based on fences, and have that tested on Aarch64 as
well.

>> I don't want to see this bug fix blocked by ia64 and hppa.  The proper
>> fix needs some reshuffling of the macros here, or maybe use an unused
>> bit in the flags field as an indicator for initialization.
>
> The fix for this is straight forward.
>
> Add a new initializer field to the reloc_result, it's an internal data
> structure. It can be as big as we want and we can optimize it later.
>
> You don't need to do any big cleanups, but we *do* have to get the
> synchronization correct.

See above; I don't think we can get the synchronization formally
correct, even with any level of cleanups.  In the data race case, we
would have

  atomic acquire MO load of initializer field
  non-atomic writes to various struct fields
  atomic release MO store to initializer field

in each thread.  That's still undefined behavior due to the blocking
stores in the middle.

Let me reiterate: Just because you say our atomics are C11, it doesn't
make them so.  They are syntactically different, and they are not
presented to the compiler as atomics for !USE_ATOMIC_COMPILER_BUILTINS.
I know that you and Torvald didn't consider this a problem in the past,
but maybe you can reconsider your position?

Thanks,
Florian
  

Patch

diff --git a/elf/dl-runtime.c b/elf/dl-runtime.c
index 63bbc89776..c1ba372bd7 100644
--- a/elf/dl-runtime.c
+++ b/elf/dl-runtime.c
@@ -183,9 +183,18 @@  _dl_profile_fixup (
   /* This is the address in the array where we store the result of previous
      relocations.  */
   struct reloc_result *reloc_result = &l->l_reloc_result[reloc_index];
-  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
 
+  /* CONCURRENCY NOTES:
+
+     The following code uses DL_FIXUP_VALUE_CODE_ADDR to access a potential
+     member of reloc_result->addr to indicate if it is the first time this
+     object is being relocated.
+     Reading/Writing from/to reloc_result->addr must not happen before previous
+     writes to reloc_result complete as they could end-up with an incomplete
+     struct.  */
+  DL_FIXUP_VALUE_TYPE *resultp = &reloc_result->addr;
   DL_FIXUP_VALUE_TYPE value = *resultp;
+  atomic_thread_fence_acquire ();
   if (DL_FIXUP_VALUE_CODE_ADDR (value) == 0)
     {
       /* This is the first time we have to relocate this object.  */
@@ -346,7 +355,13 @@  _dl_profile_fixup (
 
       /* Store the result for later runs.  */
       if (__glibc_likely (! GLRO(dl_bind_not)))
-	*resultp = value;
+	{
+	  /* Guarantee all previous writes complete before
+	     resultp (aka. reloc_result->addr) is updated.  See CONCURRENCY
+	     NOTES earlier  */
+	  atomic_thread_fence_release ();
+	  *resultp = value;
+	}
     }
 
   /* By default we do not call the pltexit function.  */
diff --git a/nptl/Makefile b/nptl/Makefile
index be8066524c..48aba579c0 100644
--- a/nptl/Makefile
+++ b/nptl/Makefile
@@ -382,7 +382,8 @@  tests += tst-cancelx2 tst-cancelx3 tst-cancelx4 tst-cancelx5 \
 	 tst-cleanupx0 tst-cleanupx1 tst-cleanupx2 tst-cleanupx3 tst-cleanupx4 \
 	 tst-oncex3 tst-oncex4
 ifeq ($(build-shared),yes)
-tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder
+tests += tst-atfork2 tst-tls4 tst-_res1 tst-fini1 tst-compat-forwarder \
+	 tst-audit-threads
 tests-internal += tst-tls3 tst-tls3-malloc tst-tls5 tst-stackguard1
 tests-nolibpthread += tst-fini1
 ifeq ($(have-z-execstack),yes)
@@ -394,7 +395,8 @@  modules-names = tst-atfork2mod tst-tls3mod tst-tls4moda tst-tls4modb \
 		tst-tls5mod tst-tls5moda tst-tls5modb tst-tls5modc \
 		tst-tls5modd tst-tls5mode tst-tls5modf tst-stack4mod \
 		tst-_res1mod1 tst-_res1mod2 tst-execstack-mod tst-fini1mod \
-		tst-join7mod tst-compat-forwarder-mod
+		tst-join7mod tst-compat-forwarder-mod tst-audit-threads-mod1 \
+		tst-audit-threads-mod2
 extra-test-objs += $(addsuffix .os,$(strip $(modules-names))) \
 		   tst-cleanup4aux.o tst-cleanupx4aux.o
 test-extras += tst-cleanup4aux tst-cleanupx4aux
@@ -709,6 +711,10 @@  endif
 
 $(objpfx)tst-compat-forwarder: $(objpfx)tst-compat-forwarder-mod.so
 
+$(objpfx)tst-audit-threads: $(objpfx)tst-audit-threads-mod2.so
+$(objpfx)tst-audit-threads.out: $(objpfx)tst-audit-threads-mod1.so
+tst-audit-threads-ENV = LD_AUDIT=$(objpfx)tst-audit-threads-mod1.so
+
 # The tests here better do not run in parallel
 ifneq ($(filter %tests,$(MAKECMDGOALS)),)
 .NOTPARALLEL:
diff --git a/nptl/tst-audit-threads-mod1.c b/nptl/tst-audit-threads-mod1.c
new file mode 100644
index 0000000000..194c65a6bb
--- /dev/null
+++ b/nptl/tst-audit-threads-mod1.c
@@ -0,0 +1,38 @@ 
+/* Dummy audit library for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <elf.h>
+#include <link.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+
+volatile int count = 0;
+
+unsigned int
+la_version (unsigned int ver)
+{
+  return 1;
+}
+
+unsigned int
+la_objopen (struct link_map *map, Lmid_t lmid, uintptr_t *cookie)
+{
+  return LA_FLG_BINDTO | LA_FLG_BINDFROM;
+}
diff --git a/nptl/tst-audit-threads-mod2.c b/nptl/tst-audit-threads-mod2.c
new file mode 100644
index 0000000000..6ceedb0196
--- /dev/null
+++ b/nptl/tst-audit-threads-mod2.c
@@ -0,0 +1,22 @@ 
+/* Shared object with a huge number of functions for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* Define all the retNumN functions.  */
+#define definenum
+#include "tst-audit-threads.h"
diff --git a/nptl/tst-audit-threads.c b/nptl/tst-audit-threads.c
new file mode 100644
index 0000000000..0c81edc762
--- /dev/null
+++ b/nptl/tst-audit-threads.c
@@ -0,0 +1,91 @@ 
+/* Test multi-threading using LD_AUDIT.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This test uses a dummy LD_AUDIT library (test-audit-threads-mod1) and a
+   library with a huge number of functions in order to validate lazy symbol
+   binding with an audit library.  */
+
+#include <support/xthread.h>
+#include <strings.h>
+#include <stdlib.h>
+#include <sys/sysinfo.h>
+
+static int do_test (void);
+
+/* This test usually takes less than 3s to run.  However, there are cases that
+   take up to 30s.  */
+#define TIMEOUT 60
+#define TEST_FUNCTION do_test ()
+#include "../test-skeleton.c"
+
+#define externnum
+#include "tst-audit-threads.h"
+#undef externnum
+
+int num_threads;
+pthread_barrier_t barrier;
+
+void
+sync_all (int num)
+{
+  pthread_barrier_wait (&barrier);
+}
+
+void
+call_all_ret_nums (void)
+{
+#define callnum
+#include "tst-audit-threads.h"
+#undef callnum
+}
+
+void *
+thread_main (void *unused)
+{
+  call_all_ret_nums ();
+  return NULL;
+}
+
+#define STR2(X) #X
+#define STR(X) STR2(X)
+
+static int
+do_test (void)
+{
+  int i;
+  pthread_t *threads;
+
+  num_threads = get_nprocs ();
+  if (num_threads <= 1)
+    num_threads = 2;
+
+  /* Used to synchronize all the threads after calling each retNumN.  */
+  xpthread_barrier_init (&barrier, NULL, num_threads);
+
+  threads = (pthread_t *) xcalloc (num_threads, sizeof(pthread_t));
+  for (i = 0; i < num_threads; i++)
+    threads[i] = xpthread_create(NULL, thread_main, NULL);
+
+  for (i = 0; i < num_threads; i++)
+    xpthread_join(threads[i]);
+
+  free (threads);
+
+  return 0;
+}
diff --git a/nptl/tst-audit-threads.h b/nptl/tst-audit-threads.h
new file mode 100644
index 0000000000..cb17645f4b
--- /dev/null
+++ b/nptl/tst-audit-threads.h
@@ -0,0 +1,84 @@ 
+/* Helper header for test-audit-threads.
+
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#define CONCAT(a, b) a ## b
+#define NUM(x, y) CONCAT (x, y)
+
+#define FUNC10(x)	\
+  FUNC (NUM (x, 0));	\
+  FUNC (NUM (x, 1));	\
+  FUNC (NUM (x, 2));	\
+  FUNC (NUM (x, 3));	\
+  FUNC (NUM (x, 4));	\
+  FUNC (NUM (x, 5));	\
+  FUNC (NUM (x, 6));	\
+  FUNC (NUM (x, 7));	\
+  FUNC (NUM (x, 8));	\
+  FUNC (NUM (x, 9))
+
+#define FUNC100(x)	\
+  FUNC10 (NUM (x, 0));	\
+  FUNC10 (NUM (x, 1));	\
+  FUNC10 (NUM (x, 2));	\
+  FUNC10 (NUM (x, 3));	\
+  FUNC10 (NUM (x, 4));	\
+  FUNC10 (NUM (x, 5));	\
+  FUNC10 (NUM (x, 6));	\
+  FUNC10 (NUM (x, 7));	\
+  FUNC10 (NUM (x, 8));	\
+  FUNC10 (NUM (x, 9))
+
+#define FUNC1000(x)		\
+  FUNC100 (NUM (x, 0));		\
+  FUNC100 (NUM (x, 1));		\
+  FUNC100 (NUM (x, 2));		\
+  FUNC100 (NUM (x, 3));		\
+  FUNC100 (NUM (x, 4));		\
+  FUNC100 (NUM (x, 5));		\
+  FUNC100 (NUM (x, 6));		\
+  FUNC100 (NUM (x, 7));		\
+  FUNC100 (NUM (x, 8));		\
+  FUNC100 (NUM (x, 9))
+
+#define FUNC7000()	\
+  FUNC1000 (1);		\
+  FUNC1000 (2);		\
+  FUNC1000 (3);		\
+  FUNC1000 (4);		\
+  FUNC1000 (5);		\
+  FUNC1000 (6);		\
+  FUNC1000 (7);
+
+#ifdef FUNC
+# undef FUNC
+#endif
+
+#ifdef externnum
+# define FUNC(x) extern int CONCAT (retNum, x) (void)
+#endif
+
+#ifdef definenum
+# define FUNC(x) int CONCAT (retNum, x) (void) { return x; }
+#endif
+
+#ifdef callnum
+# define FUNC(x) CONCAT (retNum, x) (); sync_all (x)
+#endif
+
+FUNC7000 ();