[v6,1/2] elf: Properly align PT_LOAD segments [BZ #28676]

Message ID 20211213025103.48472-2-rongwei.wang@linux.alibaba.com
State Committed
Headers
Series fix p_align on PT_LOAD segment in DSO isn't honored |

Checks

Context Check Description
dj/TryBot-apply_patch fail Patch failed to apply to master at the time it was sent

Commit Message

Rongwei Wang Dec. 13, 2021, 2:51 a.m. UTC
  When PT_LOAD segment alignment > the page size, allocate
enough space to ensure that the segment can be properly
aligned.

And this change helps code segments use huge pages become
simple and available.

This fixes [BZ #28676].

Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
---
 elf/dl-load.c         |  2 ++
 elf/dl-load.h         |  3 ++-
 elf/dl-map-segments.h | 50 +++++++++++++++++++++++++++++++++++++++----
 3 files changed, 50 insertions(+), 5 deletions(-)
  

Comments

Szabolcs Nagy Dec. 13, 2021, 11:05 a.m. UTC | #1
The 12/13/2021 10:51, Rongwei Wang via Libc-alpha wrote:
> When PT_LOAD segment alignment > the page size, allocate
> enough space to ensure that the segment can be properly
> aligned.
> 
> And this change helps code segments use huge pages become
> simple and available.
> 
> This fixes [BZ #28676].
> 
> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>


since this commit nptl/tst-stack4 consistently
fails in my 32bit arm test environment with

pthread_create failed: 11
pthread_create failed: 11
pthread_create failed: 11
pthread_create failed: 11
pthread_create failed: 11
pthread_create failed: 11
pthread_create failed: 11
tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
Didn't expect signal from child: got `Aborted'

i suspect it simply runs out of memory.

if this change pushes memory usage of the test
above 2G then i think either the test or the
patch have to be changed so 32bit targets can
still run it reliably.
  
Florian Weimer Dec. 13, 2021, 11:17 a.m. UTC | #2
* Szabolcs Nagy:

> The 12/13/2021 10:51, Rongwei Wang via Libc-alpha wrote:
>> When PT_LOAD segment alignment > the page size, allocate
>> enough space to ensure that the segment can be properly
>> aligned.
>> 
>> And this change helps code segments use huge pages become
>> simple and available.
>> 
>> This fixes [BZ #28676].
>> 
>> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
>> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
>
>
> since this commit nptl/tst-stack4 consistently
> fails in my 32bit arm test environment with
>
> pthread_create failed: 11
> pthread_create failed: 11
> pthread_create failed: 11
> pthread_create failed: 11
> pthread_create failed: 11
> pthread_create failed: 11
> pthread_create failed: 11
> tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
> Didn't expect signal from child: got `Aborted'
>
> i suspect it simply runs out of memory.
>
> if this change pushes memory usage of the test
> above 2G then i think either the test or the
> patch have to be changed so 32bit targets can
> still run it reliably.

What are the p_align values of the involved objects?  I would not expect
any changes on 32-bit Arm because p_align and the run-time page size
should match there.

Thanks,
Florian
  
Szabolcs Nagy Dec. 13, 2021, 11:35 a.m. UTC | #3
The 12/13/2021 12:17, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
> > The 12/13/2021 10:51, Rongwei Wang via Libc-alpha wrote:
> >> When PT_LOAD segment alignment > the page size, allocate
> >> enough space to ensure that the segment can be properly
> >> aligned.
> >> 
> >> And this change helps code segments use huge pages become
> >> simple and available.
> >> 
> >> This fixes [BZ #28676].
> >> 
> >> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
> >> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
> >
> >
> > since this commit nptl/tst-stack4 consistently
> > fails in my 32bit arm test environment with
> >
> > pthread_create failed: 11
> > pthread_create failed: 11
> > pthread_create failed: 11
> > pthread_create failed: 11
> > pthread_create failed: 11
> > pthread_create failed: 11
> > pthread_create failed: 11
> > tst-stack4: tst-stack4.c:69: dso_process: Assertion `handle[dso]' failed.
> > Didn't expect signal from child: got `Aborted'
> >
> > i suspect it simply runs out of memory.
> >
> > if this change pushes memory usage of the test
> > above 2G then i think either the test or the
> > patch have to be changed so 32bit targets can
> > still run it reliably.
> 
> What are the p_align values of the involved objects?  I would not expect
> any changes on 32-bit Arm because p_align and the run-time page size
> should match there.

p_align is 64K

and i see a lot of close to 64K PROT_NONE mappings
left behind after many dlclose which creates a lot of
vm fragmentation when dlopen/dlclose is called in a loop.

(i think the mapping is at the beginning or end of
the lib as some kind of padding and left behind after
dlclose, but haven't confirmed this yet)
  
Andreas Schwab Dec. 13, 2021, 11:46 a.m. UTC | #4
On Dez 13 2021, Rongwei Wang via Libc-alpha wrote:

> +  else
> +    {
> +      /* Unmap the unused regions.  */
> +      ElfW(Addr) delta = map_start_aligned - map_start;
> +      if (delta)
> +	__munmap ((void *) map_start, delta);
> +      ElfW(Addr) map_end = map_start_aligned + maplength;
> +      delta = map_start + maplen - map_end;
> +      if (delta)
> +	__munmap ((void *) map_end, delta);

I don't think map_end is guaranteed to be page-aligned.
  
Szabolcs Nagy Dec. 13, 2021, 11:52 a.m. UTC | #5
The 12/13/2021 12:46, Andreas Schwab wrote:
> On Dez 13 2021, Rongwei Wang via Libc-alpha wrote:
> 
> > +  else
> > +    {
> > +      /* Unmap the unused regions.  */
> > +      ElfW(Addr) delta = map_start_aligned - map_start;
> > +      if (delta)
> > +	__munmap ((void *) map_start, delta);
> > +      ElfW(Addr) map_end = map_start_aligned + maplength;
> > +      delta = map_start + maplen - map_end;
> > +      if (delta)
> > +	__munmap ((void *) map_end, delta);
> 
> I don't think map_end is guaranteed to be page-aligned.

indeed i see failing munmap syscalls in strace

..
3161105 munmap(0xf7973040, 57344)       = -1 EINVAL (Invalid argument)
3161105 munmap(0xf79591d4, 24576)       = -1 EINVAL (Invalid argument)
3161107 munmap(0xf6031038, 45056)       = -1 EINVAL (Invalid argument)
3161108 munmap(0xf56f1038, 53248)       = -1 EINVAL (Invalid argument)
...
  
Florian Weimer Dec. 13, 2021, 11:59 a.m. UTC | #6
* Szabolcs Nagy:

>> What are the p_align values of the involved objects?  I would not expect
>> any changes on 32-bit Arm because p_align and the run-time page size
>> should match there.
>
> p_align is 64K
>
> and i see a lot of close to 64K PROT_NONE mappings
> left behind after many dlclose which creates a lot of
> vm fragmentation when dlopen/dlclose is called in a loop.
>
> (i think the mapping is at the beginning or end of
> the lib as some kind of padding and left behind after
> dlclose, but haven't confirmed this yet)

Oh.  So why there is a real bug here, I think we need to discuss what
the change means for 64K p_align binaries on 4K kernels.  Do the
additional munmap calls matter for startup performance?

(I understand this is very much a correctness fix, but startup
performance matters as well.)

Thanks,
Florian
  
H.J. Lu Dec. 13, 2021, 1:20 p.m. UTC | #7
On Mon, Dec 13, 2021 at 3:59 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Szabolcs Nagy:
>
> >> What are the p_align values of the involved objects?  I would not expect
> >> any changes on 32-bit Arm because p_align and the run-time page size
> >> should match there.
> >
> > p_align is 64K
> >
> > and i see a lot of close to 64K PROT_NONE mappings
> > left behind after many dlclose which creates a lot of
> > vm fragmentation when dlopen/dlclose is called in a loop.
> >
> > (i think the mapping is at the beginning or end of
> > the lib as some kind of padding and left behind after
> > dlclose, but haven't confirmed this yet)
>
> Oh.  So why there is a real bug here, I think we need to discuss what
> the change means for 64K p_align binaries on 4K kernels.  Do the
> additional munmap calls matter for startup performance?

We should align munmap arguments.

> (I understand this is very much a correctness fix, but startup
> performance matters as well.)
>

The kernel loader doesn't call munmap in this case.  Should
we be concerned about the unused pages?
  
Florian Weimer Dec. 13, 2021, 1:26 p.m. UTC | #8
* H. J. Lu:

> On Mon, Dec 13, 2021 at 3:59 AM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Szabolcs Nagy:
>>
>> >> What are the p_align values of the involved objects?  I would not expect
>> >> any changes on 32-bit Arm because p_align and the run-time page size
>> >> should match there.
>> >
>> > p_align is 64K
>> >
>> > and i see a lot of close to 64K PROT_NONE mappings
>> > left behind after many dlclose which creates a lot of
>> > vm fragmentation when dlopen/dlclose is called in a loop.
>> >
>> > (i think the mapping is at the beginning or end of
>> > the lib as some kind of padding and left behind after
>> > dlclose, but haven't confirmed this yet)
>>
>> Oh.  So why there is a real bug here, I think we need to discuss what
>> the change means for 64K p_align binaries on 4K kernels.  Do the
>> additional munmap calls matter for startup performance?
>
> We should align munmap arguments.

Sorry, I meant “while there is a real bug here”, that is: even if we fix
that, the additional munmap calls could hurt startup performance.

>> (I understand this is very much a correctness fix, but startup
>> performance matters as well.)
>>
>
> The kernel loader doesn't call munmap in this case.  Should
> we be concerned about the unused pages?

Which approach leads to fewer mappings that count against the map limit
(64K by default I believe)?

Thanks,
Florian
  
H.J. Lu Dec. 13, 2021, 1:34 p.m. UTC | #9
On Mon, Dec 13, 2021 at 5:26 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * H. J. Lu:
>
> > On Mon, Dec 13, 2021 at 3:59 AM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * Szabolcs Nagy:
> >>
> >> >> What are the p_align values of the involved objects?  I would not expect
> >> >> any changes on 32-bit Arm because p_align and the run-time page size
> >> >> should match there.
> >> >
> >> > p_align is 64K
> >> >
> >> > and i see a lot of close to 64K PROT_NONE mappings
> >> > left behind after many dlclose which creates a lot of
> >> > vm fragmentation when dlopen/dlclose is called in a loop.
> >> >
> >> > (i think the mapping is at the beginning or end of
> >> > the lib as some kind of padding and left behind after
> >> > dlclose, but haven't confirmed this yet)
> >>
> >> Oh.  So why there is a real bug here, I think we need to discuss what
> >> the change means for 64K p_align binaries on 4K kernels.  Do the
> >> additional munmap calls matter for startup performance?
> >
> > We should align munmap arguments.
>
> Sorry, I meant “while there is a real bug here”, that is: even if we fix
> that, the additional munmap calls could hurt startup performance.

True.

> >> (I understand this is very much a correctness fix, but startup
> >> performance matters as well.)
> >>
> >
> > The kernel loader doesn't call munmap in this case.  Should
> > we be concerned about the unused pages?
>
> Which approach leads to fewer mappings that count against the map limit
> (64K by default I believe)?

Without munmap, some pages will be wasted.

Should we change the linker to add a new DT_XXX entry to request
this new behavior? Linker will set it only when setting p_align > page
size due to variable alignment or a command-line option.
  
Rongwei Wang Dec. 13, 2021, 2:51 p.m. UTC | #10
On 12/13/21 7:52 PM, Szabolcs Nagy via Libc-alpha wrote:
> The 12/13/2021 12:46, Andreas Schwab wrote:
>> On Dez 13 2021, Rongwei Wang via Libc-alpha wrote:
>>
>>> +  else
>>> +    {
>>> +      /* Unmap the unused regions.  */
>>> +      ElfW(Addr) delta = map_start_aligned - map_start;
>>> +      if (delta)
>>> +	__munmap ((void *) map_start, delta);
>>> +      ElfW(Addr) map_end = map_start_aligned + maplength;
>>> +      delta = map_start + maplen - map_end;
>>> +      if (delta)
>>> +	__munmap ((void *) map_end, delta);
>>
>> I don't think map_end is guaranteed to be page-aligned.
> 
> indeed i see failing munmap syscalls in strace
Hi, Szabolcs

Thanks for your test! I have no arm32 environment, and ignoring this test.

It seems the 'map_end' need to be page-aligned before calling munmap.
The following code only update the first line to fix this bug:

+      ElfW(Addr) map_end = ALIGN_UP(map_start_aligned + maplength, 
GLRO(dl_pagesize));
+      delta = map_start + maplen - map_end;
+      if (delta)
+	__munmap ((void *) map_end, delta);

Can you help me test this new code again if available?

Thanks.

> 
> ..
> 3161105 munmap(0xf7973040, 57344)       = -1 EINVAL (Invalid argument)
> 3161105 munmap(0xf79591d4, 24576)       = -1 EINVAL (Invalid argument)
> 3161107 munmap(0xf6031038, 45056)       = -1 EINVAL (Invalid argument)
> 3161108 munmap(0xf56f1038, 53248)       = -1 EINVAL (Invalid argument)
> ...
>
  
Szabolcs Nagy Dec. 13, 2021, 5:37 p.m. UTC | #11
The 12/13/2021 22:51, Rongwei Wang wrote:
> On 12/13/21 7:52 PM, Szabolcs Nagy via Libc-alpha wrote:
> > The 12/13/2021 12:46, Andreas Schwab wrote:
> > > On Dez 13 2021, Rongwei Wang via Libc-alpha wrote:
> > > 
> > > > +  else
> > > > +    {
> > > > +      /* Unmap the unused regions.  */
> > > > +      ElfW(Addr) delta = map_start_aligned - map_start;
> > > > +      if (delta)
> > > > +	__munmap ((void *) map_start, delta);
> > > > +      ElfW(Addr) map_end = map_start_aligned + maplength;
> > > > +      delta = map_start + maplen - map_end;
> > > > +      if (delta)
> > > > +	__munmap ((void *) map_end, delta);
> > > 
> > > I don't think map_end is guaranteed to be page-aligned.
> > 
> > indeed i see failing munmap syscalls in strace
> Hi, Szabolcs
> 
> Thanks for your test! I have no arm32 environment, and ignoring this test.
> 
> It seems the 'map_end' need to be page-aligned before calling munmap.
> The following code only update the first line to fix this bug:
> 
> +      ElfW(Addr) map_end = ALIGN_UP(map_start_aligned + maplength,
> GLRO(dl_pagesize));
> +      delta = map_start + maplen - map_end;
> +      if (delta)
> +	__munmap ((void *) map_end, delta);
> 
> Can you help me test this new code again if available?

yes, the ALIGN_UP works.

note that the issue is observable on aarch64 too (with 4k pagesize)
and likely x86_64 too, it just does not cause enough vm fragmentation
there to run out of memory.

you can verify it by using strace -e munmap before and after.

thanks.
  
Florian Weimer Dec. 13, 2021, 5:50 p.m. UTC | #12
* Szabolcs Nagy:

> The 12/13/2021 22:51, Rongwei Wang wrote:
>> On 12/13/21 7:52 PM, Szabolcs Nagy via Libc-alpha wrote:
>> > The 12/13/2021 12:46, Andreas Schwab wrote:
>> > > On Dez 13 2021, Rongwei Wang via Libc-alpha wrote:
>> > > 
>> > > > +  else
>> > > > +    {
>> > > > +      /* Unmap the unused regions.  */
>> > > > +      ElfW(Addr) delta = map_start_aligned - map_start;
>> > > > +      if (delta)
>> > > > +	__munmap ((void *) map_start, delta);
>> > > > +      ElfW(Addr) map_end = map_start_aligned + maplength;
>> > > > +      delta = map_start + maplen - map_end;
>> > > > +      if (delta)
>> > > > +	__munmap ((void *) map_end, delta);
>> > > 
>> > > I don't think map_end is guaranteed to be page-aligned.
>> > 
>> > indeed i see failing munmap syscalls in strace
>> Hi, Szabolcs
>> 
>> Thanks for your test! I have no arm32 environment, and ignoring this test.
>> 
>> It seems the 'map_end' need to be page-aligned before calling munmap.
>> The following code only update the first line to fix this bug:
>> 
>> +      ElfW(Addr) map_end = ALIGN_UP(map_start_aligned + maplength,
>> GLRO(dl_pagesize));
>> +      delta = map_start + maplen - map_end;
>> +      if (delta)
>> +	__munmap ((void *) map_end, delta);
>> 
>> Can you help me test this new code again if available?
>
> yes, the ALIGN_UP works.
>
> note that the issue is observable on aarch64 too (with 4k pagesize)
> and likely x86_64 too, it just does not cause enough vm fragmentation
> there to run out of memory.
>
> you can verify it by using strace -e munmap before and after.

We should check munmap failure though and rollback everything if
necessary.  It's possible we can undo the initial PROT_NONE mapping even
if future munmap calls fail because unmapping the first mapping does not
need to split a mapping.

Thanks,
Florian
  

Patch

diff --git a/elf/dl-load.c b/elf/dl-load.c
index bf8957e73c..721593135e 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -1,5 +1,6 @@ 
 /* Map in a shared object's segments from the file.
    Copyright (C) 1995-2021 Free Software Foundation, Inc.
+   Copyright The GNU Toolchain Authors.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -1150,6 +1151,7 @@  _dl_map_object_from_fd (const char *name, const char *origname, int fd,
 	  c->mapend = ALIGN_UP (ph->p_vaddr + ph->p_filesz, GLRO(dl_pagesize));
 	  c->dataend = ph->p_vaddr + ph->p_filesz;
 	  c->allocend = ph->p_vaddr + ph->p_memsz;
+	  c->mapalign = ph->p_align;
 	  c->mapoff = ALIGN_DOWN (ph->p_offset, GLRO(dl_pagesize));
 
 	  /* Determine whether there is a gap between the last segment
diff --git a/elf/dl-load.h b/elf/dl-load.h
index e329d49a81..e6dabcb336 100644
--- a/elf/dl-load.h
+++ b/elf/dl-load.h
@@ -1,5 +1,6 @@ 
 /* Map in a shared object's segments from the file.
    Copyright (C) 1995-2021 Free Software Foundation, Inc.
+   Copyright The GNU Toolchain Authors.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -74,7 +75,7 @@  ELF_PREFERRED_ADDRESS_DATA;
    Its details have been expanded out and converted.  */
 struct loadcmd
 {
-  ElfW(Addr) mapstart, mapend, dataend, allocend;
+  ElfW(Addr) mapstart, mapend, dataend, allocend, mapalign;
   ElfW(Off) mapoff;
   int prot;                             /* PROT_* bits.  */
 };
diff --git a/elf/dl-map-segments.h b/elf/dl-map-segments.h
index f9fb110ee3..70a4c40695 100644
--- a/elf/dl-map-segments.h
+++ b/elf/dl-map-segments.h
@@ -1,5 +1,6 @@ 
 /* Map in a shared object's segments.  Generic version.
    Copyright (C) 1995-2021 Free Software Foundation, Inc.
+   Copyright The GNU Toolchain Authors.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -18,6 +19,50 @@ 
 
 #include <dl-load.h>
 
+/* Map a segment and align it properly.  */
+
+static __always_inline ElfW(Addr)
+_dl_map_segment (const struct loadcmd *c, ElfW(Addr) mappref,
+		 const size_t maplength, int fd)
+{
+  if (__glibc_likely (c->mapalign <= GLRO(dl_pagesize)))
+    return (ElfW(Addr)) __mmap ((void *) mappref, maplength, c->prot,
+				MAP_COPY|MAP_FILE, fd, c->mapoff);
+
+  /* If the segment alignment > the page size, allocate enough space to
+     ensure that the segment can be properly aligned.  */
+  ElfW(Addr) maplen = (maplength >= c->mapalign
+		       ? (maplength + c->mapalign)
+		       : (2 * c->mapalign));
+  ElfW(Addr) map_start = (ElfW(Addr)) __mmap ((void *) mappref, maplen,
+					      PROT_NONE,
+					      MAP_ANONYMOUS|MAP_PRIVATE,
+					      -1, 0);
+  if (__glibc_unlikely ((void *) map_start == MAP_FAILED))
+    return map_start;
+
+  ElfW(Addr) map_start_aligned = ALIGN_UP (map_start, c->mapalign);
+  map_start_aligned = (ElfW(Addr)) __mmap ((void *) map_start_aligned,
+					   maplength, c->prot,
+					   MAP_COPY|MAP_FILE|MAP_FIXED,
+					   fd, c->mapoff);
+  if (__glibc_unlikely ((void *) map_start_aligned == MAP_FAILED))
+    __munmap ((void *) map_start, maplen);
+  else
+    {
+      /* Unmap the unused regions.  */
+      ElfW(Addr) delta = map_start_aligned - map_start;
+      if (delta)
+	__munmap ((void *) map_start, delta);
+      ElfW(Addr) map_end = map_start_aligned + maplength;
+      delta = map_start + maplen - map_end;
+      if (delta)
+	__munmap ((void *) map_end, delta);
+    }
+
+  return map_start_aligned;
+}
+
 /* This implementation assumes (as does the corresponding implementation
    of _dl_unmap_segments, in dl-unmap-segments.h) that shared objects
    are always laid out with all segments contiguous (or with gaps
@@ -53,10 +98,7 @@  _dl_map_segments (struct link_map *l, int fd,
            - MAP_BASE_ADDR (l));
 
       /* Remember which part of the address space this object uses.  */
-      l->l_map_start = (ElfW(Addr)) __mmap ((void *) mappref, maplength,
-                                            c->prot,
-                                            MAP_COPY|MAP_FILE,
-                                            fd, c->mapoff);
+      l->l_map_start = _dl_map_segment (c, mappref, maplength, fd);
       if (__glibc_unlikely ((void *) l->l_map_start == MAP_FAILED))
         return DL_MAP_SEGMENTS_ERROR_MAP_SEGMENT;