mbox

[v2,0/4] malloc: Improve Huge Page support

Message ID 20210818142000.128752-1-adhemerval.zanella@linaro.org
Headers

Message

Adhemerval Zanella Netto Aug. 18, 2021, 2:19 p.m. UTC
  Linux currently supports two ways to use Huge Pages: either by using
specific flags directly with the syscall (MAP_HUGETLB for mmap(), or
SHM_HUGETLB for shmget()), or by using Transparent Huge Pages (THP)
where the kernel will try to move allocated anonymous pages to Huge
Pages blocks transparent to application.

Also, THP current support three different modes [1]: 'never', 'madvise',
and 'always'.  The 'never' is self-explanatory and 'always' will enable
THP for all anonymous memory.  However, 'madvise' is still the default
for some systems and for such cases THP will be only used if the memory
range is explicity advertise by the program through a
madvise(MADV_HUGEPAGE) call.

This patchset adds a two new tunables to improve malloc() support with
Huge Page:

  - glibc.malloc.thp_madvise: instruct the system allocator to issue
    a madvise(MADV_HUGEPAGE) call after a mmap() one for sizes larger
    than the default huge page size.  The default behavior is to
    disable it and if the system does not support THP the tunable also
    does not enable the madvise() call.

  - glibc.malloc.mmap_hugetlb: instruct the system allocator to round
    allocation to huge page sizes along with the required flags
    (MAP_HUGETLB for Linux).  If the memory allocation fails, the
    default system page size is used instead.  The default behavior is
    to disable and a value of 1 uses the default system huge page size.
    A positive value larger than 1 means to use a specific huge page
    size, which is matched against the supported ones by the system.

The 'thp_madvise' tunable also changes the sbrk() usage by malloc
on main arenas, where the increment is now aligned to the huge page
size, instead of default page size.

The 'mmap_hugetlb' aims to replace the 'morecore' removed callback
from 2.34 for libhugetlbfs (where the library tries to leverage the
huge pages usage instead to provide a system allocator).  By
implementing the support directly on the mmap() code patch there is
no need to try emulate the morecore()/sbrk() semantic which simplifies
the code and make memory shrink logic more straighforward.

The performance improvements are really dependent of the workload
and the platform, however a simple testcase might show the possible
improvements:

$ cat hugepages.cc
#include <unordered_map>

int
main (int argc, char *argv[])
{
  std::size_t iters = 10000000;
  std::unordered_map <std::size_t, std::size_t> ht;
  ht.reserve (iters);
  for (std::size_t i = 0; i < iters; ++i)
    ht.try_emplace (i, i);

  return 0;
}
$ g++ -std=c++17 -O2 hugepages.cc -o hugepages

On a x86_64 (Ryzen 9 5900X):

 Performance counter stats for 'env
GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages':

            98,874      faults                                                      
           717,059      dTLB-loads                                                  
           411,701      dTLB-load-misses          #   57.42% of all dTLB
cache accesses
         3,754,927      cache-misses              #    8.479 % of all
cache refs    
        44,287,580      cache-references                                            

       0.315278378 seconds time elapsed

       0.238635000 seconds user
       0.076714000 seconds sys

 Performance counter stats for 'env
GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages':

             1,871      faults                                                      
           120,035      dTLB-loads                                                  
            19,882      dTLB-load-misses          #   16.56% of all dTLB
cache accesses
         4,182,942      cache-misses              #    7.452 % of all
cache refs    
        56,128,995      cache-references                                            

       0.262620733 seconds time elapsed

       0.222233000 seconds user
       0.040333000 seconds sys


On an AArch64 (cortex A72):

 Performance counter stats for 'env
GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages':

             98835      faults                                                      
        2007234756      dTLB-loads                                                  
           4613669      dTLB-load-misses          #    0.23% of all dTLB
cache accesses
           8831801      cache-misses              #    0.504 % of all
cache refs    
        1751391405      cache-references                                            

       0.616782575 seconds time elapsed

       0.460946000 seconds user
       0.154309000 seconds sys

 Performance counter stats for 'env
GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages':

               955      faults                                                      
        1787401880      dTLB-loads                                                  
            224034      dTLB-load-misses          #    0.01% of all dTLB
cache accesses
           5480917      cache-misses              #    0.337 % of all
cache refs    
        1625937858      cache-references                                            

       0.487773443 seconds time elapsed

       0.440894000 seconds user
       0.046465000 seconds sys


And on a powerpc64 (POWER8):

 Performance counter stats for 'env
GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages
':

              5453      faults                                                      
              9940      dTLB-load-misses                                            
           1338152      cache-misses              #    0.101 % of all
cache refs    
        1326037487      cache-references                                            

       1.056355887 seconds time elapsed

       1.014633000 seconds user
       0.041805000 seconds sys

 Performance counter stats for 'env
GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages
':

              1016      faults                                                      
              1746      dTLB-load-misses                                            
            399052      cache-misses              #    0.030 % of all
cache refs    
        1316059877      cache-references                                            

       1.057810501 seconds time elapsed

       1.012175000 seconds user
       0.045624000 seconds sys

It is worth to note that the powerpc64 machine has 'always' set
on '/sys/kernel/mm/transparent_hugepage/enabled'.

Norbert Manthey's paper has more information with a more thoroughly
performance analysis.

For testing run make check on x86_64-linux-gnu with thp_pagesize=1
(directly on ptmalloc_init() after tunable initialiazation) and
with mmap_hugetlb=1 (also directly on ptmalloc_init()) with about
10 large pages (so the fallback mmap() call is used) and with
1024 large pages (so all mmap(MAP_HUGETLB) are successful).

--

Changes from previous version:

  - Renamed thp_pagesize to thp_madvise and make it a boolean state.
  - Added MAP_HUGETLB support for mmap().
  - Remove system specific hooks for THP huge page size in favor of
    Linux generic implementation.
  - Initial program segments need to be page aligned for the
    first madvise call.

Adhemerval Zanella (4):
  malloc: Add madvise support for Transparent Huge Pages
  malloc: Add THP/madvise support for sbrk
  malloc: Move mmap logic to its own function
  malloc: Add Huge Page support for sysmalloc

 NEWS                                       |   9 +-
 elf/dl-tunables.list                       |   9 +
 elf/tst-rtld-list-tunables.exp             |   2 +
 include/libc-pointer-arith.h               |  10 +
 malloc/arena.c                             |   7 +
 malloc/malloc-internal.h                   |   1 +
 malloc/malloc.c                            | 263 +++++++++++++++------
 manual/tunables.texi                       |  23 ++
 sysdeps/generic/Makefile                   |   8 +
 sysdeps/generic/malloc-hugepages.c         |  37 +++
 sysdeps/generic/malloc-hugepages.h         |  49 ++++
 sysdeps/unix/sysv/linux/malloc-hugepages.c | 201 ++++++++++++++++
 12 files changed, 542 insertions(+), 77 deletions(-)
 create mode 100644 sysdeps/generic/malloc-hugepages.c
 create mode 100644 sysdeps/generic/malloc-hugepages.h
 create mode 100644 sysdeps/unix/sysv/linux/malloc-hugepages.c