Message ID | 20210818142000.128752-1-adhemerval.zanella@linaro.org |
---|---|
Headers |
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D1A78388C01C for <patchwork@sourceware.org>; Wed, 18 Aug 2021 14:20:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D1A78388C01C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1629296439; bh=BeeUVz5z/COC15w8TIBs8MrKXCEi3C4IitVWz4BxjqI=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=PBpnwSMkQUpK8qft0EbQD+wh74G0XFlo/DUvP+cXQaRxlB5rE7gCQDsO9R13Ic4co BhY60iUUZ2zWaczFaq0EJes4QbWwW6OiNI0KKkVS2RjDCBQrbOujVqoX5th+sFQzkV sx3HX9Mp4EIcrIPKrl7OINPUrvIhu5HpX2SHL4oM= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by sourceware.org (Postfix) with ESMTPS id 94A75388F03E for <libc-alpha@sourceware.org>; Wed, 18 Aug 2021 14:20:05 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 94A75388F03E Received: by mail-pj1-x1031.google.com with SMTP id fa24-20020a17090af0d8b0290178bfa69d97so2502266pjb.0 for <libc-alpha@sourceware.org>; Wed, 18 Aug 2021 07:20:05 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=BeeUVz5z/COC15w8TIBs8MrKXCEi3C4IitVWz4BxjqI=; b=DTlfK0A08pgiK3Wqu7dYKDxEzwyeim8PBQr3r5Ph6NX4UNI1UoneQE/U1bB/iy182n WDftqZbPdorllxsthKvxYfStmzZi6W/kBRuGqlbyzAFsT/GaOO0NOfL8Ts2p6igOGHve wLnBe4bAlzQqE1ORgKbfhHQdruRwXa6i0aSOpRE293BL7x7CHzxA3VdkoT9s5E8EZ/VB 3qrbrN51HAngcMKo00qLoMiMkxhmNnx7yQDKiwYa9WrUFc+Pt420FCyPAdjUsuJvHnVG ZqVhoXd2e9QcehdfnvCenDhQOnKh9QwCeI3uSe1cs0Son2IroM7rfeu86LhK+1+yNLN6 bgZw== X-Gm-Message-State: AOAM530P4yIu/uNSlq1ZhmdZEkMAlX3cksaWg3Xc0JmiJecabQiFeYse MyNJOPbfVlA8eem0qj2mVX2zV70mQ5Yk9Q== X-Google-Smtp-Source: ABdhPJyW2tfV0ArsLJWIX7haHGfta5tt4p1hNqBTzg6UpiSn+rpAVt2rUPQa17JFqFrYGKdmqPDHNw== X-Received: by 2002:a17:902:9009:b0:12d:8de4:bc2d with SMTP id a9-20020a170902900900b0012d8de4bc2dmr7575926plp.44.1629296404434; Wed, 18 Aug 2021 07:20:04 -0700 (PDT) Received: from birita.. ([2804:431:c7ca:cd83:8c0a:d250:6dae:d807]) by smtp.gmail.com with ESMTPSA id c133sm6805015pfb.39.2021.08.18.07.20.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Aug 2021 07:20:04 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 0/4] malloc: Improve Huge Page support Date: Wed, 18 Aug 2021 11:19:56 -0300 Message-Id: <20210818142000.128752-1-adhemerval.zanella@linaro.org> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Norbert Manthey <nmanthey@conp-solutions.com>, Guillaume Morin <guillaume@morinfr.org>, Siddhesh Poyarekar <siddhesh@sourceware.org> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> |
Series | malloc: Improve Huge Page support | |
Message
Adhemerval Zanella Netto
Aug. 18, 2021, 2:19 p.m. UTC
Linux currently supports two ways to use Huge Pages: either by using specific flags directly with the syscall (MAP_HUGETLB for mmap(), or SHM_HUGETLB for shmget()), or by using Transparent Huge Pages (THP) where the kernel will try to move allocated anonymous pages to Huge Pages blocks transparent to application. Also, THP current support three different modes [1]: 'never', 'madvise', and 'always'. The 'never' is self-explanatory and 'always' will enable THP for all anonymous memory. However, 'madvise' is still the default for some systems and for such cases THP will be only used if the memory range is explicity advertise by the program through a madvise(MADV_HUGEPAGE) call. This patchset adds a two new tunables to improve malloc() support with Huge Page: - glibc.malloc.thp_madvise: instruct the system allocator to issue a madvise(MADV_HUGEPAGE) call after a mmap() one for sizes larger than the default huge page size. The default behavior is to disable it and if the system does not support THP the tunable also does not enable the madvise() call. - glibc.malloc.mmap_hugetlb: instruct the system allocator to round allocation to huge page sizes along with the required flags (MAP_HUGETLB for Linux). If the memory allocation fails, the default system page size is used instead. The default behavior is to disable and a value of 1 uses the default system huge page size. A positive value larger than 1 means to use a specific huge page size, which is matched against the supported ones by the system. The 'thp_madvise' tunable also changes the sbrk() usage by malloc on main arenas, where the increment is now aligned to the huge page size, instead of default page size. The 'mmap_hugetlb' aims to replace the 'morecore' removed callback from 2.34 for libhugetlbfs (where the library tries to leverage the huge pages usage instead to provide a system allocator). By implementing the support directly on the mmap() code patch there is no need to try emulate the morecore()/sbrk() semantic which simplifies the code and make memory shrink logic more straighforward. The performance improvements are really dependent of the workload and the platform, however a simple testcase might show the possible improvements: $ cat hugepages.cc #include <unordered_map> int main (int argc, char *argv[]) { std::size_t iters = 10000000; std::unordered_map <std::size_t, std::size_t> ht; ht.reserve (iters); for (std::size_t i = 0; i < iters; ++i) ht.try_emplace (i, i); return 0; } $ g++ -std=c++17 -O2 hugepages.cc -o hugepages On a x86_64 (Ryzen 9 5900X): Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages': 98,874 faults 717,059 dTLB-loads 411,701 dTLB-load-misses # 57.42% of all dTLB cache accesses 3,754,927 cache-misses # 8.479 % of all cache refs 44,287,580 cache-references 0.315278378 seconds time elapsed 0.238635000 seconds user 0.076714000 seconds sys Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages': 1,871 faults 120,035 dTLB-loads 19,882 dTLB-load-misses # 16.56% of all dTLB cache accesses 4,182,942 cache-misses # 7.452 % of all cache refs 56,128,995 cache-references 0.262620733 seconds time elapsed 0.222233000 seconds user 0.040333000 seconds sys On an AArch64 (cortex A72): Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages': 98835 faults 2007234756 dTLB-loads 4613669 dTLB-load-misses # 0.23% of all dTLB cache accesses 8831801 cache-misses # 0.504 % of all cache refs 1751391405 cache-references 0.616782575 seconds time elapsed 0.460946000 seconds user 0.154309000 seconds sys Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages': 955 faults 1787401880 dTLB-loads 224034 dTLB-load-misses # 0.01% of all dTLB cache accesses 5480917 cache-misses # 0.337 % of all cache refs 1625937858 cache-references 0.487773443 seconds time elapsed 0.440894000 seconds user 0.046465000 seconds sys And on a powerpc64 (POWER8): Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages ': 5453 faults 9940 dTLB-load-misses 1338152 cache-misses # 0.101 % of all cache refs 1326037487 cache-references 1.056355887 seconds time elapsed 1.014633000 seconds user 0.041805000 seconds sys Performance counter stats for 'env GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages ': 1016 faults 1746 dTLB-load-misses 399052 cache-misses # 0.030 % of all cache refs 1316059877 cache-references 1.057810501 seconds time elapsed 1.012175000 seconds user 0.045624000 seconds sys It is worth to note that the powerpc64 machine has 'always' set on '/sys/kernel/mm/transparent_hugepage/enabled'. Norbert Manthey's paper has more information with a more thoroughly performance analysis. For testing run make check on x86_64-linux-gnu with thp_pagesize=1 (directly on ptmalloc_init() after tunable initialiazation) and with mmap_hugetlb=1 (also directly on ptmalloc_init()) with about 10 large pages (so the fallback mmap() call is used) and with 1024 large pages (so all mmap(MAP_HUGETLB) are successful). -- Changes from previous version: - Renamed thp_pagesize to thp_madvise and make it a boolean state. - Added MAP_HUGETLB support for mmap(). - Remove system specific hooks for THP huge page size in favor of Linux generic implementation. - Initial program segments need to be page aligned for the first madvise call. Adhemerval Zanella (4): malloc: Add madvise support for Transparent Huge Pages malloc: Add THP/madvise support for sbrk malloc: Move mmap logic to its own function malloc: Add Huge Page support for sysmalloc NEWS | 9 +- elf/dl-tunables.list | 9 + elf/tst-rtld-list-tunables.exp | 2 + include/libc-pointer-arith.h | 10 + malloc/arena.c | 7 + malloc/malloc-internal.h | 1 + malloc/malloc.c | 263 +++++++++++++++------ manual/tunables.texi | 23 ++ sysdeps/generic/Makefile | 8 + sysdeps/generic/malloc-hugepages.c | 37 +++ sysdeps/generic/malloc-hugepages.h | 49 ++++ sysdeps/unix/sysv/linux/malloc-hugepages.c | 201 ++++++++++++++++ 12 files changed, 542 insertions(+), 77 deletions(-) create mode 100644 sysdeps/generic/malloc-hugepages.c create mode 100644 sysdeps/generic/malloc-hugepages.h create mode 100644 sysdeps/unix/sysv/linux/malloc-hugepages.c
Comments
On 8/18/21 7:49 PM, Adhemerval Zanella wrote: > Linux currently supports two ways to use Huge Pages: either by using > specific flags directly with the syscall (MAP_HUGETLB for mmap(), or > SHM_HUGETLB for shmget()), or by using Transparent Huge Pages (THP) > where the kernel will try to move allocated anonymous pages to Huge > Pages blocks transparent to application. > > Also, THP current support three different modes [1]: 'never', 'madvise', > and 'always'. The 'never' is self-explanatory and 'always' will enable > THP for all anonymous memory. However, 'madvise' is still the default > for some systems and for such cases THP will be only used if the memory > range is explicity advertise by the program through a > madvise(MADV_HUGEPAGE) call. > > This patchset adds a two new tunables to improve malloc() support with > Huge Page: I wonder if this could be done with just the one tunable, glibc.malloc.hugepages where: 0: Disabled (default) 1: Transparent, where we emulate "always" behaviour of THP 2: HugeTLB enabled with default hugepage size <size>: HugeTLB enabled with the specified page size When using HugeTLB, we don't really need to bother with THP so they seem mutually exclusive. > > - glibc.malloc.thp_madvise: instruct the system allocator to issue > a madvise(MADV_HUGEPAGE) call after a mmap() one for sizes larger > than the default huge page size. The default behavior is to > disable it and if the system does not support THP the tunable also > does not enable the madvise() call. > > - glibc.malloc.mmap_hugetlb: instruct the system allocator to round > allocation to huge page sizes along with the required flags > (MAP_HUGETLB for Linux). If the memory allocation fails, the > default system page size is used instead. The default behavior is > to disable and a value of 1 uses the default system huge page size. > A positive value larger than 1 means to use a specific huge page > size, which is matched against the supported ones by the system. > > The 'thp_madvise' tunable also changes the sbrk() usage by malloc > on main arenas, where the increment is now aligned to the huge page > size, instead of default page size. > > The 'mmap_hugetlb' aims to replace the 'morecore' removed callback > from 2.34 for libhugetlbfs (where the library tries to leverage the > huge pages usage instead to provide a system allocator). By > implementing the support directly on the mmap() code patch there is > no need to try emulate the morecore()/sbrk() semantic which simplifies > the code and make memory shrink logic more straighforward. > > The performance improvements are really dependent of the workload > and the platform, however a simple testcase might show the possible > improvements: A simple test like below in benchtests would be very useful to at least get an initial understanding of the behaviour differences with different tunable values. Later those who care can add more relevant workloads. > > $ cat hugepages.cc > #include <unordered_map> > > int > main (int argc, char *argv[]) > { > std::size_t iters = 10000000; > std::unordered_map <std::size_t, std::size_t> ht; > ht.reserve (iters); > for (std::size_t i = 0; i < iters; ++i) > ht.try_emplace (i, i); > > return 0; > } > $ g++ -std=c++17 -O2 hugepages.cc -o hugepages > > On a x86_64 (Ryzen 9 5900X): > > Performance counter stats for 'env > GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages': > > 98,874 faults > 717,059 dTLB-loads > 411,701 dTLB-load-misses # 57.42% of all dTLB > cache accesses > 3,754,927 cache-misses # 8.479 % of all > cache refs > 44,287,580 cache-references > > 0.315278378 seconds time elapsed > > 0.238635000 seconds user > 0.076714000 seconds sys > > Performance counter stats for 'env > GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages': > > 1,871 faults > 120,035 dTLB-loads > 19,882 dTLB-load-misses # 16.56% of all dTLB > cache accesses > 4,182,942 cache-misses # 7.452 % of all > cache refs > 56,128,995 cache-references > > 0.262620733 seconds time elapsed > > 0.222233000 seconds user > 0.040333000 seconds sys > > > On an AArch64 (cortex A72): > > Performance counter stats for 'env > GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages': > > 98835 faults > 2007234756 dTLB-loads > 4613669 dTLB-load-misses # 0.23% of all dTLB > cache accesses > 8831801 cache-misses # 0.504 % of all > cache refs > 1751391405 cache-references > > 0.616782575 seconds time elapsed > > 0.460946000 seconds user > 0.154309000 seconds sys > > Performance counter stats for 'env > GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages': > > 955 faults > 1787401880 dTLB-loads > 224034 dTLB-load-misses # 0.01% of all dTLB > cache accesses > 5480917 cache-misses # 0.337 % of all > cache refs > 1625937858 cache-references > > 0.487773443 seconds time elapsed > > 0.440894000 seconds user > 0.046465000 seconds sys > > > And on a powerpc64 (POWER8): > > Performance counter stats for 'env > GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages > ': > > 5453 faults > 9940 dTLB-load-misses > 1338152 cache-misses # 0.101 % of all > cache refs > 1326037487 cache-references > > 1.056355887 seconds time elapsed > > 1.014633000 seconds user > 0.041805000 seconds sys > > Performance counter stats for 'env > GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages > ': > > 1016 faults > 1746 dTLB-load-misses > 399052 cache-misses # 0.030 % of all > cache refs > 1316059877 cache-references > > 1.057810501 seconds time elapsed > > 1.012175000 seconds user > 0.045624000 seconds sys > > It is worth to note that the powerpc64 machine has 'always' set > on '/sys/kernel/mm/transparent_hugepage/enabled'. > > Norbert Manthey's paper has more information with a more thoroughly > performance analysis. > > For testing run make check on x86_64-linux-gnu with thp_pagesize=1 > (directly on ptmalloc_init() after tunable initialiazation) and > with mmap_hugetlb=1 (also directly on ptmalloc_init()) with about > 10 large pages (so the fallback mmap() call is used) and with > 1024 large pages (so all mmap(MAP_HUGETLB) are successful). You could add tests similar to mcheck and malloc-check, i.e. add $(tests-hugepages) to run all malloc tests again with the various tunable values. See tests-mcheck for example. > -- > > Changes from previous version: > > - Renamed thp_pagesize to thp_madvise and make it a boolean state. > - Added MAP_HUGETLB support for mmap(). > - Remove system specific hooks for THP huge page size in favor of > Linux generic implementation. > - Initial program segments need to be page aligned for the > first madvise call. > > Adhemerval Zanella (4): > malloc: Add madvise support for Transparent Huge Pages > malloc: Add THP/madvise support for sbrk > malloc: Move mmap logic to its own function > malloc: Add Huge Page support for sysmalloc > > NEWS | 9 +- > elf/dl-tunables.list | 9 + > elf/tst-rtld-list-tunables.exp | 2 + > include/libc-pointer-arith.h | 10 + > malloc/arena.c | 7 + > malloc/malloc-internal.h | 1 + > malloc/malloc.c | 263 +++++++++++++++------ > manual/tunables.texi | 23 ++ > sysdeps/generic/Makefile | 8 + > sysdeps/generic/malloc-hugepages.c | 37 +++ > sysdeps/generic/malloc-hugepages.h | 49 ++++ > sysdeps/unix/sysv/linux/malloc-hugepages.c | 201 ++++++++++++++++ > 12 files changed, 542 insertions(+), 77 deletions(-) > create mode 100644 sysdeps/generic/malloc-hugepages.c > create mode 100644 sysdeps/generic/malloc-hugepages.h > create mode 100644 sysdeps/unix/sysv/linux/malloc-hugepages.c >
On 18/08/2021 15:11, Siddhesh Poyarekar wrote: > On 8/18/21 7:49 PM, Adhemerval Zanella wrote: >> Linux currently supports two ways to use Huge Pages: either by using >> specific flags directly with the syscall (MAP_HUGETLB for mmap(), or >> SHM_HUGETLB for shmget()), or by using Transparent Huge Pages (THP) >> where the kernel will try to move allocated anonymous pages to Huge >> Pages blocks transparent to application. >> >> Also, THP current support three different modes [1]: 'never', 'madvise', >> and 'always'. The 'never' is self-explanatory and 'always' will enable >> THP for all anonymous memory. However, 'madvise' is still the default >> for some systems and for such cases THP will be only used if the memory >> range is explicity advertise by the program through a >> madvise(MADV_HUGEPAGE) call. >> >> This patchset adds a two new tunables to improve malloc() support with >> Huge Page: > > I wonder if this could be done with just the one tunable, glibc.malloc.hugepages where: > > 0: Disabled (default) > 1: Transparent, where we emulate "always" behaviour of THP > 2: HugeTLB enabled with default hugepage size > <size>: HugeTLB enabled with the specified page size I though about it, and decided to use two tunables because although for mmap() system allocation both tunable are mutually exclusive (since it does not make sense to madvise() a mmap(MAP_HUGETLB) we still use sbrk() on main arena. The way I did for sbrk() is to align to the THP page size advertisen by the kernel, so using the tunable does change the behavior slightly (it is not 'transparent' as the madvise call). So to use only one tunable would require to either drop the sbrk() madvise when MAP_HUGETLB is used, move it to another tunable (say '3: HugeTLB enabled with default hugepage size and madvise() on sbrk()), or assume it when huge pages should be used. (and how do we handle sbrk() with explicit size?) If one tunable is preferable I think it would be something like: 0: Disabled (default) 1: Transparent, where we emulate "always" behaviour of THP sbrk() is also aligned to huge page size and issued madvise() 2: HugeTLB enabled with default hugepage size and sbrk() as handled are 1 > <size>: HugeTLB enabled with the specified page size and sbrk() are handled as 1 By forcing the sbrk() and madvise() on all tunables value make the expectation to use huge pages in all possible occasions. > > When using HugeTLB, we don't really need to bother with THP so they seem mutually exclusive. > >> >> - glibc.malloc.thp_madvise: instruct the system allocator to issue >> a madvise(MADV_HUGEPAGE) call after a mmap() one for sizes larger >> than the default huge page size. The default behavior is to >> disable it and if the system does not support THP the tunable also >> does not enable the madvise() call. >> >> - glibc.malloc.mmap_hugetlb: instruct the system allocator to round >> allocation to huge page sizes along with the required flags >> (MAP_HUGETLB for Linux). If the memory allocation fails, the >> default system page size is used instead. The default behavior is >> to disable and a value of 1 uses the default system huge page size. >> A positive value larger than 1 means to use a specific huge page >> size, which is matched against the supported ones by the system. >> >> The 'thp_madvise' tunable also changes the sbrk() usage by malloc >> on main arenas, where the increment is now aligned to the huge page >> size, instead of default page size. >> >> The 'mmap_hugetlb' aims to replace the 'morecore' removed callback >> from 2.34 for libhugetlbfs (where the library tries to leverage the >> huge pages usage instead to provide a system allocator). By >> implementing the support directly on the mmap() code patch there is >> no need to try emulate the morecore()/sbrk() semantic which simplifies >> the code and make memory shrink logic more straighforward. >> >> The performance improvements are really dependent of the workload >> and the platform, however a simple testcase might show the possible >> improvements: > > A simple test like below in benchtests would be very useful to at least get an initial understanding of the behaviour differences with different tunable values. Later those who care can add more relevant workloads. Yeah, I am open to suggestions on how to properly test it. The issue is we need to have specific system configuration either by proper kernel support (THP) or with reserved large pages to actually test it. For THP the issue is really 'transparent' for user, which means that we will need to poke on specific Linux sysfs information to check if huge pages are being used. And we might not get the expected answer depending of the system load and memory utilization (the advised pages might not be moved to large pages if there is no sufficient memory). > >> >> $ cat hugepages.cc >> #include <unordered_map> >> >> int >> main (int argc, char *argv[]) >> { >> std::size_t iters = 10000000; >> std::unordered_map <std::size_t, std::size_t> ht; >> ht.reserve (iters); >> for (std::size_t i = 0; i < iters; ++i) >> ht.try_emplace (i, i); >> >> return 0; >> } >> $ g++ -std=c++17 -O2 hugepages.cc -o hugepages >> >> On a x86_64 (Ryzen 9 5900X): >> >> Performance counter stats for 'env >> GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages': >> >> 98,874 faults >> 717,059 dTLB-loads >> 411,701 dTLB-load-misses # 57.42% of all dTLB >> cache accesses >> 3,754,927 cache-misses # 8.479 % of all >> cache refs >> 44,287,580 cache-references >> >> 0.315278378 seconds time elapsed >> >> 0.238635000 seconds user >> 0.076714000 seconds sys >> >> Performance counter stats for 'env >> GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages': >> >> 1,871 faults >> 120,035 dTLB-loads >> 19,882 dTLB-load-misses # 16.56% of all dTLB >> cache accesses >> 4,182,942 cache-misses # 7.452 % of all >> cache refs >> 56,128,995 cache-references >> >> 0.262620733 seconds time elapsed >> >> 0.222233000 seconds user >> 0.040333000 seconds sys >> >> >> On an AArch64 (cortex A72): >> >> Performance counter stats for 'env >> GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages': >> >> 98835 faults >> 2007234756 dTLB-loads >> 4613669 dTLB-load-misses # 0.23% of all dTLB >> cache accesses >> 8831801 cache-misses # 0.504 % of all >> cache refs >> 1751391405 cache-references >> >> 0.616782575 seconds time elapsed >> >> 0.460946000 seconds user >> 0.154309000 seconds sys >> >> Performance counter stats for 'env >> GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages': >> >> 955 faults >> 1787401880 dTLB-loads >> 224034 dTLB-load-misses # 0.01% of all dTLB >> cache accesses >> 5480917 cache-misses # 0.337 % of all >> cache refs >> 1625937858 cache-references >> >> 0.487773443 seconds time elapsed >> >> 0.440894000 seconds user >> 0.046465000 seconds sys >> >> >> And on a powerpc64 (POWER8): >> >> Performance counter stats for 'env >> GLIBC_TUNABLES=glibc.malloc.thp_madvise=0 ./testrun.sh ./hugepages >> ': >> >> 5453 faults >> 9940 dTLB-load-misses >> 1338152 cache-misses # 0.101 % of all >> cache refs >> 1326037487 cache-references >> >> 1.056355887 seconds time elapsed >> >> 1.014633000 seconds user >> 0.041805000 seconds sys >> >> Performance counter stats for 'env >> GLIBC_TUNABLES=glibc.malloc.thp_madvise=1 ./testrun.sh ./hugepages >> ': >> >> 1016 faults >> 1746 dTLB-load-misses >> 399052 cache-misses # 0.030 % of all >> cache refs >> 1316059877 cache-references >> >> 1.057810501 seconds time elapsed >> >> 1.012175000 seconds user >> 0.045624000 seconds sys >> >> It is worth to note that the powerpc64 machine has 'always' set >> on '/sys/kernel/mm/transparent_hugepage/enabled'. >> >> Norbert Manthey's paper has more information with a more thoroughly >> performance analysis. >> >> For testing run make check on x86_64-linux-gnu with thp_pagesize=1 >> (directly on ptmalloc_init() after tunable initialiazation) and >> with mmap_hugetlb=1 (also directly on ptmalloc_init()) with about >> 10 large pages (so the fallback mmap() call is used) and with >> 1024 large pages (so all mmap(MAP_HUGETLB) are successful). > > You could add tests similar to mcheck and malloc-check, i.e. add $(tests-hugepages) to run all malloc tests again with the various tunable values. See tests-mcheck for example. Ok, I can work with this. This might not add much if the system is not configured with either THP or with some huge page pool but at least adds some coverage. > >> -- >> >> Changes from previous version: >> >> - Renamed thp_pagesize to thp_madvise and make it a boolean state. >> - Added MAP_HUGETLB support for mmap(). >> - Remove system specific hooks for THP huge page size in favor of >> Linux generic implementation. >> - Initial program segments need to be page aligned for the >> first madvise call. >> >> Adhemerval Zanella (4): >> malloc: Add madvise support for Transparent Huge Pages >> malloc: Add THP/madvise support for sbrk >> malloc: Move mmap logic to its own function >> malloc: Add Huge Page support for sysmalloc >> >> NEWS | 9 +- >> elf/dl-tunables.list | 9 + >> elf/tst-rtld-list-tunables.exp | 2 + >> include/libc-pointer-arith.h | 10 + >> malloc/arena.c | 7 + >> malloc/malloc-internal.h | 1 + >> malloc/malloc.c | 263 +++++++++++++++------ >> manual/tunables.texi | 23 ++ >> sysdeps/generic/Makefile | 8 + >> sysdeps/generic/malloc-hugepages.c | 37 +++ >> sysdeps/generic/malloc-hugepages.h | 49 ++++ >> sysdeps/unix/sysv/linux/malloc-hugepages.c | 201 ++++++++++++++++ >> 12 files changed, 542 insertions(+), 77 deletions(-) >> create mode 100644 sysdeps/generic/malloc-hugepages.c >> create mode 100644 sysdeps/generic/malloc-hugepages.h >> create mode 100644 sysdeps/unix/sysv/linux/malloc-hugepages.c >> >
On 8/19/21 4:56 PM, Adhemerval Zanella wrote: > I though about it, and decided to use two tunables because although > for mmap() system allocation both tunable are mutually exclusive > (since it does not make sense to madvise() a mmap(MAP_HUGETLB) > we still use sbrk() on main arena. The way I did for sbrk() is to align > to the THP page size advertisen by the kernel, so using the tunable > does change the behavior slightly (it is not 'transparent' as the > madvise call). > > So to use only one tunable would require to either drop the sbrk() > madvise when MAP_HUGETLB is used, move it to another tunable (say > '3: HugeTLB enabled with default hugepage size and madvise() on sbrk()), > or assume it when huge pages should be used. > > (and how do we handle sbrk() with explicit size?) > > If one tunable is preferable I think it would be something like: > > 0: Disabled (default) > 1: Transparent, where we emulate "always" behaviour of THP > sbrk() is also aligned to huge page size and issued madvise() > 2: HugeTLB enabled with default hugepage size and sbrk() as > handled are 1 >> <size>: HugeTLB enabled with the specified page size and sbrk() > are handled as 1 > > By forcing the sbrk() and madvise() on all tunables value make > the expectation to use huge pages in all possible occasions. What do you think about using mmap instead of sbrk for (2) and <size> if hugetlb is requested? It kinda emulates what libhugetlbfs does and makes the behaviour more consistent with what is advertised by the tunables. >> A simple test like below in benchtests would be very useful to at least get an initial understanding of the behaviour differences with different tunable values. Later those who care can add more relevant workloads. > > Yeah, I am open to suggestions on how to properly test it. The issue > is we need to have specific system configuration either by proper > kernel support (THP) or with reserved large pages to actually test > it. > > For THP the issue is really 'transparent' for user, which means that > we will need to poke on specific Linux sysfs information to check if > huge pages are being used. And we might not get the expected answer > depending of the system load and memory utilization (the advised > pages might not be moved to large pages if there is no sufficient > memory). For benchmarking we can make a minimal assumption that the user will set the system up to appropriately isolate the benchmarks. As for the sysfs setup, we can always test and bail if unsupported. >> You could add tests similar to mcheck and malloc-check, i.e. add $(tests-hugepages) to run all malloc tests again with the various tunable values. See tests-mcheck for example. > > Ok, I can work with this. This might not add much if the system is > not configured with either THP or with some huge page pool but at > least adds some coverage. Yeah the main intent is to simply ensure that there are no differences in behaviour with hugepages. Siddhesh
On 19/08/2021 08:48, Siddhesh Poyarekar wrote: > On 8/19/21 4:56 PM, Adhemerval Zanella wrote: >> I though about it, and decided to use two tunables because although >> for mmap() system allocation both tunable are mutually exclusive >> (since it does not make sense to madvise() a mmap(MAP_HUGETLB) >> we still use sbrk() on main arena. The way I did for sbrk() is to align >> to the THP page size advertisen by the kernel, so using the tunable >> does change the behavior slightly (it is not 'transparent' as the >> madvise call). >> >> So to use only one tunable would require to either drop the sbrk() >> madvise when MAP_HUGETLB is used, move it to another tunable (say >> '3: HugeTLB enabled with default hugepage size and madvise() on sbrk()), >> or assume it when huge pages should be used. >> >> (and how do we handle sbrk() with explicit size?) >> >> If one tunable is preferable I think it would be something like: >> >> 0: Disabled (default) >> 1: Transparent, where we emulate "always" behaviour of THP >> sbrk() is also aligned to huge page size and issued madvise() >> 2: HugeTLB enabled with default hugepage size and sbrk() as >> handled are 1 >>> <size>: HugeTLB enabled with the specified page size and sbrk() >> are handled as 1 >> >> By forcing the sbrk() and madvise() on all tunables value make >> the expectation to use huge pages in all possible occasions. > > What do you think about using mmap instead of sbrk for (2) and <size> if hugetlb is requested? It kinda emulates what libhugetlbfs does and makes the behaviour more consistent with what is advertised by the tunables. I think this would be an additional tunable, we still need to handle the case where mmap() fails either in default path (due maximum number of mmap() per process by kernel or when the poll is exhausted for MAP_HUGETLB). So for sbrk() call, should we align the increment to huge page and issue the madvise() if the tunable is set to use huge pages? > >>> A simple test like below in benchtests would be very useful to at least get an initial understanding of the behaviour differences with different tunable values. Later those who care can add more relevant workloads. >> >> Yeah, I am open to suggestions on how to properly test it. The issue >> is we need to have specific system configuration either by proper >> kernel support (THP) or with reserved large pages to actually test >> it. >> >> For THP the issue is really 'transparent' for user, which means that >> we will need to poke on specific Linux sysfs information to check if >> huge pages are being used. And we might not get the expected answer >> depending of the system load and memory utilization (the advised >> pages might not be moved to large pages if there is no sufficient >> memory). > > For benchmarking we can make a minimal assumption that the user will set the system up to appropriately isolate the benchmarks. As for the sysfs setup, we can always test and bail if unsupported. > >>> You could add tests similar to mcheck and malloc-check, i.e. add $(tests-hugepages) to run all malloc tests again with the various tunable values. See tests-mcheck for example. >> >> Ok, I can work with this. This might not add much if the system is >> not configured with either THP or with some huge page pool but at >> least adds some coverage. > > Yeah the main intent is to simply ensure that there are no differences in behaviour with hugepages. Alright, I will add some tunable usage then.
On 8/19/21 5:34 PM, Adhemerval Zanella wrote: > I think this would be an additional tunable, we still need to handle > the case where mmap() fails either in default path (due maximum number > of mmap() per process by kernel or when the poll is exhausted for > MAP_HUGETLB). > > So for sbrk() call, should we align the increment to huge page and > issue the madvise() if the tunable is set to use huge pages? Yeah it's a reasonable compromise. I've been thinking about getting rid of max_mmaps too; I don't see much use for it anymore. Siddhesh
On 19/08/2021 09:26, Siddhesh Poyarekar wrote: > On 8/19/21 5:34 PM, Adhemerval Zanella wrote: >> I think this would be an additional tunable, we still need to handle >> the case where mmap() fails either in default path (due maximum number >> of mmap() per process by kernel or when the poll is exhausted for >> MAP_HUGETLB). >> >> So for sbrk() call, should we align the increment to huge page and >> issue the madvise() if the tunable is set to use huge pages? > > Yeah it's a reasonable compromise. I've been thinking about getting rid of max_mmaps too; I don't see much use for it anymore. I think it made sense when mmap() is way costly, specially for 32-bit architectures. On Linux it is still controlled by a tunable, /proc/sys/vm/max_map_count, so it might still be case where might want to avoid the overhead of the mmap failure and fallback to sbrk() directly. But I agree that for usual case where mmap() is used it does not make much sense to try use the tunable, since for cases like threaded programs sbrk() does not help much.
Hi Adhemerval, On 18 Aug 11:19, Adhemerval Zanella wrote: > Linux currently supports two ways to use Huge Pages: either by using > specific flags directly with the syscall (MAP_HUGETLB for mmap(), or > SHM_HUGETLB for shmget()), or by using Transparent Huge Pages (THP) > where the kernel will try to move allocated anonymous pages to Huge > Pages blocks transparent to application. This approach looks good to me! This is much appreciated. Are you planning on tackling using the same tunables to allocate additional heaps (in arena.c)? It's a little more subtle because of the calls to mprotect() which needs to be properly aligned for hugetlbfs, and probably for THP as well (to avoid un-necessary page splitting). One additional thing to address is the case where mmap() fails with MAP_HUGETLB because HP allocation fails. Reverting to the default pages will match what libhugetlbfs does (i.e just call mmap() again without MAP_HUGETLB). But I see that Siddhesh and you have already been discussing this case. Guillaume.
On 19/08/2021 13:42, Guillaume Morin wrote: > Hi Adhemerval, > > On 18 Aug 11:19, Adhemerval Zanella wrote: >> Linux currently supports two ways to use Huge Pages: either by using >> specific flags directly with the syscall (MAP_HUGETLB for mmap(), or >> SHM_HUGETLB for shmget()), or by using Transparent Huge Pages (THP) >> where the kernel will try to move allocated anonymous pages to Huge >> Pages blocks transparent to application. > > This approach looks good to me! This is much appreciated. > > Are you planning on tackling using the same tunables to allocate > additional heaps (in arena.c)? > > It's a little more subtle because of the calls to mprotect() which needs > to be properly aligned for hugetlbfs, and probably for THP as well (to > avoid un-necessary page splitting). What do you mean by additional heaps in this case? > > One additional thing to address is the case where mmap() fails with > MAP_HUGETLB because HP allocation fails. Reverting to the default pages > will match what libhugetlbfs does (i.e just call mmap() again without > MAP_HUGETLB). But I see that Siddhesh and you have already been > discussing this case. This is what I did in my patch, it follow the current default allocation path.
On 19 Aug 13:55, Adhemerval Zanella wrote: > On 19/08/2021 13:42, Guillaume Morin wrote: > > Are you planning on tackling using the same tunables to allocate > > additional heaps (in arena.c)? > > > > It's a little more subtle because of the calls to mprotect() which needs > > to be properly aligned for hugetlbfs, and probably for THP as well (to > > avoid un-necessary page splitting). > > What do you mean by additional heaps in this case? I mean what is done in new_heap() in arena.c. > > One additional thing to address is the case where mmap() fails with > > MAP_HUGETLB because HP allocation fails. Reverting to the default pages > > will match what libhugetlbfs does (i.e just call mmap() again without > > MAP_HUGETLB). But I see that Siddhesh and you have already been > > discussing this case. > > This is what I did in my patch, it follow the current default allocation > path. Yes, you are right. I misread. You've been discussing adding a tunable to decide if that should fail or not. My 2 cents as a user: it's hard for me to imagine that users would like malloc() to fail in this case. Even if the admin allows surplus pages (i.e create new HPs on the fly), this is far from guaranteed to succeed. Guillaume.
On 19/08/2021 14:17, Guillaume Morin wrote: > On 19 Aug 13:55, Adhemerval Zanella wrote: >> On 19/08/2021 13:42, Guillaume Morin wrote: >>> Are you planning on tackling using the same tunables to allocate >>> additional heaps (in arena.c)? >>> >>> It's a little more subtle because of the calls to mprotect() which needs >>> to be properly aligned for hugetlbfs, and probably for THP as well (to >>> avoid un-necessary page splitting). >> >> What do you mean by additional heaps in this case? > > I mean what is done in new_heap() in arena.c. Good catch, I haven't take in consideration the new_heap() code. I think we should the same tunable to drive the hguepage usage in this case as well. > >>> One additional thing to address is the case where mmap() fails with >>> MAP_HUGETLB because HP allocation fails. Reverting to the default pages >>> will match what libhugetlbfs does (i.e just call mmap() again without >>> MAP_HUGETLB). But I see that Siddhesh and you have already been >>> discussing this case. >> >> This is what I did in my patch, it follow the current default allocation >> path. > > Yes, you are right. I misread. You've been discussing adding a tunable > to decide if that should fail or not. My 2 cents as a user: it's hard > for me to imagine that users would like malloc() to fail in this case. > Even if the admin allows surplus pages (i.e create new HPs on the fly), > this is far from guaranteed to succeed. > > Guillaume. >