Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)

  Hi!

On 2022-06-09T11:38:22+0200, I wrote:
> On 2022-06-07T13:28:33+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>> On 07/06/2022 13:10, Jakub Jelinek wrote:
>>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
>>>> Following some feedback from users of the OG11 branch I think I need to
>>>> withdraw this patch, for now.
>>>>
>>>> The memory pinned via the mlock call does not give the expected performance
>>>> boost. I had not expected that it would do much in my test setup, given that
>>>> the machine has a lot of RAM and my benchmarks are small, but others have
>>>> tried more and on varying machines and architectures.
>>>
>>> I don't understand why there should be any expected performance boost (at
>>> least not unless the machine starts swapping out pages),
>>> { omp_atk_pinned, true } is solely about the requirement that the memory
>>> can't be swapped out.
>>
>> It seems like it takes a faster path through the NVidia drivers. This is
>> a black box, for me, but that seems like a plausible explanation. The
>> results are different on x86_64 and powerpc hosts (such as the Summit
>> supercomputer).
>
> For example, it's documented that 'cuMemHostAlloc',
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9>,
> "Allocates page-locked host memory".  The crucial thing, though, what
> makes this different from 'malloc' plus 'mlock' is, that "The driver
> tracks the virtual memory ranges allocated with this function and
> automatically accelerates calls to functions such as cuMemcpyHtoD().
> Since the memory can be accessed directly by the device, it can be read
> or written with much higher bandwidth than pageable memory obtained with
> functions such as malloc()".
>
> Similar, for example, for 'cuMemAllocHost',
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0>.
>
> This, to me, would explain why "the mlock call does not give the expected
> performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
> with 'mlock' you're missing the "tracks the virtual memory ranges"
> aspect.
>
> Also, by means of the Nvidia Driver allocating the memory, I suppose
> using this interface likely circumvents any "annoying" 'ulimit'
> limitations?  I get this impression, because documentation continues
> stating that "Allocating excessive amounts of memory with
> cuMemAllocHost() may degrade system performance, since it reduces the
> amount of memory available to the system for paging.  As a result, this
> function is best used sparingly to allocate staging areas for data
> exchange between host and device".
>
>>>> It seems that it isn't enough for the memory to be pinned, it has to be
>>>> pinned using the Cuda API to get the performance boost.
>>>
>>> For performance boost of what kind of code?
>>> I don't understand how Cuda API could be useful (or can be used at all) if
>>> offloading to NVPTX isn't involved.  The fact that somebody asks for host
>>> memory allocation with omp_atk_pinned set to true doesn't mean it will be
>>> in any way related to NVPTX offloading (unless it is in NVPTX target region
>>> obviously, but then mlock isn't available, so sure, if there is something
>>> CUDA can provide for that case, nice).
>>
>> This is specifically for NVPTX offload, of course, but then that's what
>> our customer is paying for.
>>
>> The expectation, from users, is that memory pinning will give the
>> benefits specific to the active device. We can certainly make that
>> happen when there is only one (flavour of) offload device present. I had
>> hoped it could be one way for all, but it looks like not.
>
> Aren't there CUDA Driver interfaces for that?  That is:
>
>>>> I had not done this
>>>> this because it was difficult to resolve the code abstraction
>>>> difficulties and anyway the implementation was supposed to be device
>>>> independent, but it seems we need a specific pinning mechanism for each
>>>> device.
>
> If not directly *allocating and registering* such memory via
> 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
> *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b54223>:
> "Page-locks the memory range specified [...] and maps it for the
> device(s) [...].  This memory range also is added to the same tracking
> mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
> manual 'mlock'ing involved in that case, too; presumably again using this
> interface likely circumvents any "annoying" 'ulimit' limitations?)
>
> Such a *register* abstraction can then be implemented by all the libgomp
> offloading plugins: they just call the respective
> CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
> memory.
>
> ..., but maybe I'm missing some crucial "detail" here?

Indeed this does appear to work; see attached
"[WIP] Attempt to register OpenMP pinned memory using a device instead of 'mlock'".
Any comments (aside from the TODOs that I'm still working on)?

Grüße
 Thomas

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

Message ID	87cz69tyla.fsf@dem-tschwing-1.ger.mentorg.com
State	Superseded
Headers	DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8DEC13858D33 IronPort-SDR: Sr6ED665OeYL9aSuMNYwlbGjVDkL3CpN3KJYDAeHNXC2NQ1TDmV6TnCg8FQ8ldzd0xPNqP/sYM qT4ye/mGbyCqlDhwz1FSqrdctgc6QQg8NXm8wNwn6x7xeUZg2eNaBaKwBbcyFBJa9tAz6MlPWi boyHV0rFyXo1kxxxZFIghVWGg7omoSxXyEZmJd+e6t/eWEGIQKeaQhfNfMOr8j5qqzsR3QjRva pQ4YrFVbtr4QT/2E+Zd7nKUdvjNG5d+5eozOtxOi9m/86oIYthwi/Rse1G0nAC+pkOu8kqP8iV W4k= From: Thomas Schwinge <thomas@codesourcery.com> To: Andrew Stubbs <ams@codesourcery.com>, Jakub Jelinek <jakub@redhat.com>, Tobias Burnus <tobias@codesourcery.com>, <gcc-patches@gcc.gnu.org> Subject: Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory) In-Reply-To: <87edzy5g8h.fsf@euler.schwinge.homeip.net> References: <f5260c95-6c71-99a7-3bf2-774380444082@codesourcery.com> <20220104155558.GG2646553@tucnak> <48ee767a-0d90-53b4-ea54-9deba9edd805@codesourcery.com> <20220104182829.GK2646553@tucnak> <20220104184740.GL2646553@tucnak> <b59981ce-9e47-8b00-03b8-1a9a5d555bb7@codesourcery.com> <a79567df-f061-8248-4281-63c74e724cb7@codesourcery.com> <dadaaf64-360f-bffb-8616-1ab9493cb358@codesourcery.com> <Yp9AMrhxak8lOh4t@tucnak> <e8fc4b30-768a-2a02-1fc9-208ab9bf8a5d@codesourcery.com> <87edzy5g8h.fsf@euler.schwinge.homeip.net> User-Agent: Notmuch/0.29.1+93~g67ed7df (https://notmuchmail.org) Emacs/26.3 (x86_64-pc-linux-gnu) Date: Thu, 16 Feb 2023 16:32:49 +0100 Message-ID: <87cz69tyla.fsf@dem-tschwing-1.ger.mentorg.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Precedence: list Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
Series	Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory) \| Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, o…

Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)

Commit Message

Comments

Patch