nvptx: 'cuDeviceGetCount' failure is fatal (was: [Patch] OpenMP: Move omp requires checks to libgomp)
Checks
Context |
Check |
Description |
linaro-tcwg-bot/tcwg_gcc_build--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_gcc_check--master-arm |
success
|
Testing passed
|
linaro-tcwg-bot/tcwg_gcc_check--master-aarch64 |
success
|
Testing passed
|
Commit Message
Hi!
On 2022-06-08T05:56:02+0200, Tobias Burnus <Tobias_Burnus@mentor.com> wrote:
> [...] On the libgomp side: The devices which do not fulfill the requirements are
> now filtered out. [...]
> --- a/libgomp/plugin/plugin-gcn.c
> +++ b/libgomp/plugin/plugin-gcn.c
> /* Return the number of GCN devices on the system. */
>
> int
> -GOMP_OFFLOAD_get_num_devices (void)
> +GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
> {
> if (!init_hsa_context ())
> return 0;
> + /* Return -1 if no omp_requires_mask cannot be fulfilled but
> + devices were present. */
> + if (hsa_context.agent_count > 0 && omp_requires_mask != 0)
> + return -1;
> return hsa_context.agent_count;
> }
> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> int
> -GOMP_OFFLOAD_get_num_devices (void)
> +GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
> {
> - return nvptx_get_num_devices ();
> + int num_devices = nvptx_get_num_devices ();
> + /* Return -1 if no omp_requires_mask cannot be fulfilled but
> + devices were present. */
> + if (num_devices > 0 && omp_requires_mask != 0)
> + return -1;
> + return num_devices;
> }
> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -4132,8 +4183,19 @@ gomp_target_init (void)
>
> if (gomp_load_plugin_for_device (¤t_device, plugin_name))
> {
> - new_num_devs = current_device.get_num_devices_func ();
> - if (new_num_devs >= 1)
> + new_num_devs = current_device.get_num_devices_func (requires_mask);
> + if (new_num_devs < 0)
> + {
> + [...]
> + }
> + else if (new_num_devs >= 1)
> {
> /* Augment DEVICES and NUM_DEVICES. */
OK to push the attached "nvptx: 'cuDeviceGetCount' failure is fatal"?
Grüße
Thomas
Comments
Hi Thomas,
Thomas Schwinge wrote:
>> /* Return the number of GCN devices on the system. */
>> int
>> -GOMP_OFFLOAD_get_num_devices (void)
>> +GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
>> {
>> if (!init_hsa_context ())
>> return 0;
>> + /* Return -1 if no omp_requires_mask cannot be fulfilled but
>> + devices were present. */
>> + if (hsa_context.agent_count > 0 && omp_requires_mask != 0)
>> + return -1;
>> return hsa_context.agent_count;
>> }
...
> OK to push the attached "nvptx: 'cuDeviceGetCount' failure is fatal"?
I think the real question is: what does a 'cuDeviceGetCount' fail mean?
Does it mean a serious error – or could it just be a permissions issue
such that the user has no device access but otherwise is fine?
Because if it is, e.g., a permission problem – just returning '0' (no
devices) would seem to be the proper solution.
But if it is expected to be always something serious, well, then a fatal
error makes more sense.
The possible exit codes are:
CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED,
CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE
which does not really help.
My impression is that 0 is usually returned if something goes wrong
(e.g. with permissions) such that an error is a real exception. But all
three choices seem to make about equally sense: either host fallback
(with 0 or -1) or a fatal error.
Tobias
Hi Tobias!
On 2024-03-07T15:28:21+0100, Tobias Burnus <tburnus@baylibre.com> wrote:
> Thomas Schwinge wrote:
>> OK to push the attached "nvptx: 'cuDeviceGetCount' failure is fatal"?
>
> I think the real question is: what does a 'cuDeviceGetCount' fail mean?
Internally to the CUDA stack: the error codes that you've cited below.
Per the state we're in when calling 'cuDeviceGetCount', we only expect
'CUDA_SUCCESS'. Therefore, in our actual use: anything else means a
fatal condition that we don't attempt to recover from, like for most of
all other device access failures.
> Does it mean a serious error – or could it just be a permissions issue
> such that the user has no device access but otherwise is fine?
As you can see, we've done a 'cuInit' right before, so in case there was
any permission issue (or similar), that's already settled (in whichever
way) by the time we do the 'cuDeviceGetCount'.
> Because if it is, e.g., a permission problem – just returning '0' (no
> devices) would seem to be the proper solution.
>
> But if it is expected to be always something serious, well, then a fatal
> error makes more sense.
ACK; pushed in commit 37078f241a22c45db6380c5e9a79b4d08054bb3d.
Grüße
Thomas
> The possible exit codes are:
>
> CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED,
> CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE
>
> which does not really help.
>
> My impression is that 0 is usually returned if something goes wrong
> (e.g. with permissions) such that an error is a real exception. But all
> three choices seem to make about equally sense: either host fallback
> (with 0 or -1) or a fatal error.
>
> Tobias
From 8090da93cb00e4aa47a8b21b6548d739b2cebc49 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <tschwinge@baylibre.com>
Date: Thu, 7 Mar 2024 13:18:23 +0100
Subject: [PATCH] nvptx: 'cuDeviceGetCount' failure is fatal
Per commit 683f11843974f0bdf42f79cdcbb0c2b43c7b81b0
"OpenMP: Move omp requires checks to libgomp", we're now using 'return -1'
from 'GOMP_OFFLOAD_get_num_devices' for 'omp_requires_mask' purposes. This
missed that via 'nvptx_get_num_devices', we could also 'return -1' for
'cuDeviceGetCount' failure. Before, this meant (in 'gomp_target_init') to
silently ignore the plugin/device -- which also has been doubtful behavior.
Let's instead turn 'cuDeviceGetCount' failure into a fatal error, similar to
other errors during device initialization.
libgomp/
* plugin/plugin-nvptx.c (nvptx_get_num_devices):
'cuDeviceGetCount' failure is fatal.
---
libgomp/plugin/plugin-nvptx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
@@ -630,7 +630,7 @@ nvptx_get_num_devices (void)
}
}
- CUDA_CALL_ERET (-1, cuDeviceGetCount, &n);
+ CUDA_CALL_ASSERT (cuDeviceGetCount, &n);
return n;
}
--
2.34.1