[RFC,0/3] nptl: Introduce and use FUTEX_LOCK_PI2

Message ID 20210621111650.1164689-1-kurt@linutronix.de
Headers
Series nptl: Introduce and use FUTEX_LOCK_PI2 |

Message

Kurt Kanzenbach June 21, 2021, 11:16 a.m. UTC
  Hi,

Linux real time application often make use of mutexes with priority inheritance
enabled due to problems such as unbounded priority inversion. In addition, some
applications make use of timeouts e.g., by utilizing pthread_mutex_clocklock().

However, the combination of priority inheritance enabled mutexes with timeouts
based on CLOCK_MONOTONIC doesn't work. That is because the underlying Linux
kernel futex interface doesn't support it, yet. Using CLOCK_REALTIME is
possible, but it is subject to adjustments (NTP, PTP, ...). Therefore, Thomas
proposed to add a new futex operation called FUTEX_LOCK_PI2 with support for
selectable clocks [1]. That patch series is not merged, yet.

Here is a proof of concept patch set adding FUTEX_LOCK_PI2 in order to support
pthread_mutex_clocklock(MONOTONIC)/PI. The idea is to use futex_lock_pi2() by
default, and fallback to futex_lock_pi() in case the kernel is too old. I think,
you get the idea.

There's also a bugreport for glibc with a test case:

 https://sourceware.org/bugzilla/show_bug.cgi?id=26801

Thoughts?

Thanks,
Kurt

[1] - https://lkml.kernel.org/lkml/20210422194417.866740847@linutronix.de/

Kurt Kanzenbach (3):
  nptl: Introduce futex_lock_pi2()
  nptl: Use futex_lock_pi2()
  nptl: Include CLOCK_MONOTONIC in mutex tests

 nptl/pthread_mutex_timedlock.c            | 24 ++++--
 nptl/tst-mutexpi10.c                      |  8 ++
 sysdeps/nptl/futex-internal.h             | 94 +++++++++++++++++++++++
 sysdeps/nptl/lowlevellock-futex.h         |  1 +
 sysdeps/pthread/tst-mutex5.c              |  3 +-
 sysdeps/pthread/tst-mutex9.c              |  3 +-
 sysdeps/unix/sysv/linux/kernel-features.h |  8 ++
 7 files changed, 133 insertions(+), 8 deletions(-)
  

Comments

Adhemerval Zanella June 21, 2021, 9:32 p.m. UTC | #1
On 21/06/2021 08:16, Kurt Kanzenbach wrote:
> Hi,
> 
> Linux real time application often make use of mutexes with priority inheritance
> enabled due to problems such as unbounded priority inversion. In addition, some
> applications make use of timeouts e.g., by utilizing pthread_mutex_clocklock().
> 
> However, the combination of priority inheritance enabled mutexes with timeouts
> based on CLOCK_MONOTONIC doesn't work. That is because the underlying Linux
> kernel futex interface doesn't support it, yet. Using CLOCK_REALTIME is
> possible, but it is subject to adjustments (NTP, PTP, ...). Therefore, Thomas
> proposed to add a new futex operation called FUTEX_LOCK_PI2 with support for
> selectable clocks [1]. That patch series is not merged, yet.
> 
> Here is a proof of concept patch set adding FUTEX_LOCK_PI2 in order to support
> pthread_mutex_clocklock(MONOTONIC)/PI. The idea is to use futex_lock_pi2() by
> default, and fallback to futex_lock_pi() in case the kernel is too old. I think,
> you get the idea.
> 
> There's also a bugreport for glibc with a test case:
> 
>  https://sourceware.org/bugzilla/show_bug.cgi?id=26801
> 
> Thoughts?
> 
> Thanks,
> Kurt

Currently we check for PI mutex support on pthread_mutex_init which basically
check for futex_cmpxchg_enabled within kernel (so it fails only on a handful
configurations).

For FUTEX_LOCK_PI2 I think we will need to rework it, since we are moving
the futex PI failure from pthread_mutex_init to pthread_mutex_{timed}lock.
It means that we will need to remove the prio_inherit_missing() test on
pthread_mutex_init and make the pthread_mutex_{timed}lock fail instead
(not sure if we should use ENOTSUP or keep with current EINVAL).

The proposed futex_lockpi2_supported() incurs in an extra syscall on every
mutex slow path, we should avoid it.  I would like also to avoid the CRIU 
issue as well, so I think it would be better to avoid any caching (as done 
by prio_inherit_missing()), and optimize the FUTEX_LOCK_PI2 to be used only 
when the timeout for clock different than CLOCK_REALTIME is required.  

So instead it would be better to move the logic solely on futex_lock_pi() 
(I am assuming FUTEX_LOCK_PI2 would be only added for futex_time64):

  static __always_inline int
  futex_lock_pi2_64 (int *futex_word, const struct __timespec64 *abtime,
                     int private)
  {
  # if __ASSUME_FUTEX_LOCK_PI2
    return INTERNAL_SYSCALL_CALL (futex_time64, futex_word,
                                  __lll_private_flag (FUTEX_LOCK_PI2, private), 0,
                                  abstime);
  # else
    if (abstime != NULL && clockid != CLOCK_REALTIME)
      return INTERNAL_SYSCALL_CALL (futex_time64, futex_word,
		  		    __lll_private_flag (FUTEX_LOCK_PI2, private), 0,
				    abstime);
    else
      {
        int err =  INTERNAL_SYSCALL_CALL (futex_time64, futex_word,
                                          __lll_private_flag (FUTEX_LOCK_PI, private), 0,
                                          abstime);
        if (err == -ENOSYS)
          err = -EOVERFLOW;
      }
  # endif /* __ASSUME_FUTEX_LOCK_PI2  */
  }

  static __always_inline int
  futex_lock_pi64 (int *futex_word, const struct __timespec64 *abstime,
                   int private)
  {
    int err;
  #ifdef __ASSUME_TIME64_SYSCALLS
    err = futex_lock_pi2_64 (futex_word, abstime, private);
  #else /* __ASSUME_TIME64_SYSCALLS  */
    bool need_time64 = abstime != NULL && !in_time_t_range (abstime->tv_sec)
    if (need_time64)
      {
        err = futex_lock_pi2_64 (futex_word, abstime, private);
      }
    else
      {
        struct timespec ts32;
        if (abstime != NULL)
          ts32 = valid_timespec64_to_timespec (*abstime);

        err = INTERNAL_SYSCALL_CALL (futex, futex_word, __lll_private_flag
                                     (FUTEX_LOCK_PI, private), 0,
                                     abstime != NULL ? &ts32 : NULL);
      }
  #endif /* __ASSUME_TIME64_SYSCALLS  */
   [...]
  }

It would make the changes on pthread_mutex code minimal, it would be only to
remove the extra check for clockid and adjust the comment.

Also, as Joseph has noted the __ASSUME_FUTEX_LOCK_PI2 should be the first
patch.  It also does not make sense to add the __ASSUME_FUTEX_LOCK_PI2 on
tests, they need to be kernel agnostic so you will need to handle a possible
EINVAL/ENOSUP failure instead.

> 
> [1] - https://lkml.kernel.org/lkml/20210422194417.866740847@linutronix.de/
> 
> Kurt Kanzenbach (3):
>   nptl: Introduce futex_lock_pi2()
>   nptl: Use futex_lock_pi2()
>   nptl: Include CLOCK_MONOTONIC in mutex tests
> 
>  nptl/pthread_mutex_timedlock.c            | 24 ++++--
>  nptl/tst-mutexpi10.c                      |  8 ++
>  sysdeps/nptl/futex-internal.h             | 94 +++++++++++++++++++++++
>  sysdeps/nptl/lowlevellock-futex.h         |  1 +
>  sysdeps/pthread/tst-mutex5.c              |  3 +-
>  sysdeps/pthread/tst-mutex9.c              |  3 +-
>  sysdeps/unix/sysv/linux/kernel-features.h |  8 ++
>  7 files changed, 133 insertions(+), 8 deletions(-)
>
  
Kurt Kanzenbach June 22, 2021, 7:26 a.m. UTC | #2
On Mon Jun 21 2021, Adhemerval Zanella wrote:
> Currently we check for PI mutex support on pthread_mutex_init which basically
> check for futex_cmpxchg_enabled within kernel (so it fails only on a handful
> configurations).
>
> For FUTEX_LOCK_PI2 I think we will need to rework it, since we are moving
> the futex PI failure from pthread_mutex_init to pthread_mutex_{timed}lock.
> It means that we will need to remove the prio_inherit_missing() test on
> pthread_mutex_init and make the pthread_mutex_{timed}lock fail instead
> (not sure if we should use ENOTSUP or keep with current EINVAL).
>
> The proposed futex_lockpi2_supported() incurs in an extra syscall on every
> mutex slow path, we should avoid it.

Yes, sure.

> I would like also to avoid the CRIU issue as well, so I think it would
> be better to avoid any caching (as done by prio_inherit_missing()),
> and optimize the FUTEX_LOCK_PI2 to be used only when the timeout for
> clock different than CLOCK_REALTIME is required.

OK.

>
> So instead it would be better to move the logic solely on futex_lock_pi() 
> (I am assuming FUTEX_LOCK_PI2 would be only added for futex_time64):

The kernel also adds FUTEX_LOCK_PI2 for the old system call interface.

>
>   static __always_inline int
>   futex_lock_pi2_64 (int *futex_word, const struct __timespec64 *abtime,
>                      int private)
>   {
>   # if __ASSUME_FUTEX_LOCK_PI2
>     return INTERNAL_SYSCALL_CALL (futex_time64, futex_word,
>                                   __lll_private_flag (FUTEX_LOCK_PI2, private), 0,
>                                   abstime);
>   # else
>     if (abstime != NULL && clockid != CLOCK_REALTIME)
>       return INTERNAL_SYSCALL_CALL (futex_time64, futex_word,
> 		  		    __lll_private_flag (FUTEX_LOCK_PI2, private), 0,
> 				    abstime);

At this point __ASSUME_FUTEX_LOCK_PI2 is false meaning the kernel does
not have FUTEX_LOCK_PI2 support. But, it calls FUTEX_LOCK_PI2 for
clockid monotonic. This will result in ENOSYS unless it's an old kernel
which is patched. Is that intended?

>     else
>       {
>         int err =  INTERNAL_SYSCALL_CALL (futex_time64, futex_word,
>                                           __lll_private_flag (FUTEX_LOCK_PI, private), 0,
>                                           abstime);
>         if (err == -ENOSYS)
>           err = -EOVERFLOW;
>       }
>   # endif /* __ASSUME_FUTEX_LOCK_PI2  */
>   }
>
>   static __always_inline int
>   futex_lock_pi64 (int *futex_word, const struct __timespec64 *abstime,
>                    int private)
>   {
>     int err;
>   #ifdef __ASSUME_TIME64_SYSCALLS
>     err = futex_lock_pi2_64 (futex_word, abstime, private);

Makes sense.

>   #else /* __ASSUME_TIME64_SYSCALLS  */
>     bool need_time64 = abstime != NULL && !in_time_t_range (abstime->tv_sec)
>     if (need_time64)
>       {
>         err = futex_lock_pi2_64 (futex_word, abstime, private);
>       }
>     else
>       {
>         struct timespec ts32;
>         if (abstime != NULL)
>           ts32 = valid_timespec64_to_timespec (*abstime);
>
>         err = INTERNAL_SYSCALL_CALL (futex, futex_word, __lll_private_flag
>                                      (FUTEX_LOCK_PI, private), 0,
>                                      abstime != NULL ? &ts32 : NULL);
>       }
>   #endif /* __ASSUME_TIME64_SYSCALLS  */
>    [...]
>   }
>
> It would make the changes on pthread_mutex code minimal, it would be only to
> remove the extra check for clockid and adjust the comment.

Well, that's an interesting point. I think the current check has to
stay, because there are two locking paths. Only the slow path calls
futex_lock_pi_64() which may result in ENOSYS for clock monotonic. But,
the fast path which doesn't call futex_lock_pi64() would succeed if the
check is removed.

Maybe something like this:

sysdeps/nptl/futex-internal.h:
|static __always_inline bool
|futex_lockpi2_supported (void)
|{
|  return __ASSUME_FUTEX_LOCK_PI2;
|}

nptl/pthread_mutex_timedlock.c:
|	if (__glibc_unlikely (! futex_lockpi2_supported () &&
|			      clockid != CLOCK_REALTIME))
|	  return EINVAL;

Or did I get something wrong?

>
> Also, as Joseph has noted the __ASSUME_FUTEX_LOCK_PI2 should be the first
> patch.

OK.

> It also does not make sense to add the __ASSUME_FUTEX_LOCK_PI2 on
> tests, they need to be kernel agnostic so you will need to handle a
> possible EINVAL/ENOSUP failure instead.

Agreed.

Thanks for your feedback. I'll rework it accordingly.

Thanks,
Kurt
  
Florian Weimer June 22, 2021, 7:29 a.m. UTC | #3
* Kurt Kanzenbach:

> At this point __ASSUME_FUTEX_LOCK_PI2 is false meaning the kernel does
> not have FUTEX_LOCK_PI2 support. But, it calls FUTEX_LOCK_PI2 for
> clockid monotonic. This will result in ENOSYS unless it's an old kernel
> which is patched. Is that intended?

That's not quite how the __ASSUME_* macros work.  If not defined, it
just means that glibc won't assume that the kernel feature is there.  It
can still use it (the run-time kernel might have it after all), but
glibc has to check for the feature and has to compile in some sort of
fallback code.

Thanks,
Florian
  
Kurt Kanzenbach June 22, 2021, 8:55 a.m. UTC | #4
On Tue Jun 22 2021, Florian Weimer wrote:
> * Kurt Kanzenbach:
>
>> At this point __ASSUME_FUTEX_LOCK_PI2 is false meaning the kernel does
>> not have FUTEX_LOCK_PI2 support. But, it calls FUTEX_LOCK_PI2 for
>> clockid monotonic. This will result in ENOSYS unless it's an old kernel
>> which is patched. Is that intended?
>
> That's not quite how the __ASSUME_* macros work.  If not defined, it
> just means that glibc won't assume that the kernel feature is there.  It
> can still use it (the run-time kernel might have it after all), but
> glibc has to check for the feature and has to compile in some sort of
> fallback code.

OK, makes sense. Thanks for the explanation.

Thanks,
Kurt
  
Adhemerval Zanella June 22, 2021, 12:30 p.m. UTC | #5
On 22/06/2021 04:26, Kurt Kanzenbach wrote:
>>   #else /* __ASSUME_TIME64_SYSCALLS  */
>>     bool need_time64 = abstime != NULL && !in_time_t_range (abstime->tv_sec)
>>     if (need_time64)
>>       {
>>         err = futex_lock_pi2_64 (futex_word, abstime, private);
>>       }
>>     else
>>       {
>>         struct timespec ts32;
>>         if (abstime != NULL)
>>           ts32 = valid_timespec64_to_timespec (*abstime);
>>
>>         err = INTERNAL_SYSCALL_CALL (futex, futex_word, __lll_private_flag
>>                                      (FUTEX_LOCK_PI, private), 0,
>>                                      abstime != NULL ? &ts32 : NULL);
>>       }
>>   #endif /* __ASSUME_TIME64_SYSCALLS  */
>>    [...]
>>   }
>>
>> It would make the changes on pthread_mutex code minimal, it would be only to
>> remove the extra check for clockid and adjust the comment.
> 
> Well, that's an interesting point. I think the current check has to
> stay, because there are two locking paths. Only the slow path calls
> futex_lock_pi_64() which may result in ENOSYS for clock monotonic. But,
> the fast path which doesn't call futex_lock_pi64() would succeed if the
> check is removed.
> 
> Maybe something like this:
> 
> sysdeps/nptl/futex-internal.h:
> |static __always_inline bool
> |futex_lockpi2_supported (void)
> |{
> |  return __ASSUME_FUTEX_LOCK_PI2;
> |}
> 
> nptl/pthread_mutex_timedlock.c:
> |	if (__glibc_unlikely (! futex_lockpi2_supported () &&
> |			      clockid != CLOCK_REALTIME))
> |	  return EINVAL;
> 
> Or did I get something wrong?

Besides what Florian has explained about __ASSUME_* macros, another issue we
have static initialization for pthread_mutex.  So the syscall probe only
works for dynamic initialization (where caller issue a pthread_mutex_init).

I view this as an inconsistent behavior and I don't have a straightforward
solution that won't result in a performance penalization for common cases
(it would require to probe for FUTEX_LOCK_PI2 *and* FUTEX_LOCK_PI).

Instead I think we should move the possible error on the slow path and let
the kernel advertise it any missing support.
  
Kurt Kanzenbach June 22, 2021, 2:25 p.m. UTC | #6
On Tue Jun 22 2021, Adhemerval Zanella wrote:
> Besides what Florian has explained about __ASSUME_* macros, another issue we
> have static initialization for pthread_mutex.  So the syscall probe only
> works for dynamic initialization (where caller issue a pthread_mutex_init).
>
> I view this as an inconsistent behavior and I don't have a straightforward
> solution that won't result in a performance penalization for common cases
> (it would require to probe for FUTEX_LOCK_PI2 *and* FUTEX_LOCK_PI).
>
> Instead I think we should move the possible error on the slow path and let
> the kernel advertise it any missing support.

OK, good. That makes the timedlock implementation straight forward. I
implemented it and started testing now with different Linux kernels. A
new version of patch set should be ready next week or so.

Thanks,
Kurt