Patchwork [v2,4/6] i386: Remove bogus THREAD_ATOMIC_* macros

login
register
mail settings
Submitter Adhemerval Zanella Netto
Date Dec. 28, 2018, 1:02 a.m.
Message ID <20181228010255.21406-5-adhemerval.zanella@linaro.org>
Download mbox | patch
Permalink /patch/30875/
State New
Headers show

Comments

Adhemerval Zanella Netto - Dec. 28, 2018, 1:02 a.m.
The x86 defines optimized THREAD_ATOMIC_* macros where reference always
the current thread instead of the one indicated by input 'descr' argument.
It work as long the input is the self thread pointer, however it generates
wrong code is the semantic is to set a bit atomicialy from another thread.

This is not an issue for current GLIBC usage, however the new cancellation
code expects that some synchronization code to atomically set bits from
different threads.

If some usage indeed proves to be a hotspot we can add an extra macro
with a more descriptive name (THREAD_ATOMIC_BIT_SET_SELF for instance)
where i386 might optimize it.

Checked on i686-linux-gnu.

	* sysdeps/i686/nptl/tls.h (THREAD_ATOMIC_CMPXCHG_VAL,
	THREAD_ATOMIC_AND, THREAD_ATOMIC_BIT_SET): Remove macros.
---
 ChangeLog               |  3 +++
 sysdeps/i386/nptl/tls.h | 37 -------------------------------------
 2 files changed, 3 insertions(+), 37 deletions(-)
Siddhesh Poyarekar - Dec. 29, 2018, 2:13 a.m.
On 28/12/18 6:32 AM, Adhemerval Zanella wrote:
> The x86 defines optimized THREAD_ATOMIC_* macros where reference always
> the current thread instead of the one indicated by input 'descr' argument.
> It work as long the input is the self thread pointer, however it generates
> wrong code is the semantic is to set a bit atomicialy from another thread.

"if the semantic is to set a bit atomically"

> 
> This is not an issue for current GLIBC usage, however the new cancellation
> code expects that some synchronization code to atomically set bits from
> different threads.
> 
> If some usage indeed proves to be a hotspot we can add an extra macro
> with a more descriptive name (THREAD_ATOMIC_BIT_SET_SELF for instance)
> where i386 might optimize it.
> 
> Checked on i686-linux-gnu.
> 
> 	* sysdeps/i686/nptl/tls.h (THREAD_ATOMIC_CMPXCHG_VAL,
> 	THREAD_ATOMIC_AND, THREAD_ATOMIC_BIT_SET): Remove macros.

OK with that nit.

Siddhesh
Florian Weimer - Jan. 25, 2019, 3:16 p.m.
* Adhemerval Zanella:

> The x86 defines optimized THREAD_ATOMIC_* macros where reference always
> the current thread instead of the one indicated by input 'descr' argument.
> It work as long the input is the self thread pointer, however it generates
> wrong code is the semantic is to set a bit atomicialy from another thread.

The entire set of THREAD_* macros is poorly documented, but this is part
of the interface: On some targets, it is expensive to obtain the thread
pointer, so it makes sense to cache it.  On other targets, including
x86-64 or i386, access to the thread pointer is implicit (or can be
cached by the compiler), so an explicit cached descriptor is
unnecessary.

I think we can improve this a lot: We could use a new type for the
cache, to make clear it is a cache, and make the macros type-safe.  Or
we could try to have the compiler cache the value for us, like we do
with __errno_location.

Thanks,
Florian
Adhemerval Zanella Netto - Jan. 29, 2019, 11:32 a.m.
On 25/01/2019 13:16, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> The x86 defines optimized THREAD_ATOMIC_* macros where reference always
>> the current thread instead of the one indicated by input 'descr' argument.
>> It work as long the input is the self thread pointer, however it generates
>> wrong code is the semantic is to set a bit atomicialy from another thread.
> 
> The entire set of THREAD_* macros is poorly documented, but this is part
> of the interface: On some targets, it is expensive to obtain the thread
> pointer, so it makes sense to cache it.  On other targets, including
> x86-64 or i386, access to the thread pointer is implicit (or can be
> cached by the compiler), so an explicit cached descriptor is
> unnecessary.
> 
> I think we can improve this a lot: We could use a new type for the
> cache, to make clear it is a cache, and make the macros type-safe.  Or
> we could try to have the compiler cache the value for us, like we do
> with __errno_location.
> 
> Thanks,
> Florian
> 

It seems that besides the attribute const annotation for __errno_location,
gcc does some conditional dead call elimination (gcc/tree-call-cdce.c) to
optimize errno handling. Are you suggesting some similar? 

It also seems that some targets support __builtin_thread_pointer, maybe
this might be better option (it might simplify compiler analysis). And 
I am not sure how compiler will handle when it is based on kernel ABI.

Anyway, I agree that adding some type-safety, document the THREAD_* macros,
evaluate if arch-specific do yield better code-gen, and check for
inconsistency is indeed a gain.
Florian Weimer - July 8, 2019, 1:44 p.m.
* Adhemerval Zanella:

> On 25/01/2019 13:16, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> The x86 defines optimized THREAD_ATOMIC_* macros where reference always
>>> the current thread instead of the one indicated by input 'descr' argument.
>>> It work as long the input is the self thread pointer, however it generates
>>> wrong code is the semantic is to set a bit atomicialy from another thread.
>> 
>> The entire set of THREAD_* macros is poorly documented, but this is part
>> of the interface: On some targets, it is expensive to obtain the thread
>> pointer, so it makes sense to cache it.  On other targets, including
>> x86-64 or i386, access to the thread pointer is implicit (or can be
>> cached by the compiler), so an explicit cached descriptor is
>> unnecessary.
>> 
>> I think we can improve this a lot: We could use a new type for the
>> cache, to make clear it is a cache, and make the macros type-safe.  Or
>> we could try to have the compiler cache the value for us, like we do
>> with __errno_location.
>> 
>> Thanks,
>> Florian
>> 
>
> It seems that besides the attribute const annotation for __errno_location,
> gcc does some conditional dead call elimination (gcc/tree-call-cdce.c) to
> optimize errno handling. Are you suggesting some similar?

Yes, that was my idea.  But if GCC hard-codes the name of
__errno_location, that's probably not possible to do.

> It also seems that some targets support __builtin_thread_pointer, maybe
> this might be better option (it might simplify compiler analysis). And 
> I am not sure how compiler will handle when it is based on kernel ABI.

It would actually need Richard Henderson's namespace patches because
some architectures use something close to that (x86 for instance).
Getting the thread pointer on these architectures requires a load of a
TLS variable.

Thanks,
Florian

Patch

diff --git a/sysdeps/i386/nptl/tls.h b/sysdeps/i386/nptl/tls.h
index 12285d3217..22ebf3d741 100644
--- a/sysdeps/i386/nptl/tls.h
+++ b/sysdeps/i386/nptl/tls.h
@@ -362,43 +362,6 @@  tls_fill_user_desc (union user_desc_init *desc,
        }})
 
 
-/* Atomic compare and exchange on TLS, returning old value.  */
-#define THREAD_ATOMIC_CMPXCHG_VAL(descr, member, newval, oldval) \
-  ({ __typeof (descr->member) __ret;					      \
-     __typeof (oldval) __old = (oldval);				      \
-     if (sizeof (descr->member) == 4)					      \
-       asm volatile (LOCK_PREFIX "cmpxchgl %2, %%gs:%P3"		      \
-		     : "=a" (__ret)					      \
-		     : "0" (__old), "r" (newval),			      \
-		       "i" (offsetof (struct pthread, member)));	      \
-     else								      \
-       /* Not necessary for other sizes in the moment.  */		      \
-       abort ();							      \
-     __ret; })
-
-
-/* Atomic logical and.  */
-#define THREAD_ATOMIC_AND(descr, member, val) \
-  (void) ({ if (sizeof ((descr)->member) == 4)				      \
-	      asm volatile (LOCK_PREFIX "andl %1, %%gs:%P0"		      \
-			    :: "i" (offsetof (struct pthread, member)),	      \
-			       "ir" (val));				      \
-	    else							      \
-	      /* Not necessary for other sizes in the moment.  */	      \
-	      abort (); })
-
-
-/* Atomic set bit.  */
-#define THREAD_ATOMIC_BIT_SET(descr, member, bit) \
-  (void) ({ if (sizeof ((descr)->member) == 4)				      \
-	      asm volatile (LOCK_PREFIX "orl %1, %%gs:%P0"		      \
-			    :: "i" (offsetof (struct pthread, member)),	      \
-			       "ir" (1 << (bit)));			      \
-	    else							      \
-	      /* Not necessary for other sizes in the moment.  */	      \
-	      abort (); })
-
-
 /* Set the stack guard field in TCB head.  */
 #define THREAD_SET_STACK_GUARD(value) \
   THREAD_SETMEM (THREAD_SELF, header.stack_guard, value)