[1/2] Optimize generic spinlock code and use C11 like atomic macros.

  On 02/20/2017 02:51 PM, Torvald Riegel wrote:
> On Mon, 2017-02-20 at 13:15 +0100, Stefan Liebler wrote:
>>>>> A while ago we tried hard to remove all code that would fail silently
>>>>> when a macro had a typo in the name, for example.  We still have a few
>>>>> of them left (e.g., in the atomics), but I think this here doesn't
>>>>> warrant an exception.
>>>>>
>>>> Okay. You're right.
>>>>
>>>> In comment of patch 2/2, you've mentioned a header where an architecture
>>>> shall define those parameters.
>>>> Thus I've added the file nptl/pthread_spin_parameters.h.
>>>
>>> I think the stub version should be in sysdeps/ but the wiki is missing
>>> this information (default ENOTSUP location):
>>> https://sourceware.org/glibc/wiki/Style_and_Conventions#Proper_sysdeps_Location
>> In case we need a new header, shall we move it to sysdeps/nptl/ ?
>
> I would guess that this would be the right place for a stub / generic
> variant of such a header, but I'm not quite sure.
>
>>>
>>> I'm still a bit undecided about how to deal with other arch's
>>> pthread_spin_lock.c files that just set SPIN_LOCK_READS_BETWEEN_CMPXCHG
>>> and then include the generic pthread_spin_lock.c.  Maybe it's best if we
>>> just remove them in this patch of yours.
>>>
>> I think the archs currently using the generic implementation have just
>> copied the default value to get rid of the warning "machine-dependent
>> file should define SPIN_LOCK_READS_BETWEEN_CMPXCHG". But this is only an
>> assumption.
>> In general removing those "wrapper"-pthread_spin_lock.c files and using
>> information from a header like the proposed pthread_spin_parameters.h or
>> atomic-machine.h is a good approach.
>
> Okay.  So let's do that.
>
Done. The wrapper-pthread_spin_lock.c files in sysdeps subfolders are 
removed and the option SPIN_LOCK_READS_BETWEEN_CMPXCHG is not available 
anymore.
Instead there is only a loop of plain reads until we observe an not 
acquired lock.
See comment in nptl/pthread_spin_lock.c.

>>> I agree that the there are architecture-specific properties of atomic
>>> operations that have a significant effect on performance.  For s390, you
>>> list the following points:
>>> * atomic exchange (I suppose all atomic read-modify-write ops?) are
>>> implemented through CAS.
>> atomic exchange is implemented by a CAS loop due to lack of an exchange
>> instruction. For exchanging to a "0", s390 can use the load-and-and
>> instruction instead of a CAS loop (See my atomic-machine.h patch; For
>> gcc, we plan to emit such a load-and-and instruction for builtin
>> __atomic_exchange_n in future;). This also saves one register usage
>> compared to the CAS loop.
>> Exchanging to "0" is e.g. used for lll_unlock macro or in
>> malloc_consolidate.
>
> I guess this may then be yet another property atomic-machine.h could
> expose: Whether an atomic fetch-and-and exists and is not implemented
> throough a CAS loop.  Something like lll_unlock would then do an
> exchange or fetch-and-and, and if that fails, a CAS loop.
>
>> Starting with z196 zarch CPUs, the following instructions which are used
>> by the appropriate __atomic_fetch_xyz builtins instead of a CAS loop:
>> load-and-add, load-and-and, load-and-exclusive-or, load-and-or.
>> As information:
>> I will update my atomic-machine.h patch to use at least some of the C11
>> atomic builtins or all depending on gcc version.
>
> Thanks.
>
>>> These properties are specific to the architecture, but they are not
>>> specific to the synchronization code (eg, to spinlocks).  Thus, I think
>>> such settings should be in the atomics headers (i.e., in
>>> atomic-machine.h).
>>> This should probably be a separate patch.  I would propose the following
>>> macros (both are bool):
>>> /* Nonzero iff all atomic read-modify-write operations (e.g., atomic
>>> exchange) are implemented using a CAS loop.  */
>>> #define ATOMIC_RMW_USES_CAS 1
>> Is one macro enough to describe all read-modify-write operations in
>> include/atomic.h?
>> Please also see my comment above.
>
> Yes, it may not be enough (e.g., you say that s390 may have fetch_and
> but not an exchange, but other archs may just have an exchange but no
> custom fetch_and).
> So maybe we should add ATOMIC_EXCHANGE_USES_CAS and
> ATOMIC_FETCH_AND_USES_CAS for now, and add further ones on demand?
>
>>> /* Nonzero iff CAS always assumes it will store, and thus has cache
>>> performance effects similar to a store (e.g., it always puts the
>>> respective cacheline into an exclusive state, even if the comparison
>>> against the expected value fails).  */
>>> ATOMIC_CAS_ALWAYS_PREPARES_FOR_STORE 1
>>>
>>> I'm not sure whether there are architectures for which the second macro
>>> would not be true.  It would be good to investigate this, maybe we don't
>>> need to add this at all.
>>>
>> We plan that in future gcc will emit e.g. a load-and-test instruction in
>> front of the CAS instruction if the old-value is a constant zero.
>
> That can be useful, but what I outlined would be about a more generic
> case.  If you are going to solve this for your arch in gcc, you might
> not need to address this in this patch.
>
>>>
>>>
>>> For the spin lock specifically, we have the following possibilities:
>>>
>>> 1) If we assume that trylock will most likely succeed in practice:
>>> * We just do an exchange.  The properties above don't matter.
>>>
>>> 2) If we want to bias towards cases where trylock succeeds, but don't
>>> rule out contention:
>>> * If exchange not a CAS loop, and exchange is faster than CAS, do an
>>> exchange.
>>> * If exchange is a CAS loop, use a weak CAS and not an exchange so we
>>> bail out after the first failed attempt to change the state.
>>>
>>> 3) If we expect contention to be likely:
>>> * If CAS always brings the cache line into an exclusive state, then load
>>> first.  Then do 2).
>>>
>>> Do we have consensus yet on whether we assume 2 or 3 (I don't think 1
>>> buys us anything over 2)?
>>>
>> Yes. Sounds good.
>
> I read this as stating that you agree that 1) doesn't need to be
> considered.  But we still to choose between 2) and 3)  :)
>
> I'm not quite sure what to favor because some of the code out there that
> uses trylock just uses it to prevent deadlock (e.g., when trying to
> acquire multiple locks at once), whereas other code uses spin_trylock to
> implement their own back-off and bounded spinning.
> I lean towards assuming that lock acquisitons, including the former use
> case for trylock, are supposed to succeed in the common case.  That is,
> I would pick 2), but I have no data to back this up.
>
> Either way, whatever we choose, we should document polish the cases 1-3
> above and the reasoning behind our choice for one of them, and add it as
> a comment to the spinlock code.
>

Okay. I've attached an updated patch. It is now using case 2).
This choice applies to pthread_spin_trylock.c and the first attempt to 
acquire the lock in pthread_spin_lock.c.
Therefore I've introduced ATOMIC_EXCHANGE_USES_CAS for all architectures 
in atomic-machine.h files. There is a check in include/atomic.h which 
ensures that it is defined to either 0 or 1. Can you please review the 
setting of 0 or 1?

Bye Stefan

[1/2] Optimize generic spinlock code and use C11 like atomic macros.

Commit Message

Comments

Patch