[BZ,#19490] Add unwind descriptors to pthread_spin_init, etc. on i386

  On Sun, 2016-01-31 at 15:09 -0800, Paul Pluzhnikov wrote:
> On Mon, Jan 25, 2016 at 5:06 AM, Torvald Riegel <triegel@redhat.com> wrote:
> 
> > For the spinlocks, I'd really prefer if we could just remove the asm
> > files.  The generic implementation should basically produce the same
> > code; if not, we should try to fix that instead of keeping the asm
> > files.
> 
> Using gcc-4.8.4 (4.8.4-2ubuntu1~14.04):
> 
> $ objdump -d nptl/pthread_spin_unlock.o
> 
> nptl/pthread_spin_unlock.o:     file format elf32-i386
> 
> 
> Disassembly of section .text:
> 
> 00000000 <pthread_spin_unlock>:
>    0: f0 83 0c 24 00       lock orl $0x0,(%esp)
>    5: 8b 44 24 04           mov    0x4(%esp),%eax
>    9: c7 00 00 00 00 00     movl   $0x0,(%eax)
>    f: 31 c0                 xor    %eax,%eax
>   11: c3                   ret
> 
> This isn't quite the same as sysdeps/i386/nptl/pthread_spin_unlock.S

This is because nptl/pthread_spin_unlock.c still issues a full barrier.
If this is changed to an atomic_store_release, one gets this on x86_64:

0000000000000000 <pthread_spin_unlock>:
   0:	c7 07 00 00 00 00    	movl   $0x0,(%rdi)
   6:	31 c0                	xor    %eax,%eax
   8:	c3

Perhaps now is a good time to finally get this done.  Most archs are
already using acquire semantics, IIRC.  I think aarch64 and arm are the
only major ones that happen to use the current generic unlock with full
barrier -- but they only use acquire MO on unlock, so there's really no
consistent pattern that would justify this.

Note that there's an ongoing debate about whether POSIX requires
pthread_spin_unlock to be a full barrier, whether it should or should
not do that, and whether that makes any difference for all "sane"
programs.  But given that we never implemented full barriers on almost
all of the major archs and nobody complained about it, I think we should
continue to not slow down spinlocks just to make weird use cases work
(and the ones that are indeed correct under POSIX are pretty complex
pieces of code).

> For pthread_spin_lock it's much worse:
> 
> $ objdump -d nptl/pthread_spin_lock.o
> 
> nptl/pthread_spin_lock.o:     file format elf32-i386
> 
> 
> Disassembly of section .text:
> 
> 00000000 <pthread_spin_lock>:
>    0: 57                   push   %edi
>    1: b8 01 00 00 00       mov    $0x1,%eax
>    6: 56                   push   %esi
>    7: 53                   push   %ebx
>    8: 83 ec 10             sub    $0x10,%esp
>    b: 8b 5c 24 20           mov    0x20(%esp),%ebx
>    f: 87 03                 xchg   %eax,(%ebx)
>   11: 89 44 24 0c           mov    %eax,0xc(%esp)
>   15: 8b 44 24 0c           mov    0xc(%esp),%eax
>   19: 31 ff                 xor    %edi,%edi
>   1b: be 01 00 00 00       mov    $0x1,%esi
>   20: 85 c0                 test   %eax,%eax
>   22: 74 29                 je     4d <pthread_spin_lock+0x4d>
>   24: 8d 74 26 00           lea    0x0(%esi,%eiz,1),%esi
>   28: 8b 03                 mov    (%ebx),%eax
>   2a: 85 c0                 test   %eax,%eax
>   2c: 74 15                 je     43 <pthread_spin_lock+0x43>
>   2e: ba e8 03 00 00       mov    $0x3e8,%edx
>   33: eb 08                 jmp    3d <pthread_spin_lock+0x3d>
>   35: 8d 76 00             lea    0x0(%esi),%esi
>   38: 83 ea 01             sub    $0x1,%edx
>   3b: 74 06                 je     43 <pthread_spin_lock+0x43>
>   3d: 8b 0b                 mov    (%ebx),%ecx
>   3f: 85 c9                 test   %ecx,%ecx
>   41: 75 f5                 jne    38 <pthread_spin_lock+0x38>
>   43: 89 f8                 mov    %edi,%eax
>   45: f0 0f b1 33           lock cmpxchg %esi,(%ebx)
>   49: 85 c0                 test   %eax,%eax
>   4b: 75 db                 jne    28 <pthread_spin_lock+0x28>
>   4d: 83 c4 10             add    $0x10,%esp
>   50: 31 c0                 xor    %eax,%eax
>   52: 5b                   pop    %ebx
>   53: 5e                   pop    %esi
>   54: 5f                   pop    %edi
>   55: c3                   ret

I wouldn't say it's worse.  It's mostly different, and the uncontended
path may be a little worse.  In the generic version, we added spinning.
This isn't really well-tuned yet, but it's something we want to do
eventually.  If we assume uncontended, the initial xchg should be fast;
maybe we need to add a glibc_likely here or such, to make GCC do what we
expect; outlining the contended path (ie, the spinning and cmpxchg)
could also help work around GCC codegen deficiencies.
However, on x86_64 I get the following (adding __glibc_likely to the
atomic_exchange_acq only moves the return up):

0000000000000000 <pthread_spin_lock>:
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   87 07                   xchg   %eax,(%rdi)
   7:   89 44 24 fc             mov    %eax,-0x4(%rsp)
   b:   8b 44 24 fc             mov    -0x4(%rsp),%eax
   f:   85 c0                   test   %eax,%eax
  11:   75 03                   jne    16 <pthread_spin_lock+0x16>
  13:   31 c0                   xor    %eax,%eax
  15:   c3                      retq   
  16:   45 31 c0                xor    %r8d,%r8d
  19:   be 01 00 00 00          mov    $0x1,%esi
  1e:   8b 17                   mov    (%rdi),%edx
  20:   85 d2                   test   %edx,%edx
  22:   74 17                   je     3b <pthread_spin_lock+0x3b>
  24:   ba e8 03 00 00          mov    $0x3e8,%edx
  29:   eb 0a                   jmp    35 <pthread_spin_lock+0x35>
  2b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  30:   83 ea 01                sub    $0x1,%edx
  33:   74 06                   je     3b <pthread_spin_lock+0x3b>
  35:   8b 0f                   mov    (%rdi),%ecx
  37:   85 c9                   test   %ecx,%ecx
  39:   75 f5                   jne    30 <pthread_spin_lock+0x30>
  3b:   44 89 c0                mov    %r8d,%eax
  3e:   f0 0f b1 37             lock cmpxchg %esi,(%rdi)
  42:   85 c0                   test   %eax,%eax
  44:   75 d8                   jne    1e <pthread_spin_lock+0x1e>
  46:   eb cb                   jmp    13 <pthread_spin_lock+0x13>

The fastpath of this doesn't look bad to me (except at 7: and b:, for
which I don't see a reason).

See attached untested patch for what I played with.

[BZ,#19490] Add unwind descriptors to pthread_spin_init, etc. on i386

Commit Message

Comments

Patch