From patchwork Thu Oct 16 22:33:56 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roland McGrath X-Patchwork-Id: 3255 Received: (qmail 21330 invoked by alias); 16 Oct 2014 22:34:02 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 21320 invoked by uid 89); 16 Oct 2014 22:34:02 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.5 required=5.0 tests=AWL, BAYES_00, KAM_STOCKGEN, URIBL_BLACK autolearn=no version=3.3.2 X-HELO: topped-with-meat.com MIME-Version: 1.0 From: Roland McGrath To: "GNU C. Library" Subject: [PATCH roland/arm] ARM: Use movw/movt more when available Message-Id: <20141016223356.845062C24CE@topped-with-meat.com> Date: Thu, 16 Oct 2014 15:33:56 -0700 (PDT) X-CMAE-Score: 0 X-CMAE-Analysis: v=2.1 cv=SvUDtp+0 c=1 sm=1 tr=0 a=WkljmVdYkabdwxfqvArNOQ==:117 a=14OXPxybAAAA:8 a=Z6MIti7PxpgA:10 a=kj9zAlcOel0A:10 a=hOe2yjtxAAAA:8 a=_iEjht0TizsVm_LvJKwA:9 a=cGVGB3DvZlY25EPs:21 a=r1VFKpQEkbgz7Ysf:21 a=CjuIK1q_8ugA:10 Note you'll need an unreleased binutils version (trunk, 2.25 or tip of 2.24 branch) for the new configure check to pass. I've tested these configurations on arm-linux-gnueabihf: * configure check fails; arm-linux-gnueabihf-gcc (4.8.2) defaults to -mthumb The only changes in disassembly are in the setjmp/longjmp code. In the shared library versions, all that changes is the order of some literal pool items. The rtld versions actually get two instructions shorter because it was previously using the GOT unnecessarily (though alignment nop-pads it out to the same text length anyway). The static versions actually use movw/movt, since the linker was never broken for the absolute flavors of those relocs. * configure check fails; arm-linux-gnueabihf-gcc -marm The only changes in disassembly are in the setjmp/longjmp code. The differences are very much like the Thumb case above. * configure check passes; arm-linux-gnueabihf-gcc (4.8.2) defaults to -mthumb * configure check passes; arm-linux-gnueabihf-gcc -marm In both of these, the shared library setjmp/longjmp code changes more. Also, every rtld syscall stub changes because it's using movw/movt to materialize the address of rtld_errno and every cancellable syscall stub changes because it's using movw/movt to materialize the address of __libc_multiple_threads. * configure check passes; arm-linux-gnueabihf-gcc (4.8.2) defaults to -mthumb; ARM_NO_INDEX_REGISTER temporarily hacked into arm-features.h * configure check passes; arm-linux-gnueabihf-gcc -marm; ARM_NO_INDEX_REGISTER temporarily hacked into arm-features.h I added the static copy of the basic setjmp test, since with this change the assembly diverges more between static and shared and there is no existing test I noticed that exercises static setjmp/longjmp. In all those configurations, all the setjmp/ tests (still) pass. We've talked before about the performance issues of movw/movt (two words, two instructions with a data dependency) vs. literal pools (one load instruction that in the best case is a one-cycle cache hit, plus a non-instruction word that in some cases--but none here--might be shared between multiple uses). ARM chip experts I've talked to tell me they think movw/movt overall is probably a wash at worst. The two cycles and data dependency trade off against the double cache pollution. (The I cache sees literal pool words that will never be executed, because they are in the same cache line as executed code; the D cache sees instruction words that will never be fetched as data, because they are in the same cache line as literal pool words.) The rtld and static versions of setjmp and longjmp already had low-hanging fruit (unnecessary GOT use and unnecessary PIC-friendliness, respectively) left on the the vine, suggesting maximal performance there was not a priority. The syscall paths might matter more, though as I said above I'm skeptical that movw/movt is really a loser. I'm not inclined to try to get empirical performance measurements since this level of stuff is so likely to vary substantially across chip variants and workloads. Still, if you really think preserving the status quo for Linux configurations is the right thing to do, I can add an arm-features.h macro for disallowing noninstructions in text (i.e. literal pools) and use the new methods only in that case, which would be set for arm-nacl and not for arm-linux*. Thanks, Roland 2014-10-16 Roland McGrath * sysdeps/arm/__longjmp.S [NEED_HWCAP] [IS_IN_rtld]: Use LDST_PCREL macro to get at the _rt_local_ro field. [NEED_HWCAP] [!IS_IN_rtld]: Use LDR_GLOBAL to get at _rtld_global_ro ([PIC] case) or _dl_hwcap ([!PIC] case). * sysdeps/arm/setjmp.S: Likewise. * config.h.in (ARM_PCREL_MOVW_OK): New macro. * sysdeps/arm/configure.ac: New check to define it. * sysdeps/arm/configure: Regenerated. * sysdeps/arm/sysdep.h [__ASSEMBLER__]: Include . (LDST_INDEXED_NOINDEX, LDST_INDEXED_INDEX): New macros. (LDST_INDEXED, LDST_PC_INDEXED): New macros, differing definitions depending on [ARM_NO_INDEX_REGISTER] and [__thumb2__]. (LDST_PCREL) [!__thumb2__ && ARCH_HAS_T2 && ARM_PCREL_MOVW_OK]: Use move/movt pair instead of a load. (LDST_GLOBAL): Macro removed. (LDR_GLOBAL): New macro replaces it. (LDR_HIDDEN): New macro. (PTR_MANGLE_LOAD): Use LDR_GLOBAL rather than LDST_GLOBAL. Use LDR_HIDDEN instead for __pointer_chk_guard_local. * setjmp/tst-setjmp-static.c: New file. * setjmp/Makefile (tests): Add it. (tests-static): New variable. --- a/config.h.in +++ b/config.h.in @@ -243,6 +243,9 @@ /* The ARM hard-float ABI is being used. */ #undef HAVE_ARM_PCS_VFP +/* The ARM movw/movt instructions using PC-relative relocs work right. */ +#define ARM_PCREL_MOVW_OK 0 + /* The pt_chown binary is being built and used by grantpt. */ #define HAVE_PT_CHOWN 0 --- a/setjmp/Makefile +++ b/setjmp/Makefile @@ -28,7 +28,8 @@ routines := setjmp sigjmp bsd-setjmp bsd-_setjmp \ longjmp __longjmp jmp-unwind tests := tst-setjmp jmpbug bug269-setjmp tst-setjmp-fp \ - tst-sigsetjmp + tst-sigsetjmp tst-setjmp-static +tests-static := tst-setjmp-static include ../Rules --- /dev/null +++ b/setjmp/tst-setjmp-static.c @@ -0,0 +1 @@ +#include "tst-setjmp.c" --- a/sysdeps/arm/__longjmp.S +++ b/sysdeps/arm/__longjmp.S @@ -77,21 +77,15 @@ ENTRY (__longjmp) #ifdef NEED_HWCAP # ifdef IS_IN_rtld - ldr a4, 1f - ldr a3, .Lrtld_local_ro -0: add a4, pc, a4 - add a4, a4, a3 - ldr a4, [a4, #RTLD_GLOBAL_RO_DL_HWCAP_OFFSET] + LDST_PCREL (ldr, a4, a3, \ + C_SYMBOL_NAME(_rtld_local_ro) \ + + RTLD_GLOBAL_RO_DL_HWCAP_OFFSET) # else # ifdef PIC - ldr a4, 1f - ldr a3, .Lrtld_global_ro -0: add a4, pc, a4 - ldr a4, [a4, a3] - ldr a4, [a4, #RTLD_GLOBAL_RO_DL_HWCAP_OFFSET] + LDR_GLOBAL (a4, a3, C_SYMBOL_NAME(_rtld_global_ro), \ + RTLD_GLOBAL_RO_DL_HWCAP_OFFSET) # else - ldr a4, .Lhwcap - ldr a4, [a4, #0] + LDR_GLOBAL (a4, a3, C_SYMBOL_NAME(_dl_hwcap), 0) # endif # endif #endif @@ -138,21 +132,4 @@ ENTRY (__longjmp) DO_RET(lr) -#ifdef NEED_HWCAP -# ifdef IS_IN_rtld -1: .long _GLOBAL_OFFSET_TABLE_ - 0b - PC_OFS -.Lrtld_local_ro: - .long C_SYMBOL_NAME(_rtld_local_ro)(GOTOFF) -# else -# ifdef PIC -1: .long _GLOBAL_OFFSET_TABLE_ - 0b - PC_OFS -.Lrtld_global_ro: - .long C_SYMBOL_NAME(_rtld_global_ro)(GOT) -# else -.Lhwcap: - .long C_SYMBOL_NAME(_dl_hwcap) -# endif -# endif -#endif - END (__longjmp) --- a/sysdeps/arm/configure +++ b/sysdeps/arm/configure @@ -150,8 +150,8 @@ else cat confdefs.h - <<_ACEOF >conftest.$ac_ext /* end confdefs.h. */ #ifdef __ARM_PCS_VFP - yes - #endif + yes + #endif _ACEOF if (eval "$ac_cpp conftest.$ac_ext") 2>&5 | @@ -211,6 +211,54 @@ else have-arm-tls-desc = no" fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether PC-relative relocs in movw/movt work properly" >&5 +$as_echo_n "checking whether PC-relative relocs in movw/movt work properly... " >&6; } +if ${libc_cv_arm_pcrel_movw+:} false; then : + $as_echo_n "(cached) " >&6 +else + +cat > conftest.s <<\EOF + .syntax unified + .arm + .arch armv7-a + + .text + .globl foo + .type foo,%function +foo: movw r0, #:lower16:symbol - 1f - 8 + movt r0, #:upper16:symbol - 1f - 8 +1: add r0, pc + @ And now a case with a local symbol. + movw r0, #:lower16:3f - 2f - 8 + movt r0, #:upper16:3f - 2f - 8 +2: add r0, pc + bx lr + +.data + .globl symbol + .hidden symbol +symbol: .long 23 +3: .long 17 +EOF +libc_cv_arm_pcrel_movw=no +${CC-cc} $CFLAGS $CPPFLAGS $LDFLAGS \ + -nostartfiles -nostdlib -shared \ + -o conftest.so conftest.s 1>&5 2>&5 && +LC_ALL=C $READELF -dr conftest.so > conftest.dr 2>&5 && +{ + cat conftest.dr 1>&5 + fgrep 'TEXTREL +R_ARM_NONE' conftest.dr > /dev/null || libc_cv_arm_pcrel_movw=yes +} +rm -f conftest* +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_arm_pcrel_movw" >&5 +$as_echo "$libc_cv_arm_pcrel_movw" >&6; } +if test $libc_cv_arm_pcrel_movw = yes; then + $as_echo "#define ARM_PCREL_MOVW_OK 1" >>confdefs.h + +fi + libc_cv_gcc_unwind_find_fde=no # Remove -fno-unwind-tables that was added in sysdeps/arm/preconfigure.ac. --- a/sysdeps/arm/configure.ac +++ b/sysdeps/arm/configure.ac @@ -17,8 +17,8 @@ dnl it. Until we do, don't define it. AC_CACHE_CHECK([whether the compiler is using the ARM hard-float ABI], [libc_cv_arm_pcs_vfp], [AC_EGREP_CPP(yes,[#ifdef __ARM_PCS_VFP - yes - #endif + yes + #endif ], libc_cv_arm_pcs_vfp=yes, libc_cv_arm_pcs_vfp=no)]) if test $libc_cv_arm_pcs_vfp = yes; then AC_DEFINE(HAVE_ARM_PCS_VFP) @@ -40,6 +40,46 @@ else LIBC_CONFIG_VAR([have-arm-tls-desc], [no]) fi +AC_CACHE_CHECK([whether PC-relative relocs in movw/movt work properly], + libc_cv_arm_pcrel_movw, [ +cat > conftest.s <<\EOF + .syntax unified + .arm + .arch armv7-a + + .text + .globl foo + .type foo,%function +foo: movw r0, #:lower16:symbol - 1f - 8 + movt r0, #:upper16:symbol - 1f - 8 +1: add r0, pc + @ And now a case with a local symbol. + movw r0, #:lower16:3f - 2f - 8 + movt r0, #:upper16:3f - 2f - 8 +2: add r0, pc + bx lr + +.data + .globl symbol + .hidden symbol +symbol: .long 23 +3: .long 17 +EOF +libc_cv_arm_pcrel_movw=no +${CC-cc} $CFLAGS $CPPFLAGS $LDFLAGS \ + -nostartfiles -nostdlib -shared \ + -o conftest.so conftest.s 1>&AS_MESSAGE_LOG_FD 2>&AS_MESSAGE_LOG_FD && +LC_ALL=C $READELF -dr conftest.so > conftest.dr 2>&AS_MESSAGE_LOG_FD && +{ + cat conftest.dr 1>&AS_MESSAGE_LOG_FD + fgrep 'TEXTREL +R_ARM_NONE' conftest.dr > /dev/null || libc_cv_arm_pcrel_movw=yes +} +rm -f conftest*]) +if test $libc_cv_arm_pcrel_movw = yes; then + AC_DEFINE([ARM_PCREL_MOVW_OK]) +fi + libc_cv_gcc_unwind_find_fde=no # Remove -fno-unwind-tables that was added in sysdeps/arm/preconfigure.ac. --- a/sysdeps/arm/setjmp.S +++ b/sysdeps/arm/setjmp.S @@ -58,21 +58,15 @@ ENTRY (__sigsetjmp) #ifdef NEED_HWCAP /* Check if we have a VFP unit. */ # ifdef IS_IN_rtld - ldr a3, 1f - ldr a4, .Lrtld_local_ro -0: add a3, pc, a3 - add a3, a3, a4 - ldr a3, [a3, #RTLD_GLOBAL_RO_DL_HWCAP_OFFSET] + LDST_PCREL (ldr, a3, a4, \ + C_SYMBOL_NAME(_rtld_local_ro) \ + + RTLD_GLOBAL_RO_DL_HWCAP_OFFSET) # else # ifdef PIC - ldr a3, 1f - ldr a4, .Lrtld_global_ro -0: add a3, pc, a3 - ldr a3, [a3, a4] - ldr a3, [a3, #RTLD_GLOBAL_RO_DL_HWCAP_OFFSET] + LDR_GLOBAL (a3, a4, C_SYMBOL_NAME(_rtld_global_ro), \ + RTLD_GLOBAL_RO_DL_HWCAP_OFFSET) # else - ldr a3, .Lhwcap - ldr a3, [a3, #0] + LDR_GLOBAL (a3, a4, C_SYMBOL_NAME(_dl_hwcap), 0) # endif # endif #endif @@ -114,23 +108,6 @@ ENTRY (__sigsetjmp) /* Make a tail call to __sigjmp_save; it takes the same args. */ B PLTJMP(C_SYMBOL_NAME(__sigjmp_save)) -#ifdef NEED_HWCAP -# ifdef IS_IN_rtld -1: .long _GLOBAL_OFFSET_TABLE_ - 0b - PC_OFS -.Lrtld_local_ro: - .long C_SYMBOL_NAME(_rtld_local_ro)(GOTOFF) -# else -# ifdef PIC -1: .long _GLOBAL_OFFSET_TABLE_ - 0b - PC_OFS -.Lrtld_global_ro: - .long C_SYMBOL_NAME(_rtld_global_ro)(GOT) -# else -.Lhwcap: - .long C_SYMBOL_NAME(_dl_hwcap) -# endif -# endif -#endif - END (__sigsetjmp) hidden_def (__sigsetjmp) --- a/sysdeps/arm/sysdep.h +++ b/sysdeps/arm/sysdep.h @@ -21,6 +21,8 @@ #ifndef __ASSEMBLER__ # include +#else +# include #endif /* The __ARM_ARCH define is provided by gcc 4.8. Construct it otherwise. */ @@ -157,6 +159,32 @@ .arm # endif +/* Load or store to/from address X + Y into/from R, (maybe) using T. + X or Y can use T freely; T can be R if OP is a load. The first + version eschews the two-register addressing mode, while the + second version uses it. */ +# define LDST_INDEXED_NOINDEX(OP, R, T, X, Y) \ + add T, X, Y; \ + sfi_breg T, \ + OP R, [T] +# define LDST_INDEXED_INDEX(OP, R, X, Y) \ + OP R, [X, Y] + +# ifdef ARM_NO_INDEX_REGISTER +/* We're never using the two-register addressing mode, so this + always uses an intermediate add. */ +# define LDST_INDEXED(OP, R, T, X, Y) LDST_INDEXED_NOINDEX (OP, R, T, X, Y) +# define LDST_PC_INDEXED(OP, R, T, X) LDST_INDEXED_NOINDEX (OP, R, T, pc, X) +# else +/* The two-register addressing mode is OK, except on Thumb with pc. */ +# define LDST_INDEXED(OP, R, T, X, Y) LDST_INDEXED_INDEX (OP, R, X, Y) +# ifdef __thumb2__ +# define LDST_PC_INDEXED(OP, R, T, X) LDST_INDEXED_NOINDEX (OP, R, T, pc, X) +# else +# define LDST_PC_INDEXED(OP, R, T, X) LDST_INDEXED_INDEX (OP, R, pc, X) +# endif +# endif + /* Load or store to/from a pc-relative EXPR into/from R, using T. */ # ifdef __thumb2__ # define LDST_PCREL(OP, R, T, EXPR) \ @@ -166,6 +194,11 @@ .previous; \ 99: add T, T, pc; \ OP R, [T] +# elif defined (ARCH_HAS_T2) && ARM_PCREL_MOVW_OK +# define LDST_PCREL(OP, R, T, EXPR) \ + movw T, #:lower16:EXPR - 99f - PC_OFS; \ + movt T, #:upper16:EXPR - 99f - PC_OFS; \ +99: LDST_PC_INDEXED (OP, R, T, T) # else # define LDST_PCREL(OP, R, T, EXPR) \ ldr T, 98f; \ @@ -175,17 +208,50 @@ 99: OP R, [pc, T] # endif -/* Load or store to/from a global EXPR into/from R, using T. */ -# define LDST_GLOBAL(OP, R, T, EXPR) \ +/* Load from a global SYMBOL + CONSTANT into R, using T. */ +# if defined (ARCH_HAS_T2) && !defined (PIC) +# define LDR_GLOBAL(R, T, SYMBOL, CONSTANT) \ + movw T, #:lower16:SYMBOL; \ + movt T, #:upper16:SYMBOL; \ + ldr R, [T, $CONSTANT] +# elif defined (ARCH_HAS_T2) && defined (PIC) && ARM_PCREL_MOVW_OK +# define LDR_GLOBAL(R, T, SYMBOL, CONSTANT) \ + movw R, #:lower16:_GLOBAL_OFFSET_TABLE_ - 97f - PC_OFS; \ + movw T, #:lower16:99f - 98f - PC_OFS; \ + movt R, #:upper16:_GLOBAL_OFFSET_TABLE_ - 97f - PC_OFS; \ + movt T, #:upper16:99f - 98f - PC_OFS; \ + .pushsection .rodata.cst4, "aM", %progbits, 4; \ + .balign 4; \ +99: .word SYMBOL##(GOT); \ + .popsection; \ +97: add R, R, pc; \ +98: LDST_PC_INDEXED (ldr, T, T, T); \ + LDST_INDEXED (ldr, R, T, R, T); \ + ldr R, [R, $CONSTANT] +# else +# define LDR_GLOBAL(R, T, SYMBOL, CONSTANT) \ ldr T, 99f; \ ldr R, 100f; \ 98: add T, T, pc; \ ldr T, [T, R]; \ .subsection 2; \ 99: .word _GLOBAL_OFFSET_TABLE_ - 98b - PC_OFS; \ -100: .word EXPR##(GOT); \ +100: .word SYMBOL##(GOT); \ .previous; \ - OP R, [T] + ldr R, [T, $CONSTANT] +# endif + +/* This is the same as LDR_GLOBAL, but for a SYMBOL that is known to + be in the same linked object (as for one with hidden visibility). + We can avoid the GOT indirection in the PIC case. For the pure + static case, LDR_GLOBAL is already optimal. */ +# ifdef PIC +# define LDR_HIDDEN(R, T, SYMBOL, CONSTANT) \ + LDST_PCREL (ldr, R, T, SYMBOL + CONSTANT) +# else +# define LDR_HIDDEN(R, T, SYMBOL, CONSTANT) \ + LDR_GLOBAL (R, T, SYMBOL, CONSTANT) +# endif /* Cope with negative memory offsets, which thumb can't encode. Use NEGOFF_ADJ_BASE to (conditionally) alter the base register, @@ -296,7 +362,7 @@ (!defined SHARED && (!defined NOT_IN_libc || defined IS_IN_libpthread))) # ifdef __ASSEMBLER__ # define PTR_MANGLE_LOAD(guard, tmp) \ - LDST_PCREL(ldr, guard, tmp, C_SYMBOL_NAME(__pointer_chk_guard_local)); + LDR_HIDDEN (guard, tmp, C_SYMBOL_NAME(__pointer_chk_guard_local), 0) # define PTR_MANGLE(dst, src, guard, tmp) \ PTR_MANGLE_LOAD(guard, tmp); \ PTR_MANGLE2(dst, src, guard) @@ -316,7 +382,7 @@ extern uintptr_t __pointer_chk_guard_local attribute_relro attribute_hidden; #else # ifdef __ASSEMBLER__ # define PTR_MANGLE_LOAD(guard, tmp) \ - LDST_GLOBAL(ldr, guard, tmp, C_SYMBOL_NAME(__pointer_chk_guard)); + LDR_GLOBAL (guard, tmp, C_SYMBOL_NAME(__pointer_chk_guard), 0); # define PTR_MANGLE(dst, src, guard, tmp) \ PTR_MANGLE_LOAD(guard, tmp); \ PTR_MANGLE2(dst, src, guard)