[MPX] MPX-specific changes in dl_runtime routines

Message ID CAMe9rOo81zoKpt+QmmVYjfjV2=KwLkYiqeADv3kMyeouM+9uug@mail.gmail.com
State Not applicable
Headers

Commit Message

H.J. Lu July 8, 2015, 6:49 p.m. UTC
  On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
> Fixed in the attached patch
>

I fixed some typos and updated sysdeps/i386/configure for
HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT
and without on i386 and x86-64.
  

Comments

Zamyatin, Igor July 9, 2015, 1:37 p.m. UTC | #1
> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>

> wrote:

> > Fixed in the attached patch

> >

> 

> I fixed some typos and updated sysdeps/i386/configure for

> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and

> without on i386 and x86-64.


Done, all works fine

Thanks,
Igor

> 

> --

> H.J.
  
H.J. Lu July 9, 2015, 2:12 p.m. UTC | #2
On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
>> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
>> wrote:
>> > Fixed in the attached patch
>> >
>>
>> I fixed some typos and updated sysdeps/i386/configure for
>> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
>> without on i386 and x86-64.
>
> Done, all works fine
>

I checked it in for you.

Thanks.
  
Ondrej Bilka July 9, 2015, 2:28 p.m. UTC | #3
On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
> >> wrote:
> >> > Fixed in the attached patch
> >> >
> >>
> >> I fixed some typos and updated sysdeps/i386/configure for
> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
> >> without on i386 and x86-64.
> >
> > Done, all works fine
> >
> 
> I checked it in for you.
> 
These are nice but you could have same problem with lazy tls allocation.
I wrote patch to merge trampolines, which now conflicts. Could you write
similar patch to solve that? Original purpose was to always save xmm
registers so we could use sse2 routines which speeds up lookup time.
  
H.J. Lu July 9, 2015, 4:07 p.m. UTC | #4
On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
>> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
>> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
>> >> wrote:
>> >> > Fixed in the attached patch
>> >> >
>> >>
>> >> I fixed some typos and updated sysdeps/i386/configure for
>> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
>> >> without on i386 and x86-64.
>> >
>> > Done, all works fine
>> >
>>
>> I checked it in for you.
>>
> These are nice but you could have same problem with lazy tls allocation.
> I wrote patch to merge trampolines, which now conflicts. Could you write
> similar patch to solve that? Original purpose was to always save xmm
> registers so we could use sse2 routines which speeds up lookup time.

So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
much gain it will give us?
  
Ondrej Bilka July 11, 2015, 10:46 a.m. UTC | #5
On Thu, Jul 09, 2015 at 09:07:24AM -0700, H.J. Lu wrote:
> On Thu, Jul 9, 2015 at 7:28 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> > On Thu, Jul 09, 2015 at 07:12:24AM -0700, H.J. Lu wrote:
> >> On Thu, Jul 9, 2015 at 6:37 AM, Zamyatin, Igor <igor.zamyatin@intel.com> wrote:
> >> >> On Wed, Jul 8, 2015 at 8:56 AM, Zamyatin, Igor <igor.zamyatin@intel.com>
> >> >> wrote:
> >> >> > Fixed in the attached patch
> >> >> >
> >> >>
> >> >> I fixed some typos and updated sysdeps/i386/configure for
> >> >> HAVE_MPX_SUPPORT.  Please verify both with HAVE_MPX_SUPPORT and
> >> >> without on i386 and x86-64.
> >> >
> >> > Done, all works fine
> >> >
> >>
> >> I checked it in for you.
> >>
> > These are nice but you could have same problem with lazy tls allocation.
> > I wrote patch to merge trampolines, which now conflicts. Could you write
> > similar patch to solve that? Original purpose was to always save xmm
> > registers so we could use sse2 routines which speeds up lookup time.
> 
> So we will preserve only xmm0 to xmm7 in _dl_runtime_resolve? How
> much gain it will give us?
>
I couldn't measure that without patch. Gain now would be big as we now
use byte-by-byte loop to check symbol name which is slow, especially
with c++ name mangling. Would be following benchmark good to measure
speedup or do I need to measure startup time which is bit harder?

#define _GNU_SOURCE
#include <string.h>
#include <dlfcn.h>
int main()
{
  int p=0;
  for (int i=0;i<10000000;i++)
    p+=(int) dlsym(RTLD_DEFAULT, "strlen");
  return p;
}
  

Patch

From bed2c05d4d05462c155baada8402516f103a79c4 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Wed, 8 Jul 2015 11:08:44 -0700
Subject: [PATCH] Preserve bound registers for pointer pass/return

We need to save/restore bound registers and add a BND prefix before
branches in _dl_runtime_profile so that bound registers for pointer
pass and return are preserved when LD_AUDIT is used.

	[BZ #18134]
	* sysdeps/i386/configure.ac: Set HAVE_MPX_SUPPORT.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/i386/dl-trampoline.S (PRESERVE_BND_REGS_PREFIX): New.
	(_dl_runtime_profile): Save and restore Intel MPX return bound
	registers when calling _dl_call_pltexit.  Add
	PRESERVE_BND_REGS_PREFIX before return.
	* sysdeps/i386/link-defines.sym (LRV_BND0_OFFSET): New.
	(LRV_BND1_OFFSET): Likewise.
	* sysdeps/x86/bits/link.h (La_i86_retval): Add lrv_bnd0 and
	lrv_bnd1.
	* sysdeps/x86_64/dl-trampoline.S (_dl_runtime_profile): Fix
	typo in bndmov encoding.
	* sysdeps/x86_64/dl-trampoline.h: Properly save and restore
	Intel MPX bound registers.  Add PRESERVE_BND_REGS_PREFIX before
	branch instructions to preserve bounds.
---
 sysdeps/i386/configure         | 27 +++++++++++++++++++++++++++
 sysdeps/i386/configure.ac      | 15 +++++++++++++++
 sysdeps/i386/dl-trampoline.S   | 21 +++++++++++++++++++++
 sysdeps/i386/link-defines.sym  |  2 ++
 sysdeps/x86/bits/link.h        |  2 ++
 sysdeps/x86_64/dl-trampoline.S |  4 ++--
 sysdeps/x86_64/dl-trampoline.h | 41 +++++++++++++++++++++++------------------
 7 files changed, 92 insertions(+), 20 deletions(-)

diff --git a/sysdeps/i386/configure b/sysdeps/i386/configure
index 6e89b59..ab66c08 100644
--- a/sysdeps/i386/configure
+++ b/sysdeps/i386/configure
@@ -240,6 +240,33 @@  $as_echo "$libc_cv_cc_novzeroupper" >&6; }
 config_vars="$config_vars
 config-cflags-novzeroupper = $libc_cv_cc_novzeroupper"
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for Intel MPX support" >&5
+$as_echo_n "checking for Intel MPX support... " >&6; }
+if ${libc_cv_asm_mpx+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat > conftest.s <<\EOF
+        bndmov %bnd0,(%esp)
+EOF
+if { ac_try='${CC-cc} -c $ASFLAGS conftest.s 1>&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }; then
+  libc_cv_asm_mpx=yes
+else
+  libc_cv_asm_mpx=no
+fi
+rm -f conftest*
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_asm_mpx" >&5
+$as_echo "$libc_cv_asm_mpx" >&6; }
+if test $libc_cv_asm_mpx == yes; then
+  $as_echo "#define HAVE_MPX_SUPPORT 1" >>confdefs.h
+
+fi
+
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for AVX2 support" >&5
 $as_echo_n "checking for AVX2 support... " >&6; }
 if ${libc_cv_cc_avx2+:} false; then :
diff --git a/sysdeps/i386/configure.ac b/sysdeps/i386/configure.ac
index 35c4522..a3f3067 100644
--- a/sysdeps/i386/configure.ac
+++ b/sysdeps/i386/configure.ac
@@ -88,6 +88,21 @@  LIBC_TRY_CC_OPTION([-mno-vzeroupper],
 ])
 LIBC_CONFIG_VAR([config-cflags-novzeroupper], [$libc_cv_cc_novzeroupper])
 
+dnl Check whether asm supports Intel MPX
+AC_CACHE_CHECK(for Intel MPX support, libc_cv_asm_mpx, [dnl
+cat > conftest.s <<\EOF
+        bndmov %bnd0,(%esp)
+EOF
+if AC_TRY_COMMAND(${CC-cc} -c $ASFLAGS conftest.s 1>&AS_MESSAGE_LOG_FD); then
+  libc_cv_asm_mpx=yes
+else
+  libc_cv_asm_mpx=no
+fi
+rm -f conftest*])
+if test $libc_cv_asm_mpx == yes; then
+  AC_DEFINE(HAVE_MPX_SUPPORT)
+fi
+
 dnl Check if -mavx2 works.
 AC_CACHE_CHECK(for AVX2 support, libc_cv_cc_avx2, [dnl
 LIBC_TRY_CC_OPTION([-mavx2], [libc_cv_cc_avx2=yes], [libc_cv_cc_avx2=no])
diff --git a/sysdeps/i386/dl-trampoline.S b/sysdeps/i386/dl-trampoline.S
index 7c72b03..8a2fd8d 100644
--- a/sysdeps/i386/dl-trampoline.S
+++ b/sysdeps/i386/dl-trampoline.S
@@ -19,6 +19,12 @@ 
 #include <sysdep.h>
 #include <link-defines.h>
 
+#ifdef HAVE_MPX_SUPPORT
+# define PRESERVE_BND_REGS_PREFIX bnd
+#else
+# define PRESERVE_BND_REGS_PREFIX .byte 0xf2
+#endif
+
 	.text
 	.globl _dl_runtime_resolve
 	.type _dl_runtime_resolve, @function
@@ -172,6 +178,13 @@  _dl_runtime_profile:
 	movl %edx, LRV_EDX_OFFSET(%esp)
 	fstpt LRV_ST0_OFFSET(%esp)
 	fstpt LRV_ST1_OFFSET(%esp)
+#ifdef HAVE_MPX_SUPPORT
+	bndmov %bnd0, LRV_BND0_OFFSET(%esp)
+	bndmov %bnd1, LRV_BND1_OFFSET(%esp)
+#else
+	.byte 0x66,0x0f,0x1b,0x44,0x24,LRV_BND0_OFFSET
+	.byte 0x66,0x0f,0x1b,0x4c,0x24,LRV_BND1_OFFSET
+#endif
 	pushl %esp
 	cfi_adjust_cfa_offset (4)
 	# Address of La_i86_regs area.
@@ -185,9 +198,17 @@  _dl_runtime_profile:
 	movl LRV_EDX_OFFSET(%esp), %edx
 	fldt LRV_ST1_OFFSET(%esp)
 	fldt LRV_ST0_OFFSET(%esp)
+#ifdef HAVE_MPX_SUPPORT
+	bndmov LRV_BND0_OFFSET(%esp), %bnd0
+	bndmov LRV_BND1_OFFSET(%esp), %bnd1
+#else
+	.byte 0x66,0x0f,0x1a,0x44,0x24,LRV_BND0_OFFSET
+	.byte 0x66,0x0f,0x1a,0x4c,0x24,LRV_BND1_OFFSET
+#endif
 	# Restore stack before return.
 	addl $(LRV_SIZE + 4 + LR_SIZE + 4), %esp
 	cfi_adjust_cfa_offset (-(LRV_SIZE + 4 + LR_SIZE + 4))
+	PRESERVE_BND_REGS_PREFIX
 	ret
 	cfi_endproc
 	.size _dl_runtime_profile, .-_dl_runtime_profile
diff --git a/sysdeps/i386/link-defines.sym b/sysdeps/i386/link-defines.sym
index a63dcb9..0995adb 100644
--- a/sysdeps/i386/link-defines.sym
+++ b/sysdeps/i386/link-defines.sym
@@ -16,3 +16,5 @@  LRV_EAX_OFFSET		offsetof (struct La_i86_retval, lrv_eax)
 LRV_EDX_OFFSET		offsetof (struct La_i86_retval, lrv_edx)
 LRV_ST0_OFFSET		offsetof (struct La_i86_retval, lrv_st0)
 LRV_ST1_OFFSET		offsetof (struct La_i86_retval, lrv_st1)
+LRV_BND0_OFFSET		offsetof (struct La_i86_retval, lrv_bnd0)
+LRV_BND1_OFFSET		offsetof (struct La_i86_retval, lrv_bnd1)
diff --git a/sysdeps/x86/bits/link.h b/sysdeps/x86/bits/link.h
index 3f559c9..0bf9b9a 100644
--- a/sysdeps/x86/bits/link.h
+++ b/sysdeps/x86/bits/link.h
@@ -38,6 +38,8 @@  typedef struct La_i86_retval
   uint32_t lrv_edx;
   long double lrv_st0;
   long double lrv_st1;
+  uint64_t lrv_bnd0;
+  uint64_t lrv_bnd1;
 } La_i86_retval;
 
 
diff --git a/sysdeps/x86_64/dl-trampoline.S b/sysdeps/x86_64/dl-trampoline.S
index 5f9b35d..b151d35 100644
--- a/sysdeps/x86_64/dl-trampoline.S
+++ b/sysdeps/x86_64/dl-trampoline.S
@@ -206,8 +206,8 @@  _dl_runtime_profile:
 #  else
 	.byte 0x66,0x0f,0x1b,0x84,0x24;.long (LR_BND_OFFSET)
 	.byte 0x66,0x0f,0x1b,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
-	.byte 0x66,0x0f,0x1b,0x84,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
-	.byte 0x66,0x0f,0x1b,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
+	.byte 0x66,0x0f,0x1b,0x94,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
+	.byte 0x66,0x0f,0x1b,0x9c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
 #  endif
 # endif
 
diff --git a/sysdeps/x86_64/dl-trampoline.h b/sysdeps/x86_64/dl-trampoline.h
index 0e5a6fb..d542428 100644
--- a/sysdeps/x86_64/dl-trampoline.h
+++ b/sysdeps/x86_64/dl-trampoline.h
@@ -63,20 +63,6 @@ 
 	movaps (LR_XMM_OFFSET + XMM_SIZE*6)(%rsp), %xmm6
 	movaps (LR_XMM_OFFSET + XMM_SIZE*7)(%rsp), %xmm7
 
-#ifndef __ILP32__
-# ifdef HAVE_MPX_SUPPORT
-	bndmov 		    (LR_BND_OFFSET)(%rsp), %bnd0  # Restore bound
-	bndmov (LR_BND_OFFSET +   BND_SIZE)(%rsp), %bnd1  # registers.
-	bndmov (LR_BND_OFFSET + BND_SIZE*2)(%rsp), %bnd2
-	bndmov (LR_BND_OFFSET + BND_SIZE*3)(%rsp), %bnd3
-# else
-	.byte 0x66,0x0f,0x1a,0x84,0x24;.long (LR_BND_OFFSET)
-	.byte 0x66,0x0f,0x1a,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
-	.byte 0x66,0x0f,0x1a,0x94,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
-	.byte 0x66,0x0f,0x1a,0x9c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
-# endif
-#endif
-
 #ifdef RESTORE_AVX
 	/* Check if any xmm0-xmm7 registers are changed by audit
 	   module.  */
@@ -154,8 +140,24 @@ 
 
 1:
 #endif
+
+#ifndef __ILP32__
+# ifdef HAVE_MPX_SUPPORT
+	bndmov              (LR_BND_OFFSET)(%rsp), %bnd0  # Restore bound
+	bndmov (LR_BND_OFFSET +   BND_SIZE)(%rsp), %bnd1  # registers.
+	bndmov (LR_BND_OFFSET + BND_SIZE*2)(%rsp), %bnd2
+	bndmov (LR_BND_OFFSET + BND_SIZE*3)(%rsp), %bnd3
+# else
+	.byte 0x66,0x0f,0x1a,0x84,0x24;.long (LR_BND_OFFSET)
+	.byte 0x66,0x0f,0x1a,0x8c,0x24;.long (LR_BND_OFFSET + BND_SIZE)
+	.byte 0x66,0x0f,0x1a,0x94,0x24;.long (LR_BND_OFFSET + BND_SIZE*2)
+	.byte 0x66,0x0f,0x1a,0x9c,0x24;.long (LR_BND_OFFSET + BND_SIZE*3)
+# endif
+#endif
+
 	mov  16(%rbx), %R10_LP	# Anything in framesize?
 	test %R10_LP, %R10_LP
+	PRESERVE_BND_REGS_PREFIX
 	jns 3f
 
 	/* There's nothing in the frame size, so there
@@ -174,6 +176,7 @@ 
 	addq $48, %rsp		# Adjust the stack to the return value
 				# (eats the reloc index and link_map)
 	cfi_adjust_cfa_offset(-48)
+	PRESERVE_BND_REGS_PREFIX
 	jmp *%r11		# Jump to function address.
 
 3:
@@ -200,6 +203,7 @@ 
 	movq 32(%rdi), %rsi
 	movq 40(%rdi), %rdi
 
+	PRESERVE_BND_REGS_PREFIX
 	call *%r11
 
 	mov 24(%rbx), %rsp	# Drop the copied stack content
@@ -280,11 +284,11 @@ 
 
 #ifndef __ILP32__
 # ifdef HAVE_MPX_SUPPORT
-	bndmov LRV_BND0_OFFSET(%rcx), %bnd0  # Restore bound registers.
-	bndmov LRV_BND1_OFFSET(%rcx), %bnd1
+	bndmov LRV_BND0_OFFSET(%rsp), %bnd0  # Restore bound registers.
+	bndmov LRV_BND1_OFFSET(%rsp), %bnd1
 # else
-	.byte  0x66,0x0f,0x1a,0x81;.long (LRV_BND0_OFFSET)
-	.byte  0x66,0x0f,0x1a,0x89;.long (LRV_BND1_OFFSET)
+	.byte  0x66,0x0f,0x1a,0x84,0x24;.long (LRV_BND0_OFFSET)
+	.byte  0x66,0x0f,0x1a,0x8c,0x24;.long (LRV_BND1_OFFSET)
 # endif
 #endif
 
@@ -299,6 +303,7 @@ 
 	addq $48, %rsp		# Adjust the stack to the return value
 				# (eats the reloc index and link_map)
 	cfi_adjust_cfa_offset(-48)
+	PRESERVE_BND_REGS_PREFIX
 	retq
 
 #ifdef MORE_CODE
-- 
2.4.3