Don't write beyond destination in __mempcpy_avx512_no_vzeroupper (bug 23196)

Message ID mvm1se4romh.fsf@suse.de
State Committed
Commit 9aaaab7c6e4176e61c59b0a63c6ba906d875dc0e
Headers

Commit Message

Andreas Schwab May 22, 2018, 10:06 a.m. UTC
  [BZ #23196]
	CVE-2018-11237
	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
	(L(preloop_large)): Save initial destination pointer in %r11 and
	use it instead of %rax after the loop.
	* string/test-mempcpy.c (MIN_PAGE_SIZE): Define.
---
 string/test-mempcpy.c                                   | 1 +
 sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S | 5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)
  

Comments

H.J. Lu May 22, 2018, 11:58 a.m. UTC | #1
On Tue, May 22, 2018 at 3:06 AM, Andreas Schwab <schwab@suse.de> wrote:
>         [BZ #23196]
>         CVE-2018-11237
>         * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
>         (L(preloop_large)): Save initial destination pointer in %r11 and
>         use it instead of %rax after the loop.
>         * string/test-mempcpy.c (MIN_PAGE_SIZE): Define.

Please include your analysis in commit message.

>  string/test-mempcpy.c                                   | 1 +
>  sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S | 5 +++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/string/test-mempcpy.c b/string/test-mempcpy.c
> index c08fba895e..d98ecdd2d9 100644
> --- a/string/test-mempcpy.c
> +++ b/string/test-mempcpy.c
> @@ -18,6 +18,7 @@
>     <http://www.gnu.org/licenses/>.  */
>
>  #define MEMCPY_RESULT(dst, len) (dst) + (len)
> +#define MIN_PAGE_SIZE 131072
>  #define TEST_MAIN
>  #define TEST_NAME "mempcpy"
>  #include "test-string.h"

The modified test does't fail on Skylake server with unchanged
memmove-avx512-no-vzeroupper.S.   Can you modify the test
so that it fails with the original memmove-avx512-no-vzeroupper.S
on Skylake server?

> diff --git a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
> index 23c0f7a9ed..effc3ac2de 100644
> --- a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
> +++ b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
> @@ -336,6 +336,7 @@ L(preloop_large):
>         vmovups (%rsi), %zmm4
>         vmovups 0x40(%rsi), %zmm5
>
> +       mov     %rdi, %r11
>  /* Align destination for access with non-temporal stores in the loop.  */
>         mov     %rdi, %r8
>         and     $-0x80, %rdi
> @@ -366,8 +367,8 @@ L(gobble_256bytes_nt_loop):
>         cmp     $256, %rdx
>         ja      L(gobble_256bytes_nt_loop)
>         sfence
> -       vmovups %zmm4, (%rax)
> -       vmovups %zmm5, 0x40(%rax)
> +       vmovups %zmm4, (%r11)
> +       vmovups %zmm5, 0x40(%r11)
>         jmp     L(check)
>
>  L(preloop_large_bkw):

memmove-vec-unaligned-erms.S supports no vzeroupper:

#ifndef VZEROUPPER
# if VEC_SIZE > 16
#  define VZEROUPPER vzeroupper
# else
#  define VZEROUPPER
# endif
#endif

Should it be used instead?
  
Andreas Schwab May 22, 2018, 12:17 p.m. UTC | #2
On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:

> The modified test does't fail on Skylake server with unchanged
> memmove-avx512-no-vzeroupper.S.   Can you modify the test
> so that it fails with the original memmove-avx512-no-vzeroupper.S
> on Skylake server?

How to do that?  I have tested it on a Xeon Phi (whateve that means)
where it failed without the fix.

> memmove-vec-unaligned-erms.S supports no vzeroupper:
>
> #ifndef VZEROUPPER
> # if VEC_SIZE > 16
> #  define VZEROUPPER vzeroupper
> # else
> #  define VZEROUPPER
> # endif
> #endif
>
> Should it be used instead?

I don't understand what that has to do with the bug in question.

Andreas.
  
H.J. Lu May 22, 2018, 12:25 p.m. UTC | #3
On Tue, May 22, 2018 at 5:17 AM, Andreas Schwab <schwab@suse.de> wrote:
> On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:
>
>> The modified test does't fail on Skylake server with unchanged
>> memmove-avx512-no-vzeroupper.S.   Can you modify the test
>> so that it fails with the original memmove-avx512-no-vzeroupper.S
>> on Skylake server?
>
> How to do that?  I have tested it on a Xeon Phi (whateve that means)
> where it failed without the fix.

__mempcpy_avx512_no_vzeroupper is also tested on Skylake server:

[hjl@gnu-skx-1 build-x86_64-linux]$ ./string/test-mempcpy
                        simple_mempcpy __mempcpy_avx512_no_vzeroupper
__mempcpy_avx512_unaligned __mempcpy_avx512_unaligned_erms
__mempcpy_avx_unaligned__mempcpy_avx_unaligned_erms
__mempcpy_ssse3_back __mempcpy_ssse3 __mempcpy_sse2_unaligned
__mempcpy_sse2_unaligned_erms __mempcpy_erms
[hjl@gnu-skx-1 build-x86_64-linux]$

It should also fail without the fix.

>> memmove-vec-unaligned-erms.S supports no vzeroupper:
>>
>> #ifndef VZEROUPPER
>> # if VEC_SIZE > 16
>> #  define VZEROUPPER vzeroupper
>> # else
>> #  define VZEROUPPER
>> # endif
>> #endif
>>
>> Should it be used instead?
>
> I don't understand what that has to do with the bug in question.

__mempcpy_avx512_no_vzeroupper can be implemented with
memmove-vec-unaligned-erms.S.  It doesn't have the issue in
memmove-avx512-no-vzeroupper.S.
  
Andreas Schwab May 22, 2018, 12:38 p.m. UTC | #4
On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:

> It should also fail without the fix.

Again, tell me how to do that.

> __mempcpy_avx512_no_vzeroupper can be implemented with
> memmove-vec-unaligned-erms.S.

In which way?  There is no definition of __mempcpy_avx512_no_vzeroupper
in memmove-vec-unaligned-erms.S.

Andreas.
  
Florian Weimer May 22, 2018, 12:47 p.m. UTC | #5
On 05/22/2018 02:17 PM, Andreas Schwab wrote:
> On Mai 22 2018, "H.J. Lu"<hjl.tools@gmail.com>  wrote:
> 
>> The modified test does't fail on Skylake server with unchanged
>> memmove-avx512-no-vzeroupper.S.   Can you modify the test
>> so that it fails with the original memmove-avx512-no-vzeroupper.S
>> on Skylake server?

> How to do that?  I have tested it on a Xeon Phi (whateve that means)
> where it failed without the fix.

It has to do with copy sizes and relation to the cache size.  I had to 
adjust your reproducer to make it fail on KNL, basically increasing the 
copy size by a factor of 10.

Thanks,
Florian
  
H.J. Lu May 22, 2018, 12:55 p.m. UTC | #6
On Tue, May 22, 2018 at 5:38 AM, Andreas Schwab <schwab@suse.de> wrote:
> On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:
>
>> It should also fail without the fix.
>
> Again, tell me how to do that.

You need to trigger this condition on all AVX512 machines.

>> __mempcpy_avx512_no_vzeroupper can be implemented with
>> memmove-vec-unaligned-erms.S.
>
> In which way?  There is no definition of __mempcpy_avx512_no_vzeroupper
> in memmove-vec-unaligned-erms.S.
>

memmove-vec-unaligned-erms.S is a template.  It can be used to
implement AVX512 memmove without vzeroupper.
  
Andreas Schwab May 22, 2018, 1:06 p.m. UTC | #7
On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:

> On Tue, May 22, 2018 at 5:38 AM, Andreas Schwab <schwab@suse.de> wrote:
>> On Mai 22 2018, "H.J. Lu" <hjl.tools@gmail.com> wrote:
>>
>>> It should also fail without the fix.
>>
>> Again, tell me how to do that.
>
> You need to trigger this condition on all AVX512 machines.

Again, tell me how to do that.

> memmove-vec-unaligned-erms.S is a template.  It can be used to
> implement AVX512 memmove without vzeroupper.

How is that relevant?

Andreas.
  
Andreas Schwab May 22, 2018, 1:21 p.m. UTC | #8
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:

> It has to do with copy sizes and relation to the cache size.  I had to
> adjust your reproducer to make it fail on KNL, basically increasing the
> copy size by a factor of 10.

What is the architectural maximum cache size?

Andreas.
  
Florian Weimer May 22, 2018, 1:53 p.m. UTC | #9
On 05/22/2018 03:21 PM, Andreas Schwab wrote:
> On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:
> 
>> It has to do with copy sizes and relation to the cache size.  I had to
>> adjust your reproducer to make it fail on KNL, basically increasing the
>> copy size by a factor of 10.
> 
> What is the architectural maximum cache size?

My test machine had an L2 cache size of 1 MiB.  I think the cut-over 
happens at half of that (at least for our backported version), which is why

#define N 976999

in your reproducer word for me (along with -O0 -fno-builtin).
Thanks,
Florian
  
Andreas Schwab May 22, 2018, 2:12 p.m. UTC | #10
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:

> My test machine had an L2 cache size of 1 MiB.

Then you should be safe, a 64K mempcpy is enough to trigger it.

Andreas.
  
Florian Weimer May 22, 2018, 2:36 p.m. UTC | #11
On 05/22/2018 04:12 PM, Andreas Schwab wrote:
> On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:
> 
>> My test machine had an L2 cache size of 1 MiB.
> 
> Then you should be safe, a 64K mempcpy is enough to trigger it.

Our memcpy has:


L(512bytesormore):
#ifdef SHARED_CACHE_SIZE_HALF
         mov     $SHARED_CACHE_SIZE_HALF, %r8
#else
         mov     __x86_64_shared_cache_size_half(%rip), %r8
#endif
         cmp     %r8, %rdx
         jae     L(preloop_large)
         cmp     $1024, %rdx
         ja      L(1024bytesormore)

And with the original reproducer, we wouldn't hit preloop_large, where 
the bug is.

Thanks,
Florian
  
Andreas Schwab May 22, 2018, 3:06 p.m. UTC | #12
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:

> And with the original reproducer, we wouldn't hit preloop_large, where the
> bug is.

Worksforme.

$ string/test-mempcpy 
                        simple_mempcpy  __mempcpy_avx512_no_vzeroupper  __mempcpy_avx512_unaligned      __mempcpy_avx512_unaligned_erms __mempcpy_avx_unaligned__mempcpy_avx_unaligned_erms     __mempcpy_ssse3_back    __mempcpy_ssse3 __mempcpy_sse2_unaligned        __mempcpy_sse2_unaligned_erms   __mempcpy_erms
/mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6000 "" src 0x7f7cf7a36000 "▒/F]t������,CZq������)@Wn������&=Tk������
                   #:Qh�����     7Ne|�����4Kby�����▒1H_v�����" len 65536
/mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6000 "" src 0x7f7cf7a36010 "▒/F]t������,CZq������)@Wn������&=Tk������
                   #:Qh�����     7Ne|�����4Kby�����▒1H_v�����" len 65536
/mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6010 "" src 0x7f7cf7a36000 "▒/F]t������,CZq������)@Wn������&=Tk������
                   #:Qh�����     7Ne|�����4Kby�����▒1H_v�����" len 65536
/mnt/src/libc/test/string/test-mempcpy: Wrong result in function __mempcpy_avx512_no_vzeroupper dst 0x7f7cf79f6010 "" src 0x7f7cf7a36010 "▒/F]t������,CZq������)@Wn������&=Tk������
                   #:Qh�����     7Ne|�����4Kby�����▒1H_v�����" len 65536

Andreas.
  
Florian Weimer May 22, 2018, 3:35 p.m. UTC | #13
On 05/22/2018 05:06 PM, Andreas Schwab wrote:
> On Mai 22 2018, Florian Weimer<fweimer@redhat.com>  wrote:
> 
>> And with the original reproducer, we wouldn't hit preloop_large, where the
>> bug is.
> Worksforme.
> 
> $ string/test-mempcpy

I don't doubt that.  Our CPUs probably have different cache sizes, or 
there are differences in how different glibc versions compute the 
effective cache size.

Thanks,
Florian
  
Andreas Schwab May 22, 2018, 3:38 p.m. UTC | #14
On Mai 22 2018, Florian Weimer <fweimer@redhat.com> wrote:

> On 05/22/2018 05:06 PM, Andreas Schwab wrote:
>> On Mai 22 2018, Florian Weimer<fweimer@redhat.com>  wrote:
>>
>>> And with the original reproducer, we wouldn't hit preloop_large, where the
>>> bug is.
>> Worksforme.
>>
>> $ string/test-mempcpy
>
> I don't doubt that.  Our CPUs probably have different cache sizes,

Same cache size.

Andreas.
  

Patch

diff --git a/string/test-mempcpy.c b/string/test-mempcpy.c
index c08fba895e..d98ecdd2d9 100644
--- a/string/test-mempcpy.c
+++ b/string/test-mempcpy.c
@@ -18,6 +18,7 @@ 
    <http://www.gnu.org/licenses/>.  */
 
 #define MEMCPY_RESULT(dst, len) (dst) + (len)
+#define MIN_PAGE_SIZE 131072
 #define TEST_MAIN
 #define TEST_NAME "mempcpy"
 #include "test-string.h"
diff --git a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
index 23c0f7a9ed..effc3ac2de 100644
--- a/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
+++ b/sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S
@@ -336,6 +336,7 @@  L(preloop_large):
 	vmovups	(%rsi), %zmm4
 	vmovups	0x40(%rsi), %zmm5
 
+	mov	%rdi, %r11
 /* Align destination for access with non-temporal stores in the loop.  */
 	mov	%rdi, %r8
 	and	$-0x80, %rdi
@@ -366,8 +367,8 @@  L(gobble_256bytes_nt_loop):
 	cmp	$256, %rdx
 	ja	L(gobble_256bytes_nt_loop)
 	sfence
-	vmovups	%zmm4, (%rax)
-	vmovups	%zmm5, 0x40(%rax)
+	vmovups	%zmm4, (%r11)
+	vmovups	%zmm5, 0x40(%r11)
 	jmp	L(check)
 
 L(preloop_large_bkw):