[v3] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863]

Message ID 20221214073633.2876766-1-goldstein.w.n@gmail.com
State New
Headers
Series [v3] x86: Prevent SIGSEGV in memcmp-sse2 when data is concurrently modified [BZ #29863] |

Checks

Context Check Description
dj/TryBot-apply_patch success Patch applied to master at the time it was sent
dj/TryBot-32bit success Build for i686
redhat-pt-bot/TryBot-still_applies warning Patch no longer applies to master

Commit Message

Noah Goldstein Dec. 14, 2022, 7:36 a.m. UTC
  In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
are concurrently modified as `memcmp` runs, there can be a SIGSEGV
in `L(ret_nonzero_vec_end_0)` because the sequential logic
assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.

To be clear, this change does not mean the usage of `memcmp` is
supported.  The program behaviour is undefined (UB) in the
presence of data races, and `memcmp` is incorrect when the values
of `a` and/or `b` are modified concurrently (data race). This UB
may manifest itself as a SIGSEGV. That being said, if we can
allow the idiomatic use cases, like those in yottadb with
opportunistic concurrency control (OCC), to execute without a
SIGSEGV, at no cost to regular use cases, then we can aim to
minimize harm to those existing users.

The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
`addq %rdx, %rax`. The 1-extra byte of code size from using the
64-bit instruction doesn't contribute to overall code size as the
next target is aligned and has multiple bytes of `nop` padding
before it. As well all the logic between the add and `ret` still
fits in the same fetch block, so the cost of this change is
basically zero.

The relevant sequential logic can be seen in the following
pseudo-code:
```
    /*
     * rsi = a
     * rdi = b
     * rdx = len - 32
     */
    /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
    in this case, this check is also assumed to cover a[0:(31 - len)]
    and b[0:(31 - len)].  */
    movups  (%rsi), %xmm0
    movups  (%rdi), %xmm1
    PCMPEQ  %xmm0, %xmm1
    pmovmskb %xmm1, %eax
    subl    %ecx, %eax
    jnz L(END_NEQ)

    /* cmp a[len-16:len-1] and b[len-16:len-1].  */
    movups  16(%rsi, %rdx), %xmm0
    movups  16(%rdi, %rdx), %xmm1
    PCMPEQ  %xmm0, %xmm1
    pmovmskb %xmm1, %eax
    subl    %ecx, %eax
    jnz L(END_NEQ2)
    ret

L(END2):
    /* Position first mismatch.  */
    bsfl    %eax, %eax

    /* The sequential version is able to assume this value is a
    positive 32-bit value because the first check included bytes in
    range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
    greater than `31 - len` so the minimum value of `edx` + `eax` is
    `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
    `a` or `b` could have been changed so a mismatch in `eax` less or
    equal than `(31 - len)` is possible (the new low bound is `(16 -
    len)`. This can result in a negative 32-bit signed integer, which
    when zero extended to 64-bits is a random large value this out
    out of bounds. */
    addl %edx, %eax

    /* Crash here because 32-bit negative number in `eax` zero
    extends to out of bounds 64-bit offset.  */
    movzbl  16(%rdi, %rax), %ecx
    movzbl  16(%rsi, %rax), %eax
```

This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
`addq %rdx, %rax`). This prevents the 32-bit zero extension
and since `eax` is still a low bound of `16 - len` the `rdx + rax`
is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
fixed offset of `16` in the memory access this must be in bounds.
---
 sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

H.J. Lu Dec. 14, 2022, 4:09 p.m. UTC | #1
On Tue, Dec 13, 2022 at 11:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> in `L(ret_nonzero_vec_end_0)` because the sequential logic
> assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
>
> To be clear, this change does not mean the usage of `memcmp` is
> supported.  The program behaviour is undefined (UB) in the
> presence of data races, and `memcmp` is incorrect when the values
> of `a` and/or `b` are modified concurrently (data race). This UB
> may manifest itself as a SIGSEGV. That being said, if we can
> allow the idiomatic use cases, like those in yottadb with
> opportunistic concurrency control (OCC), to execute without a
> SIGSEGV, at no cost to regular use cases, then we can aim to
> minimize harm to those existing users.
>
> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> `addq %rdx, %rax`. The 1-extra byte of code size from using the
> 64-bit instruction doesn't contribute to overall code size as the
> next target is aligned and has multiple bytes of `nop` padding
> before it. As well all the logic between the add and `ret` still
> fits in the same fetch block, so the cost of this change is
> basically zero.
>
> The relevant sequential logic can be seen in the following
> pseudo-code:
> ```
>     /*
>      * rsi = a
>      * rdi = b
>      * rdx = len - 32
>      */
>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
>     in this case, this check is also assumed to cover a[0:(31 - len)]
>     and b[0:(31 - len)].  */
>     movups  (%rsi), %xmm0
>     movups  (%rdi), %xmm1
>     PCMPEQ  %xmm0, %xmm1
>     pmovmskb %xmm1, %eax
>     subl    %ecx, %eax
>     jnz L(END_NEQ)
>
>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
>     movups  16(%rsi, %rdx), %xmm0
>     movups  16(%rdi, %rdx), %xmm1
>     PCMPEQ  %xmm0, %xmm1
>     pmovmskb %xmm1, %eax
>     subl    %ecx, %eax
>     jnz L(END_NEQ2)
>     ret
>
> L(END2):
>     /* Position first mismatch.  */
>     bsfl    %eax, %eax
>
>     /* The sequential version is able to assume this value is a
>     positive 32-bit value because the first check included bytes in
>     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
>     greater than `31 - len` so the minimum value of `edx` + `eax` is
>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
>     `a` or `b` could have been changed so a mismatch in `eax` less or
>     equal than `(31 - len)` is possible (the new low bound is `(16 -
>     len)`. This can result in a negative 32-bit signed integer, which
>     when zero extended to 64-bits is a random large value this out
>     out of bounds. */
>     addl %edx, %eax
>
>     /* Crash here because 32-bit negative number in `eax` zero
>     extends to out of bounds 64-bit offset.  */
>     movzbl  16(%rdi, %rax), %ecx
>     movzbl  16(%rsi, %rax), %eax
> ```
>
> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> `addq %rdx, %rax`). This prevents the 32-bit zero extension
> and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> fixed offset of `16` in the memory access this must be in bounds.
> ---
>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> index afd450d020..34e60e567d 100644
> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
>         setg    %dl
>         leal    -1(%rdx, %rdx), %eax
>  #  else
> -       addl    %edx, %eax
> +       addq    %rdx, %rax

Please add some comments here and a testcase.

>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
>         subl    %ecx, %eax
> --
> 2.34.1
>
  
Carlos O'Donell Dec. 14, 2022, 4:57 p.m. UTC | #2
On 12/14/22 02:36, Noah Goldstein via Libc-alpha wrote:
> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> in `L(ret_nonzero_vec_end_0)` because the sequential logic
> assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
> 
> To be clear, this change does not mean the usage of `memcmp` is
> supported.  The program behaviour is undefined (UB) in the
> presence of data races, and `memcmp` is incorrect when the values
> of `a` and/or `b` are modified concurrently (data race). This UB
> may manifest itself as a SIGSEGV. That being said, if we can
> allow the idiomatic use cases, like those in yottadb with
> opportunistic concurrency control (OCC), to execute without a
> SIGSEGV, at no cost to regular use cases, then we can aim to
> minimize harm to those existing users.
> 
> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> `addq %rdx, %rax`. The 1-extra byte of code size from using the
> 64-bit instruction doesn't contribute to overall code size as the
> next target is aligned and has multiple bytes of `nop` padding
> before it. As well all the logic between the add and `ret` still
> fits in the same fetch block, so the cost of this change is
> basically zero.
> 
> The relevant sequential logic can be seen in the following
> pseudo-code:
> ```
>     /*
>      * rsi = a
>      * rdi = b
>      * rdx = len - 32
>      */
>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
>     in this case, this check is also assumed to cover a[0:(31 - len)]
>     and b[0:(31 - len)].  */
>     movups  (%rsi), %xmm0
>     movups  (%rdi), %xmm1
>     PCMPEQ  %xmm0, %xmm1
>     pmovmskb %xmm1, %eax
>     subl    %ecx, %eax
>     jnz L(END_NEQ)
> 
>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
>     movups  16(%rsi, %rdx), %xmm0
>     movups  16(%rdi, %rdx), %xmm1
>     PCMPEQ  %xmm0, %xmm1
>     pmovmskb %xmm1, %eax
>     subl    %ecx, %eax
>     jnz L(END_NEQ2)
>     ret
> 
> L(END2):
>     /* Position first mismatch.  */
>     bsfl    %eax, %eax
> 
>     /* The sequential version is able to assume this value is a
>     positive 32-bit value because the first check included bytes in
>     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
>     greater than `31 - len` so the minimum value of `edx` + `eax` is
>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
>     `a` or `b` could have been changed so a mismatch in `eax` less or
>     equal than `(31 - len)` is possible (the new low bound is `(16 -
>     len)`. This can result in a negative 32-bit signed integer, which
>     when zero extended to 64-bits is a random large value this out
>     out of bounds. */
>     addl %edx, %eax
> 
>     /* Crash here because 32-bit negative number in `eax` zero
>     extends to out of bounds 64-bit offset.  */
>     movzbl  16(%rdi, %rax), %ecx
>     movzbl  16(%rsi, %rax), %eax
> ```
> 
> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> `addq %rdx, %rax`). This prevents the 32-bit zero extension
> and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> fixed offset of `16` in the memory access this must be in bounds.
> ---

v3 LGTM.

I commented downthread that I do not suggest a test case for this because
such a test case would be testing for a specific behaviour in the UB case.
As of today we do not have consensus that under UB we want the string or
memory operations have a specific kind of behaviour. My opinion is that
requiring a specific behaviour e.g. no SIGSEGV, would limit the algorithmic
choices available, and that is too heavy a cost to pay without further
strong justification.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> index afd450d020..34e60e567d 100644
> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
>  	setg	%dl
>  	leal	-1(%rdx, %rdx), %eax
>  #  else
> -	addl	%edx, %eax
> +	addq	%rdx, %rax
>  	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
>  	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
>  	subl	%ecx, %eax
  
Carlos O'Donell Dec. 14, 2022, 4:57 p.m. UTC | #3
On 12/14/22 11:09, H.J. Lu via Libc-alpha wrote:
> On Tue, Dec 13, 2022 at 11:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
>> are concurrently modified as `memcmp` runs, there can be a SIGSEGV
>> in `L(ret_nonzero_vec_end_0)` because the sequential logic
>> assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
>>
>> To be clear, this change does not mean the usage of `memcmp` is
>> supported.  The program behaviour is undefined (UB) in the
>> presence of data races, and `memcmp` is incorrect when the values
>> of `a` and/or `b` are modified concurrently (data race). This UB
>> may manifest itself as a SIGSEGV. That being said, if we can
>> allow the idiomatic use cases, like those in yottadb with
>> opportunistic concurrency control (OCC), to execute without a
>> SIGSEGV, at no cost to regular use cases, then we can aim to
>> minimize harm to those existing users.
>>
>> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
>> `addq %rdx, %rax`. The 1-extra byte of code size from using the
>> 64-bit instruction doesn't contribute to overall code size as the
>> next target is aligned and has multiple bytes of `nop` padding
>> before it. As well all the logic between the add and `ret` still
>> fits in the same fetch block, so the cost of this change is
>> basically zero.
>>
>> The relevant sequential logic can be seen in the following
>> pseudo-code:
>> ```
>>     /*
>>      * rsi = a
>>      * rdi = b
>>      * rdx = len - 32
>>      */
>>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
>>     in this case, this check is also assumed to cover a[0:(31 - len)]
>>     and b[0:(31 - len)].  */
>>     movups  (%rsi), %xmm0
>>     movups  (%rdi), %xmm1
>>     PCMPEQ  %xmm0, %xmm1
>>     pmovmskb %xmm1, %eax
>>     subl    %ecx, %eax
>>     jnz L(END_NEQ)
>>
>>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
>>     movups  16(%rsi, %rdx), %xmm0
>>     movups  16(%rdi, %rdx), %xmm1
>>     PCMPEQ  %xmm0, %xmm1
>>     pmovmskb %xmm1, %eax
>>     subl    %ecx, %eax
>>     jnz L(END_NEQ2)
>>     ret
>>
>> L(END2):
>>     /* Position first mismatch.  */
>>     bsfl    %eax, %eax
>>
>>     /* The sequential version is able to assume this value is a
>>     positive 32-bit value because the first check included bytes in
>>     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
>>     greater than `31 - len` so the minimum value of `edx` + `eax` is
>>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
>>     `a` or `b` could have been changed so a mismatch in `eax` less or
>>     equal than `(31 - len)` is possible (the new low bound is `(16 -
>>     len)`. This can result in a negative 32-bit signed integer, which
>>     when zero extended to 64-bits is a random large value this out
>>     out of bounds. */
>>     addl %edx, %eax
>>
>>     /* Crash here because 32-bit negative number in `eax` zero
>>     extends to out of bounds 64-bit offset.  */
>>     movzbl  16(%rdi, %rax), %ecx
>>     movzbl  16(%rsi, %rax), %eax
>> ```
>>
>> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
>> `addq %rdx, %rax`). This prevents the 32-bit zero extension
>> and since `eax` is still a low bound of `16 - len` the `rdx + rax`
>> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
>> fixed offset of `16` in the memory access this must be in bounds.
>> ---
>>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
>> index afd450d020..34e60e567d 100644
>> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
>> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
>> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
>>         setg    %dl
>>         leal    -1(%rdx, %rdx), %eax
>>  #  else
>> -       addl    %edx, %eax
>> +       addq    %rdx, %rax
> 
> Please add some comments here and a testcase.

I do not recommend a test case.

I would accept Noah's v3 as-is.

Even with a two or three thread test you'd have to run for long enough to statistically
trigger the change at the right time, and that means spending developer CPU resources
to test that a specific IFUNC works under UB conditions. This slows down development
cycle time for a use case we don't actually support.

I support Noah making this change because it doesn't have a performance impact and
I'm empathetic to fixing a developer reported issue. I wouldn't go any further than that.

My strong suggestion to yottadb is that they need to write their own hand-crafted memcmp
to ensure the entire algorithm operates under the *must-only-read-bytes-once* conditions.
If you read bytes *twice* due to overlapped loads you may have problems if you read
inconsistent values where you didn't expect them. The algorithm limitations are quite
severe IMO, and any mix of C and asm here could result in the compiler exposing visible
UB to the application author.
 
>>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
>>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
>>         subl    %ecx, %eax
>> --
>> 2.34.1
>>
> 
>
  
Noah Goldstein Dec. 14, 2022, 5:18 p.m. UTC | #4
On Wed, Dec 14, 2022 at 8:57 AM Carlos O'Donell <carlos@redhat.com> wrote:
>
> On 12/14/22 11:09, H.J. Lu via Libc-alpha wrote:
> > On Tue, Dec 13, 2022 at 11:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> >> are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> >> in `L(ret_nonzero_vec_end_0)` because the sequential logic
> >> assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
> >>
> >> To be clear, this change does not mean the usage of `memcmp` is
> >> supported.  The program behaviour is undefined (UB) in the
> >> presence of data races, and `memcmp` is incorrect when the values
> >> of `a` and/or `b` are modified concurrently (data race). This UB
> >> may manifest itself as a SIGSEGV. That being said, if we can
> >> allow the idiomatic use cases, like those in yottadb with
> >> opportunistic concurrency control (OCC), to execute without a
> >> SIGSEGV, at no cost to regular use cases, then we can aim to
> >> minimize harm to those existing users.
> >>
> >> The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> >> `addq %rdx, %rax`. The 1-extra byte of code size from using the
> >> 64-bit instruction doesn't contribute to overall code size as the
> >> next target is aligned and has multiple bytes of `nop` padding
> >> before it. As well all the logic between the add and `ret` still
> >> fits in the same fetch block, so the cost of this change is
> >> basically zero.
> >>
> >> The relevant sequential logic can be seen in the following
> >> pseudo-code:
> >> ```
> >>     /*
> >>      * rsi = a
> >>      * rdi = b
> >>      * rdx = len - 32
> >>      */
> >>     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
> >>     in this case, this check is also assumed to cover a[0:(31 - len)]
> >>     and b[0:(31 - len)].  */
> >>     movups  (%rsi), %xmm0
> >>     movups  (%rdi), %xmm1
> >>     PCMPEQ  %xmm0, %xmm1
> >>     pmovmskb %xmm1, %eax
> >>     subl    %ecx, %eax
> >>     jnz L(END_NEQ)
> >>
> >>     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
> >>     movups  16(%rsi, %rdx), %xmm0
> >>     movups  16(%rdi, %rdx), %xmm1
> >>     PCMPEQ  %xmm0, %xmm1
> >>     pmovmskb %xmm1, %eax
> >>     subl    %ecx, %eax
> >>     jnz L(END_NEQ2)
> >>     ret
> >>
> >> L(END2):
> >>     /* Position first mismatch.  */
> >>     bsfl    %eax, %eax
> >>
> >>     /* The sequential version is able to assume this value is a
> >>     positive 32-bit value because the first check included bytes in
> >>     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
> >>     greater than `31 - len` so the minimum value of `edx` + `eax` is
> >>     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
> >>     `a` or `b` could have been changed so a mismatch in `eax` less or
> >>     equal than `(31 - len)` is possible (the new low bound is `(16 -
> >>     len)`. This can result in a negative 32-bit signed integer, which
> >>     when zero extended to 64-bits is a random large value this out
> >>     out of bounds. */
> >>     addl %edx, %eax
> >>
> >>     /* Crash here because 32-bit negative number in `eax` zero
> >>     extends to out of bounds 64-bit offset.  */
> >>     movzbl  16(%rdi, %rax), %ecx
> >>     movzbl  16(%rsi, %rax), %eax
> >> ```
> >>
> >> This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> >> `addq %rdx, %rax`). This prevents the 32-bit zero extension
> >> and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> >> is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> >> fixed offset of `16` in the memory access this must be in bounds.
> >> ---
> >>  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> >> index afd450d020..34e60e567d 100644
> >> --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> >> +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> >> @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
> >>         setg    %dl
> >>         leal    -1(%rdx, %rdx), %eax
> >>  #  else
> >> -       addl    %edx, %eax
> >> +       addq    %rdx, %rax
> >
> > Please add some comments here and a testcase.
>
> I do not recommend a test case.
>
> I would accept Noah's v3 as-is.


I can add a comment around the 'addq'
but agree a test case doesn't make sense
because at the moment its failure / success
is pretty arbitrary.
>
> Even with a two or three thread test you'd have to run for long enough to statistically
> trigger the change at the right time, and that means spending developer CPU resources
> to test that a specific IFUNC works under UB conditions. This slows down development
> cycle time for a use case we don't actually support.
>
> I support Noah making this change because it doesn't have a performance impact and
> I'm empathetic to fixing a developer reported issue. I wouldn't go any further than that.
>
> My strong suggestion to yottadb is that they need to write their own hand-crafted memcmp
> to ensure the entire algorithm operates under the *must-only-read-bytes-once* conditions.
> If you read bytes *twice* due to overlapped loads you may have problems if you read
> inconsistent values where you didn't expect them. The algorithm limitations are quite
> severe IMO, and any mix of C and asm here could result in the compiler exposing visible
> UB to the application author.
>
> >>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
> >>         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
> >>         subl    %ecx, %eax
> >> --
> >> 2.34.1
> >>
> >
> >
>
> --
> Cheers,
> Carlos.
>
  
Noah Goldstein Dec. 14, 2022, 6:52 p.m. UTC | #5
On Wed, Dec 14, 2022 at 8:10 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Dec 13, 2022 at 11:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > In the case of INCORRECT usage of `memcmp(a, b, N)` where `a` and `b`
> > are concurrently modified as `memcmp` runs, there can be a SIGSEGV
> > in `L(ret_nonzero_vec_end_0)` because the sequential logic
> > assumes that `(rdx - 32 + rax)` is a positive 32-bit integer.
> >
> > To be clear, this change does not mean the usage of `memcmp` is
> > supported.  The program behaviour is undefined (UB) in the
> > presence of data races, and `memcmp` is incorrect when the values
> > of `a` and/or `b` are modified concurrently (data race). This UB
> > may manifest itself as a SIGSEGV. That being said, if we can
> > allow the idiomatic use cases, like those in yottadb with
> > opportunistic concurrency control (OCC), to execute without a
> > SIGSEGV, at no cost to regular use cases, then we can aim to
> > minimize harm to those existing users.
> >
> > The fix replaces a 32-bit `addl %edx, %eax` with the 64-bit variant
> > `addq %rdx, %rax`. The 1-extra byte of code size from using the
> > 64-bit instruction doesn't contribute to overall code size as the
> > next target is aligned and has multiple bytes of `nop` padding
> > before it. As well all the logic between the add and `ret` still
> > fits in the same fetch block, so the cost of this change is
> > basically zero.
> >
> > The relevant sequential logic can be seen in the following
> > pseudo-code:
> > ```
> >     /*
> >      * rsi = a
> >      * rdi = b
> >      * rdx = len - 32
> >      */
> >     /* cmp a[0:15] and b[0:15]. Since length is known to be [17, 32]
> >     in this case, this check is also assumed to cover a[0:(31 - len)]
> >     and b[0:(31 - len)].  */
> >     movups  (%rsi), %xmm0
> >     movups  (%rdi), %xmm1
> >     PCMPEQ  %xmm0, %xmm1
> >     pmovmskb %xmm1, %eax
> >     subl    %ecx, %eax
> >     jnz L(END_NEQ)
> >
> >     /* cmp a[len-16:len-1] and b[len-16:len-1].  */
> >     movups  16(%rsi, %rdx), %xmm0
> >     movups  16(%rdi, %rdx), %xmm1
> >     PCMPEQ  %xmm0, %xmm1
> >     pmovmskb %xmm1, %eax
> >     subl    %ecx, %eax
> >     jnz L(END_NEQ2)
> >     ret
> >
> > L(END2):
> >     /* Position first mismatch.  */
> >     bsfl    %eax, %eax
> >
> >     /* The sequential version is able to assume this value is a
> >     positive 32-bit value because the first check included bytes in
> >     range a[0:(31 - len)] and b[0:(31 - len)] so `eax` must be
> >     greater than `31 - len` so the minimum value of `edx` + `eax` is
> >     `(len - 32) + (32 - len) >= 0`. In the concurrent case, however,
> >     `a` or `b` could have been changed so a mismatch in `eax` less or
> >     equal than `(31 - len)` is possible (the new low bound is `(16 -
> >     len)`. This can result in a negative 32-bit signed integer, which
> >     when zero extended to 64-bits is a random large value this out
> >     out of bounds. */
> >     addl %edx, %eax
> >
> >     /* Crash here because 32-bit negative number in `eax` zero
> >     extends to out of bounds 64-bit offset.  */
> >     movzbl  16(%rdi, %rax), %ecx
> >     movzbl  16(%rsi, %rax), %eax
> > ```
> >
> > This fix is quite simple, just make the `addl %edx, %eax` 64 bit (i.e
> > `addq %rdx, %rax`). This prevents the 32-bit zero extension
> > and since `eax` is still a low bound of `16 - len` the `rdx + rax`
> > is bound by `(len - 32) - (16 - len) >= -16`. Since we have a
> > fixed offset of `16` in the memory access this must be in bounds.
> > ---
> >  sysdeps/x86_64/multiarch/memcmp-sse2.S | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > index afd450d020..34e60e567d 100644
> > --- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > +++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
> > @@ -308,7 +308,7 @@ L(ret_nonzero_vec_end_0):
> >         setg    %dl
> >         leal    -1(%rdx, %rdx), %eax
> >  #  else
> > -       addl    %edx, %eax
> > +       addq    %rdx, %rax
>
> Please add some comments here and a testcase.

Added comments.
>
> >         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
> >         movzbl  (VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
> >         subl    %ecx, %eax
> > --
> > 2.34.1
> >
>
>
> --
> H.J.
  

Patch

diff --git a/sysdeps/x86_64/multiarch/memcmp-sse2.S b/sysdeps/x86_64/multiarch/memcmp-sse2.S
index afd450d020..34e60e567d 100644
--- a/sysdeps/x86_64/multiarch/memcmp-sse2.S
+++ b/sysdeps/x86_64/multiarch/memcmp-sse2.S
@@ -308,7 +308,7 @@  L(ret_nonzero_vec_end_0):
 	setg	%dl
 	leal	-1(%rdx, %rdx), %eax
 #  else
-	addl	%edx, %eax
+	addq	%rdx, %rax
 	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rsi, %rax), %ecx
 	movzbl	(VEC_SIZE * -1 + SIZE_OFFSET)(%rdi, %rax), %eax
 	subl	%ecx, %eax