[3/6] newlib: mem[p]cpy/memmove improve performance for optimized versions

Message ID 085f3ec3cbe41e0a377b1d26089a871f04ffd5d6.camel@espressif.com
State New
Headers
Series Refactor and optimize string/memory functions |

Commit Message

Alexey Lapshin Jan. 27, 2025, 10:45 a.m. UTC
  This change improves performance on memory blocks with sizes in range
[4..15]. Performance measurements made for RISCV machine (memset):

size  4, CPU cycles change: 50 -> 37
size  5, CPU cycles change: 57 -> 40
size  6, CPU cycles change: 64 -> 47
size  7, CPU cycles change: 71 -> 54
size  8, CPU cycles change: 78 -> 44
size  9, CPU cycles change: 85 -> 47
size 10, CPU cycles change: 92 -> 54
size 11, CPU cycles change: 99 -> 61
size 12, CPU cycles change: 106 -> 51
size 13, CPU cycles change: 113 -> 54
size 14, CPU cycles change: 120 -> 61
size 15, CPU cycles change: 127 -> 68
---
 newlib/libc/string/memcpy.c  | 2 +-
 newlib/libc/string/memmove.c | 2 +-
 newlib/libc/string/mempcpy.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

-- 
2.43.0
  

Comments

Corinna Vinschen Jan. 28, 2025, 4:11 p.m. UTC | #1
On Jan 27 10:45, Alexey Lapshin wrote:
> This change improves performance on memory blocks with sizes in range
> [4..15]. Performance measurements made for RISCV machine (memset):
> 
> size  4, CPU cycles change: 50 -> 37
> size  5, CPU cycles change: 57 -> 40
> size  6, CPU cycles change: 64 -> 47
> size  7, CPU cycles change: 71 -> 54
> size  8, CPU cycles change: 78 -> 44
> size  9, CPU cycles change: 85 -> 47
> size 10, CPU cycles change: 92 -> 54
> size 11, CPU cycles change: 99 -> 61
> size 12, CPU cycles change: 106 -> 51
> size 13, CPU cycles change: 113 -> 54
> size 14, CPU cycles change: 120 -> 61
> size 15, CPU cycles change: 127 -> 68

But is that generally true for other architectures as well?


Corinna
  
Richard Earnshaw (lists) Jan. 28, 2025, 4:33 p.m. UTC | #2
On 28/01/2025 16:11, Corinna Vinschen wrote:
> On Jan 27 10:45, Alexey Lapshin wrote:
>> This change improves performance on memory blocks with sizes in range
>> [4..15]. Performance measurements made for RISCV machine (memset):
>>
>> size  4, CPU cycles change: 50 -> 37
>> size  5, CPU cycles change: 57 -> 40
>> size  6, CPU cycles change: 64 -> 47
>> size  7, CPU cycles change: 71 -> 54
>> size  8, CPU cycles change: 78 -> 44
>> size  9, CPU cycles change: 85 -> 47
>> size 10, CPU cycles change: 92 -> 54
>> size 11, CPU cycles change: 99 -> 61
>> size 12, CPU cycles change: 106 -> 51
>> size 13, CPU cycles change: 113 -> 54
>> size 14, CPU cycles change: 120 -> 61
>> size 15, CPU cycles change: 127 -> 68
> 
> But is that generally true for other architectures as well?
> 

No, it can be very dependent on the microarchitecture.  I know of Arm implementations where it would be better and implementations where it would be (much) worse.  The other variable is that for misaligned copies there's a choice of bringing the source data to alignment or the target data (you really don't want to do a large copy with both misaligned).  That can also vary by micro-architecture.

But we have custom assembler versions for Arm, so it probably doesn't matter for us, except at -Os and there I wouldn't expect us to want large expanded chunks of code for all the cases that misaligned copies might involve.

R.
  
Corinna Vinschen Jan. 29, 2025, 11:26 a.m. UTC | #3
On Jan 28 16:33, Richard Earnshaw (lists) wrote:
> On 28/01/2025 16:11, Corinna Vinschen wrote:
> > On Jan 27 10:45, Alexey Lapshin wrote:
> >> This change improves performance on memory blocks with sizes in range
> >> [4..15]. Performance measurements made for RISCV machine (memset):
> >>
> >> size  4, CPU cycles change: 50 -> 37
> >> size  5, CPU cycles change: 57 -> 40
> >> size  6, CPU cycles change: 64 -> 47
> >> size  7, CPU cycles change: 71 -> 54
> >> size  8, CPU cycles change: 78 -> 44
> >> size  9, CPU cycles change: 85 -> 47
> >> size 10, CPU cycles change: 92 -> 54
> >> size 11, CPU cycles change: 99 -> 61
> >> size 12, CPU cycles change: 106 -> 51
> >> size 13, CPU cycles change: 113 -> 54
> >> size 14, CPU cycles change: 120 -> 61
> >> size 15, CPU cycles change: 127 -> 68
> > 
> > But is that generally true for other architectures as well?
> > 
> 
> No, it can be very dependent on the microarchitecture.  I know of Arm
> implementations where it would be better and implementations where it
> would be (much) worse.

Ok, we're talking about the case that memcpy runs the optimization based
on the fact that the size of the block to copy is at least sizeof(long)
vs. at least sizeof(long)*4, while the check for being aligned is based
on sizeof(long) alone.  So assuming sizeof(long) is 4, the optimization
doesn't kick in for blocks < 32 bytes right now, while Alexey's change
allows to run the optimization even for 4 byte blocks.

As I understand it, the additional length checks in the optimizing code
*may* have a bigger performance hit than the time saved by copying 4
bytes at once rather than bytewise.

Alexey's test above show that even for a 4 byte copy, optimizing still
has a performance boost compared to a bytewise copy on RISCV.

This part is interesting.  Do we really have a supported architecture,
where one additional `while (len0 >= BIGBLOCKSIZE)' check has such an
impact, that running the optimizing code is worse than a byte copy for
small, but aligned blocks?

> The other variable is that for misaligned
> copies there's a choice of bringing the source data to alignment or
> the target data (you really don't want to do a large copy with both
> misaligned).  That can also vary by micro-architecture.

Yeah, but our simple fallback memcpy doesn't try to align, it just
runs the optimizing copde block if both blocks are already aligned
on input.  Alexey's patch doesn't change this.

> But we have custom assembler versions for Arm, so it probably doesn't
> matter for us, except at -Os

-Os isn't affected because it runs the PREFER_SIZE_OVER_SPEED code
which only does byte copy anyway.


Corinna
  
Corinna Vinschen Jan. 29, 2025, 11:26 a.m. UTC | #4
On Jan 28 16:33, Richard Earnshaw (lists) wrote:
> On 28/01/2025 16:11, Corinna Vinschen wrote:
> > On Jan 27 10:45, Alexey Lapshin wrote:
> >> This change improves performance on memory blocks with sizes in range
> >> [4..15]. Performance measurements made for RISCV machine (memset):
> >>
> >> size  4, CPU cycles change: 50 -> 37
> >> size  5, CPU cycles change: 57 -> 40
> >> size  6, CPU cycles change: 64 -> 47
> >> size  7, CPU cycles change: 71 -> 54
> >> size  8, CPU cycles change: 78 -> 44
> >> size  9, CPU cycles change: 85 -> 47
> >> size 10, CPU cycles change: 92 -> 54
> >> size 11, CPU cycles change: 99 -> 61
> >> size 12, CPU cycles change: 106 -> 51
> >> size 13, CPU cycles change: 113 -> 54
> >> size 14, CPU cycles change: 120 -> 61
> >> size 15, CPU cycles change: 127 -> 68
> > 
> > But is that generally true for other architectures as well?
> > 
> 
> No, it can be very dependent on the microarchitecture.  I know of Arm
> implementations where it would be better and implementations where it
> would be (much) worse.

Ok, we're talking about the case that memcpy runs the optimization based
on the fact that the size of the block to copy is at least sizeof(long)
vs. at least sizeof(long)*4, while the check for being aligned is based
on sizeof(long) alone.  So assuming sizeof(long) is 4, the optimization
doesn't kick in for blocks < 32 bytes right now, while Alexey's change
allows to run the optimization even for 4 byte blocks.

As I understand it, the additional length checks in the optimizing code
*may* have a bigger performance hit than the time saved by copying 4
bytes at once rather than bytewise.

Alexey's test above show that even for a 4 byte copy, optimizing still
has a performance boost compared to a bytewise copy on RISCV.

This part is interesting.  Do we really have a supported architecture,
where one additional `while (len0 >= BIGBLOCKSIZE)' check has such an
impact, that running the optimizing code is worse than a byte copy for
small, but aligned blocks?

> The other variable is that for misaligned
> copies there's a choice of bringing the source data to alignment or
> the target data (you really don't want to do a large copy with both
> misaligned).  That can also vary by micro-architecture.

Yeah, but our simple fallback memcpy doesn't try to align, it just
runs the optimizing copde block if both blocks are already aligned
on input.  Alexey's patch doesn't change this.

> But we have custom assembler versions for Arm, so it probably doesn't
> matter for us, except at -Os

-Os isn't affected because it runs the PREFER_SIZE_OVER_SPEED code
which only does byte copy anyway.


Corinna
  

Patch

diff --git a/newlib/libc/string/memcpy.c b/newlib/libc/string/memcpy.c
index 1bbd4e0bf..e680c444d 100644
--- a/newlib/libc/string/memcpy.c
+++ b/newlib/libc/string/memcpy.c
@@ -57,7 +57,7 @@  memcpy (void *__restrict dst0,
 
   /* If the size is small, or either SRC or DST is unaligned,
      then punt into the byte copy loop.  This should be rare.  */
-  if (!TOO_SMALL_BIG_BLOCK(len0) && !UNALIGNED_X_Y(src, dst))
+  if (!TOO_SMALL_LITTLE_BLOCK(len0) && !UNALIGNED_X_Y(src, dst))
     {
       aligned_dst = (long*)dst;
       aligned_src = (long*)src;
diff --git a/newlib/libc/string/memmove.c b/newlib/libc/string/memmove.c
index a82744c7d..4c5ec6f83 100644
--- a/newlib/libc/string/memmove.c
+++ b/newlib/libc/string/memmove.c
@@ -85,7 +85,7 @@  memmove (void *dst_void,
       /* Use optimizing algorithm for a non-destructive copy to closely 
          match memcpy. If the size is small or either SRC or DST is unaligned,
          then punt into the byte copy loop.  This should be rare.  */
-      if (!TOO_SMALL_BIG_BLOCK(length) && !UNALIGNED_X_Y(src, dst))
+      if (!TOO_SMALL_LITTLE_BLOCK(length) && !UNALIGNED_X_Y(src, dst))
         {
           aligned_dst = (long*)dst;
           aligned_src = (long*)src;
diff --git a/newlib/libc/string/mempcpy.c b/newlib/libc/string/mempcpy.c
index 06e97de85..561892199 100644
--- a/newlib/libc/string/mempcpy.c
+++ b/newlib/libc/string/mempcpy.c
@@ -53,7 +53,7 @@  mempcpy (void *dst0,
 
   /* If the size is small, or either SRC or DST is unaligned,
      then punt into the byte copy loop.  This should be rare.  */
-  if (!TOO_SMALL_BIG_BLOCK(len0) && !UNALIGNED_X_Y(src, dst))
+  if (!TOO_SMALL_LITTLE_BLOCK(len0) && !UNALIGNED_X_Y(src, dst))
     {
       aligned_dst = (long*)dst;
       aligned_src = (long*)src;