[1/4] LoongArch: Add ifunc support for strcpy{aligned, unaligned, lsx, lasx}

Message ID 20230908093357.3119822-2-dengjianbo@loongson.cn
State New
Headers
Series LoongArch: Add ifunc support for str{cpy, rchr}, |

Checks

Context Check Description
redhat-pt-bot/TryBot-apply_patch success Patch applied to master at the time it was sent
linaro-tcwg-bot/tcwg_glibc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_glibc_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_glibc_check--master-aarch64 success Testing passed
redhat-pt-bot/TryBot-still_applies warning Patch no longer applies to master

Commit Message

dengjianbo Sept. 8, 2023, 9:33 a.m. UTC
  According to glibc strcpy microbenchmark test results(changed to use
generic_strcpy instead of strlen + memcpy), comparing with generic_strcpy,
this implementation could reduce the runtime as following:

Name              Percent of rutime reduced
strcpy-aligned    10%-45%
strcpy-unaligned  10%-49%, comparing with the aligned version,unaligned
                  version experience better performance in case src and dest
		  cannot be both aligned with 8bytes
strcpy-lsx        20%-80%
strcpy-lasx       15%-86%
---
 sysdeps/loongarch/lp64/multiarch/Makefile     |   4 +
 .../lp64/multiarch/ifunc-impl-list.c          |   9 +
 .../loongarch/lp64/multiarch/strcpy-aligned.S | 185 ++++++++++++++++
 .../loongarch/lp64/multiarch/strcpy-lasx.S    | 208 ++++++++++++++++++
 sysdeps/loongarch/lp64/multiarch/strcpy-lsx.S | 197 +++++++++++++++++
 .../lp64/multiarch/strcpy-unaligned.S         | 131 +++++++++++
 sysdeps/loongarch/lp64/multiarch/strcpy.c     |  35 +++
 7 files changed, 769 insertions(+)
 create mode 100644 sysdeps/loongarch/lp64/multiarch/strcpy-aligned.S
 create mode 100644 sysdeps/loongarch/lp64/multiarch/strcpy-lasx.S
 create mode 100644 sysdeps/loongarch/lp64/multiarch/strcpy-lsx.S
 create mode 100644 sysdeps/loongarch/lp64/multiarch/strcpy-unaligned.S
 create mode 100644 sysdeps/loongarch/lp64/multiarch/strcpy.c
  

Comments

Xi Ruoyao Sept. 8, 2023, 2:22 p.m. UTC | #1
On Fri, 2023-09-08 at 17:33 +0800, dengjianbo wrote:
> According to glibc strcpy microbenchmark test results(changed to use
> generic_strcpy instead of strlen + memcpy), comparing with generic_strcpy,
> this implementation could reduce the runtime as following:
> 
> Name              Percent of rutime reduced
> strcpy-aligned    10%-45%
> strcpy-unaligned  10%-49%, comparing with the aligned version,unaligned
>                   version experience better performance in case src and dest
>                   cannot be both aligned with 8bytes
> strcpy-lsx        20%-80%
> strcpy-lasx       15%-86%

Generic strcpy calls stpcpy, so if we've optimized stpcpy maybe it's not
necessary to duplicate everything in strcpy.  Is there a benchmark
result comparing the timing with and without this patch, but both with
the second patch (optimized stpcpy)?
  
dengjianbo Sept. 11, 2023, 9:53 a.m. UTC | #2
Tested strcpy-lasx comparing with strcpy(call stpcpy-lasx), the
difference between two timings are 0.28, strcpy-lasx takes less time.
When the length of data is less than 32, it could reduce the runtime
more than 30%.

See:
https://github.com/jiadengx/glibc_test/blob/main/bench/strcpy_lasx_compare_generic_strcpy.out

There are some duplicated code in strcpy from stpcpy, since the main
part is almost same. Maybe we can try to use one source code with
MARCO USE_AS_STPCPY to distinguish strcpy and stpcpy link x86_64? it
could avoid the performance degradation.

On 2023-09-08 22:22, Xi Ruoyao wrote:
> On Fri, 2023-09-08 at 17:33 +0800, dengjianbo wrote:
>> According to glibc strcpy microbenchmark test results(changed to use
>> generic_strcpy instead of strlen + memcpy), comparing with generic_strcpy,
>> this implementation could reduce the runtime as following:
>>
>> Name              Percent of rutime reduced
>> strcpy-aligned    10%-45%
>> strcpy-unaligned  10%-49%, comparing with the aligned version,unaligned
>>                   version experience better performance in case src and dest
>>                   cannot be both aligned with 8bytes
>> strcpy-lsx        20%-80%
>> strcpy-lasx       15%-86%
> Generic strcpy calls stpcpy, so if we've optimized stpcpy maybe it's not
> necessary to duplicate everything in strcpy.  Is there a benchmark
> result comparing the timing with and without this patch, but both with
> the second patch (optimized stpcpy)?
>
  
dengjianbo Sept. 13, 2023, 7:47 a.m. UTC | #3
We have changed strcpy to include both strcpy and stpcpy implementation,
and use USE_AS_STPCPY to distinguish these two functions, stpcpy
function will define related macros and include strcpy source code.

See patch v2:
https://sourceware.org/pipermail/libc-alpha/2023-September/151531.html

On 2023-09-11 17:53, dengjianbo wrote:
> Tested strcpy-lasx comparing with strcpy(call stpcpy-lasx), the
> difference between two timings are 0.28, strcpy-lasx takes less time.
> When the length of data is less than 32, it could reduce the runtime
> more than 30%.
>
> See:
> https://github.com/jiadengx/glibc_test/blob/main/bench/strcpy_lasx_compare_generic_strcpy.out
>
> There are some duplicated code in strcpy from stpcpy, since the main
> part is almost same. Maybe we can try to use one source code with
> MARCO USE_AS_STPCPY to distinguish strcpy and stpcpy like x86_64? it
> could avoid the performance degradation.
>
> On 2023-09-08 22:22, Xi Ruoyao wrote:
>> On Fri, 2023-09-08 at 17:33 +0800, dengjianbo wrote:
>>> According to glibc strcpy microbenchmark test results(changed to use
>>> generic_strcpy instead of strlen + memcpy), comparing with generic_strcpy,
>>> this implementation could reduce the runtime as following:
>>>
>>> Name              Percent of rutime reduced
>>> strcpy-aligned    10%-45%
>>> strcpy-unaligned  10%-49%, comparing with the aligned version,unaligned
>>>                   version experience better performance in case src and dest
>>>                   cannot be both aligned with 8bytes
>>> strcpy-lsx        20%-80%
>>> strcpy-lasx       15%-86%
>> Generic strcpy calls stpcpy, so if we've optimized stpcpy maybe it's not
>> necessary to duplicate everything in strcpy.  Is there a benchmark
>> result comparing the timing with and without this patch, but both with
>> the second patch (optimized stpcpy)?
>>
  

Patch

diff --git a/sysdeps/loongarch/lp64/multiarch/Makefile b/sysdeps/loongarch/lp64/multiarch/Makefile
index 360a6718c0..f05685ceec 100644
--- a/sysdeps/loongarch/lp64/multiarch/Makefile
+++ b/sysdeps/loongarch/lp64/multiarch/Makefile
@@ -16,6 +16,10 @@  sysdep_routines += \
   strcmp-lsx \
   strncmp-aligned \
   strncmp-lsx \
+  strcpy-aligned \
+  strcpy-unaligned \
+  strcpy-lsx \
+  strcpy-lasx \
   memcpy-aligned \
   memcpy-unaligned \
   memmove-unaligned \
diff --git a/sysdeps/loongarch/lp64/multiarch/ifunc-impl-list.c b/sysdeps/loongarch/lp64/multiarch/ifunc-impl-list.c
index e397d58c9d..b556bacbd1 100644
--- a/sysdeps/loongarch/lp64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/loongarch/lp64/multiarch/ifunc-impl-list.c
@@ -76,6 +76,15 @@  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_aligned)
 	      )
 
+  IFUNC_IMPL (i, name, strcpy,
+#if !defined __loongarch_soft_float
+	      IFUNC_IMPL_ADD (array, i, strcpy, SUPPORT_LASX, __strcpy_lasx)
+	      IFUNC_IMPL_ADD (array, i, strcpy, SUPPORT_LSX, __strcpy_lsx)
+#endif
+	      IFUNC_IMPL_ADD (array, i, strcpy, SUPPORT_UAL, __strcpy_unaligned)
+	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_aligned)
+	      )
+
   IFUNC_IMPL (i, name, memcpy,
 #if !defined __loongarch_soft_float
               IFUNC_IMPL_ADD (array, i, memcpy, SUPPORT_LASX, __memcpy_lasx)
diff --git a/sysdeps/loongarch/lp64/multiarch/strcpy-aligned.S b/sysdeps/loongarch/lp64/multiarch/strcpy-aligned.S
new file mode 100644
index 0000000000..d5926e5e11
--- /dev/null
+++ b/sysdeps/loongarch/lp64/multiarch/strcpy-aligned.S
@@ -0,0 +1,185 @@ 
+/* Optimized strcpy aligned implementation using basic LoongArch instructions.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/regdef.h>
+#include <sys/asm.h>
+
+#if IS_IN (libc)
+# define STRCPY __strcpy_aligned
+#else
+# define STRCPY strcpy
+#endif
+
+LEAF(STRCPY, 6)
+    andi        a3, a0, 0x7
+    move        a2, a0
+    beqz        a3, L(dest_align)
+    sub.d       a5, a1, a3
+    addi.d      a5, a5, 8
+
+L(make_dest_align):
+    ld.b        t0, a1, 0
+    addi.d      a1, a1, 1
+    st.b        t0, a2, 0
+    beqz        t0, L(al_out)
+
+    addi.d      a2, a2, 1
+    bne         a1, a5, L(make_dest_align)
+
+L(dest_align):
+    andi        a4, a1, 7
+    bstrins.d   a1, zero, 2, 0
+
+    lu12i.w     t5, 0x1010
+    ld.d        t0, a1, 0
+    ori         t5, t5, 0x101
+    bstrins.d   t5, t5, 63, 32
+
+    slli.d      t6, t5, 0x7
+    bnez        a4, L(unalign)
+    sub.d       t1, t0, t5
+    andn        t2, t6, t0
+
+    and         t3, t1, t2
+    bnez        t3, L(al_end)
+
+L(al_loop):
+    st.d        t0, a2, 0
+    ld.d        t0, a1, 8
+
+    addi.d      a1, a1, 8
+    addi.d      a2, a2, 8
+    sub.d       t1, t0, t5
+    andn        t2, t6, t0
+
+    and         t3, t1, t2
+    beqz        t3, L(al_loop)
+
+L(al_end):
+    ctz.d       t1, t3
+    srli.d      t1, t1, 3
+    addi.d      t1, t1, 1
+
+    andi        a3, t1, 8
+    andi        a4, t1, 4
+    andi        a5, t1, 2
+    andi        a6, t1, 1
+
+L(al_end_8):
+    beqz        a3, L(al_end_4)
+    st.d        t0, a2, 0
+    jr          ra
+L(al_end_4):
+    beqz        a4, L(al_end_2)
+    st.w        t0, a2, 0
+    addi.d      a2, a2, 4
+    srli.d      t0, t0, 32
+L(al_end_2):
+    beqz        a5, L(al_end_1)
+    st.h        t0, a2, 0
+    addi.d      a2, a2, 2
+    srli.d      t0, t0, 16
+L(al_end_1):
+    beqz        a6, L(al_out)
+    st.b        t0, a2, 0
+L(al_out):
+    jr          ra
+
+L(unalign):
+    slli.d      a5, a4, 3
+    li.d        t1, -1
+    sub.d       a6, zero, a5
+
+    srl.d       a7, t0, a5
+    sll.d       t7, t1, a6
+
+    or          t0, a7, t7
+    sub.d       t1, t0, t5
+    andn        t2, t6, t0
+    and         t3, t1, t2
+
+    bnez        t3, L(un_end)
+
+    ld.d        t4, a1, 8
+
+    sub.d       t1, t4, t5
+    andn        t2, t6, t4
+    sll.d       t0, t4, a6
+    and         t3, t1, t2
+
+    or          t0, t0, a7
+    bnez        t3, L(un_end_with_remaining)
+
+L(un_loop):
+    srl.d       a7, t4, a5
+
+    ld.d        t4, a1, 16
+    addi.d      a1, a1, 8
+
+    st.d        t0, a2, 0
+    addi.d      a2, a2, 8
+
+    sub.d       t1, t4, t5
+    andn        t2, t6, t4
+    sll.d       t0, t4, a6
+    and         t3, t1, t2
+
+    or          t0, t0, a7
+    beqz        t3, L(un_loop)
+
+L(un_end_with_remaining):
+    ctz.d       t1, t3
+    srli.d      t1, t1, 3
+    addi.d      t1, t1, 1
+    sub.d       t1, t1, a4
+
+    blt         t1, zero, L(un_end_less_8)
+    st.d        t0, a2, 0
+    addi.d      a2, a2, 8
+    beqz        t1, L(un_out)
+    srl.d       t0, t4, a5
+    b           L(un_end_less_8)
+
+L(un_end):
+    ctz.d       t1, t3
+    srli.d      t1, t1, 3
+    addi.d      t1, t1, 1
+
+L(un_end_less_8):
+    andi        a4, t1, 4
+    andi        a5, t1, 2
+    andi        a6, t1, 1
+L(un_end_4):
+    beqz        a4, L(un_end_2)
+    st.w        t0, a2, 0
+    addi.d      a2, a2, 4
+    srli.d      t0, t0, 32
+L(un_end_2):
+    beqz        a5, L(un_end_1)
+    st.h        t0, a2, 0
+    addi.d      a2, a2, 2
+    srli.d      t0, t0, 16
+L(un_end_1):
+    beqz        a6, L(un_out)
+    st.b        t0, a2, 0
+L(un_out):
+    jr          ra
+END(STRCPY)
+
+libc_hidden_builtin_def (STRCPY)
diff --git a/sysdeps/loongarch/lp64/multiarch/strcpy-lasx.S b/sysdeps/loongarch/lp64/multiarch/strcpy-lasx.S
new file mode 100644
index 0000000000..d928db5b91
--- /dev/null
+++ b/sysdeps/loongarch/lp64/multiarch/strcpy-lasx.S
@@ -0,0 +1,208 @@ 
+/* Optimized strcpy implementation using LoongArch LASX instructions.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/regdef.h>
+#include <sys/asm.h>
+
+#if IS_IN (libc) && !defined __loongarch_soft_float
+
+#define STRCPY __strcpy_lasx
+
+LEAF(STRCPY, 6)
+    ori             t8, zero, 0xfe0
+    andi            t0, a1, 0xfff
+    li.d            t7, -1
+    move            a2, a0
+
+    bltu            t8, t0, L(page_cross_start)
+L(start_entry):
+    xvld            xr0, a1, 0
+    li.d            t0, 32
+    andi            t1, a2, 0x1f
+
+    xvsetanyeqz.b   fcc0, xr0
+    sub.d           t0, t0, t1
+    bcnez           fcc0, L(end)
+    add.d           a1, a1, t0
+
+    xvst            xr0, a2, 0
+    andi            a3, a1, 0x1f
+    add.d           a2, a2, t0
+    bnez            a3, L(unaligned)
+
+
+    xvld            xr0, a1, 0
+    xvsetanyeqz.b   fcc0, xr0
+    bcnez           fcc0, L(al_end)
+L(al_loop):
+    xvst            xr0, a2, 0
+
+    xvld            xr0, a1, 32
+    addi.d          a2, a2, 32
+    addi.d          a1, a1, 32
+    xvsetanyeqz.b   fcc0, xr0
+
+    bceqz           fcc0, L(al_loop)
+L(al_end):
+    xvmsknz.b       xr0, xr0
+    xvpickve.w      xr1, xr0, 4
+    vilvl.h         vr0, vr1, vr0
+
+    movfr2gr.s      t0, fa0
+    cto.w           t0, t0
+    add.d           a1, a1, t0
+    xvld            xr0, a1, -31
+
+
+    add.d           a2, a2, t0
+    xvst            xr0, a2, -31
+    jr              ra
+    nop
+
+L(page_cross_start):
+    move            a4, a1
+    bstrins.d       a4, zero, 4, 0
+    xvld            xr0, a4, 0
+    xvmsknz.b       xr0, xr0
+
+    xvpickve.w      xr1, xr0, 4
+    vilvl.h         vr0, vr1, vr0
+    movfr2gr.s      t0, fa0
+    sra.w           t0, t0, a1
+
+    beq             t0, t7, L(start_entry)
+    b               L(tail)
+L(unaligned):
+    andi            t0, a1, 0xfff
+    bltu            t8, t0, L(un_page_cross)
+
+
+L(un_start_entry):
+    xvld            xr0, a1, 0
+    xvsetanyeqz.b   fcc0, xr0
+    bcnez           fcc0, L(un_end)
+    addi.d          a1, a1, 32
+
+L(un_loop):
+    xvst            xr0, a2, 0
+    andi            t0, a1, 0xfff
+    addi.d          a2, a2, 32
+    bltu            t8, t0, L(page_cross_loop)
+
+L(un_loop_entry):
+    xvld            xr0, a1, 0
+    addi.d          a1, a1, 32
+    xvsetanyeqz.b   fcc0, xr0
+    bceqz           fcc0, L(un_loop)
+
+    addi.d          a1, a1, -32
+L(un_end):
+    xvmsknz.b       xr0, xr0
+    xvpickve.w      xr1, xr0, 4
+    vilvl.h         vr0, vr1, vr0
+
+
+    movfr2gr.s      t0, fa0
+L(un_tail):
+    cto.w           t0, t0
+    add.d           a1, a1, t0
+    xvld            xr0, a1, -31
+
+    add.d           a2, a2, t0
+    xvst            xr0, a2, -31
+    jr              ra
+L(un_page_cross):
+    sub.d           a4, a1, a3
+
+    xvld            xr0, a4, 0
+    xvmsknz.b       xr0, xr0
+    xvpickve.w      xr1, xr0, 4
+    vilvl.h         vr0, vr1, vr0
+
+    movfr2gr.s      t0, fa0
+    sra.w           t0, t0, a1
+    beq             t0, t7, L(un_start_entry)
+    b               L(un_tail)
+
+
+L(page_cross_loop):
+    sub.d           a4, a1, a3
+    xvld            xr0, a4, 0
+    xvmsknz.b       xr0, xr0
+    xvpickve.w      xr1, xr0, 4
+
+    vilvl.h         vr0, vr1, vr0
+    movfr2gr.s      t0, fa0
+    sra.w           t0, t0, a1
+    beq             t0, t7, L(un_loop_entry)
+
+    b               L(un_tail)
+L(end):
+    xvmsknz.b       xr0, xr0
+    xvpickve.w      xr1, xr0, 4
+    vilvl.h         vr0, vr1, vr0
+
+    movfr2gr.s      t0, fa0
+L(tail):
+    cto.w           t0, t0
+    add.d           a4, a2, t0
+    add.d           a5, a1, t0
+
+L(less_32):
+    srli.d          t1, t0, 4
+    beqz            t1, L(less_16)
+    vld             vr0, a1, 0
+    vld             vr1, a5, -15
+
+    vst             vr0, a2, 0
+    vst             vr1, a4, -15
+    jr              ra
+L(less_16):
+    srli.d          t1, t0, 3
+
+    beqz            t1, L(less_8)
+    ld.d            t2, a1, 0
+    ld.d            t3, a5, -7
+    st.d            t2, a2, 0
+
+    st.d            t3, a4, -7
+    jr              ra
+L(less_8):
+    li.d            t1, 3
+    bltu            t0, t1, L(less_4)
+
+    ld.w            t2, a1, 0
+    ld.w            t3, a5, -3
+    st.w            t2, a2, 0
+    st.w            t3, a4, -3
+
+    jr              ra
+L(less_4):
+    srli.d          t1, t0, 2
+    bgeu            t1, t0, L(zero_byte)
+    ld.h            t2, a1, 0
+
+    st.h            t2, a2, 0
+L(zero_byte):
+    st.b            zero, a4, 0
+    jr              ra
+END(STRCPY)
+
+libc_hidden_builtin_def (STRCPY)
+#endif
diff --git a/sysdeps/loongarch/lp64/multiarch/strcpy-lsx.S b/sysdeps/loongarch/lp64/multiarch/strcpy-lsx.S
new file mode 100644
index 0000000000..7a17af12a3
--- /dev/null
+++ b/sysdeps/loongarch/lp64/multiarch/strcpy-lsx.S
@@ -0,0 +1,197 @@ 
+/* Optimized strcpy implementation using LoongArch LSX instructions.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/regdef.h>
+#include <sys/asm.h>
+
+#if IS_IN (libc) && !defined __loongarch_soft_float
+
+# define STRCPY __strcpy_lsx
+
+LEAF(STRCPY, 6)
+    pcalau12i       t0, %pc_hi20(L(INDEX))
+    andi            a4, a1, 0xf
+    vld             vr1, t0, %pc_lo12(L(INDEX))
+    move            a2, a0
+
+    beqz            a4, L(load_start)
+    xor             t0, a1, a4
+    vld             vr0, t0, 0
+    vreplgr2vr.b    vr2, a4
+
+    vadd.b          vr2, vr2, vr1
+    vshuf.b         vr0, vr2, vr0, vr2
+    vsetanyeqz.b    fcc0, vr0
+    bcnez           fcc0, L(end)
+
+L(load_start):
+    vld             vr0, a1, 0
+    li.d            t1, 16
+    andi            a3, a2, 0xf
+    vsetanyeqz.b    fcc0, vr0
+
+
+    sub.d           t0, t1, a3
+    bcnez           fcc0, L(end)
+    add.d           a1, a1, t0
+    vst             vr0, a2, 0
+
+    andi            a3, a1, 0xf
+    add.d           a2, a2, t0
+    bnez            a3, L(unaligned)
+    vld             vr0, a1, 0
+
+    vsetanyeqz.b    fcc0, vr0
+    bcnez           fcc0, L(al_end)
+L(al_loop):
+    vst             vr0, a2, 0
+    vld             vr0, a1, 16
+
+    addi.d          a2, a2, 16
+    addi.d          a1, a1, 16
+    vsetanyeqz.b    fcc0, vr0
+    bceqz           fcc0, L(al_loop)
+
+
+L(al_end):
+    vmsknz.b        vr1, vr0
+    movfr2gr.s      t0, fa1
+    cto.w           t0, t0
+    add.d           a1, a1, t0
+
+    vld             vr0, a1, -15
+    add.d           a2, a2, t0
+    vst             vr0, a2, -15
+    jr              ra
+
+L(end):
+    vmsknz.b        vr1, vr0
+    movfr2gr.s      t0, fa1
+    cto.w           t0, t0
+    addi.d          t0, t0, 1
+
+L(end_16):
+    andi            t1, t0, 16
+    beqz            t1, L(end_8)
+    vst             vr0, a2, 0
+    jr              ra
+
+
+L(end_8):
+    andi            t2, t0, 8
+    andi            t3, t0, 4
+    andi            t4, t0, 2
+    andi            t5, t0, 1
+
+    beqz            t2, L(end_4)
+    vstelm.d        vr0, a2, 0, 0
+    addi.d          a2, a2, 8
+    vbsrl.v         vr0, vr0, 8
+
+L(end_4):
+    beqz            t3, L(end_2)
+    vstelm.w        vr0, a2, 0, 0
+    addi.d          a2, a2, 4
+    vbsrl.v         vr0, vr0, 4
+
+L(end_2):
+    beqz            t4, L(end_1)
+    vstelm.h        vr0, a2, 0, 0
+    addi.d          a2, a2, 2
+    vbsrl.v         vr0, vr0, 2
+
+
+L(end_1):
+    beqz            t5, L(out)
+    vstelm.b        vr0, a2, 0, 0
+L(out):
+    jr              ra
+    nop
+
+L(unaligned):
+    bstrins.d      a1, zero, 3, 0
+    vld            vr2, a1, 0
+    vreplgr2vr.b   vr3, a3
+    vslt.b         vr4, vr1, vr3
+
+    vor.v          vr0, vr2, vr4
+    vsetanyeqz.b   fcc0, vr0
+    bcnez          fcc0, L(un_first_end)
+    vld            vr0, a1, 16
+
+    vadd.b         vr3, vr3, vr1
+    vshuf.b        vr4, vr0, vr2, vr3
+    vsetanyeqz.b   fcc0, vr0
+    bcnez          fcc0, L(un_end)
+
+
+    vor.v          vr2, vr0, vr0
+    addi.d         a1, a1, 16
+L(un_loop):
+    vld            vr0, a1, 16
+    vst            vr4, a2, 0
+
+    addi.d         a2, a2, 16
+    vshuf.b        vr4, vr0, vr2, vr3
+    vsetanyeqz.b   fcc0, vr0
+    bcnez          fcc0, L(un_end)
+
+    vld            vr2, a1, 32
+    vst            vr4, a2, 0
+    addi.d         a1, a1, 32
+    addi.d         a2, a2, 16
+
+    vshuf.b        vr4, vr2, vr0, vr3
+    vsetanyeqz.b   fcc0, vr2
+    bceqz          fcc0, L(un_loop)
+    vor.v          vr0, vr2, vr2
+
+
+    addi.d         a1, a1, -16
+L(un_end):
+    vsetanyeqz.b    fcc0, vr4
+    bcnez           fcc0, 1f
+    vst             vr4, a2, 0
+
+1:
+    vmsknz.b        vr1, vr0
+    movfr2gr.s      t0, fa1
+    cto.w           t0, t0
+    add.d           a1, a1, t0
+
+    vld             vr0, a1, 1
+    add.d           a2, a2, t0
+    sub.d           a2, a2, a3
+    vst             vr0, a2, 1
+
+    jr              ra
+L(un_first_end):
+    addi.d          a2, a2, -16
+    addi.d          a1, a1, -16
+    b               1b
+END(STRCPY)
+
+    .section        .rodata.cst16,"M",@progbits,16
+    .align          4
+L(INDEX):
+    .dword          0x0706050403020100
+    .dword          0x0f0e0d0c0b0a0908
+
+libc_hidden_builtin_def (STRCPY)
+#endif
diff --git a/sysdeps/loongarch/lp64/multiarch/strcpy-unaligned.S b/sysdeps/loongarch/lp64/multiarch/strcpy-unaligned.S
new file mode 100644
index 0000000000..12e79f2ac0
--- /dev/null
+++ b/sysdeps/loongarch/lp64/multiarch/strcpy-unaligned.S
@@ -0,0 +1,131 @@ 
+/* Optimized strcpy unaligned implementation using basic LoongArch instructions.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/regdef.h>
+#include <sys/asm.h>
+
+#if IS_IN (libc)
+
+# define STRCPY __strcpy_unaligned
+
+LEAF(STRCPY, 4)
+    move        t8, a0
+    lu12i.w     t5, 0x01010
+    lu12i.w     t6, 0x7f7f7
+    ori         t5, t5, 0x101
+
+    ori         t6, t6, 0xf7f
+    bstrins.d   t5, t5, 63, 32
+    bstrins.d   t6, t6, 63, 32
+    andi        a3, a1, 0x7
+
+    beqz        a3, L(strcpy_loop_aligned_1)
+    b           L(strcpy_mutual_align)
+L(strcpy_loop_aligned):
+    st.d        t0, a0, 0
+    addi.d      a0, a0, 8
+
+L(strcpy_loop_aligned_1):
+    ld.d        t0, a1, 0
+    addi.d      a1, a1, 8
+L(strcpy_start_realigned):
+    sub.d       a4, t0, t5
+    or          a5, t0, t6
+
+    andn        t2, a4, a5
+    beqz        t2, L(strcpy_loop_aligned)
+L(strcpy_end):
+    ctz.d       t7, t2
+    srli.d      t7, t7, 3
+    addi.d      t7, t7, 1
+
+L(strcpy_end_8):
+    andi        a4, t7, 0x8
+    beqz        a4, L(strcpy_end_4)
+    st.d        t0, a0, 0
+    move        a0, t8
+    jr          ra
+
+L(strcpy_end_4):
+    andi        a4, t7, 0x4
+    beqz        a4, L(strcpy_end_2)
+    st.w        t0, a0, 0
+    srli.d      t0, t0, 32
+    addi.d      a0, a0, 4
+
+L(strcpy_end_2):
+    andi        a4, t7, 0x2
+    beqz        a4, L(strcpy_end_1)
+    st.h        t0, a0, 0
+    srli.d      t0, t0, 16
+    addi.d      a0, a0, 2
+
+L(strcpy_end_1):
+    andi        a4, t7, 0x1
+    beqz        a4, L(strcpy_end_ret)
+    st.b        t0, a0, 0
+
+L(strcpy_end_ret):
+    move        a0, t8
+    jr          ra
+
+
+L(strcpy_mutual_align):
+    li.w        a5, 0xff8
+    andi        a4, a1, 0xff8
+    beq         a4, a5, L(strcpy_page_cross)
+
+L(strcpy_page_cross_ok):
+    ld.d        t0, a1, 0
+    sub.d       a4, t0, t5
+    or          a5, t0, t6
+    andn        t2, a4, a5
+    bnez        t2, L(strcpy_end)
+
+L(strcpy_mutual_align_finish):
+    li.w        a4, 8
+    st.d        t0, a0, 0
+    sub.d       a4, a4, a3
+    add.d       a1,  a1,  a4
+    add.d       a0, a0, a4
+
+    b           L(strcpy_loop_aligned_1)
+
+L(strcpy_page_cross):
+    li.w        a4, 0x7
+    andn        a6, a1,  a4
+    ld.d        t0, a6, 0
+    li.w        a7, -1
+
+    slli.d      a5, a3, 3
+    srl.d       a7, a7, a5
+    srl.d       t0, t0, a5
+    nor         a7, a7, zero
+
+    or          t0, t0, a7
+    sub.d       a4, t0, t5
+    or          a5, t0, t6
+    andn        t2, a4, a5
+    beqz        t2, L(strcpy_page_cross_ok)
+
+    b           L(strcpy_end)
+END(STRCPY)
+
+libc_hidden_builtin_def (STRCPY)
+#endif
diff --git a/sysdeps/loongarch/lp64/multiarch/strcpy.c b/sysdeps/loongarch/lp64/multiarch/strcpy.c
new file mode 100644
index 0000000000..46afd068f9
--- /dev/null
+++ b/sysdeps/loongarch/lp64/multiarch/strcpy.c
@@ -0,0 +1,35 @@ 
+/* Multiple versions of strcpy.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+#if IS_IN (libc)
+# define strcpy __redirect_strcpy
+# include <string.h>
+# undef strcpy
+
+# define SYMBOL_NAME strcpy
+# include "ifunc-lasx.h"
+
+libc_ifunc_redirected (__redirect_strcpy, strcpy, IFUNC_SELECTOR ());
+
+# ifdef SHARED
+__hidden_ver1 (strcpy, __GI_strcpy, __redirect_strcpy)
+  __attribute__ ((visibility ("hidden"))) __attribute_copy__ (strcpy);
+# endif
+#endif