From patchwork Mon Jan 10 21:35:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 49812 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 6681D388A011 for ; Mon, 10 Jan 2022 21:36:08 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6681D388A011 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1641850568; bh=blFMcocCHqjg5JQO/hETJlvqtynNeeNKZdfvbjAhT7M=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=KEh+dHFIlH+C3DvuPU8XFzeyBnsrRg7uzESpOZesgKhaDR/MVJHZS4GyI4fKE16jJ TDuM+Ks7DkWv9w9J9lRMhuQHTccczFP/DveQJ5OXMVShJooqBhaYHNVMEGo9M2gCv9 7kIVuPXZv1rwG97vLQLvoDh33UhSOJEyaneeZ2WQ= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x635.google.com (mail-pl1-x635.google.com [IPv6:2607:f8b0:4864:20::635]) by sourceware.org (Postfix) with ESMTPS id EF56E3858C3A for ; Mon, 10 Jan 2022 21:35:46 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org EF56E3858C3A Received: by mail-pl1-x635.google.com with SMTP id x15so14042267plg.1 for ; Mon, 10 Jan 2022 13:35:46 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=blFMcocCHqjg5JQO/hETJlvqtynNeeNKZdfvbjAhT7M=; b=YtqPvdGFKCQEozkduNf4+mjR74zjiDtgHzf+agNZnFswPivX0YhxmRMKgAgEY/0bCP R2nDqK/UiDoSBx6hf+sNc4f+F/WxcrgUa7oWc/3a108qtyueNb0W+pmQjBLZCWthv+lB 69rKTa9uYjN0zO1lISn38z8z9dCezkVBJrNn2Bt40K9t2zMdu9fsVh8bl6oRpdClhS2E mVwUSems7u58wJFKngvNNvknwAOJbdYxbVqpklZScu7NQMp4j0tkOhY+7ml1bAIGkMTZ FKIdhj5NTzyKLG8vEU8n35iTXnNYmpNBMAFf8Q8l67ShnO4G2fMGSECxl9pNIAnjzH85 87iA== X-Gm-Message-State: AOAM530PArQ3MmiU+XLUIwVdusHKg5IVsQOTkhBfNSbO5Lmk3cTj4ooF 8vIq1H8ICCzPrEg0Sgd1CS1ukyB4ZGk= X-Google-Smtp-Source: ABdhPJyg+QCy+O7wt2rriFMeMfyixzl2L/aXWCaKMuVkNNYqr7Gr4pTPSFT8Q4L35MdevJ+M4IQS5w== X-Received: by 2002:a17:90b:3a83:: with SMTP id om3mr1744155pjb.186.1641850545938; Mon, 10 Jan 2022 13:35:45 -0800 (PST) Received: from noah-tigerlake.webpass.net (136-24-166-223.cab.webpass.net. [136.24.166.223]) by smtp.googlemail.com with ESMTPSA id f12sm7996515pfe.127.2022.01.10.13.35.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jan 2022 13:35:45 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v3 1/7] x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755] Date: Mon, 10 Jan 2022 15:35:34 -0600 Message-Id: <20220110213540.1258344-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220109122946.2754917-1-goldstein.w.n@gmail.com> References: <20220109122946.2754917-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to __wcscmp_avx2. For x86_64 this covers the entire address range so any length larger could not possibly be used to bound `s1` or `s2`. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein Reviewed-by: H.J. Lu --- sysdeps/x86_64/multiarch/strcmp-avx2.S | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S index a45f9d2749..9c73b5899d 100644 --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S @@ -87,6 +87,16 @@ ENTRY (STRCMP) je L(char0) jb L(zero) # ifdef USE_AS_WCSCMP +# ifndef __ILP32__ + movq %rdx, %rcx + /* Check if length could overflow when multiplied by + sizeof(wchar_t). Checking top 8 bits will cover all potential + overflow cases as well as redirect cases where its impossible to + length to bound a valid memory region. In these cases just use + 'wcscmp'. */ + shrq $56, %rcx + jnz __wcscmp_avx2 +# endif /* Convert units: from wide to byte char. */ shl $2, %RDX_LP # endif From patchwork Mon Jan 10 21:35:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 49813 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C72FB3858C3A for ; Mon, 10 Jan 2022 21:36:50 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C72FB3858C3A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1641850610; bh=/dLfOMFYcYPKjUYwJwixnhNJayIgud98VKFZG6TW0s0=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=jka/uGfSRaAlBO5T+jbzfUDC5B9hVdP/Nn2Ev8SRwrR0lHitI5CKmBQ6xHGYjlf4v SNdFYLoKQcxzBtZTI2evAvbmjE7Ne2mHbYDvJgg8ZB/LxfxWIP92F0uIWl0sQqt9fU q2uZbiW9LP9bDA12s/Wh+5XbVkVECoQ5QH03POXU= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by sourceware.org (Postfix) with ESMTPS id 393A8385840E for ; Mon, 10 Jan 2022 21:35:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 393A8385840E Received: by mail-pj1-x102f.google.com with SMTP id l10-20020a17090a384a00b001b22190e075so953409pjf.3 for ; Mon, 10 Jan 2022 13:35:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/dLfOMFYcYPKjUYwJwixnhNJayIgud98VKFZG6TW0s0=; b=WVgjvDnevsthMn6dwlNFZmTSzApp5qcyNvMbUS1MGj6n4BrDLXhvpHNcYzby3nOjms QluwV7IF1KPFHNwHev2LgPYu+4fKO/Rl3MWhZ4Ri+VWjMISLqov0h0nIRffDNIYcYFAv EjhYZ4PI7XirE34MXkAvlkP1lmeD7jCQdXOl2XOXMv8IjsNKxaaz++rKOgJLIuLwE0Dd fEAGDShzYqeKAS1elf2xMEn4xA8Aq3NJ+uSTQo2xcJU7G2CyL9VJf/a7fr+TCs/iJbSU iN4WWV02gEAL88JShew8Y8zddyZuS7kbvDGo89KXFH/+ws7O5symfpaQYYFIUQ5l2OEG wWgg== X-Gm-Message-State: AOAM533P1+ShQsenbiHJdDFf5Uuc1DINcGM+pHvPvLkZ9CjFLT3H7CpW W4r1uboHhmIc0l2r+1pD9XPb7zno66g= X-Google-Smtp-Source: ABdhPJymYRM/D/xp4s9oIxe7kMmQ7NmxnlcjWS5xMdMCppymvnvggC8FR8sTEbWOtnpnbKj5aZ2V+A== X-Received: by 2002:a17:90b:4f86:: with SMTP id qe6mr1728318pjb.154.1641850547214; Mon, 10 Jan 2022 13:35:47 -0800 (PST) Received: from noah-tigerlake.webpass.net (136-24-166-223.cab.webpass.net. [136.24.166.223]) by smtp.googlemail.com with ESMTPSA id f12sm7996515pfe.127.2022.01.10.13.35.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jan 2022 13:35:46 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v3 2/7] x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755] Date: Mon, 10 Jan 2022 15:35:35 -0600 Message-Id: <20220110213540.1258344-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220110213540.1258344-1-goldstein.w.n@gmail.com> References: <20220109122946.2754917-1-goldstein.w.n@gmail.com> <20220110213540.1258344-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to __wcscmp_evex. For x86_64 this covers the entire address range so any length larger could not possibly be used to bound `s1` or `s2`. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein Reviewed-by: H.J. Lu --- sysdeps/x86_64/multiarch/strcmp-evex.S | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S index 1d971f3889..0cd939d5af 100644 --- a/sysdeps/x86_64/multiarch/strcmp-evex.S +++ b/sysdeps/x86_64/multiarch/strcmp-evex.S @@ -104,6 +104,16 @@ ENTRY (STRCMP) je L(char0) jb L(zero) # ifdef USE_AS_WCSCMP +# ifndef __ILP32__ + movq %rdx, %rcx + /* Check if length could overflow when multiplied by + sizeof(wchar_t). Checking top 8 bits will cover all potential + overflow cases as well as redirect cases where its impossible to + length to bound a valid memory region. In these cases just use + 'wcscmp'. */ + shrq $56, %rcx + jnz __wcscmp_evex +# endif /* Convert units: from wide to byte char. */ shl $2, %RDX_LP # endif From patchwork Mon Jan 10 21:35:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 49814 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 661A8388A014 for ; Mon, 10 Jan 2022 21:37:33 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 661A8388A014 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1641850653; bh=SGvc/eZwqusV36+8Tt7pgWBnjKpBDyiqo5Ncgw1abcI=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=iIDmBmt55iMjE5tCI99yPziteSnZq7iUG5t/OhoySFv+q4/BMbOU8bKKsewUTEjZn m+yq1zbF4sgkJo5J/MJuEA9f7bZJub2DYdnntZLbeC2j/UDCqxVdugpjx8LHKq8cp0 cCOJMqDwWFJxHnqAqYxvPJZOS7PMS/zMvHk9iVjo= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by sourceware.org (Postfix) with ESMTPS id 7A7E7385840E for ; Mon, 10 Jan 2022 21:35:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 7A7E7385840E Received: by mail-pl1-x62c.google.com with SMTP id h1so14016141pls.11 for ; Mon, 10 Jan 2022 13:35:49 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=SGvc/eZwqusV36+8Tt7pgWBnjKpBDyiqo5Ncgw1abcI=; b=f2P0G2ISdfXJIdyN5odngFpE+GeFxB5YjAkAZSEG4pSO/hx2Q2Hm9TupSCx6PTcMCY MSpufbvjRowcdzGArY/TnOrpeB1tpAEPInsUf/AQL7+KsQq6IVt60cJYmPaJLgxvkq6e kyYfxW59dxTuo8IYrx9BMCyZZx9t9gwTM1dEoPUeU3wqeQMnXuGIKzRORh5cln15xyf4 /O9ep3kCN5Yf4c1UCQw9OzGfs1q//GGRgbB64syo+1WflqVqz4cE4Exq8J0iK9AOOXDQ Aq1ECJ3AvfobW3XhnAOMGghISyLxNXmQUOqIZoP8hnC9r7EVJCvncBa9vHYhrY0T/+5E dO8A== X-Gm-Message-State: AOAM532Nr1fHgoGRCExf2w+TtWCzc7fBGqUAA+W8eMoq8afHbGOhoZnt 2UngevUwhYQcWFWBlXEpeSSikHVkAtA= X-Google-Smtp-Source: ABdhPJwKi4YQs63Pe+KyxOFzi/zfEqoz+TepmTKviZNp4pqZVw8lkInT3l1EiNsy89l8L5f40HB/zg== X-Received: by 2002:a17:90b:4d92:: with SMTP id oj18mr32462118pjb.226.1641850548543; Mon, 10 Jan 2022 13:35:48 -0800 (PST) Received: from noah-tigerlake.webpass.net (136-24-166-223.cab.webpass.net. [136.24.166.223]) by smtp.googlemail.com with ESMTPSA id f12sm7996515pfe.127.2022.01.10.13.35.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jan 2022 13:35:48 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v3 3/7] string/test-str*cmp: remove stupid_[strcmp, strncmp, wcscmp, wcsncmp]. Date: Mon, 10 Jan 2022 15:35:36 -0600 Message-Id: <20220110213540.1258344-3-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220110213540.1258344-1-goldstein.w.n@gmail.com> References: <20220109122946.2754917-1-goldstein.w.n@gmail.com> <20220110213540.1258344-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" These implementations just add to test duration. Since we have simple_* implementations we already have a safe reference implementation. Signed-off-by: Noah Goldstein --- string/test-strcmp.c | 35 ----------------------------------- string/test-strncmp.c | 34 ---------------------------------- 2 files changed, 69 deletions(-) diff --git a/string/test-strcmp.c b/string/test-strcmp.c index 3c75076fb8..97d7bf5043 100644 --- a/string/test-strcmp.c +++ b/string/test-strcmp.c @@ -34,7 +34,6 @@ # define STRLEN wcslen # define MEMCPY wmemcpy # define SIMPLE_STRCMP simple_wcscmp -# define STUPID_STRCMP stupid_wcscmp # define CHAR wchar_t # define UCHAR wchar_t # define CHARBYTES 4 @@ -64,25 +63,6 @@ simple_wcscmp (const wchar_t *s1, const wchar_t *s2) return c1 < c2 ? -1 : 1; } -int -stupid_wcscmp (const wchar_t *s1, const wchar_t *s2) -{ - size_t ns1 = wcslen (s1) + 1; - size_t ns2 = wcslen (s2) + 1; - size_t n = ns1 < ns2 ? ns1 : ns2; - int ret = 0; - - wchar_t c1, c2; - - while (n--) { - c1 = *s1++; - c2 = *s2++; - if ((ret = c1 < c2 ? -1 : c1 == c2 ? 0 : 1) != 0) - break; - } - return ret; -} - #else # include @@ -92,7 +72,6 @@ stupid_wcscmp (const wchar_t *s1, const wchar_t *s2) # define STRLEN strlen # define MEMCPY memcpy # define SIMPLE_STRCMP simple_strcmp -# define STUPID_STRCMP stupid_strcmp # define CHAR char # define UCHAR unsigned char # define CHARBYTES 1 @@ -113,24 +92,10 @@ simple_strcmp (const char *s1, const char *s2) return ret; } -int -stupid_strcmp (const char *s1, const char *s2) -{ - size_t ns1 = strlen (s1) + 1; - size_t ns2 = strlen (s2) + 1; - size_t n = ns1 < ns2 ? ns1 : ns2; - int ret = 0; - - while (n--) - if ((ret = *(unsigned char *) s1++ - *(unsigned char *) s2++) != 0) - break; - return ret; -} #endif typedef int (*proto_t) (const CHAR *, const CHAR *); -IMPL (STUPID_STRCMP, 1) IMPL (SIMPLE_STRCMP, 1) IMPL (STRCMP, 1) diff --git a/string/test-strncmp.c b/string/test-strncmp.c index e7d5edea39..61a283a0af 100644 --- a/string/test-strncmp.c +++ b/string/test-strncmp.c @@ -33,7 +33,6 @@ # define STRDUP wcsdup # define MEMCPY wmemcpy # define SIMPLE_STRNCMP simple_wcsncmp -# define STUPID_STRNCMP stupid_wcsncmp # define CHAR wchar_t # define UCHAR wchar_t # define CHARBYTES 4 @@ -57,25 +56,6 @@ simple_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n) return 0; } -int -stupid_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n) -{ - wchar_t c1, c2; - size_t ns1 = wcsnlen (s1, n) + 1, ns2 = wcsnlen (s2, n) + 1; - - n = ns1 < n ? ns1 : n; - n = ns2 < n ? ns2 : n; - - while (n--) - { - c1 = *s1++; - c2 = *s2++; - if (c1 != c2) - return c1 > c2 ? 1 : -1; - } - return 0; -} - #else # define L(str) str # define STRNCMP strncmp @@ -83,7 +63,6 @@ stupid_wcsncmp (const CHAR *s1, const CHAR *s2, size_t n) # define STRDUP strdup # define MEMCPY memcpy # define SIMPLE_STRNCMP simple_strncmp -# define STUPID_STRNCMP stupid_strncmp # define CHAR char # define UCHAR unsigned char # define CHARBYTES 1 @@ -101,23 +80,10 @@ simple_strncmp (const char *s1, const char *s2, size_t n) return ret; } -int -stupid_strncmp (const char *s1, const char *s2, size_t n) -{ - size_t ns1 = strnlen (s1, n) + 1, ns2 = strnlen (s2, n) + 1; - int ret = 0; - - n = ns1 < n ? ns1 : n; - n = ns2 < n ? ns2 : n; - while (n-- && (ret = *(unsigned char *) s1++ - * (unsigned char *) s2++) == 0); - return ret; -} - #endif typedef int (*proto_t) (const CHAR *, const CHAR *, size_t); -IMPL (STUPID_STRNCMP, 0) IMPL (SIMPLE_STRNCMP, 0) IMPL (STRNCMP, 1) From patchwork Mon Jan 10 21:35:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 49815 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C45FA3835C1E for ; Mon, 10 Jan 2022 21:38:15 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C45FA3835C1E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1641850695; bh=eyiK3snG1Frg3QnGaDFXmoH/gnzPo/NhvC6GKIjq55s=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=Fd9UWFJQcFHB6JvzaUWKG1t/n+OG1dpjzhZyHLUp+/JXI7NOlO6mySZLkGPCCc+hr bgFXW9pbNwYVuQb3d+57s+Ty3LP5fi5fsxYcQUpp2fQWaChMLWsKZZYMIi50BZ/fzU KdHI701kSq9qNLYlbpnYIUUou9jVocNTcL7zNN48= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by sourceware.org (Postfix) with ESMTPS id 5CA993858C3A for ; Mon, 10 Jan 2022 21:35:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5CA993858C3A Received: by mail-pl1-x634.google.com with SMTP id x15so14042608plg.1 for ; Mon, 10 Jan 2022 13:35:51 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=eyiK3snG1Frg3QnGaDFXmoH/gnzPo/NhvC6GKIjq55s=; b=KwuE1Vlhg0KTj6ro+nvQxrJ3tCoxKb1ETN3J0M6JHFArhbY3ZRZKM/+8T4JYAn1viS C9cxw1I96xT1LcslmqeX89NQbnUFSz2aYcgyShsHjRYpHBrEMvug/IX+fhG31n5rGCD7 Sm7Bai6uB15g7/ITgIKoK8YKD/LW0Q8NsHLwFpIMXh9TWrihBeYMrf41v8VVK2At/rej DfEAaTOzYTJNTJddvbpexFsOU19gjv47XeFXTg8JW0Qm4M8eAfjoCnixLvakrj2RqM3G UONuWIx35Rbr5yGM1vajCXyiUZ/us/23J7B66KelidQHTY4bKdj/F5dpciSdjJyPUkJU 28vA== X-Gm-Message-State: AOAM533y5c33h4E90L28rONdI4ZCKhrtFDpMYo7rxrl5/vIKRX3fmAwX Ut3o9Lk0vnF1IkW/vABW1RVVKbunAuA= X-Google-Smtp-Source: ABdhPJzHJxxwoWLohVtiOq1YL0gI9MZPlx9uSaMAvj2frMDMIFCei9/DpOUadNxkHG9UVyOPy6Sc8w== X-Received: by 2002:a17:90a:c401:: with SMTP id i1mr1773241pjt.180.1641850550198; Mon, 10 Jan 2022 13:35:50 -0800 (PST) Received: from noah-tigerlake.webpass.net (136-24-166-223.cab.webpass.net. [136.24.166.223]) by smtp.googlemail.com with ESMTPSA id f12sm7996515pfe.127.2022.01.10.13.35.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jan 2022 13:35:49 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v3 4/7] string: Improve coverage in test-strcmp.c and test-strncmp.c Date: Mon, 10 Jan 2022 15:35:37 -0600 Message-Id: <20220110213540.1258344-4-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220110213540.1258344-1-goldstein.w.n@gmail.com> References: <20220109122946.2754917-1-goldstein.w.n@gmail.com> <20220110213540.1258344-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Add additional test cases for small / medium sizes. Add tests in test-strncmp.c where `n` is near ULONG_MAX or LONG_MIN to test for overflow bugs in length handling. Signed-off-by: Noah Goldstein --- string/test-strcmp.c | 70 ++++++++++-- string/test-strncmp.c | 257 +++++++++++++++++++++++++++++++++++++++--- 2 files changed, 306 insertions(+), 21 deletions(-) diff --git a/string/test-strcmp.c b/string/test-strcmp.c index 97d7bf5043..eacbdc8857 100644 --- a/string/test-strcmp.c +++ b/string/test-strcmp.c @@ -16,6 +16,9 @@ License along with the GNU C Library; if not, see . */ +#define TEST_LEN (4096 * 3) +#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ()) + #define TEST_MAIN #ifdef WIDE # define TEST_NAME "wcscmp" @@ -129,7 +132,7 @@ do_one_test (impl_t *impl, static void do_test (size_t align1, size_t align2, size_t len, int max_char, - int exp_result) + int exp_result) { size_t i; @@ -138,19 +141,22 @@ do_test (size_t align1, size_t align2, size_t len, int max_char, if (len == 0) return; - align1 &= 63; + align1 &= ~(CHARBYTES - 1); + align2 &= ~(CHARBYTES - 1); + + align1 &= getpagesize () - 1; if (align1 + (len + 1) * CHARBYTES >= page_size) return; - align2 &= 63; + align2 &= getpagesize () - 1; if (align2 + (len + 1) * CHARBYTES >= page_size) return; /* Put them close to the end of page. */ i = align1 + CHARBYTES * (len + 2); - s1 = (CHAR *) (buf1 + ((page_size - i) / 16 * 16) + align1); + s1 = (CHAR *)(buf1 + ((page_size - i) / 16 * 16) + align1); i = align2 + CHARBYTES * (len + 2); - s2 = (CHAR *) (buf2 + ((page_size - i) / 16 * 16) + align2); + s2 = (CHAR *)(buf2 + ((page_size - i) / 16 * 16) + align2); for (i = 0; i < len; i++) s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char; @@ -161,9 +167,10 @@ do_test (size_t align1, size_t align2, size_t len, int max_char, s2[len - 1] -= exp_result; FOR_EACH_IMPL (impl, 0) - do_one_test (impl, s1, s2, exp_result); + do_one_test (impl, s1, s2, exp_result); } + static void do_random_tests (void) { @@ -385,7 +392,7 @@ check3 (void) int test_main (void) { - size_t i; + size_t i, j; test_init (); check(); @@ -426,6 +433,55 @@ test_main (void) do_test (2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, -1); } + for (j = 0; j < 160; ++j) + { + for (i = 0; i < TEST_LEN;) + { + do_test (getpagesize () - j - 1, 0, i, 127, 0); + do_test (getpagesize () - j - 1, 0, i, 127, 1); + do_test (getpagesize () - j - 1, 0, i, 127, -1); + + do_test (getpagesize () - j - 1, j, i, 127, 0); + do_test (getpagesize () - j - 1, j, i, 127, 1); + do_test (getpagesize () - j - 1, j, i, 127, -1); + + do_test (0, getpagesize () - j - 1, i, 127, 0); + do_test (0, getpagesize () - j - 1, i, 127, 1); + do_test (0, getpagesize () - j - 1, i, 127, -1); + + do_test (j, getpagesize () - j - 1, i, 127, 0); + do_test (j, getpagesize () - j - 1, i, 127, 1); + do_test (j, getpagesize () - j - 1, i, 127, -1); + + if (i < 32) + { + i += 1; + } + else if (i < 161) + { + i += 7; + } + else if (i + 161 < TEST_LEN) + { + i += 31; + i *= 17; + i /= 16; + if (i + 161 > TEST_LEN) + { + i = TEST_LEN - 160; + } + } + else if (i + 32 < TEST_LEN) + { + i += 7; + } + else + { + i += 1; + } + } + } + do_random_tests (); return ret; } diff --git a/string/test-strncmp.c b/string/test-strncmp.c index 61a283a0af..1a3cee1792 100644 --- a/string/test-strncmp.c +++ b/string/test-strncmp.c @@ -16,6 +16,9 @@ License along with the GNU C Library; if not, see . */ +#define TEST_LEN (4096 * 3) +#define MIN_PAGE_SIZE (TEST_LEN + 2 * getpagesize ()) + #define TEST_MAIN #ifdef WIDE # define TEST_NAME "wcsncmp" @@ -166,11 +169,11 @@ do_test_limit (size_t align1, size_t align2, size_t len, size_t n, int max_char, } static void -do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char, - int exp_result) +do_test_n (size_t align1, size_t align2, size_t len, size_t n, int n_in_bounds, + int max_char, int exp_result) { - size_t i; - CHAR *s1, *s2; + size_t i, buf_bound; + CHAR *s1, *s2, *s1_end, *s2_end; align1 &= ~(CHARBYTES - 1); align2 &= ~(CHARBYTES - 1); @@ -178,22 +181,28 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char, if (n == 0) return; - align1 &= 63; - if (align1 + (n + 1) * CHARBYTES >= page_size) + buf_bound = n_in_bounds ? n : len; + + align1 &= getpagesize () - 1; + if (align1 + (buf_bound + 2) * CHARBYTES >= page_size) return; - align2 &= 63; - if (align2 + (n + 1) * CHARBYTES >= page_size) + align2 &= getpagesize () - 1; + if (align2 + (buf_bound + 2) * CHARBYTES >= page_size) return; - s1 = (CHAR *) (buf1 + align1); - s2 = (CHAR *) (buf2 + align2); + s1 = (CHAR *)(buf1 + align1); + s2 = (CHAR *)(buf2 + align2); - for (i = 0; i < n; i++) + if (n_in_bounds) + { + s1[n] = 24 + exp_result; + s2[n] = 23; + } + + for (i = 0; i < buf_bound; i++) s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char; - s1[n] = 24 + exp_result; - s2[n] = 23; s1[len] = 0; s2[len] = 0; if (exp_result < 0) @@ -203,10 +212,24 @@ do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char, if (len >= n) s2[n - 1] -= exp_result; + /* Ensure that both s1 and s2 are valid null terminated strings. This is + * required by the standard. */ + s1_end = (CHAR *)(buf1 + MIN_PAGE_SIZE - CHARBYTES); + s2_end = (CHAR *)(buf2 + MIN_PAGE_SIZE - CHARBYTES); + *s1_end = 0; + *s2_end = 0; + FOR_EACH_IMPL (impl, 0) do_one_test (impl, s1, s2, n, exp_result); } +static void +do_test (size_t align1, size_t align2, size_t len, size_t n, int max_char, + int exp_result) +{ + do_test_n (align1, align2, len, n, 1, max_char, exp_result); +} + static void do_page_test (size_t offset1, size_t offset2, CHAR *s2) { @@ -400,10 +423,123 @@ check3 (void) } } +static void +check_overflow (void) +{ + size_t i, j, of_mask, of_idx; + const size_t of_masks[] + = { ULONG_MAX, LONG_MIN, ULONG_MAX - (ULONG_MAX >> 2), + ((size_t)LONG_MAX) >> 1 }; + + for (of_idx = 0; of_idx < sizeof (of_masks) / sizeof (of_masks[0]); ++of_idx) + { + of_mask = of_masks[of_idx]; + for (j = 0; j < 160; ++j) + { + for (i = 1; i <= 161; i += (32 / sizeof (CHAR))) + { + do_test_n (j, 0, i, of_mask, 0, 127, 0); + do_test_n (j, 0, i, of_mask, 0, 127, 1); + do_test_n (j, 0, i, of_mask, 0, 127, -1); + + do_test_n (j, 0, i, of_mask - j / 2, 0, 127, 0); + do_test_n (j, 0, i, of_mask - j * 2, 0, 127, 1); + do_test_n (j, 0, i, of_mask - j, 0, 127, -1); + + do_test_n (j / 2, j, i, of_mask, 0, 127, 0); + do_test_n (j / 2, j, i, of_mask, 0, 127, 1); + do_test_n (j / 2, j, i, of_mask, 0, 127, -1); + + do_test_n (j / 2, j, i, of_mask - j, 0, 127, 0); + do_test_n (j / 2, j, i, of_mask - j / 2, 0, 127, 1); + do_test_n (j / 2, j, i, of_mask - j * 2, 0, 127, -1); + + do_test_n (0, j, i, of_mask - j * 2, 0, 127, 0); + do_test_n (0, j, i, of_mask - j, 0, 127, 1); + do_test_n (0, j, i, of_mask - j / 2, 0, 127, -1); + + do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, 0); + do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, 1); + do_test_n (getpagesize () - j - 1, 0, i, of_mask, 0, 127, -1); + + do_test_n (getpagesize () - j - 1, 0, i, of_mask - j / 2, 0, 127, + 0); + do_test_n (getpagesize () - j - 1, 0, i, of_mask - j * 2, 0, 127, + 1); + do_test_n (getpagesize () - j - 1, 0, i, of_mask - j, 0, 127, + -1); + + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i, + of_mask, 0, 127, 0); + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i, + of_mask, 0, 127, 1); + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i, + of_mask, 0, 127, -1); + + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i, + of_mask - j, 0, 127, 0); + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i, + of_mask - j / 2, 0, 127, 1); + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, i, + of_mask - j * 2, 0, 127, -1); + } + + for (i = 1; i < TEST_LEN; i += i) + { + do_test_n (j, 0, i - 1, of_mask, 0, 127, 0); + do_test_n (j, 0, i - 1, of_mask, 0, 127, 1); + do_test_n (j, 0, i - 1, of_mask, 0, 127, -1); + + do_test_n (j, 0, i - 1, of_mask - j / 2, 0, 127, 0); + do_test_n (j, 0, i - 1, of_mask - j * 2, 0, 127, 1); + do_test_n (j, 0, i - 1, of_mask - j, 0, 127, -1); + + do_test_n (j / 2, j, i - 1, of_mask, 0, 127, 0); + do_test_n (j / 2, j, i - 1, of_mask, 0, 127, 1); + do_test_n (j / 2, j, i - 1, of_mask, 0, 127, -1); + + do_test_n (j / 2, j, i - 1, of_mask - j, 0, 127, 0); + do_test_n (j / 2, j, i - 1, of_mask - j / 2, 0, 127, 1); + do_test_n (j / 2, j, i - 1, of_mask - j * 2, 0, 127, -1); + + do_test_n (0, j, i - 1, of_mask - j * 2, 0, 127, 0); + do_test_n (0, j, i - 1, of_mask - j, 0, 127, 1); + do_test_n (0, j, i - 1, of_mask - j / 2, 0, 127, -1); + + do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, 0); + do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, 1); + do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask, 0, 127, + -1); + + do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j / 2, 0, + 127, 0); + do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j * 2, 0, + 127, 1); + do_test_n (getpagesize () - j - 1, 0, i - 1, of_mask - j, 0, 127, + -1); + + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, + i - 1, of_mask, 0, 127, 0); + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, + i - 1, of_mask, 0, 127, 1); + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, + i - 1, of_mask, 0, 127, -1); + + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, + i - 1, of_mask - j, 0, 127, 0); + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, + i - 1, of_mask - j / 2, 0, 127, 1); + do_test_n (getpagesize () - j - 1, getpagesize () - 2 * j - 1, + i - 1, of_mask - j * 2, 0, 127, -1); + } + } + } +} + int test_main (void) { - size_t i; + size_t i, j; test_init (); @@ -470,6 +606,99 @@ test_main (void) do_test_limit (0, 0, 15 - i, 16 - i, 255, -1); } + for (j = 0; j < 160; ++j) + { + for (i = 0; i < TEST_LEN;) + { + do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 0); + do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, 1); + do_test_n (getpagesize () - j - 1, 0, i, i + 1, 0, 127, -1); + + do_test_n (getpagesize () - j - 1, 0, i, i, 0, 127, 0); + do_test_n (getpagesize () - j - 1, 0, i, i - 1, 0, 127, 0); + + do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 0); + do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, 1); + do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX, 0, 127, -1); + + do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, 0); + do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, 1); + do_test_n (getpagesize () - j - 1, 0, i, ULONG_MAX - i, 0, 127, -1); + + do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 0); + do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, 1); + do_test_n (getpagesize () - j - 1, j, i, i + 1, 0, 127, -1); + + do_test_n (getpagesize () - j - 1, j, i, i, 0, 127, 0); + do_test_n (getpagesize () - j - 1, j, i, i - 1, 0, 127, 0); + + do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 0); + do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, 1); + do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX, 0, 127, -1); + + do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, 0); + do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, 1); + do_test_n (getpagesize () - j - 1, j, i, ULONG_MAX - i, 0, 127, -1); + + do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 0); + do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, 1); + do_test_n (0, getpagesize () - j - 1, i, i + 1, 0, 127, -1); + + do_test_n (0, getpagesize () - j - 1, i, i, 0, 127, 0); + do_test_n (0, getpagesize () - j - 1, i, i - 1, 0, 127, 0); + + do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0); + do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1); + do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1); + + do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 0); + do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 1); + do_test_n (0, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, -1); + + do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 0); + do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, 1); + do_test_n (j, getpagesize () - j - 1, i, i + 1, 0, 127, -1); + + do_test_n (j, getpagesize () - j - 1, i, i, 0, 127, 0); + do_test_n (j, getpagesize () - j - 1, i, i - 1, 0, 127, 0); + + do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 0); + do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, 1); + do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX, 0, 127, -1); + + do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 0); + do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, 1); + do_test_n (j, getpagesize () - j - 1, i, ULONG_MAX - i, 0, 127, -1); + if (i < 32) + { + i += 1; + } + else if (i < 161) + { + i += 7; + } + else if (i + 161 < TEST_LEN) + { + i += 31; + i *= 17; + i /= 16; + if (i + 161 > TEST_LEN) + { + i = TEST_LEN - 160; + } + } + else if (i + 32 < TEST_LEN) + { + i += 7; + } + else + { + i += 1; + } + } + } + + check_overflow (); do_random_tests (); return ret; } From patchwork Mon Jan 10 21:35:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 49817 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E51DB3858C3A for ; Mon, 10 Jan 2022 21:39:45 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E51DB3858C3A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1641850785; bh=BgLiFpKv0omNgvSns7eqLFfw6/WPB6TNlOIskhcaIi0=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=H6Yb3hj0dejbgaJWEvk8a0Vho6HTL8w034rh3w2lfEEC+K4VK8SdGof0sjYhzF/7K dCicq6WHQ5mcFNaS2ed6iI7q1XjiM0pxA91T88BghOE5VgP452XilM2LS4IxQBFAsd ClkU1hXoRiFvqnuIzjdiSQmJrzN7lzte320z5FOI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x1033.google.com (mail-pj1-x1033.google.com [IPv6:2607:f8b0:4864:20::1033]) by sourceware.org (Postfix) with ESMTPS id 83174388CC07 for ; Mon, 10 Jan 2022 21:35:53 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 83174388CC07 Received: by mail-pj1-x1033.google.com with SMTP id y16-20020a17090a6c9000b001b13ffaa625so989975pjj.2 for ; Mon, 10 Jan 2022 13:35:53 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=BgLiFpKv0omNgvSns7eqLFfw6/WPB6TNlOIskhcaIi0=; b=3l7ckyWGWJF6cPeCIAxro5fpKJkZwvl/+r/JZ754pTIQ60eLMlefEXJ7kkRvpvI/ni AvMtSktWs/tnFfIOc9IJ5JERDT2otcw2ubcNRpBqAUFpm0QDMD2/TK+7r7WbXHWnR4DD SrbN48nT9HJvkXMOBZE6StpMhNmw2z2bC+IYPom/xCFZL1G7+s6drKw8AJLtQXRa8ull TZcGK7v57Rz63PMUqJbB5pvq9MG6dAzQfsnuBe6QzY6wFLwXcF8tUYIiqO9KktlYpQEt LgXnYGSRg8RodYVkMdlDUcXk/bnjfpwrv87pMk+1dnL10+bzwEmJD+LVhTwlHDQsiNTR Zeqg== X-Gm-Message-State: AOAM530Zlgn5GnQIyAhJgcK92np2Fc/6MYij7gzLouSCzE9aF2B+eOC9 D1XUsSMazg4l1K0rr27XICORhMLaQIw= X-Google-Smtp-Source: ABdhPJx1pDgO9wkg9Ej4mOTgdpK9EEN4dNME4xreyc8TdvV1yWhC2gHOEUzFCSbCx/m08iIvMLqyug== X-Received: by 2002:a17:90b:34f:: with SMTP id fh15mr7505589pjb.122.1641850551443; Mon, 10 Jan 2022 13:35:51 -0800 (PST) Received: from noah-tigerlake.webpass.net (136-24-166-223.cab.webpass.net. [136.24.166.223]) by smtp.googlemail.com with ESMTPSA id f12sm7996515pfe.127.2022.01.10.13.35.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jan 2022 13:35:51 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v3 5/7] x86: Optimize strcmp-avx2.S Date: Mon, 10 Jan 2022 15:35:38 -0600 Message-Id: <20220110213540.1258344-5-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220110213540.1258344-1-goldstein.w.n@gmail.com> References: <20220109122946.2754917-1-goldstein.w.n@gmail.com> <20220110213540.1258344-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SCC_10_SHORT_WORD_LINES, SCC_20_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Optimization are primarily to the loop logic and how the page cross logic interacts with the loop. The page cross logic is at times more expensive for short strings near the end of a page but not crossing the page. This is done to retest the page cross conditions with a non-faulty check and to improve the logic for entering the loop afterwards. This is only particular cases, however, and is general made up for by more than 10x improvements on the transition from the page cross -> loop case. The non-page cross cases are improved most for smaller sizes [0, 128] and go about even for (128, 4096]. The loop page cross logic is improved so some more significant speedup is seen there as well. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein --- sysdeps/x86_64/multiarch/strcmp-avx2.S | 1590 ++++++++++++++---------- 1 file changed, 939 insertions(+), 651 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S index 9c73b5899d..28d6a0025a 100644 --- a/sysdeps/x86_64/multiarch/strcmp-avx2.S +++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S @@ -26,35 +26,57 @@ # define PAGE_SIZE 4096 -/* VEC_SIZE = Number of bytes in a ymm register */ + /* VEC_SIZE = Number of bytes in a ymm register. */ # define VEC_SIZE 32 -/* Shift for dividing by (VEC_SIZE * 4). */ -# define DIVIDE_BY_VEC_4_SHIFT 7 -# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT) -# error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT) -# endif +# define VMOVU vmovdqu +# define VMOVA vmovdqa # ifdef USE_AS_WCSCMP -/* Compare packed dwords. */ + /* Compare packed dwords. */ # define VPCMPEQ vpcmpeqd -/* Compare packed dwords and store minimum. */ + /* Compare packed dwords and store minimum. */ # define VPMINU vpminud -/* 1 dword char == 4 bytes. */ + /* 1 dword char == 4 bytes. */ # define SIZE_OF_CHAR 4 # else -/* Compare packed bytes. */ + /* Compare packed bytes. */ # define VPCMPEQ vpcmpeqb -/* Compare packed bytes and store minimum. */ + /* Compare packed bytes and store minimum. */ # define VPMINU vpminub -/* 1 byte char == 1 byte. */ + /* 1 byte char == 1 byte. */ # define SIZE_OF_CHAR 1 # endif +# ifdef USE_AS_STRNCMP +# define LOOP_REG r9d +# define LOOP_REG64 r9 + +# define OFFSET_REG8 r9b +# define OFFSET_REG r9d +# define OFFSET_REG64 r9 +# else +# define LOOP_REG edx +# define LOOP_REG64 rdx + +# define OFFSET_REG8 dl +# define OFFSET_REG edx +# define OFFSET_REG64 rdx +# endif + # ifndef VZEROUPPER # define VZEROUPPER vzeroupper # endif +# if defined USE_AS_STRNCMP +# define VEC_OFFSET 0 +# else +# define VEC_OFFSET (-VEC_SIZE) +# endif + +# define xmmZERO xmm15 +# define ymmZERO ymm15 + # ifndef SECTION # define SECTION(p) p##.avx # endif @@ -79,783 +101,1049 @@ the maximum offset is reached before a difference is found, zero is returned. */ - .section SECTION(.text),"ax",@progbits -ENTRY (STRCMP) + .section SECTION(.text), "ax", @progbits +ENTRY(STRCMP) # ifdef USE_AS_STRNCMP - /* Check for simple cases (0 or 1) in offset. */ +# ifdef __ILP32__ + /* Clear the upper 32 bits. */ + movl %edx, %rdx +# endif cmp $1, %RDX_LP - je L(char0) - jb L(zero) + /* Signed comparison intentional. We use this branch to also + test cases where length >= 2^63. These very large sizes can be + handled with strcmp as there is no way for that length to + actually bound the buffer. */ + jle L(one_or_less) # ifdef USE_AS_WCSCMP -# ifndef __ILP32__ movq %rdx, %rcx - /* Check if length could overflow when multiplied by - sizeof(wchar_t). Checking top 8 bits will cover all potential - overflow cases as well as redirect cases where its impossible to - length to bound a valid memory region. In these cases just use - 'wcscmp'. */ + + /* Multiplying length by sizeof(wchar_t) can result in overflow. + Check if that is possible. All cases where overflow are possible + are cases where length is large enough that it can never be a + bound on valid memory so just use wcscmp. */ shrq $56, %rcx jnz __wcscmp_avx2 + + leaq (, %rdx, 4), %rdx # endif - /* Convert units: from wide to byte char. */ - shl $2, %RDX_LP -# endif - /* Register %r11 tracks the maximum offset. */ - mov %RDX_LP, %R11_LP # endif + vpxor %xmmZERO, %xmmZERO, %xmmZERO movl %edi, %eax - xorl %edx, %edx - /* Make %xmm7 (%ymm7) all zeros in this function. */ - vpxor %xmm7, %xmm7, %xmm7 orl %esi, %eax - andl $(PAGE_SIZE - 1), %eax - cmpl $(PAGE_SIZE - (VEC_SIZE * 4)), %eax - jg L(cross_page) - /* Start comparing 4 vectors. */ - vmovdqu (%rdi), %ymm1 - VPCMPEQ (%rsi), %ymm1, %ymm0 - VPMINU %ymm1, %ymm0, %ymm0 - VPCMPEQ %ymm7, %ymm0, %ymm0 - vpmovmskb %ymm0, %ecx - testl %ecx, %ecx - je L(next_3_vectors) - tzcntl %ecx, %edx + sall $20, %eax + /* Check if s1 or s2 may cross a page in next 4x VEC loads. */ + cmpl $((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax + ja L(page_cross) + +L(no_page_cross): + /* Safe to compare 4x vectors. */ + VMOVU (%rdi), %ymm0 + /* 1s where s1 and s2 equal. */ + VPCMPEQ (%rsi), %ymm0, %ymm1 + /* 1s at null CHAR. */ + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + /* 1s where s1 and s2 equal AND not null CHAR. */ + vpandn %ymm1, %ymm2, %ymm1 + + /* All 1s -> keep going, any 0s -> return. */ + vpmovmskb %ymm1, %ecx # ifdef USE_AS_STRNCMP - /* Return 0 if the mismatched index (%rdx) is after the maximum - offset (%r11). */ - cmpq %r11, %rdx - jae L(zero) + cmpq $VEC_SIZE, %rdx + jbe L(vec_0_test_len) # endif + + /* All 1s represents all equals. incl will overflow to zero in + all equals case. Otherwise 1s will carry until position of first + mismatch. */ + incl %ecx + jz L(more_3x_vec) + + .p2align 4,, 4 +L(return_vec_0): + tzcntl %ecx, %ecx # ifdef USE_AS_WCSCMP + movl (%rdi, %rcx), %edx xorl %eax, %eax - movl (%rdi, %rdx), %ecx - cmpl (%rsi, %rdx), %ecx - je L(return) -L(wcscmp_return): + cmpl (%rsi, %rcx), %edx + je L(ret0) setl %al negl %eax orl $1, %eax -L(return): # else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %edx - subl %edx, %eax + movzbl (%rdi, %rcx), %eax + movzbl (%rsi, %rcx), %ecx + subl %ecx, %eax # endif +L(ret0): L(return_vzeroupper): ZERO_UPPER_VEC_REGISTERS_RETURN - .p2align 4 -L(return_vec_size): - tzcntl %ecx, %edx # ifdef USE_AS_STRNCMP - /* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after - the maximum offset (%r11). */ - addq $VEC_SIZE, %rdx - cmpq %r11, %rdx - jae L(zero) -# ifdef USE_AS_WCSCMP + .p2align 4,, 8 +L(vec_0_test_len): + notl %ecx + bzhil %edx, %ecx, %eax + jnz L(return_vec_0) + /* Align if will cross fetch block. */ + .p2align 4,, 2 +L(ret_zero): xorl %eax, %eax - movl (%rdi, %rdx), %ecx - cmpl (%rsi, %rdx), %ecx - jne L(wcscmp_return) -# else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %edx - subl %edx, %eax -# endif -# else + VZEROUPPER_RETURN + + .p2align 4,, 5 +L(one_or_less): + jb L(ret_zero) # ifdef USE_AS_WCSCMP + /* 'nbe' covers the case where length is negative (large + unsigned). */ + jnbe __wcscmp_avx2 + movl (%rdi), %edx xorl %eax, %eax - movl VEC_SIZE(%rdi, %rdx), %ecx - cmpl VEC_SIZE(%rsi, %rdx), %ecx - jne L(wcscmp_return) + cmpl (%rsi), %edx + je L(ret1) + setl %al + negl %eax + orl $1, %eax # else - movzbl VEC_SIZE(%rdi, %rdx), %eax - movzbl VEC_SIZE(%rsi, %rdx), %edx - subl %edx, %eax + /* 'nbe' covers the case where length is negative (large + unsigned). */ + + jnbe __strcmp_avx2 + movzbl (%rdi), %eax + movzbl (%rsi), %ecx + subl %ecx, %eax # endif +L(ret1): + ret # endif - VZEROUPPER_RETURN - .p2align 4 -L(return_2_vec_size): - tzcntl %ecx, %edx + .p2align 4,, 10 +L(return_vec_1): + tzcntl %ecx, %ecx # ifdef USE_AS_STRNCMP - /* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is - after the maximum offset (%r11). */ - addq $(VEC_SIZE * 2), %rdx - cmpq %r11, %rdx - jae L(zero) -# ifdef USE_AS_WCSCMP + /* rdx must be > CHAR_PER_VEC so save to subtract w.o fear of + overflow. */ + addq $-VEC_SIZE, %rdx + cmpq %rcx, %rdx + jbe L(ret_zero) +# endif +# ifdef USE_AS_WCSCMP + movl VEC_SIZE(%rdi, %rcx), %edx xorl %eax, %eax - movl (%rdi, %rdx), %ecx - cmpl (%rsi, %rdx), %ecx - jne L(wcscmp_return) -# else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %edx - subl %edx, %eax -# endif + cmpl VEC_SIZE(%rsi, %rcx), %edx + je L(ret2) + setl %al + negl %eax + orl $1, %eax # else -# ifdef USE_AS_WCSCMP - xorl %eax, %eax - movl (VEC_SIZE * 2)(%rdi, %rdx), %ecx - cmpl (VEC_SIZE * 2)(%rsi, %rdx), %ecx - jne L(wcscmp_return) -# else - movzbl (VEC_SIZE * 2)(%rdi, %rdx), %eax - movzbl (VEC_SIZE * 2)(%rsi, %rdx), %edx - subl %edx, %eax -# endif + movzbl VEC_SIZE(%rdi, %rcx), %eax + movzbl VEC_SIZE(%rsi, %rcx), %ecx + subl %ecx, %eax # endif +L(ret2): VZEROUPPER_RETURN - .p2align 4 -L(return_3_vec_size): - tzcntl %ecx, %edx + .p2align 4,, 10 # ifdef USE_AS_STRNCMP - /* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is - after the maximum offset (%r11). */ - addq $(VEC_SIZE * 3), %rdx - cmpq %r11, %rdx - jae L(zero) -# ifdef USE_AS_WCSCMP +L(return_vec_3): + salq $32, %rcx +# endif + +L(return_vec_2): +# ifndef USE_AS_STRNCMP + tzcntl %ecx, %ecx +# else + tzcntq %rcx, %rcx + cmpq %rcx, %rdx + jbe L(ret_zero) +# endif + +# ifdef USE_AS_WCSCMP + movl (VEC_SIZE * 2)(%rdi, %rcx), %edx xorl %eax, %eax - movl (%rdi, %rdx), %ecx - cmpl (%rsi, %rdx), %ecx - jne L(wcscmp_return) -# else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %edx - subl %edx, %eax -# endif + cmpl (VEC_SIZE * 2)(%rsi, %rcx), %edx + je L(ret3) + setl %al + negl %eax + orl $1, %eax # else + movzbl (VEC_SIZE * 2)(%rdi, %rcx), %eax + movzbl (VEC_SIZE * 2)(%rsi, %rcx), %ecx + subl %ecx, %eax +# endif +L(ret3): + VZEROUPPER_RETURN + +# ifndef USE_AS_STRNCMP + .p2align 4,, 10 +L(return_vec_3): + tzcntl %ecx, %ecx # ifdef USE_AS_WCSCMP + movl (VEC_SIZE * 3)(%rdi, %rcx), %edx xorl %eax, %eax - movl (VEC_SIZE * 3)(%rdi, %rdx), %ecx - cmpl (VEC_SIZE * 3)(%rsi, %rdx), %ecx - jne L(wcscmp_return) + cmpl (VEC_SIZE * 3)(%rsi, %rcx), %edx + je L(ret4) + setl %al + negl %eax + orl $1, %eax # else - movzbl (VEC_SIZE * 3)(%rdi, %rdx), %eax - movzbl (VEC_SIZE * 3)(%rsi, %rdx), %edx - subl %edx, %eax + movzbl (VEC_SIZE * 3)(%rdi, %rcx), %eax + movzbl (VEC_SIZE * 3)(%rsi, %rcx), %ecx + subl %ecx, %eax # endif -# endif +L(ret4): VZEROUPPER_RETURN +# endif + + .p2align 4,, 10 +L(more_3x_vec): + /* Safe to compare 4x vectors. */ + VMOVU VEC_SIZE(%rdi), %ymm0 + VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1 + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + vpandn %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %ecx + incl %ecx + jnz L(return_vec_1) + +# ifdef USE_AS_STRNCMP + subq $(VEC_SIZE * 2), %rdx + jbe L(ret_zero) +# endif + + VMOVU (VEC_SIZE * 2)(%rdi), %ymm0 + VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm0, %ymm1 + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + vpandn %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %ecx + incl %ecx + jnz L(return_vec_2) + + VMOVU (VEC_SIZE * 3)(%rdi), %ymm0 + VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm0, %ymm1 + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + vpandn %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %ecx + incl %ecx + jnz L(return_vec_3) - .p2align 4 -L(next_3_vectors): - vmovdqu VEC_SIZE(%rdi), %ymm6 - VPCMPEQ VEC_SIZE(%rsi), %ymm6, %ymm3 - VPMINU %ymm6, %ymm3, %ymm3 - VPCMPEQ %ymm7, %ymm3, %ymm3 - vpmovmskb %ymm3, %ecx - testl %ecx, %ecx - jne L(return_vec_size) - vmovdqu (VEC_SIZE * 2)(%rdi), %ymm5 - vmovdqu (VEC_SIZE * 3)(%rdi), %ymm4 - vmovdqu (VEC_SIZE * 3)(%rsi), %ymm0 - VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm5, %ymm2 - VPMINU %ymm5, %ymm2, %ymm2 - VPCMPEQ %ymm4, %ymm0, %ymm0 - VPCMPEQ %ymm7, %ymm2, %ymm2 - vpmovmskb %ymm2, %ecx - testl %ecx, %ecx - jne L(return_2_vec_size) - VPMINU %ymm4, %ymm0, %ymm0 - VPCMPEQ %ymm7, %ymm0, %ymm0 - vpmovmskb %ymm0, %ecx - testl %ecx, %ecx - jne L(return_3_vec_size) -L(main_loop_header): - leaq (VEC_SIZE * 4)(%rdi), %rdx - movl $PAGE_SIZE, %ecx - /* Align load via RAX. */ - andq $-(VEC_SIZE * 4), %rdx - subq %rdi, %rdx - leaq (%rdi, %rdx), %rax # ifdef USE_AS_STRNCMP - /* Starting from this point, the maximum offset, or simply the - 'offset', DECREASES by the same amount when base pointers are - moved forward. Return 0 when: - 1) On match: offset <= the matched vector index. - 2) On mistmach, offset is before the mistmatched index. + cmpq $(VEC_SIZE * 2), %rdx + jbe L(ret_zero) +# endif + +# ifdef USE_AS_WCSCMP + /* any non-zero positive value that doesn't inference with 0x1. */ - subq %rdx, %r11 - jbe L(zero) -# endif - addq %rsi, %rdx - movq %rdx, %rsi - andl $(PAGE_SIZE - 1), %esi - /* Number of bytes before page crossing. */ - subq %rsi, %rcx - /* Number of VEC_SIZE * 4 blocks before page crossing. */ - shrq $DIVIDE_BY_VEC_4_SHIFT, %rcx - /* ESI: Number of VEC_SIZE * 4 blocks before page crossing. */ - movl %ecx, %esi - jmp L(loop_start) + movl $2, %r8d +# else + xorl %r8d, %r8d +# endif + + /* The prepare labels are various entry points from the page + cross logic. */ +L(prepare_loop): + +# ifdef USE_AS_STRNCMP + /* Store N + (VEC_SIZE * 4) and place check at the begining of + the loop. */ + leaq (VEC_SIZE * 2)(%rdi, %rdx), %rdx +# endif +L(prepare_loop_no_len): + + /* Align s1 and adjust s2 accordingly. */ + subq %rdi, %rsi + andq $-(VEC_SIZE * 4), %rdi + addq %rdi, %rsi + +# ifdef USE_AS_STRNCMP + subq %rdi, %rdx +# endif + +L(prepare_loop_aligned): + /* eax stores distance from rsi to next page cross. These cases + need to be handled specially as the 4x loop could potentially + read memory past the length of s1 or s2 and across a page + boundary. */ + movl $-(VEC_SIZE * 4), %eax + subl %esi, %eax + andl $(PAGE_SIZE - 1), %eax + + /* Loop 4x comparisons at a time. */ .p2align 4 L(loop): + + /* End condition for strncmp. */ # ifdef USE_AS_STRNCMP - /* Base pointers are moved forward by 4 * VEC_SIZE. Decrease - the maximum offset (%r11) by the same amount. */ - subq $(VEC_SIZE * 4), %r11 - jbe L(zero) -# endif - addq $(VEC_SIZE * 4), %rax - addq $(VEC_SIZE * 4), %rdx -L(loop_start): - testl %esi, %esi - leal -1(%esi), %esi - je L(loop_cross_page) -L(back_to_loop): - /* Main loop, comparing 4 vectors are a time. */ - vmovdqa (%rax), %ymm0 - vmovdqa VEC_SIZE(%rax), %ymm3 - VPCMPEQ (%rdx), %ymm0, %ymm4 - VPCMPEQ VEC_SIZE(%rdx), %ymm3, %ymm1 - VPMINU %ymm0, %ymm4, %ymm4 - VPMINU %ymm3, %ymm1, %ymm1 - vmovdqa (VEC_SIZE * 2)(%rax), %ymm2 - VPMINU %ymm1, %ymm4, %ymm0 - vmovdqa (VEC_SIZE * 3)(%rax), %ymm3 - VPCMPEQ (VEC_SIZE * 2)(%rdx), %ymm2, %ymm5 - VPCMPEQ (VEC_SIZE * 3)(%rdx), %ymm3, %ymm6 - VPMINU %ymm2, %ymm5, %ymm5 - VPMINU %ymm3, %ymm6, %ymm6 - VPMINU %ymm5, %ymm0, %ymm0 - VPMINU %ymm6, %ymm0, %ymm0 - VPCMPEQ %ymm7, %ymm0, %ymm0 - - /* Test each mask (32 bits) individually because for VEC_SIZE - == 32 is not possible to OR the four masks and keep all bits - in a 64-bit integer register, differing from SSE2 strcmp - where ORing is possible. */ - vpmovmskb %ymm0, %ecx + subq $(VEC_SIZE * 4), %rdx + jbe L(ret_zero) +# endif + + subq $-(VEC_SIZE * 4), %rdi + subq $-(VEC_SIZE * 4), %rsi + + /* Check if rsi loads will cross a page boundary. */ + addl $-(VEC_SIZE * 4), %eax + jnb L(page_cross_during_loop) + + /* Loop entry after handling page cross during loop. */ +L(loop_skip_page_cross_check): + VMOVA (VEC_SIZE * 0)(%rdi), %ymm0 + VMOVA (VEC_SIZE * 1)(%rdi), %ymm2 + VMOVA (VEC_SIZE * 2)(%rdi), %ymm4 + VMOVA (VEC_SIZE * 3)(%rdi), %ymm6 + + /* ymm1 all 1s where s1 and s2 equal. All 0s otherwise. */ + VPCMPEQ (VEC_SIZE * 0)(%rsi), %ymm0, %ymm1 + + VPCMPEQ (VEC_SIZE * 1)(%rsi), %ymm2, %ymm3 + VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5 + VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7 + + + /* If any mismatches or null CHAR then 0 CHAR, otherwise non- + zero. */ + vpand %ymm0, %ymm1, %ymm1 + + + vpand %ymm2, %ymm3, %ymm3 + vpand %ymm4, %ymm5, %ymm5 + vpand %ymm6, %ymm7, %ymm7 + + VPMINU %ymm1, %ymm3, %ymm3 + VPMINU %ymm5, %ymm7, %ymm7 + + /* Reduce all 0 CHARs for the 4x VEC into ymm7. */ + VPMINU %ymm3, %ymm7, %ymm7 + + /* If any 0 CHAR then done. */ + VPCMPEQ %ymm7, %ymmZERO, %ymm7 + vpmovmskb %ymm7, %LOOP_REG + testl %LOOP_REG, %LOOP_REG + jz L(loop) + + /* Find which VEC has the mismatch of end of string. */ + VPCMPEQ %ymm1, %ymmZERO, %ymm1 + vpmovmskb %ymm1, %ecx testl %ecx, %ecx - je L(loop) - VPCMPEQ %ymm7, %ymm4, %ymm0 - vpmovmskb %ymm0, %edi - testl %edi, %edi - je L(test_vec) - tzcntl %edi, %ecx + jnz L(return_vec_0_end) + + + VPCMPEQ %ymm3, %ymmZERO, %ymm3 + vpmovmskb %ymm3, %ecx + testl %ecx, %ecx + jnz L(return_vec_1_end) + +L(return_vec_2_3_end): # ifdef USE_AS_STRNCMP - cmpq %rcx, %r11 - jbe L(zero) -# ifdef USE_AS_WCSCMP - movq %rax, %rsi + subq $(VEC_SIZE * 2), %rdx + jbe L(ret_zero_end) +# endif + + VPCMPEQ %ymm5, %ymmZERO, %ymm5 + vpmovmskb %ymm5, %ecx + testl %ecx, %ecx + jnz L(return_vec_2_end) + + /* LOOP_REG contains matches for null/mismatch from the loop. If + VEC 0,1,and 2 all have no null and no mismatches then mismatch + must entirely be from VEC 3 which is fully represented by + LOOP_REG. */ + tzcntl %LOOP_REG, %LOOP_REG + +# ifdef USE_AS_STRNCMP + subl $-(VEC_SIZE), %LOOP_REG + cmpq %LOOP_REG64, %rdx + jbe L(ret_zero_end) +# endif + +# ifdef USE_AS_WCSCMP + movl (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %ecx xorl %eax, %eax - movl (%rsi, %rcx), %edi - cmpl (%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif + cmpl (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx + je L(ret5) + setl %al + negl %eax + xorl %r8d, %eax # else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (%rsi, %rcx), %edi - cmpl (%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif + movzbl (VEC_SIZE * 2 - VEC_OFFSET)(%rdi, %LOOP_REG64), %eax + movzbl (VEC_SIZE * 2 - VEC_OFFSET)(%rsi, %LOOP_REG64), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif +L(ret5): VZEROUPPER_RETURN - .p2align 4 -L(test_vec): # ifdef USE_AS_STRNCMP - /* The first vector matched. Return 0 if the maximum offset - (%r11) <= VEC_SIZE. */ - cmpq $VEC_SIZE, %r11 - jbe L(zero) + .p2align 4,, 2 +L(ret_zero_end): + xorl %eax, %eax + VZEROUPPER_RETURN # endif - VPCMPEQ %ymm7, %ymm1, %ymm1 - vpmovmskb %ymm1, %ecx - testl %ecx, %ecx - je L(test_2_vec) - tzcntl %ecx, %edi + + + /* The L(return_vec_N_end) differ from L(return_vec_N) in that + they use the value of `r8` to negate the return value. This is + because the page cross logic can swap `rdi` and `rsi`. */ + .p2align 4,, 10 # ifdef USE_AS_STRNCMP - addq $VEC_SIZE, %rdi - cmpq %rdi, %r11 - jbe L(zero) -# ifdef USE_AS_WCSCMP - movq %rax, %rsi +L(return_vec_1_end): + salq $32, %rcx +# endif +L(return_vec_0_end): +# ifndef USE_AS_STRNCMP + tzcntl %ecx, %ecx +# else + tzcntq %rcx, %rcx + cmpq %rcx, %rdx + jbe L(ret_zero_end) +# endif + +# ifdef USE_AS_WCSCMP + movl (%rdi, %rcx), %edx xorl %eax, %eax - movl (%rsi, %rdi), %ecx - cmpl (%rdx, %rdi), %ecx - jne L(wcscmp_return) -# else - movzbl (%rax, %rdi), %eax - movzbl (%rdx, %rdi), %edx - subl %edx, %eax -# endif + cmpl (%rsi, %rcx), %edx + je L(ret6) + setl %al + negl %eax + xorl %r8d, %eax # else + movzbl (%rdi, %rcx), %eax + movzbl (%rsi, %rcx), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax +# endif +L(ret6): + VZEROUPPER_RETURN + +# ifndef USE_AS_STRNCMP + .p2align 4,, 10 +L(return_vec_1_end): + tzcntl %ecx, %ecx # ifdef USE_AS_WCSCMP - movq %rax, %rsi + movl VEC_SIZE(%rdi, %rcx), %edx xorl %eax, %eax - movl VEC_SIZE(%rsi, %rdi), %ecx - cmpl VEC_SIZE(%rdx, %rdi), %ecx - jne L(wcscmp_return) + cmpl VEC_SIZE(%rsi, %rcx), %edx + je L(ret7) + setl %al + negl %eax + xorl %r8d, %eax # else - movzbl VEC_SIZE(%rax, %rdi), %eax - movzbl VEC_SIZE(%rdx, %rdi), %edx - subl %edx, %eax + movzbl VEC_SIZE(%rdi, %rcx), %eax + movzbl VEC_SIZE(%rsi, %rcx), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif -# endif +L(ret7): VZEROUPPER_RETURN +# endif - .p2align 4 -L(test_2_vec): + .p2align 4,, 10 +L(return_vec_2_end): + tzcntl %ecx, %ecx # ifdef USE_AS_STRNCMP - /* The first 2 vectors matched. Return 0 if the maximum offset - (%r11) <= 2 * VEC_SIZE. */ - cmpq $(VEC_SIZE * 2), %r11 - jbe L(zero) + cmpq %rcx, %rdx + jbe L(ret_zero_page_cross) # endif - VPCMPEQ %ymm7, %ymm5, %ymm5 - vpmovmskb %ymm5, %ecx - testl %ecx, %ecx - je L(test_3_vec) - tzcntl %ecx, %edi -# ifdef USE_AS_STRNCMP - addq $(VEC_SIZE * 2), %rdi - cmpq %rdi, %r11 - jbe L(zero) -# ifdef USE_AS_WCSCMP - movq %rax, %rsi +# ifdef USE_AS_WCSCMP + movl (VEC_SIZE * 2)(%rdi, %rcx), %edx xorl %eax, %eax - movl (%rsi, %rdi), %ecx - cmpl (%rdx, %rdi), %ecx - jne L(wcscmp_return) -# else - movzbl (%rax, %rdi), %eax - movzbl (%rdx, %rdi), %edx - subl %edx, %eax -# endif + cmpl (VEC_SIZE * 2)(%rsi, %rcx), %edx + je L(ret11) + setl %al + negl %eax + xorl %r8d, %eax # else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (VEC_SIZE * 2)(%rsi, %rdi), %ecx - cmpl (VEC_SIZE * 2)(%rdx, %rdi), %ecx - jne L(wcscmp_return) -# else - movzbl (VEC_SIZE * 2)(%rax, %rdi), %eax - movzbl (VEC_SIZE * 2)(%rdx, %rdi), %edx - subl %edx, %eax -# endif + movzbl (VEC_SIZE * 2)(%rdi, %rcx), %eax + movzbl (VEC_SIZE * 2)(%rsi, %rcx), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif +L(ret11): VZEROUPPER_RETURN - .p2align 4 -L(test_3_vec): + + /* Page cross in rsi in next 4x VEC. */ + + /* TODO: Improve logic here. */ + .p2align 4,, 10 +L(page_cross_during_loop): + /* eax contains [distance_from_page - (VEC_SIZE * 4)]. */ + + /* Optimistically rsi and rdi and both aligned inwhich case we + don't need any logic here. */ + cmpl $-(VEC_SIZE * 4), %eax + /* Don't adjust eax before jumping back to loop and we will + never hit page cross case again. */ + je L(loop_skip_page_cross_check) + + /* Check if we can safely load a VEC. */ + cmpl $-(VEC_SIZE * 3), %eax + jle L(less_1x_vec_till_page_cross) + + VMOVA (%rdi), %ymm0 + VPCMPEQ (%rsi), %ymm0, %ymm1 + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + vpandn %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %ecx + incl %ecx + jnz L(return_vec_0_end) + + /* if distance >= 2x VEC then eax > -(VEC_SIZE * 2). */ + cmpl $-(VEC_SIZE * 2), %eax + jg L(more_2x_vec_till_page_cross) + + .p2align 4,, 4 +L(less_1x_vec_till_page_cross): + subl $-(VEC_SIZE * 4), %eax + /* Guranteed safe to read from rdi - VEC_SIZE here. The only + concerning case is first iteration if incoming s1 was near start + of a page and s2 near end. If s1 was near the start of the page + we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe + to read back -VEC_SIZE. If rdi is truly at the start of a page + here, it means the previous page (rdi - VEC_SIZE) has already + been loaded earlier so must be valid. */ + VMOVU -VEC_SIZE(%rdi, %rax), %ymm0 + VPCMPEQ -VEC_SIZE(%rsi, %rax), %ymm0, %ymm1 + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + vpandn %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %ecx + + /* Mask of potentially valid bits. The lower bits can be out of + range comparisons (but safe regarding page crosses). */ + movl $-1, %r10d + shlxl %esi, %r10d, %r10d + notl %ecx + # ifdef USE_AS_STRNCMP - /* The first 3 vectors matched. Return 0 if the maximum offset - (%r11) <= 3 * VEC_SIZE. */ - cmpq $(VEC_SIZE * 3), %r11 - jbe L(zero) -# endif - VPCMPEQ %ymm7, %ymm6, %ymm6 - vpmovmskb %ymm6, %esi - tzcntl %esi, %ecx + cmpq %rax, %rdx + jbe L(return_page_cross_end_check) +# endif + movl %eax, %OFFSET_REG + addl $(PAGE_SIZE - VEC_SIZE * 4), %eax + + andl %r10d, %ecx + jz L(loop_skip_page_cross_check) + + .p2align 4,, 3 +L(return_page_cross_end): + tzcntl %ecx, %ecx + # ifdef USE_AS_STRNCMP - addq $(VEC_SIZE * 3), %rcx - cmpq %rcx, %r11 - jbe L(zero) -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (%rsi, %rcx), %esi - cmpl (%rdx, %rcx), %esi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif + leal -VEC_SIZE(%OFFSET_REG64, %rcx), %ecx +L(return_page_cross_cmp_mem): # else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi + addl %OFFSET_REG, %ecx +# endif +# ifdef USE_AS_WCSCMP + movl VEC_OFFSET(%rdi, %rcx), %edx xorl %eax, %eax - movl (VEC_SIZE * 3)(%rsi, %rcx), %esi - cmpl (VEC_SIZE * 3)(%rdx, %rcx), %esi - jne L(wcscmp_return) -# else - movzbl (VEC_SIZE * 3)(%rax, %rcx), %eax - movzbl (VEC_SIZE * 3)(%rdx, %rcx), %edx - subl %edx, %eax -# endif + cmpl VEC_OFFSET(%rsi, %rcx), %edx + je L(ret8) + setl %al + negl %eax + xorl %r8d, %eax +# else + movzbl VEC_OFFSET(%rdi, %rcx), %eax + movzbl VEC_OFFSET(%rsi, %rcx), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif +L(ret8): VZEROUPPER_RETURN - .p2align 4 -L(loop_cross_page): - xorl %r10d, %r10d - movq %rdx, %rcx - /* Align load via RDX. We load the extra ECX bytes which should - be ignored. */ - andl $((VEC_SIZE * 4) - 1), %ecx - /* R10 is -RCX. */ - subq %rcx, %r10 - - /* This works only if VEC_SIZE * 2 == 64. */ -# if (VEC_SIZE * 2) != 64 -# error (VEC_SIZE * 2) != 64 -# endif - - /* Check if the first VEC_SIZE * 2 bytes should be ignored. */ - cmpl $(VEC_SIZE * 2), %ecx - jge L(loop_cross_page_2_vec) - - vmovdqu (%rax, %r10), %ymm2 - vmovdqu VEC_SIZE(%rax, %r10), %ymm3 - VPCMPEQ (%rdx, %r10), %ymm2, %ymm0 - VPCMPEQ VEC_SIZE(%rdx, %r10), %ymm3, %ymm1 - VPMINU %ymm2, %ymm0, %ymm0 - VPMINU %ymm3, %ymm1, %ymm1 - VPCMPEQ %ymm7, %ymm0, %ymm0 - VPCMPEQ %ymm7, %ymm1, %ymm1 - - vpmovmskb %ymm0, %edi - vpmovmskb %ymm1, %esi - - salq $32, %rsi - xorq %rsi, %rdi - - /* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes. */ - shrq %cl, %rdi - - testq %rdi, %rdi - je L(loop_cross_page_2_vec) - tzcntq %rdi, %rcx # ifdef USE_AS_STRNCMP - cmpq %rcx, %r11 - jbe L(zero) -# ifdef USE_AS_WCSCMP - movq %rax, %rsi + .p2align 4,, 10 +L(return_page_cross_end_check): + tzcntl %ecx, %ecx + leal -VEC_SIZE(%rax, %rcx), %ecx + cmpl %ecx, %edx + ja L(return_page_cross_cmp_mem) xorl %eax, %eax - movl (%rsi, %rcx), %edi - cmpl (%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif -# else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (%rsi, %rcx), %edi - cmpl (%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif -# endif VZEROUPPER_RETURN +# endif - .p2align 4 -L(loop_cross_page_2_vec): - /* The first VEC_SIZE * 2 bytes match or are ignored. */ - vmovdqu (VEC_SIZE * 2)(%rax, %r10), %ymm2 - vmovdqu (VEC_SIZE * 3)(%rax, %r10), %ymm3 - VPCMPEQ (VEC_SIZE * 2)(%rdx, %r10), %ymm2, %ymm5 - VPMINU %ymm2, %ymm5, %ymm5 - VPCMPEQ (VEC_SIZE * 3)(%rdx, %r10), %ymm3, %ymm6 - VPCMPEQ %ymm7, %ymm5, %ymm5 - VPMINU %ymm3, %ymm6, %ymm6 - VPCMPEQ %ymm7, %ymm6, %ymm6 - - vpmovmskb %ymm5, %edi - vpmovmskb %ymm6, %esi - - salq $32, %rsi - xorq %rsi, %rdi - xorl %r8d, %r8d - /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes. */ - subl $(VEC_SIZE * 2), %ecx - jle 1f - /* Skip ECX bytes. */ - shrq %cl, %rdi - /* R8 has number of bytes skipped. */ - movl %ecx, %r8d -1: - /* Before jumping back to the loop, set ESI to the number of - VEC_SIZE * 4 blocks before page crossing. */ - movl $(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi - - testq %rdi, %rdi + .p2align 4,, 10 +L(more_2x_vec_till_page_cross): + /* If more 2x vec till cross we will complete a full loop + iteration here. */ + + VMOVU VEC_SIZE(%rdi), %ymm0 + VPCMPEQ VEC_SIZE(%rsi), %ymm0, %ymm1 + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + vpandn %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %ecx + incl %ecx + jnz L(return_vec_1_end) + # ifdef USE_AS_STRNCMP - /* At this point, if %rdi value is 0, it already tested - VEC_SIZE*4+%r10 byte starting from %rax. This label - checks whether strncmp maximum offset reached or not. */ - je L(string_nbyte_offset_check) -# else - je L(back_to_loop) + cmpq $(VEC_SIZE * 2), %rdx + jbe L(ret_zero_in_loop_page_cross) # endif - tzcntq %rdi, %rcx - addq %r10, %rcx - /* Adjust for number of bytes skipped. */ - addq %r8, %rcx + + subl $-(VEC_SIZE * 4), %eax + + /* Safe to include comparisons from lower bytes. */ + VMOVU -(VEC_SIZE * 2)(%rdi, %rax), %ymm0 + VPCMPEQ -(VEC_SIZE * 2)(%rsi, %rax), %ymm0, %ymm1 + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + vpandn %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %ecx + incl %ecx + jnz L(return_vec_page_cross_0) + + VMOVU -(VEC_SIZE * 1)(%rdi, %rax), %ymm0 + VPCMPEQ -(VEC_SIZE * 1)(%rsi, %rax), %ymm0, %ymm1 + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + vpandn %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %ecx + incl %ecx + jnz L(return_vec_page_cross_1) + # ifdef USE_AS_STRNCMP - addq $(VEC_SIZE * 2), %rcx - subq %rcx, %r11 - jbe L(zero) -# ifdef USE_AS_WCSCMP - movq %rax, %rsi + /* Must check length here as length might proclude reading next + page. */ + cmpq %rax, %rdx + jbe L(ret_zero_in_loop_page_cross) +# endif + + /* Finish the loop. */ + VMOVA (VEC_SIZE * 2)(%rdi), %ymm4 + VMOVA (VEC_SIZE * 3)(%rdi), %ymm6 + + VPCMPEQ (VEC_SIZE * 2)(%rsi), %ymm4, %ymm5 + VPCMPEQ (VEC_SIZE * 3)(%rsi), %ymm6, %ymm7 + vpand %ymm4, %ymm5, %ymm5 + vpand %ymm6, %ymm7, %ymm7 + VPMINU %ymm5, %ymm7, %ymm7 + VPCMPEQ %ymm7, %ymmZERO, %ymm7 + vpmovmskb %ymm7, %LOOP_REG + testl %LOOP_REG, %LOOP_REG + jnz L(return_vec_2_3_end) + + /* Best for code size to include ucond-jmp here. Would be faster + if this case is hot to duplicate the L(return_vec_2_3_end) code + as fall-through and have jump back to loop on mismatch + comparison. */ + subq $-(VEC_SIZE * 4), %rdi + subq $-(VEC_SIZE * 4), %rsi + addl $(PAGE_SIZE - VEC_SIZE * 8), %eax +# ifdef USE_AS_STRNCMP + subq $(VEC_SIZE * 4), %rdx + ja L(loop_skip_page_cross_check) +L(ret_zero_in_loop_page_cross): xorl %eax, %eax - movl (%rsi, %rcx), %edi - cmpl (%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif + VZEROUPPER_RETURN # else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (VEC_SIZE * 2)(%rsi, %rcx), %edi - cmpl (VEC_SIZE * 2)(%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (VEC_SIZE * 2)(%rax, %rcx), %eax - movzbl (VEC_SIZE * 2)(%rdx, %rcx), %edx - subl %edx, %eax -# endif + jmp L(loop_skip_page_cross_check) # endif - VZEROUPPER_RETURN + + .p2align 4,, 10 +L(return_vec_page_cross_0): + addl $-VEC_SIZE, %eax +L(return_vec_page_cross_1): + tzcntl %ecx, %ecx # ifdef USE_AS_STRNCMP -L(string_nbyte_offset_check): - leaq (VEC_SIZE * 4)(%r10), %r10 - cmpq %r10, %r11 - jbe L(zero) - jmp L(back_to_loop) + leal -VEC_SIZE(%rax, %rcx), %ecx + cmpq %rcx, %rdx + jbe L(ret_zero_in_loop_page_cross) +# else + addl %eax, %ecx # endif - .p2align 4 -L(cross_page_loop): - /* Check one byte/dword at a time. */ # ifdef USE_AS_WCSCMP - cmpl %ecx, %eax + movl VEC_OFFSET(%rdi, %rcx), %edx + xorl %eax, %eax + cmpl VEC_OFFSET(%rsi, %rcx), %edx + je L(ret9) + setl %al + negl %eax + xorl %r8d, %eax # else + movzbl VEC_OFFSET(%rdi, %rcx), %eax + movzbl VEC_OFFSET(%rsi, %rcx), %ecx subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif - jne L(different) - addl $SIZE_OF_CHAR, %edx - cmpl $(VEC_SIZE * 4), %edx - je L(main_loop_header) -# ifdef USE_AS_STRNCMP - cmpq %r11, %rdx - jae L(zero) +L(ret9): + VZEROUPPER_RETURN + + + .p2align 4,, 10 +L(page_cross): +# ifndef USE_AS_STRNCMP + /* If both are VEC aligned we don't need any special logic here. + Only valid for strcmp where stop condition is guranteed to be + reachable by just reading memory. */ + testl $((VEC_SIZE - 1) << 20), %eax + jz L(no_page_cross) # endif + + movl %edi, %eax + movl %esi, %ecx + andl $(PAGE_SIZE - 1), %eax + andl $(PAGE_SIZE - 1), %ecx + + xorl %OFFSET_REG, %OFFSET_REG + + /* Check which is closer to page cross, s1 or s2. */ + cmpl %eax, %ecx + jg L(page_cross_s2) + + /* The previous page cross check has false positives. Check for + true positive as page cross logic is very expensive. */ + subl $(PAGE_SIZE - VEC_SIZE * 4), %eax + jbe L(no_page_cross) + + /* Set r8 to not interfere with normal return value (rdi and rsi + did not swap). */ # ifdef USE_AS_WCSCMP - movl (%rdi, %rdx), %eax - movl (%rsi, %rdx), %ecx + /* any non-zero positive value that doesn't inference with 0x1. + */ + movl $2, %r8d # else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %ecx + xorl %r8d, %r8d # endif - /* Check null char. */ - testl %eax, %eax - jne L(cross_page_loop) - /* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED - comparisons. */ - subl %ecx, %eax -# ifndef USE_AS_WCSCMP -L(different): + + /* Check if less than 1x VEC till page cross. */ + subl $(VEC_SIZE * 3), %eax + jg L(less_1x_vec_till_page) + + /* If more than 1x VEC till page cross, loop throuh safely + loadable memory until within 1x VEC of page cross. */ + + .p2align 4,, 10 +L(page_cross_loop): + + VMOVU (%rdi, %OFFSET_REG64), %ymm0 + VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1 + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + vpandn %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %ecx + incl %ecx + + jnz L(check_ret_vec_page_cross) + addl $VEC_SIZE, %OFFSET_REG +# ifdef USE_AS_STRNCMP + cmpq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross) # endif - VZEROUPPER_RETURN + addl $VEC_SIZE, %eax + jl L(page_cross_loop) + + subl %eax, %OFFSET_REG + /* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed + to not cross page so is safe to load. Since we have already + loaded at least 1 VEC from rsi it is also guranteed to be safe. + */ + + VMOVU (%rdi, %OFFSET_REG64), %ymm0 + VPCMPEQ (%rsi, %OFFSET_REG64), %ymm0, %ymm1 + VPCMPEQ %ymm0, %ymmZERO, %ymm2 + vpandn %ymm1, %ymm2, %ymm1 + vpmovmskb %ymm1, %ecx + +# ifdef USE_AS_STRNCMP + leal VEC_SIZE(%OFFSET_REG64), %eax + cmpq %rax, %rdx + jbe L(check_ret_vec_page_cross2) + addq %rdi, %rdx +# endif + incl %ecx + jz L(prepare_loop_no_len) + .p2align 4,, 4 +L(ret_vec_page_cross): +# ifndef USE_AS_STRNCMP +L(check_ret_vec_page_cross): +# endif + tzcntl %ecx, %ecx + addl %OFFSET_REG, %ecx +L(ret_vec_page_cross_cont): # ifdef USE_AS_WCSCMP - .p2align 4 -L(different): - /* Use movl to avoid modifying EFLAGS. */ - movl $0, %eax + movl (%rdi, %rcx), %edx + xorl %eax, %eax + cmpl (%rsi, %rcx), %edx + je L(ret12) setl %al negl %eax - orl $1, %eax - VZEROUPPER_RETURN + xorl %r8d, %eax +# else + movzbl (%rdi, %rcx), %eax + movzbl (%rsi, %rcx), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif +L(ret12): + VZEROUPPER_RETURN # ifdef USE_AS_STRNCMP - .p2align 4 -L(zero): + .p2align 4,, 10 +L(check_ret_vec_page_cross2): + incl %ecx +L(check_ret_vec_page_cross): + tzcntl %ecx, %ecx + addl %OFFSET_REG, %ecx + cmpq %rcx, %rdx + ja L(ret_vec_page_cross_cont) + .p2align 4,, 2 +L(ret_zero_page_cross): xorl %eax, %eax VZEROUPPER_RETURN +# endif - .p2align 4 -L(char0): -# ifdef USE_AS_WCSCMP - xorl %eax, %eax - movl (%rdi), %ecx - cmpl (%rsi), %ecx - jne L(wcscmp_return) -# else - movzbl (%rsi), %ecx - movzbl (%rdi), %eax - subl %ecx, %eax -# endif - VZEROUPPER_RETURN + .p2align 4,, 4 +L(page_cross_s2): + /* Ensure this is a true page cross. */ + subl $(PAGE_SIZE - VEC_SIZE * 4), %ecx + jbe L(no_page_cross) + + + movl %ecx, %eax + movq %rdi, %rcx + movq %rsi, %rdi + movq %rcx, %rsi + + /* set r8 to negate return value as rdi and rsi swapped. */ +# ifdef USE_AS_WCSCMP + movl $-4, %r8d +# else + movl $-1, %r8d # endif + xorl %OFFSET_REG, %OFFSET_REG - .p2align 4 -L(last_vector): - addq %rdx, %rdi - addq %rdx, %rsi + /* Check if more than 1x VEC till page cross. */ + subl $(VEC_SIZE * 3), %eax + jle L(page_cross_loop) + + .p2align 4,, 6 +L(less_1x_vec_till_page): + /* Find largest load size we can use. */ + cmpl $16, %eax + ja L(less_16_till_page) + + VMOVU (%rdi), %xmm0 + VPCMPEQ (%rsi), %xmm0, %xmm1 + VPCMPEQ %xmm0, %xmmZERO, %xmm2 + vpandn %xmm1, %xmm2, %xmm1 + vpmovmskb %ymm1, %ecx + incw %cx + jnz L(check_ret_vec_page_cross) + movl $16, %OFFSET_REG # ifdef USE_AS_STRNCMP - subq %rdx, %r11 + cmpq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case0) + subl %eax, %OFFSET_REG +# else + /* Explicit check for 16 byte alignment. */ + subl %eax, %OFFSET_REG + jz L(prepare_loop) # endif - tzcntl %ecx, %edx + + VMOVU (%rdi, %OFFSET_REG64), %xmm0 + VPCMPEQ (%rsi, %OFFSET_REG64), %xmm0, %xmm1 + VPCMPEQ %xmm0, %xmmZERO, %xmm2 + vpandn %xmm1, %xmm2, %xmm1 + vpmovmskb %ymm1, %ecx + incw %cx + jnz L(check_ret_vec_page_cross) + # ifdef USE_AS_STRNCMP - cmpq %r11, %rdx - jae L(zero) + addl $16, %OFFSET_REG + subq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case0) + subq $-(VEC_SIZE * 4), %rdx + + leaq -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi + leaq -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi +# else + leaq (16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi + leaq (16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi # endif -# ifdef USE_AS_WCSCMP + jmp L(prepare_loop_aligned) + +# ifdef USE_AS_STRNCMP + .p2align 4,, 2 +L(ret_zero_page_cross_slow_case0): xorl %eax, %eax - movl (%rdi, %rdx), %ecx - cmpl (%rsi, %rdx), %ecx - jne L(wcscmp_return) -# else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %edx - subl %edx, %eax + ret # endif - VZEROUPPER_RETURN - /* Comparing on page boundary region requires special treatment: - It must done one vector at the time, starting with the wider - ymm vector if possible, if not, with xmm. If fetching 16 bytes - (xmm) still passes the boundary, byte comparison must be done. - */ - .p2align 4 -L(cross_page): - /* Try one ymm vector at a time. */ - cmpl $(PAGE_SIZE - VEC_SIZE), %eax - jg L(cross_page_1_vector) -L(loop_1_vector): - vmovdqu (%rdi, %rdx), %ymm1 - VPCMPEQ (%rsi, %rdx), %ymm1, %ymm0 - VPMINU %ymm1, %ymm0, %ymm0 - VPCMPEQ %ymm7, %ymm0, %ymm0 - vpmovmskb %ymm0, %ecx - testl %ecx, %ecx - jne L(last_vector) - addl $VEC_SIZE, %edx + .p2align 4,, 10 +L(less_16_till_page): + /* Find largest load size we can use. */ + cmpl $24, %eax + ja L(less_8_till_page) - addl $VEC_SIZE, %eax -# ifdef USE_AS_STRNCMP - /* Return 0 if the current offset (%rdx) >= the maximum offset - (%r11). */ - cmpq %r11, %rdx - jae L(zero) -# endif - cmpl $(PAGE_SIZE - VEC_SIZE), %eax - jle L(loop_1_vector) -L(cross_page_1_vector): - /* Less than 32 bytes to check, try one xmm vector. */ - cmpl $(PAGE_SIZE - 16), %eax - jg L(cross_page_1_xmm) - vmovdqu (%rdi, %rdx), %xmm1 - VPCMPEQ (%rsi, %rdx), %xmm1, %xmm0 - VPMINU %xmm1, %xmm0, %xmm0 - VPCMPEQ %xmm7, %xmm0, %xmm0 - vpmovmskb %xmm0, %ecx - testl %ecx, %ecx - jne L(last_vector) + vmovq (%rdi), %xmm0 + vmovq (%rsi), %xmm1 + VPCMPEQ %xmm0, %xmmZERO, %xmm2 + VPCMPEQ %xmm1, %xmm0, %xmm1 + vpandn %xmm1, %xmm2, %xmm1 + vpmovmskb %ymm1, %ecx + incb %cl + jnz L(check_ret_vec_page_cross) - addl $16, %edx -# ifndef USE_AS_WCSCMP - addl $16, %eax + +# ifdef USE_AS_STRNCMP + cmpq $8, %rdx + jbe L(ret_zero_page_cross_slow_case0) # endif + movl $24, %OFFSET_REG + /* Explicit check for 16 byte alignment. */ + subl %eax, %OFFSET_REG + + + + vmovq (%rdi, %OFFSET_REG64), %xmm0 + vmovq (%rsi, %OFFSET_REG64), %xmm1 + VPCMPEQ %xmm0, %xmmZERO, %xmm2 + VPCMPEQ %xmm1, %xmm0, %xmm1 + vpandn %xmm1, %xmm2, %xmm1 + vpmovmskb %ymm1, %ecx + incb %cl + jnz L(check_ret_vec_page_cross) + # ifdef USE_AS_STRNCMP - /* Return 0 if the current offset (%rdx) >= the maximum offset - (%r11). */ - cmpq %r11, %rdx - jae L(zero) -# endif - -L(cross_page_1_xmm): -# ifndef USE_AS_WCSCMP - /* Less than 16 bytes to check, try 8 byte vector. NB: No need - for wcscmp nor wcsncmp since wide char is 4 bytes. */ - cmpl $(PAGE_SIZE - 8), %eax - jg L(cross_page_8bytes) - vmovq (%rdi, %rdx), %xmm1 - vmovq (%rsi, %rdx), %xmm0 - VPCMPEQ %xmm0, %xmm1, %xmm0 - VPMINU %xmm1, %xmm0, %xmm0 - VPCMPEQ %xmm7, %xmm0, %xmm0 - vpmovmskb %xmm0, %ecx - /* Only last 8 bits are valid. */ - andl $0xff, %ecx - testl %ecx, %ecx - jne L(last_vector) + addl $8, %OFFSET_REG + subq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case0) + subq $-(VEC_SIZE * 4), %rdx - addl $8, %edx - addl $8, %eax + leaq -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi + leaq -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi +# else + leaq (8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi + leaq (8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi +# endif + jmp L(prepare_loop_aligned) + + + .p2align 4,, 10 +L(less_8_till_page): +# ifdef USE_AS_WCSCMP + /* If using wchar then this is the only check before we reach + the page boundary. */ + movl (%rdi), %eax + movl (%rsi), %ecx + cmpl %ecx, %eax + jnz L(ret_less_8_wcs) # ifdef USE_AS_STRNCMP - /* Return 0 if the current offset (%rdx) >= the maximum offset - (%r11). */ - cmpq %r11, %rdx - jae L(zero) + addq %rdi, %rdx + /* We already checked for len <= 1 so cannot hit that case here. + */ # endif + testl %eax, %eax + jnz L(prepare_loop_no_len) + ret -L(cross_page_8bytes): - /* Less than 8 bytes to check, try 4 byte vector. */ - cmpl $(PAGE_SIZE - 4), %eax - jg L(cross_page_4bytes) - vmovd (%rdi, %rdx), %xmm1 - vmovd (%rsi, %rdx), %xmm0 - VPCMPEQ %xmm0, %xmm1, %xmm0 - VPMINU %xmm1, %xmm0, %xmm0 - VPCMPEQ %xmm7, %xmm0, %xmm0 - vpmovmskb %xmm0, %ecx - /* Only last 4 bits are valid. */ - andl $0xf, %ecx - testl %ecx, %ecx - jne L(last_vector) + .p2align 4,, 8 +L(ret_less_8_wcs): + setl %OFFSET_REG8 + negl %OFFSET_REG + movl %OFFSET_REG, %eax + xorl %r8d, %eax + ret + +# else + + /* Find largest load size we can use. */ + cmpl $28, %eax + ja L(less_4_till_page) + + vmovd (%rdi), %xmm0 + vmovd (%rsi), %xmm1 + VPCMPEQ %xmm0, %xmmZERO, %xmm2 + VPCMPEQ %xmm1, %xmm0, %xmm1 + vpandn %xmm1, %xmm2, %xmm1 + vpmovmskb %ymm1, %ecx + subl $0xf, %ecx + jnz L(check_ret_vec_page_cross) - addl $4, %edx # ifdef USE_AS_STRNCMP - /* Return 0 if the current offset (%rdx) >= the maximum offset - (%r11). */ - cmpq %r11, %rdx - jae L(zero) + cmpq $4, %rdx + jbe L(ret_zero_page_cross_slow_case1) # endif + movl $28, %OFFSET_REG + /* Explicit check for 16 byte alignment. */ + subl %eax, %OFFSET_REG -L(cross_page_4bytes): -# endif - /* Less than 4 bytes to check, try one byte/dword at a time. */ -# ifdef USE_AS_STRNCMP - cmpq %r11, %rdx - jae L(zero) -# endif -# ifdef USE_AS_WCSCMP - movl (%rdi, %rdx), %eax - movl (%rsi, %rdx), %ecx -# else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %ecx -# endif - testl %eax, %eax - jne L(cross_page_loop) + + + vmovd (%rdi, %OFFSET_REG64), %xmm0 + vmovd (%rsi, %OFFSET_REG64), %xmm1 + VPCMPEQ %xmm0, %xmmZERO, %xmm2 + VPCMPEQ %xmm1, %xmm0, %xmm1 + vpandn %xmm1, %xmm2, %xmm1 + vpmovmskb %ymm1, %ecx + subl $0xf, %ecx + jnz L(check_ret_vec_page_cross) + +# ifdef USE_AS_STRNCMP + addl $4, %OFFSET_REG + subq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case1) + subq $-(VEC_SIZE * 4), %rdx + + leaq -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi + leaq -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi +# else + leaq (4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64), %rdi + leaq (4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64), %rsi +# endif + jmp L(prepare_loop_aligned) + +# ifdef USE_AS_STRNCMP + .p2align 4,, 2 +L(ret_zero_page_cross_slow_case1): + xorl %eax, %eax + ret +# endif + + .p2align 4,, 10 +L(less_4_till_page): + subq %rdi, %rsi + /* Extremely slow byte comparison loop. */ +L(less_4_loop): + movzbl (%rdi), %eax + movzbl (%rsi, %rdi), %ecx subl %ecx, %eax - VZEROUPPER_RETURN -END (STRCMP) + jnz L(ret_less_4_loop) + testl %ecx, %ecx + jz L(ret_zero_4_loop) +# ifdef USE_AS_STRNCMP + decq %rdx + jz L(ret_zero_4_loop) +# endif + incq %rdi + /* end condition is reach page boundary (rdi is aligned). */ + testl $31, %edi + jnz L(less_4_loop) + leaq -(VEC_SIZE * 4)(%rdi, %rsi), %rsi + addq $-(VEC_SIZE * 4), %rdi +# ifdef USE_AS_STRNCMP + subq $-(VEC_SIZE * 4), %rdx +# endif + jmp L(prepare_loop_aligned) + +L(ret_zero_4_loop): + xorl %eax, %eax + ret +L(ret_less_4_loop): + xorl %r8d, %eax + subl %r8d, %eax + ret +# endif +END(STRCMP) #endif From patchwork Mon Jan 10 21:35:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 49818 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 3585E385841C for ; Mon, 10 Jan 2022 21:40:34 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3585E385841C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1641850834; bh=hSBlPWM5JbjHJstxX5CzX94m2RLBM5iqvoDpVxp5RSM=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=cB/0m/DVb+Wcs5d9cSC/5IEHiDCKnf4n6sfi8ElKBGV+6zZpaf9ncnEflPy0JLoal iPF7XvurUUN/OtST6BhgKHyPAiReFNsIrnoX4GClCSeOKbmsFH1uvsuv413+FxX5Vx BzmXvHT4iZtsPcesFoOYqY+gm3npQNSbpVyTkhaA= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) by sourceware.org (Postfix) with ESMTPS id 4C2B53890436 for ; Mon, 10 Jan 2022 21:35:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4C2B53890436 Received: by mail-pl1-x62d.google.com with SMTP id w7so13998108plp.13 for ; Mon, 10 Jan 2022 13:35:56 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=hSBlPWM5JbjHJstxX5CzX94m2RLBM5iqvoDpVxp5RSM=; b=lnzn9XZlZqYsjEMlm7mRLushpagiIeGMF3rc49Zx0L3ekvgieMVv0Smv6yXkI8T+km QjSaYkutbx9n/dRdLZcKS2dOsXun1L+kR5zd1gsS4FIvNkufLADazNW54DgVlrrWhpwA S1aiEiIoaxpY8QetJKiue8FbBqLEq0Fwhlmr8qKeH8tLXmPdtmH27v91S6ykKGufAzli uvtSGmxF14HNvcgySxgLvCCRcqnKZRkPwWpXtAqQWBwLG8uSy33g1QwsOOGSYzoB0r2g Mh1vPyalBaX3GQPtv8Ap4YB89Ifp0GJ84lDrYrVBUjKPAxCfnKlXZdIdKMCLD7MIE75L tYhA== X-Gm-Message-State: AOAM533+bImTR4oCfVT3EWPQzedBRffYtzv8qLTlqzbFe/UZDaqPg0FH H7H8vT15WB6FxkOcnyJbGKQAL+FXHc0= X-Google-Smtp-Source: ABdhPJxKRqV99Z3TF2xCyK2Puwspmc/ddtOjZJc2FWXB0rlQkcniV+1/AeHT5g5F8eBCPFAJwIDl1Q== X-Received: by 2002:a17:90a:7a88:: with SMTP id q8mr1765643pjf.90.1641850553561; Mon, 10 Jan 2022 13:35:53 -0800 (PST) Received: from noah-tigerlake.webpass.net (136-24-166-223.cab.webpass.net. [136.24.166.223]) by smtp.googlemail.com with ESMTPSA id f12sm7996515pfe.127.2022.01.10.13.35.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jan 2022 13:35:53 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v3 6/7] x86: Optimize strcmp-evex.S Date: Mon, 10 Jan 2022 15:35:39 -0600 Message-Id: <20220110213540.1258344-6-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220110213540.1258344-1-goldstein.w.n@gmail.com> References: <20220109122946.2754917-1-goldstein.w.n@gmail.com> <20220110213540.1258344-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SCC_10_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Optimization are primarily to the loop logic and how the page cross logic interacts with the loop. The page cross logic is at times more expensive for short strings near the end of a page but not crossing the page. This is done to retest the page cross conditions with a non-faulty check and to improve the logic for entering the loop afterwards. This is only particular cases, however, and is general made up for by more than 10x improvements on the transition from the page cross -> loop case. The non-page cross cases as well are nearly universally improved. test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass. Signed-off-by: Noah Goldstein --- sysdeps/x86_64/multiarch/strcmp-evex.S | 1712 +++++++++++++----------- 1 file changed, 919 insertions(+), 793 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S index 0cd939d5af..e5070f3d53 100644 --- a/sysdeps/x86_64/multiarch/strcmp-evex.S +++ b/sysdeps/x86_64/multiarch/strcmp-evex.S @@ -26,54 +26,69 @@ # define PAGE_SIZE 4096 -/* VEC_SIZE = Number of bytes in a ymm register */ + /* VEC_SIZE = Number of bytes in a ymm register. */ # define VEC_SIZE 32 +# define CHAR_PER_VEC (VEC_SIZE / SIZE_OF_CHAR) -/* Shift for dividing by (VEC_SIZE * 4). */ -# define DIVIDE_BY_VEC_4_SHIFT 7 -# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT) -# error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT) -# endif - -# define VMOVU vmovdqu64 -# define VMOVA vmovdqa64 +# define VMOVU vmovdqu64 +# define VMOVA vmovdqa64 # ifdef USE_AS_WCSCMP -/* Compare packed dwords. */ -# define VPCMP vpcmpd +# define TESTEQ subl $0xff, + /* Compare packed dwords. */ +# define VPCMP vpcmpd # define VPMINU vpminud # define VPTESTM vptestmd -# define SHIFT_REG32 r8d -# define SHIFT_REG64 r8 -/* 1 dword char == 4 bytes. */ + /* 1 dword char == 4 bytes. */ # define SIZE_OF_CHAR 4 # else -/* Compare packed bytes. */ -# define VPCMP vpcmpb +# define TESTEQ incl + /* Compare packed bytes. */ +# define VPCMP vpcmpb # define VPMINU vpminub # define VPTESTM vptestmb -# define SHIFT_REG32 ecx -# define SHIFT_REG64 rcx -/* 1 byte char == 1 byte. */ + /* 1 byte char == 1 byte. */ # define SIZE_OF_CHAR 1 # endif +# ifdef USE_AS_STRNCMP +# define LOOP_REG r9d +# define LOOP_REG64 r9 + +# define OFFSET_REG8 r9b +# define OFFSET_REG r9d +# define OFFSET_REG64 r9 +# else +# define LOOP_REG edx +# define LOOP_REG64 rdx + +# define OFFSET_REG8 dl +# define OFFSET_REG edx +# define OFFSET_REG64 rdx +# endif + +# if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP +# define VEC_OFFSET 0 +# else +# define VEC_OFFSET (-VEC_SIZE) +# endif + # define XMMZERO xmm16 -# define XMM0 xmm17 -# define XMM1 xmm18 +# define XMM0 xmm17 +# define XMM1 xmm18 # define YMMZERO ymm16 -# define YMM0 ymm17 -# define YMM1 ymm18 -# define YMM2 ymm19 -# define YMM3 ymm20 -# define YMM4 ymm21 -# define YMM5 ymm22 -# define YMM6 ymm23 -# define YMM7 ymm24 -# define YMM8 ymm25 -# define YMM9 ymm26 -# define YMM10 ymm27 +# define YMM0 ymm17 +# define YMM1 ymm18 +# define YMM2 ymm19 +# define YMM3 ymm20 +# define YMM4 ymm21 +# define YMM5 ymm22 +# define YMM6 ymm23 +# define YMM7 ymm24 +# define YMM8 ymm25 +# define YMM9 ymm26 +# define YMM10 ymm27 /* Warning! wcscmp/wcsncmp have to use SIGNED comparison for elements. @@ -96,985 +111,1096 @@ the maximum offset is reached before a difference is found, zero is returned. */ - .section .text.evex,"ax",@progbits -ENTRY (STRCMP) + .section .text.evex, "ax", @progbits +ENTRY(STRCMP) # ifdef USE_AS_STRNCMP - /* Check for simple cases (0 or 1) in offset. */ - cmp $1, %RDX_LP - je L(char0) - jb L(zero) -# ifdef USE_AS_WCSCMP -# ifndef __ILP32__ - movq %rdx, %rcx - /* Check if length could overflow when multiplied by - sizeof(wchar_t). Checking top 8 bits will cover all potential - overflow cases as well as redirect cases where its impossible to - length to bound a valid memory region. In these cases just use - 'wcscmp'. */ - shrq $56, %rcx - jnz __wcscmp_evex -# endif - /* Convert units: from wide to byte char. */ - shl $2, %RDX_LP +# ifdef __ILP32__ + /* Clear the upper 32 bits. */ + movl %edx, %rdx # endif - /* Register %r11 tracks the maximum offset. */ - mov %RDX_LP, %R11_LP + cmp $1, %RDX_LP + /* Signed comparison intentional. We use this branch to also + test cases where length >= 2^63. These very large sizes can be + handled with strcmp as there is no way for that length to + actually bound the buffer. */ + jle L(one_or_less) # endif movl %edi, %eax - xorl %edx, %edx - /* Make %XMMZERO (%YMMZERO) all zeros in this function. */ - vpxorq %XMMZERO, %XMMZERO, %XMMZERO orl %esi, %eax - andl $(PAGE_SIZE - 1), %eax - cmpl $(PAGE_SIZE - (VEC_SIZE * 4)), %eax - jg L(cross_page) - /* Start comparing 4 vectors. */ + /* Shift out the bits irrelivant to page boundary ([63:12]). */ + sall $20, %eax + /* Check if s1 or s2 may cross a page in next 4x VEC loads. */ + cmpl $((PAGE_SIZE -(VEC_SIZE * 4)) << 20), %eax + ja L(page_cross) + +L(no_page_cross): + /* Safe to compare 4x vectors. */ VMOVU (%rdi), %YMM0 - - /* Each bit set in K2 represents a non-null CHAR in YMM0. */ VPTESTM %YMM0, %YMM0, %k2 - /* Each bit cleared in K1 represents a mismatch or a null CHAR in YMM0 and 32 bytes at (%rsi). */ VPCMP $0, (%rsi), %YMM0, %k1{%k2} - kmovd %k1, %ecx -# ifdef USE_AS_WCSCMP - subl $0xff, %ecx -# else - incl %ecx -# endif - je L(next_3_vectors) - tzcntl %ecx, %edx -# ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %edx -# endif # ifdef USE_AS_STRNCMP - /* Return 0 if the mismatched index (%rdx) is after the maximum - offset (%r11). */ - cmpq %r11, %rdx - jae L(zero) + cmpq $CHAR_PER_VEC, %rdx + jbe L(vec_0_test_len) # endif + + /* TESTEQ is `incl` for strcmp/strncmp and `subl $0xff` for + wcscmp/wcsncmp. */ + + /* All 1s represents all equals. TESTEQ will overflow to zero in + all equals case. Otherwise 1s will carry until position of first + mismatch. */ + TESTEQ %ecx + jz L(more_3x_vec) + + .p2align 4,, 4 +L(return_vec_0): + tzcntl %ecx, %ecx # ifdef USE_AS_WCSCMP + movl (%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax - movl (%rdi, %rdx), %ecx - cmpl (%rsi, %rdx), %ecx - je L(return) -L(wcscmp_return): + cmpl (%rsi, %rcx, SIZE_OF_CHAR), %edx + je L(ret0) setl %al negl %eax orl $1, %eax -L(return): # else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %edx - subl %edx, %eax + movzbl (%rdi, %rcx), %eax + movzbl (%rsi, %rcx), %ecx + subl %ecx, %eax # endif +L(ret0): ret -L(return_vec_size): - tzcntl %ecx, %edx -# ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %edx -# endif # ifdef USE_AS_STRNCMP - /* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after - the maximum offset (%r11). */ - addq $VEC_SIZE, %rdx - cmpq %r11, %rdx - jae L(zero) -# ifdef USE_AS_WCSCMP + .p2align 4,, 4 +L(vec_0_test_len): + notl %ecx + bzhil %edx, %ecx, %eax + jnz L(return_vec_0) + /* Align if will cross fetch block. */ + .p2align 4,, 2 +L(ret_zero): xorl %eax, %eax - movl (%rdi, %rdx), %ecx - cmpl (%rsi, %rdx), %ecx - jne L(wcscmp_return) -# else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %edx - subl %edx, %eax -# endif -# else + ret + + .p2align 4,, 5 +L(one_or_less): + jb L(ret_zero) # ifdef USE_AS_WCSCMP + /* 'nbe' covers the case where length is negative (large + unsigned). */ + jnbe __wcscmp_evex + movl (%rdi), %edx xorl %eax, %eax - movl VEC_SIZE(%rdi, %rdx), %ecx - cmpl VEC_SIZE(%rsi, %rdx), %ecx - jne L(wcscmp_return) + cmpl (%rsi), %edx + je L(ret1) + setl %al + negl %eax + orl $1, %eax # else - movzbl VEC_SIZE(%rdi, %rdx), %eax - movzbl VEC_SIZE(%rsi, %rdx), %edx - subl %edx, %eax + /* 'nbe' covers the case where length is negative (large + unsigned). */ + jnbe __strcmp_evex + movzbl (%rdi), %eax + movzbl (%rsi), %ecx + subl %ecx, %eax # endif -# endif +L(ret1): ret +# endif -L(return_2_vec_size): - tzcntl %ecx, %edx + .p2align 4,, 10 +L(return_vec_1): + tzcntl %ecx, %ecx +# ifdef USE_AS_STRNCMP + /* rdx must be > CHAR_PER_VEC so its safe to subtract without + worrying about underflow. */ + addq $-CHAR_PER_VEC, %rdx + cmpq %rcx, %rdx + jbe L(ret_zero) +# endif # ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %edx + movl VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx + xorl %eax, %eax + cmpl VEC_SIZE(%rsi, %rcx, SIZE_OF_CHAR), %edx + je L(ret2) + setl %al + negl %eax + orl $1, %eax +# else + movzbl VEC_SIZE(%rdi, %rcx), %eax + movzbl VEC_SIZE(%rsi, %rcx), %ecx + subl %ecx, %eax # endif +L(ret2): + ret + + .p2align 4,, 10 # ifdef USE_AS_STRNCMP - /* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is - after the maximum offset (%r11). */ - addq $(VEC_SIZE * 2), %rdx - cmpq %r11, %rdx - jae L(zero) -# ifdef USE_AS_WCSCMP - xorl %eax, %eax - movl (%rdi, %rdx), %ecx - cmpl (%rsi, %rdx), %ecx - jne L(wcscmp_return) +L(return_vec_3): +# if CHAR_PER_VEC <= 16 + sall $CHAR_PER_VEC, %ecx # else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %edx - subl %edx, %eax + salq $CHAR_PER_VEC, %rcx # endif +# endif +L(return_vec_2): +# if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP) + tzcntl %ecx, %ecx # else -# ifdef USE_AS_WCSCMP - xorl %eax, %eax - movl (VEC_SIZE * 2)(%rdi, %rdx), %ecx - cmpl (VEC_SIZE * 2)(%rsi, %rdx), %ecx - jne L(wcscmp_return) -# else - movzbl (VEC_SIZE * 2)(%rdi, %rdx), %eax - movzbl (VEC_SIZE * 2)(%rsi, %rdx), %edx - subl %edx, %eax -# endif + tzcntq %rcx, %rcx # endif - ret -L(return_3_vec_size): - tzcntl %ecx, %edx -# ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %edx -# endif # ifdef USE_AS_STRNCMP - /* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is - after the maximum offset (%r11). */ - addq $(VEC_SIZE * 3), %rdx - cmpq %r11, %rdx - jae L(zero) -# ifdef USE_AS_WCSCMP + cmpq %rcx, %rdx + jbe L(ret_zero) +# endif + +# ifdef USE_AS_WCSCMP + movl (VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax - movl (%rdi, %rdx), %ecx - cmpl (%rsi, %rdx), %ecx - jne L(wcscmp_return) -# else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %edx - subl %edx, %eax -# endif + cmpl (VEC_SIZE * 2)(%rsi, %rcx, SIZE_OF_CHAR), %edx + je L(ret3) + setl %al + negl %eax + orl $1, %eax # else + movzbl (VEC_SIZE * 2)(%rdi, %rcx), %eax + movzbl (VEC_SIZE * 2)(%rsi, %rcx), %ecx + subl %ecx, %eax +# endif +L(ret3): + ret + +# ifndef USE_AS_STRNCMP + .p2align 4,, 10 +L(return_vec_3): + tzcntl %ecx, %ecx # ifdef USE_AS_WCSCMP + movl (VEC_SIZE * 3)(%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax - movl (VEC_SIZE * 3)(%rdi, %rdx), %ecx - cmpl (VEC_SIZE * 3)(%rsi, %rdx), %ecx - jne L(wcscmp_return) + cmpl (VEC_SIZE * 3)(%rsi, %rcx, SIZE_OF_CHAR), %edx + je L(ret4) + setl %al + negl %eax + orl $1, %eax # else - movzbl (VEC_SIZE * 3)(%rdi, %rdx), %eax - movzbl (VEC_SIZE * 3)(%rsi, %rdx), %edx - subl %edx, %eax + movzbl (VEC_SIZE * 3)(%rdi, %rcx), %eax + movzbl (VEC_SIZE * 3)(%rsi, %rcx), %ecx + subl %ecx, %eax # endif -# endif +L(ret4): ret +# endif - .p2align 4 -L(next_3_vectors): - VMOVU VEC_SIZE(%rdi), %YMM0 - /* Each bit set in K2 represents a non-null CHAR in YMM0. */ + /* 32 byte align here ensures the main loop is ideally aligned + for DSB. */ + .p2align 5 +L(more_3x_vec): + /* Safe to compare 4x vectors. */ + VMOVU (VEC_SIZE)(%rdi), %YMM0 VPTESTM %YMM0, %YMM0, %k2 - /* Each bit cleared in K1 represents a mismatch or a null CHAR - in YMM0 and 32 bytes at VEC_SIZE(%rsi). */ - VPCMP $0, VEC_SIZE(%rsi), %YMM0, %k1{%k2} + VPCMP $0, (VEC_SIZE)(%rsi), %YMM0, %k1{%k2} kmovd %k1, %ecx -# ifdef USE_AS_WCSCMP - subl $0xff, %ecx -# else - incl %ecx + TESTEQ %ecx + jnz L(return_vec_1) + +# ifdef USE_AS_STRNCMP + subq $(CHAR_PER_VEC * 2), %rdx + jbe L(ret_zero) # endif - jne L(return_vec_size) VMOVU (VEC_SIZE * 2)(%rdi), %YMM0 - /* Each bit set in K2 represents a non-null CHAR in YMM0. */ VPTESTM %YMM0, %YMM0, %k2 - /* Each bit cleared in K1 represents a mismatch or a null CHAR - in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi). */ VPCMP $0, (VEC_SIZE * 2)(%rsi), %YMM0, %k1{%k2} kmovd %k1, %ecx -# ifdef USE_AS_WCSCMP - subl $0xff, %ecx -# else - incl %ecx -# endif - jne L(return_2_vec_size) + TESTEQ %ecx + jnz L(return_vec_2) VMOVU (VEC_SIZE * 3)(%rdi), %YMM0 - /* Each bit set in K2 represents a non-null CHAR in YMM0. */ VPTESTM %YMM0, %YMM0, %k2 - /* Each bit cleared in K1 represents a mismatch or a null CHAR - in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rsi). */ VPCMP $0, (VEC_SIZE * 3)(%rsi), %YMM0, %k1{%k2} kmovd %k1, %ecx + TESTEQ %ecx + jnz L(return_vec_3) + +# ifdef USE_AS_STRNCMP + cmpq $(CHAR_PER_VEC * 2), %rdx + jbe L(ret_zero) +# endif + + # ifdef USE_AS_WCSCMP - subl $0xff, %ecx + /* any non-zero positive value that doesn't inference with 0x1. + */ + movl $2, %r8d + # else - incl %ecx + xorl %r8d, %r8d # endif - jne L(return_3_vec_size) -L(main_loop_header): - leaq (VEC_SIZE * 4)(%rdi), %rdx - movl $PAGE_SIZE, %ecx - /* Align load via RAX. */ - andq $-(VEC_SIZE * 4), %rdx - subq %rdi, %rdx - leaq (%rdi, %rdx), %rax + + /* The prepare labels are various entry points from the page + cross logic. */ +L(prepare_loop): + # ifdef USE_AS_STRNCMP - /* Starting from this point, the maximum offset, or simply the - 'offset', DECREASES by the same amount when base pointers are - moved forward. Return 0 when: - 1) On match: offset <= the matched vector index. - 2) On mistmach, offset is before the mistmatched index. - */ - subq %rdx, %r11 - jbe L(zero) +# ifdef USE_AS_WCSCMP +L(prepare_loop_no_len): + movl %edi, %ecx + andl $(VEC_SIZE * 4 - 1), %ecx + shrl $2, %ecx + leaq (CHAR_PER_VEC * 2)(%rdx, %rcx), %rdx +# else + /* Store N + (VEC_SIZE * 4) and place check at the begining of + the loop. */ + leaq (VEC_SIZE * 2)(%rdi, %rdx), %rdx +L(prepare_loop_no_len): +# endif +# else +L(prepare_loop_no_len): # endif - addq %rsi, %rdx - movq %rdx, %rsi - andl $(PAGE_SIZE - 1), %esi - /* Number of bytes before page crossing. */ - subq %rsi, %rcx - /* Number of VEC_SIZE * 4 blocks before page crossing. */ - shrq $DIVIDE_BY_VEC_4_SHIFT, %rcx - /* ESI: Number of VEC_SIZE * 4 blocks before page crossing. */ - movl %ecx, %esi - jmp L(loop_start) + /* Align s1 and adjust s2 accordingly. */ + subq %rdi, %rsi + andq $-(VEC_SIZE * 4), %rdi +L(prepare_loop_readj): + addq %rdi, %rsi +# if (defined USE_AS_STRNCMP) && !(defined USE_AS_WCSCMP) + subq %rdi, %rdx +# endif + +L(prepare_loop_aligned): + /* eax stores distance from rsi to next page cross. These cases + need to be handled specially as the 4x loop could potentially + read memory past the length of s1 or s2 and across a page + boundary. */ + movl $-(VEC_SIZE * 4), %eax + subl %esi, %eax + andl $(PAGE_SIZE - 1), %eax + + vpxorq %YMMZERO, %YMMZERO, %YMMZERO + + /* Loop 4x comparisons at a time. */ .p2align 4 L(loop): + + /* End condition for strncmp. */ # ifdef USE_AS_STRNCMP - /* Base pointers are moved forward by 4 * VEC_SIZE. Decrease - the maximum offset (%r11) by the same amount. */ - subq $(VEC_SIZE * 4), %r11 - jbe L(zero) + subq $(CHAR_PER_VEC * 4), %rdx + jbe L(ret_zero) # endif - addq $(VEC_SIZE * 4), %rax - addq $(VEC_SIZE * 4), %rdx -L(loop_start): - testl %esi, %esi - leal -1(%esi), %esi - je L(loop_cross_page) -L(back_to_loop): - /* Main loop, comparing 4 vectors are a time. */ - VMOVA (%rax), %YMM0 - VMOVA VEC_SIZE(%rax), %YMM2 - VMOVA (VEC_SIZE * 2)(%rax), %YMM4 - VMOVA (VEC_SIZE * 3)(%rax), %YMM6 + + subq $-(VEC_SIZE * 4), %rdi + subq $-(VEC_SIZE * 4), %rsi + + /* Check if rsi loads will cross a page boundary. */ + addl $-(VEC_SIZE * 4), %eax + jnb L(page_cross_during_loop) + + /* Loop entry after handling page cross during loop. */ +L(loop_skip_page_cross_check): + VMOVA (VEC_SIZE * 0)(%rdi), %YMM0 + VMOVA (VEC_SIZE * 1)(%rdi), %YMM2 + VMOVA (VEC_SIZE * 2)(%rdi), %YMM4 + VMOVA (VEC_SIZE * 3)(%rdi), %YMM6 VPMINU %YMM0, %YMM2, %YMM8 VPMINU %YMM4, %YMM6, %YMM9 - /* A zero CHAR in YMM8 means that there is a null CHAR. */ - VPMINU %YMM8, %YMM9, %YMM8 + /* A zero CHAR in YMM9 means that there is a null CHAR. */ + VPMINU %YMM8, %YMM9, %YMM9 /* Each bit set in K1 represents a non-null CHAR in YMM8. */ - VPTESTM %YMM8, %YMM8, %k1 + VPTESTM %YMM9, %YMM9, %k1 - /* (YMM ^ YMM): A non-zero CHAR represents a mismatch. */ - vpxorq (%rdx), %YMM0, %YMM1 - vpxorq VEC_SIZE(%rdx), %YMM2, %YMM3 - vpxorq (VEC_SIZE * 2)(%rdx), %YMM4, %YMM5 - vpxorq (VEC_SIZE * 3)(%rdx), %YMM6, %YMM7 + vpxorq (VEC_SIZE * 0)(%rsi), %YMM0, %YMM1 + vpxorq (VEC_SIZE * 1)(%rsi), %YMM2, %YMM3 + vpxorq (VEC_SIZE * 2)(%rsi), %YMM4, %YMM5 + /* Ternary logic to xor (VEC_SIZE * 3)(%rsi) with YMM6 while + oring with YMM1. Result is stored in YMM6. */ + vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM1, %YMM6 - vporq %YMM1, %YMM3, %YMM9 - vporq %YMM5, %YMM7, %YMM10 + /* Or together YMM3, YMM5, and YMM6. */ + vpternlogd $0xfe, %YMM3, %YMM5, %YMM6 - /* A non-zero CHAR in YMM9 represents a mismatch. */ - vporq %YMM9, %YMM10, %YMM9 - /* Each bit cleared in K0 represents a mismatch or a null CHAR. */ - VPCMP $0, %YMMZERO, %YMM9, %k0{%k1} - kmovd %k0, %ecx -# ifdef USE_AS_WCSCMP - subl $0xff, %ecx -# else - incl %ecx -# endif - je L(loop) + /* A non-zero CHAR in YMM6 represents a mismatch. */ + VPCMP $0, %YMMZERO, %YMM6, %k0{%k1} + kmovd %k0, %LOOP_REG - /* Each bit set in K1 represents a non-null CHAR in YMM0. */ + TESTEQ %LOOP_REG + jz L(loop) + + + /* Find which VEC has the mismatch of end of string. */ VPTESTM %YMM0, %YMM0, %k1 - /* Each bit cleared in K0 represents a mismatch or a null CHAR - in YMM0 and (%rdx). */ VPCMP $0, %YMMZERO, %YMM1, %k0{%k1} kmovd %k0, %ecx -# ifdef USE_AS_WCSCMP - subl $0xff, %ecx -# else - incl %ecx -# endif - je L(test_vec) - tzcntl %ecx, %ecx -# ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %ecx -# endif -# ifdef USE_AS_STRNCMP - cmpq %rcx, %r11 - jbe L(zero) -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (%rsi, %rcx), %edi - cmpl (%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif -# else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (%rsi, %rcx), %edi - cmpl (%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif -# endif - ret + TESTEQ %ecx + jnz L(return_vec_0_end) - .p2align 4 -L(test_vec): -# ifdef USE_AS_STRNCMP - /* The first vector matched. Return 0 if the maximum offset - (%r11) <= VEC_SIZE. */ - cmpq $VEC_SIZE, %r11 - jbe L(zero) -# endif - /* Each bit set in K1 represents a non-null CHAR in YMM2. */ VPTESTM %YMM2, %YMM2, %k1 - /* Each bit cleared in K0 represents a mismatch or a null CHAR - in YMM2 and VEC_SIZE(%rdx). */ VPCMP $0, %YMMZERO, %YMM3, %k0{%k1} kmovd %k0, %ecx -# ifdef USE_AS_WCSCMP - subl $0xff, %ecx -# else - incl %ecx -# endif - je L(test_2_vec) - tzcntl %ecx, %edi -# ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %edi -# endif -# ifdef USE_AS_STRNCMP - addq $VEC_SIZE, %rdi - cmpq %rdi, %r11 - jbe L(zero) -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (%rsi, %rdi), %ecx - cmpl (%rdx, %rdi), %ecx - jne L(wcscmp_return) -# else - movzbl (%rax, %rdi), %eax - movzbl (%rdx, %rdi), %edx - subl %edx, %eax -# endif -# else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl VEC_SIZE(%rsi, %rdi), %ecx - cmpl VEC_SIZE(%rdx, %rdi), %ecx - jne L(wcscmp_return) -# else - movzbl VEC_SIZE(%rax, %rdi), %eax - movzbl VEC_SIZE(%rdx, %rdi), %edx - subl %edx, %eax -# endif -# endif - ret + TESTEQ %ecx + jnz L(return_vec_1_end) - .p2align 4 -L(test_2_vec): + + /* Handle VEC 2 and 3 without branches. */ +L(return_vec_2_3_end): # ifdef USE_AS_STRNCMP - /* The first 2 vectors matched. Return 0 if the maximum offset - (%r11) <= 2 * VEC_SIZE. */ - cmpq $(VEC_SIZE * 2), %r11 - jbe L(zero) + subq $(CHAR_PER_VEC * 2), %rdx + jbe L(ret_zero_end) # endif - /* Each bit set in K1 represents a non-null CHAR in YMM4. */ + VPTESTM %YMM4, %YMM4, %k1 - /* Each bit cleared in K0 represents a mismatch or a null CHAR - in YMM4 and (VEC_SIZE * 2)(%rdx). */ VPCMP $0, %YMMZERO, %YMM5, %k0{%k1} kmovd %k0, %ecx -# ifdef USE_AS_WCSCMP - subl $0xff, %ecx + TESTEQ %ecx +# if CHAR_PER_VEC <= 16 + sall $CHAR_PER_VEC, %LOOP_REG + orl %ecx, %LOOP_REG # else - incl %ecx + salq $CHAR_PER_VEC, %LOOP_REG64 + orq %rcx, %LOOP_REG64 +# endif +L(return_vec_3_end): + /* LOOP_REG contains matches for null/mismatch from the loop. If + VEC 0,1,and 2 all have no null and no mismatches then mismatch + must entirely be from VEC 3 which is fully represented by + LOOP_REG. */ +# if CHAR_PER_VEC <= 16 + tzcntl %LOOP_REG, %LOOP_REG +# else + tzcntq %LOOP_REG64, %LOOP_REG64 +# endif +# ifdef USE_AS_STRNCMP + cmpq %LOOP_REG64, %rdx + jbe L(ret_zero_end) # endif - je L(test_3_vec) - tzcntl %ecx, %edi + # ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %edi + movl (VEC_SIZE * 2)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx + xorl %eax, %eax + cmpl (VEC_SIZE * 2)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx + je L(ret5) + setl %al + negl %eax + xorl %r8d, %eax +# else + movzbl (VEC_SIZE * 2)(%rdi, %LOOP_REG64), %eax + movzbl (VEC_SIZE * 2)(%rsi, %LOOP_REG64), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif +L(ret5): + ret + # ifdef USE_AS_STRNCMP - addq $(VEC_SIZE * 2), %rdi - cmpq %rdi, %r11 - jbe L(zero) -# ifdef USE_AS_WCSCMP - movq %rax, %rsi + .p2align 4,, 2 +L(ret_zero_end): xorl %eax, %eax - movl (%rsi, %rdi), %ecx - cmpl (%rdx, %rdi), %ecx - jne L(wcscmp_return) + ret +# endif + + + /* The L(return_vec_N_end) differ from L(return_vec_N) in that + they use the value of `r8` to negate the return value. This is + because the page cross logic can swap `rdi` and `rsi`. */ + .p2align 4,, 10 +# ifdef USE_AS_STRNCMP +L(return_vec_1_end): +# if CHAR_PER_VEC <= 16 + sall $CHAR_PER_VEC, %ecx # else - movzbl (%rax, %rdi), %eax - movzbl (%rdx, %rdi), %edx - subl %edx, %eax + salq $CHAR_PER_VEC, %rcx # endif +# endif +L(return_vec_0_end): +# if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP) + tzcntl %ecx, %ecx # else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (VEC_SIZE * 2)(%rsi, %rdi), %ecx - cmpl (VEC_SIZE * 2)(%rdx, %rdi), %ecx - jne L(wcscmp_return) -# else - movzbl (VEC_SIZE * 2)(%rax, %rdi), %eax - movzbl (VEC_SIZE * 2)(%rdx, %rdi), %edx - subl %edx, %eax -# endif + tzcntq %rcx, %rcx # endif - ret - .p2align 4 -L(test_3_vec): # ifdef USE_AS_STRNCMP - /* The first 3 vectors matched. Return 0 if the maximum offset - (%r11) <= 3 * VEC_SIZE. */ - cmpq $(VEC_SIZE * 3), %r11 - jbe L(zero) + cmpq %rcx, %rdx + jbe L(ret_zero_end) # endif - /* Each bit set in K1 represents a non-null CHAR in YMM6. */ - VPTESTM %YMM6, %YMM6, %k1 - /* Each bit cleared in K0 represents a mismatch or a null CHAR - in YMM6 and (VEC_SIZE * 3)(%rdx). */ - VPCMP $0, %YMMZERO, %YMM7, %k0{%k1} - kmovd %k0, %ecx + # ifdef USE_AS_WCSCMP - subl $0xff, %ecx + movl (%rdi, %rcx, SIZE_OF_CHAR), %edx + xorl %eax, %eax + cmpl (%rsi, %rcx, SIZE_OF_CHAR), %edx + je L(ret6) + setl %al + negl %eax + /* This is the non-zero case for `eax` so just xorl with `r8d` + flip is `rdi` and `rsi` where swapped. */ + xorl %r8d, %eax # else - incl %ecx + movzbl (%rdi, %rcx), %eax + movzbl (%rsi, %rcx), %ecx + subl %ecx, %eax + /* Flip `eax` if `rdi` and `rsi` where swapped in page cross + logic. Subtract `r8d` after xor for zero case. */ + xorl %r8d, %eax + subl %r8d, %eax # endif +L(ret6): + ret + +# ifndef USE_AS_STRNCMP + .p2align 4,, 10 +L(return_vec_1_end): tzcntl %ecx, %ecx -# ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %ecx -# endif -# ifdef USE_AS_STRNCMP - addq $(VEC_SIZE * 3), %rcx - cmpq %rcx, %r11 - jbe L(zero) # ifdef USE_AS_WCSCMP - movq %rax, %rsi + movl VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax - movl (%rsi, %rcx), %esi - cmpl (%rdx, %rcx), %esi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif -# else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (VEC_SIZE * 3)(%rsi, %rcx), %esi - cmpl (VEC_SIZE * 3)(%rdx, %rcx), %esi - jne L(wcscmp_return) + cmpl VEC_SIZE(%rsi, %rcx, SIZE_OF_CHAR), %edx + je L(ret7) + setl %al + negl %eax + xorl %r8d, %eax # else - movzbl (VEC_SIZE * 3)(%rax, %rcx), %eax - movzbl (VEC_SIZE * 3)(%rdx, %rcx), %edx - subl %edx, %eax + movzbl VEC_SIZE(%rdi, %rcx), %eax + movzbl VEC_SIZE(%rsi, %rcx), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif -# endif +L(ret7): ret - - .p2align 4 -L(loop_cross_page): - xorl %r10d, %r10d - movq %rdx, %rcx - /* Align load via RDX. We load the extra ECX bytes which should - be ignored. */ - andl $((VEC_SIZE * 4) - 1), %ecx - /* R10 is -RCX. */ - subq %rcx, %r10 - - /* This works only if VEC_SIZE * 2 == 64. */ -# if (VEC_SIZE * 2) != 64 -# error (VEC_SIZE * 2) != 64 # endif - /* Check if the first VEC_SIZE * 2 bytes should be ignored. */ - cmpl $(VEC_SIZE * 2), %ecx - jge L(loop_cross_page_2_vec) - VMOVU (%rax, %r10), %YMM2 - VMOVU VEC_SIZE(%rax, %r10), %YMM3 + /* Page cross in rsi in next 4x VEC. */ - /* Each bit set in K2 represents a non-null CHAR in YMM2. */ - VPTESTM %YMM2, %YMM2, %k2 - /* Each bit cleared in K1 represents a mismatch or a null CHAR - in YMM2 and 32 bytes at (%rdx, %r10). */ - VPCMP $0, (%rdx, %r10), %YMM2, %k1{%k2} - kmovd %k1, %r9d - /* Don't use subl since it is the lower 16/32 bits of RDI - below. */ - notl %r9d -# ifdef USE_AS_WCSCMP - /* Only last 8 bits are valid. */ - andl $0xff, %r9d -# endif + /* TODO: Improve logic here. */ + .p2align 4,, 10 +L(page_cross_during_loop): + /* eax contains [distance_from_page - (VEC_SIZE * 4)]. */ - /* Each bit set in K4 represents a non-null CHAR in YMM3. */ - VPTESTM %YMM3, %YMM3, %k4 - /* Each bit cleared in K3 represents a mismatch or a null CHAR - in YMM3 and 32 bytes at VEC_SIZE(%rdx, %r10). */ - VPCMP $0, VEC_SIZE(%rdx, %r10), %YMM3, %k3{%k4} - kmovd %k3, %edi - /* Must use notl %edi here as lower bits are for CHAR - comparisons potentially out of range thus can be 0 without - indicating mismatch. */ - notl %edi -# ifdef USE_AS_WCSCMP - /* Don't use subl since it is the upper 8 bits of EDI below. */ - andl $0xff, %edi -# endif + /* Optimistically rsi and rdi and both aligned in which case we + don't need any logic here. */ + cmpl $-(VEC_SIZE * 4), %eax + /* Don't adjust eax before jumping back to loop and we will + never hit page cross case again. */ + je L(loop_skip_page_cross_check) -# ifdef USE_AS_WCSCMP - /* NB: Each bit in EDI/R9D represents 4-byte element. */ - sall $8, %edi - /* NB: Divide shift count by 4 since each bit in K1 represent 4 - bytes. */ - movl %ecx, %SHIFT_REG32 - sarl $2, %SHIFT_REG32 - - /* Each bit in EDI represents a null CHAR or a mismatch. */ - orl %r9d, %edi -# else - salq $32, %rdi + /* Check if we can safely load a VEC. */ + cmpl $-(VEC_SIZE * 3), %eax + jle L(less_1x_vec_till_page_cross) - /* Each bit in RDI represents a null CHAR or a mismatch. */ - orq %r9, %rdi -# endif + VMOVA (%rdi), %YMM0 + VPTESTM %YMM0, %YMM0, %k2 + VPCMP $0, (%rsi), %YMM0, %k1{%k2} + kmovd %k1, %ecx + TESTEQ %ecx + jnz L(return_vec_0_end) + + /* if distance >= 2x VEC then eax > -(VEC_SIZE * 2). */ + cmpl $-(VEC_SIZE * 2), %eax + jg L(more_2x_vec_till_page_cross) + + .p2align 4,, 4 +L(less_1x_vec_till_page_cross): + subl $-(VEC_SIZE * 4), %eax + /* Guranteed safe to read from rdi - VEC_SIZE here. The only + concerning case is first iteration if incoming s1 was near start + of a page and s2 near end. If s1 was near the start of the page + we already aligned up to nearest VEC_SIZE * 4 so gurnateed safe + to read back -VEC_SIZE. If rdi is truly at the start of a page + here, it means the previous page (rdi - VEC_SIZE) has already + been loaded earlier so must be valid. */ + VMOVU -VEC_SIZE(%rdi, %rax), %YMM0 + VPTESTM %YMM0, %YMM0, %k2 + VPCMP $0, -VEC_SIZE(%rsi, %rax), %YMM0, %k1{%k2} + + /* Mask of potentially valid bits. The lower bits can be out of + range comparisons (but safe regarding page crosses). */ - /* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes. */ - shrxq %SHIFT_REG64, %rdi, %rdi - testq %rdi, %rdi - je L(loop_cross_page_2_vec) - tzcntq %rdi, %rcx # ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %ecx + movl $-1, %r10d + movl %esi, %ecx + andl $(VEC_SIZE - 1), %ecx + shrl $2, %ecx + shlxl %ecx, %r10d, %ecx + movzbl %cl, %r10d +# else + movl $-1, %ecx + shlxl %esi, %ecx, %r10d # endif + + kmovd %k1, %ecx + notl %ecx + + # ifdef USE_AS_STRNCMP - cmpq %rcx, %r11 - jbe L(zero) # ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (%rsi, %rcx), %edi - cmpl (%rdx, %rcx), %edi - jne L(wcscmp_return) + movl %eax, %r11d + shrl $2, %r11d + cmpq %r11, %rdx # else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax + cmpq %rax, %rdx # endif + jbe L(return_page_cross_end_check) +# endif + movl %eax, %OFFSET_REG + + /* Readjust eax before potentially returning to the loop. */ + addl $(PAGE_SIZE - VEC_SIZE * 4), %eax + + andl %r10d, %ecx + jz L(loop_skip_page_cross_check) + + .p2align 4,, 3 +L(return_page_cross_end): + tzcntl %ecx, %ecx + +# if (defined USE_AS_STRNCMP) || (defined USE_AS_WCSCMP) + leal -VEC_SIZE(%OFFSET_REG64, %rcx, SIZE_OF_CHAR), %ecx +L(return_page_cross_cmp_mem): # else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi + addl %OFFSET_REG, %ecx +# endif +# ifdef USE_AS_WCSCMP + movl VEC_OFFSET(%rdi, %rcx), %edx xorl %eax, %eax - movl (%rsi, %rcx), %edi - cmpl (%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif + cmpl VEC_OFFSET(%rsi, %rcx), %edx + je L(ret8) + setl %al + negl %eax + xorl %r8d, %eax +# else + movzbl VEC_OFFSET(%rdi, %rcx), %eax + movzbl VEC_OFFSET(%rsi, %rcx), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif +L(ret8): ret - .p2align 4 -L(loop_cross_page_2_vec): - /* The first VEC_SIZE * 2 bytes match or are ignored. */ - VMOVU (VEC_SIZE * 2)(%rax, %r10), %YMM0 - VMOVU (VEC_SIZE * 3)(%rax, %r10), %YMM1 +# ifdef USE_AS_STRNCMP + .p2align 4,, 10 +L(return_page_cross_end_check): + tzcntl %ecx, %ecx + leal -VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx +# ifdef USE_AS_WCSCMP + sall $2, %edx +# endif + cmpl %ecx, %edx + ja L(return_page_cross_cmp_mem) + xorl %eax, %eax + ret +# endif + + .p2align 4,, 10 +L(more_2x_vec_till_page_cross): + /* If more 2x vec till cross we will complete a full loop + iteration here. */ + + VMOVA VEC_SIZE(%rdi), %YMM0 VPTESTM %YMM0, %YMM0, %k2 - /* Each bit cleared in K1 represents a mismatch or a null CHAR - in YMM0 and 32 bytes at (VEC_SIZE * 2)(%rdx, %r10). */ - VPCMP $0, (VEC_SIZE * 2)(%rdx, %r10), %YMM0, %k1{%k2} - kmovd %k1, %r9d - /* Don't use subl since it is the lower 16/32 bits of RDI - below. */ - notl %r9d -# ifdef USE_AS_WCSCMP - /* Only last 8 bits are valid. */ - andl $0xff, %r9d -# endif + VPCMP $0, VEC_SIZE(%rsi), %YMM0, %k1{%k2} + kmovd %k1, %ecx + TESTEQ %ecx + jnz L(return_vec_1_end) - VPTESTM %YMM1, %YMM1, %k4 - /* Each bit cleared in K3 represents a mismatch or a null CHAR - in YMM1 and 32 bytes at (VEC_SIZE * 3)(%rdx, %r10). */ - VPCMP $0, (VEC_SIZE * 3)(%rdx, %r10), %YMM1, %k3{%k4} - kmovd %k3, %edi - /* Must use notl %edi here as lower bits are for CHAR - comparisons potentially out of range thus can be 0 without - indicating mismatch. */ - notl %edi -# ifdef USE_AS_WCSCMP - /* Don't use subl since it is the upper 8 bits of EDI below. */ - andl $0xff, %edi +# ifdef USE_AS_STRNCMP + cmpq $(CHAR_PER_VEC * 2), %rdx + jbe L(ret_zero_in_loop_page_cross) # endif -# ifdef USE_AS_WCSCMP - /* NB: Each bit in EDI/R9D represents 4-byte element. */ - sall $8, %edi + subl $-(VEC_SIZE * 4), %eax - /* Each bit in EDI represents a null CHAR or a mismatch. */ - orl %r9d, %edi -# else - salq $32, %rdi + /* Safe to include comparisons from lower bytes. */ + VMOVU -(VEC_SIZE * 2)(%rdi, %rax), %YMM0 + VPTESTM %YMM0, %YMM0, %k2 + VPCMP $0, -(VEC_SIZE * 2)(%rsi, %rax), %YMM0, %k1{%k2} + kmovd %k1, %ecx + TESTEQ %ecx + jnz L(return_vec_page_cross_0) + + VMOVU -(VEC_SIZE * 1)(%rdi, %rax), %YMM0 + VPTESTM %YMM0, %YMM0, %k2 + VPCMP $0, -(VEC_SIZE * 1)(%rsi, %rax), %YMM0, %k1{%k2} + kmovd %k1, %ecx + TESTEQ %ecx + jnz L(return_vec_page_cross_1) - /* Each bit in RDI represents a null CHAR or a mismatch. */ - orq %r9, %rdi +# ifdef USE_AS_STRNCMP + /* Must check length here as length might proclude reading next + page. */ +# ifdef USE_AS_WCSCMP + movl %eax, %r11d + shrl $2, %r11d + cmpq %r11, %rdx +# else + cmpq %rax, %rdx +# endif + jbe L(ret_zero_in_loop_page_cross) # endif - xorl %r8d, %r8d - /* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes. */ - subl $(VEC_SIZE * 2), %ecx - jle 1f - /* R8 has number of bytes skipped. */ - movl %ecx, %r8d -# ifdef USE_AS_WCSCMP - /* NB: Divide shift count by 4 since each bit in RDI represent 4 - bytes. */ - sarl $2, %ecx - /* Skip ECX bytes. */ - shrl %cl, %edi + /* Finish the loop. */ + VMOVA (VEC_SIZE * 2)(%rdi), %YMM4 + VMOVA (VEC_SIZE * 3)(%rdi), %YMM6 + VPMINU %YMM4, %YMM6, %YMM9 + VPTESTM %YMM9, %YMM9, %k1 + + vpxorq (VEC_SIZE * 2)(%rsi), %YMM4, %YMM5 + /* YMM6 = YMM5 | ((VEC_SIZE * 3)(%rsi) ^ YMM6). */ + vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM5, %YMM6 + + VPCMP $0, %YMMZERO, %YMM6, %k0{%k1} + kmovd %k0, %LOOP_REG + TESTEQ %LOOP_REG + jnz L(return_vec_2_3_end) + + /* Best for code size to include ucond-jmp here. Would be faster + if this case is hot to duplicate the L(return_vec_2_3_end) code + as fall-through and have jump back to loop on mismatch + comparison. */ + subq $-(VEC_SIZE * 4), %rdi + subq $-(VEC_SIZE * 4), %rsi + addl $(PAGE_SIZE - VEC_SIZE * 8), %eax +# ifdef USE_AS_STRNCMP + subq $(CHAR_PER_VEC * 4), %rdx + ja L(loop_skip_page_cross_check) +L(ret_zero_in_loop_page_cross): + xorl %eax, %eax + ret # else - /* Skip ECX bytes. */ - shrq %cl, %rdi + jmp L(loop_skip_page_cross_check) # endif -1: - /* Before jumping back to the loop, set ESI to the number of - VEC_SIZE * 4 blocks before page crossing. */ - movl $(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi - testq %rdi, %rdi -# ifdef USE_AS_STRNCMP - /* At this point, if %rdi value is 0, it already tested - VEC_SIZE*4+%r10 byte starting from %rax. This label - checks whether strncmp maximum offset reached or not. */ - je L(string_nbyte_offset_check) + + .p2align 4,, 10 +L(return_vec_page_cross_0): + addl $-VEC_SIZE, %eax +L(return_vec_page_cross_1): + tzcntl %ecx, %ecx +# if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP + leal -VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx +# ifdef USE_AS_STRNCMP +# ifdef USE_AS_WCSCMP + /* Must divide ecx instead of multiply rdx due to overflow. */ + movl %ecx, %eax + shrl $2, %eax + cmpq %rax, %rdx +# else + cmpq %rcx, %rdx +# endif + jbe L(ret_zero_in_loop_page_cross) +# endif # else - je L(back_to_loop) + addl %eax, %ecx # endif - tzcntq %rdi, %rcx + # ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %ecx -# endif - addq %r10, %rcx - /* Adjust for number of bytes skipped. */ - addq %r8, %rcx -# ifdef USE_AS_STRNCMP - addq $(VEC_SIZE * 2), %rcx - subq %rcx, %r11 - jbe L(zero) -# ifdef USE_AS_WCSCMP - movq %rax, %rsi + movl VEC_OFFSET(%rdi, %rcx), %edx xorl %eax, %eax - movl (%rsi, %rcx), %edi - cmpl (%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (%rax, %rcx), %eax - movzbl (%rdx, %rcx), %edx - subl %edx, %eax -# endif + cmpl VEC_OFFSET(%rsi, %rcx), %edx + je L(ret9) + setl %al + negl %eax + xorl %r8d, %eax # else -# ifdef USE_AS_WCSCMP - movq %rax, %rsi - xorl %eax, %eax - movl (VEC_SIZE * 2)(%rsi, %rcx), %edi - cmpl (VEC_SIZE * 2)(%rdx, %rcx), %edi - jne L(wcscmp_return) -# else - movzbl (VEC_SIZE * 2)(%rax, %rcx), %eax - movzbl (VEC_SIZE * 2)(%rdx, %rcx), %edx - subl %edx, %eax -# endif + movzbl VEC_OFFSET(%rdi, %rcx), %eax + movzbl VEC_OFFSET(%rsi, %rcx), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif +L(ret9): ret -# ifdef USE_AS_STRNCMP -L(string_nbyte_offset_check): - leaq (VEC_SIZE * 4)(%r10), %r10 - cmpq %r10, %r11 - jbe L(zero) - jmp L(back_to_loop) + + .p2align 4,, 10 +L(page_cross): +# ifndef USE_AS_STRNCMP + /* If both are VEC aligned we don't need any special logic here. + Only valid for strcmp where stop condition is guranteed to be + reachable by just reading memory. */ + testl $((VEC_SIZE - 1) << 20), %eax + jz L(no_page_cross) # endif - .p2align 4 -L(cross_page_loop): - /* Check one byte/dword at a time. */ + movl %edi, %eax + movl %esi, %ecx + andl $(PAGE_SIZE - 1), %eax + andl $(PAGE_SIZE - 1), %ecx + + xorl %OFFSET_REG, %OFFSET_REG + + /* Check which is closer to page cross, s1 or s2. */ + cmpl %eax, %ecx + jg L(page_cross_s2) + + /* The previous page cross check has false positives. Check for + true positive as page cross logic is very expensive. */ + subl $(PAGE_SIZE - VEC_SIZE * 4), %eax + jbe L(no_page_cross) + + + /* Set r8 to not interfere with normal return value (rdi and rsi + did not swap). */ # ifdef USE_AS_WCSCMP - cmpl %ecx, %eax + /* any non-zero positive value that doesn't inference with 0x1. + */ + movl $2, %r8d # else - subl %ecx, %eax + xorl %r8d, %r8d # endif - jne L(different) - addl $SIZE_OF_CHAR, %edx - cmpl $(VEC_SIZE * 4), %edx - je L(main_loop_header) + + /* Check if less than 1x VEC till page cross. */ + subl $(VEC_SIZE * 3), %eax + jg L(less_1x_vec_till_page) + + + /* If more than 1x VEC till page cross, loop throuh safely + loadable memory until within 1x VEC of page cross. */ + .p2align 4,, 8 +L(page_cross_loop): + VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0 + VPTESTM %YMM0, %YMM0, %k2 + VPCMP $0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0, %k1{%k2} + kmovd %k1, %ecx + TESTEQ %ecx + jnz L(check_ret_vec_page_cross) + addl $CHAR_PER_VEC, %OFFSET_REG # ifdef USE_AS_STRNCMP - cmpq %r11, %rdx - jae L(zero) + cmpq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross) # endif + addl $VEC_SIZE, %eax + jl L(page_cross_loop) + # ifdef USE_AS_WCSCMP - movl (%rdi, %rdx), %eax - movl (%rsi, %rdx), %ecx -# else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %ecx + shrl $2, %eax # endif - /* Check null CHAR. */ - testl %eax, %eax - jne L(cross_page_loop) - /* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED - comparisons. */ - subl %ecx, %eax -# ifndef USE_AS_WCSCMP -L(different): + + + subl %eax, %OFFSET_REG + /* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed + to not cross page so is safe to load. Since we have already + loaded at least 1 VEC from rsi it is also guranteed to be safe. + */ + VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0 + VPTESTM %YMM0, %YMM0, %k2 + VPCMP $0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0, %k1{%k2} + + kmovd %k1, %ecx +# ifdef USE_AS_STRNCMP + leal CHAR_PER_VEC(%OFFSET_REG64), %eax + cmpq %rax, %rdx + jbe L(check_ret_vec_page_cross2) +# ifdef USE_AS_WCSCMP + addq $-(CHAR_PER_VEC * 2), %rdx +# else + addq %rdi, %rdx +# endif # endif - ret + TESTEQ %ecx + jz L(prepare_loop_no_len) + .p2align 4,, 4 +L(ret_vec_page_cross): +# ifndef USE_AS_STRNCMP +L(check_ret_vec_page_cross): +# endif + tzcntl %ecx, %ecx + addl %OFFSET_REG, %ecx +L(ret_vec_page_cross_cont): # ifdef USE_AS_WCSCMP - .p2align 4 -L(different): - /* Use movl to avoid modifying EFLAGS. */ - movl $0, %eax + movl (%rdi, %rcx, SIZE_OF_CHAR), %edx + xorl %eax, %eax + cmpl (%rsi, %rcx, SIZE_OF_CHAR), %edx + je L(ret12) setl %al negl %eax - orl $1, %eax - ret + xorl %r8d, %eax +# else + movzbl (%rdi, %rcx, SIZE_OF_CHAR), %eax + movzbl (%rsi, %rcx, SIZE_OF_CHAR), %ecx + subl %ecx, %eax + xorl %r8d, %eax + subl %r8d, %eax # endif +L(ret12): + ret + # ifdef USE_AS_STRNCMP - .p2align 4 -L(zero): + .p2align 4,, 10 +L(check_ret_vec_page_cross2): + TESTEQ %ecx +L(check_ret_vec_page_cross): + tzcntl %ecx, %ecx + addl %OFFSET_REG, %ecx + cmpq %rcx, %rdx + ja L(ret_vec_page_cross_cont) + .p2align 4,, 2 +L(ret_zero_page_cross): xorl %eax, %eax ret +# endif - .p2align 4 -L(char0): -# ifdef USE_AS_WCSCMP - xorl %eax, %eax - movl (%rdi), %ecx - cmpl (%rsi), %ecx - jne L(wcscmp_return) -# else - movzbl (%rsi), %ecx - movzbl (%rdi), %eax - subl %ecx, %eax -# endif - ret + .p2align 4,, 4 +L(page_cross_s2): + /* Ensure this is a true page cross. */ + subl $(PAGE_SIZE - VEC_SIZE * 4), %ecx + jbe L(no_page_cross) + + + movl %ecx, %eax + movq %rdi, %rcx + movq %rsi, %rdi + movq %rcx, %rsi + + /* set r8 to negate return value as rdi and rsi swapped. */ +# ifdef USE_AS_WCSCMP + movl $-4, %r8d +# else + movl $-1, %r8d # endif + xorl %OFFSET_REG, %OFFSET_REG - .p2align 4 -L(last_vector): - addq %rdx, %rdi - addq %rdx, %rsi -# ifdef USE_AS_STRNCMP - subq %rdx, %r11 + /* Check if more than 1x VEC till page cross. */ + subl $(VEC_SIZE * 3), %eax + jle L(page_cross_loop) + + .p2align 4,, 6 +L(less_1x_vec_till_page): +# ifdef USE_AS_WCSCMP + shrl $2, %eax # endif - tzcntl %ecx, %edx + /* Find largest load size we can use. */ + cmpl $(16 / SIZE_OF_CHAR), %eax + ja L(less_16_till_page) + + /* Use 16 byte comparison. */ + vmovdqu (%rdi), %xmm0 + VPTESTM %xmm0, %xmm0, %k2 + VPCMP $0, (%rsi), %xmm0, %k1{%k2} + kmovd %k1, %ecx # ifdef USE_AS_WCSCMP - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - sall $2, %edx + subl $0xf, %ecx +# else + incw %cx # endif + jnz L(check_ret_vec_page_cross) + movl $(16 / SIZE_OF_CHAR), %OFFSET_REG # ifdef USE_AS_STRNCMP - cmpq %r11, %rdx - jae L(zero) + cmpq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case0) + subl %eax, %OFFSET_REG +# else + /* Explicit check for 16 byte alignment. */ + subl %eax, %OFFSET_REG + jz L(prepare_loop) # endif + vmovdqu (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0 + VPTESTM %xmm0, %xmm0, %k2 + VPCMP $0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0, %k1{%k2} + kmovd %k1, %ecx # ifdef USE_AS_WCSCMP - xorl %eax, %eax - movl (%rdi, %rdx), %ecx - cmpl (%rsi, %rdx), %ecx - jne L(wcscmp_return) + subl $0xf, %ecx # else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %edx - subl %edx, %eax + incw %cx # endif + jnz L(check_ret_vec_page_cross) +# ifdef USE_AS_STRNCMP + addl $(16 / SIZE_OF_CHAR), %OFFSET_REG + subq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case0) + subq $-(CHAR_PER_VEC * 4), %rdx + + leaq -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi + leaq -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi +# else + leaq (16 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi + leaq (16 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi +# endif + jmp L(prepare_loop_aligned) + +# ifdef USE_AS_STRNCMP + .p2align 4,, 2 +L(ret_zero_page_cross_slow_case0): + xorl %eax, %eax ret +# endif - /* Comparing on page boundary region requires special treatment: - It must done one vector at the time, starting with the wider - ymm vector if possible, if not, with xmm. If fetching 16 bytes - (xmm) still passes the boundary, byte comparison must be done. - */ - .p2align 4 -L(cross_page): - /* Try one ymm vector at a time. */ - cmpl $(PAGE_SIZE - VEC_SIZE), %eax - jg L(cross_page_1_vector) -L(loop_1_vector): - VMOVU (%rdi, %rdx), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - /* Each bit cleared in K1 represents a mismatch or a null CHAR - in YMM0 and 32 bytes at (%rsi, %rdx). */ - VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2} + .p2align 4,, 10 +L(less_16_till_page): + cmpl $(24 / SIZE_OF_CHAR), %eax + ja L(less_8_till_page) + + /* Use 8 byte comparison. */ + vmovq (%rdi), %xmm0 + vmovq (%rsi), %xmm1 + VPTESTM %xmm0, %xmm0, %k2 + VPCMP $0, %xmm1, %xmm0, %k1{%k2} kmovd %k1, %ecx # ifdef USE_AS_WCSCMP - subl $0xff, %ecx + subl $0x3, %ecx # else - incl %ecx + incb %cl # endif - jne L(last_vector) + jnz L(check_ret_vec_page_cross) - addl $VEC_SIZE, %edx - addl $VEC_SIZE, %eax # ifdef USE_AS_STRNCMP - /* Return 0 if the current offset (%rdx) >= the maximum offset - (%r11). */ - cmpq %r11, %rdx - jae L(zero) + cmpq $(8 / SIZE_OF_CHAR), %rdx + jbe L(ret_zero_page_cross_slow_case0) # endif - cmpl $(PAGE_SIZE - VEC_SIZE), %eax - jle L(loop_1_vector) -L(cross_page_1_vector): - /* Less than 32 bytes to check, try one xmm vector. */ - cmpl $(PAGE_SIZE - 16), %eax - jg L(cross_page_1_xmm) - VMOVU (%rdi, %rdx), %XMM0 + movl $(24 / SIZE_OF_CHAR), %OFFSET_REG + subl %eax, %OFFSET_REG - VPTESTM %YMM0, %YMM0, %k2 - /* Each bit cleared in K1 represents a mismatch or a null CHAR - in XMM0 and 16 bytes at (%rsi, %rdx). */ - VPCMP $0, (%rsi, %rdx), %XMM0, %k1{%k2} + vmovq (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0 + vmovq (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm1 + VPTESTM %xmm0, %xmm0, %k2 + VPCMP $0, %xmm1, %xmm0, %k1{%k2} kmovd %k1, %ecx # ifdef USE_AS_WCSCMP - subl $0xf, %ecx + subl $0x3, %ecx # else - subl $0xffff, %ecx + incb %cl # endif - jne L(last_vector) + jnz L(check_ret_vec_page_cross) + - addl $16, %edx -# ifndef USE_AS_WCSCMP - addl $16, %eax -# endif # ifdef USE_AS_STRNCMP - /* Return 0 if the current offset (%rdx) >= the maximum offset - (%r11). */ - cmpq %r11, %rdx - jae L(zero) + addl $(8 / SIZE_OF_CHAR), %OFFSET_REG + subq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case0) + subq $-(CHAR_PER_VEC * 4), %rdx + + leaq -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi + leaq -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi +# else + leaq (8 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi + leaq (8 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi # endif + jmp L(prepare_loop_aligned) -L(cross_page_1_xmm): -# ifndef USE_AS_WCSCMP - /* Less than 16 bytes to check, try 8 byte vector. NB: No need - for wcscmp nor wcsncmp since wide char is 4 bytes. */ - cmpl $(PAGE_SIZE - 8), %eax - jg L(cross_page_8bytes) - vmovq (%rdi, %rdx), %XMM0 - vmovq (%rsi, %rdx), %XMM1 - VPTESTM %YMM0, %YMM0, %k2 - /* Each bit cleared in K1 represents a mismatch or a null CHAR - in XMM0 and XMM1. */ - VPCMP $0, %XMM1, %XMM0, %k1{%k2} - kmovb %k1, %ecx + + + .p2align 4,, 10 +L(less_8_till_page): # ifdef USE_AS_WCSCMP - subl $0x3, %ecx + /* If using wchar then this is the only check before we reach + the page boundary. */ + movl (%rdi), %eax + movl (%rsi), %ecx + cmpl %ecx, %eax + jnz L(ret_less_8_wcs) +# ifdef USE_AS_STRNCMP + addq $-(CHAR_PER_VEC * 2), %rdx + /* We already checked for len <= 1 so cannot hit that case here. + */ +# endif + testl %eax, %eax + jnz L(prepare_loop) + ret + + .p2align 4,, 8 +L(ret_less_8_wcs): + setl %OFFSET_REG8 + negl %OFFSET_REG + movl %OFFSET_REG, %eax + xorl %r8d, %eax + ret + # else - subl $0xff, %ecx -# endif - jne L(last_vector) + cmpl $28, %eax + ja L(less_4_till_page) + + vmovd (%rdi), %xmm0 + vmovd (%rsi), %xmm1 + VPTESTM %xmm0, %xmm0, %k2 + VPCMP $0, %xmm1, %xmm0, %k1{%k2} + kmovd %k1, %ecx + subl $0xf, %ecx + jnz L(check_ret_vec_page_cross) - addl $8, %edx - addl $8, %eax # ifdef USE_AS_STRNCMP - /* Return 0 if the current offset (%rdx) >= the maximum offset - (%r11). */ - cmpq %r11, %rdx - jae L(zero) + cmpq $4, %rdx + jbe L(ret_zero_page_cross_slow_case1) # endif + movl $(28 / SIZE_OF_CHAR), %OFFSET_REG + subl %eax, %OFFSET_REG -L(cross_page_8bytes): - /* Less than 8 bytes to check, try 4 byte vector. */ - cmpl $(PAGE_SIZE - 4), %eax - jg L(cross_page_4bytes) - vmovd (%rdi, %rdx), %XMM0 - vmovd (%rsi, %rdx), %XMM1 - - VPTESTM %YMM0, %YMM0, %k2 - /* Each bit cleared in K1 represents a mismatch or a null CHAR - in XMM0 and XMM1. */ - VPCMP $0, %XMM1, %XMM0, %k1{%k2} + vmovd (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0 + vmovd (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm1 + VPTESTM %xmm0, %xmm0, %k2 + VPCMP $0, %xmm1, %xmm0, %k1{%k2} kmovd %k1, %ecx -# ifdef USE_AS_WCSCMP - subl $0x1, %ecx -# else subl $0xf, %ecx -# endif - jne L(last_vector) + jnz L(check_ret_vec_page_cross) +# ifdef USE_AS_STRNCMP + addl $(4 / SIZE_OF_CHAR), %OFFSET_REG + subq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case1) + subq $-(CHAR_PER_VEC * 4), %rdx + + leaq -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi + leaq -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi +# else + leaq (4 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi + leaq (4 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi +# endif + jmp L(prepare_loop_aligned) + - addl $4, %edx # ifdef USE_AS_STRNCMP - /* Return 0 if the current offset (%rdx) >= the maximum offset - (%r11). */ - cmpq %r11, %rdx - jae L(zero) + .p2align 4,, 2 +L(ret_zero_page_cross_slow_case1): + xorl %eax, %eax + ret # endif -L(cross_page_4bytes): -# endif - /* Less than 4 bytes to check, try one byte/dword at a time. */ -# ifdef USE_AS_STRNCMP - cmpq %r11, %rdx - jae L(zero) -# endif -# ifdef USE_AS_WCSCMP - movl (%rdi, %rdx), %eax - movl (%rsi, %rdx), %ecx -# else - movzbl (%rdi, %rdx), %eax - movzbl (%rsi, %rdx), %ecx -# endif - testl %eax, %eax - jne L(cross_page_loop) + .p2align 4,, 10 +L(less_4_till_page): + subq %rdi, %rsi + /* Extremely slow byte comparison loop. */ +L(less_4_loop): + movzbl (%rdi), %eax + movzbl (%rsi, %rdi), %ecx subl %ecx, %eax + jnz L(ret_less_4_loop) + testl %ecx, %ecx + jz L(ret_zero_4_loop) +# ifdef USE_AS_STRNCMP + decq %rdx + jz L(ret_zero_4_loop) +# endif + incq %rdi + /* end condition is reach page boundary (rdi is aligned). */ + testl $31, %edi + jnz L(less_4_loop) + leaq -(VEC_SIZE * 4)(%rdi, %rsi), %rsi + addq $-(VEC_SIZE * 4), %rdi +# ifdef USE_AS_STRNCMP + subq $-(CHAR_PER_VEC * 4), %rdx +# endif + jmp L(prepare_loop_aligned) + +L(ret_zero_4_loop): + xorl %eax, %eax + ret +L(ret_less_4_loop): + xorl %r8d, %eax + subl %r8d, %eax ret -END (STRCMP) +# endif +END(STRCMP) #endif From patchwork Mon Jan 10 21:35:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 49816 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2B697388881B for ; Mon, 10 Jan 2022 21:39:03 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2B697388881B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1641850743; bh=bY2VZ0Y5BMIhyq+ySmKAWBqkWf2JOc4fO3B1Z6wOFUU=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=fe8gE2Pn0uSnIHpjEthXorKaRn16DbGNncxCvh7OsRtN3k9chULrqMskxMX0hYfw/ QSXkBfkscfskxsiLIfnXVy+8BvhXf0Jl4RSUvH7gvp0dlnKx4JV+KnT4Qh3XDL2Cd9 zVPBZG3LPrP1ha9P2IGQ1ot04QzDUD0cVWZ6biGs= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by sourceware.org (Postfix) with ESMTPS id 5F5E1388A024 for ; Mon, 10 Jan 2022 21:35:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5F5E1388A024 Received: by mail-pl1-x634.google.com with SMTP id i6so14078432pla.0 for ; Mon, 10 Jan 2022 13:35:55 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=bY2VZ0Y5BMIhyq+ySmKAWBqkWf2JOc4fO3B1Z6wOFUU=; b=QL8AvPZlfvWBCdPpDRnBqEhPb8GoI85mg7Xv/Xv8xrrp/ew2aXY8HQMVYQpC4Y3xV9 VItt8jGijmEraYT7DAOp0IF/Qu/7RkXLREGUhgsfoxIT7htqJbn/gUBj2sxrHAsgvnX5 Bygl/AkkLgodrngqK3HMPwPSjTuvhM/kqYUzFt84UAxFbEhjSeeoNuK9smzBlTzu7UIe bVcj+C+igNnxsIAbhzFeOmgm7ioqoNT4nOAj6rqZFnmSGY6ngNCYlTuLuERDp1pay7Sl AKZLkAEiz8JP6kTvGJ0/kSbcJZRo9MFTpamztcCO6YOOkoFxQTWFT6RxjgZndFdC6xKZ iF6A== X-Gm-Message-State: AOAM530IyIH8tyAHnAlWZCBpAvEzfUOOog0Xs3x8XXsxDQ9sGelzoXTS NC3A4SRrK/RZnf+oaagGP1FVbpMVE1A= X-Google-Smtp-Source: ABdhPJyqQG61KUF60MNorfOnaDH71eCJqJGqxsHiyKI/pLtJUSl/y/LC0hMf0Sei+RFvjPgx5UAceQ== X-Received: by 2002:a63:d2:: with SMTP id 201mr1469640pga.56.1641850554182; Mon, 10 Jan 2022 13:35:54 -0800 (PST) Received: from noah-tigerlake.webpass.net (136-24-166-223.cab.webpass.net. [136.24.166.223]) by smtp.googlemail.com with ESMTPSA id f12sm7996515pfe.127.2022.01.10.13.35.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Jan 2022 13:35:53 -0800 (PST) To: libc-alpha@sourceware.org Subject: [PATCH v3 7/7] benchtests: Add more coverage for strcmp and strncmp benchmarks Date: Mon, 10 Jan 2022 15:35:40 -0600 Message-Id: <20220110213540.1258344-7-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220110213540.1258344-1-goldstein.w.n@gmail.com> References: <20220109122946.2754917-1-goldstein.w.n@gmail.com> <20220110213540.1258344-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Add more small and medium sized tests for strcmp and strncmp. As well for strcmp add option for more direct control of alignment. Previously alignment was being pushed to the end of the page. While this is the most difficult case to implement, it is far from the common case and so shouldn't be the only benchmark. Signed-off-by: Noah Goldstein --- benchtests/bench-strcmp.c | 142 ++++++++++++++++++++++++++----------- benchtests/bench-strncmp.c | 110 ++++++++++++++++++++-------- 2 files changed, 183 insertions(+), 69 deletions(-) diff --git a/benchtests/bench-strcmp.c b/benchtests/bench-strcmp.c index 387e76fcfb..3a60edfb15 100644 --- a/benchtests/bench-strcmp.c +++ b/benchtests/bench-strcmp.c @@ -99,8 +99,8 @@ do_one_test (json_ctx_t *json_ctx, impl_t *impl, } static void -do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int - max_char, int exp_result) +do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, + int max_char, int exp_result, int at_end) { size_t i; @@ -109,19 +109,28 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int if (len == 0) return; - align1 &= 63; + align1 &= ~(CHARBYTES - 1); + align2 &= ~(CHARBYTES - 1); + + align1 &= (getpagesize () - 1); if (align1 + (len + 1) * CHARBYTES >= page_size) return; - align2 &= 63; + align2 &= (getpagesize () - 1); if (align2 + (len + 1) * CHARBYTES >= page_size) return; /* Put them close to the end of page. */ - i = align1 + CHARBYTES * (len + 2); - s1 = (CHAR *) (buf1 + ((page_size - i) / 16 * 16) + align1); - i = align2 + CHARBYTES * (len + 2); - s2 = (CHAR *) (buf2 + ((page_size - i) / 16 * 16) + align2); + if (at_end) + { + i = align1 + CHARBYTES * (len + 2); + align1 = ((page_size - i) / 16 * 16) + align1; + i = align2 + CHARBYTES * (len + 2); + align2 = ((page_size - i) / 16 * 16) + align2; + } + + s1 = (CHAR *)(buf1 + align1); + s2 = (CHAR *)(buf2 + align2); for (i = 0; i < len; i++) s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char; @@ -132,9 +141,9 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, int s2[len - 1] -= exp_result; json_element_object_begin (json_ctx); - json_attr_uint (json_ctx, "length", (double) len); - json_attr_uint (json_ctx, "align1", (double) align1); - json_attr_uint (json_ctx, "align2", (double) align2); + json_attr_uint (json_ctx, "length", (double)len); + json_attr_uint (json_ctx, "align1", (double)align1); + json_attr_uint (json_ctx, "align2", (double)align2); json_array_begin (json_ctx, "timings"); FOR_EACH_IMPL (impl, 0) @@ -202,7 +211,8 @@ int test_main (void) { json_ctx_t json_ctx; - size_t i; + size_t i, j, k; + size_t pg_sz = getpagesize (); test_init (); @@ -221,36 +231,88 @@ test_main (void) json_array_end (&json_ctx); json_array_begin (&json_ctx, "results"); - - for (i = 1; i < 32; ++i) - { - do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 0); - do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 1); - do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, -1); - } - - for (i = 1; i < 10 + CHARBYTESLOG; ++i) + for (k = 0; k < 2; ++k) { - do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 0); - do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 0); - do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 1); - do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 1); - do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, -1); - do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, -1); - do_test (&json_ctx, 0, CHARBYTES * i, 2 << i, MIDCHAR, 1); - do_test (&json_ctx, CHARBYTES * i, CHARBYTES * (i + 1), 2 << i, LARGECHAR, 1); + for (i = 1; i < 32; ++i) + { + do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 0, k); + do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, 1, k); + do_test (&json_ctx, CHARBYTES * i, CHARBYTES * i, i, MIDCHAR, -1, k); + } + + for (i = 1; i <= 8192;) + { + /* No page crosses. */ + do_test (&json_ctx, 0, 0, i, MIDCHAR, 0, k); + do_test (&json_ctx, i * CHARBYTES, 0, i, MIDCHAR, 0, k); + do_test (&json_ctx, 0, i * CHARBYTES, i, MIDCHAR, 0, k); + + /* False page crosses. */ + do_test (&json_ctx, pg_sz / 2, pg_sz / 2 - CHARBYTES, i, MIDCHAR, 0, + k); + do_test (&json_ctx, pg_sz / 2 - CHARBYTES, pg_sz / 2, i, MIDCHAR, 0, + k); + + do_test (&json_ctx, pg_sz - (i * CHARBYTES), 0, i, MIDCHAR, 0, k); + do_test (&json_ctx, 0, pg_sz - (i * CHARBYTES), i, MIDCHAR, 0, k); + + /* Real page cross. */ + for (j = 16; j < 128; j += 16) + { + do_test (&json_ctx, pg_sz - j, 0, i, MIDCHAR, 0, k); + do_test (&json_ctx, 0, pg_sz - j, i, MIDCHAR, 0, k); + + do_test (&json_ctx, pg_sz - j, pg_sz - j / 2, i, MIDCHAR, 0, k); + do_test (&json_ctx, pg_sz - j / 2, pg_sz - j, i, MIDCHAR, 0, k); + } + + if (i < 32) + { + ++i; + } + else if (i < 160) + { + i += 8; + } + else if (i < 512) + { + i += 32; + } + else + { + i *= 2; + } + } + + for (i = 1; i < 10 + CHARBYTESLOG; ++i) + { + do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 0, k); + do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 0, k); + do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, 1, k); + do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, 1, k); + do_test (&json_ctx, 0, 0, 2 << i, MIDCHAR, -1, k); + do_test (&json_ctx, 0, 0, 2 << i, LARGECHAR, -1, k); + do_test (&json_ctx, 0, CHARBYTES * i, 2 << i, MIDCHAR, 1, k); + do_test (&json_ctx, CHARBYTES * i, CHARBYTES * (i + 1), 2 << i, + LARGECHAR, 1, k); + } + + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, + MIDCHAR, 0, k); + do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, + LARGECHAR, 0, k); + do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, + MIDCHAR, 1, k); + do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, + LARGECHAR, 1, k); + do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, + MIDCHAR, -1, k); + do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, + LARGECHAR, -1, k); + } } - - for (i = 1; i < 8; ++i) - { - do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, 0); - do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, 0); - do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, 1); - do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, 1); - do_test (&json_ctx, CHARBYTES * i, 2 * CHARBYTES * i, 8 << i, MIDCHAR, -1); - do_test (&json_ctx, 2 * CHARBYTES * i, CHARBYTES * i, 8 << i, LARGECHAR, -1); - } - do_test_page_boundary (&json_ctx); json_array_end (&json_ctx); diff --git a/benchtests/bench-strncmp.c b/benchtests/bench-strncmp.c index b7a01fde64..6673a53521 100644 --- a/benchtests/bench-strncmp.c +++ b/benchtests/bench-strncmp.c @@ -150,43 +150,43 @@ do_test (json_ctx_t *json_ctx, size_t align1, size_t align2, size_t len, size_t if (n == 0) return; - align1 &= 63; + align1 &= getpagesize () - 1; if (align1 + (n + 1) * CHARBYTES >= page_size) return; - align2 &= 7; + align2 &= getpagesize () - 1; if (align2 + (n + 1) * CHARBYTES >= page_size) return; json_element_object_begin (json_ctx); - json_attr_uint (json_ctx, "strlen", (double) len); - json_attr_uint (json_ctx, "len", (double) n); - json_attr_uint (json_ctx, "align1", (double) align1); - json_attr_uint (json_ctx, "align2", (double) align2); + json_attr_uint (json_ctx, "strlen", (double)len); + json_attr_uint (json_ctx, "len", (double)n); + json_attr_uint (json_ctx, "align1", (double)align1); + json_attr_uint (json_ctx, "align2", (double)align2); json_array_begin (json_ctx, "timings"); FOR_EACH_IMPL (impl, 0) - { - alloc_bufs (); - s1 = (CHAR *) (buf1 + align1); - s2 = (CHAR *) (buf2 + align2); - - for (i = 0; i < n; i++) - s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char; - - s1[n] = 24 + exp_result; - s2[n] = 23; - s1[len] = 0; - s2[len] = 0; - if (exp_result < 0) - s2[len] = 32; - else if (exp_result > 0) - s1[len] = 64; - if (len >= n) - s2[n - 1] -= exp_result; + { + alloc_bufs (); + s1 = (CHAR *)(buf1 + align1); + s2 = (CHAR *)(buf2 + align2); + + for (i = 0; i < n; i++) + s1[i] = s2[i] = 1 + (23 << ((CHARBYTES - 1) * 8)) * i % max_char; + + s1[n] = 24 + exp_result; + s2[n] = 23; + s1[len] = 0; + s2[len] = 0; + if (exp_result < 0) + s2[len] = 32; + else if (exp_result > 0) + s1[len] = 64; + if (len >= n) + s2[n - 1] -= exp_result; - do_one_test (json_ctx, impl, s1, s2, n, exp_result); - } + do_one_test (json_ctx, impl, s1, s2, n, exp_result); + } json_array_end (json_ctx); json_element_object_end (json_ctx); @@ -319,7 +319,8 @@ int test_main (void) { json_ctx_t json_ctx; - size_t i; + size_t i, j, len; + size_t pg_sz = getpagesize (); test_init (); @@ -334,12 +335,12 @@ test_main (void) json_array_begin (&json_ctx, "ifuncs"); FOR_EACH_IMPL (impl, 0) - json_element_string (&json_ctx, impl->name); + json_element_string (&json_ctx, impl->name); json_array_end (&json_ctx); json_array_begin (&json_ctx, "results"); - for (i =0; i < 16; ++i) + for (i = 0; i < 16; ++i) { do_test (&json_ctx, 0, 0, 8, i, 127, 0); do_test (&json_ctx, 0, 0, 8, i, 127, -1); @@ -361,6 +362,57 @@ test_main (void) do_test (&json_ctx, i, 3 * i, 8, i, 255, -1); } + for (len = 0; len <= 128; len += 64) + { + for (i = 1; i <= 8192;) + { + /* No page crosses. */ + do_test (&json_ctx, 0, 0, i, i + len, 127, 0); + do_test (&json_ctx, i * CHARBYTES, 0, i, i + len, 127, 0); + do_test (&json_ctx, 0, i * CHARBYTES, i, i + len, 127, 0); + + /* False page crosses. */ + do_test (&json_ctx, pg_sz / 2, pg_sz / 2 - CHARBYTES, i, i + len, + 127, 0); + do_test (&json_ctx, pg_sz / 2 - CHARBYTES, pg_sz / 2, i, i + len, + 127, 0); + + do_test (&json_ctx, pg_sz - (i * CHARBYTES), 0, i, i + len, 127, + 0); + do_test (&json_ctx, 0, pg_sz - (i * CHARBYTES), i, i + len, 127, + 0); + + /* Real page cross. */ + for (j = 16; j < 128; j += 16) + { + do_test (&json_ctx, pg_sz - j, 0, i, i + len, 127, 0); + do_test (&json_ctx, 0, pg_sz - j, i, i + len, 127, 0); + + do_test (&json_ctx, pg_sz - j, pg_sz - j / 2, i, i + len, + 127, 0); + do_test (&json_ctx, pg_sz - j / 2, pg_sz - j, i, i + len, + 127, 0); + } + + if (i < 32) + { + ++i; + } + else if (i < 160) + { + i += 8; + } + else if (i < 256) + { + i += 32; + } + else + { + i *= 2; + } + } + } + for (i = 1; i < 8; ++i) { do_test (&json_ctx, 0, 0, 8 << i, 16 << i, 127, 0);