From patchwork Tue Jul 18 14:15:38 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Adhemerval Zanella Netto
 <adhemerval.zanella@linaro.org>
X-Patchwork-Id: 72881
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 078343856972
	for <patchwork@sourceware.org>; Tue, 18 Jul 2023 14:16:15 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 078343856972
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1689689775;
	bh=r3V7J/+051rcRxu3Q9N9c+X0qH2ZGC50l+QPtSbnWTs=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=yxfoxeMJQmBMuyr044/y68FygWR6BNIkQu+OKi8FFwMMyVbJuCuw1XDRMTs0Iwt8A
	 crOagyGGtVTCH2R9Mib+jPca96AD6cYFk44l6jF7Q0CWPINIK52/qXFG/3AIMRHpio
	 w9cAXrHmPqX+XgJpMc3TFJss5QyfAcMRkfqOzNP4=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-oa1-x29.google.com (mail-oa1-x29.google.com
 [IPv6:2001:4860:4864:20::29])
 by sourceware.org (Postfix) with ESMTPS id D4024385773C
 for <libc-alpha@sourceware.org>; Tue, 18 Jul 2023 14:15:52 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D4024385773C
Received: by mail-oa1-x29.google.com with SMTP id
 586e51a60fabf-1b00b0ab0daso4784955fac.0
 for <libc-alpha@sourceware.org>; Tue, 18 Jul 2023 07:15:52 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1689689751; x=1690294551;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=r3V7J/+051rcRxu3Q9N9c+X0qH2ZGC50l+QPtSbnWTs=;
 b=ajn/ULcLw1Z4OPfm1FXhKG+74olvAc+ON+cMB/9q1prE5cdRjQCNfr7VWNTotCzUBq
 g3ps1+yJDtsUdHttgfhRD12IczX7g0WgLGMCuC6+q29hkKdrjeYJ754GN2NKGQVdkupz
 uZ3pnvh8qcLQmjA1Z+qTuOzkDELauLRuRe55cbcFa5EHjbsCigfW/uorP/73jGYMgyjd
 5/qmb1SQBgspkicNQT3YvmutDkxQHGag0nenncNfiXtSGbgh+Jzqm5WNk/4QDxRWTlq4
 bOUGnM8VDTrGuA/RMACrB5yvUZs37FWi0YylmC+i2XsI4S/ZMaY2CeTPwQq/3uyCc8Vm
 Dybw==
X-Gm-Message-State: ABy/qLaHi0bcG7yFiFIYCWEdLBlcpdeI6m0YaG2sgwPmPU7j7Rf+Et02
 CF8wzcbaOEZLnrUjUkjB0+GpAyLONuqh/Nn9m6r0Nw==
X-Google-Smtp-Source: 
 APBJJlEMpohARyB/l89DBd9U/bqKxmzL9oKq70Q7v9iAlhCm4Gh/hX8AMDQgh0T1vxExY8QQ5TF8AA==
X-Received: by 2002:a05:6870:ea82:b0:1b7:63e2:2944 with SMTP id
 s2-20020a056870ea8200b001b763e22944mr18788121oap.26.1689689751441;
 Tue, 18 Jul 2023 07:15:51 -0700 (PDT)
Received: from mandiga.. ([2804:1b3:a7c0:5656:745f:3154:510d:b668])
 by smtp.gmail.com with ESMTPSA id
 188-20020a4a11c5000000b00565fcfabab8sm847843ooc.21.2023.07.18.07.15.49
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 18 Jul 2023 07:15:50 -0700 (PDT)
To: libc-alpha@sourceware.org, Paul Eggert <eggert@cs.ucla.edu>,
 Alexander Monakov <amonakov@ispras.ru>
Subject: [PATCH v6 1/6] stdlib: Optimization qsort{_r} swap implementation
Date: Tue, 18 Jul 2023 11:15:38 -0300
Message-Id: <20230718141543.3925254-2-adhemerval.zanella@linaro.org>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20230718141543.3925254-1-adhemerval.zanella@linaro.org>
References: <20230718141543.3925254-1-adhemerval.zanella@linaro.org>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Adhemerval Zanella via Libc-alpha
 <libc-alpha@sourceware.org>
From: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Reply-To: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

The optimization takes in consideration both the most common elements
are either 32 or 64 bit in size and inputs are aligned to the word
boundary.  This is similar to what msort does.

For large buffer the swap operation uses memcpy/mempcpy with a
small fixed size buffer (so compiler might inline the operations).

Checked on x86_64-linux-gnu.
---
 stdlib/qsort.c | 116 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 98 insertions(+), 18 deletions(-)

diff --git a/stdlib/qsort.c b/stdlib/qsort.c
index 728a0ed370..5bcc287c79 100644
--- a/stdlib/qsort.c
+++ b/stdlib/qsort.c
@@ -23,20 +23,92 @@
 #include <limits.h>
 #include <stdlib.h>
 #include <string.h>
+#include <stdbool.h>
 
-/* Byte-wise swap two items of size SIZE. */
-#define SWAP(a, b, size)						      \
-  do									      \
-    {									      \
-      size_t __size = (size);						      \
-      char *__a = (a), *__b = (b);					      \
-      do								      \
-	{								      \
-	  char __tmp = *__a;						      \
-	  *__a++ = *__b;						      \
-	  *__b++ = __tmp;						      \
-	} while (--__size > 0);						      \
-    } while (0)
+/* Swap SIZE bytes between addresses A and B.  These helpers are provided
+   along the generic one as an optimization.  */
+
+enum swap_type_t
+  {
+    SWAP_WORDS_64,
+    SWAP_WORDS_32,
+    SWAP_BYTES
+  };
+
+/* If this function returns true, elements can be safely copied using word
+   loads and stores.  Otherwise, it might not be safe.  BASE (as an integer)
+   must be a multiple of the word alignment.  SIZE must be a multiple of
+   WORDSIZE.  Since WORDSIZE must be a multiple of the word alignment, and
+   WORDSIZE is a power of two on all supported platforms, this function for
+   speed merely checks that BASE and SIZE are both multiples of the word
+   size.  */
+static inline bool
+is_aligned (const void *base, size_t size, size_t wordsize)
+{
+  return (((uintptr_t) base | size) & (wordsize - 1)) == 0;
+}
+
+static inline void
+swap_words_64 (void * restrict a, void * restrict b, size_t n)
+{
+  typedef uint64_t __attribute__ ((__may_alias__)) u64_alias_t;
+  do
+   {
+     n -= 8;
+     u64_alias_t t = *(u64_alias_t *)(a + n);
+     *(u64_alias_t *)(a + n) = *(u64_alias_t *)(b + n);
+     *(u64_alias_t *)(b + n) = t;
+   } while (n);
+}
+
+static inline void
+swap_words_32 (void * restrict a, void * restrict b, size_t n)
+{
+  typedef uint32_t __attribute__ ((__may_alias__)) u32_alias_t;
+  do
+   {
+     n -= 4;
+     u32_alias_t t = *(u32_alias_t *)(a + n);
+     *(u32_alias_t *)(a + n) = *(u32_alias_t *)(b + n);
+     *(u32_alias_t *)(b + n) = t;
+   } while (n);
+}
+
+static inline void
+swap_bytes (void * restrict a, void * restrict b, size_t n)
+{
+  /* Use multiple small memcpys with constant size to enable inlining
+     on most targets.  */
+  enum { SWAP_GENERIC_SIZE = 32 };
+  unsigned char tmp[SWAP_GENERIC_SIZE];
+  while (n > SWAP_GENERIC_SIZE)
+    {
+      memcpy (tmp, a, SWAP_GENERIC_SIZE);
+      a = __mempcpy (a, b, SWAP_GENERIC_SIZE);
+      b = __mempcpy (b, tmp, SWAP_GENERIC_SIZE);
+      n -= SWAP_GENERIC_SIZE;
+    }
+  while (n > 0)
+    {
+      unsigned char t = ((unsigned char *)a)[--n];
+      ((unsigned char *)a)[n] = ((unsigned char *)b)[n];
+      ((unsigned char *)b)[n] = t;
+    }
+}
+
+/* Replace the indirect call with a serie of if statements.  It should help
+   the branch predictor.  */
+static void
+do_swap (void * restrict a, void * restrict b, size_t size,
+	 enum swap_type_t swap_func)
+{
+  if (swap_func == SWAP_WORDS_64)
+    swap_words_64 (a, b, size);
+  else if (swap_func == SWAP_WORDS_32)
+    swap_words_32 (a, b, size);
+  else
+    swap_bytes (a, b, size);
+}
 
 /* Discontinue quicksort algorithm when partition gets below this size.
    This particular magic number was chosen to work best on a Sun 4/260. */
@@ -96,6 +168,14 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
     /* Avoid lossage with unsigned arithmetic below.  */
     return;
 
+  enum swap_type_t swap_type;
+  if (is_aligned (pbase, size, 8))
+    swap_type = SWAP_WORDS_64;
+  else if (is_aligned (pbase, size, 4))
+    swap_type = SWAP_WORDS_32;
+  else
+    swap_type = SWAP_BYTES;
+
   if (total_elems > MAX_THRESH)
     {
       char *lo = base_ptr;
@@ -119,13 +199,13 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
 	  char *mid = lo + size * ((hi - lo) / size >> 1);
 
 	  if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
-	    SWAP (mid, lo, size);
+	    do_swap (mid, lo, size, swap_type);
 	  if ((*cmp) ((void *) hi, (void *) mid, arg) < 0)
-	    SWAP (mid, hi, size);
+	    do_swap (mid, hi, size, swap_type);
 	  else
 	    goto jump_over;
 	  if ((*cmp) ((void *) mid, (void *) lo, arg) < 0)
-	    SWAP (mid, lo, size);
+	    do_swap (mid, lo, size, swap_type);
 	jump_over:;
 
 	  left_ptr  = lo + size;
@@ -144,7 +224,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
 
 	      if (left_ptr < right_ptr)
 		{
-		  SWAP (left_ptr, right_ptr, size);
+		  do_swap (left_ptr, right_ptr, size, swap_type);
 		  if (mid == left_ptr)
 		    mid = right_ptr;
 		  else if (mid == right_ptr)
@@ -216,7 +296,7 @@ _quicksort (void *const pbase, size_t total_elems, size_t size,
         tmp_ptr = run_ptr;
 
     if (tmp_ptr != base_ptr)
-      SWAP (tmp_ptr, base_ptr, size);
+      do_swap (tmp_ptr, base_ptr, size, swap_type);
 
     /* Insertion sort, running from left-hand-side up to right-hand-side.  */