From patchwork Fri Oct 22 17:28:49 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Paul A. Clarke" <pc@us.ibm.com>
X-Patchwork-Id: 46534
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 18E1D3857C52
	for <patchwork@sourceware.org>; Fri, 22 Oct 2021 17:29:30 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 18E1D3857C52
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1634923770;
	bh=rBCYgYTaP7Dt5xlJBUnE/nkvCCkw8sFy5qsmZuO05BI=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=OhDg1YZ3oyQJ459cwGmZw8pvUTL1Jj59P3E6JBG6MESUeIAU6gcCSuCn/MRNDc0gJ
	 s6edhuPsP6pvkxz61VbOyXxsQubVKfD82rxNN8OEdk2iloU47/ZJk3PcubCyBBZPzv
	 DIYq2JpF3gV8o/+fd2ErQRbPbJq0zrZmFy249gdM=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 3CC863858416
 for <gcc-patches@gcc.gnu.org>; Fri, 22 Oct 2021 17:29:00 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3CC863858416
Received: from pps.filterd (m0187473.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id
 19MFvEB2038356;
 Fri, 22 Oct 2021 13:28:59 -0400
Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com
 [169.47.144.26])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3bux4ucvj6-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 22 Oct 2021 13:28:58 -0400
Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1])
 by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 19MHDbj4018481;
 Fri, 22 Oct 2021 17:28:57 GMT
Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com
 [9.57.198.23]) by ppma04wdc.us.ibm.com with ESMTP id 3bqpcd7y6a-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 22 Oct 2021 17:28:57 +0000
Received: from b01ledav006.gho.pok.ibm.com (b01ledav006.gho.pok.ibm.com
 [9.57.199.111])
 by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 19MHSuHD37749220
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Fri, 22 Oct 2021 17:28:56 GMT
Received: from b01ledav006.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id DCA59AC05E;
 Fri, 22 Oct 2021 17:28:56 +0000 (GMT)
Received: from b01ledav006.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id B3EAAAC05F;
 Fri, 22 Oct 2021 17:28:56 +0000 (GMT)
Received: from localhost (unknown [9.77.149.232])
 by b01ledav006.gho.pok.ibm.com (Postfix) with ESMTP;
 Fri, 22 Oct 2021 17:28:56 +0000 (GMT)
To: segher@kernel.crashing.org, gcc-patches@gcc.gnu.org
Subject: [PATCH] rs6000: Add optimizations for _mm_sad_epu8
Date: Fri, 22 Oct 2021 12:28:49 -0500
Message-Id: <20211022172849.499625-1-pc@us.ibm.com>
X-Mailer: git-send-email 2.27.0
MIME-Version: 1.0
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: YI89ofEyotcTTErjz8KKtr7VWzFJveMe
X-Proofpoint-GUID: YI89ofEyotcTTErjz8KKtr7VWzFJveMe
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.182.1,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475
 definitions=2021-10-22_04,2021-10-22_01,2020-04-07_01
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 clxscore=1015
 priorityscore=1501 lowpriorityscore=0 adultscore=0 mlxlogscore=891
 mlxscore=0 malwarescore=0 spamscore=0 phishscore=0 bulkscore=0
 suspectscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.12.0-2109230001 definitions=main-2110220099
X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, RCVD_IN_MSPIKE_H2,
 SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: "Paul A. Clarke via Gcc-patches"
 <gcc-patches@gcc.gnu.org>
From: "Paul A. Clarke" <pc@us.ibm.com>
Reply-To: "Paul A. Clarke" <pc@us.ibm.com>
Cc: wschmidt@linux.ibm.com
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

Power9 ISA added `vabsdub` instruction which is realized in the
`vec_absd` instrinsic.

Use `vec_absd` for `_mm_sad_epu8` compatibility intrinsic, when
`_ARCH_PWR9`.

Also, the realization of `vec_sum2s` on little-endian includes
two shifts in order to position the input and output to match
the semantics of `vec_sum2s`:
- Shift the second input vector left 12 bytes. In the current usage,
  that vector is `{0}`, so this shift is unnecessary, but is currently
  not eliminated under optimization.
- Shift the vector produced by the `vsum2sws` instruction left 4 bytes.
  The two words within each doubleword of this (shifted) result must then
  be explicitly swapped to match the semantics of `_mm_sad_epu8`,
  effectively reversing this shift.  So, this shift (and a susequent swap)
  are unnecessary, but not currently removed under optimization.

Using `__builtin_altivec_vsum2sws` retains both shifts, so is not an
option for removing the shifts.

For little-endian, use the `vsum2sws` instruction directly, and
eliminate the explicit shift (swap).

2021-10-22  Paul A. Clarke  <pc@us.ibm.com>

gcc
	* config/rs6000/emmintrin.h (_mm_sad_epu8): Use vec_absd
	when _ARCH_PWR9, optimize vec_sum2s when LE.
---
Tested on powerpc64le-linux on Power9, with and without `-mcpu=power9`,
and on powerpc/powerpc64-linux on Power8.

OK for trunk?

 gcc/config/rs6000/emmintrin.h | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/gcc/config/rs6000/emmintrin.h b/gcc/config/rs6000/emmintrin.h
index ab16c13c379e..c4758be0e777 100644
--- a/gcc/config/rs6000/emmintrin.h
+++ b/gcc/config/rs6000/emmintrin.h
@@ -2197,27 +2197,37 @@ extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __arti
 _mm_sad_epu8 (__m128i __A, __m128i __B)
 {
   __v16qu a, b;
-  __v16qu vmin, vmax, vabsdiff;
+  __v16qu vabsdiff;
   __v4si vsum;
   const __v4su zero = { 0, 0, 0, 0 };
   __v4si result;
 
   a = (__v16qu) __A;
   b = (__v16qu) __B;
-  vmin = vec_min (a, b);
-  vmax = vec_max (a, b);
+#ifndef _ARCH_PWR9
+  __v16qu vmin = vec_min (a, b);
+  __v16qu vmax = vec_max (a, b);
   vabsdiff = vec_sub (vmax, vmin);
+#else
+  vabsdiff = vec_absd (a, b);
+#endif
   /* Sum four groups of bytes into integers.  */
   vsum = (__vector signed int) vec_sum4s (vabsdiff, zero);
+#ifdef __LITTLE_ENDIAN__
+  /* Sum across four integers with two integer results.  */
+  asm ("vsum2sws %0,%1,%2" : "=v" (result) : "v" (vsum), "v" (zero));
+  /* Note: vec_sum2s could be used here, but on little-endian, vector
+     shifts are added that are not needed for this use-case.
+     A vector shift to correctly position the 32-bit integer results
+     (currently at [0] and [2]) to [1] and [3] would then need to be
+     swapped back again since the desired results are two 64-bit
+     integers ([1]|[0] and [3]|[2]).  Thus, no shift is performed.  */
+#else
   /* Sum across four integers with two integer results.  */
   result = vec_sum2s (vsum, (__vector signed int) zero);
   /* Rotate the sums into the correct position.  */
-#ifdef __LITTLE_ENDIAN__
-  result = vec_sld (result, result, 4);
-#else
   result = vec_sld (result, result, 6);
 #endif
-  /* Rotate the sums into the correct position.  */
   return (__m128i) result;
 }