From patchwork Thu May 25 17:43:46 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Uros Bizjak <ubizjak@gmail.com>
X-Patchwork-Id: 70097
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 6471F3858C39
	for <patchwork@sourceware.org>; Thu, 25 May 2023 17:44:29 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6471F3858C39
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1685036669;
	bh=j7xolc6bqVzYoCfcSdS78w4Bo890OZ0cuErQmT8CDKI=;
	h=Date:Subject:To:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:From;
	b=qDbhb9+yygdseCE9v4p6RR7nyLcS2pg+UB+n+yxMf6l9oAuRQ1qS7aThbwpfgsTT9
	 4opAlHMcAPTCrSO+8QlNowiPReO3gzJ1GhtcREv75Ormy4OzCiEEz9TOLtmTbN70qM
	 wYrY5VLF5PH1o/SDxWLW1IhHMzewpFSzmPEtxWcQ=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mail-qv1-xf2a.google.com (mail-qv1-xf2a.google.com
 [IPv6:2607:f8b0:4864:20::f2a])
 by sourceware.org (Postfix) with ESMTPS id 84E8B3858D32
 for <gcc-patches@gcc.gnu.org>; Thu, 25 May 2023 17:43:58 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 84E8B3858D32
Received: by mail-qv1-xf2a.google.com with SMTP id
 6a1803df08f44-6238b15d298so229326d6.0
 for <gcc-patches@gcc.gnu.org>; Thu, 25 May 2023 10:43:58 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1685036637; x=1687628637;
 h=to:subject:message-id:date:from:mime-version:x-gm-message-state
 :from:to:cc:subject:date:message-id:reply-to;
 bh=j7xolc6bqVzYoCfcSdS78w4Bo890OZ0cuErQmT8CDKI=;
 b=ZaRozpIUgQNtKEmSOQUTfBfPKNa+mo4hvzDsUihtACKjQeiMm2mvpaFDCrsnwe1gmv
 3O2PnELnY3OzjWxXDsbwLF9d3Cu757Jun10EDP66ix6iiFP0DoXAsPiX114DCpQ14zyx
 OQjjqegxOvJzXRAsUIWZ/jBKlLY2jNa6GEGULqxMQ7BUAbvK83SQt7qbbN9Pt5GvsQUz
 WOGQdt1t6e0XhPVstswEXlkQCVY/m1X9uGKTf6KkWV+LvaQpo/j/IDOfvxZmNc9tNtgM
 ofUDKXPl+DW2aQjfJ265ouW0WVeACCmceENLwX/dAj/GnVolehhZVGWJ6U1NEPVgVgEl
 DWTg==
X-Gm-Message-State: AC+VfDyZNYQ1gUE3IvNC4U92lB0Jw334x8iX6WeEB+zi5+mfl5BBO0a7
 Pu3vchD8rVhlkinLhUi7V2TJT8tziXvl5/A+HnvCQ+fjJl+E0g==
X-Google-Smtp-Source: 
 ACHHUZ5GBG4HCUvJtr1u3jG37OeIWqzERECPCfMcCdnXi807C0hLeH6ijdgZ8qHax3osQ8zYDgfkKwyyvuZazsMQRk0=
X-Received: by 2002:a05:6214:f0a:b0:623:9515:24c0 with SMTP id
 gw10-20020a0562140f0a00b00623951524c0mr2075449qvb.16.1685036637374; Thu, 25
 May 2023 10:43:57 -0700 (PDT)
MIME-Version: 1.0
Date: Thu, 25 May 2023 19:43:46 +0200
Message-ID: 
 <CAFULd4bYim7cFJYrQrPZ3+n_HmqEyYsKejmy=fjaaRHcdD+JXg@mail.gmail.com>
Subject: [COMMITTED] i386: Use 2x-wider modes when emulating QImode vector
 instructions
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE,
 URIBL_BLACK autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Uros Bizjak via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Uros Bizjak <ubizjak@gmail.com>
Reply-To: Uros Bizjak <ubizjak@gmail.com>
Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>

Rewrite ix86_expand_vecop_qihi2 to expand fo 2x-wider (e.g. V16QI -> V16HImode)
instructions when available.  Currently, the compiler generates following
assembly for V16QImode multiplication (-mavx2):

    vpunpcklbw      %xmm0, %xmm0, %xmm3
    vpunpcklbw      %xmm1, %xmm1, %xmm2
    vpunpckhbw      %xmm0, %xmm0, %xmm0
    movl    $255, %eax
    vpunpckhbw      %xmm1, %xmm1, %xmm1
    vpmullw %xmm3, %xmm2, %xmm2
    vmovd   %eax, %xmm3
    vpmullw %xmm0, %xmm1, %xmm1
    vpbroadcastw    %xmm3, %xmm3
    vpand   %xmm2, %xmm3, %xmm0
    vpand   %xmm1, %xmm3, %xmm3
    vpackuswb       %xmm3, %xmm0, %xmm0

and only with -mavx512bw -mavx512vl generates:

    vpmovzxbw       %xmm1, %ymm1
    vpmovzxbw       %xmm0, %ymm0
    vpmullw %ymm1, %ymm0, %ymm0
    vpmovwb %ymm0, %xmm0

Patched compiler generates more optimized code involving multiplication
in 2x-wider mode in cases where missing truncate instruction has to be
emulated with a permutation (-mavx2):

    vpmovzxbw       %xmm0, %ymm0
    vpmovzxbw       %xmm1, %ymm1
    movl    $255, %eax
    vpmullw %ymm1, %ymm0, %ymm1
    vmovd   %eax, %xmm0
    vpbroadcastw    %xmm0, %ymm0
    vpand   %ymm1, %ymm0, %ymm0
    vpackuswb       %ymm0, %ymm0, %ymm0
    vpermq  $216, %ymm0, %ymm0

The patch also adjusts cost calculation of V*QImode emulations to account
for generation of 2x-wider mode instructions.

gcc/ChangeLog:

    * config/i386/i386-expand.cc (ix86_expand_vecop_qihi2):
    Rewrite to expand to 2x-wider (e.g. V16QI -> V16HImode)
    instructions when available.  Emulate truncation via
    ix86_expand_vec_perm_const_1 when native truncate insn
    is not available.
    (ix86_expand_vecop_qihi_partial) <case MULT>: Use pmovzx
    when available.  Trivially rename some variables.
    (ix86_expand_vecop_qihi): Unconditionally call ix86_expand_vecop_qihi2.
    * config/i386/i386.cc (ix86_multiplication_cost): Rewrite cost
    calculation of V*QImode emulations to account for generation of
    2x-wider mode instructions.
    (ix86_shift_rotate_cost): Update cost calculation of V*QImode
    emulations to account for generation of 2x-wider mode instructions.

gcc/testsuite/ChangeLog:

    * gcc.target/i386/avx512vl-pr95488-1.c: Revert 2023-05-18 change.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 5a57be82e98..0d8953b8c75 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -23106,68 +23106,6 @@ ix86_expand_vec_interleave (rtx targ, rtx op0, rtx op1, bool high_p)
   gcc_assert (ok);
 }
 
-/* This function is similar as ix86_expand_vecop_qihi,
-   but optimized under AVX512BW by using vpmovwb.
-   For example, optimize vector MUL generation like
-
-   vpmovzxbw ymm2, xmm0
-   vpmovzxbw ymm3, xmm1
-   vpmullw   ymm4, ymm2, ymm3
-   vpmovwb   xmm0, ymm4
-
-   it would take less instructions than ix86_expand_vecop_qihi.
-   Return true if success.  */
-
-static bool
-ix86_expand_vecop_qihi2 (enum rtx_code code, rtx dest, rtx op1, rtx op2)
-{
-  machine_mode himode, qimode = GET_MODE (dest);
-  rtx hop1, hop2, hdest;
-  rtx (*gen_truncate)(rtx, rtx);
-  bool uns_p = (code == ASHIFTRT) ? false : true;
-
-  /* There are no V64HImode instructions.  */
-  if (qimode == V64QImode)
-    return false;
-
-  /* vpmovwb only available under AVX512BW.  */
-  if (!TARGET_AVX512BW)
-    return false;
-
-  if (qimode == V16QImode && !TARGET_AVX512VL)
-    return false;
-
-  /* Do not generate ymm/zmm instructions when
-     target prefers 128/256 bit vector width.  */
-  if ((qimode == V16QImode && TARGET_PREFER_AVX128)
-      || (qimode == V32QImode && TARGET_PREFER_AVX256))
-    return false;
-
-  switch (qimode)
-    {
-    case E_V16QImode:
-      himode = V16HImode;
-      gen_truncate = gen_truncv16hiv16qi2;
-      break;
-    case E_V32QImode:
-      himode = V32HImode;
-      gen_truncate = gen_truncv32hiv32qi2;
-      break;
-    default:
-      gcc_unreachable ();
-    }
-
-  hop1 = gen_reg_rtx (himode);
-  hop2 = gen_reg_rtx (himode);
-  hdest = gen_reg_rtx (himode);
-  emit_insn (gen_extend_insn (hop1, op1, himode, qimode, uns_p));
-  emit_insn (gen_extend_insn (hop2, op2, himode, qimode, uns_p));
-  emit_insn (gen_rtx_SET (hdest, simplify_gen_binary (code, himode,
-						      hop1, hop2)));
-  emit_insn (gen_truncate (dest, hdest));
-  return true;
-}
-
 /* Expand a vector operation shift by constant for a V*QImode in terms of the
    same operation on V*HImode. Return true if success. */
 static bool
@@ -23272,9 +23210,9 @@ void
 ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx dest, rtx op1, rtx op2)
 {
   machine_mode qimode = GET_MODE (dest);
-  rtx qop1, qop2, hop1, hop2, qdest, hres;
+  rtx qop1, qop2, hop1, hop2, qdest, hdest;
   bool op2vec = GET_MODE_CLASS (GET_MODE (op2)) == MODE_VECTOR_INT;
-  bool uns_p = true;
+  bool uns_p = code != ASHIFTRT;
 
   switch (qimode)
     {
@@ -23306,24 +23244,25 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx dest, rtx op1, rtx op2)
     {
     case MULT:
       gcc_assert (op2vec);
-      /* Unpack data such that we've got a source byte in each low byte of
-	 each word.  We don't care what goes into the high byte of each word.
-	 Rather than trying to get zero in there, most convenient is to let
-	 it be a copy of the low byte.  */
-      hop1 = copy_to_reg (qop1);
-      hop2 = copy_to_reg (qop2);
-      emit_insn (gen_vec_interleave_lowv16qi (hop1, hop1, hop1));
-      emit_insn (gen_vec_interleave_lowv16qi (hop2, hop2, hop2));
-      break;
-
-    case ASHIFTRT:
-      uns_p = false;
+      if (!TARGET_SSE4_1)
+	{
+	  /* Unpack data such that we've got a source byte in each low byte
+	     of each word.  We don't care what goes into the high byte of
+	     each word.  Rather than trying to get zero in there, most
+	     convenient is to let it be a copy of the low byte.  */
+	  hop1 = copy_to_reg (qop1);
+	  hop2 = copy_to_reg (qop2);
+	  emit_insn (gen_vec_interleave_lowv16qi (hop1, hop1, hop1));
+	  emit_insn (gen_vec_interleave_lowv16qi (hop2, hop2, hop2));
+	  break;
+	}
       /* FALLTHRU */
     case ASHIFT:
+    case ASHIFTRT:
     case LSHIFTRT:
       hop1 = gen_reg_rtx (V8HImode);
       ix86_expand_sse_unpack (hop1, qop1, uns_p, false);
-      /* vashr/vlshr/vashl  */
+      /* mult/vashr/vlshr/vashl  */
       if (op2vec)
 	{
 	  hop2 = gen_reg_rtx (V8HImode);
@@ -23340,14 +23279,14 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx dest, rtx op1, rtx op2)
   if (code != MULT && op2vec)
     {
       /* Expand vashr/vlshr/vashl.  */
-      hres = gen_reg_rtx (V8HImode);
-      emit_insn (gen_rtx_SET (hres,
+      hdest = gen_reg_rtx (V8HImode);
+      emit_insn (gen_rtx_SET (hdest,
 			      simplify_gen_binary (code, V8HImode,
 						   hop1, hop2)));
     }
   else
     /* Expand mult/ashr/lshr/ashl.  */
-    hres = expand_simple_binop (V8HImode, code, hop1, hop2,
+    hdest = expand_simple_binop (V8HImode, code, hop1, hop2,
 				NULL_RTX, 1, OPTAB_DIRECT);
 
   if (TARGET_AVX512BW && TARGET_AVX512VL)
@@ -23357,19 +23296,18 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx dest, rtx op1, rtx op2)
       else
 	qdest = gen_reg_rtx (V8QImode);
 
-      emit_insn (gen_truncv8hiv8qi2 (qdest, hres));
+      emit_insn (gen_truncv8hiv8qi2 (qdest, hdest));
     }
   else
     {
       struct expand_vec_perm_d d;
-      rtx qres = gen_lowpart (V16QImode, hres);
+      rtx qres = gen_lowpart (V16QImode, hdest);
       bool ok;
       int i;
 
       /* Merge the data back into the right place.  */
       d.target = qdest;
-      d.op0 = qres;
-      d.op1 = qres;
+      d.op0 = d.op1 = qres;
       d.vmode = V16QImode;
       d.nelt = 16;
       d.one_operand_p = false;
@@ -23386,6 +23324,116 @@ ix86_expand_vecop_qihi_partial (enum rtx_code code, rtx dest, rtx op1, rtx op2)
     emit_move_insn (dest, gen_lowpart (qimode, qdest));
 }
 
+/* Emit instruction in 2x wider mode.  For example, optimize
+   vector MUL generation like
+
+   vpmovzxbw ymm2, xmm0
+   vpmovzxbw ymm3, xmm1
+   vpmullw   ymm4, ymm2, ymm3
+   vpmovwb   xmm0, ymm4
+
+   it would take less instructions than ix86_expand_vecop_qihi.
+   Return true if success.  */
+
+static bool
+ix86_expand_vecop_qihi2 (enum rtx_code code, rtx dest, rtx op1, rtx op2)
+{
+  machine_mode himode, qimode = GET_MODE (dest);
+  machine_mode wqimode;
+  rtx qop1, qop2, hop1, hop2, hdest;
+  rtx (*gen_truncate)(rtx, rtx) = NULL;
+  bool op2vec = GET_MODE_CLASS (GET_MODE (op2)) == MODE_VECTOR_INT;
+  bool uns_p = code != ASHIFTRT;
+
+  if ((qimode == V16QImode && !TARGET_AVX2)
+      || (qimode == V32QImode && !TARGET_AVX512BW)
+      /* There are no V64HImode instructions.  */
+      || qimode == V64QImode)
+     return false;
+
+  /* Do not generate ymm/zmm instructions when
+     target prefers 128/256 bit vector width.  */
+  if ((qimode == V16QImode && TARGET_PREFER_AVX128)
+      || (qimode == V32QImode && TARGET_PREFER_AVX256))
+    return false;
+
+  switch (qimode)
+    {
+    case E_V16QImode:
+      himode = V16HImode;
+      if (TARGET_AVX512VL)
+	gen_truncate = gen_truncv16hiv16qi2;
+      break;
+    case E_V32QImode:
+      himode = V32HImode;
+      gen_truncate = gen_truncv32hiv32qi2;
+      break;
+    default:
+      gcc_unreachable ();
+    }
+
+  wqimode = GET_MODE_2XWIDER_MODE (qimode).require ();
+  qop1 = lowpart_subreg (wqimode, force_reg (qimode, op1), qimode);
+
+  if (op2vec)
+    qop2 = lowpart_subreg (wqimode, force_reg (qimode, op2), qimode);
+  else
+    qop2 = op2;
+
+  hop1 = gen_reg_rtx (himode);
+  ix86_expand_sse_unpack (hop1, qop1, uns_p, false);
+
+  if (op2vec)
+    {
+      hop2 = gen_reg_rtx (himode);
+      ix86_expand_sse_unpack (hop2, qop2, uns_p, false);
+    }
+  else
+    hop2 = qop2;
+
+    if (code != MULT && op2vec)
+      {
+	/* Expand vashr/vlshr/vashl.  */
+	hdest = gen_reg_rtx (himode);
+	emit_insn (gen_rtx_SET (hdest,
+				simplify_gen_binary (code, himode,
+						     hop1, hop2)));
+      }
+    else
+      /* Expand mult/ashr/lshr/ashl.  */
+      hdest = expand_simple_binop (himode, code, hop1, hop2,
+				   NULL_RTX, 1, OPTAB_DIRECT);
+
+  if (gen_truncate)
+    emit_insn (gen_truncate (dest, hdest));
+  else
+    {
+      struct expand_vec_perm_d d;
+      rtx wqdest = gen_reg_rtx (wqimode);
+      rtx wqres = gen_lowpart (wqimode, hdest);
+      bool ok;
+      int i;
+
+      /* Merge the data back into the right place.  */
+      d.target = wqdest;
+      d.op0 = d.op1 = wqres;
+      d.vmode = wqimode;
+      d.nelt = GET_MODE_NUNITS (wqimode);
+      d.one_operand_p = false;
+      d.testing_p = false;
+
+      for (i = 0; i < d.nelt; ++i)
+	d.perm[i] = i * 2;
+
+      ok = ix86_expand_vec_perm_const_1 (&d);
+      gcc_assert (ok);
+
+      emit_move_insn (dest, gen_lowpart (qimode, wqdest));
+    }
+
+  return true;
+}
+
 /* Expand a vector operation CODE for a V*QImode in terms of the
    same operation on V*HImode.  */
 
@@ -23400,7 +23448,7 @@ ix86_expand_vecop_qihi (enum rtx_code code, rtx dest, rtx op1, rtx op2)
   bool op2vec = GET_MODE_CLASS (GET_MODE (op2)) == MODE_VECTOR_INT;
   struct expand_vec_perm_d d;
   bool full_interleave = true;
-  bool uns_p = true;
+  bool uns_p = code != ASHIFTRT;
   bool ok;
   int i;
 
@@ -23409,9 +23457,7 @@ ix86_expand_vecop_qihi (enum rtx_code code, rtx dest, rtx op1, rtx op2)
       && ix86_expand_vec_shift_qihi_constant (code, dest, op1, op2))
     return;
 
-  if (TARGET_AVX512BW
-      && VECTOR_MODE_P (GET_MODE (op2))
-      && ix86_expand_vecop_qihi2 (code, dest, op1, op2))
+  if (ix86_expand_vecop_qihi2 (code, dest, op1, op2))
     return;
 
   switch (qimode)
@@ -23468,10 +23514,8 @@ ix86_expand_vecop_qihi (enum rtx_code code, rtx dest, rtx op1, rtx op2)
       emit_insn (gen_ih (op1_h, op1, op1));
       break;
 
-    case ASHIFTRT:
-      uns_p = false;
-      /* FALLTHRU */
     case ASHIFT:
+    case ASHIFTRT:
     case LSHIFTRT:
       op1_l = gen_reg_rtx (himode);
       op1_h = gen_reg_rtx (himode);
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 202abf0b39c..e548d658828 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -20497,75 +20497,109 @@ ix86_multiplication_cost (const struct processor_costs *cost,
     return  ix86_vec_cost (mode,
 			   inner_mode == DFmode ? cost->mulsd : cost->mulss);
   else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
-    switch (mode)
-      {
-      case V4QImode:
-      case V8QImode:
-	/* Partial V*QImode is emulated with 4-6 insns.  */
-	if (TARGET_AVX512BW && TARGET_AVX512VL)
-	  return ix86_vec_cost (mode, cost->mulss + cost->sse_op * 3);
-	else if (TARGET_AVX2)
-	  return ix86_vec_cost (mode, cost->mulss + cost->sse_op * 5);
-	else if (TARGET_XOP)
-	  return (ix86_vec_cost (mode, cost->mulss + cost->sse_op * 3)
-		  + cost->sse_load[2]);
-	else
-	  return (ix86_vec_cost (mode, cost->mulss + cost->sse_op * 4)
-		  + cost->sse_load[2]);
-
-      case V16QImode:
-	/* V*QImode is emulated with 4-11 insns.  */
-	if (TARGET_AVX512BW && TARGET_AVX512VL)
-	  return ix86_vec_cost (mode, cost->mulss + cost->sse_op * 3);
-	else if (TARGET_AVX2)
-	  return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 8);
-	else if (TARGET_XOP)
-	  return (ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 5)
-		  + cost->sse_load[2]);
-	else
-	  return (ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 7)
-		  + cost->sse_load[2]);
+    {
+      int nmults, nops;
+      /* Cost of reading the memory.  */
+      int extra;
 
-      case V32QImode:
-	if (TARGET_AVX512BW)
-	  return ix86_vec_cost (mode, cost->mulss + cost->sse_op * 3);
-	else
-	  return (ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 7)
-		  + cost->sse_load[3] * 2);
-
-      case V64QImode:
-	return (ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 9)
-		+ cost->sse_load[3] * 2
-		+ cost->sse_load[4] * 2);
-
-      case V4SImode:
-	/* pmulld is used in this case. No emulation is needed.  */
-	if (TARGET_SSE4_1)
-	  goto do_native;
-	/* V4SImode is emulated with 7 insns.  */
-	else
-	  return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 5);
-
-      case V2DImode:
-      case V4DImode:
-	/* vpmullq is used in this case. No emulation is needed.  */
-	if (TARGET_AVX512DQ && TARGET_AVX512VL)
-	  goto do_native;
-	/* V*DImode is emulated with 6-8 insns.  */
-	else if (TARGET_XOP && mode == V2DImode)
-	  return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 4);
-	/* FALLTHRU */
-      case V8DImode:
-	/* vpmullq is used in this case. No emulation is needed.  */
-	if (TARGET_AVX512DQ && mode == V8DImode)
-	  goto do_native;
-	else
-	  return ix86_vec_cost (mode, cost->mulss * 3 + cost->sse_op * 5);
+      switch (mode)
+	{
+	case V4QImode:
+	case V8QImode:
+	  /* Partial V*QImode is emulated with 4-6 insns.  */
+	  nmults = 1;
+	  nops = 3;
+	  extra = 0;
 
-      default:
-      do_native:
-	return ix86_vec_cost (mode, cost->mulss);
-      }
+	  if (TARGET_AVX512BW && TARGET_AVX512VL)
+	    ;
+	  else if (TARGET_AVX2)
+	    nops += 2;
+	  else if (TARGET_XOP)
+	    extra += cost->sse_load[2];
+	  else
+	    {
+	      nops += 1;
+	      extra += cost->sse_load[2];
+	    }
+	  goto do_qimode;
+
+	case V16QImode:
+	  /* V*QImode is emulated with 4-11 insns.  */
+	  nmults = 1;
+	  nops = 3;
+	  extra = 0;
+
+	  if (TARGET_AVX2 && !TARGET_PREFER_AVX128)
+	    {
+	      if (!(TARGET_AVX512BW && TARGET_AVX512VL))
+		nops += 3;
+	    }
+	  else if (TARGET_XOP)
+	    {
+	      nmults += 1;
+	      nops += 2;
+	      extra += cost->sse_load[2];
+	    }
+	  else
+	    {
+	      nmults += 1;
+	      nops += 4;
+	      extra += cost->sse_load[2];
+	    }
+	  goto do_qimode;
+
+	case V32QImode:
+	  nmults = 1;
+	  nops = 3;
+	  extra = 0;
+
+	  if (!TARGET_AVX512BW || TARGET_PREFER_AVX256)
+	    {
+	      nmults += 1;
+	      nops += 4;
+	      extra += cost->sse_load[3] * 2;
+	    }
+	  goto do_qimode;
+
+	case V64QImode:
+	  nmults = 2;
+	  nops = 9;
+	  extra = cost->sse_load[3] * 2 + cost->sse_load[4] * 2;
+
+	do_qimode:
+	  return ix86_vec_cost (mode, cost->mulss * nmults
+				+ cost->sse_op * nops) + extra;
+
+	case V4SImode:
+	  /* pmulld is used in this case. No emulation is needed.  */
+	  if (TARGET_SSE4_1)
+	    goto do_native;
+	  /* V4SImode is emulated with 7 insns.  */
+	  else
+	    return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 5);
+
+	case V2DImode:
+	case V4DImode:
+	  /* vpmullq is used in this case. No emulation is needed.  */
+	  if (TARGET_AVX512DQ && TARGET_AVX512VL)
+	    goto do_native;
+	  /* V*DImode is emulated with 6-8 insns.  */
+	  else if (TARGET_XOP && mode == V2DImode)
+	    return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 4);
+	  /* FALLTHRU */
+	case V8DImode:
+	  /* vpmullq is used in this case. No emulation is needed.  */
+	  if (TARGET_AVX512DQ && mode == V8DImode)
+	    goto do_native;
+	  else
+	    return ix86_vec_cost (mode, cost->mulss * 3 + cost->sse_op * 5);
+
+	default:
+	do_native:
+	  return ix86_vec_cost (mode, cost->mulss);
+	}
+    }
   else
     return (cost->mult_init[MODE_INDEX (mode)] + cost->mult_bit * 7);
 }
@@ -20637,16 +20671,13 @@ ix86_shift_rotate_cost (const struct processor_costs *cost,
 		count = 2;
 	    }
 	  else if (TARGET_AVX512BW && TARGET_AVX512VL)
-	    {
-	      count = 3;
-	      return ix86_vec_cost (mode, cost->sse_op * count);
-	    }
+	    return ix86_vec_cost (mode, cost->sse_op * 4);
 	  else if (TARGET_SSE4_1)
-	    count = 4;
-	  else if (code == ASHIFTRT)
 	    count = 5;
+	  else if (code == ASHIFTRT)
+	    count = 6;
 	  else
-	    count = 4;
+	    count = 5;
 	  return ix86_vec_cost (mode, cost->sse_op * count) + extra;
 
 	case V16QImode:
@@ -20663,7 +20694,7 @@ ix86_shift_rotate_cost (const struct processor_costs *cost,
 		}
 	      else
 		{
-		  count = (code == ASHIFT) ? 2 : 3;
+		  count = (code == ASHIFT) ? 3 : 4;
 		  return ix86_vec_cost (mode, cost->sse_op * count);
 		}
 	    }
@@ -20685,12 +20716,20 @@ ix86_shift_rotate_cost (const struct processor_costs *cost,
 	      else
 		count = 2;
 	    }
+	  else if (TARGET_AVX512BW
+		   && ((mode == V32QImode && !TARGET_PREFER_AVX256)
+		       || (mode == V16QImode && TARGET_AVX512VL
+			   && !TARGET_PREFER_AVX128)))
+	    return ix86_vec_cost (mode, cost->sse_op * 4);
+	  else if (TARGET_AVX2
+		   && mode == V16QImode && !TARGET_PREFER_AVX128)
+	    count = 6;
 	  else if (TARGET_SSE4_1)
-	    count = 8;
-	  else if (code == ASHIFTRT)
 	    count = 9;
+	  else if (code == ASHIFTRT)
+	    count = 10;
 	  else
-	    count = 8;
+	    count = 9;
 	  return ix86_vec_cost (mode, cost->sse_op * count) + extra;
 
 	case V2DImode:
@@ -20704,6 +20743,8 @@ ix86_shift_rotate_cost (const struct processor_costs *cost,
 		    count = TARGET_SSE4_2 ? 1 : 2;
 		  else if (TARGET_XOP)
 		    count = 2;
+		  else if (TARGET_SSE4_1)
+		    count = 3;
 		  else
 		    count = 4;
 		}
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-pr95488-1.c b/gcc/testsuite/gcc.target/i386/avx512vl-pr95488-1.c
index 5e9f4f2805c..dc684a167c8 100644
--- a/gcc/testsuite/gcc.target/i386/avx512vl-pr95488-1.c
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-pr95488-1.c
@@ -1,8 +1,7 @@
 /* PR target/pr95488  */
 /* { dg-do compile } */
 /* { dg-options "-O2 -mavx512bw -mavx512vl" }  */
-/* { dg-final { scan-assembler-times "vpmovzxbw" 4 { target { ! ia32 } } } } */
-/* { dg-final { scan-assembler-times "vpunpcklbw" 4 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpmovzxbw" 8 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "vpmullw\[^\n\]*ymm" 2 } } */
 /* { dg-final { scan-assembler-times "vpmullw\[^\n\]*xmm" 2 { target { ! ia32 } } } } */
 /* { dg-final { scan-assembler-times "vpmovwb" 4 { target { ! ia32 } } } } */