[3/3] libcpp: add AVX2 helper

Message ID 20240806161850.18839-3-amonakov@ispras.ru
State New
Headers
Series libcpp: improve x86 vectorized helpers |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm fail Build failed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 fail Build failed

Commit Message

Alexander Monakov Aug. 6, 2024, 4:18 p.m. UTC
  Use the same PSHUFB-based matching as in the SSSE3 helper, just 2x
wider.

Directly use the new helper if __AVX2__ is defined. It makes the other
helpers unused, so mark them inline to prevent warnings.

Rewrite and simplify init_vectorized_lexer.

libcpp/ChangeLog:

	* files.cc (read_file_guts): Bump padding to 32 if HAVE_AVX2.
	* lex.cc (search_line_acc_char): Mark inline, not "unused".
	(search_line_sse2): Mark inline.
	(search_line_ssse3): Ditto.
	(search_line_avx2): New function.
	(init_vectorized_lexer): Reimplement.
---
 libcpp/files.cc |  15 +++----
 libcpp/lex.cc   | 111 ++++++++++++++++++++++++++++++++++++------------
 2 files changed, 92 insertions(+), 34 deletions(-)
  

Comments

Alexander Monakov Aug. 7, 2024, 5:40 a.m. UTC | #1
On Tue, 6 Aug 2024, Alexander Monakov wrote:

> --- a/libcpp/files.cc
> +++ b/libcpp/files.cc
[...]
> +  pad = HAVE_AVX2 ? 32 : 16;

This should have been

#ifdef HAVE_AVX2
  pad = 32;
#else
  pad = 16;
#endif

Alexander
  
Richard Biener Aug. 7, 2024, 7:50 a.m. UTC | #2
On Wed, Aug 7, 2024 at 7:41 AM Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Tue, 6 Aug 2024, Alexander Monakov wrote:
>
> > --- a/libcpp/files.cc
> > +++ b/libcpp/files.cc
> [...]
> > +  pad = HAVE_AVX2 ? 32 : 16;
>
> This should have been
>
> #ifdef HAVE_AVX2
>   pad = 32;
> #else
>   pad = 16;
> #endif

OK with that change.

Did you think about a AVX512 version (possibly with 32 byte vectors)?
In case there's a more efficient variant of pshufb/pmovmskb available
there - possibly
the load on the branch unit could be lessened with using masking.

Thanks,
Richard.

> Alexander
  
Alexander Monakov Aug. 7, 2024, 8:08 a.m. UTC | #3
On Wed, 7 Aug 2024, Richard Biener wrote:

> OK with that change.
> 
> Did you think about a AVX512 version (possibly with 32 byte vectors)?
> In case there's a more efficient variant of pshufb/pmovmskb available
> there - possibly
> the load on the branch unit could be lessened with using masking.

Thanks for the idea; unfortunately I don't see any possible improvement.
It would trade pmovmskb-(test+jcc,fused) for ktest-jcc, so unless the
latencies are shorter it seems to be a wash. The only way to use fewer
branches seems to be employing longer vectors.

(in any case I don't have access to a capable CPU to see for myself)

Alexander
  
Alexander Monakov Aug. 8, 2024, 5:44 p.m. UTC | #4
> On Wed, 7 Aug 2024, Richard Biener wrote:
> 
> > OK with that change.
> > 
> > Did you think about a AVX512 version (possibly with 32 byte vectors)?
> > In case there's a more efficient variant of pshufb/pmovmskb available
> > there - possibly
> > the load on the branch unit could be lessened with using masking.
> 
> Thanks for the idea; unfortunately I don't see any possible improvement.
> It would trade pmovmskb-(test+jcc,fused) for ktest-jcc, so unless the
> latencies are shorter it seems to be a wash. The only way to use fewer
> branches seems to be employing longer vectors.

A better answer is that of course we can reduce branching without AVX2
by using two SSE vectors in place of one 32-byte AVX vector. In fact,
with a bit of TLC this idea works out really well, and the result
closely matches the AVX2 code in performance. To put that in numbers,
on the testcase from my microbenchmark archive, unmodified GCC shows

 Performance counter stats for 'gcc/cc1plus.orig -fsyntax-only -quiet t-rawstr.cc' (9 runs):

             39.13 msec task-clock:u                     #    1.101 CPUs utilized               ( +-  0.30% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
             2,206      page-faults:u                    #   56.374 K/sec                       ( +-  2.58% )
        61,502,159      cycles:u                         #    1.572 GHz                         ( +-  0.08% )
           749,841      stalled-cycles-frontend:u        #    1.22% frontend cycles idle        ( +-  0.20% )
         6,831,862      stalled-cycles-backend:u         #   11.11% backend cycles idle         ( +-  0.63% )
       141,972,604      instructions:u                   #    2.31  insn per cycle
                                                  #    0.05  stalled cycles per insn     ( +-  0.00% )
        46,054,279      branches:u                       #    1.177 G/sec                       ( +-  0.00% )
           325,134      branch-misses:u                  #    0.71% of all branches             ( +-  0.11% )

          0.035550 +- 0.000373 seconds time elapsed  ( +-  1.05% )


then with the AVX2 helper from my patchset we have

 Performance counter stats for 'gcc/cc1plus.avx -fsyntax-only -quiet t-rawstr.cc' (9 runs):

             36.39 msec task-clock:u                     #    1.112 CPUs utilized               ( +-  0.27% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
             2,208      page-faults:u                    #   60.677 K/sec                       ( +-  2.55% )
        56,527,349      cycles:u                         #    1.553 GHz                         ( +-  0.09% )
           728,417      stalled-cycles-frontend:u        #    1.29% frontend cycles idle        ( +-  0.38% )
         6,221,761      stalled-cycles-backend:u         #   11.01% backend cycles idle         ( +-  1.58% )
       141,296,340      instructions:u                   #    2.50  insn per cycle
                                                  #    0.04  stalled cycles per insn     ( +-  0.00% )
        45,758,162      branches:u                       #    1.257 G/sec                       ( +-  0.00% )
           295,042      branch-misses:u                  #    0.64% of all branches             ( +-  0.12% )

          0.032736 +- 0.000460 seconds time elapsed  ( +-  1.41% )


and with the revised patch that uses SSSE3 more cleverly

 Performance counter stats for 'gcc/cc1plus -fsyntax-only -quiet t-rawstr.cc' (9 runs):

             36.89 msec task-clock:u                     #    1.110 CPUs utilized               ( +-  0.29% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
             2,374      page-faults:u                    #   64.349 K/sec                       ( +-  3.77% )
        56,556,237      cycles:u                         #    1.533 GHz                         ( +-  0.11% )
           733,192      stalled-cycles-frontend:u        #    1.30% frontend cycles idle        ( +-  1.08% )
         6,271,987      stalled-cycles-backend:u         #   11.09% backend cycles idle         ( +-  1.89% )
       142,743,102      instructions:u                   #    2.52  insn per cycle
                                                  #    0.04  stalled cycles per insn     ( +-  0.00% )
        45,646,829      branches:u                       #    1.237 G/sec                       ( +-  0.00% )
           295,155      branch-misses:u                  #    0.65% of all branches             ( +-  0.11% )

          0.033242 +- 0.000418 seconds time elapsed  ( +-  1.26% )


Is the revised patch below still ok? I've rolled the configury changes into it,
and dropped the (now unnecessary) AVX2 helper (and cpuid/xgetbv detection).

===8<===

From: Alexander Monakov <amonakov@ispras.ru>
Subject: [PATCH v2] libcpp: replace SSE4.2 helper with an SSSE3 one

Since the characters we are searching for (CR, LF, '\', '?') all have
distinct ASCII codes mod 16, PSHUFB can help match them all at once.

Directly use the new helper if __SSSE3__ is defined. It makes the other
helpers unused, so mark them inline to prevent warnings.

Rewrite and simplify init_vectorized_lexer.

libcpp/ChangeLog:

	* config.in: Regenerate.
	* configure: Regenerate.
	* configure.ac: Check for SSSE3 instead of SSE4.2.
	* files.cc (read_file_guts): Bump padding to 64 if HAVE_SSSE3.
	* lex.cc (search_line_acc_char): Mark inline, not "unused".
	(search_line_sse2): Mark inline.
	(search_line_sse42): Replace with...
	(search_line_ssse3): ... this new function.  Adjust the use...
	(init_vectorized_lexer): ... here.  Simplify.
---
 libcpp/config.in    |   4 +-
 libcpp/configure    |   4 +-
 libcpp/configure.ac |   6 +-
 libcpp/files.cc     |  19 +++---
 libcpp/lex.cc       | 150 ++++++++++++++++----------------------------
 5 files changed, 73 insertions(+), 110 deletions(-)

diff --git a/libcpp/config.in b/libcpp/config.in
index 253ef03a3d..b2e2f4e842 100644
--- a/libcpp/config.in
+++ b/libcpp/config.in
@@ -210,8 +210,8 @@
 /* Define to 1 if you have the `putc_unlocked' function. */
 #undef HAVE_PUTC_UNLOCKED
 
-/* Define to 1 if you can assemble SSE4 insns. */
-#undef HAVE_SSE4
+/* Define to 1 if you can assemble SSSE3 insns. */
+#undef HAVE_SSSE3
 
 /* Define to 1 if you have the <stddef.h> header file. */
 #undef HAVE_STDDEF_H
diff --git a/libcpp/configure b/libcpp/configure
index 32d6aaa306..1391081ba0 100755
--- a/libcpp/configure
+++ b/libcpp/configure
@@ -9140,14 +9140,14 @@ case $target in
 int
 main ()
 {
-asm ("pcmpestri %0, %%xmm0, %%xmm1" : : "i"(0))
+asm ("pshufb %xmm0, %xmm1")
   ;
   return 0;
 }
 _ACEOF
 if ac_fn_c_try_compile "$LINENO"; then :
 
-$as_echo "#define HAVE_SSE4 1" >>confdefs.h
+$as_echo "#define HAVE_SSSE3 1" >>confdefs.h
 
 fi
 rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
diff --git a/libcpp/configure.ac b/libcpp/configure.ac
index b883fec776..981f97c4ab 100644
--- a/libcpp/configure.ac
+++ b/libcpp/configure.ac
@@ -197,9 +197,9 @@ fi
 
 case $target in
   i?86-* | x86_64-*)
-    AC_TRY_COMPILE([], [asm ("pcmpestri %0, %%xmm0, %%xmm1" : : "i"(0))],
-      [AC_DEFINE([HAVE_SSE4], [1],
-		 [Define to 1 if you can assemble SSE4 insns.])])
+    AC_TRY_COMPILE([], [asm ("pshufb %xmm0, %xmm1")],
+      [AC_DEFINE([HAVE_SSSE3], [1],
+		 [Define to 1 if you can assemble SSSE3 insns.])])
 esac
 
 # Enable --enable-host-shared.
diff --git a/libcpp/files.cc b/libcpp/files.cc
index 78f56e30bd..3775091d25 100644
--- a/libcpp/files.cc
+++ b/libcpp/files.cc
@@ -693,7 +693,7 @@ static bool
 read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
 		const char *input_charset)
 {
-  ssize_t size, total, count;
+  ssize_t size, pad, total, count;
   uchar *buf;
   bool regular;
 
@@ -732,11 +732,14 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
        the majority of C source files.  */
     size = 8 * 1024;
 
-  /* The + 16 here is space for the final '\n' and 15 bytes of padding,
-     used to quiet warnings from valgrind or Address Sanitizer, when the
-     optimized lexer accesses aligned 16-byte memory chunks, including
-     the bytes after the malloced, area, and stops lexing on '\n'.  */
-  buf = XNEWVEC (uchar, size + 16);
+#ifdef HAVE_SSSE3
+  pad = 64;
+#else
+  pad = 16;
+#endif
+  /* The '+ PAD' here is space for the final '\n' and PAD-1 bytes of padding,
+     allowing search_line_fast to use (possibly misaligned) vector loads.  */
+  buf = XNEWVEC (uchar, size + pad);
   total = 0;
   while ((count = read (file->fd, buf + total, size - total)) > 0)
     {
@@ -747,7 +750,7 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
 	  if (regular)
 	    break;
 	  size *= 2;
-	  buf = XRESIZEVEC (uchar, buf, size + 16);
+	  buf = XRESIZEVEC (uchar, buf, size + pad);
 	}
     }
 
@@ -765,7 +768,7 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
 
   file->buffer = _cpp_convert_input (pfile,
 				     input_charset,
-				     buf, size + 16, total,
+				     buf, size + pad, total,
 				     &file->buffer_start,
 				     &file->st.st_size);
   file->buffer_valid = file->buffer;
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index 1591dcdf15..daf2c770bc 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -225,10 +225,7 @@ acc_char_index (word_type cmp ATTRIBUTE_UNUSED,
    and branches without increasing the number of arithmetic operations.
    It's almost certainly going to be a win with 64-bit word size.  */
 
-static const uchar * search_line_acc_char (const uchar *, const uchar *)
-  ATTRIBUTE_UNUSED;
-
-static const uchar *
+static inline const uchar *
 search_line_acc_char (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
 {
   const word_type repl_nl = acc_char_replicate ('\n');
@@ -293,7 +290,7 @@ static const char repl_chars[4][16] __attribute__((aligned(16))) = {
 
 /* A version of the fast scanner using SSE2 vectorized byte compare insns.  */
 
-static const uchar *
+static inline const uchar *
 #ifndef __SSE2__
 __attribute__((__target__("sse2")))
 #endif
@@ -344,120 +341,83 @@ search_line_sse2 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
   return (const uchar *)p + found;
 }
 
-#ifdef HAVE_SSE4
-/* A version of the fast scanner using SSE 4.2 vectorized string insns.  */
+#ifdef HAVE_SSSE3
+/* A version of the fast scanner using SSSE3 shuffle (PSHUFB) insns.  */
 
-static const uchar *
-#ifndef __SSE4_2__
-__attribute__((__target__("sse4.2")))
+static inline const uchar *
+#ifndef __SSSE3__
+__attribute__((__target__("ssse3")))
 #endif
-search_line_sse42 (const uchar *s, const uchar *end)
+search_line_ssse3 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
 {
   typedef char v16qi __attribute__ ((__vector_size__ (16)));
-  static const v16qi search = { '\n', '\r', '?', '\\' };
-
-  uintptr_t si = (uintptr_t)s;
-  uintptr_t index;
-
-  /* Check for unaligned input.  */
-  if (si & 15)
-    {
-      v16qi sv;
-
-      if (__builtin_expect (end - s < 16, 0)
-	  && __builtin_expect ((si & 0xfff) > 0xff0, 0))
-	{
-	  /* There are less than 16 bytes left in the buffer, and less
-	     than 16 bytes left on the page.  Reading 16 bytes at this
-	     point might generate a spurious page fault.  Defer to the
-	     SSE2 implementation, which already handles alignment.  */
-	  return search_line_sse2 (s, end);
-	}
-
-      /* ??? The builtin doesn't understand that the PCMPESTRI read from
-	 memory need not be aligned.  */
-      sv = __builtin_ia32_loaddqu ((const char *) s);
-      index = __builtin_ia32_pcmpestri128 (search, 4, sv, 16, 0);
-
-      if (__builtin_expect (index < 16, 0))
-	goto found;
-
-      /* Advance the pointer to an aligned address.  We will re-scan a
-	 few bytes, but we no longer need care for reading past the
-	 end of a page, since we're guaranteed a match.  */
-      s = (const uchar *)((si + 15) & -16);
-    }
-
-  /* Main loop, processing 16 bytes at a time.  */
-#ifdef __GCC_ASM_FLAG_OUTPUTS__
-  while (1)
+  typedef v16qi v16qi_u __attribute__ ((__aligned__ (1)));
+  /* Helper vector for pshufb-based matching:
+     each character C we're searching for is at position (C % 16).  */
+  v16qi lut = { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' };
+  static_assert ('\n' == 10 && '\r' == 13 && '\\' == 92 && '?' == 63);
+
+  v16qi d1, d2, t1, t2;
+  /* Unaligned loads.  Reading beyond the final newline is safe,
+     since files.cc:read_file_guts pads the allocation.  */
+  d1 = *(const v16qi_u *)s;
+  d2 = *(const v16qi_u *)(s + 16);
+  unsigned m1, m2, found;
+  /* Process two 16-byte chunks per iteration.  */
+  do
     {
-      char f;
-
-      /* By using inline assembly instead of the builtin,
-	 we can use the result, as well as the flags set.  */
-      __asm ("%vpcmpestri\t$0, %2, %3"
-	     : "=c"(index), "=@ccc"(f)
-	     : "m"(*s), "x"(search), "a"(4), "d"(16));
-      if (f)
-	break;
-      
-      s += 16;
+      t1 = __builtin_ia32_pshufb128 (lut, d1);
+      t2 = __builtin_ia32_pshufb128 (lut, d2);
+      m1 = __builtin_ia32_pmovmskb128 (t1 == d1);
+      m2 = __builtin_ia32_pmovmskb128 (t2 == d2);
+      s += 32;
+      d1 = *(const v16qi_u *)s;
+      d2 = *(const v16qi_u *)(s + 16);
+      found = m1 + (m2 << 16);
     }
-#else
-  s -= 16;
-  /* By doing the whole loop in inline assembly,
-     we can make proper use of the flags set.  */
-  __asm (      ".balign 16\n"
-	"0:	add $16, %1\n"
-	"	%vpcmpestri\t$0, (%1), %2\n"
-	"	jnc 0b"
-	: "=&c"(index), "+r"(s)
-	: "x"(search), "a"(4), "d"(16));
-#endif
-
- found:
-  return s + index;
+  while (!found);
+  /* Prefer to compute 's - 32' here, not spend an extra instruction
+     to make a copy of the previous value of 's' in the loop.  */
+  __asm__ ("" : "+r"(s));
+  return s - 32 + __builtin_ctz (found);
 }
 
 #else
-/* Work around out-dated assemblers without sse4 support.  */
-#define search_line_sse42 search_line_sse2
+/* Work around out-dated assemblers without SSSE3 support.  */
+#define search_line_ssse3 search_line_sse2
 #endif
 
+#ifdef __SSSE3__
+/* No need for CPU probing, just use the best available variant.  */
+#define search_line_fast search_line_ssse3
+#else
 /* Check the CPU capabilities.  */
 
 #include "../gcc/config/i386/cpuid.h"
 
 typedef const uchar * (*search_line_fast_type) (const uchar *, const uchar *);
-static search_line_fast_type search_line_fast;
+static search_line_fast_type search_line_fast
+#if defined(__SSE2__)
+ = search_line_sse2;
+#else
+ = search_line_acc_char;
+#endif
 
 #define HAVE_init_vectorized_lexer 1
 static inline void
 init_vectorized_lexer (void)
 {
-  unsigned dummy, ecx = 0, edx = 0;
-  search_line_fast_type impl = search_line_acc_char;
-  int minimum = 0;
-
-#if defined(__SSE4_2__)
-  minimum = 3;
-#elif defined(__SSE2__)
-  minimum = 2;
-#endif
+  unsigned ax, bx, cx, dx;
 
-  if (minimum == 3)
-    impl = search_line_sse42;
-  else if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx) || minimum == 2)
-    {
-      if (minimum == 3 || (ecx & bit_SSE4_2))
-        impl = search_line_sse42;
-      else if (minimum == 2 || (edx & bit_SSE2))
-	impl = search_line_sse2;
-    }
+  if (!__get_cpuid (1, &ax, &bx, &cx, &dx))
+    return;
 
-  search_line_fast = impl;
+  if (cx & bit_SSSE3)
+    search_line_fast = search_line_ssse3;
+  else if (dx & bit_SSE2)
+    search_line_fast = search_line_sse2;
 }
+#endif
 
 #elif (GCC_VERSION >= 4005) && defined(_ARCH_PWR8) && defined(__ALTIVEC__)
  
Richard Biener Aug. 20, 2024, 7:56 a.m. UTC | #5
On Thu, Aug 8, 2024 at 7:44 PM Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> > On Wed, 7 Aug 2024, Richard Biener wrote:
> >
> > > OK with that change.
> > >
> > > Did you think about a AVX512 version (possibly with 32 byte vectors)?
> > > In case there's a more efficient variant of pshufb/pmovmskb available
> > > there - possibly
> > > the load on the branch unit could be lessened with using masking.
> >
> > Thanks for the idea; unfortunately I don't see any possible improvement.
> > It would trade pmovmskb-(test+jcc,fused) for ktest-jcc, so unless the
> > latencies are shorter it seems to be a wash. The only way to use fewer
> > branches seems to be employing longer vectors.
>
> A better answer is that of course we can reduce branching without AVX2
> by using two SSE vectors in place of one 32-byte AVX vector. In fact,
> with a bit of TLC this idea works out really well, and the result
> closely matches the AVX2 code in performance. To put that in numbers,
> on the testcase from my microbenchmark archive, unmodified GCC shows
>
>  Performance counter stats for 'gcc/cc1plus.orig -fsyntax-only -quiet t-rawstr.cc' (9 runs):
>
>              39.13 msec task-clock:u                     #    1.101 CPUs utilized               ( +-  0.30% )
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>              2,206      page-faults:u                    #   56.374 K/sec                       ( +-  2.58% )
>         61,502,159      cycles:u                         #    1.572 GHz                         ( +-  0.08% )
>            749,841      stalled-cycles-frontend:u        #    1.22% frontend cycles idle        ( +-  0.20% )
>          6,831,862      stalled-cycles-backend:u         #   11.11% backend cycles idle         ( +-  0.63% )
>        141,972,604      instructions:u                   #    2.31  insn per cycle
>                                                   #    0.05  stalled cycles per insn     ( +-  0.00% )
>         46,054,279      branches:u                       #    1.177 G/sec                       ( +-  0.00% )
>            325,134      branch-misses:u                  #    0.71% of all branches             ( +-  0.11% )
>
>           0.035550 +- 0.000373 seconds time elapsed  ( +-  1.05% )
>
>
> then with the AVX2 helper from my patchset we have
>
>  Performance counter stats for 'gcc/cc1plus.avx -fsyntax-only -quiet t-rawstr.cc' (9 runs):
>
>              36.39 msec task-clock:u                     #    1.112 CPUs utilized               ( +-  0.27% )
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>              2,208      page-faults:u                    #   60.677 K/sec                       ( +-  2.55% )
>         56,527,349      cycles:u                         #    1.553 GHz                         ( +-  0.09% )
>            728,417      stalled-cycles-frontend:u        #    1.29% frontend cycles idle        ( +-  0.38% )
>          6,221,761      stalled-cycles-backend:u         #   11.01% backend cycles idle         ( +-  1.58% )
>        141,296,340      instructions:u                   #    2.50  insn per cycle
>                                                   #    0.04  stalled cycles per insn     ( +-  0.00% )
>         45,758,162      branches:u                       #    1.257 G/sec                       ( +-  0.00% )
>            295,042      branch-misses:u                  #    0.64% of all branches             ( +-  0.12% )
>
>           0.032736 +- 0.000460 seconds time elapsed  ( +-  1.41% )
>
>
> and with the revised patch that uses SSSE3 more cleverly
>
>  Performance counter stats for 'gcc/cc1plus -fsyntax-only -quiet t-rawstr.cc' (9 runs):
>
>              36.89 msec task-clock:u                     #    1.110 CPUs utilized               ( +-  0.29% )
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>              2,374      page-faults:u                    #   64.349 K/sec                       ( +-  3.77% )
>         56,556,237      cycles:u                         #    1.533 GHz                         ( +-  0.11% )
>            733,192      stalled-cycles-frontend:u        #    1.30% frontend cycles idle        ( +-  1.08% )
>          6,271,987      stalled-cycles-backend:u         #   11.09% backend cycles idle         ( +-  1.89% )
>        142,743,102      instructions:u                   #    2.52  insn per cycle
>                                                   #    0.04  stalled cycles per insn     ( +-  0.00% )
>         45,646,829      branches:u                       #    1.237 G/sec                       ( +-  0.00% )
>            295,155      branch-misses:u                  #    0.65% of all branches             ( +-  0.11% )
>
>           0.033242 +- 0.000418 seconds time elapsed  ( +-  1.26% )
>
>
> Is the revised patch below still ok? I've rolled the configury changes into it,
> and dropped the (now unnecessary) AVX2 helper (and cpuid/xgetbv detection).

Yes - thanks.  Less code to maintain is always good.

Richard.

> ===8<===
>
> From: Alexander Monakov <amonakov@ispras.ru>
> Subject: [PATCH v2] libcpp: replace SSE4.2 helper with an SSSE3 one
>
> Since the characters we are searching for (CR, LF, '\', '?') all have
> distinct ASCII codes mod 16, PSHUFB can help match them all at once.
>
> Directly use the new helper if __SSSE3__ is defined. It makes the other
> helpers unused, so mark them inline to prevent warnings.
>
> Rewrite and simplify init_vectorized_lexer.
>
> libcpp/ChangeLog:
>
>         * config.in: Regenerate.
>         * configure: Regenerate.
>         * configure.ac: Check for SSSE3 instead of SSE4.2.
>         * files.cc (read_file_guts): Bump padding to 64 if HAVE_SSSE3.
>         * lex.cc (search_line_acc_char): Mark inline, not "unused".
>         (search_line_sse2): Mark inline.
>         (search_line_sse42): Replace with...
>         (search_line_ssse3): ... this new function.  Adjust the use...
>         (init_vectorized_lexer): ... here.  Simplify.
> ---
>  libcpp/config.in    |   4 +-
>  libcpp/configure    |   4 +-
>  libcpp/configure.ac |   6 +-
>  libcpp/files.cc     |  19 +++---
>  libcpp/lex.cc       | 150 ++++++++++++++++----------------------------
>  5 files changed, 73 insertions(+), 110 deletions(-)
>
> diff --git a/libcpp/config.in b/libcpp/config.in
> index 253ef03a3d..b2e2f4e842 100644
> --- a/libcpp/config.in
> +++ b/libcpp/config.in
> @@ -210,8 +210,8 @@
>  /* Define to 1 if you have the `putc_unlocked' function. */
>  #undef HAVE_PUTC_UNLOCKED
>
> -/* Define to 1 if you can assemble SSE4 insns. */
> -#undef HAVE_SSE4
> +/* Define to 1 if you can assemble SSSE3 insns. */
> +#undef HAVE_SSSE3
>
>  /* Define to 1 if you have the <stddef.h> header file. */
>  #undef HAVE_STDDEF_H
> diff --git a/libcpp/configure b/libcpp/configure
> index 32d6aaa306..1391081ba0 100755
> --- a/libcpp/configure
> +++ b/libcpp/configure
> @@ -9140,14 +9140,14 @@ case $target in
>  int
>  main ()
>  {
> -asm ("pcmpestri %0, %%xmm0, %%xmm1" : : "i"(0))
> +asm ("pshufb %xmm0, %xmm1")
>    ;
>    return 0;
>  }
>  _ACEOF
>  if ac_fn_c_try_compile "$LINENO"; then :
>
> -$as_echo "#define HAVE_SSE4 1" >>confdefs.h
> +$as_echo "#define HAVE_SSSE3 1" >>confdefs.h
>
>  fi
>  rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
> diff --git a/libcpp/configure.ac b/libcpp/configure.ac
> index b883fec776..981f97c4ab 100644
> --- a/libcpp/configure.ac
> +++ b/libcpp/configure.ac
> @@ -197,9 +197,9 @@ fi
>
>  case $target in
>    i?86-* | x86_64-*)
> -    AC_TRY_COMPILE([], [asm ("pcmpestri %0, %%xmm0, %%xmm1" : : "i"(0))],
> -      [AC_DEFINE([HAVE_SSE4], [1],
> -                [Define to 1 if you can assemble SSE4 insns.])])
> +    AC_TRY_COMPILE([], [asm ("pshufb %xmm0, %xmm1")],
> +      [AC_DEFINE([HAVE_SSSE3], [1],
> +                [Define to 1 if you can assemble SSSE3 insns.])])
>  esac
>
>  # Enable --enable-host-shared.
> diff --git a/libcpp/files.cc b/libcpp/files.cc
> index 78f56e30bd..3775091d25 100644
> --- a/libcpp/files.cc
> +++ b/libcpp/files.cc
> @@ -693,7 +693,7 @@ static bool
>  read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
>                 const char *input_charset)
>  {
> -  ssize_t size, total, count;
> +  ssize_t size, pad, total, count;
>    uchar *buf;
>    bool regular;
>
> @@ -732,11 +732,14 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
>         the majority of C source files.  */
>      size = 8 * 1024;
>
> -  /* The + 16 here is space for the final '\n' and 15 bytes of padding,
> -     used to quiet warnings from valgrind or Address Sanitizer, when the
> -     optimized lexer accesses aligned 16-byte memory chunks, including
> -     the bytes after the malloced, area, and stops lexing on '\n'.  */
> -  buf = XNEWVEC (uchar, size + 16);
> +#ifdef HAVE_SSSE3
> +  pad = 64;
> +#else
> +  pad = 16;
> +#endif
> +  /* The '+ PAD' here is space for the final '\n' and PAD-1 bytes of padding,
> +     allowing search_line_fast to use (possibly misaligned) vector loads.  */
> +  buf = XNEWVEC (uchar, size + pad);
>    total = 0;
>    while ((count = read (file->fd, buf + total, size - total)) > 0)
>      {
> @@ -747,7 +750,7 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
>           if (regular)
>             break;
>           size *= 2;
> -         buf = XRESIZEVEC (uchar, buf, size + 16);
> +         buf = XRESIZEVEC (uchar, buf, size + pad);
>         }
>      }
>
> @@ -765,7 +768,7 @@ read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
>
>    file->buffer = _cpp_convert_input (pfile,
>                                      input_charset,
> -                                    buf, size + 16, total,
> +                                    buf, size + pad, total,
>                                      &file->buffer_start,
>                                      &file->st.st_size);
>    file->buffer_valid = file->buffer;
> diff --git a/libcpp/lex.cc b/libcpp/lex.cc
> index 1591dcdf15..daf2c770bc 100644
> --- a/libcpp/lex.cc
> +++ b/libcpp/lex.cc
> @@ -225,10 +225,7 @@ acc_char_index (word_type cmp ATTRIBUTE_UNUSED,
>     and branches without increasing the number of arithmetic operations.
>     It's almost certainly going to be a win with 64-bit word size.  */
>
> -static const uchar * search_line_acc_char (const uchar *, const uchar *)
> -  ATTRIBUTE_UNUSED;
> -
> -static const uchar *
> +static inline const uchar *
>  search_line_acc_char (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
>  {
>    const word_type repl_nl = acc_char_replicate ('\n');
> @@ -293,7 +290,7 @@ static const char repl_chars[4][16] __attribute__((aligned(16))) = {
>
>  /* A version of the fast scanner using SSE2 vectorized byte compare insns.  */
>
> -static const uchar *
> +static inline const uchar *
>  #ifndef __SSE2__
>  __attribute__((__target__("sse2")))
>  #endif
> @@ -344,120 +341,83 @@ search_line_sse2 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
>    return (const uchar *)p + found;
>  }
>
> -#ifdef HAVE_SSE4
> -/* A version of the fast scanner using SSE 4.2 vectorized string insns.  */
> +#ifdef HAVE_SSSE3
> +/* A version of the fast scanner using SSSE3 shuffle (PSHUFB) insns.  */
>
> -static const uchar *
> -#ifndef __SSE4_2__
> -__attribute__((__target__("sse4.2")))
> +static inline const uchar *
> +#ifndef __SSSE3__
> +__attribute__((__target__("ssse3")))
>  #endif
> -search_line_sse42 (const uchar *s, const uchar *end)
> +search_line_ssse3 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
>  {
>    typedef char v16qi __attribute__ ((__vector_size__ (16)));
> -  static const v16qi search = { '\n', '\r', '?', '\\' };
> -
> -  uintptr_t si = (uintptr_t)s;
> -  uintptr_t index;
> -
> -  /* Check for unaligned input.  */
> -  if (si & 15)
> -    {
> -      v16qi sv;
> -
> -      if (__builtin_expect (end - s < 16, 0)
> -         && __builtin_expect ((si & 0xfff) > 0xff0, 0))
> -       {
> -         /* There are less than 16 bytes left in the buffer, and less
> -            than 16 bytes left on the page.  Reading 16 bytes at this
> -            point might generate a spurious page fault.  Defer to the
> -            SSE2 implementation, which already handles alignment.  */
> -         return search_line_sse2 (s, end);
> -       }
> -
> -      /* ??? The builtin doesn't understand that the PCMPESTRI read from
> -        memory need not be aligned.  */
> -      sv = __builtin_ia32_loaddqu ((const char *) s);
> -      index = __builtin_ia32_pcmpestri128 (search, 4, sv, 16, 0);
> -
> -      if (__builtin_expect (index < 16, 0))
> -       goto found;
> -
> -      /* Advance the pointer to an aligned address.  We will re-scan a
> -        few bytes, but we no longer need care for reading past the
> -        end of a page, since we're guaranteed a match.  */
> -      s = (const uchar *)((si + 15) & -16);
> -    }
> -
> -  /* Main loop, processing 16 bytes at a time.  */
> -#ifdef __GCC_ASM_FLAG_OUTPUTS__
> -  while (1)
> +  typedef v16qi v16qi_u __attribute__ ((__aligned__ (1)));
> +  /* Helper vector for pshufb-based matching:
> +     each character C we're searching for is at position (C % 16).  */
> +  v16qi lut = { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' };
> +  static_assert ('\n' == 10 && '\r' == 13 && '\\' == 92 && '?' == 63);
> +
> +  v16qi d1, d2, t1, t2;
> +  /* Unaligned loads.  Reading beyond the final newline is safe,
> +     since files.cc:read_file_guts pads the allocation.  */
> +  d1 = *(const v16qi_u *)s;
> +  d2 = *(const v16qi_u *)(s + 16);
> +  unsigned m1, m2, found;
> +  /* Process two 16-byte chunks per iteration.  */
> +  do
>      {
> -      char f;
> -
> -      /* By using inline assembly instead of the builtin,
> -        we can use the result, as well as the flags set.  */
> -      __asm ("%vpcmpestri\t$0, %2, %3"
> -            : "=c"(index), "=@ccc"(f)
> -            : "m"(*s), "x"(search), "a"(4), "d"(16));
> -      if (f)
> -       break;
> -
> -      s += 16;
> +      t1 = __builtin_ia32_pshufb128 (lut, d1);
> +      t2 = __builtin_ia32_pshufb128 (lut, d2);
> +      m1 = __builtin_ia32_pmovmskb128 (t1 == d1);
> +      m2 = __builtin_ia32_pmovmskb128 (t2 == d2);
> +      s += 32;
> +      d1 = *(const v16qi_u *)s;
> +      d2 = *(const v16qi_u *)(s + 16);
> +      found = m1 + (m2 << 16);
>      }
> -#else
> -  s -= 16;
> -  /* By doing the whole loop in inline assembly,
> -     we can make proper use of the flags set.  */
> -  __asm (      ".balign 16\n"
> -       "0:     add $16, %1\n"
> -       "       %vpcmpestri\t$0, (%1), %2\n"
> -       "       jnc 0b"
> -       : "=&c"(index), "+r"(s)
> -       : "x"(search), "a"(4), "d"(16));
> -#endif
> -
> - found:
> -  return s + index;
> +  while (!found);
> +  /* Prefer to compute 's - 32' here, not spend an extra instruction
> +     to make a copy of the previous value of 's' in the loop.  */
> +  __asm__ ("" : "+r"(s));
> +  return s - 32 + __builtin_ctz (found);
>  }
>
>  #else
> -/* Work around out-dated assemblers without sse4 support.  */
> -#define search_line_sse42 search_line_sse2
> +/* Work around out-dated assemblers without SSSE3 support.  */
> +#define search_line_ssse3 search_line_sse2
>  #endif
>
> +#ifdef __SSSE3__
> +/* No need for CPU probing, just use the best available variant.  */
> +#define search_line_fast search_line_ssse3
> +#else
>  /* Check the CPU capabilities.  */
>
>  #include "../gcc/config/i386/cpuid.h"
>
>  typedef const uchar * (*search_line_fast_type) (const uchar *, const uchar *);
> -static search_line_fast_type search_line_fast;
> +static search_line_fast_type search_line_fast
> +#if defined(__SSE2__)
> + = search_line_sse2;
> +#else
> + = search_line_acc_char;
> +#endif
>
>  #define HAVE_init_vectorized_lexer 1
>  static inline void
>  init_vectorized_lexer (void)
>  {
> -  unsigned dummy, ecx = 0, edx = 0;
> -  search_line_fast_type impl = search_line_acc_char;
> -  int minimum = 0;
> -
> -#if defined(__SSE4_2__)
> -  minimum = 3;
> -#elif defined(__SSE2__)
> -  minimum = 2;
> -#endif
> +  unsigned ax, bx, cx, dx;
>
> -  if (minimum == 3)
> -    impl = search_line_sse42;
> -  else if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx) || minimum == 2)
> -    {
> -      if (minimum == 3 || (ecx & bit_SSE4_2))
> -        impl = search_line_sse42;
> -      else if (minimum == 2 || (edx & bit_SSE2))
> -       impl = search_line_sse2;
> -    }
> +  if (!__get_cpuid (1, &ax, &bx, &cx, &dx))
> +    return;
>
> -  search_line_fast = impl;
> +  if (cx & bit_SSSE3)
> +    search_line_fast = search_line_ssse3;
> +  else if (dx & bit_SSE2)
> +    search_line_fast = search_line_sse2;
>  }
> +#endif
>
>  #elif (GCC_VERSION >= 4005) && defined(_ARCH_PWR8) && defined(__ALTIVEC__)
>
> --
> 2.44.0
>
  

Patch

diff --git a/libcpp/files.cc b/libcpp/files.cc
index 78f56e30bd..3df070d035 100644
--- a/libcpp/files.cc
+++ b/libcpp/files.cc
@@ -693,7 +693,7 @@  static bool
 read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
 		const char *input_charset)
 {
-  ssize_t size, total, count;
+  ssize_t size, pad, total, count;
   uchar *buf;
   bool regular;
 
@@ -732,11 +732,10 @@  read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
        the majority of C source files.  */
     size = 8 * 1024;
 
-  /* The + 16 here is space for the final '\n' and 15 bytes of padding,
-     used to quiet warnings from valgrind or Address Sanitizer, when the
-     optimized lexer accesses aligned 16-byte memory chunks, including
-     the bytes after the malloced, area, and stops lexing on '\n'.  */
-  buf = XNEWVEC (uchar, size + 16);
+  pad = HAVE_AVX2 ? 32 : 16;
+  /* The '+ PAD' here is space for the final '\n' and PAD-1 bytes of padding,
+     allowing search_line_fast to use (possibly misaligned) vector loads.  */
+  buf = XNEWVEC (uchar, size + pad);
   total = 0;
   while ((count = read (file->fd, buf + total, size - total)) > 0)
     {
@@ -747,7 +746,7 @@  read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
 	  if (regular)
 	    break;
 	  size *= 2;
-	  buf = XRESIZEVEC (uchar, buf, size + 16);
+	  buf = XRESIZEVEC (uchar, buf, size + pad);
 	}
     }
 
@@ -765,7 +764,7 @@  read_file_guts (cpp_reader *pfile, _cpp_file *file, location_t loc,
 
   file->buffer = _cpp_convert_input (pfile,
 				     input_charset,
-				     buf, size + 16, total,
+				     buf, size + pad, total,
 				     &file->buffer_start,
 				     &file->st.st_size);
   file->buffer_valid = file->buffer;
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index 815b8abd29..c336281658 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -225,10 +225,7 @@  acc_char_index (word_type cmp ATTRIBUTE_UNUSED,
    and branches without increasing the number of arithmetic operations.
    It's almost certainly going to be a win with 64-bit word size.  */
 
-static const uchar * search_line_acc_char (const uchar *, const uchar *)
-  ATTRIBUTE_UNUSED;
-
-static const uchar *
+static inline const uchar *
 search_line_acc_char (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
 {
   const word_type repl_nl = acc_char_replicate ('\n');
@@ -293,7 +290,7 @@  static const char repl_chars[4][16] __attribute__((aligned(16))) = {
 
 /* A version of the fast scanner using SSE2 vectorized byte compare insns.  */
 
-static const uchar *
+static inline const uchar *
 #ifndef __SSE2__
 __attribute__((__target__("sse2")))
 #endif
@@ -345,9 +342,9 @@  search_line_sse2 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
 }
 
 #ifdef HAVE_AVX2
-/* A version of the fast scanner using SSSE3 shuffle (PSHUFB) insns.  */
+/* Variants of the fast scanner using SSSE3 shuffle (PSHUFB) insns.  */
 
-static const uchar *
+static inline const uchar *
 #ifndef __SSSE3__
 __attribute__((__target__("ssse3")))
 #endif
@@ -394,44 +391,106 @@  done:
   return s + __builtin_ctz (found);
 }
 
+static inline const uchar *
+#ifndef __AVX2__
+__attribute__((__target__("avx2")))
+#endif
+search_line_avx2 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
+{
+  typedef char v32qi __attribute__ ((__vector_size__ (32)));
+  typedef v32qi v32qi_u __attribute__ ((__aligned__ (1)));
+  v32qi lut = {
+    1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?',
+    1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?'
+  };
+
+  int found;
+  /* Process three 32-byte chunks per iteration.  */
+  for (; ; s += 96)
+    {
+      v32qi data, t;
+      data = *(const v32qi_u *)s;
+      __asm__ ("" : "+x" (data));
+      t = __builtin_ia32_pshufb256 (lut, data);
+      if ((found = __builtin_ia32_pmovmskb256 (t == data)))
+	goto done;
+      /* Second chunk.  */
+      data = *(const v32qi_u *)(s + 32);
+      __asm__ ("" : "+x" (data));
+      t = __builtin_ia32_pshufb256 (lut, data);
+      if ((found = __builtin_ia32_pmovmskb256 (t == data)))
+	goto add_32;
+      /* Third chunk.  */
+      data = *(const v32qi_u *)(s + 64);
+      __asm__ ("" : "+x" (data));
+      t = __builtin_ia32_pshufb256 (lut, data);
+      if ((found = __builtin_ia32_pmovmskb256 (t == data)))
+	goto add_64;
+    }
+add_64:
+  s += 32;
+add_32:
+  s += 32;
+done:
+  return s + __builtin_ctz (found);
+}
+
 #else
-/* Work around out-dated assemblers without SSSE3 support.  */
+/* Work around out-dated assemblers without AVX2 support.  */
 #define search_line_ssse3 search_line_sse2
+#define search_line_avx2 search_line_sse2
 #endif
 
+#ifdef __AVX2__
+/* No need for CPU probing, just use the best available variant.  */
+#define search_line_fast search_line_avx2
+#else
 /* Check the CPU capabilities.  */
 
 #include "../gcc/config/i386/cpuid.h"
 
 typedef const uchar * (*search_line_fast_type) (const uchar *, const uchar *);
-static search_line_fast_type search_line_fast;
+static search_line_fast_type search_line_fast
+#if defined(__SSE2__)
+ = search_line_sse2;
+#else
+ = search_line_acc_char;
+#endif
 
 #define HAVE_init_vectorized_lexer 1
 static inline void
 init_vectorized_lexer (void)
 {
-  unsigned dummy, ecx = 0, edx = 0;
-  search_line_fast_type impl = search_line_acc_char;
-  int minimum = 0;
-
-#if defined(__SSSE3__)
-  minimum = 3;
-#elif defined(__SSE2__)
-  minimum = 2;
-#endif
+  unsigned a1, b1, c1, d1;
+
+  if (!__get_cpuid (1, &a1, &b1, &c1, &d1))
+    return;
 
-  if (minimum == 3)
-    impl = search_line_ssse3;
-  else if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx) || minimum == 2)
+  if (c1 & bit_OSXSAVE)
     {
-      if (minimum == 3 || (ecx & bit_SSSE3))
-	impl = search_line_ssse3;
-      else if (minimum == 2 || (edx & bit_SSE2))
-	impl = search_line_sse2;
+      /* Check leaf 7 subleaf 0 for AVX2 ISA support.  */
+      unsigned a7, b7, c7, d7;
+      if (__get_cpuid_count (7, 0, &a7, &b7, &c7, &d7)
+	  && (b7 & bit_AVX2))
+	{
+	  /* Check XCR0 for YMM state support in the OS.  */
+	  unsigned xcr0h, xcr0l;
+	  __asm__ volatile (".byte 0x0f, 0x01, 0xd0" // xgetbv
+			    : "=d" (xcr0h), "=a" (xcr0l) : "c" (0));
+	  if ((xcr0l & 6) == 6)
+	    {
+	      search_line_fast = search_line_avx2;
+	      return;
+	    }
+	}
     }
 
-  search_line_fast = impl;
+  if (c1 & bit_SSSE3)
+    search_line_fast = search_line_ssse3;
+  else if (d1 & bit_SSE2)
+    search_line_fast = search_line_sse2;
 }
+#endif
 
 #elif (GCC_VERSION >= 4005) && defined(_ARCH_PWR8) && defined(__ALTIVEC__)