[2/3] libcpp: replace SSE4.2 helper with an SSSE3 one
Checks
Context |
Check |
Description |
linaro-tcwg-bot/tcwg_gcc_build--master-arm |
success
|
Build passed
|
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 |
success
|
Build passed
|
Commit Message
Since the characters we are searching for (CR, LF, '\', '?') all have
distinct ASCII codes mod 16, PSHUFB can help match them all at once.
libcpp/ChangeLog:
* lex.cc (search_line_sse42): Replace with...
(search_line_ssse3): ... this new function. Adjust the use...
(init_vectorized_lexer): ... here.
---
libcpp/lex.cc | 118 ++++++++++++++++++++------------------------------
1 file changed, 46 insertions(+), 72 deletions(-)
Comments
> - s += 16;
> + v16qi data, t;
> + /* Unaligned load. Reading beyond the final newline is safe, since
> + files.cc:read_file_guts pads the allocation. */
You need to change that function to use 32 byte padding as Jakub
pointed out (I forgot that too)
> + data = *(const v16qi_u *)s;
> + /* Prevent propagation into pshufb and pcmp as memory operand. */
> + __asm__ ("" : "+x" (data));
It would probably make sense to a file a PR on this separately,
to eventually fix the compiler to not need such workarounds.
Not sure how much difference it makes however.
-Andi
On Tue, Aug 06, 2024 at 11:50:00AM -0700, Andi Kleen wrote:
> > - s += 16;
> > + v16qi data, t;
> > + /* Unaligned load. Reading beyond the final newline is safe, since
> > + files.cc:read_file_guts pads the allocation. */
>
> You need to change that function to use 32 byte padding as Jakub
> pointed out (I forgot that too)
Never mind, it's in the next patch.
On Tue, Aug 6, 2024 at 8:50 PM Andi Kleen <andi@firstfloor.org> wrote:
>
> > - s += 16;
> > + v16qi data, t;
> > + /* Unaligned load. Reading beyond the final newline is safe, since
> > + files.cc:read_file_guts pads the allocation. */
>
> You need to change that function to use 32 byte padding as Jakub
> pointed out (I forgot that too)
>
> > + data = *(const v16qi_u *)s;
> > + /* Prevent propagation into pshufb and pcmp as memory operand. */
> > + __asm__ ("" : "+x" (data));
>
> It would probably make sense to a file a PR on this separately,
> to eventually fix the compiler to not need such workarounds.
> Not sure how much difference it makes however.
This is probably to work around bugs in older compiler versions? If
not I agree.
Otherwise the patch is OK.
Thanks,
Richard.
> -Andi
On Wed, 7 Aug 2024, Richard Biener wrote:
> > > + data = *(const v16qi_u *)s;
> > > + /* Prevent propagation into pshufb and pcmp as memory operand. */
> > > + __asm__ ("" : "+x" (data));
> >
> > It would probably make sense to a file a PR on this separately,
> > to eventually fix the compiler to not need such workarounds.
> > Not sure how much difference it makes however.
>
> This is probably to work around bugs in older compiler versions? If
> not I agree.
This is deliberate hand-tuning to avoid a subtle issue: pshufb is not
macro-fused on Intel, so with propagation it is two uops early in the
CPU front-end.
The "propagation" actually falls out of IRA/LRA decisions, and stopped
happening in gcc-14. I'm not sure if there were relevant RA changes.
In any case, this can potentially flip-flop in the future again.
Considering the trunk gets this right, I think the next move is to
add a testcase for this, not a PR, correct?
> Otherwise the patch is OK.
Still OK with the asms, or would you prefer them be taken out?
Thanks.
Alexander
On Wed, Aug 7, 2024 at 11:08 AM Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Wed, 7 Aug 2024, Richard Biener wrote:
>
> > > > + data = *(const v16qi_u *)s;
> > > > + /* Prevent propagation into pshufb and pcmp as memory operand. */
> > > > + __asm__ ("" : "+x" (data));
> > >
> > > It would probably make sense to a file a PR on this separately,
> > > to eventually fix the compiler to not need such workarounds.
> > > Not sure how much difference it makes however.
> >
> > This is probably to work around bugs in older compiler versions? If
> > not I agree.
>
> This is deliberate hand-tuning to avoid a subtle issue: pshufb is not
> macro-fused on Intel, so with propagation it is two uops early in the
> CPU front-end.
>
> The "propagation" actually falls out of IRA/LRA decisions, and stopped
> happening in gcc-14. I'm not sure if there were relevant RA changes.
> In any case, this can potentially flip-flop in the future again.
>
> Considering the trunk gets this right, I think the next move is to
> add a testcase for this, not a PR, correct?
Well, merging the memory operand into the pshufb would be wrong - embedded
memory ops are always considered aligned, no?
> > Otherwise the patch is OK.
>
> Still OK with the asms, or would you prefer them be taken out?
I think it's OK with the asms.
Richard.
> Thanks.
>
> Alexander
On Wed, Aug 07, 2024 at 01:16:20PM +0200, Richard Biener wrote:
> Well, merging the memory operand into the pshufb would be wrong - embedded
> memory ops are always considered aligned, no?
Depends. For VEX/EVEX encoded can be unaligned, for the pre-AVX encoding
aligned except when in explicitly unaligned instructions.
Jakub
On Wed, 7 Aug 2024, Richard Biener wrote:
> > > This is probably to work around bugs in older compiler versions? If
> > > not I agree.
> >
> > This is deliberate hand-tuning to avoid a subtle issue: pshufb is not
> > macro-fused on Intel, so with propagation it is two uops early in the
> > CPU front-end.
> >
> > The "propagation" actually falls out of IRA/LRA decisions, and stopped
> > happening in gcc-14. I'm not sure if there were relevant RA changes.
> > In any case, this can potentially flip-flop in the future again.
> >
> > Considering the trunk gets this right, I think the next move is to
> > add a testcase for this, not a PR, correct?
>
> Well, merging the memory operand into the pshufb would be wrong - embedded
> memory ops are always considered aligned, no?
In SSE yes, in AVX no. For search_line_ssse3 the asms help if it is compiled
with e.g. -march=sandybridge (i.e. for a CPU that has AVX but lacks AVX2):
then VEX-encoded SSE instructions accept misaligned memory, and we want to
prevent that here.
Alexander
On Wed, Aug 7, 2024 at 1:37 PM Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Wed, 7 Aug 2024, Richard Biener wrote:
>
> > > > This is probably to work around bugs in older compiler versions? If
> > > > not I agree.
> > >
> > > This is deliberate hand-tuning to avoid a subtle issue: pshufb is not
> > > macro-fused on Intel, so with propagation it is two uops early in the
> > > CPU front-end.
> > >
> > > The "propagation" actually falls out of IRA/LRA decisions, and stopped
> > > happening in gcc-14. I'm not sure if there were relevant RA changes.
> > > In any case, this can potentially flip-flop in the future again.
> > >
> > > Considering the trunk gets this right, I think the next move is to
> > > add a testcase for this, not a PR, correct?
> >
> > Well, merging the memory operand into the pshufb would be wrong - embedded
> > memory ops are always considered aligned, no?
>
> In SSE yes, in AVX no. For search_line_ssse3 the asms help if it is compiled
> with e.g. -march=sandybridge (i.e. for a CPU that has AVX but lacks AVX2):
> then VEX-encoded SSE instructions accept misaligned memory, and we want to
> prevent that here.
Ah, yeah - I think there's even existing bugreports that we're too happy to
duplicate a memory operand even into multiple insns.
Richard.
>
> Alexander
@@ -345,84 +345,58 @@ search_line_sse2 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
}
#ifdef HAVE_AVX2
-/* A version of the fast scanner using SSE 4.2 vectorized string insns. */
+/* A version of the fast scanner using SSSE3 shuffle (PSHUFB) insns. */
static const uchar *
-#ifndef __SSE4_2__
-__attribute__((__target__("sse4.2")))
+#ifndef __SSSE3__
+__attribute__((__target__("ssse3")))
#endif
-search_line_sse42 (const uchar *s, const uchar *end)
+search_line_ssse3 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
{
typedef char v16qi __attribute__ ((__vector_size__ (16)));
- static const v16qi search = { '\n', '\r', '?', '\\' };
-
- uintptr_t si = (uintptr_t)s;
- uintptr_t index;
-
- /* Check for unaligned input. */
- if (si & 15)
- {
- v16qi sv;
-
- if (__builtin_expect (end - s < 16, 0)
- && __builtin_expect ((si & 0xfff) > 0xff0, 0))
- {
- /* There are less than 16 bytes left in the buffer, and less
- than 16 bytes left on the page. Reading 16 bytes at this
- point might generate a spurious page fault. Defer to the
- SSE2 implementation, which already handles alignment. */
- return search_line_sse2 (s, end);
- }
-
- /* ??? The builtin doesn't understand that the PCMPESTRI read from
- memory need not be aligned. */
- sv = __builtin_ia32_loaddqu ((const char *) s);
- index = __builtin_ia32_pcmpestri128 (search, 4, sv, 16, 0);
-
- if (__builtin_expect (index < 16, 0))
- goto found;
-
- /* Advance the pointer to an aligned address. We will re-scan a
- few bytes, but we no longer need care for reading past the
- end of a page, since we're guaranteed a match. */
- s = (const uchar *)((si + 15) & -16);
- }
-
- /* Main loop, processing 16 bytes at a time. */
-#ifdef __GCC_ASM_FLAG_OUTPUTS__
- while (1)
+ typedef v16qi v16qi_u __attribute__ ((__aligned__ (1)));
+ /* Helper vector for pshufb-based matching:
+ each character C we're searching for is at position (C % 16). */
+ v16qi lut = { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' };
+ static_assert ('\n' == 10 && '\r' == 13 && '\\' == 92 && '?' == 63);
+
+ int found;
+ /* Process three 16-byte chunks per iteration. */
+ for (; ; s += 48)
{
- char f;
-
- /* By using inline assembly instead of the builtin,
- we can use the result, as well as the flags set. */
- __asm ("%vpcmpestri\t$0, %2, %3"
- : "=c"(index), "=@ccc"(f)
- : "m"(*s), "x"(search), "a"(4), "d"(16));
- if (f)
- break;
-
- s += 16;
+ v16qi data, t;
+ /* Unaligned load. Reading beyond the final newline is safe, since
+ files.cc:read_file_guts pads the allocation. */
+ data = *(const v16qi_u *)s;
+ /* Prevent propagation into pshufb and pcmp as memory operand. */
+ __asm__ ("" : "+x" (data));
+ t = __builtin_ia32_pshufb128 (lut, data);
+ if ((found = __builtin_ia32_pmovmskb128 (t == data)))
+ goto done;
+ /* Second chunk. */
+ data = *(const v16qi_u *)(s + 16);
+ __asm__ ("" : "+x" (data));
+ t = __builtin_ia32_pshufb128 (lut, data);
+ if ((found = __builtin_ia32_pmovmskb128 (t == data)))
+ goto add_16;
+ /* Third chunk. */
+ data = *(const v16qi_u *)(s + 32);
+ __asm__ ("" : "+x" (data));
+ t = __builtin_ia32_pshufb128 (lut, data);
+ if ((found = __builtin_ia32_pmovmskb128 (t == data)))
+ goto add_32;
}
-#else
- s -= 16;
- /* By doing the whole loop in inline assembly,
- we can make proper use of the flags set. */
- __asm ( ".balign 16\n"
- "0: add $16, %1\n"
- " %vpcmpestri\t$0, (%1), %2\n"
- " jnc 0b"
- : "=&c"(index), "+r"(s)
- : "x"(search), "a"(4), "d"(16));
-#endif
-
- found:
- return s + index;
+add_32:
+ s += 16;
+add_16:
+ s += 16;
+done:
+ return s + __builtin_ctz (found);
}
#else
-/* Work around out-dated assemblers without sse4 support. */
-#define search_line_sse42 search_line_sse2
+/* Work around out-dated assemblers without SSSE3 support. */
+#define search_line_ssse3 search_line_sse2
#endif
/* Check the CPU capabilities. */
@@ -440,18 +414,18 @@ init_vectorized_lexer (void)
search_line_fast_type impl = search_line_acc_char;
int minimum = 0;
-#if defined(__SSE4_2__)
+#if defined(__SSSE3__)
minimum = 3;
#elif defined(__SSE2__)
minimum = 2;
#endif
if (minimum == 3)
- impl = search_line_sse42;
+ impl = search_line_ssse3;
else if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx) || minimum == 2)
{
- if (minimum == 3 || (ecx & bit_SSE4_2))
- impl = search_line_sse42;
+ if (minimum == 3 || (ecx & bit_SSSE3))
+ impl = search_line_ssse3;
else if (minimum == 2 || (edx & bit_SSE2))
impl = search_line_sse2;
}