[0/7] AArch64 Optimize truncation, shifts and bitmask comparisons

Message ID patch-14899-tamar@arm.com
Headers
Series AArch64 Optimize truncation, shifts and bitmask comparisons |

Message

Tamar Christina Sept. 29, 2021, 4:19 p.m. UTC
  Hi All,

This patch series is optimizing AArch64 codegen for narrowing operations,
shift and narrow, and some comparisons with bitmasks.

There are more to come but this is the first batch.

This series shows a 2% gain on x264 in SPECCPU2017 and 0.05% size reduction
and shows 5-10% perf gain on various intrinsics optimized real world
libraries.

One part that is missing and needs additional work is being able to combine
stores into sequential locations.  Consider:

#include <arm_neon.h>
?
#define SIZE 1
#define SIZE2 8 * 8 * 8
?
extern void pop (uint8_t*);
?
void foo (int16x8_t row0, int16x8_t row1, int16x8_t row2, int16x8_t row3,
          int16x8_t row4, int16x8_t row5, int16x8_t row6, int16x8_t row7) {
    uint8_t block_nbits[SIZE2];

    uint8x8_t row0_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row0)));
    uint8x8_t row1_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row1)));
    uint8x8_t row2_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row2)));
    uint8x8_t row3_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row3)));
    uint8x8_t row4_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row4)));
    uint8x8_t row5_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row5)));
    uint8x8_t row6_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row6)));
    uint8x8_t row7_nbits = vsub_u8(vdup_n_u8(16),
                                   vmovn_u16(vreinterpretq_u16_s16(row7)));

    vst1_u8(block_nbits + 0 * SIZE, row0_nbits);
    vst1_u8(block_nbits + 1 * SIZE, row1_nbits);
    vst1_u8(block_nbits + 2 * SIZE, row2_nbits);
    vst1_u8(block_nbits + 3 * SIZE, row3_nbits);
    vst1_u8(block_nbits + 4 * SIZE, row4_nbits);
    vst1_u8(block_nbits + 5 * SIZE, row5_nbits);
    vst1_u8(block_nbits + 6 * SIZE, row6_nbits);
    vst1_u8(block_nbits + 7 * SIZE, row7_nbits);
?
    pop (block_nbits);
}

currently generates:

movi v1.8b, #0x10

xtn v17.8b, v17.8h
xtn v23.8b, v23.8h
xtn v22.8b, v22.8h
xtn v4.8b, v21.8h
xtn v20.8b, v20.8h
xtn v19.8b, v19.8h
xtn v18.8b, v18.8h
xtn v24.8b, v24.8h

sub v17.8b, v1.8b, v17.8b
sub v23.8b, v1.8b, v23.8b
sub v22.8b, v1.8b, v22.8b
sub v16.8b, v1.8b, v4.8b
sub v8.8b, v1.8b, v20.8b
sub v4.8b, v1.8b, v19.8b
sub v2.8b, v1.8b, v18.8b
sub v1.8b, v1.8b, v24.8b

stp d17, d23, [sp, #224]
stp d22, d16, [sp, #240]
stp d8, d4, [sp, #256]
stp d2, d1, [sp, #272]

where optimized codegen for this is:

movi v1.16b, #0x10

uzp1 v17.16b, v17.16b, v23.16b
uzp1 v22.16b, v22.16b, v4.16b
uzp1 v20.16b, v20.16b, v19.16b
uzp1 v24.16b, v18.16b, v24.16b

sub v17.16b, v1.16b, v17.16b
sub v18.16b, v1.16b, v22.16b
sub v19.16b, v1.16b, v20.16b
sub v20.16b, v1.16b, v24.16b

stp q17, q18, [sp, #224]
stp q19, q20, [sp, #256]

which requires us to recognize the stores into sequential locations (multiple
stp d blocks in the current example) and merge them into one.

This pattern happens reasonably often but unsure how to handle it.  For one this
requires st1 and friends to not be unspec, which is currently the focus of

https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579582.html

Thanks,
Tamar

--- inline copy of patch -- 

--