Message ID | DB6PR0801MB2053D467ED5AC3E8860BD0EB83D10@DB6PR0801MB2053.eurprd08.prod.outlook.com |
---|---|
State | New, archived |
Headers | show |
On Wednesday 14 March 2018 07:34 PM, Wilco Dijkstra wrote:
> Why not use lsr limit_wd, limit, 3? We have 3-operand shifts on AArch64!
Because I was half asleep and just followed what Szabolcs said ;)
I'll fix that up later (I can barely sit today, my back is killing me)
or please feel free to fix up if you'd like to.
Thanks,
Siddhesh
On Wednesday 14 March 2018 07:50 PM, Siddhesh Poyarekar wrote: > On Wednesday 14 March 2018 07:34 PM, Wilco Dijkstra wrote: >> Why not use lsr limit_wd, limit, 3? We have 3-operand shifts on AArch64! > > Because I was half asleep and just followed what Szabolcs said ;) > > I'll fix that up later (I can barely sit today, my back is killing me) > or please feel free to fix up if you'd like to. I have fixed this now: https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=commitdiff;h=b47c3e7637efb77818cbef55dcd0ed1f0ea0ddf1 Thanks, Siddhesh
Siddhesh Poyarekar wrote: > I have fixed this now: > > https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=commitdiff;h=b47c3e7637efb77818cbef55dcd0ed1f0ea0ddf1 Thanks, that's fine for now. We should look into tuning this further in the future, I think both strcmp and strncmp should be able to be almost as fast as memcmp. Wilco
On Thursday 15 March 2018 07:14 PM, Wilco Dijkstra wrote: > Thanks, that's fine for now. We should look into tuning this further in the future, > I think both strcmp and strncmp should be able to be almost as fast as memcmp. Agreed, I haven't taken it off my plate. This was a pretty big gain to keep holding on to though, which is why I pushed it out early. Siddhesh
--- a/sysdeps/aarch64/strncmp.S +++ b/sysdeps/aarch64/strncmp.S @@ -208,13 +208,15 @@ L(done): /* Align the SRC1 to a dword by doing a bytewise compare and then do the dword loop. */ L(try_misaligned_words): - mov limit_wd, limit, lsr #3 + mov limit_wd, limit + lsr limit_wd, limit_wd, #3 cbz count, L(do_misaligned) neg count, count and count, count, #7 sub limit, limit, count - mov limit_wd, limit, lsr #3 + mov limit_wd, limit + lsr limit_wd, limit_wd, #3 Also it seems to me it would be far easier to subtract 8 from limit in the main loop. This means we don't ever need limit_wd, and avoids having to do this later: