strstr benchtest.

Message ID 20150512210946.GB24057@domone
State Not applicable
Headers

Commit Message

Ondrej Bilka May 12, 2015, 9:09 p.m. UTC
  On Tue, May 12, 2015 at 10:53:11PM +0200, Ondřej Bílka wrote:
> Hi,
>
As I mentioned strstr benchmarking I used following.

Current benchmark is as useless as rest of string benchmarks. To compile
you need to delete benchtest/bench-strstr.o as this is tested by header
inclusion and benchtest Makefile doesn't detect that strstr.c changed.

Main problem of current benchtest was very biased input. I replaced that
with randomly generated ones that are closer to average case.

A problem with periodic input with period 16 is that it would skew
branch prediction a lot. It also creates periodic pattern for branches
and they would be predicted perfectly while in reality performance would
be magnitude worse. 

Second problem is that it was a worst-case scenario for this heuristic
instead average as first three characters form searched triplet instead
randomly distributed when characters are selected independently.

I couldn't get definite answer. As I said before this aproach is good
when string has high entropy and problematic on lower-entropy ones.
Selection which is better depends on entropy of strings used by given
application. It isn't always better but benefits outweigth the cost.
  

Comments

Florian Weimer May 14, 2015, 9:33 a.m. UTC | #1
On 05/12/2015 11:09 PM, Ondřej Bílka wrote:
> Main problem of current benchtest was very biased input. I replaced that
> with randomly generated ones that are closer to average case.

But strstr is not run on random strings.

Is there some easy way to generated traces of actual strstr arguments
from a running system, perhaps using Systemtap?
  
Ondrej Bilka May 14, 2015, 10:41 a.m. UTC | #2
On Thu, May 14, 2015 at 11:33:58AM +0200, Florian Weimer wrote:
> On 05/12/2015 11:09 PM, Ondřej Bílka wrote:
> > Main problem of current benchtest was very biased input. I replaced that
> > with randomly generated ones that are closer to average case.
> 
> But strstr is not run on random strings.
> 
> Is there some easy way to generated traces of actual strstr arguments
> from a running system, perhaps using Systemtap?
>
Use my profiler here.

http://kam.mff.cuni.cz/~ondra/benchmark_string/strstr_profile250813.tar.bz2

But it won't help you much with this case. Haystacks are really short,
61% of calls end with match in first 16 bytes. See:

http://kam.mff.cuni.cz/~ondra/benchmark_string/profile/results/result.html strstr function
or
http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/strstr_profile/results_gcc/result.html 

While its extremely valuable information and you should focus to
optimize header it would probably report little change as both
implementations just do initial strchr.

Longer strings where you could see improvement don't exist in practice.
So you are left with generating some data.
  
Ondrej Bilka June 3, 2015, 11:10 a.m. UTC | #3
On Thu, May 14, 2015 at 11:33:58AM +0200, Florian Weimer wrote:
> On 05/12/2015 11:09 PM, Ondřej Bílka wrote:
> > Main problem of current benchtest was very biased input. I replaced that
> > with randomly generated ones that are closer to average case.
> 
> But strstr is not run on random strings.
> 
> Is there some easy way to generated traces of actual strstr arguments
> from a running system, perhaps using Systemtap?
> 
To be able to rerun these I wrote a dryrun, I send a previous version of
it some time ago. As systemtap problem is that it offers only problems
with zero benefit as you need just write strings to disk which is easier
just in pure c without unnecessary wrapping of that, I put current
version here:

http://kam.mff.cuni.cz/~ondra/dryrun.tar.bz2

You should record data with LD_PRELOAD=path/record_strstr.so bash

These could be replayed with LD_PRELOADed suitable alternative
implementation with

LD_PRELOAD=new_strstr.so ./bench_strstr

Then you can inspect data with ./show_strstr and finally summaries from
these are created by

./summary #for brief one
or
./summary_strstr -u # to show only unique binaries
or
./summary_strstr -a # to also show repeated recordings of same binary

For me ./summary_strstr -u shows following and I wont submit raw data as
it contains at least list of my mail headers.

replaying mc

average size  4.62 needle size  1.01 comparisons   6.62 digraphs  0.02 trigraphs  0.00
calls     2440 succeed  99.7% latencies -18.7 -18.6
s1    aligned to 4 bytes  99.9% aligned to 8 bytes  94.5% aligned to 16 bytes  94.5%
s2    aligned to 4 bytes  99.8% aligned to 8 bytes  99.8% aligned to 16 bytes  99.7%
needle found in n bytes: n <= 0:   0.2% n <= 1:  17.0% n <= 2:  66.4% n <= 3:  69.1%  n <= 4:  71.0% n <= 8:  81.4% n <= 16:  95.6% n <= 32:  99.8% n <= 64: 100.0%
needle size: n <= 0:   0.0% n <= 1:  99.8% n <= 2:  99.8% n <= 3:  99.8%  n <= 4:  99.9% n <= 8: 100.0% n <= 16: 100.0% n <= 32: 100.0% n <= 64: 100.0%

replaying mutt

average size 20.60 needle size  2.01 comparisons  20.90 digraphs  0.08 trigraphs  0.00
calls     3505 succeed   8.4% latencies -97.4 -88.6
s1    aligned to 4 bytes  99.7% aligned to 8 bytes  99.5% aligned to 16 bytes  99.5%
s2    aligned to 4 bytes   1.4% aligned to 8 bytes   0.7% aligned to 16 bytes   0.1%
needle found in n bytes: n <= 0:   8.0% n <= 1:   8.3% n <= 2:   8.3% n <= 3:   8.9%  n <= 4:   9.0% n <= 8:  12.2% n <= 16:  68.9% n <= 32:  84.1% n <= 64:  94.8%
needle size: n <= 0:   0.0% n <= 1:   0.0% n <= 2:  99.3% n <= 3:  99.9%  n <= 4:  99.9% n <= 8: 100.0% n <= 16: 100.0% n <= 32: 100.0% n <= 64: 100.0%

replaying /usr/bin/python

average size 53.38 needle size  8.55 comparisons  54.57 digraphs  0.10 trigraphs  0.00
calls       60 succeed   0.0% latencies -73.0 -68.5
s1    aligned to 4 bytes 100.0% aligned to 8 bytes 100.0% aligned to 16 bytes 100.0%
s2    aligned to 4 bytes  26.7% aligned to 8 bytes  26.7% aligned to 16 bytes  26.7%
needle found in n bytes: n <= 0:   1.7% n <= 1:   1.7% n <= 2:   1.7% n <= 3:   1.7%  n <= 4:   1.7% n <= 8:   1.7% n <= 16:   1.7% n <= 32:  21.7% n <= 64:  60.0%
needle size: n <= 0:   1.7% n <= 1:   1.7% n <= 2:   1.7% n <= 3:   1.7%  n <= 4:  26.7% n <= 8:  26.7% n <= 16: 100.0% n <= 32: 100.0% n <= 64: 100.0%

replaying /opt/google/chrome/chrome

average size  6.95 needle size  1.35 comparisons   8.63 digraphs  0.05 trigraphs  0.00
calls     1990 succeed  79.8% latencies -44.7 -46.4
s1    aligned to 4 bytes  96.7% aligned to 8 bytes  95.5% aligned to 16 bytes  95.0%
s2    aligned to 4 bytes  90.5% aligned to 8 bytes  89.6% aligned to 16 bytes  75.4%
needle found in n bytes: n <= 0:   0.2% n <= 1:   0.2% n <= 2:   0.3% n <= 3:   0.4%  n <= 4:  46.0% n <= 8:  78.3% n <= 16:  96.6% n <= 32:  99.7% n <= 64:  99.9%
needle size: n <= 0:   0.1% n <= 1:  90.6% n <= 2:  95.0% n <= 3:  95.7%  n <= 4:  95.7% n <= 8:  99.0% n <= 16:  99.8% n <= 32: 100.0% n <= 64: 100.0%

replaying /usr/lib/kde4/libexec/kdm_greet

average size 11.38 needle size  8.99 comparisons  11.41 digraphs  0.00 trigraphs  0.00
calls    23221 succeed   0.2% latencies -14.4 -13.0
s1    aligned to 4 bytes 100.0% aligned to 8 bytes 100.0% aligned to 16 bytes 100.0%
s2    aligned to 4 bytes   0.1% aligned to 8 bytes   0.0% aligned to 16 bytes   0.0%
needle found in n bytes: n <= 0:   0.0% n <= 1:   0.0% n <= 2:   0.0% n <= 3:   0.0%  n <= 4:   0.0% n <= 8:   0.0% n <= 16:  99.8% n <= 32:  99.8% n <= 64:  99.8%
needle size: n <= 0:   0.0% n <= 1:   0.0% n <= 2:   0.0% n <= 3:   0.0%  n <= 4:   0.2% n <= 8:   0.3% n <= 16: 100.0% n <= 32: 100.0% n <= 64: 100.0%

replaying su

average size 40.09 needle size  9.60 comparisons  45.20 digraphs  0.29 trigraphs  0.00
calls       91 succeed   0.0% latencies  -2.1  -0.4
s1    aligned to 4 bytes 100.0% aligned to 8 bytes  51.6% aligned to 16 bytes  51.6%
s2    aligned to 4 bytes   1.1% aligned to 8 bytes   1.1% aligned to 16 bytes   1.1%
needle found in n bytes: n <= 0:   1.1% n <= 1:   1.1% n <= 2:   1.1% n <= 3:   1.1%  n <= 4:   1.1% n <= 8:   4.4% n <= 16:  26.4% n <= 32:  28.6% n <= 64: 100.0%
needle size: n <= 0:   1.1% n <= 1:   1.1% n <= 2:   1.1% n <= 3:   1.1%  n <= 4:   1.1% n <= 8:   1.1% n <= 16: 100.0% n <= 32: 100.0% n <= 64: 100.0%

replaying make

average size 32.87 needle size  1.20 comparisons  34.73 digraphs  0.01 trigraphs  0.01
calls      277 succeed  91.7% latencies -12.5 -10.5
s1    aligned to 4 bytes  31.0% aligned to 8 bytes  17.3% aligned to 16 bytes  13.4%
s2    aligned to 4 bytes 100.0% aligned to 8 bytes 100.0% aligned to 16 bytes  98.6%
needle found in n bytes: n <= 0:   0.4% n <= 1:   4.3% n <= 2:   4.7% n <= 3:   6.9%  n <= 4:   7.6% n <= 8:   9.7% n <= 16:  14.8% n <= 32:  35.0% n <= 64:  99.6%
needle size: n <= 0:   0.4% n <= 1:  95.3% n <= 2:  95.7% n <= 3:  97.1%  n <= 4:  97.1% n <= 8: 100.0% n <= 16: 100.0% n <= 32: 100.0% n <= 64: 100.0%

replaying gdb

average size 14.69 needle size  6.77 comparisons  15.46 digraphs  0.06 trigraphs  0.01
calls    17206 succeed   5.0% latencies   8.4  10.2
s1    aligned to 4 bytes  52.6% aligned to 8 bytes  45.0% aligned to 16 bytes  41.3%
s2    aligned to 4 bytes   9.6% aligned to 8 bytes   9.1% aligned to 16 bytes   9.1%
needle found in n bytes: n <= 0:   0.2% n <= 1:   1.0% n <= 2:   4.3% n <= 3:   5.3%  n <= 4:   7.6% n <= 8:  32.9% n <= 16:  81.9% n <= 32:  92.1% n <= 64:  96.7%
needle size: n <= 0:   0.0% n <= 1:   0.0% n <= 2:  27.9% n <= 3:  62.5%  n <= 4:  64.8% n <= 8:  65.1% n <= 16:  92.7% n <= 32: 100.0% n <= 64: 100.0%

replaying /usr/bin/urxvt

average size 10.04 needle size  9.01 comparisons  10.05 digraphs  0.00 trigraphs  0.00
calls    11674 succeed   0.0% latencies  -5.5  -0.7
s1    aligned to 4 bytes 100.0% aligned to 8 bytes 100.0% aligned to 16 bytes 100.0%
s2    aligned to 4 bytes   0.6% aligned to 8 bytes   0.3% aligned to 16 bytes   0.2%
needle found in n bytes: n <= 0:   0.0% n <= 1:   0.0% n <= 2:   0.0% n <= 3:   0.0%  n <= 4:   0.0% n <= 8:   0.0% n <= 16:  99.8% n <= 32: 100.0% n <= 64: 100.0%
needle size: n <= 0:   0.0% n <= 1:   0.0% n <= 2:   0.0% n <= 3:   0.0%  n <= 4:   0.0% n <= 8:   0.3% n <= 16:  99.9% n <= 32: 100.0% n <= 64: 100.0%

replaying man

average size  7.91 needle size 29.43 comparisons  10.99 digraphs  0.24 trigraphs  0.24
calls       74 succeed  33.8% latencies   0.0   0.0
s1    aligned to 4 bytes 100.0% aligned to 8 bytes 100.0% aligned to 16 bytes 100.0%
s2    aligned to 4 bytes  63.5% aligned to 8 bytes  63.5% aligned to 16 bytes  62.2%
needle found in n bytes: n <= 0:  47.3% n <= 1:  47.3% n <= 2:  47.3% n <= 3:  51.4%  n <= 4:  51.4% n <= 8:  51.4% n <= 16:  83.8% n <= 32: 100.0% n <= 64: 100.0%
needle size: n <= 0:   1.4% n <= 1:   1.4% n <= 2:   1.4% n <= 3:   2.7%  n <= 4:   5.4% n <= 8:   6.8% n <= 16:  39.2% n <= 32:  55.4% n <= 64: 100.0%

replaying troff

average size 11.87 needle size  9.00 comparisons  11.87 digraphs  0.00 trigraphs  0.00
calls     5498 succeed   0.0% latencies   0.0   0.0
s1    aligned to 4 bytes 100.0% aligned to 8 bytes 100.0% aligned to 16 bytes 100.0%
s2    aligned to 4 bytes   0.0% aligned to 8 bytes   0.0% aligned to 16 bytes   0.0%
needle found in n bytes: n <= 0:   0.0% n <= 1:   0.0% n <= 2:   0.0% n <= 3:   0.0%  n <= 4:   0.0% n <= 8:   0.0% n <= 16:  81.5% n <= 32: 100.0% n <= 64: 100.0%
needle size: n <= 0:   0.0% n <= 1:   0.0% n <= 2:   0.0% n <= 3:   0.0%  n <= 4:   0.0% n <= 8:   0.0% n <= 16: 100.0% n <= 32: 100.0% n <= 64: 100.0%
  

Patch

diff --git a/benchtests/bench-strstr.c b/benchtests/bench-strstr.c
index 74f3ee8..0412c1b 100644
--- a/benchtests/bench-strstr.c
+++ b/benchtests/bench-strstr.c
@@ -81,31 +81,15 @@  do_test (size_t align1, size_t align2, size_t len1, size_t len2,
   char *s1 = (char *) (buf1 + align1);
   char *s2 = (char *) (buf2 + align2);
 
-  static const char d[] = "1234567890abcdef";
-#define dl (sizeof (d) - 1)
-  char *ss2 = s2;
-  for (size_t l = len2; l > 0; l = l > dl ? l - dl : 0)
+  for (size_t l = 0; l < len2; l++)
     {
-      size_t t = l > dl ? dl : l;
-      ss2 = mempcpy (ss2, d, t);
+      s2[l] = (rand () % 128) + 1;
     }
   s2[len2] = '\0';
 
-  if (fail)
+  for (size_t l = 0; l < len1; l++)
     {
-      char *ss1 = s1;
-      for (size_t l = len1; l > 0; l = l > dl ? l - dl : 0)
-	{
-	  size_t t = l > dl ? dl : l;
-	  memcpy (ss1, d, t);
-	  ++ss1[len2 > 7 ? 7 : len2 - 1];
-	  ss1 += t;
-	}
-    }
-  else
-    {
-      memset (s1, '0', len1);
-      memcpy (s1 + len1 - len2, s2, len2);
+      s1[l] = (rand () % 128) + 1;
     }
   s1[len1] = '\0';