From patchwork Mon Sep 29 10:30:18 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Leonhard Holz <leonhard.holz@web.de>
X-Patchwork-Id: 3017
Received: (qmail 15037 invoked by alias); 29 Sep 2014 10:30:33 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Delivered-To: mailing list libc-alpha@sourceware.org
Received: (qmail 15010 invoked by uid 89); 29 Sep 2014 10:30:29 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL, BAYES_00,
	FREEMAIL_FROM, KAM_INFOUSME, RCVD_IN_DNSWL_NONE,
	RP_MATCHES_RCVD, SPF_PASS autolearn=ham version=3.3.2
X-HELO: mout.web.de
Message-ID: <542934BA.9020000@web.de>
Date: Mon, 29 Sep 2014 12:30:18 +0200
From: Leonhard Holz <leonhard.holz@web.de>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
	rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: libc-alpha@sourceware.org
Subject: Re: [Patch] [BZ 15884] strcoll: improve performance by removing the
	cache
References: <542282C1.2080504@web.de>
	<20140924100247.GN1716@spoyarek.pnq.redhat.com>
In-Reply-To: <20140924100247.GN1716@spoyarek.pnq.redhat.com>
X-UI-Out-Filterresults: notjunk:1;

Ok, according to Siddhesh remarks I splitted the patch in two, this 
which removes the cache and a following one that introduces inline-ing. 
Formating should now be better and I tried to minimize the changed lines.

2014-09-29  Leonhard Holz  <leonhard.holz@web.de>

	[BZ #15884]
	* string/strcoll_l.c: Remove weight and rules cache.
	* benchtests/bench-strcoll.c: Benchmark for strcoll().
	* benchtests/Makefile: Likewise.
	* benchtests/strcoll-inputs/glibc_files.txt: Benchmark data.
	* benchtests/strcoll-inputs/lorem_ipsum_vietnamese.txt
	* benchtests/strcoll-inputs/lorem_ipsum_latin.txt
	* benchtests/strcoll-inputs/lorem_ipsum_arabic.txt
	* benchtests/strcoll-inputs/lorem_ipsum_l33tspeak.txt
	* benchtests/strcoll-inputs/lorem_ipsum_chinese.txt
	* benchtests/strcoll-inputs/lorem_ipsum_czech.txt
	* benchtests/strcoll-inputs/lorem_ipsum_old_english.txt
	* benchtests/strcoll-inputs/lorem_ipsum_danish.txt
	* benchtests/strcoll-inputs/lorem_ipsum_polish.txt
	* benchtests/strcoll-inputs/lorem_ipsum_french.txt
	* benchtests/strcoll-inputs/lorem_ipsum_portugese.txt
	* benchtests/strcoll-inputs/lorem_ipsum_greek.txt
	* benchtests/strcoll-inputs/lorem_ipsum_russian.txt
	* benchtests/strcoll-inputs/lorem_ipsum_hebrew.txt
	* benchtests/strcoll-inputs/lorem_ipsum_spain.txt
	* benchtests/strcoll-inputs/lorem_ipsum_hindi.txt
	* benchtests/strcoll-inputs/lorem_ipsum_swedish.txt
	* benchtests/strcoll-inputs/lorem_ipsum_hungarian.txt
	* benchtests/strcoll-inputs/lorem_ipsum_turkish.txt
	* benchtests/strcoll-inputs/lorem_ipsum_icelandic.txt
	* benchtests/strcoll-inputs/lorem_ipsum_italian.txt
	* benchtests/strcoll-inputs/lorem_ipsum_yugoslavian.txt
	* benchtests/strcoll-inputs/lorem_ipsum_japanese.txt
	* localedata/Makefile: Generate locales needed for benchtests.

The numbers are:

glibc_files.txt                     en_US.UTF-8    -46.72%
lorem_ipsum_vietnamese.txt          vi_VN.UTF-8    -36.60%
lorem_ipsum_latin.txt               en_US.UTF-8    -45.83%
lorem_ipsum_arabic.txt              ar_SA.UTF-8    -34.11%
lorem_ipsum_l33tspeak.txt           en_US.UTF-8    -46.28%
lorem_ipsum_chinese.txt             zh_CN.UTF-8    +30.95%
lorem_ipsum_czech.txt               cs_CZ.UTF-8    -36.17%
lorem_ipsum_old_english.txt         en_GB.UTF-8    -35.22%
lorem_ipsum_danish.txt              da_DK.UTF-8    -39.22%
lorem_ipsum_polish.txt              pl_PL.UTF-8    -42.62%
lorem_ipsum_french.txt              fr_FR.UTF-8    -31.09%
lorem_ipsum_portugese.txt           pt_PT.UTF-8    -30.27%
lorem_ipsum_greek.txt               el_GR.UTF-8    -32.07%
lorem_ipsum_russian.txt             ru_RU.UTF-8    -36.00%
lorem_ipsum_hebrew.txt              iw_IL.UTF-8    -41.44%
lorem_ipsum_spain.txt               es_ES.UTF-8    -35.64%
lorem_ipsum_hindi.txt               hi_IN.UTF-8    -00.17%
lorem_ipsum_swedish.txt             sv_SE.UTF-8    -38.85%
lorem_ipsum_hungarian.txt           hu_HU.UTF-8    -21.82%
lorem_ipsum_turkish.txt             tr_TR.UTF-8    -38.08%
lorem_ipsum_icelandic.txt           is_IS.UTF-8    -43.40%
lorem_ipsum_italian.txt             it_IT.UTF-8    -30.52%
lorem_ipsum_yugoslavian.txt         sr_RS.UTF-8    -36.41%
lorem_ipsum_japanese.txt            ja_JP.UTF-8    +18.00%

Chinese and japanese are a bit special as AFAIK in these languages every 
character is a word and the benchmark is probably comparing sentences. 
Also theses language complete much faster in absolute numbers, about 1e6 
vs. 3e6 (new) / 5e6 (old) for alphabetic languages.

Best,
Leonhard

Am 24.09.2014 12:02, schrieb Siddhesh Poyarekar:
> On Wed, Sep 24, 2014 at 10:37:21AM +0200, Leonhard Holz wrote:
>> Hello everybody,
>>
>> this is a path that should solve bug 15884. It complains about the
>> performance of strcoll(). It was found out that the runtime of strcoll() is
>> actually bound to strlen which is needed for calculating the size of a cache
>> that was installed to improve the comparison performance.
>>
>> The idea for this patch was that the cache is only useful in rare cases
>> (strings of same length and same first-level-chars) and that it would be
>> better to avoid memory allocation at all. To prove this I wrote a
>> performance test that is found in benchtests-strcoll.tar.bz2. Also
>> modifications in benchtests/Makefile and localedata/Makefile are necessary
>> to make it work.
>>
>> After removing the cache the strcoll method showed the predicted behavior
>> (getting slightly faster) in all but the test case for hindi word sorting.
>> This was due the hindi text having much more equal words than the other
>> ones. For equal strings the performance was worse since all comparison
>> levels were run through and from the second level on the cache improved the
>> comparison performance of the original version.
>>
>> Therefore I added a bytewise test via strcmp iff the first level comparison
>> found that both strings did match because in this case it is very likely
>> that equal strings are compared. This solved the problem with the hindi test
>> case and improved the performance of the others.
>
> Thanks for working on this and also writing a benchmark for it.  The
> general approach seems sound to me (I haven't done a deep review yet),
> but there are quite a few nits that will need to be worked out, most
> of them covered in the contributor checklist[1].
>
> - There are a lot of unrelated whitespace and formatting changes in
>    the patch.  Most of them seem to have been made using the GNU indent
>    program, which is mostly accurate, but not completely.  Please
>    review and fix them up.
>
> - The change needs a changelog which mentions all your changes,
>    including all the new files.
>
> - Please include bench-strcoll.c in the patch as well.  It's OK if you
>    post the input files in the tarball but the source needs to be
>    reviewed.
>
> - bench-strcoll.c has some code formatting issues, especially
>    unnecessary braces around single line for/if blocks.
>
>> Another improvement was achieved by inlineing both static subroutines.
>
> - Please post the inlining change separately with separate numbers for
>    it.  In general we stay away from inlining functions and just let
>    the compiler do its job.  However if there is a case where such
>    inlining is especially useful, it needs to be accompanied with
>    numbers.  So a separate patch with separate numbers for the change
>    would be helpful.
>
> - Finally, I don't know if you have signed a copyright assignment with
>    the FSF for your changes.  Carlos seems to have mentioned that in
>    your previous email thread, but I don't know if you've followed
>    through on it since I am not an FSF maintainer.  Maybe one of the
>    FSF maintainers can confirm that.
>
> Siddhesh
>
> [1] https://sourceware.org/glibc/wiki/Contribution%20checklist
>
/* Measure strcoll implementation.
   Copyright (C) 2014 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, see
   <http://www.gnu.org/licenses/>.  */

#define TEST_MAIN
#define TEST_NAME "strcoll"

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include "bench-timing.h"

// many thanks to http://generator.lorem-ipsum.info/
const char *li_files[] = {
  "strcoll-inputs/lorem_ipsum_vietnamese.txt",
  "strcoll-inputs/lorem_ipsum_latin.txt",
  "strcoll-inputs/lorem_ipsum_arabic.txt",
  "strcoll-inputs/lorem_ipsum_l33tspeak.txt",
  "strcoll-inputs/lorem_ipsum_chinese.txt",
  "strcoll-inputs/lorem_ipsum_czech.txt",
  "strcoll-inputs/lorem_ipsum_old_english.txt",
  "strcoll-inputs/lorem_ipsum_danish.txt",
  "strcoll-inputs/lorem_ipsum_polish.txt",
  "strcoll-inputs/lorem_ipsum_french.txt",
  "strcoll-inputs/lorem_ipsum_portugese.txt",
  "strcoll-inputs/lorem_ipsum_greek.txt",
  "strcoll-inputs/lorem_ipsum_russian.txt",
  "strcoll-inputs/lorem_ipsum_hebrew.txt",
  "strcoll-inputs/lorem_ipsum_spain.txt",
  "strcoll-inputs/lorem_ipsum_hindi.txt",
  "strcoll-inputs/lorem_ipsum_swedish.txt",
  "strcoll-inputs/lorem_ipsum_hungarian.txt",
  "strcoll-inputs/lorem_ipsum_turkish.txt",
  "strcoll-inputs/lorem_ipsum_icelandic.txt",
  "strcoll-inputs/lorem_ipsum_italian.txt",
  "strcoll-inputs/lorem_ipsum_yugoslavian.txt",
  "strcoll-inputs/lorem_ipsum_japanese.txt"
};

const char *li_locales[] = {
  "vi_VN.UTF-8",
  "en_US.UTF-8",
  "ar_SA.UTF-8",
  "en_US.UTF-8",
  "zh_CN.UTF-8",
  "cs_CZ.UTF-8",
  "en_GB.UTF-8",
  "da_DK.UTF-8",
  "pl_PL.UTF-8",
  "fr_FR.UTF-8",
  "pt_PT.UTF-8",
  "el_GR.UTF-8",
  "ru_RU.UTF-8",
  "iw_IL.UTF-8",
  "es_ES.UTF-8",
  "hi_IN.UTF-8",
  "sv_SE.UTF-8",
  "hu_HU.UTF-8",
  "tr_TR.UTF-8",
  "is_IS.UTF-8",
  "it_IT.UTF-8",
  "sr_RS.UTF-8",
  "ja_JP.UTF-8"
};

const char *filenames_file = "strcoll-inputs/glibc_files.txt";

#define LI_DELIMITER " \n\r\t.,?!"
#define FILENAMES_DELIMITER "\n\r"

char *
read_file (const char *filename)
{
  char *buffer = NULL;
  FILE *f = fopen (filename, "rb");

  if (f)
    {
      fseek (f, 0, SEEK_END);
      size_t length = ftell (f);
      fseek (f, 0, SEEK_SET);
      buffer = malloc (length + 1);
      if (buffer)
	{
	  fread (buffer, 1, length, f);
	  *(buffer + length) = '\0';
	}
      fclose (f);
    }

  return buffer;
}

size_t
count_words (const char *text, const char *delim)
{
  size_t wordcount = 0;
  char *tmp = strdup (text);

  char *token = strtok (tmp, delim);
  while (token != NULL)
    {
      if (*token != '\0')
	wordcount++;
      token = strtok (NULL, delim);
    }

  free (tmp);
  return wordcount;
}

typedef struct
{
  char *str;
  size_t strlen;
  size_t size;
  char **words;
} word_list;

word_list
tokenize_string (const char *str, const char *delim)
{
  word_list list;
  list.strlen = strlen (str);
  list.size = count_words (str, delim);
  list.words = malloc (list.size * sizeof (char *));

  size_t n = 0;
  list.str = strdup (str);
  char *word = strtok (list.str, delim);
  while (word != NULL && n < list.size)
    {
      if (*word != '\0')
	list.words[n++] = word;
      word = strtok (NULL, delim);
    }

  return list;
}

word_list
copy_word_list (const word_list list)
{
  size_t i;
  word_list copy;

  copy.strlen = list.strlen;
  copy.str = malloc (list.strlen + 1);
  memcpy (copy.str, list.str, list.strlen + 1);

  copy.size = list.size;
  copy.words = malloc (list.size * sizeof (char *));

  for (i = 0; i < list.size; i++)
    copy.words[i] = list.words[i] - list.str + copy.str;

  return copy;
}

int
compare_words (const void *a, const void *b)
{
  const char *s1 = *(char **) a;
  const char *s2 = *(char **) b;
  return strcoll (s1, s2);
}

word_list
sort_word_list (const word_list list)
{
  word_list sorted = copy_word_list (list);
  qsort (sorted.words, sorted.size, sizeof (char *), compare_words);
  return sorted;
}

void
print_word_list (word_list list)
{
  size_t i;

  for (i = 0; i < list.size; i++)
    printf ("%s\n", list.words[i]);
}

void
free_word_list (word_list list)
{
  free (list.words);
  free (list.str);
}

#undef INNER_LOOP_ITERS
#define INNER_LOOP_ITERS 16

void
test_file (const char *filename, const char *delim, const char *locale)
{
  setlocale (LC_ALL, locale);
  timing_t start, stop, cur;
  size_t i, iters = INNER_LOOP_ITERS;

  printf ("%-50s %-10s", filename, setlocale (LC_ALL, NULL));

  char *text = read_file (filename);
  word_list list = tokenize_string (text, delim);

  word_list *tests = malloc (INNER_LOOP_ITERS * sizeof (word_list));
  for (i = 0; i < INNER_LOOP_ITERS; i++)
    tests[i] = copy_word_list (list);

  TIMING_NOW (start);
  for (i = 0; i < INNER_LOOP_ITERS; i++)
    qsort (tests[i].words, tests[i].size, sizeof (char *), compare_words);
  TIMING_NOW (stop);

  setlocale (LC_ALL, "en_US.UTF-8");

  TIMING_DIFF (cur, start, stop);
  TIMING_PRINT_MEAN ((double) cur, (double) iters);
  putchar ('\n');

  for (i = 0; i < INNER_LOOP_ITERS; i++)
    free_word_list (tests[i]);
  free (tests);

  free_word_list (list);
  free (text);
}


int
do_bench (void)
{
  if (setlocale (LC_ALL, "en_US.UTF-8") == NULL)
    {
      printf ("Failed to set default locale.");
      return 1;
    }

  timing_t res __attribute__ ((unused));
  TIMING_INIT (res);

  test_file (filenames_file, FILENAMES_DELIMITER, "en_US.UTF-8");

  size_t i;
  for (i = 0; i < (sizeof (li_files) / sizeof (li_files[0])); i++)
    test_file (li_files[i], LI_DELIMITER, li_locales[i]);

  return 0;
}

#define TEST_FUNCTION do_bench ()

/* On slower platforms this test needs more than the default 2 seconds.  */
#define TIMEOUT 10

#include "../test-skeleton.c"

diff --git a/localedata/Makefile b/localedata/Makefile
index b6235f2..cb24974 100644
--- a/localedata/Makefile
+++ b/localedata/Makefile
@@ -106,7 +106,10 @@ LOCALES := de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ANSI_X3.4-1968 \
 	   hr_HR.ISO-8859-2 sv_SE.ISO-8859-1 ja_JP.SJIS fr_FR.ISO-8859-1 \
 	   nb_NO.ISO-8859-1 nn_NO.ISO-8859-1 tr_TR.UTF-8 cs_CZ.UTF-8 \
 	   zh_TW.EUC-TW fa_IR.UTF-8 fr_FR.UTF-8 ja_JP.UTF-8 si_LK.UTF-8 \
-	   tr_TR.ISO-8859-9 en_GB.UTF-8
+	   tr_TR.ISO-8859-9 en_GB.UTF-8 vi_VN.UTF-8 ar_SA.UTF-8 zh_CN.UTF-8 \
+	   da_DK.UTF-8 pl_PL.UTF-8 pt_PT.UTF-8 el_GR.UTF-8 ru_RU.UTF-8 \
+	   iw_IL.UTF-8 es_ES.UTF-8 hi_IN.UTF-8 sv_SE.UTF-8 hu_HU.UTF-8 \
+	   is_IS.UTF-8 it_IT.UTF-8 sr_RS.UTF-8
 LOCALE_SRCS := $(shell echo "$(LOCALES)"|sed 's/\([^ .]*\)[^ ]*/\1/g')
 CHARMAPS := $(shell echo "$(LOCALES)" | \
 		    sed -e 's/[^ .]*[.]\([^ ]*\)/\1/g' -e s/SJIS/SHIFT_JIS/g)
diff --git a/string/strcoll_l.c b/string/strcoll_l.c
index d4f42a3..dbda23c 100644
--- a/string/strcoll_l.c
+++ b/string/strcoll_l.c
@@ -22,7 +22,6 @@
 #include <locale.h>
 #include <stddef.h>
 #include <stdint.h>
-#include <stdlib.h>
 #include <string.h>
 #include <sys/param.h>
 
@@ -55,8 +54,6 @@ typedef struct
   size_t backw;			/* Current Backward sequence index.  */
   size_t backw_stop;		/* Index where the backward sequences stop.  */
   const USTRING_TYPE *us;	/* The string.  */
-  int32_t *idxarr;		/* Array to cache weight indices.  */
-  unsigned char *rulearr;	/* Array to cache rules.  */
   unsigned char rule;		/* Saved rule for the first sequence.  */
   int32_t idx;			/* Index to weight of the current sequence.  */
   int32_t save_idx;		/* Save looked up index of a forward
@@ -65,179 +62,9 @@ typedef struct
   const USTRING_TYPE *back_us;	/* Beginning of the backward sequence.  */
 } coll_seq;
 
-/* Get next sequence.  The weight indices are cached, so we don't need to
-   traverse the string.  */
-static void
-get_next_seq_cached (coll_seq *seq, int nrules, int pass,
-		     const unsigned char *rulesets,
-		     const USTRING_TYPE *weights)
-{
-  size_t val = seq->val = 0;
-  int len = seq->len;
-  size_t backw_stop = seq->backw_stop;
-  size_t backw = seq->backw;
-  size_t idxcnt = seq->idxcnt;
-  size_t idxmax = seq->idxmax;
-  size_t idxnow = seq->idxnow;
-  unsigned char *rulearr = seq->rulearr;
-  int32_t *idxarr = seq->idxarr;
-
-  while (len == 0)
-    {
-      ++val;
-      if (backw_stop != ~0ul)
-	{
-	  /* There is something pushed.  */
-	  if (backw == backw_stop)
-	    {
-	      /* The last pushed character was handled.  Continue
-		 with forward characters.  */
-	      if (idxcnt < idxmax)
-		{
-		  idxnow = idxcnt;
-		  backw_stop = ~0ul;
-		}
-	      else
-		{
-		  /* Nothing any more.  The backward sequence
-		     ended with the last sequence in the string.  */
-		  idxnow = ~0ul;
-		  break;
-		}
-	    }
-	  else
-	    idxnow = --backw;
-	}
-      else
-	{
-	  backw_stop = idxcnt;
-
-	  while (idxcnt < idxmax)
-	    {
-	      if ((rulesets[rulearr[idxcnt] * nrules + pass]
-		   & sort_backward) == 0)
-		/* No more backward characters to push.  */
-		break;
-	      ++idxcnt;
-	    }
-
-	  if (backw_stop == idxcnt)
-	    {
-	      /* No sequence at all or just one.  */
-	      if (idxcnt == idxmax)
-		/* Note that LEN is still zero.  */
-		break;
-
-	      backw_stop = ~0ul;
-	      idxnow = idxcnt++;
-	    }
-	  else
-	    /* We pushed backward sequences.  */
-	    idxnow = backw = idxcnt - 1;
-	}
-      len = weights[idxarr[idxnow]++];
-    }
-
-  /* Update the structure.  */
-  seq->val = val;
-  seq->len = len;
-  seq->backw_stop = backw_stop;
-  seq->backw = backw;
-  seq->idxcnt = idxcnt;
-  seq->idxnow = idxnow;
-}
-
 /* Get next sequence.  Traverse the string as required.  */
 static void
 get_next_seq (coll_seq *seq, int nrules, const unsigned char *rulesets,
-	      const USTRING_TYPE *weights, const int32_t *table,
-	      const USTRING_TYPE *extra, const int32_t *indirect)
-{
-  size_t val = seq->val = 0;
-  int len = seq->len;
-  size_t backw_stop = seq->backw_stop;
-  size_t backw = seq->backw;
-  size_t idxcnt = seq->idxcnt;
-  size_t idxmax = seq->idxmax;
-  size_t idxnow = seq->idxnow;
-  unsigned char *rulearr = seq->rulearr;
-  int32_t *idxarr = seq->idxarr;
-  const USTRING_TYPE *us = seq->us;
-
-  while (len == 0)
-    {
-      ++val;
-      if (backw_stop != ~0ul)
-	{
-	  /* There is something pushed.  */
-	  if (backw == backw_stop)
-	    {
-	      /* The last pushed character was handled.  Continue
-		 with forward characters.  */
-	      if (idxcnt < idxmax)
-		{
-		  idxnow = idxcnt;
-		  backw_stop = ~0ul;
-		}
-	      else
-		/* Nothing any more.  The backward sequence ended with
-		   the last sequence in the string.  Note that LEN
-		   is still zero.  */
-		break;
-	    }
-	  else
-	    idxnow = --backw;
-	}
-      else
-	{
-	  backw_stop = idxmax;
-
-	  while (*us != L('\0'))
-	    {
-	      int32_t tmp = findidx (table, indirect, extra, &us, -1);
-	      rulearr[idxmax] = tmp >> 24;
-	      idxarr[idxmax] = tmp & 0xffffff;
-	      idxcnt = idxmax++;
-
-	      if ((rulesets[rulearr[idxcnt] * nrules]
-		   & sort_backward) == 0)
-		/* No more backward characters to push.  */
-		break;
-	      ++idxcnt;
-	    }
-
-	  if (backw_stop >= idxcnt)
-	    {
-	      /* No sequence at all or just one.  */
-	      if (idxcnt == idxmax || backw_stop > idxcnt)
-		/* Note that LEN is still zero.  */
-		break;
-
-	      backw_stop = ~0ul;
-	      idxnow = idxcnt;
-	    }
-	  else
-	    /* We pushed backward sequences.  */
-	    idxnow = backw = idxcnt - 1;
-	}
-      len = weights[idxarr[idxnow]++];
-    }
-
-  /* Update the structure.  */
-  seq->val = val;
-  seq->len = len;
-  seq->backw_stop = backw_stop;
-  seq->backw = backw;
-  seq->idxcnt = idxcnt;
-  seq->idxmax = idxmax;
-  seq->idxnow = idxnow;
-  seq->us = us;
-}
-
-/* Get next sequence.  Traverse the string as required.  This function does not
-   set or use any index or rule cache.  */
-static void
-get_next_seq_nocache (coll_seq *seq, int nrules, const unsigned char *rulesets,
 		      const USTRING_TYPE *weights, const int32_t *table,
 		      const USTRING_TYPE *extra, const int32_t *indirect,
 		      int pass)
@@ -366,10 +193,9 @@ get_next_seq_nocache (coll_seq *seq, int nrules, const unsigned char *rulesets,
   seq->idx = idx;
 }
 
-/* Compare two sequences.  This version does not use the index and rules
-   cache.  */
+/* Compare two sequences.  */
 static int
-do_compare_nocache (coll_seq *seq1, coll_seq *seq2, int position,
+do_compare (coll_seq *seq1, coll_seq *seq2, int position,
 		    const USTRING_TYPE *weights)
 {
   int seq1len = seq1->len;
@@ -417,56 +243,6 @@ out:
   return result;
 }
 
-/* Compare two sequences using the index cache.  */
-static int
-do_compare (coll_seq *seq1, coll_seq *seq2, int position,
-	    const USTRING_TYPE *weights)
-{
-  int seq1len = seq1->len;
-  int seq2len = seq2->len;
-  size_t val1 = seq1->val;
-  size_t val2 = seq2->val;
-  int32_t *idx1arr = seq1->idxarr;
-  int32_t *idx2arr = seq2->idxarr;
-  int idx1now = seq1->idxnow;
-  int idx2now = seq2->idxnow;
-  int result = 0;
-
-  /* Test for position if necessary.  */
-  if (position && val1 != val2)
-    {
-      result = val1 > val2 ? 1 : -1;
-      goto out;
-    }
-
-  /* Compare the two sequences.  */
-  do
-    {
-      if (weights[idx1arr[idx1now]] != weights[idx2arr[idx2now]])
-	{
-	  /* The sequences differ.  */
-	  result = weights[idx1arr[idx1now]] - weights[idx2arr[idx2now]];
-	  goto out;
-	}
-
-      /* Increment the offsets.  */
-      ++idx1arr[idx1now];
-      ++idx2arr[idx2now];
-
-      --seq1len;
-      --seq2len;
-    }
-  while (seq1len > 0 && seq2len > 0);
-
-  if (position && seq1len != seq2len)
-    result = seq1len - seq2len;
-
-out:
-  seq1->len = seq1len;
-  seq2->len = seq2len;
-  return result;
-}
-
 int
 STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
 {
@@ -483,6 +259,10 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
   if (nrules == 0)
     return STRCMP (s1, s2);
 
+  /* Catch empty strings.  */
+  if (__glibc_unlikely (*s1 == '\0') || __glibc_unlikely (*s2 == '\0'))
+    return (*s1 != '\0') - (*s2 != '\0');
+
   rulesets = (const unsigned char *)
     current->values[_NL_ITEM_INDEX (_NL_COLLATE_RULESETS)].string;
   table = (const int32_t *)
@@ -499,65 +279,12 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
   assert (((uintptr_t) extra) % __alignof__ (extra[0]) == 0);
   assert (((uintptr_t) indirect) % __alignof__ (indirect[0]) == 0);
 
-  /* We need this a few times.  */
-  size_t s1len = STRLEN (s1);
-  size_t s2len = STRLEN (s2);
-
-  /* Catch empty strings.  */
-  if (__glibc_unlikely (s1len == 0) || __glibc_unlikely (s2len == 0))
-    return (s1len != 0) - (s2len != 0);
-
-  /* Perform the first pass over the string and while doing this find
-     and store the weights for each character.  Since we want this to
-     be as fast as possible we are using `alloca' to store the temporary
-     values.  But since there is no limit on the length of the string
-     we have to use `malloc' if the string is too long.  We should be
-     very conservative here.
-
-     Please note that the localedef programs makes sure that `position'
-     is not used at the first level.  */
+  int result = 0, rule = 0;
 
   coll_seq seq1, seq2;
-  bool use_malloc = false;
-  int result = 0;
-
   memset (&seq1, 0, sizeof (seq1));
   seq2 = seq1;
 
-  size_t size_max = SIZE_MAX / (sizeof (int32_t) + 1);
-
-  if (MIN (s1len, s2len) > size_max
-      || MAX (s1len, s2len) > size_max - MIN (s1len, s2len))
-    {
-      /* If the strings are long enough to cause overflow in the size request,
-         then skip the allocation and proceed with the non-cached routines.  */
-    }
-  else if (! __libc_use_alloca ((s1len + s2len) * (sizeof (int32_t) + 1)))
-    {
-      seq1.idxarr = (int32_t *) malloc ((s1len + s2len) * (sizeof (int32_t) + 1));
-
-      /* If we failed to allocate memory, we leave everything as NULL so that
-	 we use the nocache version of traversal and comparison functions.  */
-      if (seq1.idxarr != NULL)
-	{
-	  seq2.idxarr = &seq1.idxarr[s1len];
-	  seq1.rulearr = (unsigned char *) &seq2.idxarr[s2len];
-	  seq2.rulearr = &seq1.rulearr[s1len];
-	  use_malloc = true;
-	}
-    }
-  else
-    {
-      seq1.idxarr = (int32_t *) alloca (s1len * sizeof (int32_t));
-      seq2.idxarr = (int32_t *) alloca (s2len * sizeof (int32_t));
-      seq1.rulearr = (unsigned char *) alloca (s1len);
-      seq2.rulearr = (unsigned char *) alloca (s2len);
-    }
-
-  int rule = 0;
-
-  /* Cache values in the first pass and if needed, use them in subsequent
-     passes.  */
   for (int pass = 0; pass < nrules; ++pass)
     {
       seq1.idxcnt = 0;
@@ -575,64 +302,44 @@ STRCOLL (const STRING_TYPE *s1, const STRING_TYPE *s2, __locale_t l)
       seq2.us = (const USTRING_TYPE *) s2;
 
       /* We assume that if a rule has defined `position' in one section
-	 this is true for all of them.  */
+	 this is true for all of them.  Please note that the localedef programs
+	 makes sure that `position' is not used at the first level.  */
+
       int position = rulesets[rule * nrules + pass] & sort_position;
 
       while (1)
 	{
-	  if (__glibc_unlikely (seq1.idxarr == NULL))
-	    {
-	      get_next_seq_nocache (&seq1, nrules, rulesets, weights, table,
+	  get_next_seq (&seq1, nrules, rulesets, weights, table,
 				    extra, indirect, pass);
-	      get_next_seq_nocache (&seq2, nrules, rulesets, weights, table,
+	  get_next_seq (&seq2, nrules, rulesets, weights, table,
 				    extra, indirect, pass);
-	    }
-	  else if (pass == 0)
-	    {
-	      get_next_seq (&seq1, nrules, rulesets, weights, table, extra,
-			    indirect);
-	      get_next_seq (&seq2, nrules, rulesets, weights, table, extra,
-			    indirect);
-	    }
-	  else
-	    {
-	      get_next_seq_cached (&seq1, nrules, pass, rulesets, weights);
-	      get_next_seq_cached (&seq2, nrules, pass, rulesets, weights);
-	    }
-
 	  /* See whether any or both strings are empty.  */
 	  if (seq1.len == 0 || seq2.len == 0)
 	    {
 	      if (seq1.len == seq2.len)
-		/* Both ended.  So far so good, both strings are equal
-		   at this level.  */
-		break;
+		{
+		  /* Both ended.  Both strings are equal at this level. */
+		  if (pass == 0 && STRCMP(s1, s2) == 0)
+		    /* Shortcut to avoid looping through all levels
+		       for totally equal strings. */
+		    return result;
+		  else
+		    break;
+	        }
 
 	      /* This means one string is shorter than the other.  Find out
 		 which one and return an appropriate value.  */
-	      result = seq1.len == 0 ? -1 : 1;
-	      goto free_and_return;
+	      return seq1.len == 0 ? -1 : 1;
 	    }
 
-	  if (__glibc_unlikely (seq1.idxarr == NULL))
-	    result = do_compare_nocache (&seq1, &seq2, position, weights);
-	  else
-	    result = do_compare (&seq1, &seq2, position, weights);
+	  result = do_compare (&seq1, &seq2, position, weights);
 	  if (result != 0)
-	    goto free_and_return;
+	    return result;
 	}
 
-      if (__glibc_likely (seq1.rulearr != NULL))
-	rule = seq1.rulearr[0];
-      else
-	rule = seq1.rule;
+      rule = seq1.rule;
     }
 
-  /* Free the memory if needed.  */
- free_and_return:
-  if (use_malloc)
-    free (seq1.idxarr);
-
   return result;
 }
 libc_hidden_def (STRCOLL)

diff --git a/benchtests/Makefile b/benchtests/Makefile
index fd3036d..e79ceee 100644
--- a/benchtests/Makefile
+++ b/benchtests/Makefile
@@ -34,7 +34,7 @@ string-bench := bcopy bzero memccpy memchr memcmp memcpy memmem memmove \
 		mempcpy memset rawmemchr stpcpy stpncpy strcasecmp strcasestr \
 		strcat strchr strchrnul strcmp strcpy strcspn strlen \
 		strncasecmp strncat strncmp strncpy strnlen strpbrk strrchr \
-		strspn strstr strcpy_chk stpcpy_chk memrchr strsep strtok
+		strspn strstr strcpy_chk stpcpy_chk memrchr strsep strtok strcoll
 string-bench-all := $(string-bench)
 
 stdlib-bench := strtod