[1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers

Message ID 1511138515-25996-1-git-send-email-palves@redhat.com
State New, archived
Headers

Commit Message

Pedro Alves Nov. 20, 2017, 12:41 a.m. UTC
  The find-upper-bound-for-completion algorithm in the name components
accelerator table in dwarf2read.c increments a char in a string, and
asserts that it's not incrementing a 0xff char, but that's incorrect.

First, we shouldn't be calling gdb_assert on input.

Then, if "char" is signed, comparing a caracther with "0xff" will
never yield true, which is caught by Clang with:

  error: comparison of constant 255 with expression of type '....' (aka 'char') is always true [-Werror,-Wtautological-constant-out-of-range-compare]
	    gdb_assert (after.back () != 0xff);
			~~~~~~~~~~~~~ ^  ~~~~

And then, 0xff is a valid character on non-UTF-8/ASCII character sets.
E.g., it's 'ÿ' in Latin1.  While GCC nor Clang support !ASCII &&
!UTF-8 characters in identifiers (GCC supports UTF-8 characters only
via UCNs, see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html),
but other compilers might (Visual Studio?), so it doesn't hurt to
handle it correctly.  Testing is covered by extending the
dw2_expand_symtabs_matching unit tests with relevant cases.

However, without further changes, the unit tests still fail...  The
problem is that cp-name-parser.y assumes that identifiers are ASCII
(via ISALPHA/ISALNUM).  This commit fixes that too, so that we can
unit test the dwarf2read.c changes.  (The regular C/C++ lexer in
c-lang.y needs a similar treatment, but I'm leaving that for another
patch.)

While doing this, I noticed a thinko in the computation of the upper
bound for completion in dw2_expand_symtabs_matching_symbol.  We're
using std::upper_bound but we should use std::lower_bound.  I extended
the unit test with a case that I thought would expose it, this one:

 +  /* These are used to check that the increment-last-char in the
 +     matching algorithm for completion doesn't match "t1_fund" when
 +     completing "t1_func".  */
 +  "t1_func",
 +  "t1_func1",
 +  "t1_fund",
 +  "t1_fund1",

The algorithm actually returns "t1_fund1" as lower bound, so "t1_fund"
matches incorrectly.  But turns out the problem is masked because
later here:

  for (;lower != upper; ++lower)
    {
      const char *qualified = index.symbol_name_at (lower->idx);

      if (!lookup_name_matcher.matches (qualified)

the lookup_name_matcher.matches check above filters out "t1_fund"
because that doesn't start with "t1_func".

I'll fix the latent bug in follow up patches, after factoring things
out a bit in a way that allows unit testing the relevant code more
directly.

gdb/ChangeLog:
yyyy-mm-dd  Pedro Alves  <palves@redhat.com>

	* cp-name-parser.y (cp_ident_is_alpha, cp_ident_is_alnum): New.
	(symbol_end): Use cp_ident_is_alnum.
	(yylex): Use cp_ident_is_alpha and cp_ident_is_alnum.
	* dwarf2read.c (make_sort_after_prefix_name): New function.
	(dw2_expand_symtabs_matching_symbol): Use it.
	(test_symbols): Add more symbols.
	(run_test): Add tests.
---
 gdb/cp-name-parser.y |  28 ++++++++++--
 gdb/dwarf2read.c     | 119 ++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 129 insertions(+), 18 deletions(-)
  

Comments

Simon Marchi Nov. 20, 2017, 1:38 a.m. UTC | #1
On 2017-11-19 07:41 PM, Pedro Alves wrote:
> The find-upper-bound-for-completion algorithm in the name components
> accelerator table in dwarf2read.c increments a char in a string, and
> asserts that it's not incrementing a 0xff char, but that's incorrect.
> 
> First, we shouldn't be calling gdb_assert on input.
> 
> Then, if "char" is signed, comparing a caracther with "0xff" will
> never yield true, which is caught by Clang with:
> 
>   error: comparison of constant 255 with expression of type '....' (aka 'char') is always true [-Werror,-Wtautological-constant-out-of-range-compare]
> 	    gdb_assert (after.back () != 0xff);
> 			~~~~~~~~~~~~~ ^  ~~~~
> 
> And then, 0xff is a valid character on non-UTF-8/ASCII character sets.
> E.g., it's 'ÿ' in Latin1.  While GCC nor Clang support !ASCII &&
> !UTF-8 characters in identifiers (GCC supports UTF-8 characters only
> via UCNs, see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html),
> but other compilers might (Visual Studio?), so it doesn't hurt to
> handle it correctly.  Testing is covered by extending the
> dw2_expand_symtabs_matching unit tests with relevant cases.
> 
> However, without further changes, the unit tests still fail...  The
> problem is that cp-name-parser.y assumes that identifiers are ASCII
> (via ISALPHA/ISALNUM).  This commit fixes that too, so that we can
> unit test the dwarf2read.c changes.  (The regular C/C++ lexer in
> c-lang.y needs a similar treatment, but I'm leaving that for another
> patch.)
> 
> While doing this, I noticed a thinko in the computation of the upper
> bound for completion in dw2_expand_symtabs_matching_symbol.  We're
> using std::upper_bound but we should use std::lower_bound.  I extended
> the unit test with a case that I thought would expose it, this one:
> 
>  +  /* These are used to check that the increment-last-char in the
>  +     matching algorithm for completion doesn't match "t1_fund" when
>  +     completing "t1_func".  */
>  +  "t1_func",
>  +  "t1_func1",
>  +  "t1_fund",
>  +  "t1_fund1",
> 
> The algorithm actually returns "t1_fund1" as lower bound, so "t1_fund"
> matches incorrectly.  But turns out the problem is masked because
> later here:
> 
>   for (;lower != upper; ++lower)
>     {
>       const char *qualified = index.symbol_name_at (lower->idx);
> 
>       if (!lookup_name_matcher.matches (qualified)
> 
> the lookup_name_matcher.matches check above filters out "t1_fund"
> because that doesn't start with "t1_func".
> 
> I'll fix the latent bug in follow up patches, after factoring things
> out a bit in a way that allows unit testing the relevant code more
> directly.

Everything you said makes sense to me, the patch looks good to me.  I noted
one comment and a typo below.

> gdb/ChangeLog:
> yyyy-mm-dd  Pedro Alves  <palves@redhat.com>
> 
> 	* cp-name-parser.y (cp_ident_is_alpha, cp_ident_is_alnum): New.
> 	(symbol_end): Use cp_ident_is_alnum.
> 	(yylex): Use cp_ident_is_alpha and cp_ident_is_alnum.
> 	* dwarf2read.c (make_sort_after_prefix_name): New function.
> 	(dw2_expand_symtabs_matching_symbol): Use it.
> 	(test_symbols): Add more symbols.
> 	(run_test): Add tests.
> ---
>  gdb/cp-name-parser.y |  28 ++++++++++--
>  gdb/dwarf2read.c     | 119 ++++++++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 129 insertions(+), 18 deletions(-)
> 
> diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y
> index 33ecf13..fdfbf15 100644
> --- a/gdb/cp-name-parser.y
> +++ b/gdb/cp-name-parser.y
> @@ -1304,6 +1304,28 @@ d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp
>  		      fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs));
>  }
>  
> +/* Like ISALPHA, but also returns true for the union of all UTF-8
> +   multi-byte sequence bytes and non-ASCII characters in
> +   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
> +   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
> +   identifiers, but we don't need to be pedantic so for simplicity we
> +   ignore that here.  Plus this avoids the complication of actually
> +   knowing what was the right encoding.  */
> +
> +static inline bool
> +cp_ident_is_alpha (unsigned char ch)
> +{
> +  return ISALPHA (ch) || ch >= 0x80;
> +}
> +
> +/* Similarly, but Like ISALNUM.  */
> +
> +static inline bool
> +cp_ident_is_alnum (unsigned char ch)
> +{
> +  return ISALNUM (ch) || ch >= 0x80;
> +}
> +
>  /* Find the end of a symbol name starting at LEXPTR.  */
>  
>  static const char *
> @@ -1311,7 +1333,7 @@ symbol_end (const char *lexptr)
>  {
>    const char *p = lexptr;
>  
> -  while (*p && (ISALNUM (*p) || *p == '_' || *p == '$' || *p == '.'))
> +  while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
>      p++;
>  
>    return p;
> @@ -1791,7 +1813,7 @@ yylex (void)
>        return ERROR;
>      }
>  
> -  if (!(c == '_' || c == '$' || ISALPHA (c)))
> +  if (!(c == '_' || c == '$' || cp_ident_is_alpha (c)))
>      {
>        /* We must have come across a bad character (e.g. ';').  */
>        yyerror (_("invalid character"));
> @@ -1802,7 +1824,7 @@ yylex (void)
>    namelen = 0;
>    do
>      c = tokstart[++namelen];
> -  while (ISALNUM (c) || c == '_' || c == '$');
> +  while (cp_ident_is_alnum (c) || c == '_' || c == '$');
>  
>    lexptr += namelen;
>  
> diff --git a/gdb/dwarf2read.c b/gdb/dwarf2read.c
> index 5437d21..b08e81c 100644
> --- a/gdb/dwarf2read.c
> +++ b/gdb/dwarf2read.c
> @@ -4188,6 +4188,60 @@ gdb_index_symbol_name_matcher::matches (const char *symbol_name)
>    return false;
>  }
>  
> +/* Starting from a search name, return the string that finds the upper
> +   bound of all strings that start with SEARCH_NAME in a sorted name
> +   list.  Returns the empty string to indicate that the upper bound is
> +   the end of the list.  */
> +
> +static std::string
> +make_sort_after_prefix_name (const char *search_name)
> +{
> +  /* When looking to complete "func", we find the upper bound of all
> +     symbols that start with "func" by looking for where we'd insert
> +     "func"-with-last-character-incremented, i.e. "fund".  */
> +  std::string after = search_name;
> +
> +  /* Mind 0xff though, which is a valid character in non-UTF-8 source
> +     character sets (e.g. Latin1 'ÿ'), and we can't rule out compilers
> +     allowing it in identifiers.  If we run into it, increment the
> +     previous character instead and shorten the string.  If the very
> +     first character turns out to be 0xff, then the upper bound is the
> +     end of the list.

It's a bit of a nit, but I think this explanation could be a bit more
precise, and maybe simpler.  Maybe you could just say that you strip all
trailing 0xff characters, and increment the last non-0xff character of
the string.  If the string is composed only of 0xff characters, then the
upper bound is the end of the list.

The "If the very first character turns out to be 0xff" threw me off a bit,
because if you have the string "\xffa\xff", the upper bound will be "\xffb",
not the end of the list, despite the very first character being 0xff.

> +
> +     E.g., with these symbols:
> +
> +      func
> +      func1
> +      fund
> +
> +     completing "func" looks for symbols between "func" and
> +     "func"-with-last-character-incremented, i.e. "fund" (exclusive),
> +     which finds "func" and "func1", but not "fund".
> +
> +     And with:
> +
> +      funcÿ     (Latin1 'ÿ' [0xff])
> +      funcÿ1
> +      fund
> +
> +     completing "funcÿ", look for symbols between "funcÿ" and "fund"

looks

Simon
  
Pedro Alves Nov. 20, 2017, 11:56 a.m. UTC | #2
On 11/20/2017 01:38 AM, Simon Marchi wrote:

> Everything you said makes sense to me, the patch looks good to me.  I noted
> one comment and a typo below.

Thanks, see below.

>> +/* Starting from a search name, return the string that finds the upper
>> +   bound of all strings that start with SEARCH_NAME in a sorted name
>> +   list.  Returns the empty string to indicate that the upper bound is
>> +   the end of the list.  */
>> +
>> +static std::string
>> +make_sort_after_prefix_name (const char *search_name)
>> +{
>> +  /* When looking to complete "func", we find the upper bound of all
>> +     symbols that start with "func" by looking for where we'd insert
>> +     "func"-with-last-character-incremented, i.e. "fund".  */
>> +  std::string after = search_name;
>> +
>> +  /* Mind 0xff though, which is a valid character in non-UTF-8 source
>> +     character sets (e.g. Latin1 'ÿ'), and we can't rule out compilers
>> +     allowing it in identifiers.  If we run into it, increment the
>> +     previous character instead and shorten the string.  If the very
>> +     first character turns out to be 0xff, then the upper bound is the
>> +     end of the list.
> 
> It's a bit of a nit, but I think this explanation could be a bit more
> precise, and maybe simpler.  Maybe you could just say that you strip all
> trailing 0xff characters, and increment the last non-0xff character of
> the string.  If the string is composed only of 0xff characters, then the
> upper bound is the end of the list.

My problem with that is that it wouldn't explain _why_ we strip
the 0xffs.

> 
> The "If the very first character turns out to be 0xff" threw me off a bit,
> because if you have the string "\xffa\xff", the upper bound will be "\xffb",
> not the end of the list, despite the very first character being 0xff.

I like that example.  How about the following.  It's even longer, but
I think it's justified.

/* Starting from a search name, return the string that finds the upper
   bound of all strings that start with SEARCH_NAME in a sorted name
   list.  Returns the empty string to indicate that the upper bound is
   the end of the list.  */

static std::string
make_sort_after_prefix_name (const char *search_name)
{
  /* When looking to complete "func", we find the upper bound of all
     symbols that start with "func" by looking for where we'd insert
     the closest string that would follow "func" in lexicographical
     order.  Usually, that's "func"-with-last-character-incremented,
     i.e. "fund".  Mind non-ASCII characters, though.  Usually those
     will be UTF-8 multi-byte sequences, but we can't be certain.
     Especially mind the 0xff character, which is a valid character in
     non-UTF-8 source character sets (e.g. Latin1 'ÿ'), and we can't
     rule out compilers allowing it in identifiers.  Note that
     conveniently, strcmp/strcasecmp are specified to compare
     characters interpreted as unsigned char.  So what we do is treat
     the whole string as a base 255 number composed of a sequence of
     base 255 "digits" and add 1 to it.  I.e., adding 1 to 0xff wraps
     to 0, and carries 1 to the following more-significant position.
     If the very first character carries/overflows, then the upper
     bound is the end of the list.  Also the string after the empty
     string is also the empty string.

     Some examples of this operation:

       SEARCH_NAME  => "+1" RESULT

       "abc"        => "abd"
       "ab\xff"     => "ac"
       "\xffa\xff"  => "\xffb"
       "\xff"       => ""
       "\xff\xff"   => ""
       ""           => ""

     Then, with these symbols for example:

      func
      func1
      fund

     completing "func" looks for symbols between "func" and
     "func"-with-last-character-incremented, i.e. "fund" (exclusive),
     which finds "func" and "func1", but not "fund".

     And with:

      funcÿ     (Latin1 'ÿ' [0xff])
      funcÿ1
      fund

     completing "funcÿ" looks for symbols between "funcÿ" and "fund"
     (exclusive), which finds "funcÿ" and "funcÿ1", but not "fund".

     And with:

      ÿÿ        (Latin1 'ÿ' [0xff])
      ÿÿ1

     completing "ÿ" or "ÿÿ" looks for symbols between between "ÿÿ" and
     the end of the list.
  */
  std::string after = search_name;
  while (!after.empty () && (unsigned char) after.back () == 0xff)
    after.pop_back ();
  if (!after.empty ())
    after.back () = (unsigned char) after.back () + 1;
  return after;
}

>> +     completing "funcÿ", look for symbols between "funcÿ" and "fund"
> 
> looks

Fixed above.

Thanks,
Pedro Alves
  
Simon Marchi Nov. 20, 2017, 4:50 p.m. UTC | #3
On 2017-11-20 06:56 AM, Pedro Alves wrote:
>>> +/* Starting from a search name, return the string that finds the upper
>>> +   bound of all strings that start with SEARCH_NAME in a sorted name
>>> +   list.  Returns the empty string to indicate that the upper bound is
>>> +   the end of the list.  */
>>> +
>>> +static std::string
>>> +make_sort_after_prefix_name (const char *search_name)
>>> +{
>>> +  /* When looking to complete "func", we find the upper bound of all
>>> +     symbols that start with "func" by looking for where we'd insert
>>> +     "func"-with-last-character-incremented, i.e. "fund".  */
>>> +  std::string after = search_name;
>>> +
>>> +  /* Mind 0xff though, which is a valid character in non-UTF-8 source
>>> +     character sets (e.g. Latin1 'ÿ'), and we can't rule out compilers
>>> +     allowing it in identifiers.  If we run into it, increment the
>>> +     previous character instead and shorten the string.  If the very
>>> +     first character turns out to be 0xff, then the upper bound is the
>>> +     end of the list.
>>
>> It's a bit of a nit, but I think this explanation could be a bit more
>> precise, and maybe simpler.  Maybe you could just say that you strip all
>> trailing 0xff characters, and increment the last non-0xff character of
>> the string.  If the string is composed only of 0xff characters, then the
>> upper bound is the end of the list.
> 
> My problem with that is that it wouldn't explain _why_ we strip
> the 0xffs.

Right, the comment should say why, not how.

>>
>> The "If the very first character turns out to be 0xff" threw me off a bit,
>> because if you have the string "\xffa\xff", the upper bound will be "\xffb",
>> not the end of the list, despite the very first character being 0xff.
> 
> I like that example.  How about the following.  It's even longer, but
> I think it's justified.
> 
> /* Starting from a search name, return the string that finds the upper
>    bound of all strings that start with SEARCH_NAME in a sorted name
>    list.  Returns the empty string to indicate that the upper bound is
>    the end of the list.  */
> 
> static std::string
> make_sort_after_prefix_name (const char *search_name)
> {
>   /* When looking to complete "func", we find the upper bound of all
>      symbols that start with "func" by looking for where we'd insert
>      the closest string that would follow "func" in lexicographical
>      order.  Usually, that's "func"-with-last-character-incremented,
>      i.e. "fund".  Mind non-ASCII characters, though.  Usually those
>      will be UTF-8 multi-byte sequences, but we can't be certain.
>      Especially mind the 0xff character, which is a valid character in
>      non-UTF-8 source character sets (e.g. Latin1 'ÿ'), and we can't
>      rule out compilers allowing it in identifiers.  Note that
>      conveniently, strcmp/strcasecmp are specified to compare
>      characters interpreted as unsigned char.  So what we do is treat
>      the whole string as a base 255 number composed of a sequence of
>      base 255 "digits" and add 1 to it.  I.e., adding 1 to 0xff wraps
>      to 0, and carries 1 to the following more-significant position.
>      If the very first character carries/overflows, then the upper
>      bound is the end of the list.  Also the string after the empty
>      string is also the empty string.

Making an analogy with base-10 arithmetic is actually what made me
understand it.  The number after 149 is not 140, it's 150.  We're
doing the string equivalent of that.  Your explanation with base-255
numbers is very good.  It doesn't really work for all-0xff strings,
because adding one (with carry) to "\xff\xff" would give "\x01\x00\x00",
but it doesn't really matter for the explanation :).

Simon
  

Patch

diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y
index 33ecf13..fdfbf15 100644
--- a/gdb/cp-name-parser.y
+++ b/gdb/cp-name-parser.y
@@ -1304,6 +1304,28 @@  d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp
 		      fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs));
 }
 
+/* Like ISALPHA, but also returns true for the union of all UTF-8
+   multi-byte sequence bytes and non-ASCII characters in
+   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
+   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
+   identifiers, but we don't need to be pedantic so for simplicity we
+   ignore that here.  Plus this avoids the complication of actually
+   knowing what was the right encoding.  */
+
+static inline bool
+cp_ident_is_alpha (unsigned char ch)
+{
+  return ISALPHA (ch) || ch >= 0x80;
+}
+
+/* Similarly, but Like ISALNUM.  */
+
+static inline bool
+cp_ident_is_alnum (unsigned char ch)
+{
+  return ISALNUM (ch) || ch >= 0x80;
+}
+
 /* Find the end of a symbol name starting at LEXPTR.  */
 
 static const char *
@@ -1311,7 +1333,7 @@  symbol_end (const char *lexptr)
 {
   const char *p = lexptr;
 
-  while (*p && (ISALNUM (*p) || *p == '_' || *p == '$' || *p == '.'))
+  while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
     p++;
 
   return p;
@@ -1791,7 +1813,7 @@  yylex (void)
       return ERROR;
     }
 
-  if (!(c == '_' || c == '$' || ISALPHA (c)))
+  if (!(c == '_' || c == '$' || cp_ident_is_alpha (c)))
     {
       /* We must have come across a bad character (e.g. ';').  */
       yyerror (_("invalid character"));
@@ -1802,7 +1824,7 @@  yylex (void)
   namelen = 0;
   do
     c = tokstart[++namelen];
-  while (ISALNUM (c) || c == '_' || c == '$');
+  while (cp_ident_is_alnum (c) || c == '_' || c == '$');
 
   lexptr += namelen;
 
diff --git a/gdb/dwarf2read.c b/gdb/dwarf2read.c
index 5437d21..b08e81c 100644
--- a/gdb/dwarf2read.c
+++ b/gdb/dwarf2read.c
@@ -4188,6 +4188,60 @@  gdb_index_symbol_name_matcher::matches (const char *symbol_name)
   return false;
 }
 
+/* Starting from a search name, return the string that finds the upper
+   bound of all strings that start with SEARCH_NAME in a sorted name
+   list.  Returns the empty string to indicate that the upper bound is
+   the end of the list.  */
+
+static std::string
+make_sort_after_prefix_name (const char *search_name)
+{
+  /* When looking to complete "func", we find the upper bound of all
+     symbols that start with "func" by looking for where we'd insert
+     "func"-with-last-character-incremented, i.e. "fund".  */
+  std::string after = search_name;
+
+  /* Mind 0xff though, which is a valid character in non-UTF-8 source
+     character sets (e.g. Latin1 'ÿ'), and we can't rule out compilers
+     allowing it in identifiers.  If we run into it, increment the
+     previous character instead and shorten the string.  If the very
+     first character turns out to be 0xff, then the upper bound is the
+     end of the list.
+
+     E.g., with these symbols:
+
+      func
+      func1
+      fund
+
+     completing "func" looks for symbols between "func" and
+     "func"-with-last-character-incremented, i.e. "fund" (exclusive),
+     which finds "func" and "func1", but not "fund".
+
+     And with:
+
+      funcÿ     (Latin1 'ÿ' [0xff])
+      funcÿ1
+      fund
+
+     completing "funcÿ", look for symbols between "funcÿ" and "fund"
+     (exclusive), which finds "funcÿ" and "funcÿ1", but not "fund".
+
+     And with:
+
+      ÿÿ        (Latin1 'ÿ' [0xff])
+      ÿÿ1
+
+     completing "ÿ" or "ÿÿ" looks for symbols between between "ÿÿ" and
+     the end of the list.
+  */
+  while (!after.empty () && (unsigned char) after.back () == 0xff)
+    after.pop_back ();
+  if (!after.empty ())
+    after.back () = (unsigned char) after.back () + 1;
+  return after;
+}
+
 /* Helper for dw2_expand_symtabs_matching that works with a
    mapped_index instead of the containing objfile.  This is split to a
    separate function in order to be able to unit test the
@@ -4303,21 +4357,20 @@  dw2_expand_symtabs_matching_symbol
     {
       if (lookup_name_in.completion_mode ())
 	{
-	  /* The string frobbing below won't work if the string is
-	     empty.  We don't need it then, anyway -- if we're
-	     completing an empty string, then we want to iterate over
-	     the whole range.  */
-	  if (cplus[0] == '\0')
+	  /* In completion mode, we want UPPER to point past all
+	     symbols names that have the same prefix.  I.e., with
+	     these symbols, and completing "func":
+
+	      function        << lower bound
+	      function1
+	      other_function  << upper bound
+
+	     We find the upper bound by looking for the insertion
+	     point of "func"-with-last-character-incremented,
+	     i.e. "fund".  */
+	  std::string after = make_sort_after_prefix_name (cplus);
+	  if (after.empty ())
 	    return end;
-
-	  /* In completion mode, increment the last character because
-	     we want UPPER to point past all symbols names that have
-	     the same prefix.  */
-	  std::string after = cplus;
-
-	  gdb_assert (after.back () != 0xff);
-	  after.back ()++;
-
 	  return std::upper_bound (lower, end, after.c_str (),
 				   lookup_compare_upper);
 	}
@@ -4493,6 +4546,22 @@  static const char *test_symbols[] = {
   "ns::foo<int>",
   "ns::foo<long>",
 
+  /* These are used to check that the increment-last-char in the
+     matching algorithm for completion doesn't match "t1_fund" when
+     completing "t1_func".  */
+  "t1_func",
+  "t1_func1",
+  "t1_fund",
+  "t1_fund1",
+
+  /* A UTF-8 name with multi-byte sequences to make sure that
+     cp-name-parser understands this as a single identifier ("função"
+     is "function" in PT).  */
+  u8"u8função",
+
+  /* \377 / 0xff is Latin1 'ÿ'.  */
+  "yfunc\377",
+
   /* A name with all sorts of complications.  Starts with "z" to make
      it easier for the completion tests below.  */
 #define Z_SYM_NAME \
@@ -4500,7 +4569,11 @@  static const char *test_symbols[] = {
     "::tuple<(anonymous namespace)::ui*, " \
     "std::default_delete<(anonymous namespace)::ui>, void>"
 
-  Z_SYM_NAME
+  Z_SYM_NAME,
+
+  /* \377 / 0xff is Latin1 'ÿ'.  */
+  "\377",
+  "\377\377123",
 };
 
 static void
@@ -4551,6 +4624,22 @@  run_test ()
 		   {});
     }
 
+  /* Check that the name matching algorithm for completion doesn't get
+     confused with ANSI-1 'ÿ' / 0xff.  */
+  {
+    static const char str[] = "\377";
+    CHECK_MATCH (str, symbol_name_match_type::FULL, true,
+		 EXPECT ("\377", "\377\377123"));
+  }
+
+  /* Check that the increment-last-char in the matching algorithm for
+     completion doesn't match "t1_fund" when completing "t1_func".  */
+  {
+    static const char str[] = "t1_func";
+    CHECK_MATCH (str, symbol_name_match_type::FULL, true,
+		 EXPECT ("t1_func", "t1_func1"));
+  }
+
   /* Check that completion mode works at each prefix of the expected
      symbol name.  */
   {