[1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers

  The find-upper-bound-for-completion algorithm in the name components
accelerator table in dwarf2read.c increments a char in a string, and
asserts that it's not incrementing a 0xff char, but that's incorrect.

First, we shouldn't be calling gdb_assert on input.

Then, if "char" is signed, comparing a caracther with "0xff" will
never yield true, which is caught by Clang with:

  error: comparison of constant 255 with expression of type '....' (aka 'char') is always true [-Werror,-Wtautological-constant-out-of-range-compare]
	    gdb_assert (after.back () != 0xff);
			~~~~~~~~~~~~~ ^  ~~~~

And then, 0xff is a valid character on non-UTF-8/ASCII character sets.
E.g., it's 'ÿ' in Latin1.  While GCC nor Clang support !ASCII &&
!UTF-8 characters in identifiers (GCC supports UTF-8 characters only
via UCNs, see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html),
but other compilers might (Visual Studio?), so it doesn't hurt to
handle it correctly.  Testing is covered by extending the
dw2_expand_symtabs_matching unit tests with relevant cases.

However, without further changes, the unit tests still fail...  The
problem is that cp-name-parser.y assumes that identifiers are ASCII
(via ISALPHA/ISALNUM).  This commit fixes that too, so that we can
unit test the dwarf2read.c changes.  (The regular C/C++ lexer in
c-lang.y needs a similar treatment, but I'm leaving that for another
patch.)

While doing this, I noticed a thinko in the computation of the upper
bound for completion in dw2_expand_symtabs_matching_symbol.  We're
using std::upper_bound but we should use std::lower_bound.  I extended
the unit test with a case that I thought would expose it, this one:

 +  /* These are used to check that the increment-last-char in the
 +     matching algorithm for completion doesn't match "t1_fund" when
 +     completing "t1_func".  */
 +  "t1_func",
 +  "t1_func1",
 +  "t1_fund",
 +  "t1_fund1",

The algorithm actually returns "t1_fund1" as lower bound, so "t1_fund"
matches incorrectly.  But turns out the problem is masked because
later here:

  for (;lower != upper; ++lower)
    {
      const char *qualified = index.symbol_name_at (lower->idx);

      if (!lookup_name_matcher.matches (qualified)

the lookup_name_matcher.matches check above filters out "t1_fund"
because that doesn't start with "t1_func".

I'll fix the latent bug in follow up patches, after factoring things
out a bit in a way that allows unit testing the relevant code more
directly.

gdb/ChangeLog:
yyyy-mm-dd  Pedro Alves  <palves@redhat.com>

	* cp-name-parser.y (cp_ident_is_alpha, cp_ident_is_alnum): New.
	(symbol_end): Use cp_ident_is_alnum.
	(yylex): Use cp_ident_is_alpha and cp_ident_is_alnum.
	* dwarf2read.c (make_sort_after_prefix_name): New function.
	(dw2_expand_symtabs_matching_symbol): Use it.
	(test_symbols): Add more symbols.
	(run_test): Add tests.
---
 gdb/cp-name-parser.y |  28 ++++++++++--
 gdb/dwarf2read.c     | 119 ++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 129 insertions(+), 18 deletions(-)

[1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers

Commit Message

Comments

Patch