support C/C++ identifiers named with non-ASCII characters

Message ID 9418d4f0-f22a-c587-cc34-2fa67afbd028@zjz.name
State New, archived
Headers

Commit Message

張俊芝 May 21, 2018, 8:52 a.m. UTC
  Hello, team.

This patch fixes the bug at
https://sourceware.org/bugzilla/show_bug.cgi?id=22973 .

Here is how to test the patch:

Step 1. If you are using Clang or any other C compilers that have 
implemented
         support for Unicode identifiers, then create a C file with the 
following
         content:

int main(int 參量, char* 參[])
{
   struct 集
   {
     int 數[3];
   } 集 = {100, 200, 300};
   int 序 = 2;
   return 0;
}

Or if you are using GCC, create a C file with the following content as a
workaround(GCC still doesn't actually support Unicode identifiers in 
2018, which
is a pity):

int main(int \u53C3\u91CF, char* \u53C3[])
{
   struct \u96C6
   {
     int \u6578[3];
   } \u96C6 = {100, 200, 300};
   int \u5E8F = 2;
   return 0;
}

Step 2. Compile the C file.

Step 3. Run GDB for the compiled executable, add a breakpoint in "return 0".

Step 4. Run until the breakpoint.

Step 5. Test the following commands to see if they work:
         p 參量
         p 參
         p 集
         p 集.數
         p 集.數[序]

Thanks for your review.
2018-05-20  張俊芝  <zjz@zjz.name>

	* gdb/c-exp.y (is_identifier_separator): New function.
	(lex_one_token): Now recognizes C and C++ Unicode identifiers by using
	is_identifier_separator to determine the boundary of a token.
  

Comments

Simon Marchi May 21, 2018, 2:03 p.m. UTC | #1
On 2018-05-21 04:52 AM, 張俊芝 wrote:
> Hello, team.
> 
> This patch fixes the bug at
> https://sourceware.org/bugzilla/show_bug.cgi?id=22973 .
> 
> Here is how to test the patch:
> 
> Step 1. If you are using Clang or any other C compilers that have 
> implemented
>          support for Unicode identifiers, then create a C file with the 
> following
>          content:
> 
> int main(int 參量, char* 參[])
> {
>    struct 集
>    {
>      int 數[3];
>    } 集 = {100, 200, 300};
>    int 序 = 2;
>    return 0;
> }
> 
> Or if you are using GCC, create a C file with the following content as a
> workaround(GCC still doesn't actually support Unicode identifiers in 
> 2018, which
> is a pity):
> 
> int main(int \u53C3\u91CF, char* \u53C3[])
> {
>    struct \u96C6
>    {
>      int \u6578[3];
>    } \u96C6 = {100, 200, 300};
>    int \u5E8F = 2;
>    return 0;
> }
> 
> Step 2. Compile the C file.
> 
> Step 3. Run GDB for the compiled executable, add a breakpoint in "return 0".
> 
> Step 4. Run until the breakpoint.
> 
> Step 5. Test the following commands to see if they work:
>          p 參量
>          p 參
>          p 集
>          p 集.數
>          p 集.數[序]
> 
> Thanks for your review.
> 

Hi Zhang,

Thanks for the patch, I tested it quickly, it seems to work as expected.

Could you please write a small test case in testsuite/gdb.base with the example
you gave, so we make sure this doesn't get broken later?  If you can write it
in such a way that both clang and gcc understand it would be better, because
most people run the testuite using gcc to compile test programs.

I am not a specialist in lexing and parsing C, so can you explain quickly why
you think this is a good solution?  Quickly, I understand that you change the
identifier recognition algorithm to a blacklist of characters rather than
a whitelist, so bytes that are not recognized (such as those that compose
the utf-8 encoded characters) are not rejected.

Given unlimited time, would the right solution be to use a lib to parse the
string as utf-8, and reject strings that are not valid utf-8?

Here are some not and formatting comments:

> +static bool is_identifier_separator (char);

You don't have to forward declare the function if it's not necessary.

> +    /* '<' should not be a token separator, because it can be an open angle
> +       bracket followed by a nested template identifier in C++. */

Please use two spaces after the final period (...C++.  */).

> +  if (is_identifier_separator(c))

Please use a space before the parentheses:

  is_identifier_separator (c)


> +  for (c = tokstart[namelen]; !is_identifier_separator(c);)

Here too.

Thanks!

Simon
  
Koning, Paul May 21, 2018, 2:12 p.m. UTC | #2
> On May 21, 2018, at 10:03 AM, Simon Marchi <simark@simark.ca> wrote:
> 
> ...
> I am not a specialist in lexing and parsing C, so can you explain quickly why
> you think this is a good solution?  Quickly, I understand that you change the
> identifier recognition algorithm to a blacklist of characters rather than
> a whitelist, so bytes that are not recognized (such as those that compose
> the utf-8 encoded characters) are not rejected.
> 
> Given unlimited time, would the right solution be to use a lib to parse the
> string as utf-8, and reject strings that are not valid utf-8?

This sounds like a scenario where "stringprep" is helpful (or necessary).  It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not.

	paul
  
Eli Zaretskii May 21, 2018, 4:12 p.m. UTC | #3
> From: <Paul.Koning@dell.com>
> CC: <zjz@zjz.name>, <gdb-patches@sourceware.org>
> Date: Mon, 21 May 2018 14:12:12 +0000
> 
> > Given unlimited time, would the right solution be to use a lib to parse the
> > string as utf-8, and reject strings that are not valid utf-8?
> 
> This sounds like a scenario where "stringprep" is helpful (or necessary).  It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not.

Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
can not include invalid UTF-8 sequences?
  
張俊芝 May 21, 2018, 5:45 p.m. UTC | #4
> 
> 
> Simon Marchi 於 2018/5/21 下午10:03 寫道:
>> Could you please write a small test case in testsuite/gdb.base with 
>> the example
>> you gave, so we make sure this doesn't get broken later?  If you can 
>> write it
>> in such a way that both clang and gcc understand it would be better, 
>> because
>> most people run the testuite using gcc to compile test programs.
>>
Oops, sorry, Simon, I forgot the test part in the second upload.

Clang is compatible with the GCC workaround \uXXXX. So I will write the 
test case in that format.

But it's late here, I will do it tomorrow.
  
Matt Rice May 21, 2018, 6 p.m. UTC | #5
On Mon, May 21, 2018 at 10:45 AM, 張俊芝 <zjz@zjz.name> wrote:
>>
>>
>> Simon Marchi 於 2018/5/21 下午10:03 寫道:
>>>
>>> Could you please write a small test case in testsuite/gdb.base with the
>>> example
>>> you gave, so we make sure this doesn't get broken later?  If you can
>>> write it
>>> in such a way that both clang and gcc understand it would be better,
>>> because
>>> most people run the testuite using gcc to compile test programs.
>>>
> Oops, sorry, Simon, I forgot the test part in the second upload.
>
> Clang is compatible with the GCC workaround \uXXXX. So I will write the test
> case in that format.
>
> But it's late here, I will do it tomorrow.

Just FYI, there is another bug in this area, which i had noticed  that
occurs when trying to
tab complete symbols using GCC's \uXXXX.  It seems like an issue in
another place where gdb is not aware of the encoding.

https://sourceware.org/bugzilla/show_bug.cgi?id=18226
  
Koning, Paul May 21, 2018, 6:03 p.m. UTC | #6
> On May 21, 2018, at 12:12 PM, Eli Zaretskii <eliz@gnu.org> wrote:

> 

>> From: <Paul.Koning@dell.com>

>> CC: <zjz@zjz.name>, <gdb-patches@sourceware.org>

>> Date: Mon, 21 May 2018 14:12:12 +0000

>> 

>>> Given unlimited time, would the right solution be to use a lib to parse the

>>> string as utf-8, and reject strings that are not valid utf-8?

>> 

>> This sounds like a scenario where "stringprep" is helpful (or necessary).  It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not.

> 

> Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and

> can not include invalid UTF-8 sequences?


Encoding is a I/O question.  "UTF-8" and "Unicode" are often mixed up, but they are distinct.  Unicode is a character set, in which each character has a numeric identification.  For example, 張 is Unicode character number 24373 (0x5f35).

UTF-8 is one of several ways to encode Unicode characters as a byte stream.  The UTF-8 encoding of 張 is e5 bc b5.

I don't know what the C/C++ standards say about non-ASCII identifiers.  I assume they are stated to be Unicode, and presumably specific Unicode character classes.  So there are some sequences of Unicode characters that are valid identifiers, while others are not -- exactly as "abc" is a valid ASCII identifier while "a@bc" is not.

A separate question is the encoding of files.  The encoding rule could be that UTF-8 is required -- or that the encoding is selectable.  There also has to be an encoding in output files (debug data for example).  And when strings are entered at the GDB user interface, they arrive in some encoding.  For all these, UTF-8 is a logical answer.

Not all byte strings are valid UTF-8 strings.  When a byte string is delivered from the outside, it makes sense to validate if it's a valid encoding before it is used.  Or you can assume that inputs are valid and rely on "symbol not found" as the general way to handle anything that doesn't match.  For gdb, that may be good enough.

Yet another issue: for many characters, there are multiple ways to represent them in Unicode.  For example, ü (latin small letter u with dieresis) can be coded as the single Unicode character 0xfc, or as the pair 0x0308 0x75 (combining dieresis, latin small letter u).  These are supposed to be synonymous; when doing string matches, you'd want them to be taken as equivalent.  The stringprep library helps with this by offering a conversion to a standard form, at which point memcmp will give the correct answer.

	paul
  
Eli Zaretskii May 21, 2018, 6:14 p.m. UTC | #7
> From: <Paul.Koning@dell.com>
> CC: <simark@simark.ca>, <zjz@zjz.name>, <gdb-patches@sourceware.org>
> Date: Mon, 21 May 2018 18:03:17 +0000
> 
> > Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
> > can not include invalid UTF-8 sequences?
> 
> Encoding is a I/O question.

Not necessarily.

I asked that question because scanning a string for certain ASCII
characters using a 'char *' pointer will only work reliably if the
string is in UTF-8 or in some single-byte encoding.  Otherwise, we
might find false hits for the delimiters, which are actually parts of
multibyte sequences.
  
Koning, Paul May 21, 2018, 6:34 p.m. UTC | #8
> On May 21, 2018, at 2:14 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: <Paul.Koning@dell.com>
>> CC: <simark@simark.ca>, <zjz@zjz.name>, <gdb-patches@sourceware.org>
>> Date: Mon, 21 May 2018 18:03:17 +0000
>> 
>>> Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
>>> can not include invalid UTF-8 sequences?
>> 
>> Encoding is a I/O question.
> 
> Not necessarily.
> 
> I asked that question because scanning a string for certain ASCII
> characters using a 'char *' pointer will only work reliably if the
> string is in UTF-8 or in some single-byte encoding.  Otherwise, we
> might find false hits for the delimiters, which are actually parts of
> multibyte sequences.

I see your point.

The I/O encoding ties to the internal encoding.  UTF-8 can be read into char[] and processed using C string primitives.  Other encodings cannot.  For example, if you have UTF-16 or UTF-32, you'd have to read it into a wchar_t string of the correct character width and use the wchar string functions.

So there are two questions:

1. What are the valid characters?  (Unicode question, independent of encoding)
2. What encoding do we expect in I/O (UTF question) from which we conclude what processing functions we need.

	paul
  
Joseph Myers May 21, 2018, 8:26 p.m. UTC | #9
On Mon, 21 May 2018, Paul.Koning@dell.com wrote:

> I don't know what the C/C++ standards say about non-ASCII identifiers.  
> I assume they are stated to be Unicode, and presumably specific Unicode 

They are defined in terms of ISO 10646 (and so concepts from Unicode that 
don't appear in ISO 10646, such as normalization forms, are not relevant 
to them).

See C11 Annex D, which is also aligned with C++11 and later (older 
standard versions had generally more restrictive sets of allowed 
characters, different for C and C++, based on TR 10176).

> Yet another issue: for many characters, there are multiple ways to 
> represent them in Unicode.  For example, ü (latin small letter u with 
> dieresis) can be coded as the single Unicode character 0xfc, or as the 
> pair 0x0308 0x75 (combining dieresis, latin small letter u).  These are 

(The letter goes before the combining mark in that case, not after.  Thus 
such combining marks are not generally permitted at the start of 
identifiers.)

> supposed to be synonymous; when doing string matches, you'd want them to 

They are *not* synonymous in C or C++.  (GCC has -Wnormalized= options to 
warn about identifiers not in an appropriate normalization form, with 
-Wnormalized=nfc as the default.)


GCC always generates UTF-8 in its .s output for such identifiers, which 
gas then transfers straight through to its output (thus, UTF-8 ELF 
symbols).  The generic ELF ABI is silent on the encoding of such symbols 
(it just says "External C symbols have the same names in C and object 
files' symbol tables.").
  
張俊芝 May 22, 2018, 5:37 a.m. UTC | #10
Matt Rice 於 2018/5/22 上午2:00 寫道:
> On Mon, May 21, 2018 at 10:45 AM, 張俊芝 <zjz@zjz.name> wrote:

> Just FYI, there is another bug in this area, which i had noticed  that
> occurs when trying to
> tab complete symbols using GCC's \uXXXX.  It seems like an issue in
> another place where gdb is not aware of the encoding.
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=18226


Thank you for your helpful information, Matt.
  
張俊芝 May 22, 2018, 7:01 a.m. UTC | #11
Paul.Koning@dell.com 於 2018/5/22 上午2:03 寫道:
> 
> Not all byte strings are valid UTF-8 strings.  When a byte string is delivered from the outside, it makes sense to validate if it's a valid encoding before it is used.  Or you can assume that inputs are valid and rely on "symbol not found" as the general way to handle anything that doesn't match.  For gdb, that may be good enough.

I preferred the latter(I.e. assume all non-ASCII characters are valid 
and rely on "symbol not found"), and it's actually what the patch does. 
Although a compiler has to be strict with validity of non-ASCII 
characters, but for GDB, the latter solution is just good enough - 
Checking only ASCII characters makes GDB work well with all 
ASCII-compliant encodings.
  
張俊芝 May 22, 2018, 7:05 a.m. UTC | #12
Joseph Myers 於 2018/5/22 上午4:26 寫道:

> The generic ELF ABI is silent on the encoding of such symbols
> (it just says "External C symbols have the same names in C and object
> files' symbol tables.").

So this is another reason why it may be better to just check ASCII 
character delimiters. May be good enough for GDB.
  
Pedro Alves May 22, 2018, 2:15 p.m. UTC | #13
On 05/21/2018 06:45 PM, 張俊芝 wrote:
>>
>>
>> Simon Marchi 於 2018/5/21 下午10:03 寫道:
>>> Could you please write a small test case in testsuite/gdb.base with the example
>>> you gave, so we make sure this doesn't get broken later?  If you can write it
>>> in such a way that both clang and gcc understand it would be better, because
>>> most people run the testuite using gcc to compile test programs.
>>>
> Oops, sorry, Simon, I forgot the test part in the second upload.
> 
> Clang is compatible with the GCC workaround \uXXXX. So I will write the test case in that format.
> 
> But it's late here, I will do it tomorrow.

I actually already started writing a patch for this a few months
back, including a C testcase, after these discussions:

  https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html
  https://sourceware.org/ml/gdb/2017-11/msg00022.html

Let me try to find it.  I don't recall exactly where I left off,
but I think I had something working.

Thanks,
Pedro Alves
  
張俊芝 May 22, 2018, 2:32 p.m. UTC | #14
Pedro Alves 於 2018/5/22 下午10:15 寫道:
> 
> I actually already started writing a patch for this a few months
> back, including a C testcase, after these discussions:
> 
>    https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html
>    https://sourceware.org/ml/gdb/2017-11/msg00022.html
> 
> Let me try to find it.  I don't recall exactly where I left off,
> but I think I had something working.
> 
> Thanks,
> Pedro Alves
> 

I just started writing a test case when I saw your letter.

Could you shed light on how you delimit identifiers in your patch, 
Pedro? Does it check all invalid non-ASCII characters, is it dedicated 
to some encoding such as UTF-8, or to any encodings?
  
Pedro Alves May 22, 2018, 2:50 p.m. UTC | #15
On 05/22/2018 03:32 PM, 張俊芝 wrote:
> 
> Pedro Alves 於 2018/5/22 下午10:15 寫道:
>>
>> I actually already started writing a patch for this a few months
>> back, including a C testcase, after these discussions:
>>
>>    https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html
>>    https://sourceware.org/ml/gdb/2017-11/msg00022.html
>>
>> Let me try to find it.  I don't recall exactly where I left off,
>> but I think I had something working.
> 
> I just started writing a test case when I saw your letter.
> 
> Could you shed light on how you delimit identifiers in your patch, Pedro? Does it check all invalid non-ASCII characters, is it dedicated to some encoding such as UTF-8, or to any encodings?

I found the patch.  Let me rebase it and send it / post it.  It'll
be easier to just look at the patch.

Thanks,
Pedro Alves
  

Patch

diff --git a/gdb/c-exp.y b/gdb/c-exp.y
index 5e10d2a3b4..b0dd6c7caf 100644
--- a/gdb/c-exp.y
+++ b/gdb/c-exp.y
@@ -73,6 +73,8 @@  void yyerror (const char *);
 
 static int type_aggregate_p (struct type *);
 
+static bool is_identifier_separator (char);
+
 %}
 
 /* Although the yacc "value" of an expression is not used,
@@ -1718,6 +1720,53 @@  type_aggregate_p (struct type *type)
 	      && TYPE_DECLARED_CLASS (type)));
 }
 
+/* While iterating all the characters in an identifier, an identifier separator
+   is a boundary where we know the iteration is done. */
+
+static bool
+is_identifier_separator (char c)
+{
+  switch (c)
+    {
+    case ' ':
+    case '\t':
+    case '\n':
+    case '\0':
+    case '\'':
+    case '"':
+    case '\\':
+    case '(':
+    case ')':
+    case ',':
+    case '.':
+    case '+':
+    case '-':
+    case '*':
+    case '/':
+    case '|':
+    case '&':
+    case '^':
+    case '~':
+    case '!':
+    case '@':
+    case '[':
+    case ']':
+    /* '<' should not be a token separator, because it can be an open angle
+       bracket followed by a nested template identifier in C++. */
+    case '>':
+    case '?':
+    case ':':
+    case '=':
+    case '{':
+    case '}':
+    case ';':
+      return true;
+    default:
+      break;
+    }
+  return false;
+}
+
 /* Validate a parameter typelist.  */
 
 static void
@@ -1920,7 +1969,7 @@  parse_number (struct parser_state *par_state,
 	 FIXME: This check is wrong; for example it doesn't find overflow
 	 on 0x123456789 when LONGEST is 32 bits.  */
       if (c != 'l' && c != 'u' && n != 0)
-	{	
+	{
 	  if ((unsigned_p && (ULONGEST) prevn >= (ULONGEST) n))
 	    error (_("Numeric constant too large."));
 	}
@@ -2741,16 +2790,13 @@  lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
       }
     }
 
-  if (!(c == '_' || c == '$'
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')))
+  if (is_identifier_separator(c))
     /* We must have come across a bad character (e.g. ';').  */
     error (_("Invalid character '%c' in expression."), c);
 
   /* It's a name.  See how long it is.  */
   namelen = 0;
-  for (c = tokstart[namelen];
-       (c == '_' || c == '$' || (c >= '0' && c <= '9')
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '<');)
+  for (c = tokstart[namelen]; !is_identifier_separator(c);)
     {
       /* Template parameter lists are part of the name.
 	 FIXME: This mishandles `print $a<4&&$a>3'.  */
@@ -2932,7 +2978,7 @@  classify_name (struct parser_state *par_state, const struct block *block,
 	 filename.  However, if the name was quoted, then it is better
 	 to check for a filename or a block, since this is the only
 	 way the user has of requiring the extension to be used.  */
-      if ((is_a_field_of_this.type == NULL && !is_after_structop) 
+      if ((is_a_field_of_this.type == NULL && !is_after_structop)
 	  || is_quoted_name)
 	{
 	  /* See if it's a file name. */