From patchwork Mon May 21 17:36:42 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?b?5by15L+K6Iqd?= <zjz@zjz.name>
X-Patchwork-Id: 27361
Received: (qmail 67690 invoked by alias); 21 May 2018 17:37:18 -0000
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Unsubscribe: <mailto:gdb-patches-unsubscribe-##L=##H@sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>,
	<http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
Delivered-To: mailing list gdb-patches@sourceware.org
Received: (qmail 67675 invoked by uid 89); 21 May 2018 17:37:18 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: =?ISO-8859-1?Q?No,
	score=-20.9 required=5.0 tests=BAYES_00, BODY_8BITS,
	FROM_EXCESS_BASE64, GARBLED_BODY, GIT_PATCH_0, GIT_PATCH_1,
	GIT_PATCH_2, GIT_PATCH_3, KAM_LAZY_DOMAIN_SECURITY,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 spammy==e5=af=ab=e9?=
X-HELO: m176116.mail.qiye.163.com
Received: from m176116.mail.qiye.163.com (HELO m176116.mail.qiye.163.com)
	(59.111.176.116) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Mon, 21 May 2018 17:37:15 +0000
Received: from [192.168.0.105] (unknown [59.59.164.76])	by
	m176116.mail.qiye.163.com (HMail) with ESMTPSA id
	69E56B426E3	for <gdb-patches@sourceware.org>;
	Tue, 22 May 2018 01:37:02 +0800 (CST)
Subject: Re: support C/C++ identifiers named with non-ASCII characters
References: <1b915196-3e97-4892-7426-be4211fe7889@zjz.name>
To: gdb-patches@sourceware.org
From: =?UTF-8?B?5by15L+K6Iqd?= <zjz@zjz.name>
X-Forwarded-Message-Id: <1b915196-3e97-4892-7426-be4211fe7889@zjz.name>
Message-ID: <ffed3fc4-edcf-0e0b-c1fc-352fe73f52aa@zjz.name>
Date: Tue, 22 May 2018 01:36:42 +0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:55.0) Gecko/20100101 Thunderbird/55.0
MIME-Version: 1.0
In-Reply-To: <1b915196-3e97-4892-7426-be4211fe7889@zjz.name>
X-HM-Spam-Status: e1ktWUFJV1koWUFPN1dZCBgUCR5ZQUtVS1dZCQ4XHghZQVkyNS06NzI*QU
	tVS1kG
X-HM-Sender-Digest: e1kSHx4VD1lBWUc6NzI6Cjo4LzovDSosEStRMhdPNhoKCklVSlVKTklN
	QklPSUlJT0NMVTMWGhIXVQERATsBEQFVFRoWHkVZV1kMHhlZQR0aFwgeV1kIAVlBQ0lISzdXWRIL
	WUFZTkJVTkJVSk1PVUxNWQY+
X-HM-Tid: 0a6383c69ff8926akuws69e56b426e3

Simon Marchi 於 2018/5/21 下午10:03 寫道:
> On 2018-05-21 04:52 AM, 張俊芝 wrote:
> 
> Hi Zhang,
> 
> Thanks for the patch, I tested it quickly, it seems to work as expected.
> 
> Could you please write a small test case in testsuite/gdb.base with the example
> you gave, so we make sure this doesn't get broken later?  If you can write it
> in such a way that both clang and gcc understand it would be better, because
> most people run the testuite using gcc to compile test programs.
> 
> I am not a specialist in lexing and parsing C, so can you explain quickly why
> you think this is a good solution?  Quickly, I understand that you change the
> identifier recognition algorithm to a blacklist of characters rather than
> a whitelist, so bytes that are not recognized (such as those that compose
> the utf-8 encoded characters) are not rejected.
> 
> Given unlimited time, would the right solution be to use a lib to parse the
> string as utf-8, and reject strings that are not valid utf-8?
> 
> Here are some not and formatting comments:
> 
>> +static bool is_identifier_separator (char);
> 
> You don't have to forward declare the function if it's not necessary.
> 
>> +    /* '<' should not be a token separator, because it can be an open angle
>> +       bracket followed by a nested template identifier in C++. */
> 
> Please use two spaces after the final period (...C++.  */).
> 
>> +  if (is_identifier_separator(c))
> 
> Please use a space before the parentheses:
> 
>    is_identifier_separator (c)
> 
> 
>> +  for (c = tokstart[namelen]; !is_identifier_separator(c);)
> 
> Here too.
> 
> Thanks!
> 
> Simon
> 

Thank you for the reply, Simon.

This new diff addresses all the code style issues you mentioned.

Yes, you are right in that the blacklist is limited. Actually, 
`is_identifier_separator` only blacklists the invalid ASCII characters 
for an identifier, leaving all the invalid non-ASCII characters unchecked.

So seems it would be better if non-ASCII characters were also checked. 
However, unfortunately, GDB is neither aware of the encoding of the 
terminal input, nor the encoding of the generated symbolic information 
in the executable. So the blacklist is made to restrict to the invalid 
ASCII characters in order to support all ASCII-compliant encodings.

Having said that, I find that it does no harm to the user if we only 
check the ASCII characters. If the user is trying to print an identifier 
which includes an invalid non-ASCII character, say, p 測】, where 】is 
invalid, they will get an error message:

No symbol "測】" in current context.

which doesn't seem any worse than an error message like:

Invalid character "】" in expression.

Perhaps the former might be even more intuitional.

So personally, I think it might be safe enough to use this limited 
blacklist method.

diff --git a/gdb/c-exp.y b/gdb/c-exp.y
index 5e10d2a3b4..e10b6b474d 100644
--- a/gdb/c-exp.y
+++ b/gdb/c-exp.y
@@ -1718,6 +1718,53 @@ type_aggregate_p (struct type *type)
 	      && TYPE_DECLARED_CLASS (type)));
 }
 
+/* While iterating all the characters in an identifier, an identifier separator
+   is a boundary where we know the iteration is done. */
+
+static bool
+is_identifier_separator (char c)
+{
+  switch (c)
+    {
+    case ' ':
+    case '\t':
+    case '\n':
+    case '\0':
+    case '\'':
+    case '"':
+    case '\\':
+    case '(':
+    case ')':
+    case ',':
+    case '.':
+    case '+':
+    case '-':
+    case '*':
+    case '/':
+    case '|':
+    case '&':
+    case '^':
+    case '~':
+    case '!':
+    case '@':
+    case '[':
+    case ']':
+    /* '<' should not be a token separator, because it can be an open angle
+       bracket followed by a nested template identifier in C++.  */
+    case '>':
+    case '?':
+    case ':':
+    case '=':
+    case '{':
+    case '}':
+    case ';':
+      return true;
+    default:
+      break;
+    }
+  return false;
+}
+
 /* Validate a parameter typelist.  */
 
 static void
@@ -1920,7 +1967,7 @@ parse_number (struct parser_state *par_state,
 	 FIXME: This check is wrong; for example it doesn't find overflow
 	 on 0x123456789 when LONGEST is 32 bits.  */
       if (c != 'l' && c != 'u' && n != 0)
-	{	
+	{
 	  if ((unsigned_p && (ULONGEST) prevn >= (ULONGEST) n))
 	    error (_("Numeric constant too large."));
 	}
@@ -2741,16 +2788,13 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
       }
     }
 
-  if (!(c == '_' || c == '$'
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')))
+  if (is_identifier_separator (c))
     /* We must have come across a bad character (e.g. ';').  */
     error (_("Invalid character '%c' in expression."), c);
 
   /* It's a name.  See how long it is.  */
   namelen = 0;
-  for (c = tokstart[namelen];
-       (c == '_' || c == '$' || (c >= '0' && c <= '9')
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '<');)
+  for (c = tokstart[namelen]; !is_identifier_separator (c);)
     {
       /* Template parameter lists are part of the name.
 	 FIXME: This mishandles `print $a<4&&$a>3'.  */
@@ -2932,7 +2976,7 @@ classify_name (struct parser_state *par_state, const struct block *block,
 	 filename.  However, if the name was quoted, then it is better
 	 to check for a filename or a block, since this is the only
 	 way the user has of requiring the extension to be used.  */
-      if ((is_a_field_of_this.type == NULL && !is_after_structop) 
+      if ((is_a_field_of_this.type == NULL && !is_after_structop)
 	  || is_quoted_name)
 	{
 	  /* See if it's a file name. */