Message ID | 9418d4f0-f22a-c587-cc34-2fa67afbd028@zjz.name |
---|---|
State | New, archived |
Headers |
Received: (qmail 29903 invoked by alias); 21 May 2018 08:53:12 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: <gdb-patches.sourceware.org> List-Unsubscribe: <mailto:gdb-patches-unsubscribe-##L=##H@sourceware.org> List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org> List-Archive: <http://sourceware.org/ml/gdb-patches/> List-Post: <mailto:gdb-patches@sourceware.org> List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs> Sender: gdb-patches-owner@sourceware.org Delivered-To: mailing list gdb-patches@sourceware.org Received: (qmail 29888 invoked by uid 89); 21 May 2018 08:53:11 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-20.9 required=5.0 tests=BAYES_00, BODY_8BITS, FROM_EXCESS_BASE64, GARBLED_BODY, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_LAZY_DOMAIN_SECURITY, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 spammy=bracket X-HELO: m176117.mail.qiye.163.com Received: from m176117.mail.qiye.163.com (HELO m176117.mail.qiye.163.com) (59.111.176.117) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 21 May 2018 08:53:09 +0000 Received: from [192.168.0.105] (unknown [59.59.164.76]) by m176117.mail.qiye.163.com (HMail) with ESMTPSA id 359F6781CC4 for <gdb-patches@sourceware.org>; Mon, 21 May 2018 16:52:56 +0800 (CST) To: gdb-patches@sourceware.org From: =?UTF-8?B?5by15L+K6Iqd?= <zjz@zjz.name> Subject: support C/C++ identifiers named with non-ASCII characters Message-ID: <9418d4f0-f22a-c587-cc34-2fa67afbd028@zjz.name> Date: Mon, 21 May 2018 16:52:31 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Thunderbird/55.0 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------2F412D02F9EF84FCBAAB16CE" X-HM-Spam-Status: e1ktWUFJV1koWUFPN1dZCBgUCR5ZQUtVS1dZCQ4XHghZQVkyNS06NzI*QU tVS1kG X-HM-Sender-Digest: e1kSHx4VD1lBWUc6MT46DAw*VjojHSo3E00qKikJDSsaCQFVSlVKTklN Q0JJTExNSUJMVTMWGhIXVQERATsBEQFVFRoWHkVZV1kMHhlZQR0aFwgeV1kIAVlBTUJLSzdXWRIL WUFZTkJVTkJVSk1PVUxNWQY+ X-HM-Tid: 0a6381e6cb2d926bkuws359f6781cc4 |
Commit Message
張俊芝
May 21, 2018, 8:52 a.m. UTC
Hello, team. This patch fixes the bug at https://sourceware.org/bugzilla/show_bug.cgi?id=22973 . Here is how to test the patch: Step 1. If you are using Clang or any other C compilers that have implemented support for Unicode identifiers, then create a C file with the following content: int main(int 參量, char* 參[]) { struct 集 { int 數[3]; } 集 = {100, 200, 300}; int 序 = 2; return 0; } Or if you are using GCC, create a C file with the following content as a workaround(GCC still doesn't actually support Unicode identifiers in 2018, which is a pity): int main(int \u53C3\u91CF, char* \u53C3[]) { struct \u96C6 { int \u6578[3]; } \u96C6 = {100, 200, 300}; int \u5E8F = 2; return 0; } Step 2. Compile the C file. Step 3. Run GDB for the compiled executable, add a breakpoint in "return 0". Step 4. Run until the breakpoint. Step 5. Test the following commands to see if they work: p 參量 p 參 p 集 p 集.數 p 集.數[序] Thanks for your review. 2018-05-20 張俊芝 <zjz@zjz.name> * gdb/c-exp.y (is_identifier_separator): New function. (lex_one_token): Now recognizes C and C++ Unicode identifiers by using is_identifier_separator to determine the boundary of a token.
Comments
On 2018-05-21 04:52 AM, 張俊芝 wrote: > Hello, team. > > This patch fixes the bug at > https://sourceware.org/bugzilla/show_bug.cgi?id=22973 . > > Here is how to test the patch: > > Step 1. If you are using Clang or any other C compilers that have > implemented > support for Unicode identifiers, then create a C file with the > following > content: > > int main(int 參量, char* 參[]) > { > struct 集 > { > int 數[3]; > } 集 = {100, 200, 300}; > int 序 = 2; > return 0; > } > > Or if you are using GCC, create a C file with the following content as a > workaround(GCC still doesn't actually support Unicode identifiers in > 2018, which > is a pity): > > int main(int \u53C3\u91CF, char* \u53C3[]) > { > struct \u96C6 > { > int \u6578[3]; > } \u96C6 = {100, 200, 300}; > int \u5E8F = 2; > return 0; > } > > Step 2. Compile the C file. > > Step 3. Run GDB for the compiled executable, add a breakpoint in "return 0". > > Step 4. Run until the breakpoint. > > Step 5. Test the following commands to see if they work: > p 參量 > p 參 > p 集 > p 集.數 > p 集.數[序] > > Thanks for your review. > Hi Zhang, Thanks for the patch, I tested it quickly, it seems to work as expected. Could you please write a small test case in testsuite/gdb.base with the example you gave, so we make sure this doesn't get broken later? If you can write it in such a way that both clang and gcc understand it would be better, because most people run the testuite using gcc to compile test programs. I am not a specialist in lexing and parsing C, so can you explain quickly why you think this is a good solution? Quickly, I understand that you change the identifier recognition algorithm to a blacklist of characters rather than a whitelist, so bytes that are not recognized (such as those that compose the utf-8 encoded characters) are not rejected. Given unlimited time, would the right solution be to use a lib to parse the string as utf-8, and reject strings that are not valid utf-8? Here are some not and formatting comments: > +static bool is_identifier_separator (char); You don't have to forward declare the function if it's not necessary. > + /* '<' should not be a token separator, because it can be an open angle > + bracket followed by a nested template identifier in C++. */ Please use two spaces after the final period (...C++. */). > + if (is_identifier_separator(c)) Please use a space before the parentheses: is_identifier_separator (c) > + for (c = tokstart[namelen]; !is_identifier_separator(c);) Here too. Thanks! Simon
> On May 21, 2018, at 10:03 AM, Simon Marchi <simark@simark.ca> wrote: > > ... > I am not a specialist in lexing and parsing C, so can you explain quickly why > you think this is a good solution? Quickly, I understand that you change the > identifier recognition algorithm to a blacklist of characters rather than > a whitelist, so bytes that are not recognized (such as those that compose > the utf-8 encoded characters) are not rejected. > > Given unlimited time, would the right solution be to use a lib to parse the > string as utf-8, and reject strings that are not valid utf-8? This sounds like a scenario where "stringprep" is helpful (or necessary). It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not. paul
> From: <Paul.Koning@dell.com> > CC: <zjz@zjz.name>, <gdb-patches@sourceware.org> > Date: Mon, 21 May 2018 14:12:12 +0000 > > > Given unlimited time, would the right solution be to use a lib to parse the > > string as utf-8, and reject strings that are not valid utf-8? > > This sounds like a scenario where "stringprep" is helpful (or necessary). It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not. Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and can not include invalid UTF-8 sequences?
> > > Simon Marchi 於 2018/5/21 下午10:03 寫道: >> Could you please write a small test case in testsuite/gdb.base with >> the example >> you gave, so we make sure this doesn't get broken later? If you can >> write it >> in such a way that both clang and gcc understand it would be better, >> because >> most people run the testuite using gcc to compile test programs. >> Oops, sorry, Simon, I forgot the test part in the second upload. Clang is compatible with the GCC workaround \uXXXX. So I will write the test case in that format. But it's late here, I will do it tomorrow.
On Mon, May 21, 2018 at 10:45 AM, 張俊芝 <zjz@zjz.name> wrote: >> >> >> Simon Marchi 於 2018/5/21 下午10:03 寫道: >>> >>> Could you please write a small test case in testsuite/gdb.base with the >>> example >>> you gave, so we make sure this doesn't get broken later? If you can >>> write it >>> in such a way that both clang and gcc understand it would be better, >>> because >>> most people run the testuite using gcc to compile test programs. >>> > Oops, sorry, Simon, I forgot the test part in the second upload. > > Clang is compatible with the GCC workaround \uXXXX. So I will write the test > case in that format. > > But it's late here, I will do it tomorrow. Just FYI, there is another bug in this area, which i had noticed that occurs when trying to tab complete symbols using GCC's \uXXXX. It seems like an issue in another place where gdb is not aware of the encoding. https://sourceware.org/bugzilla/show_bug.cgi?id=18226
> On May 21, 2018, at 12:12 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: <Paul.Koning@dell.com> >> CC: <zjz@zjz.name>, <gdb-patches@sourceware.org> >> Date: Mon, 21 May 2018 14:12:12 +0000 >> >>> Given unlimited time, would the right solution be to use a lib to parse the >>> string as utf-8, and reject strings that are not valid utf-8? >> >> This sounds like a scenario where "stringprep" is helpful (or necessary). It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not. > > Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and > can not include invalid UTF-8 sequences? Encoding is a I/O question. "UTF-8" and "Unicode" are often mixed up, but they are distinct. Unicode is a character set, in which each character has a numeric identification. For example, 張 is Unicode character number 24373 (0x5f35). UTF-8 is one of several ways to encode Unicode characters as a byte stream. The UTF-8 encoding of 張 is e5 bc b5. I don't know what the C/C++ standards say about non-ASCII identifiers. I assume they are stated to be Unicode, and presumably specific Unicode character classes. So there are some sequences of Unicode characters that are valid identifiers, while others are not -- exactly as "abc" is a valid ASCII identifier while "a@bc" is not. A separate question is the encoding of files. The encoding rule could be that UTF-8 is required -- or that the encoding is selectable. There also has to be an encoding in output files (debug data for example). And when strings are entered at the GDB user interface, they arrive in some encoding. For all these, UTF-8 is a logical answer. Not all byte strings are valid UTF-8 strings. When a byte string is delivered from the outside, it makes sense to validate if it's a valid encoding before it is used. Or you can assume that inputs are valid and rely on "symbol not found" as the general way to handle anything that doesn't match. For gdb, that may be good enough. Yet another issue: for many characters, there are multiple ways to represent them in Unicode. For example, ü (latin small letter u with dieresis) can be coded as the single Unicode character 0xfc, or as the pair 0x0308 0x75 (combining dieresis, latin small letter u). These are supposed to be synonymous; when doing string matches, you'd want them to be taken as equivalent. The stringprep library helps with this by offering a conversion to a standard form, at which point memcmp will give the correct answer. paul
> From: <Paul.Koning@dell.com> > CC: <simark@simark.ca>, <zjz@zjz.name>, <gdb-patches@sourceware.org> > Date: Mon, 21 May 2018 18:03:17 +0000 > > > Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and > > can not include invalid UTF-8 sequences? > > Encoding is a I/O question. Not necessarily. I asked that question because scanning a string for certain ASCII characters using a 'char *' pointer will only work reliably if the string is in UTF-8 or in some single-byte encoding. Otherwise, we might find false hits for the delimiters, which are actually parts of multibyte sequences.
> On May 21, 2018, at 2:14 PM, Eli Zaretskii <eliz@gnu.org> wrote: > >> From: <Paul.Koning@dell.com> >> CC: <simark@simark.ca>, <zjz@zjz.name>, <gdb-patches@sourceware.org> >> Date: Mon, 21 May 2018 18:03:17 +0000 >> >>> Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and >>> can not include invalid UTF-8 sequences? >> >> Encoding is a I/O question. > > Not necessarily. > > I asked that question because scanning a string for certain ASCII > characters using a 'char *' pointer will only work reliably if the > string is in UTF-8 or in some single-byte encoding. Otherwise, we > might find false hits for the delimiters, which are actually parts of > multibyte sequences. I see your point. The I/O encoding ties to the internal encoding. UTF-8 can be read into char[] and processed using C string primitives. Other encodings cannot. For example, if you have UTF-16 or UTF-32, you'd have to read it into a wchar_t string of the correct character width and use the wchar string functions. So there are two questions: 1. What are the valid characters? (Unicode question, independent of encoding) 2. What encoding do we expect in I/O (UTF question) from which we conclude what processing functions we need. paul
On Mon, 21 May 2018, Paul.Koning@dell.com wrote: > I don't know what the C/C++ standards say about non-ASCII identifiers. > I assume they are stated to be Unicode, and presumably specific Unicode They are defined in terms of ISO 10646 (and so concepts from Unicode that don't appear in ISO 10646, such as normalization forms, are not relevant to them). See C11 Annex D, which is also aligned with C++11 and later (older standard versions had generally more restrictive sets of allowed characters, different for C and C++, based on TR 10176). > Yet another issue: for many characters, there are multiple ways to > represent them in Unicode. For example, ü (latin small letter u with > dieresis) can be coded as the single Unicode character 0xfc, or as the > pair 0x0308 0x75 (combining dieresis, latin small letter u). These are (The letter goes before the combining mark in that case, not after. Thus such combining marks are not generally permitted at the start of identifiers.) > supposed to be synonymous; when doing string matches, you'd want them to They are *not* synonymous in C or C++. (GCC has -Wnormalized= options to warn about identifiers not in an appropriate normalization form, with -Wnormalized=nfc as the default.) GCC always generates UTF-8 in its .s output for such identifiers, which gas then transfers straight through to its output (thus, UTF-8 ELF symbols). The generic ELF ABI is silent on the encoding of such symbols (it just says "External C symbols have the same names in C and object files' symbol tables.").
Matt Rice 於 2018/5/22 上午2:00 寫道: > On Mon, May 21, 2018 at 10:45 AM, 張俊芝 <zjz@zjz.name> wrote: > Just FYI, there is another bug in this area, which i had noticed that > occurs when trying to > tab complete symbols using GCC's \uXXXX. It seems like an issue in > another place where gdb is not aware of the encoding. > > https://sourceware.org/bugzilla/show_bug.cgi?id=18226 Thank you for your helpful information, Matt.
Paul.Koning@dell.com 於 2018/5/22 上午2:03 寫道: > > Not all byte strings are valid UTF-8 strings. When a byte string is delivered from the outside, it makes sense to validate if it's a valid encoding before it is used. Or you can assume that inputs are valid and rely on "symbol not found" as the general way to handle anything that doesn't match. For gdb, that may be good enough. I preferred the latter(I.e. assume all non-ASCII characters are valid and rely on "symbol not found"), and it's actually what the patch does. Although a compiler has to be strict with validity of non-ASCII characters, but for GDB, the latter solution is just good enough - Checking only ASCII characters makes GDB work well with all ASCII-compliant encodings.
Joseph Myers 於 2018/5/22 上午4:26 寫道: > The generic ELF ABI is silent on the encoding of such symbols > (it just says "External C symbols have the same names in C and object > files' symbol tables."). So this is another reason why it may be better to just check ASCII character delimiters. May be good enough for GDB.
On 05/21/2018 06:45 PM, 張俊芝 wrote: >> >> >> Simon Marchi 於 2018/5/21 下午10:03 寫道: >>> Could you please write a small test case in testsuite/gdb.base with the example >>> you gave, so we make sure this doesn't get broken later? If you can write it >>> in such a way that both clang and gcc understand it would be better, because >>> most people run the testuite using gcc to compile test programs. >>> > Oops, sorry, Simon, I forgot the test part in the second upload. > > Clang is compatible with the GCC workaround \uXXXX. So I will write the test case in that format. > > But it's late here, I will do it tomorrow. I actually already started writing a patch for this a few months back, including a C testcase, after these discussions: https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html https://sourceware.org/ml/gdb/2017-11/msg00022.html Let me try to find it. I don't recall exactly where I left off, but I think I had something working. Thanks, Pedro Alves
Pedro Alves 於 2018/5/22 下午10:15 寫道: > > I actually already started writing a patch for this a few months > back, including a C testcase, after these discussions: > > https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html > https://sourceware.org/ml/gdb/2017-11/msg00022.html > > Let me try to find it. I don't recall exactly where I left off, > but I think I had something working. > > Thanks, > Pedro Alves > I just started writing a test case when I saw your letter. Could you shed light on how you delimit identifiers in your patch, Pedro? Does it check all invalid non-ASCII characters, is it dedicated to some encoding such as UTF-8, or to any encodings?
On 05/22/2018 03:32 PM, 張俊芝 wrote: > > Pedro Alves 於 2018/5/22 下午10:15 寫道: >> >> I actually already started writing a patch for this a few months >> back, including a C testcase, after these discussions: >> >> https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html >> https://sourceware.org/ml/gdb/2017-11/msg00022.html >> >> Let me try to find it. I don't recall exactly where I left off, >> but I think I had something working. > > I just started writing a test case when I saw your letter. > > Could you shed light on how you delimit identifiers in your patch, Pedro? Does it check all invalid non-ASCII characters, is it dedicated to some encoding such as UTF-8, or to any encodings? I found the patch. Let me rebase it and send it / post it. It'll be easier to just look at the patch. Thanks, Pedro Alves
diff --git a/gdb/c-exp.y b/gdb/c-exp.y index 5e10d2a3b4..b0dd6c7caf 100644 --- a/gdb/c-exp.y +++ b/gdb/c-exp.y @@ -73,6 +73,8 @@ void yyerror (const char *); static int type_aggregate_p (struct type *); +static bool is_identifier_separator (char); + %} /* Although the yacc "value" of an expression is not used, @@ -1718,6 +1720,53 @@ type_aggregate_p (struct type *type) && TYPE_DECLARED_CLASS (type))); } +/* While iterating all the characters in an identifier, an identifier separator + is a boundary where we know the iteration is done. */ + +static bool +is_identifier_separator (char c) +{ + switch (c) + { + case ' ': + case '\t': + case '\n': + case '\0': + case '\'': + case '"': + case '\\': + case '(': + case ')': + case ',': + case '.': + case '+': + case '-': + case '*': + case '/': + case '|': + case '&': + case '^': + case '~': + case '!': + case '@': + case '[': + case ']': + /* '<' should not be a token separator, because it can be an open angle + bracket followed by a nested template identifier in C++. */ + case '>': + case '?': + case ':': + case '=': + case '{': + case '}': + case ';': + return true; + default: + break; + } + return false; +} + /* Validate a parameter typelist. */ static void @@ -1920,7 +1969,7 @@ parse_number (struct parser_state *par_state, FIXME: This check is wrong; for example it doesn't find overflow on 0x123456789 when LONGEST is 32 bits. */ if (c != 'l' && c != 'u' && n != 0) - { + { if ((unsigned_p && (ULONGEST) prevn >= (ULONGEST) n)) error (_("Numeric constant too large.")); } @@ -2741,16 +2790,13 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name) } } - if (!(c == '_' || c == '$' - || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))) + if (is_identifier_separator(c)) /* We must have come across a bad character (e.g. ';'). */ error (_("Invalid character '%c' in expression."), c); /* It's a name. See how long it is. */ namelen = 0; - for (c = tokstart[namelen]; - (c == '_' || c == '$' || (c >= '0' && c <= '9') - || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '<');) + for (c = tokstart[namelen]; !is_identifier_separator(c);) { /* Template parameter lists are part of the name. FIXME: This mishandles `print $a<4&&$a>3'. */ @@ -2932,7 +2978,7 @@ classify_name (struct parser_state *par_state, const struct block *block, filename. However, if the name was quoted, then it is better to check for a filename or a block, since this is the only way the user has of requiring the extension to be used. */ - if ((is_a_field_of_this.type == NULL && !is_after_structop) + if ((is_a_field_of_this.type == NULL && !is_after_structop) || is_quoted_name) { /* See if it's a file name. */