libcpp: Handle extended characters in user-defined literal suffix [PR103902]

  Hello-

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902

The attached patch resolves PR preprocessor/103902 as described in the patch
message inline below. bootstrap + regtest all languages was successful on
x86-64 Linux, with no new failures:

FAIL 103 103
PASS 542338 542371
UNSUPPORTED 15247 15250
UNTESTED 136 136
XFAIL 4166 4166
XPASS 17 17

Please let me know if it looks OK?

A few questions I have:

- A difference introduced with this patch is that after lexing something
like `operator ""_abc', then `_abc' is added to the identifier hash map,
whereas previously it was not. I feel like this must be OK because with the
optional space as in `operator "" _abc', it would be added with or without the
patch.

- The behavior of `#pragma GCC poison' is not consistent (including prior to
  my patch). I tried to make it more so but there is still one thing I want to
  ask about. Leaving aside extended characters for now, the inconsistency is
  that currently the poison is only checked, when the suffix appears as a
  standalone token.

  #pragma GCC poison _X
  bool operator ""_X (unsigned long long);   //accepted before the patch,
                                             //rejected after it
  bool operator "" _X (unsigned long long);  //rejected either before or after
  const char * operator ""_X (const char *, unsigned long); //accepted before,
                                                            //rejected after
  const char * operator "" _X (const char *, unsigned long); //rejected either

  const char * s = ""_X; //accepted before the patch, rejected after it
  const bool b = 1_X; //accepted before or after ****

I feel like after the patch, the behavior is the expected behavior for all
cases but the last one. Here, we allow the poisoned identifier because it's
not lexed as an identifier, it's lexed as part of a pp-number. Does it seem OK
like this or does it need to be addressed?

Thanks for taking a look!

-Lewis
Subject: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

The PR complains that we do not handle UTF-8 in the suffix for a user-defined
literal, such as:

bool operator ""_π (unsigned long long);

In fact we don't handle any extended identifier characters there, whether
UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
the "" tokens is included, since then the identifier is lexed in the "normal"
way as its own token. But when it is lexed as part of the string token, this
is handled in lex_string() with a one-off loop that is not aware of extended
characters.

This patch fixes it by adding a new function scan_cur_identifier() that can be
used to lex an identifier while in the middle of lexing another token. It is
somewhat duplicative of the code in lex_identifier(), which handles the normal
case, but I think there's no good way to avoid that without pessimizing the
usual case, since lex_identifier() takes advantage of the fact that the first
character of the identifier has already been analyzed. The code duplication is
somewhat offset by factoring out the identifier lexing diagnostics (e.g. for
poisoned identifiers), which were formerly duplicated in two places, and have
been factored into their own function that's used in (now) 3 places.

BTW, the other place that was lexing identifiers is lex_identifier_intern(),
which is used to implement #pragma push_macro and #pragma pop_macro. This does
not support extended characters either. I will add that in a subsequent patch,
because it can't directly reuse the new function, but rather needs to lex from
a string instead of a cpp_buffer.

With scan_cur_identifier(), we do also correctly warn about bidi and
normalization issues in the extended identifiers comprising the suffix, and we
check for poisoned identifiers there as well.

PR preprocessor/103902

libcpp/ChangeLog:

	* lex.cc (identifier_diagnostics_on_lex): New function refactors
	common code from...
	(lex_identifier_intern): ...here, and...
	(lex_identifier): ...here.
	(struct scan_id_result): New struct to hold the result of...
	(scan_cur_identifier): ...new function.
	(create_literal2): New function.
	(is_macro): Removed function that is now handled directly in
	lex_string() and lex_raw_string().
	(is_macro_not_literal_suffix): Likewise.
	(lit_accum::create_literal2): New function.
	(lex_raw_string): Make use of new function scan_cur_identifier().
	(lex_string): Likewise.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/udlit-extended-id-1.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-2.C: New test.
	* g++.dg/cpp0x/udlit-extended-id-3.C: New test.

Message ID	20220614212649.GA58025@ldh-imac.local
State	Committed
Headers	Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 13863383EC51 for <patchwork@sourceware.org>; Tue, 14 Jun 2022 21:27:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 13863383EC51 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1655242046; bh=rUafy0rRZRqKxaddhp4VHDQyeOeC1sMk3zOApaOeooY=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=vET/A1XkBL/TLJ2WmkczAS1n/S8sZdxlX/klf8GCI0DlmfwIlxf8p1BzU999P1lbd d1g1mEWaOwjhIbavT9y2JUDRoqvWLkbLf0MvJgZePlptchLs8bL0x+349c4EtBmpVl 1Xkr8OkTIJRZd4vBrlEnxFfRQwyNS2oWtVQBAzcA= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-qv1-xf2a.google.com (mail-qv1-xf2a.google.com [IPv6:2607:f8b0:4864:20::f2a]) by sourceware.org (Postfix) with ESMTPS id D4B673856DD4 for <gcc-patches@gcc.gnu.org>; Tue, 14 Jun 2022 21:26:53 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D4B673856DD4 Received: by mail-qv1-xf2a.google.com with SMTP id ea7so7367923qvb.12 for <gcc-patches@gcc.gnu.org>; Tue, 14 Jun 2022 14:26:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:subject:message-id:mime-version :content-disposition:content-transfer-encoding; bh=rUafy0rRZRqKxaddhp4VHDQyeOeC1sMk3zOApaOeooY=; b=jqh2dYlr3BFaxqWvAH9coXlD6YLHj8ALQ5t54aNaGNw6p3zVKZXsBht5X7E9zkVjJZ PlCvOZYGElImL0X0quKsoc3MBk0/hkDlbiNCausW4zvcwhiHs9eFEM0KYEdISDXUiQ86 FlUlVMiJcs73WpGwC55MmsjT3QCzcAiRfD/nKTFyM6hTBwla6t1jh1ivWKy9Fxi3xTZO O68hxiYkc0vyqlRqqRZZm9A0mCwk0NmjMmA17YtLAXBKX6O8I41VQLhL++yK2OqlkDHW dWEkcGa6e0pc5WYEHVuaIofWBN7hXDsYIIOwzgxcZx4/kVO7U7nj9VTlUguHuKntFG2H kEuw== X-Gm-Message-State: AJIora8wZd8DyGX7RPNwRw66LOnJR0ROFdcxziFtxzNwbSMCXxxe1x0i FmrSRb4ZT3P8Xv7QrefGXL688DS9N0Y= X-Google-Smtp-Source: AGRyM1upXCjFjrz3mWzna0kyuiK2eMxQQaZDSzzlvwMrhBY3Vh4KTLNmvdTEoR7B0mzXNqVubfE4eA== X-Received: by 2002:a0c:e887:0:b0:464:5710:7f77 with SMTP id b7-20020a0ce887000000b0046457107f77mr5698491qvo.68.1655242012867; Tue, 14 Jun 2022 14:26:52 -0700 (PDT) Received: from ldh-imac.local (96-67-140-173-static.hfc.comcastbusiness.net. [96.67.140.173]) by smtp.gmail.com with ESMTPSA id i2-20020ac860c2000000b00304e688189fsm7882251qtm.37.2022.06.14.14.26.51 for <gcc-patches@gcc.gnu.org> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Jun 2022 14:26:51 -0700 (PDT) Date: Tue, 14 Jun 2022 17:26:49 -0400 To: gcc-patches@gcc.gnu.org Subject: [PATCH] libcpp: Handle extended characters in user-defined literal suffix [PR103902] Message-ID: <20220614212649.GA58025@ldh-imac.local> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="M9NhX3UHpAaciwkO" Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3039.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: Lewis Hyatt via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Lewis Hyatt <lhyatt@gmail.com> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
Series	libcpp: Handle extended characters in user-defined literal suffix [PR103902] \| libcpp: Handle extended characters in user-defined literal suffix [PR103902]

libcpp: Handle extended characters in user-defined literal suffix [PR103902]

Commit Message

Comments

Patch