c++: Add testcase for C++23 P2316R2 - consistent character literal encoding [PR102615]
Message ID | 20211007130049.GT304296@tucnak |
---|---|
State | Committed |
Headers |
Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 850373858414 for <patchwork@sourceware.org>; Thu, 7 Oct 2021 13:01:35 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 850373858414 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1633611695; bh=dfvSgRSrZsexxd7nMZvfSQqD5lyyLc8NQ3tN9HVFyvM=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=kiX6hBeaPZu78OIPsYiZcX9gWi3YZvXvUbytUnaedriUVqx/3pwY60QM+DVnVFqbJ Wg2DJ+lIc8A7E7Lq+zAi3R/jyulNHy8fxhq/6K42sV1Mnxz/IwdNTTufegz8fbD2GD j8iaOu643WhV5akBds4PaMhV+z5/MsoYVURYL8q4= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id CEB273858C60 for <gcc-patches@gcc.gnu.org>; Thu, 7 Oct 2021 13:01:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CEB273858C60 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-288-pOA-s7CzPvi6oXMOkNjoag-1; Thu, 07 Oct 2021 09:00:55 -0400 X-MC-Unique: pOA-s7CzPvi6oXMOkNjoag-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 334B55074C; Thu, 7 Oct 2021 13:00:54 +0000 (UTC) Received: from tucnak.zalov.cz (unknown [10.39.193.109]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 742051981F; Thu, 7 Oct 2021 13:00:53 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.16.1/8.16.1) with ESMTPS id 197D0o193939176 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Thu, 7 Oct 2021 15:00:51 +0200 Received: (from jakub@localhost) by tucnak.zalov.cz (8.16.1/8.16.1/Submit) id 197D0nor3939175; Thu, 7 Oct 2021 15:00:49 +0200 Date: Thu, 7 Oct 2021 15:00:49 +0200 To: Jason Merrill <jason@redhat.com>, "Joseph S. Myers" <joseph@codesourcery.com> Subject: [PATCH] c++: Add testcase for C++23 P2316R2 - consistent character literal encoding [PR102615] Message-ID: <20211007130049.GT304296@tucnak> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Jakub Jelinek <jakub@redhat.com> Cc: gcc-patches@gcc.gnu.org Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> |
Series |
c++: Add testcase for C++23 P2316R2 - consistent character literal encoding [PR102615]
|
|
Commit Message
Jakub Jelinek
Oct. 7, 2021, 1 p.m. UTC
Hi! I believe we need no changes to the compiler for P2316R2, seems we treat character literals the same between preprocessor and C++ expressions, here is a testcase that should verify it. Tested on x86_64-linux, ok for trunk? Note, seems the internal charset for GCC can be either UTF-8 or UTF-EBCDIC, but I bet it is very hard (at least for me) to actually test the latter. I'd guess one needs all system headers to be in EBCDIC and the gcc sources too. But looking around the source, I'm a little bit worried about the UTF-EBCDIC case. One is: #if '\n' == 0x0A && ' ' == 0x20 && '0' == 0x30 \ && 'A' == 0x41 && 'a' == 0x61 && '!' == 0x21 # define HOST_CHARSET HOST_CHARSET_ASCII #else # if '\n' == 0x15 && ' ' == 0x40 && '0' == 0xF0 \ && 'A' == 0xC1 && 'a' == 0x81 && '!' == 0x5A # define HOST_CHARSET HOST_CHARSET_EBCDIC # else # define HOST_CHARSET HOST_CHARSET_UNKNOWN # endif #endif in include/safe-ctype.h, does that mean we only support EBCDIC if -funsigned-char and otherwise fail to build gcc? Because with -fsigned-char, '0' is -0x10 rather than 0xF0, 'A' is -0x3F rather than 0xC1 and 'a' is -0x7F rather than 0x81. And another thing, if HOST_CHARSET == HOST_CHARSET_EBCDIC, how does the libcpp/lex.c static const cppchar_t utf8_signifier = 0xC0; ... if (*buffer->cur >= utf8_signifier) { if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, state, &s)) return true; } work? Because in UTF-EBCDIC, >= 0xC0 isn't the right test for start of multi-byte character, it is more complicated and seems _cpp_valid_utf8 assumes UTF-8 as the host charset. 2021-10-07 Jakub Jelinek <jakub@redhat.com> PR c++/102615 * g++.dg/cpp23/charlit-encoding1.C: New testcase for C++23 P2316R2. Jakub
Comments
On 10/7/21 09:00, Jakub Jelinek wrote: > Hi! > > I believe we need no changes to the compiler for P2316R2, seems we treat > character literals the same between preprocessor and C++ expressions, > here is a testcase that should verify it. > > Tested on x86_64-linux, ok for trunk? > > Note, seems the internal charset for GCC can be either UTF-8 or UTF-EBCDIC, > but I bet it is very hard (at least for me) to actually test the latter. > I'd guess one needs all system headers to be in EBCDIC and the gcc sources too. > But looking around the source, I'm a little bit worried about the UTF-EBCDIC > case. > One is: > #if '\n' == 0x0A && ' ' == 0x20 && '0' == 0x30 \ > && 'A' == 0x41 && 'a' == 0x61 && '!' == 0x21 > # define HOST_CHARSET HOST_CHARSET_ASCII > #else > # if '\n' == 0x15 && ' ' == 0x40 && '0' == 0xF0 \ > && 'A' == 0xC1 && 'a' == 0x81 && '!' == 0x5A > # define HOST_CHARSET HOST_CHARSET_EBCDIC > # else > # define HOST_CHARSET HOST_CHARSET_UNKNOWN > # endif > #endif > in include/safe-ctype.h, does that mean we only support EBCDIC if -funsigned-char > and otherwise fail to build gcc? Because with -fsigned-char, '0' is -0x10 > rather than 0xF0, 'A' is -0x3F rather than 0xC1 and 'a' is -0x7F rather than > 0x81. > And another thing, if HOST_CHARSET == HOST_CHARSET_EBCDIC, how does the libcpp/lex.c > static const cppchar_t utf8_signifier = 0xC0; > ... > if (*buffer->cur >= utf8_signifier) > { > if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, > state, &s)) > return true; > } > work? Because in UTF-EBCDIC, >= 0xC0 isn't the right test for start of > multi-byte character, it is more complicated and seems _cpp_valid_utf8 > assumes UTF-8 as the host charset. Are there any supported platforms that use UTF-EBCDIC? > 2021-10-07 Jakub Jelinek <jakub@redhat.com> > > PR c++/102615 > * g++.dg/cpp23/charlit-encoding1.C: New testcase for C++23 P2316R2. > > --- gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C.jj 2021-10-07 14:34:35.182132411 +0200 > +++ gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C 2021-10-07 14:34:02.902583774 +0200 > @@ -0,0 +1,33 @@ > +// PR c++/102615 - P2316R2 - Consistent character literal encoding > +// { dg-do compile } Doesn't this need to run? OK with that change. > +extern "C" void abort (); > + > +int > +main () > +{ > +#if ' ' == 0x20 > + if (' ' != 0x20) > + abort (); > +#elif ' ' == 0x40 > + if (' ' != 0x40) > + abort (); > +#else > + if (' ' == 0x20 || ' ' == 0x40) > + abort (); > +#endif > +#if 'a' == 0x61 > + if ('a' != 0x61) > + abort (); > +#elif 'a' == 0x81 > + if ('a' != 0x81) > + abort (); > +#elif 'a' == -0x7F > + if ('a' != -0x7F) > + abort (); > +#else > + if ('a' == 0x61 || 'a' == 0x81 || 'a' == -0x7F) > + abort (); > +#endif > + return 0; > +} > > Jakub >
On Thu, Oct 07, 2021 at 09:12:15AM -0400, Jason Merrill wrote: > > And another thing, if HOST_CHARSET == HOST_CHARSET_EBCDIC, how does the libcpp/lex.c > > static const cppchar_t utf8_signifier = 0xC0; > > ... > > if (*buffer->cur >= utf8_signifier) > > { > > if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, > > state, &s)) > > return true; > > } > > work? Because in UTF-EBCDIC, >= 0xC0 isn't the right test for start of > > multi-byte character, it is more complicated and seems _cpp_valid_utf8 > > assumes UTF-8 as the host charset. > > Are there any supported platforms that use UTF-EBCDIC? I have no idea. From the libcpp/charset.c code, seems there is no built-in conversion for UTF-EBCDIC, the only internally supported conversions are { "UTF-8/UTF-32LE", convert_utf8_utf32, (iconv_t)0 }, { "UTF-8/UTF-32BE", convert_utf8_utf32, (iconv_t)1 }, { "UTF-8/UTF-16LE", convert_utf8_utf16, (iconv_t)0 }, { "UTF-8/UTF-16BE", convert_utf8_utf16, (iconv_t)1 }, { "UTF-32LE/UTF-8", convert_utf32_utf8, (iconv_t)0 }, { "UTF-32BE/UTF-8", convert_utf32_utf8, (iconv_t)1 }, { "UTF-16LE/UTF-8", convert_utf16_utf8, (iconv_t)0 }, { "UTF-16BE/UTF-8", convert_utf16_utf8, (iconv_t)1 }, and identity, so unless the C library iconv supports conversion to UTF-EBCDIC, the only case that could be supported is when -finput-charset= is also UTF-EBCDIC. E.g. glibc iconv doesn't support that. Never used z/VM nor OS/390 which I think are the only possible hosts that could have UTF-EBCDIC. CCing Andreas if he knows more... > > --- gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C.jj 2021-10-07 14:34:35.182132411 +0200 > > +++ gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C 2021-10-07 14:34:02.902583774 +0200 > > @@ -0,0 +1,33 @@ > > +// PR c++/102615 - P2316R2 - Consistent character literal encoding > > +// { dg-do compile } > > Doesn't this need to run? OK with that change. Thanks for catching that, fixed, retested and committed. Jakub
On Thu, Oct 7, 2021 at 9:01 AM Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > And another thing, if HOST_CHARSET == HOST_CHARSET_EBCDIC, how does the libcpp/lex.c > static const cppchar_t utf8_signifier = 0xC0; > ... > if (*buffer->cur >= utf8_signifier) > { > if (_cpp_valid_utf8 (pfile, &buffer->cur, buffer->rlimit, 1 + !first, > state, &s)) > return true; > } > work? Because in UTF-EBCDIC, >= 0xC0 isn't the right test for start of > multi-byte character, it is more complicated and seems _cpp_valid_utf8 > assumes UTF-8 as the host charset. FWIW, here I was following Joseph's guidance from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c21 ("You can ignore anything claiming to handle UTF-EBCDIC.") -Lewis
--- gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C.jj 2021-10-07 14:34:35.182132411 +0200 +++ gcc/testsuite/g++.dg/cpp23/charlit-encoding1.C 2021-10-07 14:34:02.902583774 +0200 @@ -0,0 +1,33 @@ +// PR c++/102615 - P2316R2 - Consistent character literal encoding +// { dg-do compile } + +extern "C" void abort (); + +int +main () +{ +#if ' ' == 0x20 + if (' ' != 0x20) + abort (); +#elif ' ' == 0x40 + if (' ' != 0x40) + abort (); +#else + if (' ' == 0x20 || ' ' == 0x40) + abort (); +#endif +#if 'a' == 0x61 + if ('a' != 0x61) + abort (); +#elif 'a' == 0x81 + if ('a' != 0x81) + abort (); +#elif 'a' == -0x7F + if ('a' != -0x7F) + abort (); +#else + if ('a' == 0x61 || 'a' == 0x81 || 'a' == -0x7F) + abort (); +#endif + return 0; +}