From patchwork Sun Mar 20 16:43:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Max Gautier X-Patchwork-Id: 52153 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D0F783857C6F for ; Sun, 20 Mar 2022 16:43:33 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D0F783857C6F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1647794613; bh=ho/185rexmT4D73UT3M5XPgArIfuf3foCqEE1H/Tqwg=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=HQ9OnxJcFQd9hbIGfzykMOTaLtvxSc1ltGT8C/0Z5nIwMP8Uk5PD4Zz+62qfwlwII MfDm+dRBoQNgHAciQD/1Vo94OlvDFYuy9JvK5F7XMfEzol5OhoNqsnst+J7qFzQls/ k9AIjGBezsQaT2J1l2R5s9e3KMlyFxKn3imMMXUg= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mout-p-102.mailbox.org (mout-p-102.mailbox.org [80.241.56.152]) by sourceware.org (Postfix) with ESMTPS id 708B43858D37 for ; Sun, 20 Mar 2022 16:43:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 708B43858D37 Received: from smtp202.mailbox.org (smtp202.mailbox.org [IPv6:2001:67c:2050:105:465:1:4:0]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-102.mailbox.org (Postfix) with ESMTPS id 4KM3TW3NYvz9sT7; Sun, 20 Mar 2022 17:43:11 +0100 (CET) Date: Sun, 20 Mar 2022 17:43:04 +0100 To: libc-alpha@sourceware.org Subject: [PATCH v5 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c Message-ID: Mail-Followup-To: Max Gautier , libc-alpha@sourceware.org References: <87blcw9ptq.fsf@oldenburg.str.redhat.com> <20211209093152.313872-1-mg@max.gautier.name> <20211209093152.313872-5-mg@max.gautier.name> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20211209093152.313872-5-mg@max.gautier.name> X-Spam-Status: No, score=-10.0 required=5.0 tests=BAYES_00, BODY_8BITS, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Max Gautier via Libc-alpha From: Max Gautier Reply-To: Max Gautier Cc: Max Gautier Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" UTF-7-IMAP differs from UTF-7 in the followings ways (see RFC 3501[1] for reference) : - The shift character is '&' instead of '+' - There is no "optional direct characters" and the "direct characters" set is different - There is no implicit shift back to US-ASCII from BASE64, all BASE64 sequences MUST be terminated with '-' [1]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3 Signed-off-by: Max Gautier Reviewed-by: Adhemerval Zanella --- iconvdata/TESTS | 1 + iconvdata/gconv-modules | 4 ++++ iconvdata/testdata/UTF-7-IMAP | 1 + iconvdata/testdata/UTF-7-IMAP..UTF8 | 32 +++++++++++++++++++++++++++++ iconvdata/utf-7.c | 30 +++++++++++++++++++++------ 5 files changed, 62 insertions(+), 6 deletions(-) create mode 100644 iconvdata/testdata/UTF-7-IMAP create mode 100644 iconvdata/testdata/UTF-7-IMAP..UTF8 diff --git a/iconvdata/TESTS b/iconvdata/TESTS index a0157c3350..3cc043c21b 100644 --- a/iconvdata/TESTS +++ b/iconvdata/TESTS @@ -94,6 +94,7 @@ EUC-TW EUC-TW Y UTF8 GBK GBK Y UTF8 BIG5HKSCS BIG5HKSCS Y UTF8 UTF-7 UTF-7 N UTF8 +UTF-7-IMAP UTF-7-IMAP N UTF8 IBM856 IBM856 N UTF8 IBM922 IBM922 Y UTF8 IBM930 IBM930 N UTF8 diff --git a/iconvdata/gconv-modules b/iconvdata/gconv-modules index 4acbba062f..d120699394 100644 --- a/iconvdata/gconv-modules +++ b/iconvdata/gconv-modules @@ -113,3 +113,7 @@ module INTERNAL UTF-32BE// UTF-32 1 alias UTF7// UTF-7// module UTF-7// INTERNAL UTF-7 1 module INTERNAL UTF-7// UTF-7 1 + +# from to module cost +module UTF-7-IMAP// INTERNAL UTF-7 1 +module INTERNAL UTF-7-IMAP// UTF-7 1 diff --git a/iconvdata/testdata/UTF-7-IMAP b/iconvdata/testdata/UTF-7-IMAP new file mode 100644 index 0000000000..6b5dada63c --- /dev/null +++ b/iconvdata/testdata/UTF-7-IMAP @@ -0,0 +1 @@ +&EqASGxItEps- Amharic&AAoBDQ-esky Czech&AAo-Dansk Danish&AAo-English English&AAo-Suomi Finnish&AAo-Fran&AOc-ais French&AAo-Deutsch German&AAoDlQO7A7sDtwO9A7kDugOs- Greek&AAoF4gXRBegF2QXq- Hebrew&AAo-Italiano Italian&AAo-Norsk Norwegian&AAoEIARDBEEEQQQ6BDgEOQ- Russian&AAo-Espa&APE-ol Spanish&AAo-Svenska Swedish&AAoOIA4yDikOMg5EDhcOIg- Thai&AAo-T&APw-rk&AOc-e Turkish&AAo-Ti&Hr8-ng Vi&Hsc-t Vietnamese&AApl5Wcsip4- Japanese&AApOLWWH- Chinese&AArVXK4A- Korean&AAoACg-// Checking for correct handling of shift characters ('&-', '-') after base64 sequences&AArVXK4A-&-&AArVXK4A--&AAoACg-// Checking for correct handling of litteral '&-' and '-'&AAo----&-&--&AAoACg-// The last line of this file is missing the end-of-line terminator&AAo-// on purpose, in order to test that the conversion empties the bit buffer&AAo-// and shifts back to the initial state at the end of the conversion.&AAo-A&ImIDkQ- \ No newline at end of file diff --git a/iconvdata/testdata/UTF-7-IMAP..UTF8 b/iconvdata/testdata/UTF-7-IMAP..UTF8 new file mode 100644 index 0000000000..8b9add3670 --- /dev/null +++ b/iconvdata/testdata/UTF-7-IMAP..UTF8 @@ -0,0 +1,32 @@ +አማርኛ Amharic +česky Czech +Dansk Danish +English English +Suomi Finnish +Français French +Deutsch German +Ελληνικά Greek +עברית Hebrew +Italiano Italian +Norsk Norwegian +Русский Russian +Español Spanish +Svenska Swedish +ภาษาไทย Thai +Türkçe Turkish +Tiếng Việt Vietnamese +日本語 Japanese +中文 Chinese +한글 Korean + +// Checking for correct handling of shift characters ('&', '-') after base64 sequences +한글& +한글- + +// Checking for correct handling of litteral '&' and '-' +---&&- + +// The last line of this file is missing the end-of-line terminator +// on purpose, in order to test that the conversion empties the bit buffer +// and shifts back to the initial state at the end of the conversion. +A≢Α \ No newline at end of file diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c index b639d8ff3e..5c2e17e50c 100644 --- a/iconvdata/utf-7.c +++ b/iconvdata/utf-7.c @@ -32,11 +32,13 @@ enum variant { UTF7, + UTF_7_IMAP }; /* Must be in the same order as enum variant above. */ static const char names[] = "UTF-7//\0" + "UTF-7-IMAP//\0" "\0"; static uint32_t @@ -44,6 +46,8 @@ shift_character (enum variant const var) { if (var == UTF7) return '+'; + else if (var == UTF_7_IMAP) + return '&'; else abort (); } @@ -58,6 +62,9 @@ between (uint32_t const ch, /* The set of "direct characters": FOR UTF-7 A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr + FOR UTF-7-IMAP + A-Z a-z 0-9 ' ( ) , - . / : ? space + ! " # $ % + * ; < = > @ [ \ ] ^ _ ` { | } ~ */ static bool @@ -71,6 +78,8 @@ isdirect (uint32_t ch, enum variant var) || between (ch, ',', '/') || ch == ':' || ch == '?' || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r'); + else if (var == UTF_7_IMAP) + return (ch != '&' && between (ch, ' ', '~')); abort (); } @@ -124,6 +133,8 @@ base64 (unsigned int i, enum variant var) return '+'; else if (i == 63 && var == UTF7) return '/'; + else if (i == 63 && var == UTF_7_IMAP) + return ','; else abort (); } @@ -308,7 +319,8 @@ gconv_end (struct __gconv_step *data) i = ch - '0' + 52; \ else if (ch == '+') \ i = 62; \ - else if (ch == '/') \ + else if ((var == UTF7 && ch == '/') \ + || (var == UTF_7_IMAP && ch == ',')) \ i = 63; \ else \ { \ @@ -316,8 +328,10 @@ gconv_end (struct __gconv_step *data) \ /* If accumulated data is nonzero, the input is invalid. */ \ /* Also, partial UTF-16 characters are invalid. */ \ - if (__builtin_expect (statep->__value.__wch != 0, 0) \ - || __builtin_expect ((statep->__count >> 3) <= 26, 0)) \ + /* In IMAP variant, must be terminated by '-'. */ \ + if (__glibc_unlikely (statep->__value.__wch != 0) \ + || __glibc_unlikely ((statep->__count >> 3) <= 26) \ + || __glibc_unlikely (var == UTF_7_IMAP && ch != '-')) \ { \ STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1)); \ } \ @@ -474,13 +488,15 @@ gconv_end (struct __gconv_step *data) else \ { \ /* base64 encoding active */ \ - if (isdirect (ch, var)) \ + if ((var == UTF_7_IMAP && ch == '&') || isdirect (ch, var)) \ { \ /* deactivate base64 encoding */ \ size_t count; \ \ count = ((statep->__count & 0x18) >= 0x10) \ - + needs_explicit_shift (ch) + 1; \ + + (var == UTF_7_IMAP || needs_explicit_shift (ch)) \ + + (var == UTF_7_IMAP && ch == '&') \ + + 1; \ if (__glibc_unlikely (outptr + count > outend)) \ { \ result = __GCONV_FULL_OUTPUT; \ @@ -489,9 +505,11 @@ gconv_end (struct __gconv_step *data) \ if ((statep->__count & 0x18) >= 0x10) \ *outptr++ = base64 ((statep->__count >> 3) & ~3, var); \ - if (needs_explicit_shift (ch)) \ + if (var == UTF_7_IMAP || needs_explicit_shift (ch)) \ *outptr++ = '-'; \ *outptr++ = (unsigned char) ch; \ + if (var == UTF_7_IMAP && ch == '&') \ + *outptr++ = '-'; \ statep->__count = 0; \ } \ else \