From patchwork Thu Dec 9 09:31:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Max Gautier X-Patchwork-Id: 48705 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 9EEBA385802D for ; Thu, 9 Dec 2021 09:32:47 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 9EEBA385802D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1639042367; bh=enFfnCawzZ1G2WfyVUqyztKu/AdO9+4v9Wt2AHD7yUY=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=lTVfkCUQIc+Mj0GrHJpu1YwUDQj9GM8GMhZV/Dh8kvDvJUdVdATMFxGZQZ5pKyBZJ FHPYen2yZCBFNmjIyd0/lFCXTtLmUMgt4kGZ4h4eBqovHB/B+P8L/x2VRTeD9DfXGA u4q95QLVm1ZvHCwp5EMRwZctDCdE+a9URK1qC1qw= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mout-p-201.mailbox.org (mout-p-201.mailbox.org [IPv6:2001:67c:2050::465:201]) by sourceware.org (Postfix) with ESMTPS id E6C473858C27 for ; Thu, 9 Dec 2021 09:32:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E6C473858C27 Received: from smtp102.mailbox.org (smtp102.mailbox.org [IPv6:2001:67c:2050:105:465:1:3:0]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-201.mailbox.org (Postfix) with ESMTPS id 4J8pj23Mb6zQjBM; Thu, 9 Dec 2021 10:32:22 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de To: libc-alpha@sourceware.org Subject: [PATCH v4 1/4] iconv: Always encode "optional direct" UTF-7 characters Date: Thu, 9 Dec 2021 10:31:49 +0100 Message-Id: <20211209093152.313872-2-mg@max.gautier.name> In-Reply-To: <20211209093152.313872-1-mg@max.gautier.name> References: <87blcw9ptq.fsf@oldenburg.str.redhat.com> <20211209093152.313872-1-mg@max.gautier.name> MIME-Version: 1.0 X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Max Gautier via Libc-alpha From: Max Gautier Reply-To: Max Gautier Cc: Max Gautier Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Signed-off-by: Max Gautier Reviewed-by: Adhemerval Zanella --- iconvdata/utf-7.c | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c index 0ed46c948d..9ba0974959 100644 --- a/iconvdata/utf-7.c +++ b/iconvdata/utf-7.c @@ -29,14 +29,6 @@ #include -/* Define this to 1 if you want the so-called "optional direct" characters - ! " # $ % & * ; < = > @ [ ] ^ _ ` { | } - to be encoded. Define to 0 if you want them to be passed straight - through, like the so-called "direct" characters. - We set this to 1 because it's safer. - */ -#define UTF7_ENCODE_OPTIONAL_CHARS 1 - /* The set of "direct characters": A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr @@ -323,7 +315,7 @@ base64 (unsigned int i) if ((statep->__count & 0x18) == 0) \ { \ /* base64 encoding inactive */ \ - if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch)) \ + if (isdirect (ch)) \ { \ *outptr++ = (unsigned char) ch; \ } \ @@ -375,7 +367,7 @@ base64 (unsigned int i) else \ { \ /* base64 encoding active */ \ - if (UTF7_ENCODE_OPTIONAL_CHARS ? isdirect (ch) : isxdirect (ch)) \ + if (isdirect (ch)) \ { \ /* deactivate base64 encoding */ \ size_t count; \ From patchwork Thu Dec 9 09:31:50 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Max Gautier X-Patchwork-Id: 48706 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DB19A385803A for ; Thu, 9 Dec 2021 09:33:29 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DB19A385803A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1639042409; bh=pS5xlxKqTIRmxwc9CcSplOSdj0+ByjIzyHPNciBaSYQ=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=LL05c8ijpgymxRecQsUSvWGg0Req998zDe1h4gBQk0AYYISDzIyz2OiyqGcjo9WXk /FTRl9T66CpT87mDM3Sfnpus2tRWoDR4rD/3B++CzE+2MrIrfvBc/RQHdhiCL5hLYJ nXk3CtJCSkDHeKsGsD1KZTFA15i9Xv76lI/7HOaE= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mout-p-202.mailbox.org (mout-p-202.mailbox.org [IPv6:2001:67c:2050::465:202]) by sourceware.org (Postfix) with ESMTPS id DD55E3858C39 for ; Thu, 9 Dec 2021 09:32:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org DD55E3858C39 Received: from smtp1.mailbox.org (smtp1.mailbox.org [IPv6:2001:67c:2050:105:465:1:1:0]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-202.mailbox.org (Postfix) with ESMTPS id 4J8pj34VRHzQjcN; Thu, 9 Dec 2021 10:32:23 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de To: libc-alpha@sourceware.org Subject: [PATCH v4 2/4] iconv: Better mapping to RFC for UTF-7 Date: Thu, 9 Dec 2021 10:31:50 +0100 Message-Id: <20211209093152.313872-3-mg@max.gautier.name> In-Reply-To: <20211209093152.313872-1-mg@max.gautier.name> References: <87blcw9ptq.fsf@oldenburg.str.redhat.com> <20211209093152.313872-1-mg@max.gautier.name> MIME-Version: 1.0 X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, KAM_NUMSUBJECT, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Max Gautier via Libc-alpha From: Max Gautier Reply-To: Max Gautier Cc: Max Gautier Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" - Direct use of characters instead of arcane arrays - isxbase64 is not the Modified BASE64 alphabet, but the characters who needs to trigger an explicit shift back to US-ASCII. Make that clearer Signed-off-by: Max Gautier Reviewed-by: Adhemerval Zanella --- iconvdata/utf-7.c | 56 +++++++++++++++++++++++++++-------------------- 1 file changed, 32 insertions(+), 24 deletions(-) diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c index 9ba0974959..ac7d78141a 100644 --- a/iconvdata/utf-7.c +++ b/iconvdata/utf-7.c @@ -30,20 +30,27 @@ +static int +between(uint32_t const ch, + uint32_t const lower_bound, uint32_t const upper_bound) +{ + return (ch >= lower_bound && ch <= upper_bound); +} + /* The set of "direct characters": A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr */ -static const unsigned char direct_tab[128 / 8] = - { - 0x00, 0x26, 0x00, 0x00, 0x81, 0xf3, 0xff, 0x87, - 0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07 - }; - static int isdirect (uint32_t ch) { - return (ch < 128 && ((direct_tab[ch >> 3] >> (ch & 7)) & 1)); + return (between(ch, 'A', 'Z') + || between(ch, 'a', 'z') + || between(ch, '0', '9') + || ch == '\'' || ch == '(' || ch == ')' + || between(ch, ',', '/') + || ch == ':' || ch == '?' + || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r'); } @@ -52,33 +59,33 @@ isdirect (uint32_t ch) ! " # $ % & * ; < = > @ [ ] ^ _ ` { | } */ -static const unsigned char xdirect_tab[128 / 8] = - { - 0x00, 0x26, 0x00, 0x00, 0xff, 0xf7, 0xff, 0xff, - 0xff, 0xff, 0xff, 0xef, 0xff, 0xff, 0xff, 0x3f - }; static int isxdirect (uint32_t ch) { - return (ch < 128 && ((xdirect_tab[ch >> 3] >> (ch & 7)) & 1)); + return (ch == '\t' + || ch == '\n' + || ch == '\r' + || (between(ch, ' ','}') + && ch != '+' && ch != '\\') + ); } -/* The set of "extended base64 characters": +/* Characters which needs to trigger an explicit shift back to US-ASCII (UTF-7 + only): Modified base64 + '-' (shift back character) A-Z a-z 0-9 + / - */ -static const unsigned char xbase64_tab[128 / 8] = - { - 0x00, 0x00, 0x00, 0x00, 0x00, 0xa8, 0xff, 0x03, - 0xfe, 0xff, 0xff, 0x07, 0xfe, 0xff, 0xff, 0x07 - }; - static int -isxbase64 (uint32_t ch) +needs_explicit_shift (uint32_t ch) { - return (ch < 128 && ((xbase64_tab[ch >> 3] >> (ch & 7)) & 1)); + return (between(ch, 'A', 'Z') + || between(ch, 'a', 'z') + || between(ch, '/', '9') + || ch == '+' + || ch == '-' + ); } @@ -372,7 +379,8 @@ base64 (unsigned int i) /* deactivate base64 encoding */ \ size_t count; \ \ - count = ((statep->__count & 0x18) >= 0x10) + isxbase64 (ch) + 1; \ + count = ((statep->__count & 0x18) >= 0x10) \ + + needs_explicit_shift (ch) + 1; \ if (__glibc_unlikely (outptr + count > outend)) \ { \ result = __GCONV_FULL_OUTPUT; \ @@ -381,7 +389,7 @@ base64 (unsigned int i) \ if ((statep->__count & 0x18) >= 0x10) \ *outptr++ = base64 ((statep->__count >> 3) & ~3); \ - if (isxbase64 (ch)) \ + if (needs_explicit_shift (ch)) \ *outptr++ = '-'; \ *outptr++ = (unsigned char) ch; \ statep->__count = 0; \ From patchwork Thu Dec 9 09:31:51 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Max Gautier X-Patchwork-Id: 48708 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 35D7C3858039 for ; Thu, 9 Dec 2021 09:34:54 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 35D7C3858039 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1639042494; bh=TYcgFHBlvPWMwI5emdqi1QXPq2p+PM5Ka3eLBd2Yh2o=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=G5zjNwxMrQ2d3uO1YIEkdyqRoLe78+Tm+/P6tGpfGH/9PXKFFfxKZUJUVd5yLKk2L r4TqSrNC4O1fjfEw5bw+KwgcMq0jEGK9/EzO8QCbRBz38iJnJdXBEktLy8+wI0K4Yq ToZVHH7xbQfQO6+87lgqlqu3PU0znU/rU2Wu9vNQ= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mout-p-202.mailbox.org (mout-p-202.mailbox.org [IPv6:2001:67c:2050::465:202]) by sourceware.org (Postfix) with ESMTPS id 819CB3858407 for ; Thu, 9 Dec 2021 09:32:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 819CB3858407 Received: from smtp202.mailbox.org (smtp202.mailbox.org [80.241.60.245]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-202.mailbox.org (Postfix) with ESMTPS id 4J8pj527THzQjm2; Thu, 9 Dec 2021 10:32:25 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de To: libc-alpha@sourceware.org Subject: [PATCH v4 3/4] iconv: make utf-7.c able to use variants Date: Thu, 9 Dec 2021 10:31:51 +0100 Message-Id: <20211209093152.313872-4-mg@max.gautier.name> In-Reply-To: <20211209093152.313872-1-mg@max.gautier.name> References: <87blcw9ptq.fsf@oldenburg.str.redhat.com> <20211209093152.313872-1-mg@max.gautier.name> MIME-Version: 1.0 X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP, T_FILL_THIS_FORM_SHORT autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Max Gautier via Libc-alpha From: Max Gautier Reply-To: Max Gautier Cc: Max Gautier Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Add infrastructure in utf-7.c to handle variants. The approach comes from iso646.c The variant is defined at gconv_init time and is passed as a supplementary variable. Signed-off-by: Max Gautier --- iconvdata/utf-7.c | 239 +++++++++++++++++++++++++++++++++------------- 1 file changed, 174 insertions(+), 65 deletions(-) diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c index ac7d78141a..965d4220f1 100644 --- a/iconvdata/utf-7.c +++ b/iconvdata/utf-7.c @@ -29,6 +29,24 @@ #include +enum variant +{ + UTF7, +}; + +/* Must be in the same order as enum variant above. */ +static const char names[] = + "UTF-7//\0" + "\0"; + +static uint32_t +shift_character(enum variant const var) +{ + if (var == UTF7) + return '+'; + else + abort(); +} static int between(uint32_t const ch, @@ -38,37 +56,43 @@ between(uint32_t const ch, } /* The set of "direct characters": + FOR UTF-7 A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr */ static int -isdirect (uint32_t ch) +isdirect (uint32_t ch, enum variant var) { - return (between(ch, 'A', 'Z') - || between(ch, 'a', 'z') - || between(ch, '0', '9') - || ch == '\'' || ch == '(' || ch == ')' - || between(ch, ',', '/') - || ch == ':' || ch == '?' - || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r'); + if (var == UTF7) + return (between(ch, 'A', 'Z') + || between(ch, 'a', 'z') + || between(ch, '0', '9') + || ch == '\'' || ch == '(' || ch == ')' + || between(ch, ',', '/') + || ch == ':' || ch == '?' + || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r'); + abort(); } /* The set of "direct and optional direct characters": + (UTF-7 only) A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr ! " # $ % & * ; < = > @ [ ] ^ _ ` { | } */ - static int -isxdirect (uint32_t ch) +isxdirect (uint32_t ch, enum variant var) { - return (ch == '\t' - || ch == '\n' - || ch == '\r' - || (between(ch, ' ','}') - && ch != '+' && ch != '\\') - ); + return(isdirect(ch, var) + || (var == UTF7 && + (between(ch, '!', '&') + || ch == '*' + || between(ch, ';', '@') + || (between(ch, '[', '`') && ch != '\\') + || between(ch, '{', '}')) + ) + ); } @@ -91,7 +115,7 @@ needs_explicit_shift (uint32_t ch) /* Converts a value in the range 0..63 to a base64 encoded char. */ static unsigned char -base64 (unsigned int i) +base64 (unsigned int i, enum variant var) { if (i < 26) return i + 'A'; @@ -101,7 +125,7 @@ base64 (unsigned int i) return i - 52 + '0'; else if (i == 62) return '+'; - else if (i == 63) + else if (i == 63 && var == UTF7) return '/'; else abort (); @@ -109,9 +133,8 @@ base64 (unsigned int i) /* Definitions used in the body of the `gconv' function. */ -#define CHARSET_NAME "UTF-7//" -#define DEFINE_INIT 1 -#define DEFINE_FINI 1 +#define DEFINE_INIT 0 +#define DEFINE_FINI 0 #define FROM_LOOP from_utf7_loop #define TO_LOOP to_utf7_loop #define MIN_NEEDED_FROM 1 @@ -119,11 +142,27 @@ base64 (unsigned int i) #define MIN_NEEDED_TO 4 #define MAX_NEEDED_TO 4 #define ONE_DIRECTION 0 +#define FROM_DIRECTION (dir == from_utf7) #define PREPARE_LOOP \ mbstate_t saved_state; \ - mbstate_t *statep = data->__statep; -#define EXTRA_LOOP_ARGS , statep + mbstate_t *statep = data->__statep; \ + enum direction dir = ((struct utf7_data *) step->__data)->dir; \ + enum direction var = ((struct utf7_data *) step->__data)->var; +#define EXTRA_LOOP_ARGS , statep, var + + +enum direction +{ + illegal_dir, + from_utf7, + to_utf7 +}; +struct utf7_data +{ + enum direction dir; + enum variant var; +}; /* Since we might have to reset input pointer we must be able to save and restore the state. */ @@ -133,6 +172,72 @@ base64 (unsigned int i) else \ *statep = saved_state +extern int gconv_init (struct __gconv_step *step); +int +gconv_init (struct __gconv_step *step) +{ + /* Determine which direction. */ + struct utf7_data *new_data; + enum direction dir = illegal_dir; + + enum variant var = 0; + for (const char *name = names; *name != '\0'; + name = __rawmemchr (name, '\0') + 1) + { + if (__strcasecmp (step->__from_name, name) == 0) + { + dir = from_utf7; + break; + } + else if (__strcasecmp (step->__to_name, name) == 0) + { + dir = to_utf7; + break; + } + ++var; + } + + if (__builtin_expect (dir, from_utf7) != illegal_dir) + { + new_data = malloc (sizeof (*new_data)); + if (new_data == NULL) + return __GCONV_NOMEM; + + new_data->dir = dir; + new_data->var = var; + step->__data = new_data; + + if (dir == from_utf7) + { + step->__min_needed_from = MIN_NEEDED_FROM; + step->__max_needed_from = MAX_NEEDED_FROM; + step->__min_needed_to = MIN_NEEDED_TO; + step->__max_needed_to = MAX_NEEDED_TO; + } + else + { + step->__min_needed_from = MIN_NEEDED_TO; + step->__max_needed_from = MAX_NEEDED_TO; + step->__min_needed_to = MIN_NEEDED_FROM; + step->__max_needed_to = MAX_NEEDED_FROM; + } + } + else + return __GCONV_NOCONV; + + step->__stateful = 1; + + return __GCONV_OK; +} + +extern void gconv_end (struct __gconv_step *data); +void +gconv_end (struct __gconv_step *data) +{ + free (data->__data); +} + + /* First define the conversion function from UTF-7 to UCS4. The state is structured as follows: @@ -160,13 +265,13 @@ base64 (unsigned int i) if ((statep->__count >> 3) == 0) \ { \ /* base64 encoding inactive. */ \ - if (isxdirect (ch)) \ + if (isxdirect (ch, var)) \ { \ inptr++; \ put32 (outptr, ch); \ outptr += 4; \ } \ - else if (__glibc_likely (ch == '+')) \ + else if (__glibc_likely (ch == shift_character(var))) \ { \ if (__glibc_unlikely (inptr + 2 > inend)) \ { \ @@ -291,7 +396,7 @@ base64 (unsigned int i) } \ } #define LOOP_NEED_FLAGS -#define EXTRA_LOOP_DECLS , mbstate_t *statep +#define EXTRA_LOOP_DECLS , mbstate_t *statep, enum variant var #include @@ -322,7 +427,7 @@ base64 (unsigned int i) if ((statep->__count & 0x18) == 0) \ { \ /* base64 encoding inactive */ \ - if (isdirect (ch)) \ + if (isdirect (ch, var)) \ { \ *outptr++ = (unsigned char) ch; \ } \ @@ -330,7 +435,7 @@ base64 (unsigned int i) { \ size_t count; \ \ - if (ch == '+') \ + if (ch == shift_character(var)) \ count = 2; \ else if (ch < 0x10000) \ count = 3; \ @@ -345,13 +450,13 @@ base64 (unsigned int i) break; \ } \ \ - *outptr++ = '+'; \ - if (ch == '+') \ + *outptr++ = shift_character(var); \ + if (ch == shift_character(var)) \ *outptr++ = '-'; \ else if (ch < 0x10000) \ { \ - *outptr++ = base64 (ch >> 10); \ - *outptr++ = base64 ((ch >> 4) & 0x3f); \ + *outptr++ = base64 (ch >> 10, var); \ + *outptr++ = base64 ((ch >> 4) & 0x3f, var); \ statep->__count = ((ch & 15) << 5) | (3 << 3); \ } \ else if (ch < 0x110000) \ @@ -360,11 +465,11 @@ base64 (unsigned int i) uint32_t ch2 = 0xdc00 + ((ch - 0x10000) & 0x3ff); \ \ ch = (ch1 << 16) | ch2; \ - *outptr++ = base64 (ch >> 26); \ - *outptr++ = base64 ((ch >> 20) & 0x3f); \ - *outptr++ = base64 ((ch >> 14) & 0x3f); \ - *outptr++ = base64 ((ch >> 8) & 0x3f); \ - *outptr++ = base64 ((ch >> 2) & 0x3f); \ + *outptr++ = base64 (ch >> 26, var); \ + *outptr++ = base64 ((ch >> 20) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 14) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 8) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 2) & 0x3f, var); \ statep->__count = ((ch & 3) << 7) | (2 << 3); \ } \ else \ @@ -374,7 +479,7 @@ base64 (unsigned int i) else \ { \ /* base64 encoding active */ \ - if (isdirect (ch)) \ + if (isdirect (ch, var)) \ { \ /* deactivate base64 encoding */ \ size_t count; \ @@ -388,7 +493,7 @@ base64 (unsigned int i) } \ \ if ((statep->__count & 0x18) >= 0x10) \ - *outptr++ = base64 ((statep->__count >> 3) & ~3); \ + *outptr++ = base64 ((statep->__count >> 3) & ~3, var); \ if (needs_explicit_shift (ch)) \ *outptr++ = '-'; \ *outptr++ = (unsigned char) ch; \ @@ -416,22 +521,24 @@ base64 (unsigned int i) switch ((statep->__count >> 3) & 3) \ { \ case 1: \ - *outptr++ = base64 (ch >> 10); \ - *outptr++ = base64 ((ch >> 4) & 0x3f); \ + *outptr++ = base64 (ch >> 10, var); \ + *outptr++ = base64 ((ch >> 4) & 0x3f, var); \ statep->__count = ((ch & 15) << 5) | (3 << 3); \ break; \ case 2: \ *outptr++ = \ - base64 (((statep->__count >> 3) & ~3) | (ch >> 12)); \ - *outptr++ = base64 ((ch >> 6) & 0x3f); \ - *outptr++ = base64 (ch & 0x3f); \ + base64 (((statep->__count >> 3) & ~3) | (ch >> 12), \ + var); \ + *outptr++ = base64 ((ch >> 6) & 0x3f, var); \ + *outptr++ = base64 (ch & 0x3f, var); \ statep->__count = (1 << 3); \ break; \ case 3: \ *outptr++ = \ - base64 (((statep->__count >> 3) & ~3) | (ch >> 14)); \ - *outptr++ = base64 ((ch >> 8) & 0x3f); \ - *outptr++ = base64 ((ch >> 2) & 0x3f); \ + base64 (((statep->__count >> 3) & ~3) | (ch >> 14), \ + var); \ + *outptr++ = base64 ((ch >> 8) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 2) & 0x3f, var); \ statep->__count = ((ch & 3) << 7) | (2 << 3); \ break; \ default: \ @@ -447,30 +554,32 @@ base64 (unsigned int i) switch ((statep->__count >> 3) & 3) \ { \ case 1: \ - *outptr++ = base64 (ch >> 26); \ - *outptr++ = base64 ((ch >> 20) & 0x3f); \ - *outptr++ = base64 ((ch >> 14) & 0x3f); \ - *outptr++ = base64 ((ch >> 8) & 0x3f); \ - *outptr++ = base64 ((ch >> 2) & 0x3f); \ + *outptr++ = base64 (ch >> 26, var); \ + *outptr++ = base64 ((ch >> 20) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 14) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 8) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 2) & 0x3f, var); \ statep->__count = ((ch & 3) << 7) | (2 << 3); \ break; \ case 2: \ *outptr++ = \ - base64 (((statep->__count >> 3) & ~3) | (ch >> 28)); \ - *outptr++ = base64 ((ch >> 22) & 0x3f); \ - *outptr++ = base64 ((ch >> 16) & 0x3f); \ - *outptr++ = base64 ((ch >> 10) & 0x3f); \ - *outptr++ = base64 ((ch >> 4) & 0x3f); \ + base64 (((statep->__count >> 3) & ~3) | (ch >> 28), \ + var); \ + *outptr++ = base64 ((ch >> 22) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 16) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 10) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 4) & 0x3f, var); \ statep->__count = ((ch & 15) << 5) | (3 << 3); \ break; \ case 3: \ *outptr++ = \ - base64 (((statep->__count >> 3) & ~3) | (ch >> 30)); \ - *outptr++ = base64 ((ch >> 24) & 0x3f); \ - *outptr++ = base64 ((ch >> 18) & 0x3f); \ - *outptr++ = base64 ((ch >> 12) & 0x3f); \ - *outptr++ = base64 ((ch >> 6) & 0x3f); \ - *outptr++ = base64 (ch & 0x3f); \ + base64 (((statep->__count >> 3) & ~3) | (ch >> 30), \ + var); \ + *outptr++ = base64 ((ch >> 24) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 18) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 12) & 0x3f, var); \ + *outptr++ = base64 ((ch >> 6) & 0x3f, var); \ + *outptr++ = base64 (ch & 0x3f, var); \ statep->__count = (1 << 3); \ break; \ default: \ @@ -486,7 +595,7 @@ base64 (unsigned int i) inptr += 4; \ } #define LOOP_NEED_FLAGS -#define EXTRA_LOOP_DECLS , mbstate_t *statep +#define EXTRA_LOOP_DECLS , mbstate_t *statep, enum variant var #include @@ -516,7 +625,7 @@ base64 (unsigned int i) { \ /* Write out the shift sequence. */ \ if ((state & 0x18) >= 0x10) \ - *outbuf++ = base64 ((state >> 3) & ~3); \ + *outbuf++ = base64 ((state >> 3) & ~3, var); \ *outbuf++ = '-'; \ \ data->__statep->__count = 0; \ From patchwork Thu Dec 9 09:31:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Max Gautier X-Patchwork-Id: 48709 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id B4C923858C39 for ; Thu, 9 Dec 2021 09:35:41 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B4C923858C39 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1639042541; bh=WxeohpCt13sGwo1OYWud/gRkRWOje2NOPE/7IARLjOY=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=Anr2PiUL3r3nWnW+Iy7IYrUtPsKPew/UIIh7UJEC6fGbUkMIUUk/e66OD5qoJTCP6 kgS0tYoeo2flMgEd0VZ1kGxVyq7bAkVcrVRbWsQD9tnLByGzXRQqnisWan1IAhZejC JRphZucP4QlhILwObCSDTgFZrQfrjzvhHIi1c+2s= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mout-p-202.mailbox.org (mout-p-202.mailbox.org [80.241.56.172]) by sourceware.org (Postfix) with ESMTPS id 55F5D3858017 for ; Thu, 9 Dec 2021 09:32:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 55F5D3858017 Received: from smtp2.mailbox.org (smtp2.mailbox.org [IPv6:2001:67c:2050:105:465:1:2:0]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-202.mailbox.org (Postfix) with ESMTPS id 4J8pj9080vzQjmg; Thu, 9 Dec 2021 10:32:29 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de To: libc-alpha@sourceware.org Subject: [PATCH v4 4/4] iconv: Add UTF-7-IMAP variant in utf-7.c Date: Thu, 9 Dec 2021 10:31:52 +0100 Message-Id: <20211209093152.313872-5-mg@max.gautier.name> In-Reply-To: <20211209093152.313872-1-mg@max.gautier.name> References: <87blcw9ptq.fsf@oldenburg.str.redhat.com> <20211209093152.313872-1-mg@max.gautier.name> MIME-Version: 1.0 X-Spam-Status: No, score=-9.3 required=5.0 tests=BAYES_05, BODY_8BITS, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Max Gautier via Libc-alpha From: Max Gautier Reply-To: Max Gautier Cc: Max Gautier Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" UTF-7-IMAP differs from UTF-7 in the followings ways (see RFC 3501[1] for reference) : - The shift character is '&' instead of '+' - There is no "optional direct characters" and the "direct characters" set is different - ',' replaces '/' in the Modified Base64 alphabet - There is no implicit shift back to US-ASCII from BASE64, all BASE64 sequences MUST be terminated with '-' [1]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3 Signed-off-by: Max Gautier Reviewed-by: Adhemerval Zanella --- iconvdata/TESTS | 1 + iconvdata/gconv-modules | 4 ++++ iconvdata/testdata/UTF-7-IMAP | 1 + iconvdata/testdata/UTF-7-IMAP..UTF8 | 32 +++++++++++++++++++++++++++++ iconvdata/utf-7.c | 28 ++++++++++++++++++++----- 5 files changed, 61 insertions(+), 5 deletions(-) create mode 100644 iconvdata/testdata/UTF-7-IMAP create mode 100644 iconvdata/testdata/UTF-7-IMAP..UTF8 diff --git a/iconvdata/TESTS b/iconvdata/TESTS index a0157c3350..3cc043c21b 100644 --- a/iconvdata/TESTS +++ b/iconvdata/TESTS @@ -94,6 +94,7 @@ EUC-TW EUC-TW Y UTF8 GBK GBK Y UTF8 BIG5HKSCS BIG5HKSCS Y UTF8 UTF-7 UTF-7 N UTF8 +UTF-7-IMAP UTF-7-IMAP N UTF8 IBM856 IBM856 N UTF8 IBM922 IBM922 Y UTF8 IBM930 IBM930 N UTF8 diff --git a/iconvdata/gconv-modules b/iconvdata/gconv-modules index 4acbba062f..d120699394 100644 --- a/iconvdata/gconv-modules +++ b/iconvdata/gconv-modules @@ -113,3 +113,7 @@ module INTERNAL UTF-32BE// UTF-32 1 alias UTF7// UTF-7// module UTF-7// INTERNAL UTF-7 1 module INTERNAL UTF-7// UTF-7 1 + +# from to module cost +module UTF-7-IMAP// INTERNAL UTF-7 1 +module INTERNAL UTF-7-IMAP// UTF-7 1 diff --git a/iconvdata/testdata/UTF-7-IMAP b/iconvdata/testdata/UTF-7-IMAP new file mode 100644 index 0000000000..6b5dada63c --- /dev/null +++ b/iconvdata/testdata/UTF-7-IMAP @@ -0,0 +1 @@ +&EqASGxItEps- Amharic&AAoBDQ-esky Czech&AAo-Dansk Danish&AAo-English English&AAo-Suomi Finnish&AAo-Fran&AOc-ais French&AAo-Deutsch German&AAoDlQO7A7sDtwO9A7kDugOs- Greek&AAoF4gXRBegF2QXq- Hebrew&AAo-Italiano Italian&AAo-Norsk Norwegian&AAoEIARDBEEEQQQ6BDgEOQ- Russian&AAo-Espa&APE-ol Spanish&AAo-Svenska Swedish&AAoOIA4yDikOMg5EDhcOIg- Thai&AAo-T&APw-rk&AOc-e Turkish&AAo-Ti&Hr8-ng Vi&Hsc-t Vietnamese&AApl5Wcsip4- Japanese&AApOLWWH- Chinese&AArVXK4A- Korean&AAoACg-// Checking for correct handling of shift characters ('&-', '-') after base64 sequences&AArVXK4A-&-&AArVXK4A--&AAoACg-// Checking for correct handling of litteral '&-' and '-'&AAo----&-&--&AAoACg-// The last line of this file is missing the end-of-line terminator&AAo-// on purpose, in order to test that the conversion empties the bit buffer&AAo-// and shifts back to the initial state at the end of the conversion.&AAo-A&ImIDkQ- \ No newline at end of file diff --git a/iconvdata/testdata/UTF-7-IMAP..UTF8 b/iconvdata/testdata/UTF-7-IMAP..UTF8 new file mode 100644 index 0000000000..8b9add3670 --- /dev/null +++ b/iconvdata/testdata/UTF-7-IMAP..UTF8 @@ -0,0 +1,32 @@ +አማርኛ Amharic +česky Czech +Dansk Danish +English English +Suomi Finnish +Français French +Deutsch German +Ελληνικά Greek +עברית Hebrew +Italiano Italian +Norsk Norwegian +Русский Russian +Español Spanish +Svenska Swedish +ภาษาไทย Thai +Türkçe Turkish +Tiếng Việt Vietnamese +日本語 Japanese +中文 Chinese +한글 Korean + +// Checking for correct handling of shift characters ('&', '-') after base64 sequences +한글& +한글- + +// Checking for correct handling of litteral '&' and '-' +---&&- + +// The last line of this file is missing the end-of-line terminator +// on purpose, in order to test that the conversion empties the bit buffer +// and shifts back to the initial state at the end of the conversion. +A≢Α \ No newline at end of file diff --git a/iconvdata/utf-7.c b/iconvdata/utf-7.c index 965d4220f1..553636e324 100644 --- a/iconvdata/utf-7.c +++ b/iconvdata/utf-7.c @@ -32,11 +32,13 @@ enum variant { UTF7, + UTF_7_IMAP }; /* Must be in the same order as enum variant above. */ static const char names[] = "UTF-7//\0" + "UTF-7-IMAP//\0" "\0"; static uint32_t @@ -44,6 +46,8 @@ shift_character(enum variant const var) { if (var == UTF7) return '+'; + else if (var == UTF_7_IMAP) + return '&'; else abort(); } @@ -58,6 +62,9 @@ between(uint32_t const ch, /* The set of "direct characters": FOR UTF-7 A-Z a-z 0-9 ' ( ) , - . / : ? space tab lf cr + FOR UTF-7-IMAP + A-Z a-z 0-9 ' ( ) , - . / : ? space + ! " # $ % + * ; < = > @ [ \ ] ^ _ ` { | } ~ */ static int @@ -71,6 +78,8 @@ isdirect (uint32_t ch, enum variant var) || between(ch, ',', '/') || ch == ':' || ch == '?' || ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r'); + else if (var == UTF_7_IMAP) + return (ch != '&' && between(ch, ' ', '~')); abort(); } @@ -127,6 +136,8 @@ base64 (unsigned int i, enum variant var) return '+'; else if (i == 63 && var == UTF7) return '/'; + else if (i == 63 && var == UTF_7_IMAP) + return ','; else abort (); } @@ -313,7 +324,8 @@ gconv_end (struct __gconv_step *data) i = ch - '0' + 52; \ else if (ch == '+') \ i = 62; \ - else if (ch == '/') \ + else if ((var == UTF7 && ch == '/') \ + || (var == UTF_7_IMAP && ch == ',')) \ i = 63; \ else \ { \ @@ -321,8 +333,10 @@ gconv_end (struct __gconv_step *data) \ /* If accumulated data is nonzero, the input is invalid. */ \ /* Also, partial UTF-16 characters are invalid. */ \ + /* In IMAP variant, must be terminated by '-'. */ \ if (__builtin_expect (statep->__value.__wch != 0, 0) \ - || __builtin_expect ((statep->__count >> 3) <= 26, 0)) \ + || __builtin_expect ((statep->__count >> 3) <= 26, 0) \ + || __builtin_expect (var == UTF_7_IMAP && ch != '-', 0)) \ { \ STANDARD_FROM_LOOP_ERR_HANDLER ((statep->__count = 0, 1)); \ } \ @@ -479,13 +493,15 @@ gconv_end (struct __gconv_step *data) else \ { \ /* base64 encoding active */ \ - if (isdirect (ch, var)) \ + if ((var == UTF_7_IMAP && ch == '&') || isdirect (ch, var)) \ { \ /* deactivate base64 encoding */ \ size_t count; \ \ count = ((statep->__count & 0x18) >= 0x10) \ - + needs_explicit_shift (ch) + 1; \ + + (var == UTF_7_IMAP || needs_explicit_shift (ch)) \ + + (var == UTF_7_IMAP && ch == '&') \ + + 1; \ if (__glibc_unlikely (outptr + count > outend)) \ { \ result = __GCONV_FULL_OUTPUT; \ @@ -494,9 +510,11 @@ gconv_end (struct __gconv_step *data) \ if ((statep->__count & 0x18) >= 0x10) \ *outptr++ = base64 ((statep->__count >> 3) & ~3, var); \ - if (needs_explicit_shift (ch)) \ + if (var == UTF_7_IMAP || needs_explicit_shift (ch)) \ *outptr++ = '-'; \ *outptr++ = (unsigned char) ch; \ + if (var == UTF_7_IMAP && ch == '&') \ + *outptr++ = '-'; \ statep->__count = 0; \ } \ else \