From patchwork Thu Dec 9 09:31:48 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Max Gautier X-Patchwork-Id: 48707 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 022513858435 for ; Thu, 9 Dec 2021 09:34:12 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 022513858435 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1639042452; bh=SDj6m8DTl/K1vnqKP/u4i5F8Xgdb8T7zYaJi7B6YO14=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=qzV5VnfxleCvsijmGLhVCJPpN+QNtORh7UKYP1U87lIn5LvDG57L7ybLczReh/swd 8wAuowYyunhJeH5slp+d6rizjv+drRT+u7+qWKXoklIZ8IO3G4SOFFSRjYPfe82udX o9QQ3qLdR7W0O8VKr3HoxYNpK50qWPVhJGgv+ouo= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mout-p-102.mailbox.org (mout-p-102.mailbox.org [IPv6:2001:67c:2050::465:102]) by sourceware.org (Postfix) with ESMTPS id 969213858431 for ; Thu, 9 Dec 2021 09:32:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 969213858431 Received: from smtp202.mailbox.org (smtp202.mailbox.org [80.241.60.245]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-102.mailbox.org (Postfix) with ESMTPS id 4J8pj41lMlzQkF0 for ; Thu, 9 Dec 2021 10:32:24 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de To: libc-alpha@sourceware.org Subject: [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP Date: Thu, 9 Dec 2021 10:31:48 +0100 Message-Id: <20211209093152.313872-1-mg@max.gautier.name> In-Reply-To: <87blcw9ptq.fsf@oldenburg.str.redhat.com> References: <87blcw9ptq.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 X-Spam-Status: No, score=-5.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Max Gautier via Libc-alpha From: Max Gautier Reply-To: Max Gautier Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" I finally took the time to work on this again. This new series implements UTF-7-IMAP in the UTF-7 module, using, as advised, the same approach than in iso646.c. Unresolved issues (would appreciate advice on those): - There is a slight incoherence (to me) in the UTF-7 RFC[1], and the current implementation do not follow it exactly : In the "UTF-7 Definition/Rule 2": "The '+' signals that subsequent octets are to be interpreted as elements of the Modified Base64 alphabet until a character not in that alphabet is encountered. Such characters include control characters such as carriage returns and line feeds" The UTF-7 module implements this by making characters '\n', '\r', '\t' part of the "direct characters" set, even though they are not according to the definition given by the RFC. So these characters should be encoded, but should also be interpreted literally and implicitly terminates base64 sequences. On this, I'm inclined to leave the current behavior as is. Changing it might mean breaking things; and I don't see many benefits. - For UTF-7-IMAP: The IMAPv4 RFC (UTF-7-IMAP definition)[2] specifies that : - The character "&" (0x26) is represented by the two-octet sequence "&-" - null shifts ("-&" while in BASE64; note that "&-" while in US-ASCII means "&") are not permitted - The purpose of these modifications is to correct the following problems with UTF-7: ... 5) UTF-7 permits multiple alternate forms to represent the same string; in particular, printable US-ASCII characters can be represented in encoded form. Consider the following cases: A- When encoding to UTF-7-IMAP, if we encounter '&' while in base64 mode, should we: 1) encode it in base64 2) terminate the encoding with '-' and use "&-" B- When encoding to UTF-7-IMAP, if we encounter "&&" while in us-ascii mode, should we: 1) start base64 mode and encode the two '&' 2) encode them as "&-&-" It seems to me than for A and B, the solution 2 allows null shifts, and solution 1 allows multiples representation. However, A-2 and B-2 still feels cleaner to me, since they avoid alternate forms for '&'. The arguments can be made that the resulting sequences are not null shifts, merely a special case in US-ASCII. I've use that approach in PATCH 4/4, but that should be quite easy to change if necessary. - Also, I'm not sure how to add negative test cases, aka, invalid sequences which needs to trigger an iconv errors. Thanks for your time. [1]: https://datatracker.ietf.org/doc/html/rfc2152 [2]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3