From patchwork Thu Dec  9 09:31:48 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Max Gautier <mg@max.gautier.name>
X-Patchwork-Id: 48707
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>
X-Original-To: patchwork@sourceware.org
Delivered-To: patchwork@sourceware.org
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 022513858435
	for <patchwork@sourceware.org>; Thu,  9 Dec 2021 09:34:12 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 022513858435
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1639042452;
	bh=SDj6m8DTl/K1vnqKP/u4i5F8Xgdb8T7zYaJi7B6YO14=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=qzV5VnfxleCvsijmGLhVCJPpN+QNtORh7UKYP1U87lIn5LvDG57L7ybLczReh/swd
	 8wAuowYyunhJeH5slp+d6rizjv+drRT+u7+qWKXoklIZ8IO3G4SOFFSRjYPfe82udX
	 o9QQ3qLdR7W0O8VKr3HoxYNpK50qWPVhJGgv+ouo=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mout-p-102.mailbox.org (mout-p-102.mailbox.org
 [IPv6:2001:67c:2050::465:102])
 by sourceware.org (Postfix) with ESMTPS id 969213858431
 for <libc-alpha@sourceware.org>; Thu,  9 Dec 2021 09:32:26 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 969213858431
Received: from smtp202.mailbox.org (smtp202.mailbox.org [80.241.60.245])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest
 SHA256) (No client certificate requested)
 by mout-p-102.mailbox.org (Postfix) with ESMTPS id 4J8pj41lMlzQkF0
 for <libc-alpha@sourceware.org>; Thu,  9 Dec 2021 10:32:24 +0100 (CET)
X-Virus-Scanned: amavisd-new at heinlein-support.de
To: libc-alpha@sourceware.org
Subject: [PATCH v4 0/4] iconv: Add support for UTF-7-IMAP 
Date: Thu,  9 Dec 2021 10:31:48 +0100
Message-Id: <20211209093152.313872-1-mg@max.gautier.name>
In-Reply-To: <87blcw9ptq.fsf@oldenburg.str.redhat.com>
References: <87blcw9ptq.fsf@oldenburg.str.redhat.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-5.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Max Gautier via Libc-alpha
 <libc-alpha@sourceware.org>
From: Max Gautier <mg@max.gautier.name>
Reply-To: Max Gautier <mg@max.gautier.name>
Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org>

I finally took the time to work on this again.

This new series implements UTF-7-IMAP in the UTF-7 module, using, as
advised, the same approach than in iso646.c.

Unresolved issues (would appreciate advice on those):
- There is a slight incoherence (to me) in the UTF-7 RFC[1], and the
  current implementation do not follow it exactly :
  In the "UTF-7 Definition/Rule 2":

  "The '+' signals that subsequent octets are to be interpreted as
  elements of the Modified Base64 alphabet until a character not in that
  alphabet is encountered. Such characters include control characters
  such as carriage returns and line feeds"

  The UTF-7 module implements this by making characters '\n', '\r', '\t'
  part of the "direct characters" set, even though they are not
  according to the definition given by the RFC.

  So these characters should be encoded, but should also be interpreted
  literally and implicitly terminates base64 sequences.
  
  On this, I'm inclined to leave the current behavior as is. Changing it
  might mean breaking things; and I don't see many benefits.

- For UTF-7-IMAP:
  The IMAPv4 RFC (UTF-7-IMAP definition)[2] specifies that :
  
  - The character "&" (0x26) is represented by the two-octet sequence "&-"
  - null shifts ("-&" while in BASE64; note that "&-" while in US-ASCII
    means "&") are not permitted
  - The purpose of these modifications is to correct the following
    problems with UTF-7:
      ...

      5) UTF-7 permits multiple alternate forms to represent the same
         string; in particular, printable US-ASCII characters can be
         represented in encoded form.

   Consider the following cases:

   A- When encoding to UTF-7-IMAP, if we encounter '&' while in base64
   mode, should we:
       1) encode it in base64
       2) terminate the encoding with '-' and use "&-"
   B- When encoding to UTF-7-IMAP, if we encounter "&&" while in
   us-ascii mode, should we:
       1) start base64 mode and encode the two '&' 
       2) encode them as "&-&-"
   It seems to me than for A and B, the solution 2 allows null shifts,
   and solution 1 allows multiples representation.

   However, A-2 and B-2 still feels cleaner to me, since they avoid
   alternate forms for '&'. The arguments can be made that the resulting
   sequences are not null shifts, merely a special case in US-ASCII.
   I've use that approach in PATCH 4/4, but that should be quite easy to
   change if necessary.

- Also, I'm not sure how to add negative test cases, aka, invalid
  sequences which needs to trigger an iconv errors.


Thanks for your time.

[1]: https://datatracker.ietf.org/doc/html/rfc2152
[2]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3