[v4,0/4] iconv: Add support for UTF-7-IMAP

Message ID 20211209093152.313872-1-mg@max.gautier.name
Headers show


Max Gautier Dec. 9, 2021, 9:31 a.m. UTC
I finally took the time to work on this again.

This new series implements UTF-7-IMAP in the UTF-7 module, using, as
advised, the same approach than in iso646.c.

Unresolved issues (would appreciate advice on those):
- There is a slight incoherence (to me) in the UTF-7 RFC[1], and the
  current implementation do not follow it exactly :
  In the "UTF-7 Definition/Rule 2":

  "The '+' signals that subsequent octets are to be interpreted as
  elements of the Modified Base64 alphabet until a character not in that
  alphabet is encountered. Such characters include control characters
  such as carriage returns and line feeds"

  The UTF-7 module implements this by making characters '\n', '\r', '\t'
  part of the "direct characters" set, even though they are not
  according to the definition given by the RFC.

  So these characters should be encoded, but should also be interpreted
  literally and implicitly terminates base64 sequences.
  On this, I'm inclined to leave the current behavior as is. Changing it
  might mean breaking things; and I don't see many benefits.

- For UTF-7-IMAP:
  The IMAPv4 RFC (UTF-7-IMAP definition)[2] specifies that :
  - The character "&" (0x26) is represented by the two-octet sequence "&-"
  - null shifts ("-&" while in BASE64; note that "&-" while in US-ASCII
    means "&") are not permitted
  - The purpose of these modifications is to correct the following
    problems with UTF-7:

      5) UTF-7 permits multiple alternate forms to represent the same
         string; in particular, printable US-ASCII characters can be
         represented in encoded form.

   Consider the following cases:

   A- When encoding to UTF-7-IMAP, if we encounter '&' while in base64
   mode, should we:
       1) encode it in base64
       2) terminate the encoding with '-' and use "&-"
   B- When encoding to UTF-7-IMAP, if we encounter "&&" while in
   us-ascii mode, should we:
       1) start base64 mode and encode the two '&' 
       2) encode them as "&-&-"
   It seems to me than for A and B, the solution 2 allows null shifts,
   and solution 1 allows multiples representation.

   However, A-2 and B-2 still feels cleaner to me, since they avoid
   alternate forms for '&'. The arguments can be made that the resulting
   sequences are not null shifts, merely a special case in US-ASCII.
   I've use that approach in PATCH 4/4, but that should be quite easy to
   change if necessary.

- Also, I'm not sure how to add negative test cases, aka, invalid
  sequences which needs to trigger an iconv errors.

Thanks for your time.

[1]: https://datatracker.ietf.org/doc/html/rfc2152
[2]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3