mbox series

[v4,0/4] iconv: Add support for UTF-7-IMAP

Message ID 20211209093152.313872-1-mg@max.gautier.name
Headers show
Series iconv: Add support for UTF-7-IMAP | expand

Message

Max Gautier Dec. 9, 2021, 9:31 a.m. UTC
I finally took the time to work on this again.

This new series implements UTF-7-IMAP in the UTF-7 module, using, as
advised, the same approach than in iso646.c.

Unresolved issues (would appreciate advice on those):
- There is a slight incoherence (to me) in the UTF-7 RFC[1], and the
  current implementation do not follow it exactly :
  In the "UTF-7 Definition/Rule 2":

  "The '+' signals that subsequent octets are to be interpreted as
  elements of the Modified Base64 alphabet until a character not in that
  alphabet is encountered. Such characters include control characters
  such as carriage returns and line feeds"

  The UTF-7 module implements this by making characters '\n', '\r', '\t'
  part of the "direct characters" set, even though they are not
  according to the definition given by the RFC.

  So these characters should be encoded, but should also be interpreted
  literally and implicitly terminates base64 sequences.
  
  On this, I'm inclined to leave the current behavior as is. Changing it
  might mean breaking things; and I don't see many benefits.

- For UTF-7-IMAP:
  The IMAPv4 RFC (UTF-7-IMAP definition)[2] specifies that :
  
  - The character "&" (0x26) is represented by the two-octet sequence "&-"
  - null shifts ("-&" while in BASE64; note that "&-" while in US-ASCII
    means "&") are not permitted
  - The purpose of these modifications is to correct the following
    problems with UTF-7:
      ...

      5) UTF-7 permits multiple alternate forms to represent the same
         string; in particular, printable US-ASCII characters can be
         represented in encoded form.

   Consider the following cases:

   A- When encoding to UTF-7-IMAP, if we encounter '&' while in base64
   mode, should we:
       1) encode it in base64
       2) terminate the encoding with '-' and use "&-"
   B- When encoding to UTF-7-IMAP, if we encounter "&&" while in
   us-ascii mode, should we:
       1) start base64 mode and encode the two '&' 
       2) encode them as "&-&-"
   It seems to me than for A and B, the solution 2 allows null shifts,
   and solution 1 allows multiples representation.

   However, A-2 and B-2 still feels cleaner to me, since they avoid
   alternate forms for '&'. The arguments can be made that the resulting
   sequences are not null shifts, merely a special case in US-ASCII.
   I've use that approach in PATCH 4/4, but that should be quite easy to
   change if necessary.

- Also, I'm not sure how to add negative test cases, aka, invalid
  sequences which needs to trigger an iconv errors.


Thanks for your time.

[1]: https://datatracker.ietf.org/doc/html/rfc2152
[2]: https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3

Comments

Max Gautier Dec. 17, 2021, 1:15 p.m. UTC | #1
Hi,

The contribution checklist on the wiki says to keep pinging weekly, so,
doing that.

Cheers
Max Gautier Jan. 17, 2022, 2:07 p.m. UTC | #2
Keeping pinging.
Max Gautier Jan. 24, 2022, 9:17 a.m. UTC | #3
Pinging the patch.
Adhemerval Zanella Jan. 24, 2022, 2:19 p.m. UTC | #4
On 17/12/2021 10:15, Max Gautier via Libc-alpha wrote:
> Hi,
> 
> The contribution checklist on the wiki says to keep pinging weekly, so,
> doing that.
> 
> Cheers
> 

Thanks for your patience.  I think it is late for 2.35, but I want to get
back on this for 2.36.
Max Gautier Feb. 10, 2022, 1:16 p.m. UTC | #5
On Mon, Jan 24, 2022 at 11:19:46AM -0300, Adhemerval Zanella wrote:
> ... 
> I think it is late for 2.35, but I want to get back on this for 2.36.  

Since 2.35 has sailed, any chances to tackle this some time soon ?

Cheers
Adhemerval Zanella Feb. 10, 2022, 1:17 p.m. UTC | #6
On 10/02/2022 10:16, Max Gautier wrote:
> On Mon, Jan 24, 2022 at 11:19:46AM -0300, Adhemerval Zanella wrote:
>> ... 
>> I think it is late for 2.35, but I want to get back on this for 2.36.  
> 
> Since 2.35 has sailed, any chances to tackle this some time soon ?
> 
> Cheers
> 

Thanks for remind me, I will try to spare some time to check on this.
Max Gautier March 4, 2022, 8:53 a.m. UTC | #7
Pinging for UTF-7-IMAP


Cheers