[0/5] iconv: module for MODIFIED-UTF-7

Message ID 20200819230702.229822-1-mg@max.gautier.name
Headers
Series iconv: module for MODIFIED-UTF-7 |

Message

Max Gautier Aug. 19, 2020, 11:06 p.m. UTC
  These patches implement a conversion module for "modified UTF-7"
described by RFC 3501 as part of the IMAP4rev1 specification (in section
5.1.3[1]).
This is the encoding used by convention by IMAP server to describe
internationalized mailbox names.

I'm trying to make isync[2] (an IMAP synchronizer) support that
encoding ; implementing it in glibc (vs making a custom gconv-module)
seems (to me) a sensible move, since this will allow other IMAP clients
to reuse that work. Also, it's easier to reuse the boilerplate than to
remake my own.

The conversion is based on the existing UTF-7 module ; I have merely copied
it, then changed the necessary parts to make UTF-7 into MODIFIED-UTF-7.

I am unaware of an official name for the encoding, so I used
"MODIFIED-UTF-7". There might be better choices, if someone has insights
on that.

I added test files (last patch) but I'm not sure `make check` actually
tests the stateful character sets (I'm not very familiar with iconv or
the glibc build system).

I would appreciate feedback, even if it is only to say you think that
module does not belongs in glibc.

Thank you,

Max Gautier

[1]: https://tools.ietf.org/html/rfc3501#section-5.1.3
[2]: https://isync.sourceforge.io/
  

Comments

Andreas Schwab Aug. 20, 2020, 7:18 a.m. UTC | #1
On Aug 20 2020, Max Gautier via Libc-alpha wrote:

> +// The last line of this file is missing the end-of-line terminator
> +// on purpose, in order to test that the conversion empties the bit buffer
> +// and shifts back to the initial state at the end of the conversion.
> +A&ImIDkQ-

That didn't work out, the newline is present nevertheless.

Andreas.
  
Florian Weimer Aug. 20, 2020, 8:03 a.m. UTC | #2
* Max Gautier via Libc-alpha:

> I am unaware of an official name for the encoding, so I used
> "MODIFIED-UTF-7". There might be better choices, if someone has insights
> on that.

Let's try to get it added to the IANA registry?  It's odd that a
charset defined in an RFC is not already contained in it.

The contact information for the registry seems to have atrophied a
bit.  I will try to figure out the current process.  Historically,
it's been Expert Review, so we shouldn't have to write an RFC for
this.
  
Max Gautier Aug. 20, 2020, 3:19 p.m. UTC | #3
Florian Weimer:
> * Max Gautier via Libc-alpha:
> 
> > I am unaware of an official name for the encoding, so I used
> > "MODIFIED-UTF-7". There might be better choices, if someone has insights
> > on that.
> 
> Let's try to get it added to the IANA registry?  It's odd that a
> charset defined in an RFC is not already contained in it.
> 
> The contact information for the registry seems to have atrophied a
> bit.  I will try to figure out the current process.  Historically,
> it's been Expert Review, so we shouldn't have to write an RFC for
> this.

Is the list here (and the linked RFC) not accurate ? 

[1]: https://www.iana.org/assignments/charset-info/charset-info.xhtml
  
Florian Weimer Aug. 20, 2020, 3:58 p.m. UTC | #4
* Max Gautier:

> Florian Weimer:
>> * Max Gautier via Libc-alpha:
>> 
>> > I am unaware of an official name for the encoding, so I used
>> > "MODIFIED-UTF-7". There might be better choices, if someone has insights
>> > on that.
>> 
>> Let's try to get it added to the IANA registry?  It's odd that a
>> charset defined in an RFC is not already contained in it.
>> 
>> The contact information for the registry seems to have atrophied a
>> bit.  I will try to figure out the current process.  Historically,
>> it's been Expert Review, so we shouldn't have to write an RFC for
>> this.
>
> Is the list here (and the linked RFC) not accurate ? 
>
> [1]: https://www.iana.org/assignments/charset-info/charset-info.xhtml

I tried to subscribe earlier today, but have yet to receive a response
from the mailing list software.
  
Max Gautier Sept. 2, 2020, 3:24 p.m. UTC | #5
* Florian Weimer:
> Let's try to get it added to the IANA registry?  It's odd that a
> charset defined in an RFC is not already contained in it.

While we do that, is there someone who would have time to review the
code itself, so we'll be able to proceed once the charset name get
registered ?
  
Adhemerval Zanella Sept. 2, 2020, 8:01 p.m. UTC | #6
On 02/09/2020 12:24, Max Gautier via Libc-alpha wrote:
> * Florian Weimer:
>> Let's try to get it added to the IANA registry?  It's odd that a
>> charset defined in an RFC is not already contained in it.
> 
> While we do that, is there someone who would have time to review the
> code itself, so we'll be able to proceed once the charset name get
> registered ?
> 

I haven't read the RFC to comment whether the resulting patch matches
the standard, but the patchset structure looks good.  Although I am
not very found of the copy duplication of the utf7 module, the resulting
patch does make clear what is the difference compared to default utf-7 
modified one (and I am not sure if trying to parametrize iconvdata/utf-7.c
really pays off here). I noticed only some minor style issues.

The only worry I have if this encoding is really used in the wild that
justify its inclusion on glibc.  The fact that it is defined for about 
17 years without anyone having the trouble to register it on IANA makes
me doubtful it is really that useful.
  
Max Gautier Sept. 3, 2020, 9:47 a.m. UTC | #7
On Wed, Sep 02, 2020 at 05:01:14PM -0300, Adhemerval Zanella wrote:
> The only worry I have if this encoding is really used in the wild that
> justify its inclusion on glibc.  The fact that it is defined for about 
> 17 years without anyone having the trouble to register it on IANA makes
> me doubtful it is really that useful.

AFAIK, it's only used for IMAP client and servers. Searching Google for
"IMAP4 utf-7 usage" shows there exists some implementations already
(PHP, Python, Perl (I think)) and several questions on Stack Overflow
(over the years and until recently) on how to deal with that UTF-7.
So it seems it's a narrow use case, but used by many. Since I suppose
that many languages can interface with glibc, including that modified
UTF-7 would avoid workarounds like converting to original UTF-7 then
just replacing the shift character and '/' by ',', and that kind of
things.
  
Andreas Schwab Sept. 3, 2020, 10:56 a.m. UTC | #8
On Sep 03 2020, Max Gautier via Libc-alpha wrote:

> AFAIK, it's only used for IMAP client and servers. Searching Google for
> "IMAP4 utf-7 usage" shows there exists some implementations already
> (PHP, Python, Perl (I think))

Emacs also implements it (as utf-7-imap) and is used by Gnus.

Andreas.
  
Florian Weimer Jan. 12, 2021, 9:12 a.m. UTC | #9
* Max Gautier via Libc-alpha:

> These patches implement a conversion module for "modified UTF-7"
> described by RFC 3501 as part of the IMAP4rev1 specification (in
> section 5.1.3[1]).  This is the encoding used by convention by IMAP
> server to describe internationalized mailbox names.

UTF-7-IMAP is now an official IMAP charset.  I suggest to use this name.

Thanks,
FLorian