gdb: add UTF16/UTF32 target charsets in phony_iconv

Message ID 20221002140010.106238-1-patrick@monnerat.net
State New
Headers
Series gdb: add UTF16/UTF32 target charsets in phony_iconv |

Commit Message

Patrick Monnerat Oct. 2, 2022, 2 p.m. UTC
  Function phony_iconv is substituted to the system-supplied iconv on
platforms where the latter is deficient. It implements too few possible
conversions for the current gdb requirements. In particular, Ada support
in gdb needs converting strings to UTF-32, which is not currently
featured: as this is used to determine the language, a warning is
issued in all cases.

Conditonal statements decide when the substitution occurs. This
currently enables it for mingw (wchar_t is not UTF-32) even when the
system-supplied iconv is suitable for gdb use.

This patch extends phony_iconv_open and phony_iconv functions to support
any conversion from/to host encoding, wchar_t, UTF-16 and UTF-32 with
endianness alternatives.

The value returned by phony_iconv_open is an integer token representing
the size and endianness of both character encodings involved.
---
 gdb/charset.c | 132 +++++++++++++++++++++++++++++---------------------
 1 file changed, 76 insertions(+), 56 deletions(-)
  

Comments

Tom Tromey Oct. 7, 2022, 8:10 p.m. UTC | #1
>>>>> "Patrick" == Patrick Monnerat via Gdb-patches <gdb-patches@sourceware.org> writes:

Patrick> Function phony_iconv is substituted to the system-supplied iconv on
Patrick> platforms where the latter is deficient.

I sort of hate to improve the phony iconv, but I get why one might want
to.

Patrick> Conditonal statements decide when the substitution occurs. This
Patrick> currently enables it for mingw (wchar_t is not UTF-32) even when the
Patrick> system-supplied iconv is suitable for gdb use.

I don't recall the outcome from this, but is there no way to improve gdb
to use this iconv?  If it truly works well enough then it seems like it
would be the better approach.

Patrick> +  static struct

This can also be const.

Patrick> +#if WORDS_BIGENDIAN
Patrick> +      {	"wchar_t", (sizeof (gdb_wchar_t) - 1) | TOKEN_BIGENDIAN },
Patrick> +#else
Patrick> +      {	"wchar_t",  (sizeof (gdb_wchar_t) - 1) },
Patrick> +#endif

This has tabs before the strings but there's no need.

Patrick> +      { NULL, -1 }
... 
Patrick> +  for (auto p = encodings; p->name; p++)
Patrick> +    if (strcmp (encoding, p->name) == 0)
Patrick> +      return p->token;

You can drop the trailing null entry and just do:

    for (auto p : encodings)

Patrick> +  if(token & ~((TOKEN_MASK << TOKEN_BITS) | TOKEN_MASK))

Missing space after the "if".

Patrick> +  maxval = 1UL << (7 * tosize);	/* Split shift to avoid count overflow. */

gdb doesn't generally use comments at the end of the line.

Tom
  
Patrick Monnerat Oct. 8, 2022, 12:12 a.m. UTC | #2
On 10/7/22 22:10, Tom Tromey wrote:
>>>>>> "Patrick" == Patrick Monnerat via Gdb-patches <gdb-patches@sourceware.org> writes:
Many thanks for your review and remarks.
> Patrick> Function phony_iconv is substituted to the system-supplied iconv on
> Patrick> platforms where the latter is deficient.
>
> I sort of hate to improve the phony iconv, but I get why one might want
> to.

Me too! In fact, I even hate the idea of phony_iconv: this is just a 
workaround for a particular case of unreliable external component.

Conversions to UTF-32 were probably not needed/mandatory when 
phony_iconv was implemented. This is not the case anymore and a true 
workaround should now support them.

>
> Patrick> Conditonal statements decide when the substitution occurs. This
> Patrick> currently enables it for mingw (wchar_t is not UTF-32) even when the
> Patrick> system-supplied iconv is suitable for gdb use.
>
> I don't recall the outcome from this, but is there no way to improve gdb
> to use this iconv?  If it truly works well enough then it seems like it
> would be the better approach.

There is a fuzzy source comment about an iconv problem in Solaris. Your 
bugzilla comment 
https://sourceware.org/bugzilla/show_bug.cgi?id=29315#c2 also goes in 
this direction: however we have no details about what the problem is, if 
we should still support it and/or if there are other systems affected by 
such an iconv problem.

Conditionals enabling phony_iconv rely on ISO-10646 availability (32-bit 
wchar_t?), Haible's implementation of iconv and/or btowc availability, 
but not on a particular OS (see gdb_wchar.h). This is very hard to 
establish a relation between the conditionals and what the implementor 
wanted to target.

In the absence of further info, this patch goes the safe way and 
probably resolves UTF-32 problems on some platforms. Ideally, 
phony_iconv should be removed if the problem has not to be supported 
anymore. Else, conditionals targeting the faulty cases (but what are 
they?) rather than the working ones would be preferable. If we have to 
keep phony_iconv, it should be able to support conversions to UTF-32 anyway.


Please note I had the intention to determine the Solaris problem and 
tried to install the last available OpenSolaris (dated 2009) in a VM 
without success: I gave up.

Patrick
  
Tom Tromey Oct. 8, 2022, 6:55 p.m. UTC | #3
>> I don't recall the outcome from this, but is there no way to improve gdb
>> to use this iconv?  If it truly works well enough then it seems like it
>> would be the better approach.

Patrick> There is a fuzzy source comment about an iconv problem in
Patrick> Solaris. Your bugzilla comment 
Patrick> https://sourceware.org/bugzilla/show_bug.cgi?id=29315#c2 also goes in
Patrick> this direction: however we have no details about what the problem is,
Patrick> if we should still support it and/or if there are other systems
Patrick> affected by such an iconv problem.

Ok.  Thank you for the link, I knew we had discussed this somewhere in
the past...

The comments at the top of gdb_wchar.h describe the situation somewhat,
though they don't really explain what was wrong with Solaris.  My
recollection, though, is that the Solaris wchar_t doesn't have any
ordinary encoding but is instead a weird hybrid thing, and furthermore
that the Solaris iconv doesn't accept "wchar_t" as an encoding name.
So, on Solaris, there's no convenient way to do the conversions (it's
possible to convert wchar_t to/from the locale's multi-byte encoding,
but I didn't implement that since it seemed like a pain).

All of this is based on the idea that it's convenient to work in a wide
character representation at some points in the code.  At the time, I
figured relying on wchar_t would be good for this because (presumably)
hosts would support that reasonably well and we wouldn't have to do
extra work in gdb.

However, it seems to me that it doesn't really have to be done this way.
We could use UTF-32 instead, by making our own tables (along the lines
of ada-unicode.py) for "isdigit" and "isprint".

In addition to this, I suppose we could simply require iconv.  Probably
any host that has iconv will support UTF-32 (if not, what good is it
really).  And libiconv exists and can even be conveniently dropped into
the source tree if there are any hosts that don't have it.  This may not
be a good plan if there are active host platforms where this would be a
pain to deal with.

Anyway, what do you think of this plan?

Patrick> Please note I had the intention to determine the Solaris problem and
Patrick> tried to install the last available OpenSolaris (dated 2009) in a VM 
Patrick> without success: I gave up.

Yeah, that's fine, I wouldn't have done it myself.

Tom
  
Patrick Monnerat Oct. 9, 2022, 12:47 a.m. UTC | #4
On 10/8/22 20:55, Tom Tromey wrote:
>
> The comments at the top of gdb_wchar.h describe the situation somewhat,
> though they don't really explain what was wrong with Solaris.  My
> recollection, though, is that the Solaris wchar_t doesn't have any
> ordinary encoding but is instead a weird hybrid thing, and furthermore
> that the Solaris iconv doesn't accept "wchar_t" as an encoding name.
> So, on Solaris, there's no convenient way to do the conversions (it's
> possible to convert wchar_t to/from the locale's multi-byte encoding,
> but I didn't implement that since it seemed like a pain).

Thanks for the additional explanation.

This describes the particular case of Solaris. Are there other OSes with 
similar implementations?

In all cases, this tends to demonstrate wchar_t is not reliable.

>
> All of this is based on the idea that it's convenient to work in a wide
> character representation at some points in the code.  At the time, I
> figured relying on wchar_t would be good for this because (presumably)
> hosts would support that reasonably well and we wouldn't have to do
> extra work in gdb.
>
> However, it seems to me that it doesn't really have to be done this way.
> We could use UTF-32 instead, by making our own tables (along the lines
> of ada-unicode.py) for "isdigit" and "isprint".

Totally agreed: we need to have something more "predictable". UTF-32 
seems a good choice, but the endian problem should still be resolved. 
Should it be fixed (UTF-32[BL]E) or machine dependent? Both have pros 
and cons. We could have a class implementing those chars + their 
ctype-like methods and even a basic_string instance subclass supporting 
conversions.

I nevertheless don't have any idea what is the amount of work required 
to change this.

> In addition to this, I suppose we could simply require iconv.  Probably
> any host that has iconv will support UTF-32 (if not, what good is it
> really).  And libiconv exists and can even be conveniently dropped into
> the source tree if there are any hosts that don't have it.  This may not
> be a good plan if there are active host platforms where this would be a
> pain to deal with.

IMO, only old platforms (>~15 years) have an iconv that does not feature 
UTF. Do we have to support them?

For the particular case of Solaris, did things changed nowadays and how 
old versions should be supported?

>
> Anyway, what do you think of this plan?

Globally and in the long term, I fully agree. Requiring iconv, using 
UTF-32 instead of wchar_t and dropping phony_iconv looks like the best 
solution. But again, I don't imagine the amount of rework implied.

As I'm mainly the Insight maintainer (since 2014) and a very small and 
recent committer to gdb, I don't want to make a revolution into the 
latter, just make it clean and usable from Insight when called in my 
test contexts (currently linux OK, cygwin OK, mingw BAD). That said, I'm 
not against a reasonable contribution that benefits to bare gdb too, 
within the limits of my knowledge and understanding of its code.

In the short and middle terms, I think the current patch is still 
useful: it immediately (and dirtily!) solves the problem introduced by 
Ada support and will allow a smooth and gentle UTF-32 transition until 
reaching a situation where phony_iconv can be dropped.


The questions I ask above are more to emphasize important strategic 
points rather than requiring an immediate answer, I guess!

Patrick
  
Tom Tromey Oct. 10, 2022, 4:11 p.m. UTC | #5
Patrick> This describes the particular case of Solaris. Are there other OSes
Patrick> with similar implementations?

I don't know.  It's not impossible, since the encoding of wchar_t isn't
specified.

Patrick> Totally agreed: we need to have something more "predictable". UTF-32
Patrick> seems a good choice, but the endian problem should still be resolved. 
Patrick> Should it be fixed (UTF-32[BL]E) or machine dependent? Both have pros
Patrick> and cons. We could have a class implementing those chars + their 
Patrick> ctype-like methods and even a basic_string instance subclass
Patrick> supporting conversions.

It's simplest if the characters are just scalars -- that is, in the host
ordering.  I don't think there's any need for a string type yet.

Patrick> I nevertheless don't have any idea what is the amount of work required
Patrick> to change this.

I did most of it already.

Patrick> For the particular case of Solaris, did things changed nowadays and
Patrick> how old versions should be supported?

I don't know, but it wouldn't be important any more, because we'd no
longer require any way to convert to wchar_t -- gdb simply wouldn't use
wchar_t any more.

Patrick> In the short and middle terms, I think the current patch is still
Patrick> useful: it immediately (and dirtily!) solves the problem introduced by 
Patrick> Ada support and will allow a smooth and gentle UTF-32 transition until
Patrick> reaching a situation where phony_iconv can be dropped.

I suspect we should probably move forward with your patch for GDB 13,
and then switch to my patch for GDB 14.  My reasoning here is just that
requiring iconv is a change that we may not want to spring on people so
late in the process.

Tom
  
Tom Tromey Oct. 16, 2022, 1:50 a.m. UTC | #6
[ UTF-32 as intermediate encoding ]

Patrick> I nevertheless don't have any idea what is the amount of work required
Patrick> to change this.

Tom> I did most of it already.

After digging deeper into this, I think it can't work.  iswprint can
take the locale into account, and without this, we can end up in a
situation where the host-wide-char (or in this case utf-32) conversion
to the host multibyte charset ends up having to emit unnecessary
escapes.

For example in wchar.exp, \242 ends up being emitted as '\242\0\0\0',
because the \242 can't be converted to ASCII.  I think this happens
because gdb_iswprint in the utf-32 formulation disagrees with the iconv
conversion.

However, I think there's a better way to fix all this.  It's very simple
and offhand I don't know why I didn't think of it before... the use of
wchar_t depends on knowing the encoding of wchar_t -- and I think we do
know this on mingw, as it's a form of UTF-16.  So, maybe all that's
needed is a host-is-windows check in gdb_wchar.h, combined with a
suitable definition of INTERMEDIATE_ENCODING.

Can you try this?  Or if you'd prefer, I can send a patch for you to try.

thanks,
Tom
  
Eli Zaretskii Oct. 16, 2022, 6:24 a.m. UTC | #7
> From: Tom Tromey <tom@tromey.com>
> Date: Sat, 15 Oct 2022 19:50:31 -0600
> Cc: Patrick Monnerat via Gdb-patches <gdb-patches@sourceware.org>
> 
> However, I think there's a better way to fix all this.  It's very simple
> and offhand I don't know why I didn't think of it before... the use of
> wchar_t depends on knowing the encoding of wchar_t -- and I think we do
> know this on mingw, as it's a form of UTF-16.

Beware: wchar_t on MS-Windows _is_ indeed UTF-16, but that means a
single wchar_t character can only represent characters within the BMP;
anything beyond the BMP will need a 'wchar_t *' string whose length is
at least 2.  So if the code converts single characters, on Windows it
can only do that with BMP codepoints.

(Ignore me if what I say makes no sense or is not useful: I wasn't
tracking this discussion.)
  
Tom Tromey Oct. 17, 2022, 11:10 p.m. UTC | #8
>> However, I think there's a better way to fix all this.  It's very simple
>> and offhand I don't know why I didn't think of it before... the use of
>> wchar_t depends on knowing the encoding of wchar_t -- and I think we do
>> know this on mingw, as it's a form of UTF-16.

Eli> Beware: wchar_t on MS-Windows _is_ indeed UTF-16, but that means a
Eli> single wchar_t character can only represent characters within the BMP;
Eli> anything beyond the BMP will need a 'wchar_t *' string whose length is
Eli> at least 2.  So if the code converts single characters, on Windows it
Eli> can only do that with BMP codepoints.

Eli> (Ignore me if what I say makes no sense or is not useful: I wasn't
Eli> tracking this discussion.)

Yes, it is helpful.  This may be an issue with this approach, we'd have
to make sure there is a test that covers this case.  If there is a
problem, I think it's fixable with a bit of work in the
character-printing code, though.

thanks,
Tom
  

Patch

diff --git a/gdb/charset.c b/gdb/charset.c
index a6261fc505c..c0528bfb9b5 100644
--- a/gdb/charset.c
+++ b/gdb/charset.c
@@ -98,27 +98,63 @@ 
 #undef ICONV_CONST
 #define ICONV_CONST const
 
-/* We allow conversions from UTF-32, wchar_t, and the host charset.
-   We allow conversions to wchar_t and the host charset.
-   Return 1 if we are converting from UTF-32BE, 2 if from UTF32-LE,
-   0 otherwise.  This is used as a flag in calls to iconv.  */
+/* We allow conversions from/to UTF-16, UTF-32, wchar_t, and the host charset.
+   Return a token representing the conversion or -1 if error.  The token
+   is unpacked in iconv.  */
 
-static iconv_t
-phony_iconv_open (const char *to, const char *from)
+#define TOKEN_BITS	3
+#define TOKEN_MASK	((1 << TOKEN_BITS) - 1)
+#define TOKEN_BIGENDIAN	(1 << (TOKEN_BITS - 1))
+
+static int
+phony_iconv_token (const char *encoding)
 {
-  if (strcmp (to, "wchar_t") && strcmp (to, GDB_DEFAULT_HOST_CHARSET))
-    return -1;
+  static struct
+    {
+      const char *name;
+      int token;
+    } const encodings[] =
+    {
+      { "UTF-16",   (2 - 1) | TOKEN_BIGENDIAN },
+      { "UTF-16BE", (2 - 1) | TOKEN_BIGENDIAN },
+      { "UTF-16LE", (2 - 1) },
+      { "UTF-32",   (4 - 1) | TOKEN_BIGENDIAN },
+      { "UTF-32BE", (4 - 1) | TOKEN_BIGENDIAN },
+      { "UTF-32LE", (4 - 1) },
+      { GDB_DEFAULT_HOST_CHARSET, (1 - 1) },
+#if WORDS_BIGENDIAN
+      {	"wchar_t", (sizeof (gdb_wchar_t) - 1) | TOKEN_BIGENDIAN },
+#else
+      {	"wchar_t",  (sizeof (gdb_wchar_t) - 1) },
+#endif
+      { NULL, -1 }
+    };
 
-  if (!strcmp (from, "UTF-32BE") || !strcmp (from, "UTF-32"))
-    return 1;
+  for (auto p = encodings; p->name; p++)
+    if (strcmp (encoding, p->name) == 0)
+      return p->token;
 
-  if (!strcmp (from, "UTF-32LE"))
-    return 2;
+  return -1;
+}
 
-  if (strcmp (from, "wchar_t") && strcmp (from, GDB_DEFAULT_HOST_CHARSET))
+static void
+phony_split_token (int token, size_t &size, enum bfd_endian &endian)
+{
+  /* Extract parameter values from the token. */
+  endian = token & TOKEN_BIGENDIAN ? BFD_ENDIAN_BIG : BFD_ENDIAN_LITTLE;
+  size = (token & TOKEN_MASK & ~TOKEN_BIGENDIAN) + 1;
+}
+
+static iconv_t
+phony_iconv_open (const char *to, const char *from)
+{
+  int totok = phony_iconv_token (to);
+  int fromtok = phony_iconv_token (from);
+
+  if (totok < 0 || fromtok < 0)
     return -1;
 
-  return 0;
+  return (totok << TOKEN_BITS) | fromtok;
 }
 
 static int
@@ -128,60 +164,44 @@  phony_iconv_close (iconv_t arg)
 }
 
 static size_t
-phony_iconv (iconv_t utf_flag, const char **inbuf, size_t *inbytesleft,
+phony_iconv (iconv_t token, const char **inbuf, size_t *inbytesleft,
 	     char **outbuf, size_t *outbytesleft)
 {
-  if (utf_flag)
+  enum bfd_endian toendian, fromendian;
+  size_t tosize, fromsize;
+  unsigned long maxval;
+
+  if(token & ~((TOKEN_MASK << TOKEN_BITS) | TOKEN_MASK))
     {
-      enum bfd_endian endian
-	= utf_flag == 1 ? BFD_ENDIAN_BIG : BFD_ENDIAN_LITTLE;
-      while (*inbytesleft >= 4)
-	{
-	  unsigned long c
-	    = extract_unsigned_integer ((const gdb_byte *)*inbuf, 4, endian);
+      errno = EBADF;
+      return (size_t) -1;
+    }
 
-	  if (c >= 256)
-	    {
-	      errno = EILSEQ;
-	      return -1;
-	    }
-	  if (*outbytesleft < 1)
-	    {
-	      errno = E2BIG;
-	      return -1;
-	    }
-	  **outbuf = c & 0xff;
-	  ++*outbuf;
-	  --*outbytesleft;
+  phony_split_token (token, fromsize, fromendian);
+  phony_split_token (token >> TOKEN_BITS, tosize, toendian);
+  maxval = 1UL << (7 * tosize);	/* Split shift to avoid count overflow. */
+  maxval = (maxval << tosize) - 1;
 
-	  *inbuf += 4;
-	  *inbytesleft -= 4;
-	}
-      if (*inbytesleft)
+  while (*inbytesleft >= fromsize)
+    {
+      unsigned long c = extract_unsigned_integer ((const gdb_byte *) *inbuf,
+                                                  fromsize, fromendian);
+
+      if (c > maxval)
 	{
-	  /* Partial sequence on input.  */
-	  errno = EINVAL;
+	  errno = EILSEQ;
 	  return -1;
 	}
-    }
-  else
-    {
-      /* In all other cases we simply copy input bytes to the
-	 output.  */
-      size_t amt = *inbytesleft;
-
-      if (amt > *outbytesleft)
-	amt = *outbytesleft;
-      memcpy (*outbuf, *inbuf, amt);
-      *inbuf += amt;
-      *outbuf += amt;
-      *inbytesleft -= amt;
-      *outbytesleft -= amt;
-      if (*inbytesleft)
+      if (*outbytesleft < tosize)
 	{
 	  errno = E2BIG;
 	  return -1;
 	}
+      store_unsigned_integer ((gdb_byte *) *outbuf, tosize, toendian, c);
+      *inbuf += fromsize;
+      *inbytesleft -= fromsize;
+      *outbuf += tosize;
+      *outbytesleft -= tosize;
     }
 
   /* The number of non-reversible conversions -- but they were all