[v2] diagnostics: Fix mojibake from displaying UTF-8 on Windows consoles

Message ID 20250912154503.478219-1-peter0x44@disroot.org
State New
Headers
Series [v2] diagnostics: Fix mojibake from displaying UTF-8 on Windows consoles |

Commit Message

Peter Damianov Sept. 12, 2025, 3:45 p.m. UTC
  UTF-8 characters in diagnostic output (such as the warning emoji ⚠️
used by fanalyzer) display as mojibake on Windows unless the utf8
code page is being used

This patch adds UTF-8 to UTF-16 conversion when outputting to a console
on Windows.

gcc/ChangeLog:
	* pretty-print.cc (decode_utf8_char): Move forward declaration.
	(utf8_to_utf16): New function to convert UTF-8 to UTF-16.
	(is_console_handle): New function to detect Windows console handles.
	(write_all): Add UTF-8 to UTF-16 conversion for console output,
	falling back to WriteFile for ASCII strings and regular files.

Signed-off-by: Peter Damianov <peter0x44@disroot.org>
---
v2:
Fix linux build by moving decode_utf8_char outside of ifdef
Keep form feed

 gcc/pretty-print.cc | 132 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 129 insertions(+), 3 deletions(-)
  

Comments

Jonathan Yong Sept. 14, 2025, 3:26 a.m. UTC | #1
On 9/12/25 3:45 PM, Peter Damianov wrote:
> UTF-8 characters in diagnostic output (such as the warning emoji ⚠️
> used by fanalyzer) display as mojibake on Windows unless the utf8
> code page is being used
> 
> This patch adds UTF-8 to UTF-16 conversion when outputting to a console
> on Windows.
> 
> gcc/ChangeLog:
> 	* pretty-print.cc (decode_utf8_char): Move forward declaration.
> 	(utf8_to_utf16): New function to convert UTF-8 to UTF-16.
> 	(is_console_handle): New function to detect Windows console handles.
> 	(write_all): Add UTF-8 to UTF-16 conversion for console output,
> 	falling back to WriteFile for ASCII strings and regular files.
> 
> Signed-off-by: Peter Damianov <peter0x44@disroot.org>
> ---
> v2:
> Fix linux build by moving decode_utf8_char outside of ifdef
> Keep form feed
> 
>   gcc/pretty-print.cc | 132 +++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 129 insertions(+), 3 deletions(-)
> 

Bootstrapped with native amd64 Linux without failures. Will push into 
master branch soon if there are no more feed backs.
  
David Malcolm Sept. 15, 2025, 4:10 p.m. UTC | #2
On Fri, 2025-09-12 at 16:45 +0100, Peter Damianov wrote:
> UTF-8 characters in diagnostic output (such as the warning emoji ⚠️
> used by fanalyzer) display as mojibake on Windows unless the utf8
> code page is being used
> 
> This patch adds UTF-8 to UTF-16 conversion when outputting to a
> console
> on Windows.
> 
> gcc/ChangeLog:
> 	* pretty-print.cc (decode_utf8_char): Move forward
> declaration.
> 	(utf8_to_utf16): New function to convert UTF-8 to UTF-16.
> 	(is_console_handle): New function to detect Windows console
> handles.
> 	(write_all): Add UTF-8 to UTF-16 conversion for console
> output,
> 	falling back to WriteFile for ASCII strings and regular
> files.
> 
> Signed-off-by: Peter Damianov <peter0x44@disroot.org>
> ---
> v2:
> Fix linux build by moving decode_utf8_char outside of ifdef
> Keep form feed
> 
>  gcc/pretty-print.cc | 132
> +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 129 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/pretty-print.cc b/gcc/pretty-print.cc
> index d79a8282cfb..c29e15a41f3 100644
> --- a/gcc/pretty-print.cc
> +++ b/gcc/pretty-print.cc
> @@ -38,11 +38,18 @@ along with GCC; see the file COPYING3.  If not
> see
>  #include <iconv.h>
>  #endif
>  
> +static int
> +decode_utf8_char (const unsigned char *, size_t len, unsigned int
> *);
> +
>  #ifdef __MINGW32__
>  
>  /* Replacement for fputs() that handles ANSI escape codes on Windows
> NT.
>     Contributed by: Liu Hao (lh_mouse at 126 dot com)
>  
> +   Extended by: Peter Damianov
> +   Converts UTF-8 to UTF-16 if outputting to a console, so that
> emojis and
> +   various other unicode characters don't get mojibak'd.
> +
>     XXX: This file is compiled into libcommon.a that will be self-
> contained.
>  	It looks like that these functions can be put nowhere else. 
> */
>  
> @@ -50,11 +57,132 @@ along with GCC; see the file COPYING3.  If not
> see
>  #define WIN32_LEAN_AND_MEAN 1
>  #include <windows.h>
>  
> +/* Convert UTF-8 string to UTF-16.
> +   Returns true if conversion was performed, false if string is pure
> ASCII.
> +
> +   If the string contains only ASCII characters, returns false
> +   without allocating any memory.  Otherwise, a buffer that the
> caller
> +   must free is allocated and the string is converted into it.  */
> +static bool
> +utf8_to_utf16 (const char *utf8_str, size_t utf8_len, wchar_t
> **utf16_str,
> +	       size_t *utf16_len)

Thanks for the patch.

I notice that libcpp/charset.cc defines a function convert_utf8_utf16
(albeit currently static).  Is there a way that this could be reused,
rather than adding a 2nd implementation?

[...snip...]

Sorry, I confess I don't know enough about Windows compat that I can't
comment on the rest of the patch.  If it fixes things on Windows and
doesn't break other OSes, that's good, I suppose :/

Hope this is constructive
Dave
  
Peter Damianov Sept. 15, 2025, 4:33 p.m. UTC | #3
> 
> Thanks for the patch.
> 
> I notice that libcpp/charset.cc defines a function convert_utf8_utf16
> (albeit currently static).  Is there a way that this could be reused,
> rather than adding a 2nd implementation?

 From what I can tell, it calls libiconv, which can't be assumed to be 
present.
So despite not being ideal, it's necessary to reimplement it. It is also 
not that much code, not the worst thing.

> 
> [...snip...]
> 
> Sorry, I confess I don't know enough about Windows compat that I can't
> comment on the rest of the patch.  If it fixes things on Windows and
> doesn't break other OSes, that's good, I suppose :/

The problem is that windows support for utf8 is nonexistent until 
windows 10, and quite poor afterwards.
By default, even fopen cannot open a utf8 filename!
The only way to do this before windows 10 is calling the W windows APIs, 
which take utf-16.
https://nullprogram.com/blog/2021/12/30/
This blog post should clarify it if you care to research further. I 
don't think you need to.

It is behind an ifdef, so it cannot break any other platforms.

> 
> Hope this is constructive
> Dave

Thanks for reviewing.
  
Jonathan Yong Sept. 15, 2025, 6:04 p.m. UTC | #4
On 9/15/25 4:33 PM, Peter0x44 wrote:
> 
>>
>> Thanks for the patch.
>>
>> I notice that libcpp/charset.cc defines a function convert_utf8_utf16
>> (albeit currently static).  Is there a way that this could be reused,
>> rather than adding a 2nd implementation?
> 
>  From what I can tell, it calls libiconv, which can't be assumed to be 
> present.
> So despite not being ideal, it's necessary to reimplement it. It is also 
> not that much code, not the worst thing.
> 

Can it be renamed to avoid link time optimization issues?
Thanks.
  
Peter Damianov Sept. 15, 2025, 6:26 p.m. UTC | #5
On 2025-09-15 19:04, Jonathan Yong wrote:
> On 9/15/25 4:33 PM, Peter0x44 wrote:
>> 
>>> 
>>> Thanks for the patch.
>>> 
>>> I notice that libcpp/charset.cc defines a function convert_utf8_utf16
>>> (albeit currently static).  Is there a way that this could be reused,
>>> rather than adding a 2nd implementation?
>> 
>>  From what I can tell, it calls libiconv, which can't be assumed to be 
>> present.
>> So despite not being ideal, it's necessary to reimplement it. It is 
>> also not that much code, not the worst thing.
>> 
> 
> Can it be renamed to avoid link time optimization issues?
> Thanks.
It's static. What link time optimization issues are you talking about?
  
Jonathan Yong Sept. 15, 2025, 6:48 p.m. UTC | #6
On 9/15/25 6:26 PM, Peter0x44 wrote:
> On 2025-09-15 19:04, Jonathan Yong wrote:
>> On 9/15/25 4:33 PM, Peter0x44 wrote:
>>>
>>>>
>>>> Thanks for the patch.
>>>>
>>>> I notice that libcpp/charset.cc defines a function convert_utf8_utf16
>>>> (albeit currently static).  Is there a way that this could be reused,
>>>> rather than adding a 2nd implementation?
>>>
>>>  From what I can tell, it calls libiconv, which can't be assumed to 
>>> be present.
>>> So despite not being ideal, it's necessary to reimplement it. It is 
>>> also not that much code, not the worst thing.
>>>
>>
>> Can it be renamed to avoid link time optimization issues?
>> Thanks.
> It's static. What link time optimization issues are you talking about?

Try checking with -Werror=odr -Werror=lto-type-mismatch, you may have to 
enable -flto.
  
LIU Hao Sept. 16, 2025, 3:21 a.m. UTC | #7
在 2025-9-16 00:10, David Malcolm 写道:
> I notice that libcpp/charset.cc defines a function convert_utf8_utf16
> (albeit currently static).  Is there a way that this could be reused,
> rather than adding a 2nd implementation?

On Windows there's `MultiByteToWideChar()` which can do UTF-8-to-UTF-16 conversion, however we would have 
to add some boilerplate code, for example, it takes string lengths as `int` which might not be enough for 
64-bit hosts; ignoring the possibility of truncation is probably not good.

Also there's a third patch so please comment on that instead.


-- 
Best regards,
LIU Hao
  

Patch

diff --git a/gcc/pretty-print.cc b/gcc/pretty-print.cc
index d79a8282cfb..c29e15a41f3 100644
--- a/gcc/pretty-print.cc
+++ b/gcc/pretty-print.cc
@@ -38,11 +38,18 @@  along with GCC; see the file COPYING3.  If not see
 #include <iconv.h>
 #endif
 
+static int
+decode_utf8_char (const unsigned char *, size_t len, unsigned int *);
+
 #ifdef __MINGW32__
 
 /* Replacement for fputs() that handles ANSI escape codes on Windows NT.
    Contributed by: Liu Hao (lh_mouse at 126 dot com)
 
+   Extended by: Peter Damianov
+   Converts UTF-8 to UTF-16 if outputting to a console, so that emojis and
+   various other unicode characters don't get mojibak'd.
+
    XXX: This file is compiled into libcommon.a that will be self-contained.
 	It looks like that these functions can be put nowhere else.  */
 
@@ -50,11 +57,132 @@  along with GCC; see the file COPYING3.  If not see
 #define WIN32_LEAN_AND_MEAN 1
 #include <windows.h>
 
+/* Convert UTF-8 string to UTF-16.
+   Returns true if conversion was performed, false if string is pure ASCII.
+
+   If the string contains only ASCII characters, returns false
+   without allocating any memory.  Otherwise, a buffer that the caller
+   must free is allocated and the string is converted into it.  */
+static bool
+utf8_to_utf16 (const char *utf8_str, size_t utf8_len, wchar_t **utf16_str,
+	       size_t *utf16_len)
+{
+  if (utf8_len == 0)
+    {
+      *utf16_str = NULL;
+      *utf16_len = 0;
+      return false;  /* No conversion needed for empty string.  */
+    }
+
+  /* First pass: scan for non-ASCII and count UTF-16 code units needed.  */
+  size_t utf16_count = 0;
+  const unsigned char *p = (const unsigned char *) utf8_str;
+  const unsigned char *end = p + utf8_len;
+  bool found_non_ascii = false;
+
+  while (p < end)
+    {
+      if (*p <= 127)
+	{
+	  /* ASCII character - count as 1 UTF-16 unit and advance.  */
+	  utf16_count++;
+	  p++;
+	}
+      else
+	{
+	  /* Non-ASCII character - decode UTF-8 sequence.  */
+	  found_non_ascii = true;
+	  unsigned int codepoint;
+	  int utf8_char_len = decode_utf8_char (p, end - p, &codepoint);
+
+	  if (utf8_char_len == 0)
+	    return false;  /* Invalid UTF-8.  */
+
+	  if (codepoint <= 0xFFFF)
+	    utf16_count += 1;  /* Single UTF-16 unit.  */
+	  else
+	    utf16_count += 2;  /* Surrogate pair.  */
+
+	  p += utf8_char_len;
+	}
+    }
+
+  /* If string is pure ASCII, no conversion needed.  */
+  if (!found_non_ascii)
+    return false;
+
+  *utf16_str = (wchar_t *) xmalloc (utf16_count * sizeof (wchar_t));
+  *utf16_len = utf16_count;
+
+  /* Second pass: convert UTF-8 to UTF-16.  */
+  wchar_t *out = *utf16_str;
+  p = (const unsigned char *) utf8_str;
+
+  while (p < end)
+    {
+      if (*p <= 127)
+	{
+	  /* ASCII character.  */
+	  *out++ = (wchar_t) *p++;
+	}
+      else
+	{
+	  /* Non-ASCII character - decode and convert.  */
+	  unsigned int codepoint;
+	  int utf8_char_len = decode_utf8_char (p, end - p, &codepoint);
+
+	  if (codepoint <= 0xFFFF)
+	    {
+	      *out++ = (wchar_t) codepoint;
+	    }
+	  else
+	    {
+	      /* Convert to UTF-16 surrogate pair.  */
+	      codepoint -= 0x10000;
+	      *out++ = (wchar_t) (0xD800 + (codepoint >> 10));
+	      *out++ = (wchar_t) (0xDC00 + (codepoint & 0x3FF));
+	    }
+
+	  p += utf8_char_len;
+	}
+    }
+
+  return true;
+}
+
+/* Check if the handle is a console.  */
+static bool
+is_console_handle (HANDLE h)
+{
+	DWORD mode;
+	return GetConsoleMode (h, &mode);
+}
+
 /* Write all bytes in [s,s+n) into the specified stream.
-   Errors are ignored.  */
+	 If outputting to a Windows console, convert UTF-8 to UTF-16 if needed.
+	 Errors are ignored.  */
 static void
 write_all (HANDLE h, const char *s, size_t n)
 {
+	/* If writing to console, try to convert from UTF-8 to UTF-16 and use
+	   WriteConsoleW.  utf8_to_utf16 will return false if the string is pure
+	   ASCII, in which case we fall back to the regular WriteFile path.  */
+	if (is_console_handle (h))
+	  {
+	    wchar_t *utf16_str;
+	    size_t utf16_len;
+
+	    if (utf8_to_utf16 (s, n, &utf16_str, &utf16_len))
+	{
+	  DWORD written;
+	  WriteConsoleW (h, utf16_str, utf16_len, &written, NULL);
+	  free (utf16_str);
+	  return;
+	}
+      /* If UTF-8 conversion returned false, fall back to WriteFile.  */
+    }
+
+  /* WriteFile for regular files or when UTF-16 conversion is not needed.  */
   size_t rem = n;
   DWORD step;
 
@@ -712,8 +840,6 @@  mingw_ansi_fputs (const char *str, FILE *fp)
 
 #endif /* __MINGW32__ */
 
-static int
-decode_utf8_char (const unsigned char *, size_t len, unsigned int *);
 static void pp_quoted_string (pretty_printer *, const char *, size_t = -1);
 
 extern void