From patchwork Fri Sep 12 15:45:03 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Peter Damianov X-Patchwork-Id: 120159 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id D108B3857B8F for ; Fri, 12 Sep 2025 15:46:29 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D108B3857B8F Authentication-Results: sourceware.org; dkim=pass (2048-bit key, secure) header.d=disroot.org header.i=@disroot.org header.a=rsa-sha256 header.s=mail header.b=DbsutOqa X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from layka.disroot.org (layka.disroot.org [178.21.23.139]) by sourceware.org (Postfix) with ESMTPS id B31513858D21 for ; Fri, 12 Sep 2025 15:45:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B31513858D21 Authentication-Results: sourceware.org; dmarc=pass (p=reject dis=none) header.from=disroot.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=disroot.org ARC-Filter: OpenARC Filter v1.0.0 sourceware.org B31513858D21 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=178.21.23.139 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1757691937; cv=none; b=MsdLTRFzIMqhOTkz6JmdrmIFXuRcrHYEYDYz4DM9wYmBecNQBFoeR6Bg8tUFRnL9rVD2xZFxe101fYJ5aw4yoozr57adH7w+Dqhfm5Xl1SWIE57PPwmrrVsmJ7LV6HrtUrTEbm8ivzdP3lSAOoLmQcyyGhwkAlA40HUlxWBm1Yw= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1757691937; c=relaxed/simple; bh=wUMFvm8VNcVbyU+6pQocGGxYq4nxavZynO2bc/jsRNQ=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=Jf3H51b8XERvNerFBlqz3qVXAwbCwyMbqyT+BIiMBma4rg/9rGAgaE9fOLweafhjtZO9qQW3UvPBzaFkdRTSbkpzhY5HcCces/yE54Po7bKP6T6aDB+Bs9zb9a8qlwMtbpxL/JSKsQLDMHRaw7IzcWTqkMvjIbpSTvfMZy7+Nzk= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B31513858D21 Received: from mail01.disroot.lan (localhost [127.0.0.1]) by disroot.org (Postfix) with ESMTP id 4F10C23023; Fri, 12 Sep 2025 17:45:35 +0200 (CEST) X-Virus-Scanned: SPAM Filter at disroot.org Received: from layka.disroot.org ([127.0.0.1]) by localhost (disroot.org [127.0.0.1]) (amavis, port 10024) with ESMTP id P260KXro-qVK; Fri, 12 Sep 2025 17:45:34 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=disroot.org; s=mail; t=1757691934; bh=wUMFvm8VNcVbyU+6pQocGGxYq4nxavZynO2bc/jsRNQ=; h=From:To:Cc:Subject:Date; b=DbsutOqai+7LcbGU/sSuevISbn9Ruqc59BQZOb59980l/lHqLJFAII2fe3z5mVahT Xa+bfrNrCTwyDgL0ZsaucVc94unQUzeS38od7b8GPUwHs5xImAQ/259ybcfnq1D7HQ j2QkDtz/fIr5H+cJ5uN8orKlK6NbSV/eAwov2E8qBP5hTYbfuIORz4Kv5pX9MRl3S6 VdDcDVtgaTnUKnBP4mKLvfiJ5ZWrAqVMS52MrE1vAGSnKyZ/NT7l4G9FRmfx4k+xwF GDe0cLF2vxLC11prXrBXSl9x5k7Q5ksLvMkt4XsFUSmz+iXYYvnkcnBW/IAfuPWyqQ ijDS7pUObLcyA== From: Peter Damianov To: gcc-patches@gcc.gnu.org Cc: Liu Hao , David Malcolm , Jonathan Yong <10walls@gmail.com>, Christopher Wellons , Christoph Reiter , Peter Damianov Subject: [PATCH v2] diagnostics: Fix mojibake from displaying UTF-8 on Windows consoles Date: Fri, 12 Sep 2025 16:45:03 +0100 Message-Id: <20250912154503.478219-1-peter0x44@disroot.org> MIME-Version: 1.0 X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_VALIDITY_RPBL_BLOCKED, RCVD_IN_VALIDITY_SAFE_BLOCKED, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~patchwork=sourceware.org@gcc.gnu.org UTF-8 characters in diagnostic output (such as the warning emoji ⚠️ used by fanalyzer) display as mojibake on Windows unless the utf8 code page is being used This patch adds UTF-8 to UTF-16 conversion when outputting to a console on Windows. gcc/ChangeLog: * pretty-print.cc (decode_utf8_char): Move forward declaration. (utf8_to_utf16): New function to convert UTF-8 to UTF-16. (is_console_handle): New function to detect Windows console handles. (write_all): Add UTF-8 to UTF-16 conversion for console output, falling back to WriteFile for ASCII strings and regular files. Signed-off-by: Peter Damianov --- v2: Fix linux build by moving decode_utf8_char outside of ifdef Keep form feed gcc/pretty-print.cc | 132 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 129 insertions(+), 3 deletions(-) diff --git a/gcc/pretty-print.cc b/gcc/pretty-print.cc index d79a8282cfb..c29e15a41f3 100644 --- a/gcc/pretty-print.cc +++ b/gcc/pretty-print.cc @@ -38,11 +38,18 @@ along with GCC; see the file COPYING3. If not see #include #endif +static int +decode_utf8_char (const unsigned char *, size_t len, unsigned int *); + #ifdef __MINGW32__ /* Replacement for fputs() that handles ANSI escape codes on Windows NT. Contributed by: Liu Hao (lh_mouse at 126 dot com) + Extended by: Peter Damianov + Converts UTF-8 to UTF-16 if outputting to a console, so that emojis and + various other unicode characters don't get mojibak'd. + XXX: This file is compiled into libcommon.a that will be self-contained. It looks like that these functions can be put nowhere else. */ @@ -50,11 +57,132 @@ along with GCC; see the file COPYING3. If not see #define WIN32_LEAN_AND_MEAN 1 #include +/* Convert UTF-8 string to UTF-16. + Returns true if conversion was performed, false if string is pure ASCII. + + If the string contains only ASCII characters, returns false + without allocating any memory. Otherwise, a buffer that the caller + must free is allocated and the string is converted into it. */ +static bool +utf8_to_utf16 (const char *utf8_str, size_t utf8_len, wchar_t **utf16_str, + size_t *utf16_len) +{ + if (utf8_len == 0) + { + *utf16_str = NULL; + *utf16_len = 0; + return false; /* No conversion needed for empty string. */ + } + + /* First pass: scan for non-ASCII and count UTF-16 code units needed. */ + size_t utf16_count = 0; + const unsigned char *p = (const unsigned char *) utf8_str; + const unsigned char *end = p + utf8_len; + bool found_non_ascii = false; + + while (p < end) + { + if (*p <= 127) + { + /* ASCII character - count as 1 UTF-16 unit and advance. */ + utf16_count++; + p++; + } + else + { + /* Non-ASCII character - decode UTF-8 sequence. */ + found_non_ascii = true; + unsigned int codepoint; + int utf8_char_len = decode_utf8_char (p, end - p, &codepoint); + + if (utf8_char_len == 0) + return false; /* Invalid UTF-8. */ + + if (codepoint <= 0xFFFF) + utf16_count += 1; /* Single UTF-16 unit. */ + else + utf16_count += 2; /* Surrogate pair. */ + + p += utf8_char_len; + } + } + + /* If string is pure ASCII, no conversion needed. */ + if (!found_non_ascii) + return false; + + *utf16_str = (wchar_t *) xmalloc (utf16_count * sizeof (wchar_t)); + *utf16_len = utf16_count; + + /* Second pass: convert UTF-8 to UTF-16. */ + wchar_t *out = *utf16_str; + p = (const unsigned char *) utf8_str; + + while (p < end) + { + if (*p <= 127) + { + /* ASCII character. */ + *out++ = (wchar_t) *p++; + } + else + { + /* Non-ASCII character - decode and convert. */ + unsigned int codepoint; + int utf8_char_len = decode_utf8_char (p, end - p, &codepoint); + + if (codepoint <= 0xFFFF) + { + *out++ = (wchar_t) codepoint; + } + else + { + /* Convert to UTF-16 surrogate pair. */ + codepoint -= 0x10000; + *out++ = (wchar_t) (0xD800 + (codepoint >> 10)); + *out++ = (wchar_t) (0xDC00 + (codepoint & 0x3FF)); + } + + p += utf8_char_len; + } + } + + return true; +} + +/* Check if the handle is a console. */ +static bool +is_console_handle (HANDLE h) +{ + DWORD mode; + return GetConsoleMode (h, &mode); +} + /* Write all bytes in [s,s+n) into the specified stream. - Errors are ignored. */ + If outputting to a Windows console, convert UTF-8 to UTF-16 if needed. + Errors are ignored. */ static void write_all (HANDLE h, const char *s, size_t n) { + /* If writing to console, try to convert from UTF-8 to UTF-16 and use + WriteConsoleW. utf8_to_utf16 will return false if the string is pure + ASCII, in which case we fall back to the regular WriteFile path. */ + if (is_console_handle (h)) + { + wchar_t *utf16_str; + size_t utf16_len; + + if (utf8_to_utf16 (s, n, &utf16_str, &utf16_len)) + { + DWORD written; + WriteConsoleW (h, utf16_str, utf16_len, &written, NULL); + free (utf16_str); + return; + } + /* If UTF-8 conversion returned false, fall back to WriteFile. */ + } + + /* WriteFile for regular files or when UTF-16 conversion is not needed. */ size_t rem = n; DWORD step; @@ -712,8 +840,6 @@ mingw_ansi_fputs (const char *str, FILE *fp) #endif /* __MINGW32__ */ -static int -decode_utf8_char (const unsigned char *, size_t len, unsigned int *); static void pp_quoted_string (pretty_printer *, const char *, size_t = -1); extern void