diff mbox series

[gdb/tui] Handle unicode chars in prompt

Message ID	20230526132512.29496-1-tdevries@suse.de
State	Superseded
Headers	DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 28E3E3858CDB To: gdb-patches@sourceware.org Cc: Tom Tromey <tom@tromey.com> Subject: [PATCH] [gdb/tui] Handle unicode chars in prompt Date: Fri, 26 May 2023 15:25:12 +0200 Message-Id: <20230526132512.29496-1-tdevries@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list From: Tom de Vries via Gdb-patches <gdb-patches@sourceware.org> Reply-To: Tom de Vries <tdevries@suse.de> Errors-To: gdb-patches-bounces+patchwork=sourceware.org@sourceware.org Sender: "Gdb-patches" <gdb-patches-bounces+patchwork=sourceware.org@sourceware.org>
Series	[gdb/tui] Handle unicode chars in prompt \| [gdb/tui] Handle unicode chars in prompt

Commit Message

Tom de Vries May 26, 2023, 1:25 p.m. UTC

  Let's try to set the prompt using a unicode character, say '❯', aka U+276F
(heavy right-pointing angle quotation mark ornament).

This works fine on an xterm with CLI (with X marking the position of the
blinking cursor):
...
$ gdb -q -ex "set prompt GDB❯ "
GDB❯ X
...
but with TUI:
...
$ gdb -q -tui -ex "set prompt GDB❯ "
...
we get instead:
...
GDB  GDB  X
...

We can use the test-case gdb.tui/unicode-prompt.exp to get more details, using
tuiterm.

With Term::dump_screen we have:
...
   16 (gdb) set prompt GDB❯
   17 GDB❯ GDB❯ GDB❯ set prompt (gdb)
   18 (gdb)
...
and with Term::dump_screen_with_attrs (summarizing using attribute sets <attrs1>
and <attrs2>):
...
   16 (gdb) set prompt GDB❯
   17 GDB<attrs1>❯<attrs2> GDB<attrs1>❯<attrs2> GDB<attrs1>❯<attrs2> set prompt (gdb)
   18 (gdb)
...
where:
...
<attrs1> == <reverse:1><invisible:1><blinking:1><intensity:bold>
<attrs2> == <reverse:0><invisible:0><blinking:0><intensity:normal>
...

This explains why we didn't see the unicode char on xterm: it's hidden
because the invisible attribute is set.

So, there seem to be two problems:
- the attributes are incorrect, and
- the prompt is repeated a couple of times.

In TUI, the prompt is written out by tui_puts_internal, which outputs one byte
at a time using waddch, which apparantly breaks multi-byte char support.

Fix this by detecting multi-byte chars in tui_puts_internal, and printing them using
waddnstr.

Tested on x86_64-linux.

Reported-By: wuzy01@qq.com

PR tui/28800
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=28800
---
 gdb/testsuite/gdb.tui/unicode-prompt.exp | 45 ++++++++++++++++
 gdb/tui/tui-io.c                         | 67 +++++++++++++++++++++++-
 2 files changed, 111 insertions(+), 1 deletion(-)
 create mode 100644 gdb/testsuite/gdb.tui/unicode-prompt.exp


base-commit: 5fd6b60d86ab6ab4bbd173524062b5d2aeac199a

Comments

Eli Zaretskii May 26, 2023, 1:56 p.m. UTC | #1

> Cc: Tom Tromey <tom@tromey.com>
> Date: Fri, 26 May 2023 15:25:12 +0200
> From: Tom de Vries via Gdb-patches <gdb-patches@sourceware.org>
> 
> +/* Return true if STRING starts with a multi-byte char.  Return the length of
> +   the multi-byte char in LEN, or 0 in case it's a multi-byte null char.
> +   Implementation based on _rl_read_mbchar.  */
> +
> +static bool
> +is_mb_char (const char *string, int &len)
> +{
> +  for (len = 1; len <= MB_CUR_MAX; len++)
> +    {
> +      size_t res;
> +
> +      {
> +	wchar_t wc;  <<<<<<<<<<<<<<<<<<<<<<<
> +	mbstate_t ps;
> +	memset (&ps, 0, sizeof (mbstate_t));
> +	res = mbrtowc (&wc, string, len, &ps);

The above assumes each call to mbrtowc produces only one wchar_t
value.  But that's non-portable: on MS-Windows wchar_t is a 16-bit
wide data type, and wchar_t "wide characters" are actually encoded in
UTF-16.  So characters beyond the BMP will yield 2 wchar_t values, not
one.

One additional caveat: "multibyte" != "UTF-8".  There's more than one
multibyte encoding, and the current locale could use some non-UTF-8
encoding instead.  For example, some encoding of the ISO-2022 family.
I'm not sure what this means for the issue at hand.

Yet another consideration is whether tui_puts_internal is used for
outputting text in the target charset, in which case you may have
problems with using mbrtowc, because AFAIK that supports only the
current locale's codeset.  If the target charset is different from the
locale's (basically, the host) charset, and we don't convert one to
the other before calling tui_puts_internal, mbrtowc will fail.

Yes, this is a mess.

Thanks.

Tom de Vries May 26, 2023, 3:44 p.m. UTC | #2

On 5/26/23 15:25, Tom de Vries via Gdb-patches wrote:
> In TUI, the prompt is written out by tui_puts_internal, which outputs one byte
> at a time using waddch, which apparantly breaks multi-byte char support.
> 
> Fix this by detecting multi-byte chars in tui_puts_internal, and printing them using
> waddnstr.

FWIW, I just came across this commit, which seems relevant:
...
commit 2c72d5e58a55d3e0f867ffd9421184852f051cb7
Author: Tom Tromey <tom@tromey.com>
Date:   Sun Sep 27 20:30:30 2020 -0600

Rewrite tui_puts

This rewrites tui_puts.  It now writes as many bytes as possible in a
call to waddnstr, letting curses handle multi-byte sequences properly.

Note that tui_puts_internal remains.  It is needed to handle computing
the start line of the readline prompt, which is difficult to do
properly in the case where redisplaying can also cause the command
window to scroll.  This might be possible to implement by reverting to
single "character" output, by using mbsrtowcs for its side effects to
find character boundaries in the input.  I have not attempted this.
...

This patch uses mbrtowc rather than mbsrtowcs, but I suppose the idea is 
the same.

Thanks,
- Tom

Tom Tromey May 30, 2023, 4:51 p.m. UTC | #3

>>>>> "Eli" == Eli Zaretskii via Gdb-patches <gdb-patches@sourceware.org> writes:

Eli> Yet another consideration is whether tui_puts_internal is used for
Eli> outputting text in the target charset

I think this shouldn't happen; text from the inferior goes through gdb's
target-to-host charset conversion.

Tom

Tom Tromey May 30, 2023, 5:03 p.m. UTC | #4

>> In TUI, the prompt is written out by tui_puts_internal, which outputs one byte
>> at a time using waddch, which apparantly breaks multi-byte char support.
>> Fix this by detecting multi-byte chars in tui_puts_internal, and
>> printing them using
>> waddnstr.

> FWIW, I just came across this commit, which seems relevant:

Tom> Note that tui_puts_internal remains.  It is needed to handle computing
Tom> the start line of the readline prompt, which is difficult to do
Tom> properly in the case where redisplaying can also cause the command
Tom> window to scroll.  This might be possible to implement by reverting to
Tom> single "character" output, by using mbsrtowcs for its side effects to
Tom> find character boundaries in the input.  I have not attempted this.
Tom> ...

I no longer remember what made this difficult.  I wonder if it's
possible to simply emit as many characters as possible in a single call,
and then use getyx to figure out the length of the prompt after it has
been fully displayed.  If the prompt wraps or if it takes multiple
lines, offhand it seems fine to just pick whatever the final column
happens to be.


Using wchar functions in gdb is a pain; at least in the past,
gdb_wchar.h was written to support systems that don't support these at
all (DJGPP - not sure if that host even builds any more).

Some characters may take multiple columns (see 'wcwidth').  I'd hope
that the display-and-getyx approach would avoid having to have gdb
understand this; though I suppose gdb's pager probably already gets this
wrong.

Tom

DJ Delorie May 30, 2023, 6:07 p.m. UTC | #5

Tom Tromey <tom@tromey.com> writes:
> Using wchar functions in gdb is a pain; at least in the past,
> gdb_wchar.h was written to support systems that don't support these at
> all (DJGPP - not sure if that host even builds any more).

The latest gdb build we have is 8.0 but I don't do the builds so I don't
know what the status is for anything newer.

As for wide chars, DJGPP has wchar_t and the required helper functions,
and that's about all the support we added.  I don't think I've ever seen
DJGPP produce anything beyond ASCII or the usual DOS code pages.

Tom Tromey May 31, 2023, 12:02 a.m. UTC | #6

DJ> As for wide chars, DJGPP has wchar_t and the required helper functions,
DJ> and that's about all the support we added.  I don't think I've ever seen
DJ> DJGPP produce anything beyond ASCII or the usual DOS code pages.

Maybe we could drop the compatibility code then.  It's hard to really be
sure, the comments refer to DJGPP but we wouldn't necessarily know if
they were in use on some other system.

Tom

Tom de Vries May 31, 2023, 11:29 a.m. UTC | #7

On 5/30/23 19:03, Tom Tromey wrote:
>>> In TUI, the prompt is written out by tui_puts_internal, which outputs one byte
>>> at a time using waddch, which apparantly breaks multi-byte char support.
>>> Fix this by detecting multi-byte chars in tui_puts_internal, and
>>> printing them using
>>> waddnstr.
> 
>> FWIW, I just came across this commit, which seems relevant:
> 
> Tom> Note that tui_puts_internal remains.  It is needed to handle computing
> Tom> the start line of the readline prompt, which is difficult to do
> Tom> properly in the case where redisplaying can also cause the command
> Tom> window to scroll.  This might be possible to implement by reverting to
> Tom> single "character" output, by using mbsrtowcs for its side effects to
> Tom> find character boundaries in the input.  I have not attempted this.
> Tom> ...
> 
> I no longer remember what made this difficult.  I wonder if it's
> possible to simply emit as many characters as possible in a single call,
> and then use getyx to figure out the length of the prompt after it has
> been fully displayed.  If the prompt wraps or if it takes multiple
> lines, offhand it seems fine to just pick whatever the final column
> happens to be.

I've given that a try, and that seems to work.

I also realized that we don't cover wrapping prompts in the testsuite, 
so I wrote a test-case ( 
https://sourceware.org/pipermail/gdb-patches/2023-May/199950.html ).

Thanks,
- Tom

Tom de Vries June 8, 2023, 10:44 p.m. UTC | #8

On 5/31/23 13:29, Tom de Vries wrote:
> On 5/30/23 19:03, Tom Tromey wrote:
>>>> In TUI, the prompt is written out by tui_puts_internal, which 
>>>> outputs one byte
>>>> at a time using waddch, which apparantly breaks multi-byte char 
>>>> support.
>>>> Fix this by detecting multi-byte chars in tui_puts_internal, and
>>>> printing them using
>>>> waddnstr.
>>
>>> FWIW, I just came across this commit, which seems relevant:
>>
>> Tom> Note that tui_puts_internal remains.  It is needed to handle 
>> computing
>> Tom> the start line of the readline prompt, which is difficult to do
>> Tom> properly in the case where redisplaying can also cause the command
>> Tom> window to scroll.  This might be possible to implement by 
>> reverting to
>> Tom> single "character" output, by using mbsrtowcs for its side 
>> effects to
>> Tom> find character boundaries in the input.  I have not attempted this.
>> Tom> ...
>>
>> I no longer remember what made this difficult.  I wonder if it's
>> possible to simply emit as many characters as possible in a single call,
>> and then use getyx to figure out the length of the prompt after it has
>> been fully displayed.  If the prompt wraps or if it takes multiple
>> lines, offhand it seems fine to just pick whatever the final column
>> happens to be.
> 
> I've given that a try, and that seems to work.
> 
> I also realized that we don't cover wrapping prompts in the testsuite, 
> so I wrote a test-case ( 
> https://sourceware.org/pipermail/gdb-patches/2023-May/199950.html ).

I've committed the test-case (excluding the proposed fix for now), with 
an extra check that FAILs in combination with this patch:
...
FAIL: gdb.tui/long-prompt.exp: prompt size == width + 1: end of screen: 
scrolling
...
because we have:
...
    17 (gdb) set prompt 123456789A123456789B123
    18 456789C123456789D>
    19 123456789A123456789B123456789C123456789D
    20 123456789A123456789B123456789C123456789D
    21 123456789A123456789B123456789C123456789D
    22 >set prompt (gdb)
    23 (gdb)
...
instead of the expected:
...
    19 (gdb) set prompt 123456789A123456789B123
    20 456789C123456789D>
    21 123456789A123456789B123456789C123456789D
    22 >set prompt (gdb)
    23 (gdb)
...

The logic I used in this patch:
...
+  if (height != nullptr)
+    {
+      int line = getcury (w);
+      *height += line - prev_line;
      }
...
was to use the current line to detect a wrap but that doesn't work if 
writing the prompt wraps at the last line which then generates a scroll.

This is the reason that a wrap is detected in the original code using a 
reduction in wrap position.

But AFAIU that doesn't work either for this "maximum-string" approach.
If the string is long enough, it's possible to wrap and increase column 
position.

In conclusion, AFAIU this approach doesn't work.

Thanks,
- Tom

Tom de Vries June 9, 2023, 9:34 a.m. UTC | #9

On 5/26/23 15:56, Eli Zaretskii wrote:
>> Cc: Tom Tromey <tom@tromey.com>
>> Date: Fri, 26 May 2023 15:25:12 +0200
>> From: Tom de Vries via Gdb-patches <gdb-patches@sourceware.org>
>>
>> +/* Return true if STRING starts with a multi-byte char.  Return the length of
>> +   the multi-byte char in LEN, or 0 in case it's a multi-byte null char.
>> +   Implementation based on _rl_read_mbchar.  */
>> +
>> +static bool
>> +is_mb_char (const char *string, int &len)
>> +{
>> +  for (len = 1; len <= MB_CUR_MAX; len++)
>> +    {
>> +      size_t res;
>> +
>> +      {
>> +	wchar_t wc;  <<<<<<<<<<<<<<<<<<<<<<<
>> +	mbstate_t ps;
>> +	memset (&ps, 0, sizeof (mbstate_t));
>> +	res = mbrtowc (&wc, string, len, &ps);
> 
> The above assumes each call to mbrtowc produces only one wchar_t
> value.  But that's non-portable: on MS-Windows wchar_t is a 16-bit
> wide data type, and wchar_t "wide characters" are actually encoded in
> UTF-16.  So characters beyond the BMP will yield 2 wchar_t values, not
> one.
> 

Hi Eli,

I see, thanks for pointing that out.  I've fixed this by using nullptr 
instead of &wc.

> One additional caveat: "multibyte" != "UTF-8".  There's more than one
> multibyte encoding, and the current locale could use some non-UTF-8
> encoding instead.  For example, some encoding of the ISO-2022 family.
> I'm not sure what this means for the issue at hand.
> 

AFAIU, interpreting the currently locale and encoding correctly is up to 
mbrtowc, so as long as it does that correctly I think there's no problem.

> Yet another consideration is whether tui_puts_internal is used for
> outputting text in the target charset, in which case you may have
> problems with using mbrtowc, because AFAIK that supports only the
> current locale's codeset.  If the target charset is different from the
> locale's (basically, the host) charset, and we don't convert one to
> the other before calling tui_puts_internal, mbrtowc will fail.
> 

[ Addressed by Tom Tromey in this thread. ]

> Yes, this is a mess.
> 

Indeed :)

V2 posted here ( 
https://sourceware.org/pipermail/gdb-patches/2023-June/200181.html ).

Thanks,
- Tom

Tom de Vries June 9, 2023, 9:48 a.m. UTC | #10

On 5/30/23 19:03, Tom Tromey wrote:
>>> In TUI, the prompt is written out by tui_puts_internal, which outputs one byte
>>> at a time using waddch, which apparantly breaks multi-byte char support.
>>> Fix this by detecting multi-byte chars in tui_puts_internal, and
>>> printing them using
>>> waddnstr.
> 
>> FWIW, I just came across this commit, which seems relevant:
> 
> Tom> Note that tui_puts_internal remains.  It is needed to handle computing
> Tom> the start line of the readline prompt, which is difficult to do
> Tom> properly in the case where redisplaying can also cause the command
> Tom> window to scroll.  This might be possible to implement by reverting to
> Tom> single "character" output, by using mbsrtowcs for its side effects to
> Tom> find character boundaries in the input.  I have not attempted this.
> Tom> ...
> 
> I no longer remember what made this difficult.  I wonder if it's
> possible to simply emit as many characters as possible in a single call,
> and then use getyx to figure out the length of the prompt after it has
> been fully displayed.  If the prompt wraps or if it takes multiple
> lines, offhand it seems fine to just pick whatever the final column
> happens to be.
> 
> 
> Using wchar functions in gdb is a pain; at least in the past,
> gdb_wchar.h was written to support systems that don't support these at
> all (DJGPP - not sure if that host even builds any more).
> 
> Some characters may take multiple columns (see 'wcwidth').  I'd hope
> that the display-and-getyx approach would avoid having to have gdb
> understand this; though I suppose gdb's pager probably already gets this
> wrong.

Thanks for the pointer.

In v2 I've used #ifdef HAVE_BTOWC to guard the use of mbrtowc, with a 
reference to gdb_wchar.h.

[ Though I do wonder whether we could rely on the c++ stdlib instead and 
just use std::mbrtowc.  ]

I've also fixed a bug, the v1 version didn't take care of wrapping due 
to printing a multi-byte character.

I've added a simplification patch to make the structure of the function 
easier to understand, making that bug easier to spot.

Thanks,
- Tom

Eli Zaretskii June 9, 2023, 10:21 a.m. UTC | #11

> Date: Fri, 9 Jun 2023 11:34:28 +0200
> Cc: gdb-patches@sourceware.org, tom@tromey.com
> From: Tom de Vries <tdevries@suse.de>
> 
> > One additional caveat: "multibyte" != "UTF-8".  There's more than one
> > multibyte encoding, and the current locale could use some non-UTF-8
> > encoding instead.  For example, some encoding of the ISO-2022 family.
> > I'm not sure what this means for the issue at hand.
> > 
> 
> AFAIU, interpreting the currently locale and encoding correctly is up to 
> mbrtowc, so as long as it does that correctly I think there's no problem.

Depends on how and for what purposes will the is_mb_char function be
used, I guess.  If it is used to mean "is this a Unicode character
encoded in UTF-8", then the results might not be what the caller
expects.

Tom Tromey June 9, 2023, 3:13 p.m. UTC | #12

>>>>> "Tom" == Tom de Vries via Gdb-patches <gdb-patches@sourceware.org> writes:

Sorry, I missed this email when looking at your latest TUI patches.

Tom> The logic I used in this patch:
Tom> ...
Tom> +  if (height != nullptr)
Tom> +    {
Tom> +      int line = getcury (w);
Tom> +      *height += line - prev_line;
Tom>      }
Tom> ...
Tom> was to use the current line to detect a wrap but that doesn't work if
Tom> writing the prompt wraps at the last line which then generates a
Tom> scroll.

Yeah, I see.  This whole thing is maddening to me because ncurses does
have a flag that indicates that the output wrapped, and while it is in
curses.h, it is clearly non-exported, there's not even an ncurses
extension for it.  And, PDcurses does not have this.

Tom> In conclusion, AFAIU this approach doesn't work.

Yeah :(

Tom

Tom Tromey June 9, 2023, 3:15 p.m. UTC | #13

Tom> [ Though I do wonder whether we could rely on the c++ stdlib instead
Tom> and just use std::mbrtowc.  ]

I don't know.  We learned with std::thread that not all host ports are
equal, so I think we'd just have to try it and see.

Tom

diff mbox series

Patch

diff --git a/gdb/testsuite/gdb.tui/unicode-prompt.exp b/gdb/testsuite/gdb.tui/unicode-prompt.exp
new file mode 100644
index 00000000000..6c2f9036921
--- /dev/null
+++ b/gdb/testsuite/gdb.tui/unicode-prompt.exp
@@ -0,0 +1,45 @@ 
+# Copyright 2023 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+require allow_tui_tests
+
+tuiterm_env
+
+save_vars { env(LC_ALL) env(LANG) env(LC_CTYPE) } {
+    # Override "C" settings from default_gdb_init.
+    setenv LC_ALL ""
+    setenv LANG en_US.UTF-8
+    setenv LC_CTYPE ""
+
+    Term::clean_restart 24 80
+
+    if {![Term::enter_tui]} {
+	unsupported "TUI not supported"
+	return
+    }
+
+    set unicode_char "\u276F"
+
+    set prompt "GDB$unicode_char "
+    set prompt_re [string_to_regexp $prompt]
+
+    # Set new prompt.
+    send_gdb "set prompt $prompt\n"
+    # Set old prompt back.
+    send_gdb "set prompt (gdb) \n"
+
+    gdb_assert { [Term::wait_for "^${prompt_re}set prompt $gdb_prompt "] } \
+	"prompt with unicode char"
+}
diff --git a/gdb/tui/tui-io.c b/gdb/tui/tui-io.c
index a1eadcd937d..f6412e2dbad 100644
--- a/gdb/tui/tui-io.c
+++ b/gdb/tui/tui-io.c
@@ -514,6 +514,51 @@  tui_puts (const char *string, WINDOW *w)
     update_cmdwin_start_line ();
 }
 
+/* Return true if STRING starts with a multi-byte char.  Return the length of
+   the multi-byte char in LEN, or 0 in case it's a multi-byte null char.
+   Implementation based on _rl_read_mbchar.  */
+
+static bool
+is_mb_char (const char *string, int &len)
+{
+  for (len = 1; len <= MB_CUR_MAX; len++)
+    {
+      size_t res;
+
+      {
+	wchar_t wc;
+	mbstate_t ps;
+	memset (&ps, 0, sizeof (mbstate_t));
+	res = mbrtowc (&wc, string, len, &ps);
+      }
+
+      if (res == (size_t)(-1))
+	{
+	  /* Not a multi-byte char.  */
+	  return false;
+	}
+
+      if (res == (size_t)(-2))
+	{
+	  /* Part of a multi-byte char.  */
+	  continue;
+	}
+
+      if (res == 0)
+	{
+	  /* Multi-byte null char.  */
+	  len = 0;
+	  return true;
+	}
+
+      /* Complete multi-byte char.  */
+      gdb_assert (res == len);
+      return true;
+    }
+
+  return false;
+}
+
 static void
 tui_puts_internal (WINDOW *w, const char *string, int *height)
 {
@@ -521,8 +566,28 @@  tui_puts_internal (WINDOW *w, const char *string, int *height)
   int prev_col = 0;
   bool saw_nl = false;
 
-  while ((c = *string++) != 0)
+  while (true)
     {
+      {
+	int mb_len;
+	if (is_mb_char (string, mb_len) && mb_len != 1)
+	  {
+	    if (mb_len == 0)
+	      {
+		/* Multi-byte null char.  */
+		break;
+	      }
+
+	    waddnstr (w, string, mb_len);
+	    string += mb_len;
+	    continue;
+	  }
+      }
+
+      c = *string++;
+      if (c == '\0')
+	break;
+
       if (c == '\n')
 	saw_nl = true;