[2/4] gdbsupport: allow for configuration of extract_string_maybe_quoted

Message ID 499135b7e99641c873df5c4ce419bf4e5a9751f9.1707409662.git.aburgess@redhat.com
State New
Headers
Series Rewrite gdb_argv using extract_string_maybe_quoted |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gdb_build--master-aarch64 success Testing passed
linaro-tcwg-bot/tcwg_gdb_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gdb_check--master-aarch64 success Testing passed

Commit Message

Andrew Burgess Feb. 8, 2024, 4:28 p.m. UTC
  This function has the extract_string_maybe_quoted function accept an
extra parameter, a control structure.  This control structure can then
be used to modify how extract_string_maybe_quoted operates.

What can be controlled exactly?

The new argument allows controls over which characters can and can't
be escaped within each of three contexts: outside any quotes, within
single quotes, and within double quotes.

The escaping rules for the current extract_string_maybe_quoted make
very little sense to me, we allow every character to be escaped within
all three contexts.

This doesn't make much sense to me, why do we need to allow so much
escaping?  In most places in GDB where extract_string_maybe_quoted is
used no characters have special meaning, so if I call
extract_string_maybe_quoted to extract from this:

  "Blah\@"

Then I'll get back (Blah@) without the parenthesis.  But why is the
escaping supported?  I could just as easily write:

  "Blah@"

In which case I'll get the same (Blah@) result.  If I want a result
of (Blah\@) then I need to type:

  "Blah\\@"

Now the argument could be used that this is consistent with shell
escaping, which is what users will expect.  Except it really isn't.
Outside of any quotes then yes, any \<CHAR> sequence will result in
<CHAR> even when <CHAR> doesn't have a special meaning, but within
double quotes only a limited set of characters need escaping, in all
other cases the backslash will be preserved, yet for GDB within double
quotes, our escaping is just like the unquoted escaping.

And within single quotes no characters can be escaped at a shell, and
yet, once again, GDB just uses the unquoted strategy.

So I think GDB's approach neither makes sense from a necessity point
of view, nor does it make sense from a "like a shell" point of view.

My plan, in a later commit, is to reuse extract_string_maybe_quoted as
the core of gdb_argv instead of calling into libiberty.  To do this I
need extract_string_maybe_quoted to behave more "shell like" just as
libiberties buildargv function does.

And so, by allowing control over which characters can, or can't be
quoted, I can retain the current extract_string_maybe_quoted behaviour
for the current users, but provide a different control structure for
when (later on) I reuse extract_string_maybe_quoted in gdb_argv.

I did consider changing the default behaviour of
extract_string_maybe_quoted, and I might still propose that in the
future.  Unfortunately, there is one test (gdb.base/options.exp) that
depends on the current extract_string_maybe_quoted behaviour,
specifically is relies on being able to escape a single quote
character within a single quote context -- at a POSIX shell it is not
possible to embed a single quote within a single quoted string.  So
as this is tested behaviour I'm reluctant to just go changing this.

In this commit I've added a new control structure extract_string_ctrl
in gdbsupport/common-utils.h, defined a default structure in
gdbsupport/common-utils.cc, and this default is used by
extract_string_maybe_quoted to maintain the current behaviour.

In gdb/unittests/extract-string-selftests.c I've defined a new
extract_string_ctrl object which implements shell like behaviour, I've
then added some new extract_string_maybe_quoted self tests which check
the shell like behaviour.
---
 gdb/unittests/extract-string-selftests.c |  20 +++-
 gdbsupport/common-utils.cc               |  75 +++++++------
 gdbsupport/common-utils.h                | 133 ++++++++++++++++++++++-
 3 files changed, 191 insertions(+), 37 deletions(-)
  

Comments

Tom Tromey Feb. 8, 2024, 7:05 p.m. UTC | #1
Andrew> And so, by allowing control over which characters can, or can't be
Andrew> quoted, I can retain the current extract_string_maybe_quoted behaviour
Andrew> for the current users, but provide a different control structure for
Andrew> when (later on) I reuse extract_string_maybe_quoted in gdb_argv.

Andrew> I did consider changing the default behaviour of
Andrew> extract_string_maybe_quoted, and I might still propose that in the
Andrew> future.  Unfortunately, there is one test (gdb.base/options.exp) that
Andrew> depends on the current extract_string_maybe_quoted behaviour,
Andrew> specifically is relies on being able to escape a single quote
Andrew> character within a single quote context -- at a POSIX shell it is not
Andrew> possible to embed a single quote within a single quoted string.  So
Andrew> as this is tested behaviour I'm reluctant to just go changing this.

To me the current behavior seems pretty surprising and not extremely
useful.  I think I'd be in favor of changing it to always be more
shell-like.

I think it might be better to also just remove the controls from this
function -- i.e., simplify this patch -- and have all commands that use
this present the same UI.

The idea here is to try to make gdb be a bit more consistent.  However,
I haven't looked at all the callers to know if this is infeasible for
some reason.  Do you know if there's a reason not to do this?

Internally at AdaCore, we have a change to the libiberty buildargv that
changes the backslash behavior on Windows hosts.  However, it seems to
me that fixing the backslash to not affect ordinary characters would
probably fill the same role.

Tom
  

Patch

diff --git a/gdb/unittests/extract-string-selftests.c b/gdb/unittests/extract-string-selftests.c
index c2eedd75342..3b9a5bd104c 100644
--- a/gdb/unittests/extract-string-selftests.c
+++ b/gdb/unittests/extract-string-selftests.c
@@ -23,20 +23,30 @@ 
 namespace selftests {
 namespace extract_string {
 
+static extract_string_ctrl shell_extract_string_ctrl
+  (nullptr, "", "\"$`\\", "\n", "", "\n");
+
 struct test_def
 {
   test_def (const char *input,
 	    const char *output,
-	    size_t offset)
+	    size_t offset,
+	    extract_string_ctrl *ctrl = nullptr)
     : m_input (input),
       m_output (output),
-      m_offset (offset)
+      m_offset (offset),
+      m_ctrl (ctrl)
   { /* Nothing.  */ }
 
   void run () const
   {
     const char *tmp = m_input;
-    std::string test_out = extract_string_maybe_quoted (&tmp);
+    std::string test_out;
+
+    if (m_ctrl == nullptr)
+      test_out = extract_string_maybe_quoted (&tmp);
+    else
+      test_out = extract_string_maybe_quoted (&tmp, *m_ctrl);
 
     if (run_verbose ())
       {
@@ -55,6 +65,7 @@  struct test_def
   const char *m_input;
   const char *m_output;
   size_t m_offset;
+  extract_string_ctrl *m_ctrl;
 };
 
 test_def tests[] = {
@@ -64,6 +75,9 @@  test_def tests[] = {
   { "ab\\ cd ef", "ab cd", 6 },
   { "\"abc\\\"def\" ghi", "abc\"def", 10 },
   { "\"'abc' 'def'\" ghi", "'abc' 'def'", 13 },
+  { "'ab\\ cd' ef", "ab\\ cd", 8, &shell_extract_string_ctrl },
+  { "ab\\\ncd ef", "abcd", 6, &shell_extract_string_ctrl },
+  { "\"ab\\\ncd\" ef", "abcd", 8, &shell_extract_string_ctrl },
 };
 
 static void
diff --git a/gdbsupport/common-utils.cc b/gdbsupport/common-utils.cc
index 99b9cb8609b..15ee2c43f83 100644
--- a/gdbsupport/common-utils.cc
+++ b/gdbsupport/common-utils.cc
@@ -164,58 +164,67 @@  savestring (const char *ptr, size_t len)
   return p;
 }
 
+/* See common-utils.h.  */
+
+extract_string_ctrl default_extract_string_ctrl (nullptr, nullptr, nullptr);
+
 /* See documentation in common-utils.h.  */
 
 std::string
-extract_string_maybe_quoted (const char **arg)
+extract_string_maybe_quoted (const char **arg,
+			     const extract_string_ctrl &ctrl)
 {
-  bool squote = false;
-  bool dquote = false;
-  bool bsquote = false;
+  char quote = '\0';
   std::string result;
   const char *p = *arg;
 
   /* Find the start of the argument.  */
   p = skip_spaces (p);
 
-  /* Parse p similarly to gdb_argv buildargv function.  */
+  /* Parse p similarly to the libiberty buildargv function, but respect
+     the escaping rules of CTRL.  */
   while (*p != '\0')
     {
-      if (ISSPACE (*p) && !squote && !dquote && !bsquote)
+      if (ISSPACE (*p) && quote == '\0')
 	break;
       else
 	{
-	  if (bsquote)
-	    {
-	      bsquote = false;
-	      result += *p;
-	    }
-	  else if (*p == '\\')
-	    bsquote = true;
-	  else if (squote)
+	  if (*p == '\\')
 	    {
-	      if (*p == '\'')
-		squote = false;
+	      /* The character after the backslash.  */
+	      char c = *(p + 1);
+
+	      if (ctrl.discard_escaped_char (c, quote))
+		{
+		  /* We are discarding the escape character (backslash) and
+		     the character after the escape.  This allows us to
+		     emulate POSIX newline handling where an escaped
+		     newline is discarded.  */
+		  ++p;
+		}
+	      else if (ctrl.is_escaped (c, quote))
+		{
+		  /* The character C is escaped.  We discard the escape
+		     character (backslash), but do include C.  */
+		  ++p;
+		  result += c;
+		}
 	      else
-		result += *p;
-	    }
-	  else if (dquote)
-	    {
-	      if (*p == '"')
-		dquote = false;
-	      else
-		result += *p;
+		{
+		  /* We are going to treat the backslash as a literal
+		     character with no special association to the following
+		     character.  */
+		  result += *p;
+		}
 	    }
+	  else if (*p == quote)
+	    quote = '\0';
+	  else if (quote == '\0' && (*p == '\'' || *p == '"'))
+	    quote = *p;
 	  else
-	    {
-	      if (*p == '\'')
-		squote = true;
-	      else if (*p == '"')
-		dquote = true;
-	      else
-		result += *p;
-	    }
-	  p++;
+	    result += *p;
+
+	  ++p;
 	}
     }
 
diff --git a/gdbsupport/common-utils.h b/gdbsupport/common-utils.h
index 23cd40c0207..60578cd7560 100644
--- a/gdbsupport/common-utils.h
+++ b/gdbsupport/common-utils.h
@@ -74,6 +74,135 @@  std::string &string_vappendf (std::string &dest, const char* fmt, va_list args)
 
 char *savestring (const char *ptr, size_t len);
 
+/* A control data structure, instances of this class are passed to the
+   extract_string_maybe_quoted function in order to modify its behaviour.
+
+   This class controls which characters can and can't be quoted within
+   different context (no quotes, single quotes, or double quotes).  */
+
+struct extract_string_ctrl
+{
+  /* Constructor.  The CAN_ESCAPE_* arguments are the set of characters
+     which can be escaped within each context.  If any character within
+     this set appears within the given context with a backslash before it,
+     then the backslash will be removed and the character retained in the
+     output.  If any of these arguments are nullptr then every character
+     can be escaped within that context.  If you want no character escaping
+     within a particular context then pass the empty string.
+
+     The DISCARD_* arguments are the set of characters which should be
+     discarded when escaped within a given context.  If any of these
+     characters appear within the given context with a backslash before
+     them, then both the backslash and the character will be removed from
+     the final output.  If any of these arguments are nullptr then no
+     characters will be discarded when escaped within that context.
+
+     During string extraction, if a backslash is encountered in the input,
+     and the next character is not in the relevant CAN_ESCAPE_* variable,
+     nor is in the relevant DISCARD_* variable, then the backslash will be
+     retained in the output.  */
+
+  explicit extract_string_ctrl (const char *can_escape_unquoted,
+				const char *can_escape_single,
+				const char *can_escape_double,
+				const char *discard_unquoted = nullptr,
+				const char *discard_single = nullptr,
+				const char *discard_double = nullptr)
+    : m_can_escape_unquoted (can_escape_unquoted),
+      m_can_escape_single (can_escape_single),
+      m_can_escape_double (can_escape_double),
+      m_discard_unquoted (discard_unquoted),
+      m_discard_single (discard_single),
+      m_discard_double (discard_double)
+  {
+    /* Nothing.  */
+  }
+
+  /* Return true if, when character C appears escaped within an input
+     string, both the backslash escape character, and the character C
+     should be discarded from the input stream.  */
+
+  bool discard_escaped_char (const char c, const char quote) const
+  {
+    const char *char_set = nullptr;
+    switch (quote)
+      {
+      case '\0':
+	char_set = m_discard_unquoted;
+	break;
+      case '\'':
+	char_set = m_discard_single;
+	break;
+      case '"':
+	char_set = m_discard_double;
+	break;
+      default:
+	gdb_assert_not_reached ("unexpected quote character '%c'", quote);
+      }
+
+    /* Discard nothing.  */
+    if (char_set == nullptr)
+      return false;
+
+    /* If C is in CHAR_SET then discard.  */
+    return strchr (char_set, c) != nullptr;
+  }
+
+  /* Return true if, when character C appears escaped within an input
+     string, we should discard the previous escape character and just
+     include character C.  Otherwise return false.  */
+
+  bool is_escaped (const char c, const char quote) const
+  {
+    const char *char_set = nullptr;
+    switch (quote)
+      {
+      case '\0':
+	char_set = m_can_escape_unquoted;
+	break;
+      case '\'':
+	char_set = m_can_escape_single;
+	break;
+      case '"':
+	char_set = m_can_escape_double;
+	break;
+      default:
+	gdb_assert_not_reached ("unexpected quote character '%c'", quote);
+      }
+
+    /* When no character set is defined we always return true, this means
+       that every character can be escaped.  */
+    if (char_set == nullptr)
+      return true;
+
+    /* Otherwise, return true if C is in CHAR_SET.  */
+    return strchr (char_set, c) != nullptr;
+  }
+
+private:
+  /* The set of characters that can be escaped within a particular
+     context.  See the constructor for more information.  */
+  const char *m_can_escape_unquoted;
+  const char *m_can_escape_single;
+  const char *m_can_escape_double;
+
+  /* The set of characters that can be discarded within a particular
+     context.  See the constructor for more information.  */
+  const char *m_discard_unquoted;
+  const char *m_discard_single;
+  const char *m_discard_double;
+};
+
+/* The default control configuration for extract_string_maybe_quoted.
+   This is the way it is for backwards compatibility reasons, though it is
+   unclear if this is actually a good thing or not.  Every character in
+   every context can be escaped.  This really makes very little sense as
+   in most locations in GDB no characters, other than quotes and white
+   space, actually have any special meaning, so really very little
+   escaping should be supported.  */
+
+extern extract_string_ctrl default_extract_string_ctrl;
+
 /* Extract the next word from ARG.  The next word is defined as either,
    everything up to the next space, or, if the next word starts with either
    a single or double quote, then everything up to the closing quote.  The
@@ -82,7 +211,9 @@  char *savestring (const char *ptr, size_t len);
    word, or, for quoted words, the first character after the closing
    quote.  */
 
-std::string extract_string_maybe_quoted (const char **arg);
+std::string extract_string_maybe_quoted
+  (const char **arg,
+   const extract_string_ctrl &ctrl = default_extract_string_ctrl);
 
 /* The strerror() function can return NULL for errno values that are
    out of range.  Provide a "safe" version that always returns a