[v2,RFC] libdep plugin: fix bugs in parser and drop escaping

Message ID butKCYrx-4tkZT6YHIgyIrVR6g6J-l2HDeF4EVBSejakFQgiCD-GuY5LGr5XpmTAMXxoqjxTeYygkWSEmlxJ0wOVpu4bd4uxe7h8_x1fHXo=@harmenstoppels.nl
State New
Headers
Series [v2,RFC] libdep plugin: fix bugs in parser and drop escaping |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_binutils_build--master-arm success Build passed
linaro-tcwg-bot/tcwg_binutils_build--master-aarch64 success Build passed
linaro-tcwg-bot/tcwg_binutils_check--master-aarch64 success Test passed
linaro-tcwg-bot/tcwg_binutils_check--master-arm success Test passed

Commit Message

Harmen Stoppels June 20, 2024, 7:49 a.m. UTC
  Second version of this patch addresses the following:

1. fix another buffer overflow in libdep_plugin.c when the libdep argument
   string was tab separated instead of ` ` white space.

2. drops handling of `\`: previously the parser simply dropped that character,
   it did not escape anything, and it was impossible to specify `-Lfoo\bar`
   with a literal `\` in the path. Further `\` triggered a bug where it did not
   memmove the trailing null byte. So it is not breaking to make the parser
   treat `\` like a literal character. There is no need for an escape character
   because single and double quotes allow one to express any character as an
   argument, include white space and quotes.

3. unterminated quotes are now a warning, the plugin does nothing in that case.

4. make the parser linear time. memmove made it quadratic worst case.

	PR ld/31906
	* libdep_plugin.c (str2vec): Fix bug where null byte was not copied on
	  memmove during quote handling and escaping, causing repeat of the last
	  character in the last argument.  Fix buffer overflow in **res when
	  arguments were separated by `\t` instead of ` `.  Remove handling of
	  the escape character `\`, as it made it impossible to specify paths
	  containing `\` -- the implementation merely dropped `\`, and was
	  affected by the memmove bug, so this should not be breaking; just
	  single and double quotes are sufficient to deal with white space and
	  quote characters, there is no need for escaping.  Handle syntax errors
	  on unterminated quotes.  Make the parser linear time instead of
	  quadratic.

Signed-off-by: Harmen Stoppels <me@harmenstoppels.nl>
---
 ld/libdep_plugin.c | 153 ++++++++++++++++++---------------------------
 1 file changed, 61 insertions(+), 92 deletions(-)
  

Comments

Howard Chu June 20, 2024, 5:36 p.m. UTC | #1
Harmen Stoppels wrote:
> Second version of this patch addresses the following:
> 
> 1. fix another buffer overflow in libdep_plugin.c when the libdep argument
>    string was tab separated instead of ` ` white space.

When would such an argument be produced? The string was itself passed in as a
command line argument, and a shell would delimit all arguments with a space. Any
embedded tabs would thus be data, and not delimiters.

> 2. drops handling of `\`: previously the parser simply dropped that character,
>    it did not escape anything, and it was impossible to specify `-Lfoo\bar`
>    with a literal `\` in the path. Further `\` triggered a bug where it did not
>    memmove the trailing null byte. So it is not breaking to make the parser
>    treat `\` like a literal character.

> There is no need for an escape character
>    because single and double quotes allow one to express any character as an
>    argument, include white space and quotes.

It's quite awkward to do the necessary intermixing of single and double quotes when
passing arguments through a Makefile and a shell, or thru autoconf to a shell to a
Makefile, etc. etc...

> 3. unterminated quotes are now a warning, the plugin does nothing in that case.
> 
> 4. make the parser linear time. memmove made it quadratic worst case.

Sure, quadratic worst case, but we're not talking about megabytes of arguments.

> 	PR ld/31906
> 	* libdep_plugin.c (str2vec): Fix bug where null byte was not copied on
> 	  memmove during quote handling and escaping, causing repeat of the last
> 	  character in the last argument.  Fix buffer overflow in **res when
> 	  arguments were separated by `\t` instead of ` `.  Remove handling of
> 	  the escape character `\`, as it made it impossible to specify paths
> 	  containing `\` -- the implementation merely dropped `\`, and was
> 	  affected by the memmove bug, so this should not be breaking; just
> 	  single and double quotes are sufficient to deal with white space and
> 	  quote characters, there is no need for escaping.  Handle syntax errors
> 	  on unterminated quotes.  Make the parser linear time instead of
> 	  quadratic.
  
Harmen Stoppels June 20, 2024, 7:12 p.m. UTC | #2
Hi Howard,

On Thursday, June 20th, 2024 at 19:36, Howard Chu <hyc@symas.com> wrote:

>
>
> Harmen Stoppels wrote:
>
> > Second version of this patch addresses the following:
> >
> > 1. fix another buffer overflow in libdep_plugin.c when the libdep argument
> > string was tab separated instead of white space.
>
>
> When would such an argument be produced? The string was itself passed in as a
> command line argument, and a shell would delimit all arguments with a space. Any
> embedded tabs would thus be data, and not delimiters.

It can happen in a pattern `ar cfl "-L$dir -lexample" libf.a f.o`. It may be
uncommon for directories to contain a `\t`, but it is allowed, so it shouldn't
segfault the linker.

> > 2. drops handling of `\\`: previously the parser simply dropped that character,
> > it did not escape anything, and it was impossible to specify `-Lfoo\\bar`
> > with a literal `\\` in the path. Further `\\` triggered a bug where it did not
> > memmove the trailing null byte. So it is not breaking to make the parser
> > treat `\\` like a literal character.
>
> > There is no need for an escape character
> > because single and double quotes allow one to express any character as an
> > argument, include white space and quotes.
>
>
> It's quite awkward to do the necessary intermixing of single and double quotes when
> passing arguments through a Makefile and a shell, or thru autoconf to a shell to a
> Makefile, etc. etc...

Sure, but not everyone uses autotools and makefiles. For example meson is python based
and can trivially generate a ninja file with proper escaping using shlex.quote(...),
it's a python builtin.

If \ is a special escaping character in the linker, then you actually have to do double
escaping from the shell when calling `ar`, which is only more confusing:

    $ ar crl "-La\\b"

would escape the `\` in the shell, but the linker receives `-La\b` and interprets it as
an escape character for `\b` parsing it as `-Lab`, so users would have to specify

   $ ar crl "-La\\\\b"

on the command line to get `-La\b` as a command line argument...

If `\` is taken as a literal character in the linker, then

   $ ar crl "-La\\b"

is sufficient to get `-La\b` which is much better.
  
Harmen Stoppels June 21, 2024, 8:06 a.m. UTC | #3
To me it seems that the current implementation of libdep.so is flawed, it's just
not comparable to how dynamic libraries are resolved, even though it claims to
make static linking behave similar to dynamic linking w.r.t. resolving
dependencies.

The relevant -L flags should have been local search paths at highest priority
scoped to the current static library (maybe inherited like rpaths). That's not
something that can be expressed easily in a bfd plugin:

- The gold linker gets it right to some extent, except that it only supports a
  single -L flag per -l library, making it hard to use. Every new -L flag
  overrides the previous one.
- The bfd linker adds a *global* search path at lowest priority, which results
  in two issues: the dependency is often located in system directories instead
  of the provided -L path, and the -L path affects locating of other libraries
  specified on the command line.

Fixing those issues would require how registration of paths is done in bfd and
gold, which would potentially break other user plugins that rely on the current
behavior.

Further, the double escaping issues (that always persist with quotes) could've
been avoided by using a better API in ar, like one of the following:

   ar cr --libdep-arg=-L/foo --libdep-arg=-L/bar --libdep-arg=-lbaz
   ar cr --libdep-path=/foo --libdep-path=/bar --libdep-lib=baz
   
Those arguments can be stored as consecutive C-strings in the __.LIBDEP member.
No escaping needed, and it mimics `cc -Wl,-rpath`. But I'm afraid that ship has
sailed.
  

Patch

diff --git a/ld/libdep_plugin.c b/ld/libdep_plugin.c
index 414f3cffe12..65372abdcb1 100644
--- a/ld/libdep_plugin.c
+++ b/ld/libdep_plugin.c
@@ -132,83 +132,56 @@  get_libdeps (int fd)
   return rc;
 }
 
-/* Turn a string into an argvec.  */
-static char **
-str2vec (char *in)
+/* Parse arguments in-place as contiguous C-strings
+   and return the number of arguments */
+int
+parse_libdep (char *str)
 {
-  char **res;
-  char *s, *first, *end;
-  char *sq, *dq;
-  int i;
-
-  end = in + strlen (in);
-  s = in;
-  while (isspace ((unsigned char) *s)) s++;
-  first = s;
-
-  i = 1;
-  while ((s = strchr (s, ' ')))
+  char *src, *dst;
+  char quote;
+  int narg;
+  src = dst = str;
+  for (; isspace ((unsigned char) *src); ++src);
+  if (*src == '\0')
+    return 0;
+  narg = 1;
+  quote = 0;
+  while (*src)
     {
-      s++;
-      i++;
-    }
-  res = (char **)malloc ((i+1) * sizeof (char *));
-  if (!res)
-    return res;
-
-  i = 0;
-  sq = NULL;
-  dq = NULL;
-  res[0] = first;
-  for (s = first; *s; s++)
-    {
-      if (*s == '\\')
-	{
-	  memmove (s, s+1, end-s-1);
-	  end--;
-	}
-      if (isspace ((unsigned char) *s))
-	{
-	  if (sq || dq)
-	    continue;
-	  *s++ = '\0';
-	  while (isspace ((unsigned char) *s)) s++;
-	  if (*s)
-	    res[++i] = s;
-	}
-      if (*s == '\'' && !dq)
+      if (*src == '\'' || *src == '\"')
 	{
-	  if (sq)
+	  if (!quote)
+	    quote = *src++;
+
+	  else if (*src == quote)
 	    {
-	      memmove (sq, sq+1, s-sq-1);
-	      memmove (s-2, s+1, end-s-1);
-	      end -= 2;
-	      s--;
-	      sq = NULL;
+	      ++src;
+	      quote = 0;
 	    }
+
 	  else
-	    {
-	      sq = s;
-	    }
+	    *dst++ = *src++;
 	}
-      if (*s == '"' && !sq)
+
+      else if (!quote && isspace ((unsigned char) *src))
 	{
-	  if (dq)
-	    {
-	      memmove (dq, dq+1, s-dq-1);
-	      memmove (s-2, s+1, end-s-1);
-	      end -= 2;
-	      s--;
-	      dq = NULL;
-	    }
-	  else
-	    {
-	      dq = s;
-	    }
+	  ++narg;
+	  ++src;
+	  *dst++ = '\0';
+	  for (; isspace ((unsigned char) *src); ++src);
 	}
+
+      else
+	*dst++ = *src++;
     }
-  res[++i] = NULL;
-  return res;
+  *dst = '\0';
+  if (quote)
+    {
+      TV_MESSAGE (LDPL_WARNING,
+		  "libdep syntax error: unterminated quoted string");
+      return 0;
+    }
+  return narg;
 }
 
 static char *prevfile;
@@ -260,39 +233,35 @@  static enum ld_plugin_status
 onall_symbols_read (void)
 {
   linerec *lr;
-  char **vec;
+  int nargs;
+  char const *arg;
   enum ld_plugin_status rv = LDPS_OK;
 
   while ((lr = line_head))
     {
       line_head = lr->next;
-      vec = str2vec (lr->line);
-      if (vec)
+      nargs = parse_libdep (lr->line);
+      arg = lr->line;
+      int i;
+      for (i = 0; i < nargs; i++, arg = strchr (arg, '\0') + 1)
 	{
-	  int i;
-	  for (i = 0; vec[i]; i++)
+	  if (arg[0] != '-')
+	    {
+	      TV_MESSAGE (LDPL_WARNING, "ignoring libdep argument %s", arg);
+	      fflush (NULL);
+	      continue;
+	    }
+	  if (arg[1] == 'l')
+	    rv = tv_add_input_library (arg + 2);
+	  else if (arg[1] == 'L')
+	    rv = tv_set_extra_library_path (arg + 2);
+	  else
 	    {
-	      if (vec[i][0] != '-')
-		{
-		  TV_MESSAGE (LDPL_WARNING, "ignoring libdep argument %s",
-			      vec[i]);
-		  fflush (NULL);
-		  continue;
-		}
-	      if (vec[i][1] == 'l')
-		rv = tv_add_input_library (vec[i]+2);
-	      else if (vec[i][1] == 'L')
-		rv = tv_set_extra_library_path (vec[i]+2);
-	      else
-		{
-		  TV_MESSAGE (LDPL_WARNING, "ignoring libdep argument %s",
-			      vec[i]);
-		  fflush (NULL);
-		}
-	      if (rv != LDPS_OK)
-		break;
+	      TV_MESSAGE (LDPL_WARNING, "ignoring libdep argument %s", arg);
+	      fflush (NULL);
 	    }
-	  free (vec);
+	  if (rv != LDPS_OK)
+	    break;
 	}
       free (lr);
     }