libcpp: Support raw strings with newlines while processing directives [PR55971]

Message ID 20220617222106.GA79014@ldh-imac.local
State Committed
Headers
Series libcpp: Support raw strings with newlines while processing directives [PR55971] |

Commit Message

Lewis Hyatt June 17, 2022, 10:21 p.m. UTC
  Hello-

The attached fixes PR preprocessor/55971:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55971

, which is the issue that we don't currently allow raw string literals
containing embedded newlines to appear in #define. With the patch, they can be
used in any preprocessing directive.

While I was in there, I also added support for the argument to _Pragma() to be
a raw string, which was not supported (with or without newlines) previously,
but I think is nice to have.

Bootstrap + regtest all languages on x86-64 Linux looks good, with no new
failures and the new testcases passing:

FAIL 103 103
PASS 542454 542574
UNSUPPORTED 15248 15248
UNTESTED 136 136
XFAIL 4166 4166
XPASS 17 17

Please let me know if it looks OK?
Thanks!

-Lewis
[PATCH] libcpp: Support raw strings with newlines while processing directives [PR55971]

It's not currently possible to use a C++11 raw string containing a newline as
part of the definition of a macro, or in any other preprocessing directive,
such as:

 #define X R"(two
lines)"

 #error R"(this error has
two lines)"

This patch adds support for that by relaxing the conditions under which
_cpp_get_fresh_line() refuses to get a new line. For the case of lexing a raw
string, it's OK to do so as long as there is another line within the current
buffer. The code in cpp_get_fresh_line() was refactored into a new function
get_fresh_line_impl(), so that the new logic is applied only when processing a
raw string and not any other times.

gcc -E needed a small tweak now that it's possible to get a token from
macro expansion which contains a newline; in that case, c-ppoutput.c needs
to check and avoid incrementing its internal line counter in that case,
otherwise it erroneously prints a line change marker after printing the
expansion of a macro with an embedded newline.

I have added testcases for all preprocessing directives to make sure they are
OK with these kinds of raw strings. While doing that it became apparent that
we do not currently accept a raw string (with or without embedded newlines) as
the argument to _Pragma(). That was pretty straightforward to add, so I have
done that as well, since it seems potentially handy to avoid needing to escape
all the quotes inside the pragma, plus clang accepts this as well.

PR preprocessor/55971

libcpp/ChangeLog:

	* directives.cc (destringize_and_run): Support C++11 raw strings
	as the argument to _Pragma().
	* lex.cc (get_fresh_line_impl): New function refactoring the code
	from...
	(_cpp_get_fresh_line): ...here.
	(lex_raw_string): Use the new version of get_fresh_line_impl() to
	support raw strings containing new lines when processing a directive.

gcc/testsuite/ChangeLog:

	* c-c++-common/raw-string-directive-1.c: New test.
	* c-c++-common/raw-string-directive-2.c: New test.

gcc/c-family/ChangeLog:

	* c-ppoutput.cc (token_streamer::stream): Don't call
	account_for_newlines() if the tokens came from macro expansion.
  

Patch

diff --git a/gcc/c-family/c-ppoutput.cc b/gcc/c-family/c-ppoutput.cc
index 9de46a9655f..4f3576fa273 100644
--- a/gcc/c-family/c-ppoutput.cc
+++ b/gcc/c-family/c-ppoutput.cc
@@ -292,11 +292,13 @@  token_streamer::stream (cpp_reader *pfile, const cpp_token *token,
       print.printed = true;
     }
 
-  /* CPP_COMMENT tokens and raw-string literal tokens can have
-     embedded new-line characters.  Rather than enumerating all the
-     possible token types just check if token uses val.str union
-     member.  */
-  if (cpp_token_val_index (token) == CPP_TOKEN_FLD_STR)
+  /* CPP_COMMENT tokens and raw-string literal tokens can have embedded
+     new-line characters.  Rather than enumerating all the possible token
+     types, just check if token uses val.str union member.  If the token came
+     from a macro expansion, then no adjustment should be made since the
+     new-line characters did not appear in the source.  */
+  if (cpp_token_val_index (token) == CPP_TOKEN_FLD_STR
+      && !from_macro_expansion_at (loc))
     account_for_newlines (token->val.str.text, token->val.str.len);
 }
 
diff --git a/gcc/testsuite/c-c++-common/raw-string-directive-1.c b/gcc/testsuite/c-c++-common/raw-string-directive-1.c
new file mode 100644
index 00000000000..810f11256fa
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/raw-string-directive-1.c
@@ -0,0 +1,77 @@ 
+/* { dg-do compile } */
+/* { dg-options "-std=gnu99" { target c } } */
+/* { dg-options "-std=c++11" { target c++ } } */
+
+/* Test that multi-line raw strings are lexed OK for all preprocessing
+   directives where one could appear. Test raw-string-directive-2.c
+   checks that #define is also processed properly.  */
+
+/* Note that in cases where we cause GCC to produce a multi-line error
+   message, we construct the string so that the second line looks enough
+   like an error message for DejaGNU to process it as such, so that we
+   can use dg-warning or dg-error directives to check for it.  */
+
+#warning R"delim(line1 /* { dg-warning "line1" } */
+file:15:1: warning: line2)delim" /* { dg-warning "line2" } */
+
+#error R"delim(line3 /* { dg-error "line3" } */
+file:18:1: error: line4)delim" /* { dg-error "line4" } */
+
+#define X1 R"(line 5
+line 6
+line 7
+line 8
+/*
+//
+line 9)" R"delim(
+line10)delim"
+
+#define X2(a) X1 #a R"(line 11
+/*
+line12
+)"
+
+#if R"(line 13 /* { dg-error "line13" } */
+file:35:1: error: line14)" /* { dg-error "line14\\)\"\" is not valid" } */
+#endif R"(line 15 /* { dg-warning "extra tokens at end of #endif" } */
+\
+line16)" ""
+
+#ifdef XYZ R"(line17 /* { dg-warning "extra tokens at end of #ifdef" } */
+\
+\
+line18)"
+#endif
+
+#if 1
+#else R"(line23 /* { dg-warning "extra tokens at end of #else" } */
+\
+
+line24)"
+#endif
+
+#if 0
+#elif R"(line 25 /* { dg-error "line25" } */
+file:55:1: error: line26)" /* { dg-error "line26\\)\"\" is not valid" } */
+#endif
+
+#line 60 R"(file:60:1: warning: this file has a space
+in it!)"
+#warning "line27" /* { dg-warning "line27" } */
+/* { dg-warning "this file has a space" "#line check" { target *-*-* } 60 } */
+#line 63 "file"
+
+#undef X1 R"(line28 /* { dg-warning "extra tokens at end of #undef" } */
+line29
+\
+)"
+
+#ident R"(line30
+line31)" R"(line 32 /* { dg-warning "extra tokens at end of #ident" } */
+line 33)"
+
+#pragma GCC diagnostic ignored R"(-Woption /* { dg-warning "-Wpragmas" } */
+-with-a-newline)"
+
+_Pragma(R"delim1(GCC diagnostic ignored R"delim2(-Woption /* { dg-warning "-Wpragmas" } */
+-with-a-newline)delim2")delim1")
diff --git a/gcc/testsuite/c-c++-common/raw-string-directive-2.c b/gcc/testsuite/c-c++-common/raw-string-directive-2.c
new file mode 100644
index 00000000000..6fc673ccd82
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/raw-string-directive-2.c
@@ -0,0 +1,33 @@ 
+/* { dg-do run } */
+/* { dg-options "-std=gnu99" { target c } } */
+/* { dg-options "-std=c++11" { target c++ } } */
+
+#define S1 R"(three
+line
+string)"
+
+#define S2 R"(pasted
+two line)" " string"
+
+#define X(a, b) a b R"(
+one more)"
+
+const char *s1 = S1;
+const char *s2 = S2;
+const char *s3 = X(S1, R"(
+with this line plus)");
+
+int main ()
+{
+  const char s1_correct[] = "three\nline\nstring";
+  if (__builtin_memcmp (s1, s1_correct, sizeof s1_correct))
+    __builtin_abort ();
+
+  const char s2_correct[] = "pasted\ntwo line string";
+  if (__builtin_memcmp (s2, s2_correct, sizeof s2_correct))
+    __builtin_abort ();
+
+  const char s3_correct[] = "three\nline\nstring\nwith this line plus\none more";
+  if (__builtin_memcmp (s3, s3_correct, sizeof s3_correct))
+    __builtin_abort ();
+}
diff --git a/libcpp/directives.cc b/libcpp/directives.cc
index f804a441f39..72c7f6fd0ab 100644
--- a/libcpp/directives.cc
+++ b/libcpp/directives.cc
@@ -1859,8 +1859,7 @@  static void
 destringize_and_run (cpp_reader *pfile, const cpp_string *in,
 		     location_t expansion_loc)
 {
-  const unsigned char *src, *limit;
-  char *dest, *result;
+  uchar *dest, *result;
   cpp_context *saved_context;
   cpp_token *saved_cur_token;
   tokenrun *saved_cur_run;
@@ -1868,15 +1867,34 @@  destringize_and_run (cpp_reader *pfile, const cpp_string *in,
   int count;
   const struct directive *save_directive;
 
-  dest = result = (char *) alloca (in->len - 1);
-  src = in->text + 1 + (in->text[0] == 'L');
-  limit = in->text + in->len - 1;
-  while (src < limit)
+  /* If we were given a raw string literal, we don't need to destringize it,
+     but we do need to strip off the prefix and the suffix.  */
+  if (in->text[0] == 'R')
     {
-      /* We know there is a character following the backslash.  */
-      if (*src == '\\' && (src[1] == '\\' || src[1] == '"'))
-	src++;
-      *dest++ = *src++;
+      cpp_string buf;
+      const bool ok
+	= cpp_interpret_string_notranslate (pfile, in, 1, &buf, CPP_STRING);
+      gcc_assert (ok);
+      result = (uchar *) alloca (buf.len);
+
+      /* (Terminating null will be replaced by a newline shortly.)  */
+      memcpy (result, buf.text, buf.len - 1);
+      dest = result + (buf.len - 1);
+      XDELETEVEC (buf.text);
+    }
+  else
+    {
+      const uchar *src, *limit;
+      dest = result = (uchar *) alloca (in->len - 1);
+      src = in->text + 1 + (in->text[0] == 'L');
+      limit = in->text + in->len - 1;
+      while (src < limit)
+	{
+	  /* We know there is a character following the backslash.  */
+	  if (*src == '\\' && (src[1] == '\\' || src[1] == '"'))
+	    src++;
+	  *dest++ = *src++;
+	}
     }
   *dest = '\n';
 
@@ -1896,9 +1914,10 @@  destringize_and_run (cpp_reader *pfile, const cpp_string *in,
 
   /* Inline run_directive, since we need to delay the _cpp_pop_buffer
      until we've read all of the tokens that we want.  */
-  cpp_push_buffer (pfile, (const uchar *) result, dest - result,
-		   /* from_stage3 */ true);
-  /* ??? Antique Disgusting Hack.  What does this do?  */
+  cpp_push_buffer (pfile, result, dest - result, /* from_stage3 */ true);
+
+  /* This is needed to make _Pragma("once") work correctly, as it needs
+   pfile->buffer->file to be set to the current source file.  */
   if (pfile->buffer->prev)
     pfile->buffer->file = pfile->buffer->prev->file;
 
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index f891d3e17df..73e239b2a01 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -1073,6 +1073,9 @@  _cpp_clean_line (cpp_reader *pfile)
   buffer->next_line = s + 1;
 }
 
+template <bool lexing_raw_string>
+static bool get_fresh_line_impl (cpp_reader *pfile);
+
 /* Return true if the trigraph indicated by NOTE should be warned
    about in a comment.  */
 static bool
@@ -2489,9 +2492,8 @@  lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
 	{
 	  pos--;
 	  pfile->buffer->cur = pos;
-	  if (pfile->state.in_directive
-	      || (pfile->state.parsing_args
-		  && pfile->buffer->next_line >= pfile->buffer->rlimit))
+	  if ((pfile->state.in_directive || pfile->state.parsing_args)
+	      && pfile->buffer->next_line >= pfile->buffer->rlimit)
 	    {
 	      cpp_error_with_line (pfile, CPP_DL_ERROR, token->src_loc, 0,
 				   "unterminated raw string");
@@ -2506,7 +2508,7 @@  lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
 	    CPP_INCREMENT_LINE (pfile, 0);
 	  pfile->buffer->need_line = true;
 
-	  if (!_cpp_get_fresh_line (pfile))
+	  if (!get_fresh_line_impl<true> (pfile))
 	    {
 	      /* We ran out of file and failed to get a line.  */
 	      location_t src_loc = token->src_loc;
@@ -2518,8 +2520,15 @@  lex_raw_string (cpp_reader *pfile, cpp_token *token, const uchar *base)
 		_cpp_release_buff (pfile, accum.first);
 	      cpp_error_with_line (pfile, CPP_DL_ERROR, src_loc, 0,
 				   "unterminated raw string");
-	      /* Now pop the buffer that _cpp_get_fresh_line did not.  */
+
+	      /* Now pop the buffer that get_fresh_line_impl() did not.  Popping
+		 is not safe if processing a directive, however this cannot
+		 happen as we already checked above that a line would be
+		 available, and get_fresh_line_impl() can't fail in this
+		 case.  */
+	      gcc_assert (!pfile->state.in_directive);
 	      _cpp_pop_buffer (pfile);
+
 	      return;
 	    }
 
@@ -3453,11 +3462,14 @@  _cpp_lex_token (cpp_reader *pfile)
 }
 
 /* Returns true if a fresh line has been loaded.  */
-bool
-_cpp_get_fresh_line (cpp_reader *pfile)
+template <bool lexing_raw_string>
+static bool
+get_fresh_line_impl (cpp_reader *pfile)
 {
-  /* We can't get a new line until we leave the current directive.  */
-  if (pfile->state.in_directive)
+  /* We can't get a new line until we leave the current directive, unless we
+     are lexing a raw string, in which case it will be OK as long as we don't
+     pop the current buffer.  */
+  if (!lexing_raw_string && pfile->state.in_directive)
     return false;
 
   for (;;)
@@ -3473,6 +3485,10 @@  _cpp_get_fresh_line (cpp_reader *pfile)
 	  return true;
 	}
 
+      /* We can't change buffers until we leave the current directive.  */
+      if (lexing_raw_string && pfile->state.in_directive)
+	return false;
+
       /* First, get out of parsing arguments state.  */
       if (pfile->state.parsing_args)
 	return false;
@@ -3500,6 +3516,13 @@  _cpp_get_fresh_line (cpp_reader *pfile)
     }
 }
 
+bool
+_cpp_get_fresh_line (cpp_reader *pfile)
+{
+  return get_fresh_line_impl<false> (pfile);
+}
+
+
 #define IF_NEXT_IS(CHAR, THEN_TYPE, ELSE_TYPE)		\
   do							\
     {							\