[v1] C++: Support constexpr strings for asm statements

Message ID 20240125121819.2759991-1-ak@linux.intel.com
State New
Headers
Series [v1] C++: Support constexpr strings for asm statements |

Checks

Context Check Description
linaro-tcwg-bot/tcwg_gcc_build--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gcc_check--master-arm success Testing passed
linaro-tcwg-bot/tcwg_gcc_build--master-aarch64 success Testing passed

Commit Message

Andi Kleen Jan. 25, 2024, 12:18 p.m. UTC
  Some programing styles use a lot of inline assembler, and it is common
to use very complex preprocessor macros to generate the assembler
strings for the asm statements. In C++ there would be a typesafe alternative
using templates and constexpr to generate the assembler strings, but
unfortunately the asm statement requires plain string literals, so this
doesn't work.

This patch modifies the C++ parser to accept strings generated by
constexpr instead of just plain strings. This requires new syntax
because e.g. asm("..." : "r" (expr)) would be ambigious with a function
call. I chose () to make it unique. For example now you can write

constexpr const char *genasm() { return "insn"; }
constexpr const char *genconstraint() { return "r"; }

	asm(genasm() :: (genconstraint()) (input));

The constexpr strings are allowed for the asm template, the
constraints and the clobbers (every time current asm accepts a string)

The drawback of this scheme is that the constexpr doesn't have
full control over the input/output/clobber lists, but that can be
usually handled with a switch statement.  One could imagine
more flexible ways to handle that, for example supporting constexpr
vectors for the clobber list, or similar. But even without
that it is already useful.

Bootstrapped and full test on x86_64-linux.
---
 gcc/cp/parser.cc                       | 76 ++++++++++++++++++--------
 gcc/doc/extend.texi                    | 17 +++++-
 gcc/testsuite/g++.dg/constexpr-asm-1.C | 30 ++++++++++
 3 files changed, 99 insertions(+), 24 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/constexpr-asm-1.C
  

Comments

Jakub Jelinek March 15, 2024, 10:29 a.m. UTC | #1
On Thu, Jan 25, 2024 at 04:18:19AM -0800, Andi Kleen wrote:
> Some programing styles use a lot of inline assembler, and it is common
> to use very complex preprocessor macros to generate the assembler
> strings for the asm statements. In C++ there would be a typesafe alternative
> using templates and constexpr to generate the assembler strings, but
> unfortunately the asm statement requires plain string literals, so this
> doesn't work.
> 
> This patch modifies the C++ parser to accept strings generated by
> constexpr instead of just plain strings. This requires new syntax
> because e.g. asm("..." : "r" (expr)) would be ambigious with a function
> call. I chose () to make it unique. For example now you can write
> 
> constexpr const char *genasm() { return "insn"; }
> constexpr const char *genconstraint() { return "r"; }
> 
> 	asm(genasm() :: (genconstraint()) (input));
> 
> The constexpr strings are allowed for the asm template, the
> constraints and the clobbers (every time current asm accepts a string)
> 
> The drawback of this scheme is that the constexpr doesn't have
> full control over the input/output/clobber lists, but that can be
> usually handled with a switch statement.  One could imagine
> more flexible ways to handle that, for example supporting constexpr
> vectors for the clobber list, or similar. But even without
> that it is already useful.
> 
> Bootstrapped and full test on x86_64-linux.
> ---
>  gcc/cp/parser.cc                       | 76 ++++++++++++++++++--------
>  gcc/doc/extend.texi                    | 17 +++++-
>  gcc/testsuite/g++.dg/constexpr-asm-1.C | 30 ++++++++++
>  3 files changed, 99 insertions(+), 24 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/constexpr-asm-1.C
> 
> diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
> index 3748ccd49ff3..cc323dc8557a 100644
> --- a/gcc/cp/parser.cc
> +++ b/gcc/cp/parser.cc
> @@ -22654,6 +22654,43 @@ cp_parser_using_directive (cp_parser* parser)
>    cp_parser_require (parser, CPP_SEMICOLON, RT_SEMICOLON);
>  }
>  
> +/* Parse a string literal or constant expression yielding a string.
> +   The constant expression uses extra parens to avoid ambiguity with "x" (expr).
> +
> +   asm-string-expr:
> +     string-literal
> +     ( constant-expr ) */
> +
> +static tree
> +cp_parser_asm_string_expression (cp_parser *parser)
> +{
> +  location_t sloc = cp_lexer_peek_token (parser->lexer)->location;
> +
> +  if (cp_lexer_next_token_is (parser->lexer, CPP_OPEN_PAREN))

Why should it be wrapped in ()s?

> +    {
> +      matching_parens parens;
> +      parens.consume_open (parser);
> +      tree string = cp_parser_constant_expression (parser);
> +      if (string != error_mark_node)
> +	string = cxx_constant_value (string, tf_error);
> +      if (TREE_CODE (string) == NOP_EXPR)
> +	string = TREE_OPERAND (string, 0);
> +      if (TREE_CODE (string) == ADDR_EXPR && TREE_CODE (TREE_OPERAND (string, 0)) == STRING_CST)

Too long line.

> +	string = TREE_OPERAND (string, 0);
> +      if (TREE_CODE (string) == VIEW_CONVERT_EXPR)
> +	string = TREE_OPERAND (string, 0);
> +      if (TREE_CODE (string) != STRING_CST && string != error_mark_node)
> +	{
> +	  error_at (sloc, "Expected string valued constant expression for %<asm%>, not type %qT",
> +			    TREE_TYPE (string));

Again, too long line, diagnostics should never start with a capital letter,
but more importantly, this will handle only a small subset of what one can
construct with constexpr functions, not everything they can return even if
they return const char * is necessarily a STRING_LITERAL, could be an array
of chars or something similar, especially if the intent is not just to
return prepared whole string literals, but construct the template etc. from
substrings.

Given the https://wg21.link/P2741R3 C++26 addition, I wonder if it wouldn't
be much better to stay compatible with the static_assert extension there and
use similar thing for inline asm.
See https://gcc.gnu.org/r14-5771 and r14-5956 follow-up for the actual
implementation.
One issue is the character set question.  The strings in inline asm right
now are untranslated, i.e. remain in SOURCE_CHARSET, usually UTF-8 (in
theory there is also UTF-EBCDIC support, but nobody knows if it actually
works), which is presumably what the assembler expects too.
Most of the string literals and character literals constexpr deals with
are in the execution character set though.  For static_assert we just assume
the user knows what he is doing when trying to emit non-ASCII characters in
the message when using say -fexec-charset=EBCDICUS , but should that be the
case for inline asm too?  Or should we try to translate those strings back
from execution character set to source character set?  Or require that it
actually constructs UTF-8 string literals and for the UTF-EBCDIC case
translate from UTF-8 to UTF-EBCDIC?  So the user constexpr functions then
would return u8"insn"; or construct from u8'i' etc. character literals...

In any case, as has been said earlier, this isn't stage4 material.

	Jakub
  
Andi Kleen March 15, 2024, 5:20 p.m. UTC | #2
Thanks Jakub.

> > +/* Parse a string literal or constant expression yielding a string.
> > +   The constant expression uses extra parens to avoid ambiguity with "x" (expr).
> > +
> > +   asm-string-expr:
> > +     string-literal
> > +     ( constant-expr ) */
> > +
> > +static tree
> > +cp_parser_asm_string_expression (cp_parser *parser)
> > +{
> > +  location_t sloc = cp_lexer_peek_token (parser->lexer)->location;
> > +
> > +  if (cp_lexer_next_token_is (parser->lexer, CPP_OPEN_PAREN))
> 
> Why should it be wrapped in ()s?

It's true it's only needed for the constraints, but not here.

> Too long line.
> 
> > +	string = TREE_OPERAND (string, 0);
> > +      if (TREE_CODE (string) == VIEW_CONVERT_EXPR)
> > +	string = TREE_OPERAND (string, 0);
> > +      if (TREE_CODE (string) != STRING_CST && string != error_mark_node)
> > +	{
> > +	  error_at (sloc, "Expected string valued constant expression for %<asm%>, not type %qT",
> > +			    TREE_TYPE (string));
> 
> Again, too long line, diagnostics should never start with a capital letter,
> but more importantly, this will handle only a small subset of what one can
> construct with constexpr functions, not everything they can return even if
> they return const char * is necessarily a STRING_LITERAL, could be an array
> of chars or something similar, especially if the intent is not just to
> return prepared whole string literals, but construct the template etc. from
> substrings.
> 
> Given the https://wg21.link/P2741R3 C++26 addition, I wonder if it wouldn't

That's a good find.

> be much better to stay compatible with the static_assert extension there and
> use similar thing for inline asm.
> See https://gcc.gnu.org/r14-5771 and r14-5956 follow-up for the actual
> implementation.

Makes sense. I will adapt the code.

> One issue is the character set question.  The strings in inline asm right
> now are untranslated, i.e. remain in SOURCE_CHARSET, usually UTF-8 (in
> theory there is also UTF-EBCDIC support, but nobody knows if it actually
> works), which is presumably what the assembler expects too.
> Most of the string literals and character literals constexpr deals with
> are in the execution character set though.  For static_assert we just assume
> the user knows what he is doing when trying to emit non-ASCII characters in
> the message when using say -fexec-charset=EBCDICUS , but should that be the
> case for inline asm too?  Or should we try to translate those strings back
> from execution character set to source character set?  Or require that it
> actually constructs UTF-8 string literals and for the UTF-EBCDIC case
> translate from UTF-8 to UTF-EBCDIC?  So the user constexpr functions then
> would return u8"insn"; or construct from u8'i' etc. character literals...

I think keeping it untranslated is fine for now. Any translation
if really needed would be a separate feature.


> 
> In any case, as has been said earlier, this isn't stage4 material.

Yes that merge can be deferred of course.

-Andi
  
Jakub Jelinek March 15, 2024, 5:54 p.m. UTC | #3
On Fri, Mar 15, 2024 at 10:20:06AM -0700, Andi Kleen wrote:
> > One issue is the character set question.  The strings in inline asm right
> > now are untranslated, i.e. remain in SOURCE_CHARSET, usually UTF-8 (in
> > theory there is also UTF-EBCDIC support, but nobody knows if it actually
> > works), which is presumably what the assembler expects too.
> > Most of the string literals and character literals constexpr deals with
> > are in the execution character set though.  For static_assert we just assume
> > the user knows what he is doing when trying to emit non-ASCII characters in
> > the message when using say -fexec-charset=EBCDICUS , but should that be the
> > case for inline asm too?  Or should we try to translate those strings back
> > from execution character set to source character set?  Or require that it
> > actually constructs UTF-8 string literals and for the UTF-EBCDIC case
> > translate from UTF-8 to UTF-EBCDIC?  So the user constexpr functions then
> > would return u8"insn"; or construct from u8'i' etc. character literals...
> 
> I think keeping it untranslated is fine for now. Any translation
> if really needed would be a separate feature.

I mean, unless you make extra effort, it is translated.
Even in your current version, try constexpr *foo () { return "nop"; }
and you'll see that it actually results in "\x95\x96\x97" with
-fexec-charset=EBCDICUS.
What is worse, constexpr *bar () { return "%0 %1"; }
results in "\x6c\xf0\x40\x6c\xf1", so the compiler will not be able to find
the % special characters in there etc.
The parsing of the string literal in asm definitions uses translate=false
to avoid the translations.
As the static_assert paper says, for static_assert it isn't that big a deal,
the program is already UB if it diagnoses static assertion failure, worst
case it prints garbage if one plays with -fexec-charset=.  But for inline
asm it would fail to compile...

So, the extension really should be well defined vs. the character set,
either it should be characters in the execution charset and the FE would
need to ask libcpp to translate it back, or it would need to be declared
to be e.g. in UTF-8 regardless of the charset (like u8'x' or u8"abc"
literals are; but then shouldn't the _M_data in that case return a pointer
to char8_t instead), something else?

	Jakub
  
Andi Kleen March 15, 2024, 11:47 p.m. UTC | #4
> > I think keeping it untranslated is fine for now. Any translation
> > if really needed would be a separate feature.
> 
> I mean, unless you make extra effort, it is translated.
> Even in your current version, try constexpr *foo () { return "nop"; }
> and you'll see that it actually results in "\x95\x96\x97" with
> -fexec-charset=EBCDICUS.

Perhaps I don't understand the use case for this option.
Would it ever be used on a non EBCDIC system?

> What is worse, constexpr *bar () { return "%0 %1"; }
> results in "\x6c\xf0\x40\x6c\xf1", so the compiler will not be able to find
> the % special characters in there etc.
> The parsing of the string literal in asm definitions uses translate=false
> to avoid the translations.
> As the static_assert paper says, for static_assert it isn't that big a deal,
> the program is already UB if it diagnoses static assertion failure, worst
> case it prints garbage if one plays with -fexec-charset=.  But for inline
> asm it would fail to compile...
> 
> So, the extension really should be well defined vs. the character set,
> either it should be characters in the execution charset and the FE would
> need to ask libcpp to translate it back, or it would need to be declared
> to be e.g. in UTF-8 regardless of the charset (like u8'x' or u8"abc"
> literals are; but then shouldn't the _M_data in that case return a pointer
> to char8_t instead), something else?

Okay then we can always translate to UTF-8.

-Andi
  

Patch

diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index 3748ccd49ff3..cc323dc8557a 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -22654,6 +22654,43 @@  cp_parser_using_directive (cp_parser* parser)
   cp_parser_require (parser, CPP_SEMICOLON, RT_SEMICOLON);
 }
 
+/* Parse a string literal or constant expression yielding a string.
+   The constant expression uses extra parens to avoid ambiguity with "x" (expr).
+
+   asm-string-expr:
+     string-literal
+     ( constant-expr ) */
+
+static tree
+cp_parser_asm_string_expression (cp_parser *parser)
+{
+  location_t sloc = cp_lexer_peek_token (parser->lexer)->location;
+
+  if (cp_lexer_next_token_is (parser->lexer, CPP_OPEN_PAREN))
+    {
+      matching_parens parens;
+      parens.consume_open (parser);
+      tree string = cp_parser_constant_expression (parser);
+      if (string != error_mark_node)
+	string = cxx_constant_value (string, tf_error);
+      if (TREE_CODE (string) == NOP_EXPR)
+	string = TREE_OPERAND (string, 0);
+      if (TREE_CODE (string) == ADDR_EXPR && TREE_CODE (TREE_OPERAND (string, 0)) == STRING_CST)
+	string = TREE_OPERAND (string, 0);
+      if (TREE_CODE (string) == VIEW_CONVERT_EXPR)
+	string = TREE_OPERAND (string, 0);
+      if (TREE_CODE (string) != STRING_CST && string != error_mark_node)
+	{
+	  error_at (sloc, "Expected string valued constant expression for %<asm%>, not type %qT",
+			    TREE_TYPE (string));
+	  string = error_mark_node;
+	}
+      parens.require_close (parser);
+      return string;
+    }
+  return cp_parser_string_literal (parser, false, false);
+}
+
 /* Parse an asm-definition.
 
   asm-qualifier:
@@ -22666,19 +22703,19 @@  cp_parser_using_directive (cp_parser* parser)
     asm-qualifier-list asm-qualifier
 
    asm-definition:
-     asm ( string-literal ) ;
+     asm ( constant-expr ) ;
 
    GNU Extension:
 
    asm-definition:
-     asm asm-qualifier-list [opt] ( string-literal ) ;
-     asm asm-qualifier-list [opt] ( string-literal : asm-operand-list [opt] ) ;
-     asm asm-qualifier-list [opt] ( string-literal : asm-operand-list [opt]
+     asm asm-qualifier-list [opt] ( asm-string-expr ) ;
+     asm asm-qualifier-list [opt] ( asm-string-expr : asm-operand-list [opt] ) ;
+     asm asm-qualifier-list [opt] ( asm-string-expr : asm-operand-list [opt]
 				    : asm-operand-list [opt] ) ;
-     asm asm-qualifier-list [opt] ( string-literal : asm-operand-list [opt]
+     asm asm-qualifier-list [opt] ( asm-string-expr : asm-operand-list [opt]
 				    : asm-operand-list [opt]
 			  : asm-clobber-list [opt] ) ;
-     asm asm-qualifier-list [opt] ( string-literal : : asm-operand-list [opt]
+     asm asm-qualifier-list [opt] ( asm-string-expr : : asm-operand-list [opt]
 				    : asm-clobber-list [opt]
 				    : asm-goto-list ) ;
 
@@ -22797,8 +22834,7 @@  cp_parser_asm_definition (cp_parser* parser)
   if (!cp_parser_require (parser, CPP_OPEN_PAREN, RT_OPEN_PAREN))
     return;
   /* Look for the string.  */
-  tree string = cp_parser_string_literal (parser, /*translate=*/false,
-					  /*wide_ok=*/false);
+  tree string = cp_parser_asm_string_expression (parser);
   if (string == error_mark_node)
     {
       cp_parser_skip_to_closing_parenthesis (parser, true, false,
@@ -29352,7 +29388,7 @@  cp_parser_yield_expression (cp_parser* parser)
 /* Parse an (optional) asm-specification.
 
    asm-specification:
-     asm ( string-literal )
+     asm ( asm-string-expr )
 
    If the asm-specification is present, returns a STRING_CST
    corresponding to the string-literal.  Otherwise, returns
@@ -29375,9 +29411,7 @@  cp_parser_asm_specification_opt (cp_parser* parser)
   parens.require_open (parser);
 
   /* Look for the string-literal.  */
-  tree asm_specification = cp_parser_string_literal (parser,
-						     /*translate=*/false,
-						     /*wide_ok=*/false);
+  tree asm_specification = cp_parser_asm_string_expression (parser);
 
   /* Look for the `)'.  */
   parens.require_close (parser);
@@ -29392,8 +29426,8 @@  cp_parser_asm_specification_opt (cp_parser* parser)
      asm-operand-list , asm-operand
 
    asm-operand:
-     string-literal ( expression )
-     [ string-literal ] string-literal ( expression )
+     asm-string-expr ( expression )
+     [ asm-string-expr ] asm-string-expr ( expression )
 
    Returns a TREE_LIST representing the operands.  The TREE_VALUE of
    each node is the expression.  The TREE_PURPOSE is itself a
@@ -29426,10 +29460,8 @@  cp_parser_asm_operand_list (cp_parser* parser)
 	}
       else
 	name = NULL_TREE;
-      /* Look for the string-literal.  */
-      tree string_literal = cp_parser_string_literal (parser,
-						      /*translate=*/false,
-						      /*wide_ok=*/false);
+      /* Look for the string.  */
+      tree string_literal = cp_parser_asm_string_expression (parser);
 
       /* Look for the `('.  */
       matching_parens parens;
@@ -29462,8 +29494,8 @@  cp_parser_asm_operand_list (cp_parser* parser)
 /* Parse an asm-clobber-list.
 
    asm-clobber-list:
-     string-literal
-     asm-clobber-list , string-literal
+     const-expression
+     asm-clobber-list , const-expression
 
    Returns a TREE_LIST, indicating the clobbers in the order that they
    appeared.  The TREE_VALUE of each node is a STRING_CST.  */
@@ -29476,9 +29508,7 @@  cp_parser_asm_clobber_list (cp_parser* parser)
   while (true)
     {
       /* Look for the string literal.  */
-      tree string_literal = cp_parser_string_literal (parser,
-						      /*translate=*/false,
-						      /*wide_ok=*/false);
+      tree string_literal = cp_parser_asm_string_expression (parser);
       /* Add it to the list.  */
       clobbers = tree_cons (NULL_TREE, string_literal, clobbers);
       /* If the next token is not a `,', then the list is
diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 0bc586d120e7..230b46fc6c75 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -10589,6 +10589,17 @@  contain any instructions recognized by the assembler, including directives.
 GCC does not parse the assembler instructions themselves and 
 does not know what they mean or even whether they are valid assembler input. 
 
+With gnu++11 or later it can also be a compile time constant expression inside parens.
+
+@example
+constexpr const char *genfoo() @{ return "foo"; @}
+
+void function()
+@{
+  asm((genfoo()));
+@}
+@end example
+
 You may place multiple assembler instructions together in a single @code{asm} 
 string, separated by the characters normally used in assembly code for the 
 system. A combination that works in most places is a newline to break the 
@@ -10739,20 +10750,24 @@  perform a jump to one of the labels listed in the @var{GotoLabels}.
 @item AssemblerTemplate
 This is a literal string that is the template for the assembler code. It is a 
 combination of fixed text and tokens that refer to the input, output, 
-and goto parameters. @xref{AssemblerTemplate}.
+and goto parameters. @xref{AssemblerTemplate}. With gnu++11 or later it can
+also be a constant expression inside parens.
 
 @item OutputOperands
 A comma-separated list of the C variables modified by the instructions in the 
 @var{AssemblerTemplate}.  An empty list is permitted.  @xref{OutputOperands}.
+With gnu++11 or later the strings can also be constant expressions inside parens.
 
 @item InputOperands
 A comma-separated list of C expressions read by the instructions in the 
 @var{AssemblerTemplate}.  An empty list is permitted.  @xref{InputOperands}.
+With gnu++11 or later the strings can also be constant expressions inside parens.
 
 @item Clobbers
 A comma-separated list of registers or other values changed by the 
 @var{AssemblerTemplate}, beyond those listed as outputs.
 An empty list is permitted.  @xref{Clobbers and Scratch Registers}.
+With gnu++11 or later the strings can also be constant expressions inside parens.
 
 @item GotoLabels
 When you are using the @code{goto} form of @code{asm}, this section contains 
diff --git a/gcc/testsuite/g++.dg/constexpr-asm-1.C b/gcc/testsuite/g++.dg/constexpr-asm-1.C
new file mode 100644
index 000000000000..b1305571a2c1
--- /dev/null
+++ b/gcc/testsuite/g++.dg/constexpr-asm-1.C
@@ -0,0 +1,30 @@ 
+/* { dg-do compile } */
+/* { dg-options "-std=gnu++11" } */
+
+constexpr const char *genfoo()
+{
+	return "foo %1,%0";
+}
+
+constexpr const char *genoutput()
+{
+	return "=r";
+}
+
+constexpr const char *geninput()
+{
+	return "r";
+}
+
+constexpr const char *genclobber()
+{
+	return "memory";
+}
+
+void f()
+{
+	int a;
+	asm((genfoo()) : (genoutput()) (a) : (geninput()) (1) : (genclobber()));
+}
+
+/* { dg-final { scan-assembler "foo" } } */