[0/2] Re: [PATCH] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026]

Message ID 20211102205801.1202228-1-dmalcolm@redhat.com


David Malcolm Nov. 2, 2021, 8:57 p.m. UTC
  On Mon, 2021-11-01 at 12:36 -0400, Marek Polacek via Gcc-patches wrote:
> From a link below:
> "An issue was discovered in the Bidirectional Algorithm in the
> Unicode
> Specification through 14.0. It permits the visual reordering of
> characters via control sequences, which can be used to craft source
> code
> that renders different logic than the logical ordering of tokens
> ingested by compilers and interpreters. Adversaries can leverage this
> to
> encode source code for compilers accepting Unicode such that targeted
> vulnerabilities are introduced invisibly to human reviewers."
> More info:
> https://nvd.nist.gov/vuln/detail/CVE-2021-42574
> https://trojansource.codes/
> This is not a compiler bug.  However, to mitigate the problem, this
> patch
> implements -Wbidirectional=[none|unpaired|any] to warn about possibly
> misleading Unicode bidirectional characters the preprocessor may
> encounter.
> The default is =unpaired, which warns about improperly terminated
> bidirectional characters; e.g. a LRE without its appertaining PDF. 
> The
> level =any warns about any use of bidirectional characters.
> This patch handles both UCNs and UTF-8 characters.  UCNs designating
> bidi characters in identifiers are accepted since r204886.  Then
> r217144
> enabled -fextended-identifiers by default.  Extended characters in
> C/C++
> identifiers have been accepted since r275979.  However, this patch
> still
> warns about mixing UTF-8 and UCN bidi characters; there seems to be
> no
> good reason to allow mixing them.
> We warn in different contexts: comments (both C and C++-style),
> string
> literals, character constants, and identifiers.  Expectedly, UCNs are
> ignored
> in comments and raw string literals.  The bidirectional characters
> can nest
> so this patch handles that as well.
> I have not included nor tested this at all with Fortran (which also
> has
> string literals and line comments).
> Dave M. posted patches improving diagnostic involving Unicode
> characters.
> This patch does not make use of this new infrastructure yet.

Challenge accepted :)

Here are a couple of patches on top of the v1 version of your patch
to make use of that new infrastructure.

The first patch is relatively non-invasive; the second patch reworks
things quite a bit to capture location_t values for the bidirectional
control characters, and use them in the diagnostics, with labelled
ranges, giving e.g.:

$ ./xgcc -B. -S ../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c -fdiagnostics-escape-format=bytes
../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c: In function ‘main’:
../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c:5:28: warning: unpaired UTF-8 bidirectional character detected [-Wbidirectional=]
    5 |     /* Say hello; newline<e2><81><a7>/*/ return 0 ;
      |                          ~~~~~~~~~~~~  ^
      |                          |             |
      |                          |             end of bidirectional context
      |                          U+2067 (RIGHT-TO-LEFT ISOLATE)

There's a more complicated example in the test case.

Not yet bootstrapped, but hopefully gives you some ideas on future
versions of the patch.

Note that the precise location_t values aren't going to make much sense
without the escaping feature [1], and I don't think that's backportable
to GCC 11, so these UX tweaks might be for GCC 12+ only.

Hope this is constructive

[1] what is a "column number" in a line of bidirectional text?  Right now
it's a 1-based offset w.r.t. the logical ordering of the characters, but
respecting tabs and counting certain characters as occupying two columns,
but it's not at all clear to me that there's such a thing as a
"column number" in bidirectional text.

David Malcolm (2):
  Flag CPP_W_BIDIRECTIONAL so that source lines are escaped
  Capture locations of bidi chars and underline ranges

 .../c-c++-common/Wbidirectional-ranges.c      |  54 ++++
 libcpp/lex.c                                  | 254 ++++++++++++++----
 2 files changed, 261 insertions(+), 47 deletions(-)
 create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-ranges.c