Initial implementation of -Whomoglyph [PR preprocessor/103027]

  [Resending to get around mailing list size limit; see notes below]

This patch implements a new -Whomoglyph diagnostic, enabled by default.

Internally it implements the "skeleton" algorithm from:
  http://www.unicode.org/reports/tr39/#Confusable_Detection
so that every new identifier is mapped to a "skeleton", and if
the skeleton is already in use by a different identifier, issue
a -Whomoglyph diagnostic.
It uses the data from:
  https://www.unicode.org/Public/security/13.0.0/confusables.txt
to determine which characters are confusable.

For example, given the example of CVE-2021-42694 at
https://trojansource.codes/, with this patch we emit:

t.cc:7:1: warning: identifier ‘sayНello’ (‘say\u041dello’)... [CWE-1007] [-Whomoglyph]
    7 | void say<U+041D>ello() {
      | ^~~~
t.cc:3:1: note: ...confusable with non-equal identifier ‘sayHello’ here
    3 | void sayHello() {
      | ^~~~

(the precise location of the token isn't quite right; the
identifiers should be underlined, rather than the "void" tokens)

This takes advantage of:
  "diagnostics: escape non-ASCII source bytes for certain diagnostics"
    https://gcc.gnu.org/pipermail/gcc-patches/2021-November/583020.html
to escape non-ASCII characters when printing a source line for -Whomoglyph,
so that we print "say<U+041D>ello" when quoting the source line, making it
clearer that this is not "sayHello".

In order to implement "skeleton", I had to implement NFD support, so the
patch also contains some UTF-32 support code.

Known issues:
- I'm doing an extra hash_table lookup on every identifier lookup.
  I haven't yet measured the impact on the speed of the compiler.
  If this is an issue, is there a good place to stash an extra
  pointer in every identifier?
- doesn't yet bootstrap, as the confusables.txt data contains ASCII
  to ASCII confusables, leading to warnings such as:
../../.././gcc/options.h:11273:3: warning: identifier ‘OPT_l’... [CWE-1007] [-Whomoglyph]
../../.././gcc/options.h:9959:3: note: ...confusable with non-equal identifier ‘OPT_I’ (‘OPT_I’) here
  Perhaps the option should have levels, where we don't complain about
  pure ASCII confusables at the default level?
- no docs yet
- as noted above the location_t of the token isn't quite right
- map_identifier_to_skeleton and map_skeleton_to_first_use aren't
  yet integrated with the garbage collector
- some other FIXMEs in the patch

[I had to trim the patch for space to get it to pass the size filter on the
mailing list; I trimmed:
  contrib/unicode/confusables.txt, 
  gcc/testsuite/selftests/NormalizationTest.txt
which can be downloaded from the URLs in the ChangeLog, and:
  gcc/confusables.inc
  gcc/decomposition.inc
which can be generated using the scripts in the patch ]

Thoughts?
Dave

contrib/ChangeLog:
	PR preprocessor/103027
	* unicode/confusables.txt: New file, dowloaded from
	https://www.unicode.org/Public/security/13.0.0/confusables.txt
	* unicode/gen-confusables-inc.py: New file.
	* unicode/gen-decomposition-inc.py: New file.

gcc/ChangeLog:
	PR preprocessor/103027
	* common.opt (Whomoglyph): New.
	* confusables.inc: New file, generated by
	contrib/unicode/gen-confusables-inc.py.
	* decomposition.inc: New file, generated by
	contrib/unicode/gen-decomposition-inc.py.
	* selftest-run-tests.c (selftest::run_tests): Call
	stringpool_c_tests.
	* selftest.c (note_formatted): New.
	* selftest.h (note_formatted): New decl.
	(stringpool_c_tests): New decl.
	* stringpool.c: Include "cpplib.h", "pretty-print.h",
	"selftest.h", "diagnostic.h", "diagnostic-metadata.h",
	"gcc-rich-location.h"
	(init_stringpool): Set on_new_node and on_existing_node callbacks.
	(print_escaped_codepoint): New.
	(parse_hex): New.
	(class utf32_string): New.
	(class utf8_string): New.
	(utf32_string::append_decomposition): New.
	(utf32_string::convert_to_nfd): New.
	(cmp_combining_class): New.
	(utf32_string::sort_by_combining_class): New.
	(convert_homoglyphs_to_exemplars): New.
	(get_tr39_skeleton): New.
	(struct first_use): New.
	(map_identifier_to_skeleton): New.
	(map_skeleton_to_first_use): New.
	(class escaped_identifier): New.
	(maybe_warn_on_homoglyph): New.
	(stringpool_on_new_hashnode): New.
	(stringpool_on_existing_hashnode): New.
	(selftest::assert_dump_eq): New.
	(ASSERT_DUMP_EQ): New.
	(selftest::test_utf32_from_utf8): New.
	(selftest::test_from_hex): New.
	(selftest::test_combining_classes): New.
	(selftest::assert_utf32_eq_at): New.
	(ASSERT_UTF32_EQ_AT): New.
	(selftest::test_normalization_line): New.
	(selftest::test_normalization): New.
	(selftest::test_tr39_skeleton_1): New.
	(selftest::test_tr39_skeleton_2): New.
	(selftest::test_tr39_skeleton_3): New.
	(selftest::test_tr39_skeleton_4): New.
	(selftest::test_tr39_skeleton_5): New.
	(selftest::stringpool_c_tests): New.

gcc/testsuite/ChangeLog:
	PR preprocessor/103027
	* c-c++-common/Whomoglyph-1.c: New test.
	* c-c++-common/Whomoglyph-2.c: New test.
	* c-c++-common/Whomoglyph-3.c: New test.
	* selftests/NormalizationTest.txt: New file, dowloaded from
	https://www.unicode.org/Public/13.0.0/ucd/NormalizationTest.txt

libcpp/ChangeLog:
	PR preprocessor/103027
	* charset.c (cpp_combining_class): New.
	* include/cpplib.h (cpp_warning_at): New decl.
	(cpp_combining_class): New decl.
	* include/symtab.h (ht::on_new_node): New callback field.
	(ht::on_existing_node): New callback field.
	* symtab.c (ht_lookup_with_hash): Call the new callbacks

Signed-off-by: David Malcolm <dmalcolm@redhat.com>
---
 contrib/unicode/confusables.txt               |  9638 +++++
 contrib/unicode/gen-confusables-inc.py        |   120 +
 contrib/unicode/gen-decomposition-inc.py      |    41 +
 gcc/common.opt                                |     4 +
 gcc/confusables.inc                           | 34266 ++++++++++++++++
 gcc/decomposition.inc                         | 11337 +++++
 gcc/selftest-run-tests.c                      |     1 +
 gcc/selftest.c                                |    16 +
 gcc/selftest.h                                |     7 +
 gcc/stringpool.c                              |  1199 +
 gcc/testsuite/c-c++-common/Whomoglyph-1.c     |    18 +
 gcc/testsuite/c-c++-common/Whomoglyph-2.c     |    18 +
 gcc/testsuite/c-c++-common/Whomoglyph-3.c     |    20 +
 gcc/testsuite/selftests/NormalizationTest.txt | 18908 +++++++++
 libcpp/charset.c                              |    23 +
 libcpp/include/cpplib.h                       |     5 +
 libcpp/include/symtab.h                       |     4 +
 libcpp/symtab.c                               |    15 +-
 18 files changed, 75638 insertions(+), 2 deletions(-)
 create mode 100644 contrib/unicode/confusables.txt
 create mode 100644 contrib/unicode/gen-confusables-inc.py
 create mode 100644 contrib/unicode/gen-decomposition-inc.py
 create mode 100644 gcc/confusables.inc
 create mode 100644 gcc/decomposition.inc
 create mode 100644 gcc/testsuite/c-c++-common/Whomoglyph-1.c
 create mode 100644 gcc/testsuite/c-c++-common/Whomoglyph-2.c
 create mode 100644 gcc/testsuite/c-c++-common/Whomoglyph-3.c
 create mode 100644 gcc/testsuite/selftests/NormalizationTest.txt

Message ID	20211101211412.1123930-1-dmalcolm@redhat.com
State	New
Headers	Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2E084385801F for <patchwork@sourceware.org>; Mon, 1 Nov 2021 21:14:57 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2E084385801F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1635801297; bh=iYNoeiP5tf2zgTddyRU+GJ8e6TnkC8Pv5jA+8Ut0+JQ=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=vQzrMc2YS/vHh39QjMkOxa4DqNYyMq0WSSVx1JC6jSDwnMY7bUFINjBWI+2AuozjB KMpatuDXFtkXv53IGN6leby4kIDy2lSL7P3nWIjHQ9JeW1PhjsB9Pa9N+XF+beOj7P zpEJxLHX8rdtn3d3FCOqcqaEDBkgaqf+ZyS5ena4= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id D8C963858429 for <gcc-patches@gcc.gnu.org>; Mon, 1 Nov 2021 21:14:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D8C963858429 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-542-VSh6ftOCNROP3BVefmiMSg-1; Mon, 01 Nov 2021 17:14:16 -0400 X-MC-Unique: VSh6ftOCNROP3BVefmiMSg-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2185181424B for <gcc-patches@gcc.gnu.org>; Mon, 1 Nov 2021 21:14:15 +0000 (UTC) Received: from t14s.localdomain.com (ovpn-113-202.phx2.redhat.com [10.3.113.202]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7F36C5F4EB; Mon, 1 Nov 2021 21:14:14 +0000 (UTC) To: gcc-patches@gcc.gnu.org Subject: [PATCH] Initial implementation of -Whomoglyph [PR preprocessor/103027] Date: Mon, 1 Nov 2021 17:14:12 -0400 Message-Id: <20211101211412.1123930-1-dmalcolm@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-10.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, KAM_SOMETLD_ARE_BAD_TLD, PDS_OTHER_BAD_TLD, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP, URIBL_BLACK autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: David Malcolm via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: David Malcolm <dmalcolm@redhat.com> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
Series	Initial implementation of -Whomoglyph [PR preprocessor/103027] \| Initial implementation of -Whomoglyph [PR preprocessor/103027]

Initial implementation of -Whomoglyph [PR preprocessor/103027]

Commit Message

Comments

Patch