From patchwork Tue Nov 2 20:57:59 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Malcolm X-Patchwork-Id: 46967 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 0A96D385842B for ; Tue, 2 Nov 2021 20:58:51 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0A96D385842B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1635886731; bh=HKFjk911R6J0PIVLUmP5wg5OVmdyNjIDjCDAejShdZA=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=VeWiFDp8cDEKRgYM+3a7f+ZX3uyw9Fr12FSKXhDyjj9Z7brgrM/2Foj1oYTAyMqyr I842VreBZeUHL7V74UUNUacDlaYsZU6b96OgzOiwI6GTXDrryNwK9yjDUoaCiIYVgQ E6iK/FcNYTl+cQbutb+n55rQ7A27QUl11QxaD/wY= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 02B34385842A for ; Tue, 2 Nov 2021 20:58:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 02B34385842A Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-338-nSokI2YuOG-_Za3EzF5N4Q-1; Tue, 02 Nov 2021 16:58:16 -0400 X-MC-Unique: nSokI2YuOG-_Za3EzF5N4Q-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 9EFD1806688; Tue, 2 Nov 2021 20:58:15 +0000 (UTC) Received: from t14s.localdomain.com (ovpn-113-202.phx2.redhat.com [10.3.113.202]) by smtp.corp.redhat.com (Postfix) with ESMTP id CABF879450; Tue, 2 Nov 2021 20:58:06 +0000 (UTC) To: Marek Polacek , GCC Patches Subject: [PATCH 0/2] Re: [PATCH] libcpp: Implement -Wbidirectional for CVE-2021-42574 [PR103026] Date: Tue, 2 Nov 2021 16:57:59 -0400 Message-Id: <20211102205801.1202228-1-dmalcolm@redhat.com> In-Reply-To: <20211101163652.36794-1-polacek@redhat.com> References: <20211101163652.36794-1-polacek@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-7.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: David Malcolm via Gcc-patches From: David Malcolm Reply-To: David Malcolm Cc: Jakub Jelinek , Joseph Myers Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" On Mon, 2021-11-01 at 12:36 -0400, Marek Polacek via Gcc-patches wrote: > From a link below: > "An issue was discovered in the Bidirectional Algorithm in the > Unicode > Specification through 14.0. It permits the visual reordering of > characters via control sequences, which can be used to craft source > code > that renders different logic than the logical ordering of tokens > ingested by compilers and interpreters. Adversaries can leverage this > to > encode source code for compilers accepting Unicode such that targeted > vulnerabilities are introduced invisibly to human reviewers." > > More info: > https://nvd.nist.gov/vuln/detail/CVE-2021-42574 > https://trojansource.codes/ > > This is not a compiler bug. However, to mitigate the problem, this > patch > implements -Wbidirectional=[none|unpaired|any] to warn about possibly > misleading Unicode bidirectional characters the preprocessor may > encounter. > > The default is =unpaired, which warns about improperly terminated > bidirectional characters; e.g. a LRE without its appertaining PDF. > The > level =any warns about any use of bidirectional characters. > > This patch handles both UCNs and UTF-8 characters. UCNs designating > bidi characters in identifiers are accepted since r204886. Then > r217144 > enabled -fextended-identifiers by default. Extended characters in > C/C++ > identifiers have been accepted since r275979. However, this patch > still > warns about mixing UTF-8 and UCN bidi characters; there seems to be > no > good reason to allow mixing them. > > We warn in different contexts: comments (both C and C++-style), > string > literals, character constants, and identifiers. Expectedly, UCNs are > ignored > in comments and raw string literals. The bidirectional characters > can nest > so this patch handles that as well. > > I have not included nor tested this at all with Fortran (which also > has > string literals and line comments). > > Dave M. posted patches improving diagnostic involving Unicode > characters. > This patch does not make use of this new infrastructure yet. Challenge accepted :) Here are a couple of patches on top of the v1 version of your patch to make use of that new infrastructure. The first patch is relatively non-invasive; the second patch reworks things quite a bit to capture location_t values for the bidirectional control characters, and use them in the diagnostics, with labelled ranges, giving e.g.: $ ./xgcc -B. -S ../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c -fdiagnostics-escape-format=bytes ../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c: In function ‘main’: ../../src/gcc/testsuite/c-c++-common/Wbidirectional-2.c:5:28: warning: unpaired UTF-8 bidirectional character detected [-Wbidirectional=] 5 | /* Say hello; newline<81>/*/ return 0 ; | ~~~~~~~~~~~~ ^ | | | | | end of bidirectional context | U+2067 (RIGHT-TO-LEFT ISOLATE) There's a more complicated example in the test case. Not yet bootstrapped, but hopefully gives you some ideas on future versions of the patch. Note that the precise location_t values aren't going to make much sense without the escaping feature [1], and I don't think that's backportable to GCC 11, so these UX tweaks might be for GCC 12+ only. Hope this is constructive Dave [1] what is a "column number" in a line of bidirectional text? Right now it's a 1-based offset w.r.t. the logical ordering of the characters, but respecting tabs and counting certain characters as occupying two columns, but it's not at all clear to me that there's such a thing as a "column number" in bidirectional text. David Malcolm (2): Flag CPP_W_BIDIRECTIONAL so that source lines are escaped Capture locations of bidi chars and underline ranges .../c-c++-common/Wbidirectional-ranges.c | 54 ++++ libcpp/lex.c | 254 ++++++++++++++---- 2 files changed, 261 insertions(+), 47 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/Wbidirectional-ranges.c