[committed] diagnostics: escape non-ASCII source bytes for certain diagnostics

  This patch adds support to GCC's diagnostic subsystem for escaping certain
bytes and Unicode characters when quoting source code.

Specifically, this patch adds a new flag rich_location::m_escape_on_output
which is a hint from a diagnostic that non-ASCII bytes in the pertinent
lines of the user's source code should be escaped when printed.

The patch sets this for the following diagnostics:
- when complaining about stray bytes in the program (when these
are non-printable)
- when complaining about "null character(s) ignored");
- for -Wnormalized= (and generate source ranges for such warnings)

The escaping is controlled by a new option:
  -fdiagnostics-escape-format=[unicode|bytes]

For example, consider a diagnostic involing a source line containing the
string "before" followed by the Unicode character U+03C0 ("GREEK SMALL
LETTER PI", with UTF-8 encoding 0xCF 0x80) followed by the byte 0xBF
(a stray UTF-8 trailing byte), followed by the string "after", where the
diagnostic highlights the U+03C0 character.

By default, this line will be printed verbatim to the user when
reporting a diagnostic at it, as:

 beforeπXafter
       ^

(using X for the stray byte to avoid putting invalid UTF-8 in this
commit message)

If the diagnostic sets the "escape" flag, it will be printed as:

 before<U+03C0><BF>after
       ^~~~~~~~

with -fdiagnostics-escape-format=unicode (the default), or as:

  before<CF><80><BF>after
        ^~~~~~~~

if the user supplies -fdiagnostics-escape-format=bytes.

This only affects how the source is printed; it does not affect
how column numbers that are printed (as per -fdiagnostics-column-unit=
and -fdiagnostics-column-origin=).

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.

Also tested by attempting to compile a .png file as if it were C
source code.

Pushed to trunk as bd5e882cf6e0def3dd1bc106075d59a303fe0d1e.

gcc/c-family/ChangeLog:
	* c-lex.c (c_lex_with_flags): When complaining about non-printable
	CPP_OTHER tokens, set the "escape on output" flag.

gcc/ChangeLog:
	* common.opt (fdiagnostics-escape-format=): New.
	(diagnostics_escape_format): New enum.
	(DIAGNOSTICS_ESCAPE_FORMAT_UNICODE): New enum value.
	(DIAGNOSTICS_ESCAPE_FORMAT_BYTES): Likewise.
	* diagnostic-format-json.cc (json_end_diagnostic): Add
	"escape-source" attribute.
	* diagnostic-show-locus.c
	(exploc_with_display_col::exploc_with_display_col): Replace
	"tabstop" param with a cpp_char_column_policy and add an "aspect"
	param.  Use these to compute m_display_col accordingly.
	(struct char_display_policy): New struct.
	(layout::m_policy): New field.
	(layout::m_escape_on_output): New field.
	(def_policy): New function.
	(make_range): Update for changes to exploc_with_display_col ctor.
	(default_print_decoded_ch): New.
	(width_per_escaped_byte): New.
	(escape_as_bytes_width): New.
	(escape_as_bytes_print): New.
	(escape_as_unicode_width): New.
	(escape_as_unicode_print): New.
	(make_policy): New.
	(layout::layout): Initialize new fields.  Update m_exploc ctor
	call for above change to ctor.
	(layout::maybe_add_location_range): Update for changes to
	exploc_with_display_col ctor.
	(layout::calculate_x_offset_display): Update for change to
	cpp_display_width.
	(layout::print_source_line): Pass policy
	to cpp_display_width_computation. Capture cpp_decoded_char when
	calling process_next_codepoint.  Move printing of source code to
	m_policy.m_print_cb.
	(line_label::line_label): Pass in policy rather than context.
	(layout::print_any_labels): Update for change to line_label ctor.
	(get_affected_range): Pass in policy rather than context, updating
	calls to location_compute_display_column accordingly.
	(get_printed_columns): Likewise, also for cpp_display_width.
	(correction::correction): Pass in policy rather than tabstop.
	(correction::compute_display_cols): Pass m_policy rather than
	m_tabstop to cpp_display_width.
	(correction::m_tabstop): Replace with...
	(correction::m_policy): ...this.
	(line_corrections::line_corrections): Pass in policy rather than
	context.
	(line_corrections::m_context): Replace with...
	(line_corrections::m_policy): ...this.
	(line_corrections::add_hint): Update to use m_policy rather than
	m_context.
	(line_corrections::add_hint): Likewise.
	(layout::print_trailing_fixits): Likewise.
	(selftest::test_display_widths): New.
	(selftest::test_layout_x_offset_display_utf8): Update to use
	policy rather than tabstop.
	(selftest::test_one_liner_labels_utf8): Add test of escaping
	source lines.
	(selftest::test_diagnostic_show_locus_one_liner_utf8): Update to
	use policy rather than tabstop.
	(selftest::test_overlapped_fixit_printing): Likewise.
	(selftest::test_overlapped_fixit_printing_utf8): Likewise.
	(selftest::test_overlapped_fixit_printing_2): Likewise.
	(selftest::test_tab_expansion): Likewise.
	(selftest::test_escaping_bytes_1): New.
	(selftest::test_escaping_bytes_2): New.
	(selftest::diagnostic_show_locus_c_tests): Call the new tests.
	* diagnostic.c (diagnostic_initialize): Initialize
	context->escape_format.
	(convert_column_unit): Update to use default character width policy.
	(selftest::test_diagnostic_get_location_text): Likewise.
	* diagnostic.h (enum diagnostics_escape_format): New enum.
	(diagnostic_context::escape_format): New field.
	* doc/invoke.texi (-fdiagnostics-escape-format=): New option.
	(-fdiagnostics-format=): Add "escape-source" attribute to examples
	of JSON output, and document it.
	* input.c (location_compute_display_column): Pass in "policy"
	rather than "tabstop", passing to
	cpp_byte_column_to_display_column.
	(selftest::test_cpp_utf8): Update to use cpp_char_column_policy.
	* input.h (class cpp_char_column_policy): New forward decl.
	(location_compute_display_column): Pass in "policy" rather than
	"tabstop".
	* opts.c (common_handle_option): Handle
	OPT_fdiagnostics_escape_format_.
	* selftest.c (temp_source_file::temp_source_file): New ctor
	overload taking a size_t.
	* selftest.h (temp_source_file::temp_source_file): Likewise.

gcc/testsuite/ChangeLog:
	* c-c++-common/diagnostic-format-json-1.c: Add regexp to consume
	"escape-source" attribute.
	* c-c++-common/diagnostic-format-json-2.c: Likewise.
	* c-c++-common/diagnostic-format-json-3.c: Likewise.
	* c-c++-common/diagnostic-format-json-4.c: Likewise, twice.
	* c-c++-common/diagnostic-format-json-5.c: Likewise.
	* gcc.dg/cpp/warn-normalized-4-bytes.c: New test.
	* gcc.dg/cpp/warn-normalized-4-unicode.c: New test.
	* gcc.dg/encoding-issues-bytes.c: New test.
	* gcc.dg/encoding-issues-unicode.c: New test.
	* gfortran.dg/diagnostic-format-json-1.F90: Add regexp to consume
	"escape-source" attribute.
	* gfortran.dg/diagnostic-format-json-2.F90: Likewise.
	* gfortran.dg/diagnostic-format-json-3.F90: Likewise.

libcpp/ChangeLog:
	* charset.c (convert_escape): Use encoding_rich_location when
	complaining about nonprintable unknown escape sequences.
	(cpp_display_width_computation::::cpp_display_width_computation):
	Pass in policy rather than tabstop.
	(cpp_display_width_computation::process_next_codepoint): Add "out"
	param and populate *out if non-NULL.
	(cpp_display_width_computation::advance_display_cols): Pass NULL
	to process_next_codepoint.
	(cpp_byte_column_to_display_column): Pass in policy rather than
	tabstop.  Pass NULL to process_next_codepoint.
	(cpp_display_column_to_byte_column): Pass in policy rather than
	tabstop.
	* errors.c (cpp_diagnostic_get_current_location): New function,
	splitting out the logic from...
	(cpp_diagnostic): ...here.
	(cpp_warning_at): New function.
	(cpp_pedwarning_at): New function.
	* include/cpplib.h (cpp_warning_at): New decl for rich_location.
	(cpp_pedwarning_at): Likewise.
	(struct cpp_decoded_char): New.
	(struct cpp_char_column_policy): New.
	(cpp_display_width_computation::cpp_display_width_computation):
	Replace "tabstop" param with "policy".
	(cpp_display_width_computation::process_next_codepoint): Add "out"
	param.
	(cpp_display_width_computation::m_tabstop): Replace with...
	(cpp_display_width_computation::m_policy): ...this.
	(cpp_byte_column_to_display_column): Replace "tabstop" param with
	"policy".
	(cpp_display_width): Likewise.
	(cpp_display_column_to_byte_column): Likewise.
	* include/line-map.h (rich_location::escape_on_output_p): New.
	(rich_location::set_escape_on_output): New.
	(rich_location::m_escape_on_output): New.
	* internal.h (cpp_diagnostic_get_current_location): New decl.
	(class encoding_rich_location): New.
	* lex.c (skip_whitespace): Use encoding_rich_location when
	complaining about null characters.
	(warn_about_normalization): Generate a source range when
	complaining about improperly normalized tokens, rather than just a
	point, and use encoding_rich_location so that the source code
	is escaped on printing.
	* line-map.c (rich_location::rich_location): Initialize
	m_escape_on_output.

Signed-off-by: David Malcolm <dmalcolm@redhat.com>
---
 gcc/c-family/c-lex.c                          |   6 +-
 gcc/common.opt                                |  13 +
 gcc/diagnostic-format-json.cc                 |   3 +
 gcc/diagnostic-show-locus.c                   | 580 +++++++++++++++---
 gcc/diagnostic.c                              |  10 +-
 gcc/diagnostic.h                              |  18 +
 gcc/doc/invoke.texi                           |  43 +-
 gcc/input.c                                   |  62 +-
 gcc/input.h                                   |   7 +-
 gcc/opts.c                                    |   4 +
 gcc/selftest.c                                |  15 +
 gcc/selftest.h                                |   2 +
 .../c-c++-common/diagnostic-format-json-1.c   |   1 +
 .../c-c++-common/diagnostic-format-json-2.c   |   1 +
 .../c-c++-common/diagnostic-format-json-3.c   |   1 +
 .../c-c++-common/diagnostic-format-json-4.c   |   2 +
 .../c-c++-common/diagnostic-format-json-5.c   |   1 +
 .../gcc.dg/cpp/warn-normalized-4-bytes.c      |  21 +
 .../gcc.dg/cpp/warn-normalized-4-unicode.c    |  19 +
 gcc/testsuite/gcc.dg/encoding-issues-bytes.c  | Bin 0 -> 595 bytes
 .../gcc.dg/encoding-issues-unicode.c          | Bin 0 -> 613 bytes
 .../gfortran.dg/diagnostic-format-json-1.F90  |   1 +
 .../gfortran.dg/diagnostic-format-json-2.F90  |   1 +
 .../gfortran.dg/diagnostic-format-json-3.F90  |   1 +
 libcpp/charset.c                              |  63 +-
 libcpp/errors.c                               |  82 ++-
 libcpp/include/cpplib.h                       |  76 ++-
 libcpp/include/line-map.h                     |  13 +
 libcpp/internal.h                             |  23 +
 libcpp/lex.c                                  |  38 +-
 libcpp/line-map.c                             |   3 +-
 31 files changed, 942 insertions(+), 168 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c
 create mode 100644 gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c
 create mode 100644 gcc/testsuite/gcc.dg/encoding-issues-bytes.c
 create mode 100644 gcc/testsuite/gcc.dg/encoding-issues-unicode.c

Message ID	20211101133811.1094753-1-dmalcolm@redhat.com
State	Committed
Commit	bd5e882cf6e0def3dd1bc106075d59a303fe0d1e
Headers	Return-Path: <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 804193858014 for <patchwork@sourceware.org>; Mon, 1 Nov 2021 13:39:03 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 804193858014 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1635773943; bh=Io+zkmorspsCFNjVGJfbsPqN6oyuxEPXu1WTwWJaQVA=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=es8wzJnWCXr0ezdiCW2q2PL7yIRuqddpPSj2A5WJFc05ZjJMbPbbF8cHvQifothtd Uie8QyOJSrjFgSp8WU+qqPsFMcSEV6nRZonQRRtJm1EsjmvA0h8HS2meAyuoEHI7RF 2YdJ20e/ZuzVmo/bvhm63Ty+jcIfjYiPEUbvvyVg= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id E710F385842D for <gcc-patches@gcc.gnu.org>; Mon, 1 Nov 2021 13:38:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E710F385842D Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-482-NEwFZHLVP16qkMAQzO-tng-1; Mon, 01 Nov 2021 09:38:18 -0400 X-MC-Unique: NEwFZHLVP16qkMAQzO-tng-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BD12FA40C5 for <gcc-patches@gcc.gnu.org>; Mon, 1 Nov 2021 13:38:17 +0000 (UTC) Received: from t14s.localdomain.com (ovpn-113-202.phx2.redhat.com [10.3.113.202]) by smtp.corp.redhat.com (Postfix) with ESMTP id 117505D9D5; Mon, 1 Nov 2021 13:38:15 +0000 (UTC) To: gcc-patches@gcc.gnu.org Subject: [committed] diagnostics: escape non-ASCII source bytes for certain diagnostics Date: Mon, 1 Nov 2021 09:38:11 -0400 Message-Id: <20211101133811.1094753-1-dmalcolm@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-13.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP, T_FILL_THIS_FORM_SHORT autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> From: David Malcolm via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: David Malcolm <dmalcolm@redhat.com> Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org>
Series	[committed] diagnostics: escape non-ASCII source bytes for certain diagnostics \| [committed] diagnostics: escape non-ASCII source bytes for certain diagnostics

[committed] diagnostics: escape non-ASCII source bytes for certain diagnostics

Commit Message

Patch