From patchwork Mon Nov 1 15:32:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Malcolm X-Patchwork-Id: 46911 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id BF8553858432 for ; Mon, 1 Nov 2021 15:32:54 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BF8553858432 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1635780774; bh=CJ4p/U9vevr+lz54hD68HTZv6aqWMoASXPY5m3mJ9oY=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=yXHEBK+IecdXnAF0NnxeMuKGCMsOgmgJZOdmp10n+kSUUMooSVbD0MVujH8LP1DaC c5JYtBykqv9TUFhjXP1BIoRixf2cVAhi2GUnh5/dpQ/ecFMl5/Id9EJ9enc8cFkbwj C8i4bV2gyfeiyK2SmDj0ZCIy4IRiD86T18B4IO3o= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id E94F43858432 for ; Mon, 1 Nov 2021 15:32:25 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E94F43858432 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-118-Fm_LPP1sNWCzjil_4DgdHA-1; Mon, 01 Nov 2021 11:32:24 -0400 X-MC-Unique: Fm_LPP1sNWCzjil_4DgdHA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 990778030AF; Mon, 1 Nov 2021 15:32:23 +0000 (UTC) Received: from t14s.localdomain.com (ovpn-113-202.phx2.redhat.com [10.3.113.202]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1BE176788F; Mon, 1 Nov 2021 15:32:23 +0000 (UTC) To: =?utf-8?q?Martin_Li=C5=A1ka?= , gcc-patches@gcc.gnu.org Subject: [PATCH] contrib: add unicode/utf8-dump.py Date: Mon, 1 Nov 2021 11:32:21 -0400 Message-Id: <20211101153221.1102221-1-dmalcolm@redhat.com> In-Reply-To: <5425251a-e35c-27f0-fde7-323365d73a13@suse.cz> References: <5425251a-e35c-27f0-fde7-323365d73a13@suse.cz> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-13.4 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: David Malcolm via Gcc-patches From: David Malcolm Reply-To: David Malcolm Errors-To: gcc-patches-bounces+patchwork=sourceware.org@gcc.gnu.org Sender: "Gcc-patches" On Mon, 2021-11-01 at 15:36 +0100, Martin Liška wrote: > On 11/1/21 15:14, David Malcolm via Gcc-patches wrote: > > > This script may be useful when debugging issues relating to > > > Unicode encoding (e.g. when investigating source files with > > > bidirectional control characters).| > > I like the script except the following flake8 issues: > > $ flake8 contrib/unicode/utf8-dump.py > contrib/unicode/utf8-dump.py:35:1: E302 expected 2 blank lines, found > 1 > contrib/unicode/utf8-dump.py:43:1: E302 expected 2 blank lines, found > 1 > contrib/unicode/utf8-dump.py:53:1: E302 expected 2 blank lines, found > 1 > contrib/unicode/utf8-dump.py:64:1: E305 expected 2 blank lines after > class or function definition, found 1 Thanks. Here's an updated version of the script that fixes the above issues. contrib/ChangeLog: * unicode/utf8-dump.py: New file. Signed-off-by: David Malcolm --- contrib/unicode/utf8-dump.py | 69 ++++++++++++++++++++++++++++++++++++ 1 file changed, 69 insertions(+) create mode 100755 contrib/unicode/utf8-dump.py diff --git a/contrib/unicode/utf8-dump.py b/contrib/unicode/utf8-dump.py new file mode 100755 index 00000000000..f12ee79f9f2 --- /dev/null +++ b/contrib/unicode/utf8-dump.py @@ -0,0 +1,69 @@ +#!/usr/bin/env python3 +# +# Script to dump a UTF-8 file as a list of numbered lines (mimicking GCC's +# diagnostic output format), interleaved with lines per character showing +# the Unicode codepoints, the UTF-8 encoding bytes, the name of the +# character, and, where printable, the characters themselves. +# The lines are printed in logical order, which may help the reader to grok +# the relationship between visual and logical ordering in bi-di files. +# +# SPDX-License-Identifier: MIT +# +# Copyright (C) 2021 David Malcolm . +# +# Permission is hereby granted, free of charge, to any person obtaining a +# copy of this software and associated documentation files (the "Software"), +# to deal in the Software without restriction, including without limitation +# the rights to use, copy, modify, merge, publish, distribute, sublicense, +# and/or sell copies of the Software, and to permit persons to whom the +# Software is furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included +# in all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS +# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. +# IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY +# CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT +# OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE +# OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. + +import sys +import unicodedata + + +def get_name(ch): + try: + return unicodedata.name(ch) + except ValueError: + if ch == '\n': + return 'LINE FEED (LF)' + return '(unknown)' + + +def get_printable(ch): + cat = unicodedata.category(ch) + if cat == 'Cc': + return '(control character)' + elif cat == 'Cf': + return '(format control)' + elif cat[0] == 'Z': + return '(separator)' + return ch + + +def dump_file(f_in): + line_num = 1 + for line in f_in: + print('%4i | %s' % (line_num, line.rstrip())) + for ch in line: + utf8_desc = '%15s' % (' '.join(['0x%02x' % b + for b in ch.encode('utf-8')])) + print('%4s | U+%04X %s %40s %s' + % ('', ord(ch), utf8_desc, get_name(ch), get_printable(ch))) + line_num += 1 + + +with open(sys.argv[1], mode='r') as f_in: + dump_file(f_in)