contrib: add unicode/utf8-dump.py

Message ID 20211101141404.1096747-1-dmalcolm@redhat.com
State Superseded
Headers
Series contrib: add unicode/utf8-dump.py |

Commit Message

David Malcolm Nov. 1, 2021, 2:14 p.m. UTC
  This script may be useful when debugging issues relating to Unicode
encoding (e.g. when investigating source files with bidirectional control
characters).

It dump a UTF-8 file as a list of numbered lines (mimicking GCC's
diagnostic output format), interleaved with lines per character showing
the Unicode codepoints, the UTF-8 encoding bytes, the name of the
character, and, where printable, the characters themselves.
The lines are printed in logical order, which may help the reader to grok
the relationship between visual and logical ordering in bi-di files.

For example:

$ cat test.c
int གྷ;
const char *אבג = "ALEF-BET-GIMEL";

$ ./contrib/unicode/utf8-dump.py test.c
   1 | int གྷ;
     |   U+0069            0x69                     LATIN SMALL LETTER I i
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0F43  0xe0 0xbd 0x83                       TIBETAN LETTER GHA གྷ
     |   U+003B            0x3b                                SEMICOLON ;
     |   U+000A            0x0a                           LINE FEED (LF) (control character)
   2 | const char *אבג = "ALEF-BET-GIMEL";
     |   U+0063            0x63                     LATIN SMALL LETTER C c
     |   U+006F            0x6f                     LATIN SMALL LETTER O o
     |   U+006E            0x6e                     LATIN SMALL LETTER N n
     |   U+0073            0x73                     LATIN SMALL LETTER S s
     |   U+0074            0x74                     LATIN SMALL LETTER T t
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0063            0x63                     LATIN SMALL LETTER C c
     |   U+0068            0x68                     LATIN SMALL LETTER H h
     |   U+0061            0x61                     LATIN SMALL LETTER A a
     |   U+0072            0x72                     LATIN SMALL LETTER R r
     |   U+0020            0x20                                    SPACE (separator)
     |   U+002A            0x2a                                 ASTERISK *
     |   U+05D0       0xd7 0x90                       HEBREW LETTER ALEF א
     |   U+05D1       0xd7 0x91                        HEBREW LETTER BET ב
     |   U+05D2       0xd7 0x92                      HEBREW LETTER GIMEL ג
     |   U+0020            0x20                                    SPACE (separator)
     |   U+003D            0x3d                              EQUALS SIGN =
     |   U+0020            0x20                                    SPACE (separator)
     |   U+0022            0x22                           QUOTATION MARK "
     |   U+0041            0x41                   LATIN CAPITAL LETTER A A
     |   U+004C            0x4c                   LATIN CAPITAL LETTER L L
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+0046            0x46                   LATIN CAPITAL LETTER F F
     |   U+002D            0x2d                             HYPHEN-MINUS -
     |   U+0042            0x42                   LATIN CAPITAL LETTER B B
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+0054            0x54                   LATIN CAPITAL LETTER T T
     |   U+002D            0x2d                             HYPHEN-MINUS -
     |   U+0047            0x47                   LATIN CAPITAL LETTER G G
     |   U+0049            0x49                   LATIN CAPITAL LETTER I I
     |   U+004D            0x4d                   LATIN CAPITAL LETTER M M
     |   U+0045            0x45                   LATIN CAPITAL LETTER E E
     |   U+004C            0x4c                   LATIN CAPITAL LETTER L L
     |   U+0022            0x22                           QUOTATION MARK "
     |   U+003B            0x3b                                SEMICOLON ;
     |   U+000A            0x0a                           LINE FEED (LF) (control character)

Tested with Python 3.8

OK for trunk and to backport?

contrib/ChangeLog:
	* unicode/utf8-dump.py: New file.

Signed-off-by: David Malcolm <dmalcolm@redhat.com>
---
 contrib/unicode/utf8-dump.py | 65 ++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)
 create mode 100755 contrib/unicode/utf8-dump.py
  

Comments

Martin Liška Nov. 1, 2021, 2:36 p.m. UTC | #1
On 11/1/21 15:14, David Malcolm via Gcc-patches wrote:
> |This script may be useful when debugging issues relating to Unicode encoding (e.g. when investigating source files with bidirectional control characters).|

I like the script except the following flake8 issues:

$ flake8 contrib/unicode/utf8-dump.py
contrib/unicode/utf8-dump.py:35:1: E302 expected 2 blank lines, found 1
contrib/unicode/utf8-dump.py:43:1: E302 expected 2 blank lines, found 1
contrib/unicode/utf8-dump.py:53:1: E302 expected 2 blank lines, found 1
contrib/unicode/utf8-dump.py:64:1: E305 expected 2 blank lines after class or function definition, found 1

Martin
  

Patch

diff --git a/contrib/unicode/utf8-dump.py b/contrib/unicode/utf8-dump.py
new file mode 100755
index 00000000000..21885a85bdc
--- /dev/null
+++ b/contrib/unicode/utf8-dump.py
@@ -0,0 +1,65 @@ 
+#!/usr/bin/env python3
+#
+# Script to dump a UTF-8 file as a list of numbered lines (mimicking GCC's
+# diagnostic output format), interleaved with lines per character showing
+# the Unicode codepoints, the UTF-8 encoding bytes, the name of the
+# character, and, where printable, the characters themselves.
+# The lines are printed in logical order, which may help the reader to grok
+# the relationship between visual and logical ordering in bi-di files.
+#
+# SPDX-License-Identifier: MIT
+#
+# Copyright (C) 2021 David Malcolm <dmalcolm@redhat.com>.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included
+# in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+# IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+# CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT
+# OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE
+# OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+import sys
+import unicodedata
+
+def get_name(ch):
+    try:
+        return unicodedata.name(ch)
+    except ValueError:
+        if ch == '\n':
+            return 'LINE FEED (LF)'
+        return '(unknown)'
+
+def get_printable(ch):
+    cat = unicodedata.category(ch)
+    if cat == 'Cc':
+        return '(control character)'
+    elif cat == 'Cf':
+        return '(format control)'
+    elif cat[0] == 'Z':
+        return '(separator)'
+    return ch
+
+def dump_file(f_in):
+    line_num = 1
+    for line in f_in:
+        print('%4i | %s' % (line_num, line.rstrip()))
+        for ch in line:
+            utf8_desc = '%15s' % (' '.join(['0x%02x' % b
+                                            for b in ch.encode('utf-8')]))
+            print('%4s |   U+%04X %s %40s %s'
+                  % ('', ord(ch), utf8_desc, get_name(ch), get_printable(ch)))
+        line_num += 1
+
+with open(sys.argv[1], mode='r') as f_in:
+    dump_file(f_in)