From patchwork Thu Apr 29 17:27:42 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Carlos O'Donell X-Patchwork-Id: 43193 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 3E9F9396902B; Thu, 29 Apr 2021 17:28:02 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3E9F9396902B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1619717282; bh=Wfs+IRGdqPeYZbWEcfeM9fYAtxDqEWbqMCZL8wWeJwk=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=xGNBla5ec4GtN+DKeV7PLzJwg5UosZWnHHUW/dX2TlYRCbCCATmJnSWKGXc2NQT6y gQO2oKGAd61hV4p5Vhu2Fos8r9t+p7B6LACytUaPN30hodcxce3tlmQv/Jh2++lwPH lVM9L6E+4ODbuamh3P9bwFNO8q5VF0olQ//pFUyw= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id 934C43947412 for ; Thu, 29 Apr 2021 17:27:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 934C43947412 Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-108-2R5g1EYuMhuBSvJ4Z-4ehA-1; Thu, 29 Apr 2021 13:27:54 -0400 X-MC-Unique: 2R5g1EYuMhuBSvJ4Z-4ehA-1 Received: by mail-qk1-f198.google.com with SMTP id m1-20020a05620a2201b02902e5493ba894so4278751qkh.17 for ; Thu, 29 Apr 2021 10:27:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=Wfs+IRGdqPeYZbWEcfeM9fYAtxDqEWbqMCZL8wWeJwk=; b=a0TGbrasXCngofEyNh/tsCp1Onn+j2TPLQxJSVaRz8TYmbYELXDP3LKzhpIfUZpWhS ZCC8doiJcOWpZy6AmNISaAqRQkkkt5Pb8xmr7CTemrHXOYmjLlPk/VJdBXi4smOSkgHD F6GPc5u6rbY7IrZOAZIM+z3nV/exWkK9tet/E+z1f+BoSZCXDSq4mTQcRgQNRcwD6z4A rx+ka1DIU6OMY7kaPyuqTF7OwRlxQ3SRA+BpYOb9em//Y+L+I+PUBRFsp6oAgZ0hKIUk 8V+YyeDqfgNjd0Ry1QOxz9d6wQdwHZZLudfOZQ2qi76hQwwfhyg3h/wOxX98+11LZjrU KyJg== X-Gm-Message-State: AOAM533+AiMwpbFuL4JklyBjj0j6R0Dqb2CAzuFT8qmXnZQnUTuUUnhC l5EZHWusqMOjAWK0MVad0wkX9/Lnd92/zYL5UoPCEsnH9BMhNCR56xS2+mEu/1AJH74mH7k4viw ydNKNHlFXaDWtJ8oN4pixRFMDhyAMUesF50EY+XaVGFVALEBoy0A2bmCa1/IMQcxHa9YQjA== X-Received: by 2002:a37:e30c:: with SMTP id y12mr827981qki.244.1619717272770; Thu, 29 Apr 2021 10:27:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyDJMrgz/SaF4CXZHxjIrp1SaGaz3FVTUOvxnPQ/nQcwBs2gYTh/rtfbLrdOHMROte4Rgd2YA== X-Received: by 2002:a37:e30c:: with SMTP id y12mr827920qki.244.1619717272094; Thu, 29 Apr 2021 10:27:52 -0700 (PDT) Received: from localhost.localdomain (198-84-214-74.cpe.teksavvy.com. [198.84.214.74]) by smtp.gmail.com with ESMTPSA id e12sm363586qtj.81.2021.04.29.10.27.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Apr 2021 10:27:51 -0700 (PDT) To: libc-alpha@sourceware.org, fweimer@redhat.com, joseph@codesourcery.com Subject: [PATCH] Make Unicode generation reproducible. Date: Thu, 29 Apr 2021 13:27:42 -0400 Message-Id: <20210429172742.3301414-1-carlos@redhat.com> X-Mailer: git-send-email 2.26.3 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Carlos O'Donell via Libc-alpha From: Carlos O'Donell Reply-To: Carlos O'Donell Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" The following changes make Unicode generation reproducible. First we create a UnicodeRelease.txt file with metadata about the release. This metadata contains the release date for the Unicode version that we imported into glibc. Then we add APIs to unicode_utils.py to access the release metadata. Then we refactor all of the code use the release metadata, which includes using the consistent date of the Unicode release for the required LC_IDENTIFICATION dates. If the existing files like i18n_ctype or tr_TR have newer dates then we keep those, otherwise we use the newer date from the Unicode release. All data files are regenerated with: cd localedata/unicode-gen make make install Subsequent regeneration will not alter any file dates and makes the Unicode generation reproducible. Tested on x86_64 and i686 without regression. --- localedata/locales/i18n_ctype | 4 +- localedata/locales/tr_TR | 2 +- localedata/locales/translit_circle | 2 +- localedata/locales/translit_cjk_compat | 2 +- localedata/locales/translit_combining | 2 +- localedata/locales/translit_compat | 2 +- localedata/locales/translit_font | 2 +- localedata/locales/translit_fraction | 2 +- localedata/unicode-gen/Makefile | 66 ++++++++----------- localedata/unicode-gen/UnicodeRelease.txt | 8 +++ localedata/unicode-gen/gen_translit_circle.py | 20 +++--- .../unicode-gen/gen_translit_cjk_compat.py | 20 +++--- .../unicode-gen/gen_translit_combining.py | 20 +++--- localedata/unicode-gen/gen_translit_compat.py | 20 +++--- localedata/unicode-gen/gen_translit_font.py | 20 +++--- .../unicode-gen/gen_translit_fraction.py | 20 +++--- localedata/unicode-gen/gen_unicode_ctype.py | 50 ++++++-------- localedata/unicode-gen/unicode_utils.py | 38 +++++++++++ localedata/unicode-gen/utf8_compatibility.py | 27 ++++---- localedata/unicode-gen/utf8_gen.py | 61 +++++++---------- 20 files changed, 189 insertions(+), 199 deletions(-) create mode 100644 localedata/unicode-gen/UnicodeRelease.txt diff --git a/localedata/locales/i18n_ctype b/localedata/locales/i18n_ctype index c63e0790fc..f5063fe743 100644 --- a/localedata/locales/i18n_ctype +++ b/localedata/locales/i18n_ctype @@ -13,7 +13,7 @@ comment_char % % information, but with different transliterations, can include it % directly. -% Generated automatically by gen_unicode_ctype.py for Unicode 12.1.0. +% Generated automatically by gen_unicode_ctype.py. LC_IDENTIFICATION title "Unicode 13.0.0 FDCC-set" @@ -26,7 +26,7 @@ fax "" language "" territory "Earth" revision "13.0.0" -date "2020-06-25" +date "2021-03-10" category "i18n:2012";LC_CTYPE END LC_IDENTIFICATION diff --git a/localedata/locales/tr_TR b/localedata/locales/tr_TR index 7dbb923228..ff8b315b7b 100644 --- a/localedata/locales/tr_TR +++ b/localedata/locales/tr_TR @@ -43,7 +43,7 @@ fax "" language "Turkish" territory "Turkey" revision "1.0" -date "2020-06-25" +date "2021-03-10" category "i18n:2012";LC_IDENTIFICATION category "i18n:2012";LC_CTYPE diff --git a/localedata/locales/translit_circle b/localedata/locales/translit_circle index 5c07b44532..f2ef558e2d 100644 --- a/localedata/locales/translit_circle +++ b/localedata/locales/translit_circle @@ -9,7 +9,7 @@ comment_char % % otherwise be governed by that license. % Transliterations of encircled characters. -% Generated automatically from UnicodeData.txt by gen_translit_circle.py on 2020-06-25 for Unicode 13.0.0. +% Generated automatically from UnicodeData.txt by gen_translit_circle.py for Unicode 13.0.0. LC_CTYPE diff --git a/localedata/locales/translit_cjk_compat b/localedata/locales/translit_cjk_compat index ee0d7f83c6..2696445dbf 100644 --- a/localedata/locales/translit_cjk_compat +++ b/localedata/locales/translit_cjk_compat @@ -9,7 +9,7 @@ comment_char % % otherwise be governed by that license. % Transliterations of CJK compatibility characters. -% Generated automatically from UnicodeData.txt by gen_translit_cjk_compat.py on 2020-06-25 for Unicode 13.0.0. +% Generated automatically from UnicodeData.txt by gen_translit_cjk_compat.py for Unicode 13.0.0. LC_CTYPE diff --git a/localedata/locales/translit_combining b/localedata/locales/translit_combining index 36128f097a..b8e6b7efbd 100644 --- a/localedata/locales/translit_combining +++ b/localedata/locales/translit_combining @@ -10,7 +10,7 @@ comment_char % % Transliterations that remove all combining characters (accents, % pronounciation marks, etc.). -% Generated automatically from UnicodeData.txt by gen_translit_combining.py on 2020-06-25 for Unicode 13.0.0. +% Generated automatically from UnicodeData.txt by gen_translit_combining.py for Unicode 13.0.0. LC_CTYPE diff --git a/localedata/locales/translit_compat b/localedata/locales/translit_compat index ac24c4e938..61cdcccbc9 100644 --- a/localedata/locales/translit_compat +++ b/localedata/locales/translit_compat @@ -9,7 +9,7 @@ comment_char % % otherwise be governed by that license. % Transliterations of compatibility characters and ligatures. -% Generated automatically from UnicodeData.txt by gen_translit_compat.py on 2020-06-25 for Unicode 13.0.0. +% Generated automatically from UnicodeData.txt by gen_translit_compat.py for Unicode 13.0.0. LC_CTYPE diff --git a/localedata/locales/translit_font b/localedata/locales/translit_font index 680c4ed426..c3d7b44772 100644 --- a/localedata/locales/translit_font +++ b/localedata/locales/translit_font @@ -9,7 +9,7 @@ comment_char % % otherwise be governed by that license. % Transliterations of font equivalents. -% Generated automatically from UnicodeData.txt by gen_translit_font.py on 2020-06-25 for Unicode 13.0.0. +% Generated automatically from UnicodeData.txt by gen_translit_font.py for Unicode 13.0.0. LC_CTYPE diff --git a/localedata/locales/translit_fraction b/localedata/locales/translit_fraction index b52244969e..292fe3e806 100644 --- a/localedata/locales/translit_fraction +++ b/localedata/locales/translit_fraction @@ -9,7 +9,7 @@ comment_char % % otherwise be governed by that license. % Transliterations of fractions. -% Generated automatically from UnicodeData.txt by gen_translit_fraction.py on 2020-06-25 for Unicode 13.0.0. +% Generated automatically from UnicodeData.txt by gen_translit_fraction.py for Unicode 13.0.0. % The replacements have been surrounded with spaces, because fractions are % often preceded by a decimal number and followed by a unit or a math symbol. diff --git a/localedata/unicode-gen/Makefile b/localedata/unicode-gen/Makefile index d0dd1b78a5..b5c9c5517b 100644 --- a/localedata/unicode-gen/Makefile +++ b/localedata/unicode-gen/Makefile @@ -18,11 +18,10 @@ # Makefile for generating and updating Unicode-extracted files. -# This Makefile is NOT used as part of the GNU libc build. It needs -# to be run manually, within the source tree, at Unicode upgrades -# (change UNICODE_VERSION below), to update ../locales/i18n_ctype ctype -# information (part of the file is preserved, so don't wipe it all -# out), and ../charmaps/UTF-8. +# This Makefile is NOT used as part of the GNU libc build. It needs to +# be run manually, within the source tree, at Unicode upgrades, to +# update ../locales/i18n_ctype ctype information (part of the file is +# preserved, so don't wipe it all out), and ../charmaps/UTF-8. # Use make all to generate the files used in the glibc build out of # the original Unicode files; make check to verify that they are what @@ -33,13 +32,14 @@ # running afoul of the LGPL corresponding sources requirements, even # though it's not clear that they are preferred over the generated # files for making modifications. - - -UNICODE_VERSION = 13.0.0 +# +# The UnicodeRelease.txt file must be updated manually to include the +# information about the downloaded Unicode release. PYTHON3 = python3 WGET = wget +RELEASEDATA = UnicodeRelease.txt DOWNLOADS = UnicodeData.txt DerivedCoreProperties.txt EastAsianWidth.txt PropList.txt GENERATED = i18n_ctype tr_TR UTF-8 translit_combining translit_compat translit_circle translit_cjk_compat translit_font translit_fraction REPORTS = i18n_ctype-report UTF-8-report @@ -66,12 +66,10 @@ mostlyclean: .PHONY: all check clean mostlyclean install -i18n_ctype: UnicodeData.txt DerivedCoreProperties.txt +i18n_ctype: UnicodeData.txt DerivedCoreProperties.txt $(RELEASEDATA) i18n_ctype: ../locales/i18n_ctype # Preserve non-ctype information. i18n_ctype: gen_unicode_ctype.py - $(PYTHON3) gen_unicode_ctype.py -u UnicodeData.txt \ - -d DerivedCoreProperties.txt -i ../locales/i18n_ctype -o $@ \ - --unicode_version $(UNICODE_VERSION) + $(PYTHON3) gen_unicode_ctype.py -i ../locales/i18n_ctype -o $@ i18n_ctype-report: i18n_ctype ../locales/i18n_ctype i18n_ctype-report: ctype_compatibility.py ctype_compatibility_test_cases.py @@ -86,55 +84,45 @@ check-i18n_ctype: i18n_ctype-report tr_TR: UnicodeData.txt DerivedCoreProperties.txt tr_TR: ../locales/tr_TR # Preserve non-ctype information. tr_TR: gen_unicode_ctype.py - $(PYTHON3) gen_unicode_ctype.py -u UnicodeData.txt \ - -d DerivedCoreProperties.txt -i ../locales/tr_TR -o $@ \ - --unicode_version $(UNICODE_VERSION) --turkish + $(PYTHON3) gen_unicode_ctype.py -i ../locales/tr_TR -o $@ \ + --turkish -UTF-8: UnicodeData.txt EastAsianWidth.txt +UTF-8: UnicodeData.txt EastAsianWidth.txt $(RELEASEDATA) UTF-8: utf8_gen.py - $(PYTHON3) utf8_gen.py -u UnicodeData.txt \ - -e EastAsianWidth.txt -p PropList.txt \ - --unicode_version $(UNICODE_VERSION) + $(PYTHON3) utf8_gen.py UTF-8-report: UTF-8 ../charmaps/UTF-8 UTF-8-report: utf8_compatibility.py - $(PYTHON3) ./utf8_compatibility.py -u UnicodeData.txt \ - -e EastAsianWidth.txt -o ../charmaps/UTF-8 \ + $(PYTHON3) ./utf8_compatibility.py -o ../charmaps/UTF-8 \ -n UTF-8 -a -m -c > $@ check-UTF-8: UTF-8-report @if grep '^Total.*: [^0]' UTF-8-report; \ then echo manual verification required; false; else true; fi -translit_combining: UnicodeData.txt +translit_combining: UnicodeData.txt $(RELEASEDATA) translit_combining: gen_translit_combining.py - $(PYTHON3) ./gen_translit_combining.py -u UnicodeData.txt \ - -o $@ --unicode_version $(UNICODE_VERSION) + $(PYTHON3) ./gen_translit_combining.py -o $@ -translit_compat: UnicodeData.txt +translit_compat: UnicodeData.txt $(RELEASEDATA) translit_compat: gen_translit_compat.py - $(PYTHON3) ./gen_translit_compat.py -u UnicodeData.txt \ - -o $@ --unicode_version $(UNICODE_VERSION) + $(PYTHON3) ./gen_translit_compat.py -o $@ -translit_circle: UnicodeData.txt +translit_circle: UnicodeData.txt $(RELEASEDATA) translit_circle: gen_translit_circle.py - $(PYTHON3) ./gen_translit_circle.py -u UnicodeData.txt \ - -o $@ --unicode_version $(UNICODE_VERSION) + $(PYTHON3) ./gen_translit_circle.py -o $@ -translit_cjk_compat: UnicodeData.txt +translit_cjk_compat: UnicodeData.txt $(RELEASEDATA) translit_cjk_compat: gen_translit_cjk_compat.py - $(PYTHON3) ./gen_translit_cjk_compat.py -u UnicodeData.txt \ - -o $@ --unicode_version $(UNICODE_VERSION) + $(PYTHON3) ./gen_translit_cjk_compat.py -o $@ -translit_font: UnicodeData.txt +translit_font: UnicodeData.txt $(RELEASEDATA) translit_font: gen_translit_font.py - $(PYTHON3) ./gen_translit_font.py -u UnicodeData.txt \ - -o $@ --unicode_version $(UNICODE_VERSION) + $(PYTHON3) ./gen_translit_font.py -o $@ -translit_fraction: UnicodeData.txt +translit_fraction: UnicodeData.txt $(RELEASEDATA) translit_fraction: gen_translit_fraction.py - $(PYTHON3) ./gen_translit_fraction.py -u UnicodeData.txt \ - -o $@ --unicode_version $(UNICODE_VERSION) + $(PYTHON3) ./gen_translit_fraction.py -o $@ .PHONY: downloads clean-downloads downloads: $(DOWNLOADS) diff --git a/localedata/unicode-gen/UnicodeRelease.txt b/localedata/unicode-gen/UnicodeRelease.txt new file mode 100644 index 0000000000..bd9cc14ae0 --- /dev/null +++ b/localedata/unicode-gen/UnicodeRelease.txt @@ -0,0 +1,8 @@ +% This metadata is used by glibc and updated by the developer(s) +% carrying out the Unicode update. +Version,13.0.0 +ReleaseDate,2021-03-10 +Data,UnicodeData.txt +DcpData,DerivedCoreProperties.txt +EawData,EastAsianWidth.txt +PlData,PropList.txt diff --git a/localedata/unicode-gen/gen_translit_circle.py b/localedata/unicode-gen/gen_translit_circle.py index a83dccc163..cc897b2f5f 100644 --- a/localedata/unicode-gen/gen_translit_circle.py +++ b/localedata/unicode-gen/gen_translit_circle.py @@ -67,7 +67,6 @@ def output_head(translit_file, unicode_version, head=''): translit_file.write('% Transliterations of encircled characters.\n') translit_file.write('% Generated automatically from UnicodeData.txt ' + 'by gen_translit_circle.py ' - + 'on {:s} '.format(time.strftime('%Y-%m-%d')) + 'for Unicode {:s}.\n'.format(unicode_version)) translit_file.write('\n') translit_file.write('LC_CTYPE\n') @@ -110,11 +109,11 @@ if __name__ == "__main__": Generate a translit_circle file from UnicodeData.txt. ''') PARSER.add_argument( - '-u', '--unicode_data_file', + '-u', '--unicode_data_dir', nargs='?', type=str, - default='UnicodeData.txt', - help=('The UnicodeData.txt file to read, ' + default='.', + help=('The directory containing Unicode data to read, ' + 'default: %(default)s')) PARSER.add_argument( '-i', '--input_file', @@ -133,19 +132,16 @@ if __name__ == "__main__": “translit_start” line and the tail from the “translit_end” line to the end of the file will be copied unchanged into the output file. ''') - PARSER.add_argument( - '--unicode_version', - nargs='?', - required=True, - type=str, - help='The Unicode version of the input files used.') ARGS = PARSER.parse_args() - unicode_utils.fill_attributes(ARGS.unicode_data_file) + unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir) + unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir) + + unicode_utils.fill_attributes(unicode_data_file) HEAD = TAIL = '' if ARGS.input_file: (HEAD, TAIL) = read_input_file(ARGS.input_file) with open(ARGS.output_file, mode='w') as TRANSLIT_FILE: - output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD) + output_head(TRANSLIT_FILE, unicode_version, head=HEAD) output_transliteration(TRANSLIT_FILE) output_tail(TRANSLIT_FILE, tail=TAIL) diff --git a/localedata/unicode-gen/gen_translit_cjk_compat.py b/localedata/unicode-gen/gen_translit_cjk_compat.py index a040511d06..ac127a8e21 100644 --- a/localedata/unicode-gen/gen_translit_cjk_compat.py +++ b/localedata/unicode-gen/gen_translit_cjk_compat.py @@ -69,7 +69,6 @@ def output_head(translit_file, unicode_version, head=''): translit_file.write('characters.\n') translit_file.write('% Generated automatically from UnicodeData.txt ' + 'by gen_translit_cjk_compat.py ' - + 'on {:s} '.format(time.strftime('%Y-%m-%d')) + 'for Unicode {:s}.\n'.format(unicode_version)) translit_file.write('\n') translit_file.write('LC_CTYPE\n') @@ -180,11 +179,11 @@ if __name__ == "__main__": Generate a translit_cjk_compat file from UnicodeData.txt. ''') PARSER.add_argument( - '-u', '--unicode_data_file', + '-u', '--unicode_data_dir', nargs='?', type=str, - default='UnicodeData.txt', - help=('The UnicodeData.txt file to read, ' + default='.', + help=('The directory containing Unicode data to read, ' + 'default: %(default)s')) PARSER.add_argument( '-i', '--input_file', @@ -203,19 +202,16 @@ if __name__ == "__main__": “translit_start” line and the tail from the “translit_end” line to the end of the file will be copied unchanged into the output file. ''') - PARSER.add_argument( - '--unicode_version', - nargs='?', - required=True, - type=str, - help='The Unicode version of the input files used.') ARGS = PARSER.parse_args() - unicode_utils.fill_attributes(ARGS.unicode_data_file) + unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir) + unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir) + + unicode_utils.fill_attributes(unicode_data_file) HEAD = TAIL = '' if ARGS.input_file: (HEAD, TAIL) = read_input_file(ARGS.input_file) with open(ARGS.output_file, mode='w') as TRANSLIT_FILE: - output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD) + output_head(TRANSLIT_FILE, unicode_version, head=HEAD) output_transliteration(TRANSLIT_FILE) output_tail(TRANSLIT_FILE, tail=TAIL) diff --git a/localedata/unicode-gen/gen_translit_combining.py b/localedata/unicode-gen/gen_translit_combining.py index 88be8f4b8a..082c0da92c 100644 --- a/localedata/unicode-gen/gen_translit_combining.py +++ b/localedata/unicode-gen/gen_translit_combining.py @@ -69,7 +69,6 @@ def output_head(translit_file, unicode_version, head=''): translit_file.write('% pronounciation marks, etc.).\n') translit_file.write('% Generated automatically from UnicodeData.txt ' + 'by gen_translit_combining.py ' - + 'on {:s} '.format(time.strftime('%Y-%m-%d')) + 'for Unicode {:s}.\n'.format(unicode_version)) translit_file.write('\n') translit_file.write('LC_CTYPE\n') @@ -404,11 +403,11 @@ if __name__ == "__main__": Generate a translit_combining file from UnicodeData.txt. ''') PARSER.add_argument( - '-u', '--unicode_data_file', + '-u', '--unicode_data_dir', nargs='?', type=str, - default='UnicodeData.txt', - help=('The UnicodeData.txt file to read, ' + default='.', + help=('The directory containing Unicode data to read, ' + 'default: %(default)s')) PARSER.add_argument( '-i', '--input_file', @@ -427,19 +426,16 @@ if __name__ == "__main__": “translit_start” line and the tail from the “translit_end” line to the end of the file will be copied unchanged into the output file. ''') - PARSER.add_argument( - '--unicode_version', - nargs='?', - required=True, - type=str, - help='The Unicode version of the input files used.') ARGS = PARSER.parse_args() - unicode_utils.fill_attributes(ARGS.unicode_data_file) + unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir) + unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir) + + unicode_utils.fill_attributes(unicode_data_file) HEAD = TAIL = '' if ARGS.input_file: (HEAD, TAIL) = read_input_file(ARGS.input_file) with open(ARGS.output_file, mode='w') as TRANSLIT_FILE: - output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD) + output_head(TRANSLIT_FILE, unicode_version, head=HEAD) output_transliteration(TRANSLIT_FILE) output_tail(TRANSLIT_FILE, tail=TAIL) diff --git a/localedata/unicode-gen/gen_translit_compat.py b/localedata/unicode-gen/gen_translit_compat.py index c8c63b23af..ba144e9bee 100644 --- a/localedata/unicode-gen/gen_translit_compat.py +++ b/localedata/unicode-gen/gen_translit_compat.py @@ -68,7 +68,6 @@ def output_head(translit_file, unicode_version, head=''): translit_file.write('and ligatures.\n') translit_file.write('% Generated automatically from UnicodeData.txt ' + 'by gen_translit_compat.py ' - + 'on {:s} '.format(time.strftime('%Y-%m-%d')) + 'for Unicode {:s}.\n'.format(unicode_version)) translit_file.write('\n') translit_file.write('LC_CTYPE\n') @@ -286,11 +285,11 @@ if __name__ == "__main__": Generate a translit_compat file from UnicodeData.txt. ''') PARSER.add_argument( - '-u', '--unicode_data_file', + '-u', '--unicode_data_dir', nargs='?', type=str, - default='UnicodeData.txt', - help=('The UnicodeData.txt file to read, ' + default='.', + help=('The directory containing Unicode data to read, ' + 'default: %(default)s')) PARSER.add_argument( '-i', '--input_file', @@ -309,19 +308,16 @@ if __name__ == "__main__": “translit_start” line and the tail from the “translit_end” line to the end of the file will be copied unchanged into the output file. ''') - PARSER.add_argument( - '--unicode_version', - nargs='?', - required=True, - type=str, - help='The Unicode version of the input files used.') ARGS = PARSER.parse_args() - unicode_utils.fill_attributes(ARGS.unicode_data_file) + unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir) + unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir) + + unicode_utils.fill_attributes(unicode_data_file) HEAD = TAIL = '' if ARGS.input_file: (HEAD, TAIL) = read_input_file(ARGS.input_file) with open(ARGS.output_file, mode='w') as TRANSLIT_FILE: - output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD) + output_head(TRANSLIT_FILE, unicode_version, head=HEAD) output_transliteration(TRANSLIT_FILE) output_tail(TRANSLIT_FILE, tail=TAIL) diff --git a/localedata/unicode-gen/gen_translit_font.py b/localedata/unicode-gen/gen_translit_font.py index db41b47fab..93b2f128fa 100644 --- a/localedata/unicode-gen/gen_translit_font.py +++ b/localedata/unicode-gen/gen_translit_font.py @@ -67,7 +67,6 @@ def output_head(translit_file, unicode_version, head=''): translit_file.write('% Transliterations of font equivalents.\n') translit_file.write('% Generated automatically from UnicodeData.txt ' + 'by gen_translit_font.py ' - + 'on {:s} '.format(time.strftime('%Y-%m-%d')) + 'for Unicode {:s}.\n'.format(unicode_version)) translit_file.write('\n') translit_file.write('LC_CTYPE\n') @@ -116,11 +115,11 @@ if __name__ == "__main__": Generate a translit_font file from UnicodeData.txt. ''') PARSER.add_argument( - '-u', '--unicode_data_file', + '-u', '--unicode_data_dir', nargs='?', type=str, - default='UnicodeData.txt', - help=('The UnicodeData.txt file to read, ' + default='.', + help=('The directory containing Unicode data to read, ' + 'default: %(default)s')) PARSER.add_argument( '-i', '--input_file', @@ -139,19 +138,16 @@ if __name__ == "__main__": “translit_start” line and the tail from the “translit_end” line to the end of the file will be copied unchanged into the output file. ''') - PARSER.add_argument( - '--unicode_version', - nargs='?', - required=True, - type=str, - help='The Unicode version of the input files used.') ARGS = PARSER.parse_args() - unicode_utils.fill_attributes(ARGS.unicode_data_file) + unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir) + unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir) + + unicode_utils.fill_attributes(unicode_data_file) HEAD = TAIL = '' if ARGS.input_file: (HEAD, TAIL) = read_input_file(ARGS.input_file) with open(ARGS.output_file, mode='w') as TRANSLIT_FILE: - output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD) + output_head(TRANSLIT_FILE, unicode_version, head=HEAD) output_transliteration(TRANSLIT_FILE) output_tail(TRANSLIT_FILE, tail=TAIL) diff --git a/localedata/unicode-gen/gen_translit_fraction.py b/localedata/unicode-gen/gen_translit_fraction.py index c3c1513eb9..097cb04ea0 100644 --- a/localedata/unicode-gen/gen_translit_fraction.py +++ b/localedata/unicode-gen/gen_translit_fraction.py @@ -67,7 +67,6 @@ def output_head(translit_file, unicode_version, head=''): translit_file.write('% Transliterations of fractions.\n') translit_file.write('% Generated automatically from UnicodeData.txt ' + 'by gen_translit_fraction.py ' - + 'on {:s} '.format(time.strftime('%Y-%m-%d')) + 'for Unicode {:s}.\n'.format(unicode_version)) translit_file.write('% The replacements have been surrounded ') translit_file.write('with spaces, because fractions are\n') @@ -157,11 +156,11 @@ if __name__ == "__main__": Generate a translit_cjk_compat file from UnicodeData.txt. ''') PARSER.add_argument( - '-u', '--unicode_data_file', + '-u', '--unicode_data_dir', nargs='?', type=str, - default='UnicodeData.txt', - help=('The UnicodeData.txt file to read, ' + default='.', + help=('The directory containing Unicode data to read, ' + 'default: %(default)s')) PARSER.add_argument( '-i', '--input_file', @@ -180,19 +179,16 @@ if __name__ == "__main__": “translit_start” line and the tail from the “translit_end” line to the end of the file will be copied unchanged into the output file. ''') - PARSER.add_argument( - '--unicode_version', - nargs='?', - required=True, - type=str, - help='The Unicode version of the input files used.') ARGS = PARSER.parse_args() - unicode_utils.fill_attributes(ARGS.unicode_data_file) + unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir) + unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir) + + unicode_utils.fill_attributes(unicode_data_file) HEAD = TAIL = '' if ARGS.input_file: (HEAD, TAIL) = read_input_file(ARGS.input_file) with open(ARGS.output_file, mode='w') as TRANSLIT_FILE: - output_head(TRANSLIT_FILE, ARGS.unicode_version, head=HEAD) + output_head(TRANSLIT_FILE, unicode_version, head=HEAD) output_transliteration(TRANSLIT_FILE) output_tail(TRANSLIT_FILE, tail=TAIL) diff --git a/localedata/unicode-gen/gen_unicode_ctype.py b/localedata/unicode-gen/gen_unicode_ctype.py index 7548961df1..41760567cf 100755 --- a/localedata/unicode-gen/gen_unicode_ctype.py +++ b/localedata/unicode-gen/gen_unicode_ctype.py @@ -32,6 +32,7 @@ To see how this script is used, call it with the “-h” option: import argparse import time import re +import datetime import unicode_utils def code_point_ranges(is_class_function): @@ -123,7 +124,7 @@ def output_charmap(i18n_file, map_name, map_function): i18n_file.write(line+'\n') i18n_file.write('\n') -def read_input_file(filename): +def read_input_file(filename, unicode_release_date): '''Reads the original glibc i18n file to get the original head and tail. @@ -140,8 +141,13 @@ def read_input_file(filename): r'^(?Pdate\s+)(?P"[0-9]{4}-[0-9]{2}-[0-9]{2}")', line) if match: - line = match.group('key') \ - + '"{:s}"\n'.format(time.strftime('%Y-%m-%d')) + # Update the file date if the Unicode standard date + # is newer. + orig_date = datetime.date.fromisoformat(match.group('value').strip('"')) + new_date = datetime.date.fromisoformat(unicode_release_date) + if new_date > orig_date: + line = match.group('key') \ + + '"{:s}"\n'.format(unicode_release_date) head = head + line if line.startswith('LC_CTYPE'): break @@ -153,7 +159,7 @@ def read_input_file(filename): tail = tail + line return (head, tail) -def output_head(i18n_file, unicode_version, head=''): +def output_head(i18n_file, unicode_version, unicode_release_date, head=''): '''Write the header of the output file, i.e. the part of the file before the “LC_CTYPE” line. ''' @@ -180,8 +186,7 @@ def output_head(i18n_file, unicode_version, head=''): i18n_file.write('language ""\n') i18n_file.write('territory "Earth"\n') i18n_file.write('revision "{:s}"\n'.format(unicode_version)) - i18n_file.write('date "{:s}"\n'.format( - time.strftime('%Y-%m-%d'))) + i18n_file.write('date "{:s}"\n'.format(unicode_release_date)) i18n_file.write('category "i18n:2012";LC_CTYPE\n') i18n_file.write('END LC_IDENTIFICATION\n') i18n_file.write('\n') @@ -267,18 +272,11 @@ if __name__ == "__main__": UnicodeData.txt and DerivedCoreProperties.txt files. ''') PARSER.add_argument( - '-u', '--unicode_data_file', + '-u', '--unicode_data_dir', nargs='?', type=str, - default='UnicodeData.txt', - help=('The UnicodeData.txt file to read, ' - + 'default: %(default)s')) - PARSER.add_argument( - '-d', '--derived_core_properties_file', - nargs='?', - type=str, - default='DerivedCoreProperties.txt', - help=('The DerivedCoreProperties.txt file to read, ' + default='.', + help=('The directory containing Unicode data to read, ' + 'default: %(default)s')) PARSER.add_argument( '-i', '--input_file', @@ -298,27 +296,21 @@ if __name__ == "__main__": classes and the date stamp in LC_IDENTIFICATION will be copied unchanged into the output file. ''') - PARSER.add_argument( - '--unicode_version', - nargs='?', - required=True, - type=str, - help='The Unicode version of the input files used.') PARSER.add_argument( '--turkish', action='store_true', help='Use Turkish case conversions.') ARGS = PARSER.parse_args() - unicode_utils.fill_attributes( - ARGS.unicode_data_file) - unicode_utils.fill_derived_core_properties( - ARGS.derived_core_properties_file) + unicode_version = unicode_utils.release_version (ARGS.unicode_data_dir) + unicode_release_date = unicode_utils.release_date (ARGS.unicode_data_dir) + unicode_utils.fill_attributes(unicode_utils.release_data_file(ARGS.unicode_data_dir)) + unicode_utils.fill_derived_core_properties(unicode_utils.release_dcp_file(ARGS.unicode_data_dir)) unicode_utils.verifications() HEAD = TAIL = '' if ARGS.input_file: - (HEAD, TAIL) = read_input_file(ARGS.input_file) + (HEAD, TAIL) = read_input_file(ARGS.input_file, unicode_release_date) with open(ARGS.output_file, mode='w') as I18N_FILE: - output_head(I18N_FILE, ARGS.unicode_version, head=HEAD) - output_tables(I18N_FILE, ARGS.unicode_version, ARGS.turkish) + output_head(I18N_FILE, unicode_version, unicode_release_date, head=HEAD) + output_tables(I18N_FILE, unicode_version, ARGS.turkish) output_tail(I18N_FILE, tail=TAIL) diff --git a/localedata/unicode-gen/unicode_utils.py b/localedata/unicode-gen/unicode_utils.py index 3263f4510b..2b7c6aaa45 100644 --- a/localedata/unicode-gen/unicode_utils.py +++ b/localedata/unicode-gen/unicode_utils.py @@ -525,3 +525,41 @@ def verifications(): and (is_graph(code_point) or code_point == 0x0020)): sys.stderr.write('%(sym)s is graph| but not print\n' %{ 'sym': unicode_utils.ucs_symbol(code_point)}) + +def release_metadata(data_dir, parameter): + ''' Parse the UnicodeRelease.txt metadata and return the value for + the specified parameter.''' + value = "" + with open(data_dir + '/' + "UnicodeRelease.txt", "r") as f: + for line in f: + if line.strip()[0] == '%': + continue + fields = line.strip().split(",") + if fields[0] == parameter: + value = fields[1].strip() + assert value != "" + return value + +def release_version(data_dir): + ''' Return the Unicode version of the data in use.''' + return release_metadata(data_dir, "Version") + +def release_date(data_dir): + ''' Release the release date for the Unicode version of the data.''' + return release_metadata(data_dir, "ReleaseDate") + +def release_data_file(data_dir): + ''' The name of the primary data file.''' + return data_dir + '/' + release_metadata(data_dir, 'Data') + +def release_dcp_file(data_dir): + ''' The name of the derived core properties data file.''' + return data_dir + '/' + release_metadata(data_dir, 'DcpData') + +def release_eaw_file(data_dir): + ''' The name of the East Asian width data file.''' + return data_dir + '/' + release_metadata(data_dir, 'EawData') + +def release_pl_file(data_dir): + ''' The name of the properties list data file.''' + return data_dir + '/' + release_metadata(data_dir, 'PlData') diff --git a/localedata/unicode-gen/utf8_compatibility.py b/localedata/unicode-gen/utf8_compatibility.py index eca2e8cddc..7e485ba759 100755 --- a/localedata/unicode-gen/utf8_compatibility.py +++ b/localedata/unicode-gen/utf8_compatibility.py @@ -216,6 +216,13 @@ if __name__ == "__main__": description=''' Compare the contents of LC_CTYPE in two files and check for errors. ''') + PARSER.add_argument( + '-u', '--unicode_data_dir', + nargs='?', + type=str, + default='.', + help=('The directory containing Unicode data to read, ' + + 'default: %(default)s')) PARSER.add_argument( '-o', '--old_utf8_file', nargs='?', @@ -228,16 +235,6 @@ if __name__ == "__main__": required=True, type=str, help='The new UTF-8 file.') - PARSER.add_argument( - '-u', '--unicode_data_file', - nargs='?', - type=str, - help='The UnicodeData.txt file to read.') - PARSER.add_argument( - '-e', '--east_asian_width_file', - nargs='?', - type=str, - help='The EastAsianWidth.txt file to read.') PARSER.add_argument( '-a', '--show_added_characters', action='store_true', @@ -252,9 +249,11 @@ if __name__ == "__main__": help='Show characters whose width was changed in detail.') ARGS = PARSER.parse_args() - if ARGS.unicode_data_file: - unicode_utils.fill_attributes(ARGS.unicode_data_file) - if ARGS.east_asian_width_file: - unicode_utils.fill_east_asian_widths(ARGS.east_asian_width_file) + unicode_data_file = unicode_utils.release_data_file (ARGS.unicode_data_dir) + east_asian_width_file = unicode_utils.release_eaw_file (ARGS.unicode_data_dir) + + unicode_utils.fill_attributes(unicode_data_file) + unicode_utils.fill_east_asian_widths(east_asian_width_file) + check_charmap(ARGS.old_utf8_file, ARGS.new_utf8_file) check_width(ARGS.old_utf8_file, ARGS.new_utf8_file) diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py index 899840923a..4fc3038fe0 100755 --- a/localedata/unicode-gen/utf8_gen.py +++ b/localedata/unicode-gen/utf8_gen.py @@ -22,7 +22,7 @@ This script generates a glibc/localedata/charmaps/UTF-8 file from Unicode data. -Usage: python3 utf8_gen.py UnicodeData.txt EastAsianWidth.txt +Usage: python3 utf8_gen.py It will output UTF-8 file ''' @@ -198,23 +198,27 @@ def write_header_charmap(outfile): outfile.write("% alias ISO-10646/UTF-8\n") outfile.write("CHARMAP\n") -def write_header_width(outfile, unicode_version): +def write_header_width(outfile, unicode_data_dir): '''Writes the header on top of the WIDTH section to the output file''' + unicode_version = unicode_utils.release_version(unicode_data_dir) + unicode_data = unicode_utils.release_metadata(unicode_data_dir, 'Data') + eaw_data = unicode_utils.release_metadata(unicode_data_dir, 'EawData') + pl_data = unicode_utils.release_metadata(unicode_data_dir, 'PlData') outfile.write('% Character width according to Unicode ' + '{:s}.\n'.format(unicode_version)) outfile.write('% - Default width is 1.\n') outfile.write('% - Double-width characters have width 2; generated from\n') - outfile.write('% "grep \'^[^;]*;[WF]\' EastAsianWidth.txt"\n') + outfile.write('% "grep \'^[^;]*;[WF]\' ' + eaw_data + '"\n') outfile.write('% - Non-spacing characters have width 0; ' - + 'generated from PropList.txt or\n') + + 'generated from ' + pl_data + ' or\n') outfile.write('% "grep \'^[^;]*;[^;]*;[^;]*;[^;]*;NSM;\' ' - + 'UnicodeData.txt"\n') + + unicode_data + '"\n') outfile.write('% - Format control characters have width 0; ' + 'generated from\n') - outfile.write("% \"grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt\"\n") + outfile.write("% \"grep '^[^;]*;[^;]*;Cf;' " + unicode_data + "\"\n") # Not needed covered by Cf # outfile.write("% - Zero width characters have width 0; generated from\n") -# outfile.write("% \"grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt\"\n") +# outfile.write("% \"grep '^[^;]*;ZERO WIDTH ' " + unicode_data + "\"\n") outfile.write("WIDTH\n") def process_width(outfile, ulines, elines, plines): @@ -302,41 +306,26 @@ def process_width(outfile, ulines, elines, plines): if __name__ == "__main__": PARSER = argparse.ArgumentParser( description=''' - Generate a UTF-8 file from UnicodeData.txt, EastAsianWidth.txt, and PropList.txt. + Generate a UTF-8 file from the Unicode release data files. ''') PARSER.add_argument( - '-u', '--unicode_data_file', + '-u', '--unicode_data_dir', nargs='?', type=str, - default='UnicodeData.txt', - help=('The UnicodeData.txt file to read, ' + default='.', + help=('The directory containing Unicode data to read, ' + 'default: %(default)s')) - PARSER.add_argument( - '-e', '--east_asian_with_file', - nargs='?', - type=str, - default='EastAsianWidth.txt', - help=('The EastAsianWidth.txt file to read, ' - + 'default: %(default)s')) - PARSER.add_argument( - '-p', '--prop_list_file', - nargs='?', - type=str, - default='PropList.txt', - help=('The PropList.txt file to read, ' - + 'default: %(default)s')) - PARSER.add_argument( - '--unicode_version', - nargs='?', - required=True, - type=str, - help='The Unicode version of the input files used.') ARGS = PARSER.parse_args() - unicode_utils.fill_attributes(ARGS.unicode_data_file) - with open(ARGS.unicode_data_file, mode='r') as UNIDATA_FILE: + unicode_version = unicode_utils.release_version(ARGS.unicode_data_dir) + unicode_data_file = unicode_utils.release_data_file(ARGS.unicode_data_dir) + east_asian_width_file = unicode_utils.release_eaw_file(ARGS.unicode_data_dir) + prop_list_file = unicode_utils.release_pl_file(ARGS.unicode_data_dir) + + unicode_utils.fill_attributes(unicode_data_file) + with open(unicode_data_file, mode='r') as UNIDATA_FILE: UNICODE_DATA_LINES = UNIDATA_FILE.readlines() - with open(ARGS.east_asian_with_file, mode='r') as EAST_ASIAN_WIDTH_FILE: + with open(east_asian_width_file, mode='r') as EAST_ASIAN_WIDTH_FILE: EAST_ASIAN_WIDTH_LINES = [] for LINE in EAST_ASIAN_WIDTH_FILE: # If characters from EastAasianWidth.txt which are from @@ -352,7 +341,7 @@ if __name__ == "__main__": continue if re.match(r'^[^;]*;[WF]', LINE): EAST_ASIAN_WIDTH_LINES.append(LINE.strip()) - with open(ARGS.prop_list_file, mode='r') as PROP_LIST_FILE: + with open(prop_list_file, mode='r') as PROP_LIST_FILE: PROP_LIST_LINES = [] for LINE in PROP_LIST_FILE: if re.match(r'^[^;]*;[\s]*Prepended_Concatenation_Mark', LINE): @@ -363,7 +352,7 @@ if __name__ == "__main__": process_charmap(UNICODE_DATA_LINES, OUTFILE) OUTFILE.write("END CHARMAP\n\n") # Processing EastAsianWidth.txt and write WIDTH to UTF-8 file - write_header_width(OUTFILE, ARGS.unicode_version) + write_header_width(OUTFILE, ARGS.unicode_data_dir) process_width(OUTFILE, UNICODE_DATA_LINES, EAST_ASIAN_WIDTH_LINES,