[0/5] Assume UTF-8 encoding for localedef input files
Message ID | cover.1652994079.git.fweimer@redhat.com |
---|---|
Headers |
Return-Path: <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 17A333839C69 for <patchwork@sourceware.org>; Thu, 19 May 2022 21:06:55 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 17A333839C69 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1652994415; bh=2QzyLBsp2bdEI746ZbnCr9NcHlQlbzGC+j4ix5Sr8Ww=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=oXwGsClbWNCyvBetqYr2Z+nIp8f1qC8Hz/FdNIof2/g7T7hWP/env3B80ChdZW3Oh qmSALCxDxAM72IB5UD7xuY8tGMchpk7pBzTcj330LxcFPYqXmxO4fP3tKx37/7T/ar 6nbJbU+qRNT6UuH8tU26U/imKKS8o4ioF+73QbC8= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id CE9603839C4A for <libc-alpha@sourceware.org>; Thu, 19 May 2022 21:06:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CE9603839C4A Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-526-XTw0sHdHNo2Wn9g-HrbpRA-1; Thu, 19 May 2022 17:06:30 -0400 X-MC-Unique: XTw0sHdHNo2Wn9g-HrbpRA-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 0CC45108C1EE for <libc-alpha@sourceware.org>; Thu, 19 May 2022 21:06:28 +0000 (UTC) Received: from oldenburg.str.redhat.com (unknown [10.39.192.58]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 6AE12112131E for <libc-alpha@sourceware.org>; Thu, 19 May 2022 21:06:27 +0000 (UTC) To: libc-alpha@sourceware.org Subject: [PATCH 0/5] Assume UTF-8 encoding for localedef input files X-From-Line: 8faf1d5dc7508a17bd14005b54f89593667aeecb Mon Sep 17 00:00:00 2001 Message-Id: <cover.1652994079.git.fweimer@redhat.com> Date: Thu, 19 May 2022 23:06:25 +0200 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-5.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org> List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe> List-Archive: <https://sourceware.org/pipermail/libc-alpha/> List-Post: <mailto:libc-alpha@sourceware.org> List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help> List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>, <mailto:libc-alpha-request@sourceware.org?subject=subscribe> From: Florian Weimer via Libc-alpha <libc-alpha@sourceware.org> Reply-To: Florian Weimer <fweimer@redhat.com> Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" <libc-alpha-bounces+patchwork=sourceware.org@sourceware.org> |
Message
Florian Weimer
May 19, 2022, 9:06 p.m. UTC
This is a backwards-compatible change because of two localedef bugs that cause bytes outside the ASCII range to produce unpredictable results: If char is signed, conversion from the assumed ISO-8859-1 input format to a UCS-4 codepoint does not produce the correct result. If the output character set is not overlapping ISO-8859-1 in the characters used in the locale, the required character set conversion is not applied. This is why I think we can switch to UTF-8 without impacting backwards compatibility, and there is no need for an option to restore the old behavior. Tested on i686-linux-gnu and x86_64-linux-gnu. Thanks, Florian Florian Weimer (5): locale: Turn ADDC and ADDS into functions in linereader.c locale: Fix signed char bug in lr_getc locale: Introduce translate_unicode_codepoint into linereader.c locale: localdef input files are now encoded in UTF-8 de_DE: Convert to UTF-8 NEWS | 4 + locale/programs/linereader.c | 504 ++++++++++++++++++++++------------- locale/programs/linereader.h | 2 +- localedata/locales/de_DE | 32 +-- 4 files changed, 338 insertions(+), 204 deletions(-) base-commit: 2d5ec6692f5746ccb11db60976a6481ef8e9d74f
Comments
On 5/19/22 17:06, Florian Weimer via Libc-alpha wrote: > This is a backwards-compatible change because of two localedef bugs that > cause bytes outside the ASCII range to produce unpredictable results: > > If char is signed, conversion from the assumed ISO-8859-1 input format > to a UCS-4 codepoint does not produce the correct result. > > If the output character set is not overlapping ISO-8859-1 in the > characters used in the locale, the required character set conversion > is not applied. > > This is why I think we can switch to UTF-8 without impacting backwards > compatibility, and there is no need for an option to restore the old > behavior. I can agree with that. In some sense I think the parsing of locale files is something that can require developers and users to adjust the syntax, though we'd like for it to be backwards compatible. In this case it couldn't have worked. Thank you for working on this to make locales easier to use. I particularly appreciate the example conversion of de_DE. Overall the series looks good and we should commit this ahead of glibc 2.36 so we can get any new strings translated for the TP project. This series particularly adds some error messages for the use of UTF-8 in the locale sources. Again, I really appreciate that this makes it easier for natural language speakers to write, adjust, and review locale sources. In cases where disambiguation is required we still have the capacity to write it differently if we need to. This continues the early work to convert from U-codes to ASCII. Just like last time we had this discussion the idea that glibc would support compiling locale sources on a system that lacks UTF-8 is no longer a requirement that we should have for the library. > Tested on i686-linux-gnu and x86_64-linux-gnu. > > Thanks, > Florian > > Florian Weimer (5): > locale: Turn ADDC and ADDS into functions in linereader.c > locale: Fix signed char bug in lr_getc > locale: Introduce translate_unicode_codepoint into linereader.c > locale: localdef input files are now encoded in UTF-8 > de_DE: Convert to UTF-8 > > NEWS | 4 + > locale/programs/linereader.c | 504 ++++++++++++++++++++++------------- > locale/programs/linereader.h | 2 +- > localedata/locales/de_DE | 32 +-- > 4 files changed, 338 insertions(+), 204 deletions(-) > > > base-commit: 2d5ec6692f5746ccb11db60976a6481ef8e9d74f