From patchwork Thu May 19 21:06:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Florian Weimer X-Patchwork-Id: 54240 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A7BA53839C61 for ; Thu, 19 May 2022 21:09:47 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A7BA53839C61 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1652994587; bh=PUNq2ekzrPeqJp5H71x7Ojjl2tkXiLVRtFNNEu54ALY=; h=To:Subject:In-Reply-To:References:Date:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=jXwHcvwzzVJYM/9pxRwMDG0XAtEPz4xuOCWtivwiHoA7fnHaPKtWEW1BtiPX9PIy2 lguxdZU0SLUsUzXzuIantJjP7G8BzhbWAi6LgTT4ytj2I8Ok7I/3rWn+4o7lmX31CT AIwl9tCpdK9Ng0cSE4p4xItXh8Y2RSFXtO1LkJ/A= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 2BEBA3839C6A for ; Thu, 19 May 2022 21:06:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2BEBA3839C6A Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-421-aRGsRqfWOeiJfBgqiXcbnQ-1; Thu, 19 May 2022 17:06:45 -0400 X-MC-Unique: aRGsRqfWOeiJfBgqiXcbnQ-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5E49E2949BA1 for ; Thu, 19 May 2022 21:06:45 +0000 (UTC) Received: from oldenburg.str.redhat.com (unknown [10.39.192.58]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 97DC640C1421 for ; Thu, 19 May 2022 21:06:44 +0000 (UTC) To: libc-alpha@sourceware.org Subject: [PATCH 4/5] locale: localdef input files are now encoded in UTF-8 In-Reply-To: References: X-From-Line: bab1c8587126515188cb6104cf6eba85d2e813e5 Mon Sep 17 00:00:00 2001 Message-Id: Date: Thu, 19 May 2022 23:06:42 +0200 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.11.54.2 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-11.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_NUMSUBJECT, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Florian Weimer via Libc-alpha From: Florian Weimer Reply-To: Florian Weimer Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" Previously, they were assumed to be in ISO-8859-1, and that the output charset overlapped with ISO-8859-1 for the characters actually used. However, this did not work as intended on many architectures even for an ISO-8859-1 output encoding because of the char signedness bug in lr_getc. Therefore, this commit switches to UTF-8 without making provisions for backwards compatibility. The following Elisp code can be used to convert locale definition files to UTF-8: (defun glibc/convert-localedef (from to) (interactive "r") (save-excursion (save-restriction (narrow-to-region from to) (goto-char (point-min)) (save-match-data (while (re-search-forward "" nil t) (let* ((codepoint (string-to-number (match-string 1) 16)) (converted (cond ((memq codepoint '(?/ ?\ ?< ?>)) (string ?/ codepoint)) ((= codepoint ?\") "") (t (string codepoint))))) (replace-match converted t))))))) Reviewed-by: Carlos O'Donell Tested-by: Carlos O'Donell --- NEWS | 4 + locale/programs/linereader.c | 144 ++++++++++++++++++++++++++++++++--- 2 files changed, 137 insertions(+), 11 deletions(-) diff --git a/NEWS b/NEWS index ad0c08d8ca..7ce0d8a135 100644 --- a/NEWS +++ b/NEWS @@ -20,6 +20,10 @@ Major new features: have been added. The pidfd functionality provides access to a process while avoiding the issue of PID reuse on tranditional Unix systems. +* localedef now accepts locale definition files encoded in UTF-8. + Previously, input bytes not within the ASCII range resulted in + unpredictable output. + Deprecated and removed features, and other changes affecting compatibility: * Support for prelink will be removed in the next release; this includes diff --git a/locale/programs/linereader.c b/locale/programs/linereader.c index f7292f0102..b484327969 100644 --- a/locale/programs/linereader.c +++ b/locale/programs/linereader.c @@ -42,6 +42,7 @@ static struct token *get_string (struct linereader *lr, struct localedef_t *locale, const struct repertoire_t *repertoire, int verbose); +static bool utf8_decode (struct linereader *lr, uint8_t ch1, uint32_t *wch); struct linereader * @@ -327,6 +328,17 @@ lr_token (struct linereader *lr, const struct charmap_t *charmap, } lr_ungetn (lr, 2); break; + + case 0x80 ... 0xff: /* UTF-8 sequence. */ + uint32_t wch; + if (!utf8_decode (lr, ch, &wch)) + { + lr->token.tok = tok_error; + return &lr->token; + } + lr->token.tok = tok_ucs4; + lr->token.val.ucs4 = wch; + return &lr->token; } return get_ident (lr); @@ -673,6 +685,87 @@ translate_unicode_codepoint (struct localedef_t *locale, return false; } +/* Returns true if ch is not EOF (that is, non-negative) and a valid + UTF-8 trailing byte. */ +static bool +utf8_valid_trailing (int ch) +{ + return ch >= 0 && (ch & 0xc0) == 0x80; +} + +/* Reports an error for a broken UTF-8 sequence. CH2 to CH4 may be + EOF. Always returns false. */ +static bool +utf8_sequence_error (struct linereader *lr, uint8_t ch1, int ch2, int ch3, + int ch4) +{ + char buf[30]; + + if (ch2 < 0) + snprintf (buf, sizeof (buf), "0x%02x", ch1); + else if (ch3 < 0) + snprintf (buf, sizeof (buf), "0x%02x 0x%02x", ch1, ch2); + else if (ch4 < 0) + snprintf (buf, sizeof (buf), "0x%02x 0x%02x 0x%02x", ch1, ch2, ch3); + else + snprintf (buf, sizeof (buf), "0x%02x 0x%02x 0x%02x 0x%02x", + ch1, ch2, ch3, ch4); + + lr_error (lr, _("invalid UTF-8 sequence %s"), buf); + return false; +} + +/* Reads a UTF-8 sequence from LR, with the leading byte CH1, and + stores the decoded codepoint in *WCH. Returns false on failure and + reports an error. */ +static bool +utf8_decode (struct linereader *lr, uint8_t ch1, uint32_t *wch) +{ + /* See RFC 3629 section 4 and __gconv_transform_utf8_internal. */ + if (ch1 < 0xc2) + return utf8_sequence_error (lr, ch1, -1, -1, -1); + + int ch2 = lr_getc (lr); + if (!utf8_valid_trailing (ch2)) + return utf8_sequence_error (lr, ch1, ch2, -1, -1); + + if (ch1 <= 0xdf) + { + uint32_t result = ((ch1 & 0x1f) << 6) | (ch2 & 0x3f); + if (result < 0x80) + return utf8_sequence_error (lr, ch1, ch2, -1, -1); + *wch = result; + return true; + } + + int ch3 = lr_getc (lr); + if (!utf8_valid_trailing (ch3) || ch1 < 0xe0) + return utf8_sequence_error (lr, ch1, ch2, ch3, -1); + + if (ch1 <= 0xef) + { + uint32_t result = (((ch1 & 0x0f) << 12) + | ((ch2 & 0x3f) << 6) + | (ch3 & 0x3f)); + if (result < 0x800) + return utf8_sequence_error (lr, ch1, ch2, ch3, -1); + *wch = result; + return true; + } + + int ch4 = lr_getc (lr); + if (!utf8_valid_trailing (ch4) || ch1 < 0xf0 || ch1 > 0xf4) + return utf8_sequence_error (lr, ch1, ch2, ch3, ch4); + + uint32_t result = (((ch1 & 0x07) << 18) + | ((ch2 & 0x3f) << 12) + | ((ch3 & 0x3f) << 6) + | (ch4 & 0x3f)); + if (result < 0x10000) + return utf8_sequence_error (lr, ch1, ch2, ch3, ch4); + *wch = result; + return true; +} static struct token * get_string (struct linereader *lr, const struct charmap_t *charmap, @@ -696,7 +789,11 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, buf2 = NULL; while ((ch = lr_getc (lr)) != '"' && ch != '\n' && ch != EOF) - addc (&lrb, ch); + { + if (ch >= 0x80) + lr_error (lr, _("illegal 8-bit character in untranslated string")); + addc (&lrb, ch); + } /* Catch errors with trailing escape character. */ if (lrb.act > 0 && lrb.buf[lrb.act - 1] == lr->escape_char @@ -730,24 +827,49 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, if (ch != '<') { - /* The standards leave it up to the implementation to decide - what to do with character which stand for themself. We - could jump through hoops to find out the value relative to - the charmap and the repertoire map, but instead we leave - it up to the locale definition author to write a better - definition. We assume here that every character which - stands for itself is encoded using ISO 8859-1. Using the - escape character is allowed. */ + /* The standards leave it up to the implementation to + decide what to do with characters which stand for + themselves. This implementation treats the input + file as encoded in UTF-8. */ if (ch == lr->escape_char) { ch = lr_getc (lr); + if (ch >= 0x80) + { + lr_error (lr, _("illegal 8-bit escape sequence")); + illegal_string = true; + break; + } if (ch == '\n' || ch == EOF) break; + addc (&lrb, ch); + wch = ch; + } + else if (ch < 0x80) + { + wch = ch; + addc (&lrb, ch); + } + else /* UTF-8 sequence. */ + { + if (!utf8_decode (lr, ch, &wch)) + { + illegal_string = true; + break; + } + if (!translate_unicode_codepoint (locale, charmap, + repertoire, wch, &lrb)) + { + /* Ignore the rest of the string. Callers may + skip this string because it cannot be encoded + in the output character set. */ + illegal_string = true; + continue; + } } - addc (&lrb, ch); if (return_widestr) - ADDWC ((uint32_t) ch); + ADDWC (wch); continue; }