From patchwork Thu May 19 21:06:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Florian Weimer X-Patchwork-Id: 54239 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A57633839C56 for ; Thu, 19 May 2022 21:09:05 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A57633839C56 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1652994545; bh=tW04Sw0bXDg4sisNTPpepyJXeCIXHy04YOJ4alZvy08=; h=To:Subject:In-Reply-To:References:Date:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=g31iVk/E6oQpKac+CyDZkn9Fk/5o08R31mVmpXEThVPjL+evildTcyn1tODI8a6Dj mKPy/g0OJVNAzvDBKW8xc2CVIjDDwdTu/3vFz+9KZZs1HZZ7FUNLh8T1IUWMsjbAxa 1Veoyu/MwJ7YdxFrXJW7ErngybCViX5kk2HLgSXw= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id B48E03839C61 for ; Thu, 19 May 2022 21:06:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B48E03839C61 Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-96-5Taz4fEuMNuikclkZ8dz4g-1; Thu, 19 May 2022 17:06:41 -0400 X-MC-Unique: 5Taz4fEuMNuikclkZ8dz4g-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E94D68015BA for ; Thu, 19 May 2022 21:06:40 +0000 (UTC) Received: from oldenburg.str.redhat.com (unknown [10.39.192.58]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1C93E40C1421 for ; Thu, 19 May 2022 21:06:39 +0000 (UTC) To: libc-alpha@sourceware.org Subject: [PATCH 3/5] locale: Introduce translate_unicode_codepoint into linereader.c In-Reply-To: References: X-From-Line: a89cee054d28d43cf8f7e5f171e876326e4af96e Mon Sep 17 00:00:00 2001 Message-Id: Date: Thu, 19 May 2022 23:06:38 +0200 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.11.54.2 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Florian Weimer via Libc-alpha From: Florian Weimer Reply-To: Florian Weimer Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" This will permit reusing the Unicode character processing for different character encodings, not just the current encoding. Reviewed-by: Carlos O'Donell Tested-by: Carlos O'Donell --- locale/programs/linereader.c | 167 ++++++++++++++++++----------------- 1 file changed, 85 insertions(+), 82 deletions(-) diff --git a/locale/programs/linereader.c b/locale/programs/linereader.c index d5367e0a1e..f7292f0102 100644 --- a/locale/programs/linereader.c +++ b/locale/programs/linereader.c @@ -596,6 +596,83 @@ get_ident (struct linereader *lr) return &lr->token; } +/* Process a decoded Unicode codepoint WCH in a string, placing the + multibyte sequence into LRB. Return false if the character is not + found in CHARMAP/REPERTOIRE. */ +static bool +translate_unicode_codepoint (struct localedef_t *locale, + const struct charmap_t *charmap, + const struct repertoire_t *repertoire, + uint32_t wch, struct lr_buffer *lrb) +{ + /* See whether the charmap contains the Uxxxxxxxx names. */ + char utmp[10]; + snprintf (utmp, sizeof (utmp), "U%08X", wch); + struct charseq *seq = charmap_find_value (charmap, utmp, 9); + + if (seq == NULL) + { + /* No, this isn't the case. Now determine from + the repertoire the name of the character and + find it in the charmap. */ + if (repertoire != NULL) + { + const char *symbol = repertoire_find_symbol (repertoire, wch); + if (symbol != NULL) + seq = charmap_find_value (charmap, symbol, strlen (symbol)); + } + + if (seq == NULL) + { +#ifndef NO_TRANSLITERATION + /* Transliterate if possible. */ + if (locale != NULL) + { + if ((locale->avail & CTYPE_LOCALE) == 0) + { + /* Load the CTYPE data now. */ + int old_needed = locale->needed; + + locale->needed = 0; + locale = load_locale (LC_CTYPE, locale->name, + locale->repertoire_name, + charmap, locale); + locale->needed = old_needed; + } + + uint32_t *translit; + if ((locale->avail & CTYPE_LOCALE) != 0 + && ((translit = find_translit (locale, charmap, wch)) + != NULL)) + /* The CTYPE data contains a matching + transliteration. */ + { + for (int i = 0; translit[i] != 0; ++i) + { + snprintf (utmp, sizeof (utmp), "U%08X", translit[i]); + seq = charmap_find_value (charmap, utmp, 9); + assert (seq != NULL); + adds (lrb, seq->bytes, seq->nbytes); + } + return true; + } + } +#endif /* NO_TRANSLITERATION */ + + /* Not a known name. */ + return false; + } + } + + if (seq != NULL) + { + adds (lrb, seq->bytes, seq->nbytes); + return true; + } + else + return false; +} + static struct token * get_string (struct linereader *lr, const struct charmap_t *charmap, @@ -635,7 +712,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, } else { - int illegal_string = 0; + bool illegal_string = false; size_t buf2act = 0; size_t buf2max = 56 * sizeof (uint32_t); int ch; @@ -695,7 +772,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, { /* <> is no correct name. Ignore it and also signal an error. */ - illegal_string = 1; + illegal_string = true; continue; } @@ -709,8 +786,6 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, if (cp == &lrb.buf[lrb.act]) { - char utmp[10]; - /* Yes, it is. */ addc (&lrb, '\0'); wch = strtoul (lrb.buf + startidx + 1, NULL, 16); @@ -721,81 +796,9 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, if (return_widestr) ADDWC (wch); - /* See whether the charmap contains the Uxxxxxxxx names. */ - snprintf (utmp, sizeof (utmp), "U%08X", wch); - seq = charmap_find_value (charmap, utmp, 9); - - if (seq == NULL) - { - /* No, this isn't the case. Now determine from - the repertoire the name of the character and - find it in the charmap. */ - if (repertoire != NULL) - { - const char *symbol; - - symbol = repertoire_find_symbol (repertoire, wch); - - if (symbol != NULL) - seq = charmap_find_value (charmap, symbol, - strlen (symbol)); - } - - if (seq == NULL) - { -#ifndef NO_TRANSLITERATION - /* Transliterate if possible. */ - if (locale != NULL) - { - uint32_t *translit; - - if ((locale->avail & CTYPE_LOCALE) == 0) - { - /* Load the CTYPE data now. */ - int old_needed = locale->needed; - - locale->needed = 0; - locale = load_locale (LC_CTYPE, - locale->name, - locale->repertoire_name, - charmap, locale); - locale->needed = old_needed; - } - - if ((locale->avail & CTYPE_LOCALE) != 0 - && ((translit = find_translit (locale, - charmap, wch)) - != NULL)) - /* The CTYPE data contains a matching - transliteration. */ - { - int i; - - for (i = 0; translit[i] != 0; ++i) - { - char utmp[10]; - - snprintf (utmp, sizeof (utmp), "U%08X", - translit[i]); - seq = charmap_find_value (charmap, utmp, - 9); - assert (seq != NULL); - adds (&lrb, seq->bytes, seq->nbytes); - } - - continue; - } - } -#endif /* NO_TRANSLITERATION */ - - /* Not a known name. */ - illegal_string = 1; - } - } - - if (seq != NULL) - adds (&lrb, seq->bytes, seq->nbytes); - + if (!translate_unicode_codepoint (locale, charmap, + repertoire, wch, &lrb)) + illegal_string = true; continue; } } @@ -812,7 +815,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, /* This name is not in the charmap. */ lr_error (lr, _("symbol `%.*s' not in charmap"), (int) (lrb.act - startidx), &lrb.buf[startidx]); - illegal_string = 1; + illegal_string = true; } if (return_widestr) @@ -833,7 +836,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, /* This name is not in the repertoire map. */ lr_error (lr, _("symbol `%.*s' not in repertoire map"), (int) (lrb.act - startidx), &lrb.buf[startidx]); - illegal_string = 1; + illegal_string = true; } else ADDWC (wch); @@ -850,7 +853,7 @@ get_string (struct linereader *lr, const struct charmap_t *charmap, if (ch == '\n' || ch == EOF) { lr_error (lr, _("unterminated string")); - illegal_string = 1; + illegal_string = true; } if (illegal_string)