From patchwork Mon Sep 6 15:43:34 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Carlos O'Donell X-Patchwork-Id: 44864 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 5EFDC383942D for ; Mon, 6 Sep 2021 15:44:04 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5EFDC383942D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1630943044; bh=RZ8xrbwiP1VSyu8HRkIPnVW0pJFkjYqAeokq68mn9xM=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=ZN1AO9qn+SdndNZvVm+G8Eibf6KUaG4QdoXkN8hWgS+EYMROkxoT+NWRqR0rI4pys kwXN557u3aMDLzRqimw93Ps+6liIROUqTYCRJifGygj+Kp4NHwada64CkRdJ5EGqao b49lg0ehC3CLMvknrBwrTdTJ7uEFkKf6t1b28/eI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by sourceware.org (Postfix) with ESMTP id B52783857C5B for ; Mon, 6 Sep 2021 15:43:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B52783857C5B Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-363-Rnx7wncdOaGxGgcqxqw2sQ-1; Mon, 06 Sep 2021 11:43:40 -0400 X-MC-Unique: Rnx7wncdOaGxGgcqxqw2sQ-1 Received: by mail-qk1-f197.google.com with SMTP id h135-20020a379e8d000000b003f64b0f4865so11058211qke.12 for ; Mon, 06 Sep 2021 08:43:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=RZ8xrbwiP1VSyu8HRkIPnVW0pJFkjYqAeokq68mn9xM=; b=Ef5I5eyGwrGWEnpf69oVyKZHmrL49DWpZ0JhlaPEKanBi0/KxNNQSXFelEogEL+wZd bPP4AhueFtZksrePbkrflSxJsUL+uFZdsxG98JcXCHzQihct2m6KBmJNejJmYA15h6Cg VkPgw03vyBB1T0PcZmy74uy/BmCXNSnQbxcfKmx2FNgh2Ah1Rw9KdVwPIR55eOfRma06 h4i9H7che3b3A4Vkbvh0jNEsEyZj6BdwcaCg+8vI9w8G3SIxqDIVR+Q3lTX1QERtle9k ZFqXDcG1KIy3EZv1oX0+vkwerYwaTA8yvOvIvaBQy6XuxAscmYYVMKzBf7kC7ItQYMUE PBeg== X-Gm-Message-State: AOAM532BYLyrnFc70FITSxR5eEBf0Rh91YE603AKENxuDxJbIpbltfnm dhH3PzmbnaUahwfjJJEi39+SCyACcNHiFzkoHYmGHDqxNhs0bJKIosPbYfX4ccZv0g1HObdQOyR zSEhZedsuWtMnp4vJ2D198eUGOu5MmwVww8ei1gnQ6xaOtsVxgMBPFyqE4wvfSqXZPyj+Cw== X-Received: by 2002:ac8:5f09:: with SMTP id x9mr11036214qta.103.1630943019458; Mon, 06 Sep 2021 08:43:39 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyyZq4hAruDFNlIQB97Hog6DFVdmUUncYG0HnS1HVSBeRYJLpUoyqjET4ggRoDsH/sqGY49cQ== X-Received: by 2002:ac8:5f09:: with SMTP id x9mr11036189qta.103.1630943019088; Mon, 06 Sep 2021 08:43:39 -0700 (PDT) Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com. [198.84.214.74]) by smtp.gmail.com with ESMTPSA id b13sm5235886qtb.13.2021.09.06.08.43.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Sep 2021 08:43:38 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v12 0/2] C.UTF-8 Date: Mon, 6 Sep 2021 11:43:34 -0400 Message-Id: <20210906154336.610973-1-carlos@redhat.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-5.8 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Carlos O'Donell via Libc-alpha From: Carlos O'Donell Reply-To: Carlos O'Donell Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" The following changes implement a minimally sized C.UTF-8. First we implement the 'codepoint_collation' directive. Then we implement C.UTF-8 with an LC_COLLATE that uses the 'codepoint_collation' directive to support using strcmp or wcscmp for collation i.e. code point sorting. The final C.UTF-8 is only ~396KiB with the largest ~346KiB in LC_CTYPE for all of Unicode. v12 fixes commit message to match NEWS. v11 fixes a defect in the tst-regex.c test. All modified tests were reviewed for similar defects and none were found. v11 also removes an obsolete Contributed-by line in C-collate-seq.c. v10 fixes a defect in the transbug.c test. v9 is rebased against the changes to remove ISO-8859-1 characters from the bug-regex1.c test (69623c0db0a540f26ee537bae09446d3dcdf1f80). v8 includes a NEWS entry for the updated C.UTF-8. v7 fixed the regressions detected in Fedora Rawhide here: https://bugzilla.redhat.com/show_bug.cgi?id=1986421, but does so by generating identity tables for _NL_COLLATE_COLLSEQMB, and _NL_COLLATE_COLLSEQWC to provide mappings for ASCII characters. This ensures that static applications using the new C.UTF-8 have a functioning fnmatch, regcomp, and regexec for ASCII ranges. This raises the size of LC_COLLATE from 92 to 1406 bytes. Valgrind reports no errors using the tables with C.UTF-8 under tst-fnmatch. v7 also corrected collation sequence byte ordering on BE targets, and I verified this by building crossed locales with localedef --big-endian and confirming that s390x built native C.UTF-8 is the same as an x86_64 C.UTF-8 built wtih --big-endian. The fixes that were in v4 for nrules == 0 will be included in the next release of glibc, and when those are proven correct they can be backported to provide dyanmic or newly compiled static applications with the ability to use all code points in ranges. Carlos O'Donell (2): Add 'codepoint_collation' support for LC_COLLATE. Add generic C.UTF-8 locale (Bug 17318) NEWS | 10 +- iconv/Makefile | 22 +- iconv/tst-iconv9.c | 87 +++++ locale/C-collate-seq.c | 100 ++++++ locale/C-collate.c | 78 +---- locale/programs/ld-collate.c | 36 +- locale/programs/locfile-kw.gperf | 1 + locale/programs/locfile-kw.h | 299 ++++++++--------- locale/programs/locfile-token.h | 1 + localedata/C.UTF-8.in | 157 +++++++++ localedata/Makefile | 2 + localedata/SUPPORTED | 1 + localedata/locales/C | 194 +++++++++++ posix/Makefile | 16 +- posix/bug-regex1.c | 20 ++ posix/bug-regex19.c | 22 +- posix/bug-regex4.c | 25 ++ posix/bug-regex6.c | 2 +- posix/transbug.c | 24 +- posix/tst-fnmatch.input | 549 ++++++++++++++++++++++++++++++- posix/tst-regcomp-truncated.c | 1 + posix/tst-regex.c | 33 +- 22 files changed, 1417 insertions(+), 263 deletions(-) create mode 100644 iconv/tst-iconv9.c create mode 100644 locale/C-collate-seq.c create mode 100644 localedata/C.UTF-8.in create mode 100644 localedata/locales/C