From patchwork Wed Aug 25 13:34:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Carlos O'Donell X-Patchwork-Id: 44791 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 165AF3858427 for ; Wed, 25 Aug 2021 13:35:22 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 165AF3858427 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1629898522; bh=/kc2WTzH2bmmgWV/XiXwzcSNVMNib31SQefCGOLm85w=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=YRRFHtd7qltFLpL0UzhKsA/uOOHV9km8dCmeeXX3WpTudPEDOOtXdfgVIgSJHZIw2 XhQsjsb0OfdIhDJKgVMyL0gDZrQRr3tBNQJymKMu0dtooX4IrAxBsWuHhmLOGuhiwi CSSmwY+4cgzleuCyG1V2UxhGrFt124chm8v7/vxQ= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id 5219D3858039 for ; Wed, 25 Aug 2021 13:34:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5219D3858039 Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-22-v7ecExQpPF6fbnLO6dK7Tw-1; Wed, 25 Aug 2021 09:34:47 -0400 X-MC-Unique: v7ecExQpPF6fbnLO6dK7Tw-1 Received: by mail-qt1-f200.google.com with SMTP id x11-20020ac86b4b000000b00299d7592d31so12698566qts.0 for ; Wed, 25 Aug 2021 06:34:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=/kc2WTzH2bmmgWV/XiXwzcSNVMNib31SQefCGOLm85w=; b=fnqGkEEahlL+XTsKzaywP6H1oukal4gBmyWdo9XUGLZ7i8BQmiJiWEfjVN3GIucQ2n qH/pqldIMcFGJKkdPJpF4Eh+zeisovp0LXRA0z0shZBTjLRqkLGLB1Mi9Y0LJYwvPLoM nARyGHvVKDIPUiFb4k1x76lQrfKn2Xw5kSGe7eRlHiM50ZNKPi9Kkz5h0jusylWS7UYk ZhFTK/MhWnLO0QRmFwNtxka9XUmk6OPiyUc/nvkd5q3AUQsjZRBDSbx1mMyswQOwGYDc /mg3bf+1qpuemDV870lOrOtM9u568kcR7tKAbpceffM4gMnyv5O6cnsHZzH/MtA8Lt3G 4DjA== X-Gm-Message-State: AOAM5301NpNTQzH8+SA2yQxgNqT1+plPnHThSBIhwEv9+RJS0+j9Kkqt YLL/3l0fO11nROX/hvKyOnUbPNyLI1VJmmy7uCHnK8dhy2Ylxxen2Mj9LwctmrW+NHH0RPH/nAF /OoRk/L/LBZoX0n3tSMG9FBEcCOZ2gFvaIHxKPJeidr+pq+lm+diOc5zZLbaI/f2YKcwMUw== X-Received: by 2002:ae9:f005:: with SMTP id l5mr32223453qkg.355.1629898486072; Wed, 25 Aug 2021 06:34:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxjIZnl5B1u28SWm3wxdfxYoPARoL/pW7faaCSum9rQ/oWl+Mp/1LxdTOnU6a3kfwV/s0a0vg== X-Received: by 2002:ae9:f005:: with SMTP id l5mr32223428qkg.355.1629898485739; Wed, 25 Aug 2021 06:34:45 -0700 (PDT) Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com. [198.84.214.74]) by smtp.gmail.com with ESMTPSA id v14sm12948500qkb.88.2021.08.25.06.34.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Aug 2021 06:34:44 -0700 (PDT) To: libc-alpha@sourceware.org, fweimer@redhat.com Subject: [PATCH v8 0/2] C.UTF-8 Date: Wed, 25 Aug 2021 09:34:35 -0400 Message-Id: <20210825133437.2529307-1-carlos@redhat.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-6.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Carlos O'Donell via Libc-alpha From: Carlos O'Donell Reply-To: Carlos O'Donell Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" The following changes implement a minimally sized C.UTF-8. First we implement the 'codepoint_collation' directive. Then we implement C.UTF-8 with an LC_COLLATE that uses the 'codepoint_collation' directive to support using strcmp or wcscmp for collation i.e. code point sorting. The final C.UTF-8 is only ~396KiB with the largest ~346KiB in LC_CTYPE for all of Unicode. v8 includes a NEWS entry for the updated C.UTF-8. v7 fixed the regressions detected in Fedora Rawhide here: https://bugzilla.redhat.com/show_bug.cgi?id=1986421, but does so by generating identity tables for _NL_COLLATE_COLLSEQMB, and _NL_COLLATE_COLLSEQWC to provide mappings for ASCII characters. This ensures that static applications using the new C.UTF-8 have a functioning fnmatch, regcomp, and regexec for ASCII ranges. This raises the size of LC_COLLATE from 92 to 1406 bytes. Valgrind reports no errors using the tables with C.UTF-8 under tst-fnmatch. v7 also corrected collation sequence byte ordering on BE targets, and I verified this by building crossed locales with localedef --big-endian and confirming that s390x built native C.UTF-8 is the same as an x86_64 C.UTF-8 built wtih --big-endian. The fixes that were in v4 for nrules == 0 will be included in the next release of glibc, and when those are proven correct they can be backported to provide dyanmic or newly compiled static applications with the ability to use all code points in ranges. Carlos O'Donell (2): Add 'codepoint_collation' support for LC_COLLATE. Add generic C.UTF-8 locale (Bug 17318) NEWS | 10 +- iconv/Makefile | 22 +- iconv/tst-iconv9.c | 87 +++++ locale/C-collate-seq.c | 101 ++++++ locale/C-collate.c | 78 +---- locale/programs/ld-collate.c | 36 +- locale/programs/locfile-kw.gperf | 1 + locale/programs/locfile-kw.h | 299 ++++++++--------- locale/programs/locfile-token.h | 1 + localedata/C.UTF-8.in | 157 +++++++++ localedata/Makefile | 2 + localedata/SUPPORTED | 1 + localedata/locales/C | 194 +++++++++++ posix/Makefile | 16 +- posix/bug-regex1.c | 20 ++ posix/bug-regex19.c | 22 +- posix/bug-regex4.c | 25 ++ posix/bug-regex6.c | 2 +- posix/transbug.c | 22 +- posix/tst-fnmatch.input | 549 ++++++++++++++++++++++++++++++- posix/tst-regcomp-truncated.c | 1 + posix/tst-regex.c | 25 +- 22 files changed, 1413 insertions(+), 258 deletions(-) create mode 100644 iconv/tst-iconv9.c create mode 100644 locale/C-collate-seq.c create mode 100644 localedata/C.UTF-8.in create mode 100644 localedata/locales/C