From patchwork Fri Jul 30 19:48:43 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Carlos O'Donell X-Patchwork-Id: 44539 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 9F89D3983C52 for ; Fri, 30 Jul 2021 20:04:36 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 9F89D3983C52 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1627675476; bh=4AmdKphEmV0vb6AWUeu2gX0I5bv7rgmPXQ8KxljfkYo=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=joiUxZX6fV2GVf/Nkb2z5zr8UNCZ/AbKQ5ScfnIqxmCJM70SnncQEWPdRUcNvxZKH nydYWLeCV+sTKmEAxrTnlFAgs+Kex1JybLVnxytOKi9R+2r4jVriaTiZjJmWiDtKbY PXwiICgzYEiPtb3IjWUjzCSw0SAEa8fFURcMk+Xw= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id 25F49397EC05 for ; Fri, 30 Jul 2021 19:48:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 25F49397EC05 Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-554-5fd4B0EoPVGFD1FNera8IA-1; Fri, 30 Jul 2021 15:48:49 -0400 X-MC-Unique: 5fd4B0EoPVGFD1FNera8IA-1 Received: by mail-qk1-f199.google.com with SMTP id u22-20020ae9c0160000b02903b488f9d348so6332253qkk.20 for ; Fri, 30 Jul 2021 12:48:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=4AmdKphEmV0vb6AWUeu2gX0I5bv7rgmPXQ8KxljfkYo=; b=KIvKebolpiP5BTd0IvKC0FG7SNl3H2JGcCuV6VlNz5/jsBBzu+up1SdIgk4hRgffQl PBkZ0VXqheXTCm9SWz5BCVg0zjien9vj0/fYvxwb9ynYv361ky0Oelb82YDGbKfl5gIx LJlrDvRu/yVHEJLCZ1pg3Eghp0zNwGIb91oWCvtQei7/l0GM2BRO3GNO5+Yb9mP1gPx4 BEA0fWJk2r+6B9p0aEfO/+/jCtbRa0mWcEtjlaALji/az48AgiL5/EAxDe9hNN37nu0M E2e6MrBUEAYgyMnoW7sSiRhRaj+8dArCE7FzCzpkoqnIc8xH951Gk/kbS3QKnhIkfPSh R2xQ== X-Gm-Message-State: AOAM532qmSl2q0GFIyswxsnLPEj+8PzRrpWjlUVnto/bP5HG8mlSuE+G Ti2fIlb3UAtvlfY1gJILKZs0VjCYXSAmDGwkyA/HFTZX83hZPAz0C8wg4pB16IvANb5ae7jOxKx 7/3XJ2VXstJwDUz2MjY5TM8YFIKbZymHaPOQayl98OXbv810pyBc1iTrTG9YC4JcJoERG8A== X-Received: by 2002:a05:620a:4d5:: with SMTP id 21mr3873781qks.397.1627674528980; Fri, 30 Jul 2021 12:48:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxAL7LAiynDx/y3O/A1HWuW1TBhlTZKXdROyrjPuzYy1bEjJspTpj0bBnl4qJeKunj4onv1Zg== X-Received: by 2002:a05:620a:4d5:: with SMTP id 21mr3873755qks.397.1627674528706; Fri, 30 Jul 2021 12:48:48 -0700 (PDT) Received: from athas.redhat.com (198-84-214-74.cpe.teksavvy.com. [198.84.214.74]) by smtp.gmail.com with ESMTPSA id p188sm1346697qka.114.2021.07.30.12.48.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 30 Jul 2021 12:48:48 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v6 0/2] C.UTF-8 Date: Fri, 30 Jul 2021 15:48:43 -0400 Message-Id: <20210730194845.2165087-1-carlos@redhat.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Spam-Status: No, score=-6.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_NUMSUBJECT, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Carlos O'Donell via Libc-alpha From: Carlos O'Donell Reply-To: Carlos O'Donell Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" The following changes implement a minimally sized C.UTF-8. First we implement the 'strcmp_collation' directive. Then we implement C.UTF-8 with an LC_COLLATE that uses the 'strcmp_collation' directive to support using strcmp for collation i.e. code point sorting. The final C.UTF-8 is only ~396KiB with the largest ~346KiB in LC_CTYPE for all of Unicode. v6 fixes the regressions detected in Fedora Rawhide here: https://bugzilla.redhat.com/show_bug.cgi?id=1986421, but does so by generating identity tables for _NL_COLLATE_COLLSEQMB, and _NL_COLLATE_COLLSEQWC to provide mappings for ASCII characters. This ensures that static applications using the new C.UTF-8 have a functioning fnmatch, regcomp, and regexec for ASCII ranges. This raises the size of LC_COLLATE from 92 to 1406 bytes. Valgrind reports no errors using the tables with C.UTF-8 under tst-fnmatch. v6 also corrects collation sequence byte ordering on BE targets, and I verified this by building crossed locales with localedef --big-endian and confirming that s390x built native C.UTF-8 is the same as an x86_64 C.UTF-8 built wtih --big-endian. The fixes that were in v4 for nrules == 0 will be included in the next release of glibc, and when those are proven correct they can be backported to provide dyanmic or newly compiled static applications with the ability to use all code points in ranges. Carlos O'Donell (2): Add 'strcmp_collation' support for LC_COLLATE. Add generic C.UTF-8 locale (Bug 17318) iconv/Makefile | 22 +- iconv/tst-iconv9.c | 87 +++++ locale/C-collate-seq.c | 97 ++++++ locale/C-collate.c | 78 +---- locale/programs/ld-collate.c | 36 +- locale/programs/locfile-kw.gperf | 1 + locale/programs/locfile-kw.h | 306 ++++++++--------- locale/programs/locfile-token.h | 1 + localedata/C.UTF-8.in | 157 +++++++++ localedata/Makefile | 2 + localedata/SUPPORTED | 1 + localedata/locales/C | 194 +++++++++++ posix/Makefile | 16 +- posix/bug-regex1.c | 20 ++ posix/bug-regex19.c | 22 +- posix/bug-regex4.c | 25 ++ posix/bug-regex6.c | 2 +- posix/transbug.c | 22 +- posix/tst-fnmatch.input | 549 ++++++++++++++++++++++++++++++- posix/tst-regcomp-truncated.c | 1 + posix/tst-regex.c | 25 +- 21 files changed, 1404 insertions(+), 260 deletions(-) create mode 100644 iconv/tst-iconv9.c create mode 100644 locale/C-collate-seq.c create mode 100644 localedata/C.UTF-8.in create mode 100644 localedata/locales/C