From patchwork Wed Oct 27 05:29:59 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fangrui Song X-Patchwork-Id: 46688 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2C30E385AC1F for ; Wed, 27 Oct 2021 05:31:06 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2C30E385AC1F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1635312666; bh=MAZPH8iFKVZXcnAdIeG69BwfegDnDIFdbIGPYfiqd+c=; h=Date:Subject:To:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=fFc62o/9JiYaPENpqYfYfHetjvXgZ0hw4P18OpPZY0PCMDfgnGK49xCAnmvJNfzb8 GN4phOzj1kEpJvh66gnsq3FNTWMK1ty0krv9GKny3lxL4/GQuNlyTUC7wmvVCOtlmK Z/Tp77zuteFHxxNxLuxG2f8JGEcwWfq3LUR2MQLM= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by sourceware.org (Postfix) with ESMTPS id 78161385843C for ; Wed, 27 Oct 2021 05:30:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 78161385843C Received: by mail-yb1-xb4a.google.com with SMTP id h185-20020a256cc2000000b005bdce4db0easo2218564ybc.12 for ; Tue, 26 Oct 2021 22:30:43 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=MAZPH8iFKVZXcnAdIeG69BwfegDnDIFdbIGPYfiqd+c=; b=XZU6zgjVF6tY80LVuq6FgTTRyQWDQhLcHY2dj+cUokVJ/gOCfhP6PXy7/LslWbH6Zb dPUn7/BEhdsoB4htcxg5Elkw4hqzdHvy6PAz6jcS20ZuiXJ6FinJuY/D9Bp3i+B7sbCO kk5xykTHOLjkhc0FlEznMEghG4oBO5kyAoY4ykh/P7j4p8CEQSNmnjaez85q+Vwn/81x kZUqftwJPcbv2DC4thIWduwDK/uSZqyuaFVsab4KRknAQ9zskgM9BlOjiwLpZGnJ/4oz jDn0fBVbfxCnDJBK+7SVdwFwX8P2RCE6WiJf1DDKt0M6VAjjFFIT36qgi2yAIzae+Rgp nWHw== X-Gm-Message-State: AOAM531LkLPgIHgogFLxjOMDfVf7yWYIchju6orFtGiuBVzjgaUcQMar hw0gBQ1N1vKcL1hfLUz+yVIPece3HhvwtA/cz7prH3ScmGfZS6V5m/blhs+4/Zx3ro6yD9pt0DM PdLCPKXoEf0KJi1oIcZBl4WJwrPh/FtaUwkLqLwIkasFJ9Hm0vgkah2Ee9/8PQklYAEjo X-Google-Smtp-Source: ABdhPJwFMl8IKUZlmv9wN3swxApminltcPkGxPKW3BR8kap0ZP75M1XaskvDM/NanT7Zrr5mRCXYZtJ6r0E0 X-Received: from maskray1.svl.corp.google.com ([2620:15c:2ce:200:36db:1f90:3716:bd72]) (user=maskray job=sendgmr) by 2002:a05:6902:1505:: with SMTP id q5mr32051996ybu.201.1635312642474; Tue, 26 Oct 2021 22:30:42 -0700 (PDT) Date: Tue, 26 Oct 2021 22:29:59 -0700 Message-Id: <20211027052959.2549214-1-maskray@google.com> Mime-Version: 1.0 Subject: [PATCH] regex: Unnest nested functions in regcomp.c To: libc-alpha@sourceware.org, Paul Eggert X-Spam-Status: No, score=-18.6 required=5.0 tests=BAYES_00, DKIMWL_WL_MED, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_INFOUSMEBIZ, KAM_STOCKGEN, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Fangrui Song via Libc-alpha From: Fangrui Song Reply-To: Fangrui Song Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" If the non-buildable `sysdeps/generic/dl-machine.h` doesn't count, this patch removes the last `auto inline` usage and makes the file buildable by Clang once we can resolve the "asm label after first use" issue[1]. collseqwc, table_size, symb_table, extra are now initialized to appease GCC -Werror=maybe-uninitialized false positive. [1]: https://maskray.me/blog/2021-10-10-when-can-glibc-be-built-with-clang#asm-label-after-first-use --- posix/regcomp.c | 514 +++++++++++++++++++++++++----------------------- 1 file changed, 266 insertions(+), 248 deletions(-) diff --git a/posix/regcomp.c b/posix/regcomp.c index 887e5b5068..e2b5c2807a 100644 --- a/posix/regcomp.c +++ b/posix/regcomp.c @@ -2831,6 +2831,263 @@ build_collating_symbol (bitset_t sbcset, const unsigned char *name) } #endif /* not _LIBC */ +#ifdef _LIBC +/* Local function for parse_bracket_exp used in _LIBC environment. + Seek the collating symbol entry corresponding to NAME. + Return the index of the symbol in the SYMB_TABLE, + or -1 if not found. */ + +static inline int32_t +__attribute__ ((always_inline)) +seek_collating_symbol_entry (const unsigned char *name, size_t name_len, + const int32_t *symb_table, int32_t table_size, + const unsigned char *extra) +{ + int32_t elem; + + for (elem = 0; elem < table_size; elem++) + if (symb_table[2 * elem] != 0) + { + int32_t idx = symb_table[2 * elem + 1]; + /* Skip the name of collating element name. */ + idx += 1 + extra[idx]; + if (/* Compare the length of the name. */ + name_len == extra[idx] + /* Compare the name. */ + && memcmp (name, &extra[idx + 1], name_len) == 0) + /* Yep, this is the entry. */ + return elem; + } + return -1; +} + +/* Local function for parse_bracket_exp used in _LIBC environment. + Look up the collation sequence value of BR_ELEM. + Return the value if succeeded, UINT_MAX otherwise. */ + +static inline unsigned int +__attribute__ ((always_inline)) +lookup_collation_sequence_value (bracket_elem_t *br_elem, uint32_t nrules, + const unsigned char *collseqmb, + const char *collseqwc, int32_t table_size, + const int32_t *symb_table, + const unsigned char *extra) +{ + if (br_elem->type == SB_CHAR) + { + /* if (MB_CUR_MAX == 1) */ + if (nrules == 0) + return collseqmb[br_elem->opr.ch]; + else + { + wint_t wc = __btowc (br_elem->opr.ch); + return __collseq_table_lookup (collseqwc, wc); + } + } + else if (br_elem->type == MB_CHAR) + { + if (nrules != 0) + return __collseq_table_lookup (collseqwc, br_elem->opr.wch); + } + else if (br_elem->type == COLL_SYM) + { + size_t sym_name_len = strlen ((char *) br_elem->opr.name); + if (nrules != 0) + { + int32_t elem, idx; + elem = seek_collating_symbol_entry (br_elem->opr.name, + sym_name_len, + symb_table, table_size, + extra); + if (elem != -1) + { + /* We found the entry. */ + idx = symb_table[2 * elem + 1]; + /* Skip the name of collating element name. */ + idx += 1 + extra[idx]; + /* Skip the byte sequence of the collating element. */ + idx += 1 + extra[idx]; + /* Adjust for the alignment. */ + idx = (idx + 3) & ~3; + /* Skip the multibyte collation sequence value. */ + idx += sizeof (unsigned int); + /* Skip the wide char sequence of the collating element. */ + idx += sizeof (unsigned int) * + (1 + *(unsigned int *) (extra + idx)); + /* Return the collation sequence value. */ + return *(unsigned int *) (extra + idx); + } + else if (sym_name_len == 1) + { + /* No valid character. Match it as a single byte + character. */ + return collseqmb[br_elem->opr.name[0]]; + } + } + else if (sym_name_len == 1) + return collseqmb[br_elem->opr.name[0]]; + } + return UINT_MAX; +} + +/* Local function for parse_bracket_exp used in _LIBC environment. + Build the range expression which starts from START_ELEM, and ends + at END_ELEM. The result are written to MBCSET and SBCSET. + RANGE_ALLOC is the allocated size of mbcset->range_starts, and + mbcset->range_ends, is a pointer argument since we may + update it. */ + +static inline reg_errcode_t +__attribute__ ((always_inline)) +build_range_exp (bitset_t sbcset, re_charset_t *mbcset, int *range_alloc, + bracket_elem_t *start_elem, bracket_elem_t *end_elem, + re_dfa_t *dfa, reg_syntax_t syntax, uint32_t nrules, + const unsigned char *collseqmb, const char *collseqwc, + int32_t table_size, const int32_t *symb_table, + const unsigned char *extra) +{ + unsigned int ch; + uint32_t start_collseq; + uint32_t end_collseq; + + /* Equivalence Classes and Character Classes can't be a range + start/end. */ + if (__glibc_unlikely (start_elem->type == EQUIV_CLASS + || start_elem->type == CHAR_CLASS + || end_elem->type == EQUIV_CLASS + || end_elem->type == CHAR_CLASS)) + return REG_ERANGE; + + /* FIXME: Implement rational ranges here, too. */ + start_collseq = lookup_collation_sequence_value (start_elem, nrules, collseqmb, collseqwc, + table_size, symb_table, extra); + end_collseq = lookup_collation_sequence_value (end_elem, nrules, collseqmb, collseqwc, + table_size, symb_table, extra); + /* Check start/end collation sequence values. */ + if (__glibc_unlikely (start_collseq == UINT_MAX + || end_collseq == UINT_MAX)) + return REG_ECOLLATE; + if (__glibc_unlikely ((syntax & RE_NO_EMPTY_RANGES) + && start_collseq > end_collseq)) + return REG_ERANGE; + + /* Got valid collation sequence values, add them as a new entry. + However, if we have no collation elements, and the character set + is single byte, the single byte character set that we + build below suffices. */ + if (nrules > 0 || dfa->mb_cur_max > 1) + { + /* Check the space of the arrays. */ + if (__glibc_unlikely (*range_alloc == mbcset->nranges)) + { + /* There is not enough space, need realloc. */ + uint32_t *new_array_start; + uint32_t *new_array_end; + int new_nranges; + + /* +1 in case of mbcset->nranges is 0. */ + new_nranges = 2 * mbcset->nranges + 1; + new_array_start = re_realloc (mbcset->range_starts, uint32_t, + new_nranges); + new_array_end = re_realloc (mbcset->range_ends, uint32_t, + new_nranges); + + if (__glibc_unlikely (new_array_start == NULL + || new_array_end == NULL)) + return REG_ESPACE; + + mbcset->range_starts = new_array_start; + mbcset->range_ends = new_array_end; + *range_alloc = new_nranges; + } + + mbcset->range_starts[mbcset->nranges] = start_collseq; + mbcset->range_ends[mbcset->nranges++] = end_collseq; + } + + /* Build the table for single byte characters. */ + for (ch = 0; ch < SBC_MAX; ch++) + { + uint32_t ch_collseq; + /* if (MB_CUR_MAX == 1) */ + if (nrules == 0) + ch_collseq = collseqmb[ch]; + else + ch_collseq = __collseq_table_lookup (collseqwc, __btowc (ch)); + if (start_collseq <= ch_collseq && ch_collseq <= end_collseq) + bitset_set (sbcset, ch); + } + return REG_NOERROR; +} + +/* Local function for parse_bracket_exp used in _LIBC environment. + Build the collating element which is represented by NAME. + The result are written to MBCSET and SBCSET. + COLL_SYM_ALLOC is the allocated size of mbcset->coll_sym, is a + pointer argument since we may update it. */ + +static inline reg_errcode_t +__attribute__ ((always_inline)) +build_collating_symbol (bitset_t sbcset, re_charset_t *mbcset, + int *coll_sym_alloc, const unsigned char *name, + uint32_t nrules, int32_t table_size, + const int32_t *symb_table, const unsigned char *extra) +{ + int32_t elem, idx; + size_t name_len = strlen ((const char *) name); + if (nrules != 0) + { + elem = seek_collating_symbol_entry (name, name_len, symb_table, + table_size, extra); + if (elem != -1) + { + /* We found the entry. */ + idx = symb_table[2 * elem + 1]; + /* Skip the name of collating element name. */ + idx += 1 + extra[idx]; + } + else if (name_len == 1) + { + /* No valid character, treat it as a normal + character. */ + bitset_set (sbcset, name[0]); + return REG_NOERROR; + } + else + return REG_ECOLLATE; + + /* Got valid collation sequence, add it as a new entry. */ + /* Check the space of the arrays. */ + if (__glibc_unlikely (*coll_sym_alloc == mbcset->ncoll_syms)) + { + /* Not enough, realloc it. */ + /* +1 in case of mbcset->ncoll_syms is 0. */ + int new_coll_sym_alloc = 2 * mbcset->ncoll_syms + 1; + /* Use realloc since mbcset->coll_syms is NULL + if *alloc == 0. */ + int32_t *new_coll_syms = re_realloc (mbcset->coll_syms, int32_t, + new_coll_sym_alloc); + if (__glibc_unlikely (new_coll_syms == NULL)) + return REG_ESPACE; + mbcset->coll_syms = new_coll_syms; + *coll_sym_alloc = new_coll_sym_alloc; + } + mbcset->coll_syms[mbcset->ncoll_syms++] = idx; + return REG_NOERROR; + } + else + { + if (__glibc_unlikely (name_len != 1)) + return REG_ECOLLATE; + else + { + bitset_set (sbcset, name[0]); + return REG_NOERROR; + } + } +} +#endif /* _LIBC */ + /* This function parse bracket expression like "[abc]", "[a-c]", "[[.a-a.]]" etc. */ @@ -2840,253 +3097,11 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token, { #ifdef _LIBC const unsigned char *collseqmb; - const char *collseqwc; + const char *collseqwc = NULL; uint32_t nrules; - int32_t table_size; - const int32_t *symb_table; - const unsigned char *extra; - - /* Local function for parse_bracket_exp used in _LIBC environment. - Seek the collating symbol entry corresponding to NAME. - Return the index of the symbol in the SYMB_TABLE, - or -1 if not found. */ - - auto inline int32_t - __attribute__ ((always_inline)) - seek_collating_symbol_entry (const unsigned char *name, size_t name_len) - { - int32_t elem; - - for (elem = 0; elem < table_size; elem++) - if (symb_table[2 * elem] != 0) - { - int32_t idx = symb_table[2 * elem + 1]; - /* Skip the name of collating element name. */ - idx += 1 + extra[idx]; - if (/* Compare the length of the name. */ - name_len == extra[idx] - /* Compare the name. */ - && memcmp (name, &extra[idx + 1], name_len) == 0) - /* Yep, this is the entry. */ - return elem; - } - return -1; - } - - /* Local function for parse_bracket_exp used in _LIBC environment. - Look up the collation sequence value of BR_ELEM. - Return the value if succeeded, UINT_MAX otherwise. */ - - auto inline unsigned int - __attribute__ ((always_inline)) - lookup_collation_sequence_value (bracket_elem_t *br_elem) - { - if (br_elem->type == SB_CHAR) - { - /* - if (MB_CUR_MAX == 1) - */ - if (nrules == 0) - return collseqmb[br_elem->opr.ch]; - else - { - wint_t wc = __btowc (br_elem->opr.ch); - return __collseq_table_lookup (collseqwc, wc); - } - } - else if (br_elem->type == MB_CHAR) - { - if (nrules != 0) - return __collseq_table_lookup (collseqwc, br_elem->opr.wch); - } - else if (br_elem->type == COLL_SYM) - { - size_t sym_name_len = strlen ((char *) br_elem->opr.name); - if (nrules != 0) - { - int32_t elem, idx; - elem = seek_collating_symbol_entry (br_elem->opr.name, - sym_name_len); - if (elem != -1) - { - /* We found the entry. */ - idx = symb_table[2 * elem + 1]; - /* Skip the name of collating element name. */ - idx += 1 + extra[idx]; - /* Skip the byte sequence of the collating element. */ - idx += 1 + extra[idx]; - /* Adjust for the alignment. */ - idx = (idx + 3) & ~3; - /* Skip the multibyte collation sequence value. */ - idx += sizeof (unsigned int); - /* Skip the wide char sequence of the collating element. */ - idx += sizeof (unsigned int) * - (1 + *(unsigned int *) (extra + idx)); - /* Return the collation sequence value. */ - return *(unsigned int *) (extra + idx); - } - else if (sym_name_len == 1) - { - /* No valid character. Match it as a single byte - character. */ - return collseqmb[br_elem->opr.name[0]]; - } - } - else if (sym_name_len == 1) - return collseqmb[br_elem->opr.name[0]]; - } - return UINT_MAX; - } - - /* Local function for parse_bracket_exp used in _LIBC environment. - Build the range expression which starts from START_ELEM, and ends - at END_ELEM. The result are written to MBCSET and SBCSET. - RANGE_ALLOC is the allocated size of mbcset->range_starts, and - mbcset->range_ends, is a pointer argument since we may - update it. */ - - auto inline reg_errcode_t - __attribute__ ((always_inline)) - build_range_exp (bitset_t sbcset, re_charset_t *mbcset, int *range_alloc, - bracket_elem_t *start_elem, bracket_elem_t *end_elem) - { - unsigned int ch; - uint32_t start_collseq; - uint32_t end_collseq; - - /* Equivalence Classes and Character Classes can't be a range - start/end. */ - if (__glibc_unlikely (start_elem->type == EQUIV_CLASS - || start_elem->type == CHAR_CLASS - || end_elem->type == EQUIV_CLASS - || end_elem->type == CHAR_CLASS)) - return REG_ERANGE; - - /* FIXME: Implement rational ranges here, too. */ - start_collseq = lookup_collation_sequence_value (start_elem); - end_collseq = lookup_collation_sequence_value (end_elem); - /* Check start/end collation sequence values. */ - if (__glibc_unlikely (start_collseq == UINT_MAX - || end_collseq == UINT_MAX)) - return REG_ECOLLATE; - if (__glibc_unlikely ((syntax & RE_NO_EMPTY_RANGES) - && start_collseq > end_collseq)) - return REG_ERANGE; - - /* Got valid collation sequence values, add them as a new entry. - However, if we have no collation elements, and the character set - is single byte, the single byte character set that we - build below suffices. */ - if (nrules > 0 || dfa->mb_cur_max > 1) - { - /* Check the space of the arrays. */ - if (__glibc_unlikely (*range_alloc == mbcset->nranges)) - { - /* There is not enough space, need realloc. */ - uint32_t *new_array_start; - uint32_t *new_array_end; - Idx new_nranges; - - /* +1 in case of mbcset->nranges is 0. */ - new_nranges = 2 * mbcset->nranges + 1; - new_array_start = re_realloc (mbcset->range_starts, uint32_t, - new_nranges); - new_array_end = re_realloc (mbcset->range_ends, uint32_t, - new_nranges); - - if (__glibc_unlikely (new_array_start == NULL - || new_array_end == NULL)) - return REG_ESPACE; - - mbcset->range_starts = new_array_start; - mbcset->range_ends = new_array_end; - *range_alloc = new_nranges; - } - - mbcset->range_starts[mbcset->nranges] = start_collseq; - mbcset->range_ends[mbcset->nranges++] = end_collseq; - } - - /* Build the table for single byte characters. */ - for (ch = 0; ch < SBC_MAX; ch++) - { - uint32_t ch_collseq; - /* - if (MB_CUR_MAX == 1) - */ - if (nrules == 0) - ch_collseq = collseqmb[ch]; - else - ch_collseq = __collseq_table_lookup (collseqwc, __btowc (ch)); - if (start_collseq <= ch_collseq && ch_collseq <= end_collseq) - bitset_set (sbcset, ch); - } - return REG_NOERROR; - } - - /* Local function for parse_bracket_exp used in _LIBC environment. - Build the collating element which is represented by NAME. - The result are written to MBCSET and SBCSET. - COLL_SYM_ALLOC is the allocated size of mbcset->coll_sym, is a - pointer argument since we may update it. */ - - auto inline reg_errcode_t - __attribute__ ((always_inline)) - build_collating_symbol (bitset_t sbcset, re_charset_t *mbcset, - Idx *coll_sym_alloc, const unsigned char *name) - { - int32_t elem, idx; - size_t name_len = strlen ((const char *) name); - if (nrules != 0) - { - elem = seek_collating_symbol_entry (name, name_len); - if (elem != -1) - { - /* We found the entry. */ - idx = symb_table[2 * elem + 1]; - /* Skip the name of collating element name. */ - idx += 1 + extra[idx]; - } - else if (name_len == 1) - { - /* No valid character, treat it as a normal - character. */ - bitset_set (sbcset, name[0]); - return REG_NOERROR; - } - else - return REG_ECOLLATE; - - /* Got valid collation sequence, add it as a new entry. */ - /* Check the space of the arrays. */ - if (__glibc_unlikely (*coll_sym_alloc == mbcset->ncoll_syms)) - { - /* Not enough, realloc it. */ - /* +1 in case of mbcset->ncoll_syms is 0. */ - Idx new_coll_sym_alloc = 2 * mbcset->ncoll_syms + 1; - /* Use realloc since mbcset->coll_syms is NULL - if *alloc == 0. */ - int32_t *new_coll_syms = re_realloc (mbcset->coll_syms, int32_t, - new_coll_sym_alloc); - if (__glibc_unlikely (new_coll_syms == NULL)) - return REG_ESPACE; - mbcset->coll_syms = new_coll_syms; - *coll_sym_alloc = new_coll_sym_alloc; - } - mbcset->coll_syms[mbcset->ncoll_syms++] = idx; - return REG_NOERROR; - } - else - { - if (__glibc_unlikely (name_len != 1)) - return REG_ECOLLATE; - else - { - bitset_set (sbcset, name[0]); - return REG_NOERROR; - } - } - } + int32_t table_size = 0; + const int32_t *symb_table = NULL; + const unsigned char *extra = NULL; #endif re_token_t br_token; @@ -3230,7 +3245,9 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token, #ifdef _LIBC *err = build_range_exp (sbcset, mbcset, &range_alloc, - &start_elem, &end_elem); + &start_elem, &end_elem, + dfa, syntax, nrules, collseqmb, collseqwc, + table_size, symb_table, extra); #else # ifdef RE_ENABLE_I18N *err = build_range_exp (syntax, sbcset, @@ -3283,7 +3300,8 @@ parse_bracket_exp (re_string_t *regexp, re_dfa_t *dfa, re_token_t *token, #ifdef RE_ENABLE_I18N mbcset, &coll_sym_alloc, #endif /* RE_ENABLE_I18N */ - start_elem.opr.name); + start_elem.opr.name, + nrules, table_size, symb_table, extra); if (__glibc_unlikely (*err != REG_NOERROR)) goto parse_bracket_exp_free_return; break;