remove attribute access from regexec

Message ID 15a32181-8060-4135-cb72-9e79831697d5@gmail.com
State Superseded
Headers
Series remove attribute access from regexec |

Checks

Context Check Description
dj/TryBot-apply_patch fail Patch failed to apply to master at the time it was sent
dj/TryBot-32bit fail Patch series failed to apply

Commit Message

Martin Sebor Aug. 13, 2021, 6:26 p.m. UTC
  A recent GCC enhancement to detect accesses by functions declared
with attribute access that are in conflict with the mode specified
by the attribute triggers a warning for the definition of regexec:

regexec.c: In function ‘__regexec’:
regexec.c:204:13: warning: ‘*pmatch.rm_so’ may be used uninitialized 
[-Wmaybe-uninitialized]
regexec.c:192:101: note: accessing argument 4 of a function declared 
with attribute ‘access (write_only, 4, 3)’

The attribute was added based on the POSIX description of
the function that implies the pmatch array referenced by it is
write-only.  However, when the REG_STARTEND bit is set in flags,
Glibc also reads from the object.  This seems to be an extension
to POSIX that's not mentioned in the Glibc manual but that is
documented in the Linux man pages:

   https://man7.org/linux/man-pages/man3/regcomp.3.html

The patch below changes the attribute's mode to read_write to
reflect this extension.

Martin
  

Comments

Paul Eggert Aug. 13, 2021, 8:11 p.m. UTC | #1
On 8/13/21 11:26 AM, Martin Sebor via Libc-alpha wrote:
> -    __attr_access ((__write_only__, 4, 3));
> +    __attr_access ((__read_write__, 4, 3));

Wouldn't it be simpler to remove the __attr_access instead?

What utility does __attr_access ((__read_write__, 4, 3)) have? FWIW, 
Glibc currently does not use such an attribute anywhere.
  
Martin Sebor Aug. 13, 2021, 9:30 p.m. UTC | #2
On 8/13/21 2:11 PM, Paul Eggert wrote:
> On 8/13/21 11:26 AM, Martin Sebor via Libc-alpha wrote:
>> -    __attr_access ((__write_only__, 4, 3));
>> +    __attr_access ((__read_write__, 4, 3));
> 
> Wouldn't it be simpler to remove the __attr_access instead?
> 
> What utility does __attr_access ((__read_write__, 4, 3)) have? FWIW, 
> Glibc currently does not use such an attribute anywhere.

The attribute serves two purposes: it specifies 1) how the function
might access the array (to determine whether it should be initialized
by the caller) and 2) the minimum number of elements the caller must
provide and the maximum number of elements the function definition
might access.

GCC checks to make sure these constraints are satisfied, both at call
sites and in the definitions of these functions.

The read_write mode means that a function may both read from and write
to the array, and expects it to be initialized.  But since regexec
might just write to the array and not read from it, depending on
flags, the read_write mode wouldn't be correct either.

There is no attribute access mode that would capture this but in C99
(though sadly not in C++) and thanks to the nmatch argument preceding
pmatch, the VLA argument syntax does make the checking possible.  It's
like attribute access with an unspecified mode.  (It might be worth
extending the attribute to express this mode so it can be used when
the bound doesn't precede the VLA argument and in C++.)

Attached is a revised patch with this approach.

Martin

PS POSIX says regexec() ignores pnmatch when REG_NOSUB is set, so
strictly speaking, warning for such calls to it in that case is
also a false positive.  I think that's an acceptable trade-off for
the buffer overflow detection (passing a zero nmatch in that case
is a trivial way to avoid the warning).
  
Paul Eggert Aug. 13, 2021, 10:34 p.m. UTC | #3
On 8/13/21 2:30 PM, Martin Sebor wrote:
> Attached is a revised patch with this approach.

The revised patch is to include/regex.h but the original patch was to 
posix/regex.h. Is that intentional?

We need to check whether __STDC_VERSION__ is defined. Also, no need for 
parens around arg of 'defined'. Something like this perhaps:

   #if (defined __STDC_VERSION__ && 199901L <= __STDC_VERSION__ \
        && !defined __STDC_NO_VLA__)

Also, the duplication of the declarations make the headers harder to 
read and encourage typos (I noticed one typo: "_Restrict_arr" without 
the trailing "_"). Instead, I suggest something like this:

   #if (defined __STDC_VERSION__ && 199901L <= __STDC_VERSION__ \
        && !defined __STDC_NO_VLA__)
   # define _REGEX_VLA(arg) arg
   #else
   # define _REGEX_VLA(arg)
   #endif

That way, we can simply change "regmatch_t __pmatch[_Restrict_arr_]" to 
"regmatch_t __pmatch[_Restrict_arr_ _REGEX_VLA (__nmatch)]" without 
having to duplicate the entire function declaration.

> PS POSIX says regexec() ignores pnmatch when REG_NOSUB is set, so
> strictly speaking, warning for such calls to it in that case is
> also a false positive.

Ouch, this casts doubt on the entire exercise. It's not simply about 
warnings: it's about the code being generated for the matcher. For 
example, for:

int
f (_Bool flag, unsigned long n, int a[n])
{
   return n == 0 ? 0 : flag ? a[n - 1] : a[0];
}

a compiler is allowed to generate code that loads a[n - 1] even when 
FLAG is false. Similarly, if we add this VLA business to regexec, the 
generated machine code could dereference pmatch unconditionally even if 
our source code makes the dereferencing conditional on REG_NOSUB, and 
the resulting behavior would fail to conform to POSIX.
  
Martin Sebor Aug. 14, 2021, 8:08 p.m. UTC | #4
On 8/13/21 4:34 PM, Paul Eggert wrote:
> On 8/13/21 2:30 PM, Martin Sebor wrote:
>> Attached is a revised patch with this approach.
> 
> The revised patch is to include/regex.h but the original patch was to 
> posix/regex.h. Is that intentional?

Yes, they need to be consistent, otherwise GCC issues -Wvla-parameter.
(That's to help detect inadvertent array/VLA mismatches as well as
mismatches in the VLA parameter bounds.)

> 
> We need to check whether __STDC_VERSION__ is defined. Also, no need for 
> parens around arg of 'defined'. Something like this perhaps:
> 
>    #if (defined __STDC_VERSION__ && 199901L <= __STDC_VERSION__ \
>         && !defined __STDC_NO_VLA__)
> 
> Also, the duplication of the declarations make the headers harder to 
> read and encourage typos (I noticed one typo: "_Restrict_arr" without 
> the trailing "_"). Instead, I suggest something like this:
> 
>    #if (defined __STDC_VERSION__ && 199901L <= __STDC_VERSION__ \
>         && !defined __STDC_NO_VLA__)
>    # define _REGEX_VLA(arg) arg
>    #else
>    # define _REGEX_VLA(arg)
>    #endif
> 
> That way, we can simply change "regmatch_t __pmatch[_Restrict_arr_]" to 
> "regmatch_t __pmatch[_Restrict_arr_ _REGEX_VLA (__nmatch)]" without 
> having to duplicate the entire function declaration.

Sounds good.  I've defined the macro in cdefs.h and mamed it _VLA_ARG
to make it usable in other contexts.  Please see the attached revision.

> 
>> PS POSIX says regexec() ignores pnmatch when REG_NOSUB is set, so
>> strictly speaking, warning for such calls to it in that case is
>> also a false positive.
> 
> Ouch, this casts doubt on the entire exercise. It's not simply about 
> warnings: it's about the code being generated for the matcher. For 
> example, for:
> 
> int
> f (_Bool flag, unsigned long n, int a[n])
> {
>    return n == 0 ? 0 : flag ? a[n - 1] : a[0];
> }
> 
> a compiler is allowed to generate code that loads a[n - 1] even when 
> FLAG is false. Similarly, if we add this VLA business to regexec, the 
> generated machine code could dereference pmatch unconditionally even if 
> our source code makes the dereferencing conditional on REG_NOSUB, and 
> the resulting behavior would fail to conform to POSIX.

The VLA bound by itself doesn't affect codegen.  I suspect you're
thinking of a[static n]?  With just a[n], without static, there
is no requirement that a point to an array with n elements.  It
simply declares an ordinary pointer, same as [] or *.

Martin
  
Paul Eggert Aug. 18, 2021, 7:52 p.m. UTC | #5
On 8/14/21 1:08 PM, Martin Sebor wrote:
> The VLA bound by itself doesn't affect codegen.  I suspect you're
> thinking of a[static n]?  With just a[n], without static, there
> is no requirement that a point to an array with n elements.  It
> simply declares an ordinary pointer, same as [] or *.

Thanks for clarifying.

I tried using a patch like that on coreutils, but it caused the build to 
fail like this:

   In file included from lib/exclude.c:35:
   ./lib/regex.h:661:7: error: ISO C90 forbids variable length array 
'__pmatch' [-Werror=vla]
     661 |       regmatch_t __pmatch[_Restrict_arr_ _VLA_ARG (__nmatch)],
         |       ^~~~~~~~~~
   cc1: all warnings being treated as errors
   make[2]: *** [Makefile:10648: lib/exclude.o] Error 1

This is because coreutils is compiled with -Wvla -Werror, to catch 
inadvertent attempts to use VLAs in local variables (this helps avoid 
stack-overflow problems). It'd be unfortunate if we had to give that 
useful diagnostic up simply due to this declaration, as there's no 
stack-overflow problem here.

If you can think of a way around this issue, here are some other things 
I ran into while trying this idea out on Coreutils.

* Other cdefs.h macros (__NTH, __REDIRECT, etc.) start with two 
underscores, so shouldn't this new macro too?

* Come to think of it, the name _VLA_ARG could be improved, as its 
argument is not actually a VLA; it's the number of elements in a 
VLA-like array. Also, its formal-parameter "arg" is confusingly-named, 
as it's an arbitrary integer expression and need not be a function 
parameter name. How about naming the macro __ARG_NELTS instead?

* regex.h cannot use __ARG_NELTS directly, for the same reason it can't 
use __restrict_arr directly: regex.h is shared with Gnulib and can't 
assume that a glibc-like sys/cdefs.h is present. I suppose regex.h would 
need something like this:

   #ifndef _ARG_NELTS_
   # ifdef __ARG_NELTS
   #  define _ARG_NELTS_(arg) __ARG_NELTS (arg)
   # elif (defined __STDC_VERSION__ && 199901L <= __STDC_VERSION__ \
	  && !defined __STDC_NO_VLA__)
   #  define _ARG_NELTS_(n) n
   # else
   #  define _ARG_NELTS_(n)
   # endif
   #endif

and then use _ARG_NELTS_ later.

* The cdefs.h comment has a stray 'n', its wording could be improved (I 
misread "variable bound" as a variable that's bound to something), 
there's a stray empty line, and it's nicer to put the comment in front 
of all the lines that define the macro. Perhaps something like this:

   /* Specify the number of elements of a function's array parameter,
      as in 'int f (int n, int a[__ARG_NELTS (n)]);'.  */
   #if (defined __STDC_VERSION__ && 199901L <= __STDC_VERSION__ \
        && !defined __STDC_NO_VLA__)
   # define __ARG_NELTS(n) n
   #else
   # define __ARG_NELTS(n)
   #endif

Though again, it's not clear to me that this idea will fly at all, due 
to the -Wvla issue.

Maybe GCC's -Wvla should be fixed to not report an error in this case? 
It's actually not a VLA if you ask me (the C standard is unclear).
  

Patch

diff --git a/posix/regex.h b/posix/regex.h
index 14fb1d8364..ba8351f873 100644
--- a/posix/regex.h
+++ b/posix/regex.h
@@ -656,7 +656,7 @@  extern int regexec (const regex_t *_Restrict_ __preg,
                     const char *_Restrict_ __String, size_t __nmatch,
                     regmatch_t __pmatch[_Restrict_arr_],
                     int __eflags)
-    __attr_access ((__write_only__, 4, 3));
+    __attr_access ((__read_write__, 4, 3));

  extern size_t regerror (int __errcode, const regex_t *_Restrict_ __preg,
                         char *_Restrict_ __errbuf, size_t __errbuf_size)