[10/17] Regex: Portability to non-GCC compilers.

Message ID 201712080916.vB89GxSs005509@skeeve.com
State New, archived
Headers

Commit Message

Arnold Robbins Dec. 8, 2017, 9:16 a.m. UTC
  This patch removes or ifdefs various GCC-only constructs so that the
code can be compiled with other compilers.

2017-11-27         Arnold D. Robbins     <arnold@skeeve.com>

	* posix/regex_internal.c (re_string_skip_chars): Portability to
	non-GCC compilers.
	* posix/regexec.c (check_arrival): Ditto.
	* posix/regcomp.c (utf8_sb_map): Ditto.
	(init_dfa): Ditto.
  

Comments

Florian Weimer Dec. 8, 2017, 9:26 a.m. UTC | #1
On 12/08/2017 10:16 AM, Arnold Robbins wrote:
> +#if !defined(__GNUC__) || __GNUC__ < 3
> +	  static short utf8_sb_map_inited = 0;
> +
> +	  if (! utf8_sb_map_inited)
> +	    {
> +		int i;
> +
> +	  	utf8_sb_map_inited = 0;
> +		for (i = 0; i <= 0x80 / BITSET_WORD_BITS - 1; i++)
> +		  utf8_sb_map[i] = BITSET_WORD_MAX;
> +	    }
> +#endif

This doesn't look like a good idea because it's not thread-safe.

Thanks,
Florian
  
Arnold Robbins Dec. 8, 2017, 9:47 a.m. UTC | #2
Florian Weimer <fweimer@redhat.com> wrote:

> On 12/08/2017 10:16 AM, Arnold Robbins wrote:
> > +#if !defined(__GNUC__) || __GNUC__ < 3
> > +	  static short utf8_sb_map_inited = 0;
> > +
> > +	  if (! utf8_sb_map_inited)
> > +	    {
> > +		int i;
> > +
> > +	  	utf8_sb_map_inited = 0;
> > +		for (i = 0; i <= 0x80 / BITSET_WORD_BITS - 1; i++)
> > +		  utf8_sb_map[i] = BITSET_WORD_MAX;
> > +	    }
> > +#endif
>
> This doesn't look like a good idea because it's not thread-safe.

Is the rest of regex thread safe?

I've no objection to something else that does the trick for GLIBC.
Gawk doesn't have multiple threads.

Thanks,

Arnold
  
Florian Weimer Dec. 8, 2017, 11:04 a.m. UTC | #3
On 12/08/2017 10:47 AM, arnold@skeeve.com wrote:
> Florian Weimer <fweimer@redhat.com> wrote:
> 
>> On 12/08/2017 10:16 AM, Arnold Robbins wrote:
>>> +#if !defined(__GNUC__) || __GNUC__ < 3
>>> +	  static short utf8_sb_map_inited = 0;
>>> +
>>> +	  if (! utf8_sb_map_inited)
>>> +	    {
>>> +		int i;
>>> +
>>> +	  	utf8_sb_map_inited = 0;
>>> +		for (i = 0; i <= 0x80 / BITSET_WORD_BITS - 1; i++)
>>> +		  utf8_sb_map[i] = BITSET_WORD_MAX;
>>> +	    }
>>> +#endif
>>
>> This doesn't look like a good idea because it's not thread-safe.
> 
> Is the rest of regex thread safe?
> 
> I've no objection to something else that does the trick for GLIBC.
> Gawk doesn't have multiple threads.

It does not matter for glibc because it's compiled with GCC.  It might 
matter if this code is merged into gnulib, which is supposed to be 
portable to many compilers and environments.

Thanks,
Florian
  
Adhemerval Zanella Netto Dec. 8, 2017, 12:56 p.m. UTC | #4
On 08/12/2017 09:04, Florian Weimer wrote:
> On 12/08/2017 10:47 AM, arnold@skeeve.com wrote:
>> Florian Weimer <fweimer@redhat.com> wrote:
>>
>>> On 12/08/2017 10:16 AM, Arnold Robbins wrote:
>>>> +#if !defined(__GNUC__) || __GNUC__ < 3
>>>> +      static short utf8_sb_map_inited = 0;
>>>> +
>>>> +      if (! utf8_sb_map_inited)
>>>> +        {
>>>> +        int i;
>>>> +
>>>> +          utf8_sb_map_inited = 0;
>>>> +        for (i = 0; i <= 0x80 / BITSET_WORD_BITS - 1; i++)
>>>> +          utf8_sb_map[i] = BITSET_WORD_MAX;
>>>> +        }
>>>> +#endif
>>>
>>> This doesn't look like a good idea because it's not thread-safe.
>>
>> Is the rest of regex thread safe?
>>
>> I've no objection to something else that does the trick for GLIBC.
>> Gawk doesn't have multiple threads.
> 
> It does not matter for glibc because it's compiled with GCC.  It might matter if this code is merged into gnulib, which is supposed to be portable to many compilers and environments.

My understanding is we still aim to keep in sync with gnulib, so I think
we should first integrate with current gnulib code adding any glibc code
(if any and with the idea of integrating it back to gnulib). I do not see
much point in deviate even more from gnulib, unless the idea is to really
decouple both implementations.
  
Arnold Robbins Dec. 9, 2017, 6:07 p.m. UTC | #5
Florian Weimer <fweimer@redhat.com> wrote:

> On 12/08/2017 10:47 AM, arnold@skeeve.com wrote:
> > Florian Weimer <fweimer@redhat.com> wrote:
> > 
> >> On 12/08/2017 10:16 AM, Arnold Robbins wrote:
> >>> +#if !defined(__GNUC__) || __GNUC__ < 3
> >>> +	  static short utf8_sb_map_inited = 0;
> >>> +
> >>> +	  if (! utf8_sb_map_inited)
> >>> +	    {
> >>> +		int i;
> >>> +
> >>> +	  	utf8_sb_map_inited = 0;
> >>> +		for (i = 0; i <= 0x80 / BITSET_WORD_BITS - 1; i++)
> >>> +		  utf8_sb_map[i] = BITSET_WORD_MAX;
> >>> +	    }
> >>> +#endif
> >>
> >> This doesn't look like a good idea because it's not thread-safe.
> > 
> > Is the rest of regex thread safe?

I don't think it is, what with re_set_syntax.

> > I've no objection to something else that does the trick for GLIBC.
> > Gawk doesn't have multiple threads.
>
> It does not matter for glibc because it's compiled with GCC.  It might 
> matter if this code is merged into gnulib, which is supposed to be 
> portable to many compilers and environments.

I think gnulib deals with it in a different way.  They're smart enough
to merge things if and when it becomes an issue.  If it doesn't matter
for glibc, then I hope it can be merged in.

Thanks,

Arnold
  
Paul Eggert Dec. 9, 2017, 6:46 p.m. UTC | #6
On 12/09/2017 10:07 AM, arnold@skeeve.com wrote:
> I think gnulib deals with it in a different way.  They're smart enough  > to merge things if and when it becomes an issue. If it doesn't 
matter > for glibc, then I hope it can be merged in.
How about if we merge it into gnulib first? Gnulib is easier to merge into.
  

Patch

diff --git a/posix/regcomp.c b/posix/regcomp.c
index 83fcc40..22c07ce 100644
--- a/posix/regcomp.c
+++ b/posix/regcomp.c
@@ -572,12 +572,15 @@  weak_alias (__regerror, regerror)
    UTF-8 is used.  Otherwise we would allocate memory just to initialize
    it the same all the time.  UTF-8 is the preferred encoding so this is
    a worthwhile optimization.  */
-static const bitset_t utf8_sb_map =
-{
+#if __GNUC__ >= 3
+static const bitset_t utf8_sb_map = {
   /* Set the first 128 bits.  */
   [0 ... 0x80 / BITSET_WORD_BITS - 1] = BITSET_WORD_MAX
 };
-#endif
+#else /* ! (__GNUC__ >= 3) */
+static bitset_t utf8_sb_map;
+#endif /* __GNUC__ >= 3 */
+#endif /* RE_ENABLE_I18N */
 
 
 static void
@@ -884,7 +887,21 @@  init_dfa (re_dfa_t *dfa, size_t pat_len)
   if (dfa->mb_cur_max > 1)
     {
       if (dfa->is_utf8)
+        {
+#if !defined(__GNUC__) || __GNUC__ < 3
+	  static short utf8_sb_map_inited = 0;
+
+	  if (! utf8_sb_map_inited)
+	    {
+		int i;
+
+	  	utf8_sb_map_inited = 0;
+		for (i = 0; i <= 0x80 / BITSET_WORD_BITS - 1; i++)
+		  utf8_sb_map[i] = BITSET_WORD_MAX;
+	    }
+#endif
 	dfa->sb_char = (re_bitset_ptr_t) utf8_sb_map;
+	}
       else
 	{
 	  int i, j, ch;
diff --git a/posix/regex_internal.c b/posix/regex_internal.c
index 19c1bd8..85daf67 100644
--- a/posix/regex_internal.c
+++ b/posix/regex_internal.c
@@ -515,7 +515,7 @@  re_string_skip_chars (re_string_t *pstr, int new_raw_idx, wint_t *last_wc)
       /* Then proceed the next character.  */
       rawbuf_idx += mbclen;
     }
-  *last_wc = wc;
+  *last_wc = (wint_t) wc;
   return rawbuf_idx;
 }
 #endif /* RE_ENABLE_I18N  */
diff --git a/posix/regexec.c b/posix/regexec.c
index c73294e..7b0f6d8 100644
--- a/posix/regexec.c
+++ b/posix/regexec.c
@@ -2850,7 +2850,7 @@  check_arrival (re_match_context_t *mctx, state_array_t *path, int top_node,
 	      sizeof (re_dfastate_t *) * (path->alloc - old_alloc));
     }
 
-  str_idx = path->next_idx ?: top_str;
+  str_idx = path->next_idx ? path->next_idx : top_str;
 
   /* Temporary modify MCTX.  */
   backup_state_log = mctx->state_log;